I am using spark 1.5.1. I am running into some memory problems with a java
unit test. Yes I could fix it by setting ­Xmx (its set to 1024M) how ever I
want to better understand what is going on so I can write better code in the
future. The test runs on a Mac, master="Local[2]"

I have a java unit test that starts by reading a 672K ascii file. I my
output data file is 152K. Its seems strange that such a small amount of data
would cause an out of memory exception. I am running a pretty standard
machine learning process

1. Load data
2. create a ML pipeline
3. transform the data
4. Train a model
5. Make predictions
6. Join the predictions back to my original data set
7. Coalesce(1), I only have a small amount of data and want to save it in a
single file
8. Save final results back to disk

Step 7: I am unable to call Coalesce() ³java.io.IOException: Unable to
acquire memory²

To try and figure out what is going I put log messages in to count the
number of partitions

Turns out I have 20 input files, each one winds up in a separate partition.
Okay so after loading I call coalesce(1) and check to make sure I only have
a single partition.

The total number of observations is 1998.

After calling step 7 I count the number of partitions and discovered I have
224 partitions!. Surprising given I called Coalesce(1) before I did anything
with the data. My data set should easily fit in memory. When I save them to
disk I get 202 files created with 162 of them being empty!

In general I am not explicitly using cache.

Some of the data frames get registered as tables. I find it easier to use
sql.

Some of the data frames get converted back to RDDs. I find it easier to
create RDD<LabeledPoint> this way

I put calls to unpersist(true). In several places

   private void memoryCheck(String name) {

        Runtime rt = Runtime.getRuntime();

        logger.warn("name: {} \t\ttotalMemory: {} \tfreeMemory: {} df.size:
{}", 

                name,

                String.format("%,d", rt.totalMemory()),

                String.format("%,d", rt.freeMemory()));

        }


Any idea how I can get a better understanding of what is going on? My goal
is to learn to write better spark code.

Kind regards

Andy

Memory usages at various points in my unit test
name: rawInput totalMemory:   447,741,952 freeMemory:   233,203,184

name: naiveBayesModel totalMemory:   509,083,648 freeMemory:   403,504,128

name: lpRDD totalMemory:   509,083,648 freeMemory:   402,288,104

name: results totalMemory:   509,083,648 freeMemory:   368,011,008



       DataFrame exploreDF = results.select(results.col("id"),

                                    results.col("label"),

                                    results.col("binomialLabel"),

                                    results.col("labelIndex"),

                                    results.col("prediction"),

                                    results.col("words"));

        exploreDF.show(10);

        

Yes I realize its strange to switch styles how ever this should not cause
memory problems



        final String exploreTable = "exploreTable";

        exploreDF.registerTempTable(exploreTable);

        String fmt = "SELECT * FROM %s where binomialLabel = ¹signal'";

        String stmt = String.format(fmt, exploreTable);



        DataFrame subsetToSave = sqlContext.sql(stmt);// .show(100);



name: subsetToSave totalMemory: 1,747,451,904 freeMemory: 1,049,447,144



        exploreDF.unpersist(true); does not resolve memory issue





Reply via email to