I am using spark 1.5.1. I am running into some memory problems with a java unit test. Yes I could fix it by setting Xmx (its set to 1024M) how ever I want to better understand what is going on so I can write better code in the future. The test runs on a Mac, master="Local[2]"
I have a java unit test that starts by reading a 672K ascii file. I my output data file is 152K. Its seems strange that such a small amount of data would cause an out of memory exception. I am running a pretty standard machine learning process 1. Load data 2. create a ML pipeline 3. transform the data 4. Train a model 5. Make predictions 6. Join the predictions back to my original data set 7. Coalesce(1), I only have a small amount of data and want to save it in a single file 8. Save final results back to disk Step 7: I am unable to call Coalesce() ³java.io.IOException: Unable to acquire memory² To try and figure out what is going I put log messages in to count the number of partitions Turns out I have 20 input files, each one winds up in a separate partition. Okay so after loading I call coalesce(1) and check to make sure I only have a single partition. The total number of observations is 1998. After calling step 7 I count the number of partitions and discovered I have 224 partitions!. Surprising given I called Coalesce(1) before I did anything with the data. My data set should easily fit in memory. When I save them to disk I get 202 files created with 162 of them being empty! In general I am not explicitly using cache. Some of the data frames get registered as tables. I find it easier to use sql. Some of the data frames get converted back to RDDs. I find it easier to create RDD<LabeledPoint> this way I put calls to unpersist(true). In several places private void memoryCheck(String name) { Runtime rt = Runtime.getRuntime(); logger.warn("name: {} \t\ttotalMemory: {} \tfreeMemory: {} df.size: {}", name, String.format("%,d", rt.totalMemory()), String.format("%,d", rt.freeMemory())); } Any idea how I can get a better understanding of what is going on? My goal is to learn to write better spark code. Kind regards Andy Memory usages at various points in my unit test name: rawInput totalMemory: 447,741,952 freeMemory: 233,203,184 name: naiveBayesModel totalMemory: 509,083,648 freeMemory: 403,504,128 name: lpRDD totalMemory: 509,083,648 freeMemory: 402,288,104 name: results totalMemory: 509,083,648 freeMemory: 368,011,008 DataFrame exploreDF = results.select(results.col("id"), results.col("label"), results.col("binomialLabel"), results.col("labelIndex"), results.col("prediction"), results.col("words")); exploreDF.show(10); Yes I realize its strange to switch styles how ever this should not cause memory problems final String exploreTable = "exploreTable"; exploreDF.registerTempTable(exploreTable); String fmt = "SELECT * FROM %s where binomialLabel = ¹signal'"; String stmt = String.format(fmt, exploreTable); DataFrame subsetToSave = sqlContext.sql(stmt);// .show(100); name: subsetToSave totalMemory: 1,747,451,904 freeMemory: 1,049,447,144 exploreDF.unpersist(true); does not resolve memory issue