Hi, I'm running into a strange memory scaling issue when using the partitionBy feature of DataFrameWriter.
I've generated a table (a CSV file) with 3 columns (A, B and C) and 32*32 different entries, with size on disk of about 20kb. There are 32 distinct values for column A and 32 distinct values for column B and all these are combined together (column C will contain a random number for each row - it doesn't matter) producing a 32*32 elements data set. I've imported this into Spark and I ran a partitionBy("A", "B") in order to test its performance. This should create a nested directory structure with 32 folders, each of them containing another 32 folders. It uses about 10Gb of RAM and it's running slow. If I increase the number of entries in the table from 32*32 to 128*128, I get Java Heap Space Out Of Memory no matter what value I use for Heap Space variabile. Is this a known bug? Scala code: var df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("table.csv") df.write.partitionBy("A", "B").mode("overwrite").parquet("table.parquet”) How I ran the Spark shell: bin/spark-shell --driver-memory 16g --master local[8] --packages com.databricks:spark-csv_2.10:1.0.3 Attached you'll find table.csv which I used. table.csv <http://apache-spark-developers-list.1001551.n3.nabble.com/file/n12838/table.csv> Thank you, Vlad -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-partitionBy-issues-tp12838.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org