Hi All, I'm running Spark 1.4.1 on a 8 core machine with 16 GB RAM. I've a 500MB CSV file with 10 columns and i'm need of separating it into multiple CSV/Parquet files based on one of the fields in the CSV file. I've loaded the CSV file using spark-csv and applied the below transformations. It takes a lot of time (more than 20-30mins) and sometimes terminates with OOM. Any idea of better ways to do it? Thanks in advance!
I start spark-shell using the below options: # Enabled kryo serializer bin/spark-shell --driver-memory 6G --executor-memory 6G --master "local[3]" --conf spark.kryoserializer.buffer.max=200m --packages com.databricks:spark-csv_2.11:1.1.0 val df = sqlContext.load("com.databricks.spark.csv", Map("header" -> "true", "path" -> "file:///file.csv", "partitionColumn" -> "date", "numPartitions" -> "4" ) ) df.map(r => (r(2), List(r))).reduceByKey((a,b) => a ++ b) -- Thanks, M. Varadharajan ------------------------------------------------ "Experience is what you get when you didn't get what you wanted" -By Prof. Randy Pausch in "The Last Lecture" My Journal :- http://varadharajan.in