Re: Parquet files are only 6-20MB in size?

2014-11-03 Thread Akhil Das
Before doing saveAsParquetFile, you can call the repartition and provide a decent number which will result in the total number of output files generated. Thanks Best Regards On Mon, Nov 3, 2014 at 1:12 PM, ag007 agre...@mac.com wrote: Hi there, I have a pySpark job that is simply taking a

Re: Parquet files are only 6-20MB in size?

2014-11-03 Thread ag007
Thanks Akhil, Am I right in saying that the repartition will spread the data randomly so I loose chronological order? I really just want the csv -- parquet format in the same order it came in. If I set repartition with 1 will this not be random? cheers, Ag -- View this message in context:

Re: Parquet files are only 6-20MB in size?

2014-11-03 Thread Davies Liu
Befire saveAsParquetFile(), you can call coalesce(N), then you will have N files, it will keep the order as before (repartition() will not). On Mon, Nov 3, 2014 at 1:16 AM, ag007 agre...@mac.com wrote: Thanks Akhil, Am I right in saying that the repartition will spread the data randomly so I

Re: Parquet files are only 6-20MB in size?

2014-11-03 Thread ag007
David, that's exactly what I was after :) Awesome, thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935p18002.html Sent from the Apache Spark User List mailing list archive at Nabble.com.