Re: Parquet files are only 6-20MB in size?

2014-11-03 Thread Akhil Das
Before doing saveAsParquetFile, you can call the repartition and provide a
decent number which will result in the total number of output files
generated.

Thanks
Best Regards

On Mon, Nov 3, 2014 at 1:12 PM, ag007 agre...@mac.com wrote:

 Hi there,

 I have a pySpark job that is simply taking a tab separated CSV outputting
 it
 to a Parquet file.  The code is based on the SQL write parquet example.
 (Using a different inferred schema, only 35 columns). The input files range
 from 100MB to 12 Gb.

 I have tried different different block sizes from 10MB through to 1 Gb, I
 have tried different parallelism. The total part files total about 1:5
 compression.

 I am trying to get large parquet files.  Having this many small files will
 cause problems to my name node.  I have over 500,000 of these files.

 Your assistance would be greatly appreciated.

 cheers,
 Ag

 PS Another solution may be if there is a parquet concat tool around.  I
 couldn't see one.  I understand that this tool would have to adjust the
 footer.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Parquet files are only 6-20MB in size?

2014-11-03 Thread ag007
Thanks Akhil,

Am I right in saying that the repartition will spread the data randomly so I
loose chronological order?

I really just want the csv -- parquet format in the same order it came in. 
If I set repartition with 1 will this not be random?

cheers,
Ag



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935p17941.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Parquet files are only 6-20MB in size?

2014-11-03 Thread Davies Liu
Befire saveAsParquetFile(), you can call coalesce(N), then you will
have N files,
it will keep the order as before (repartition() will not).


On Mon, Nov 3, 2014 at 1:16 AM, ag007 agre...@mac.com wrote:
 Thanks Akhil,

 Am I right in saying that the repartition will spread the data randomly so I
 loose chronological order?

 I really just want the csv -- parquet format in the same order it came in.
 If I set repartition with 1 will this not be random?

 cheers,
 Ag



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935p17941.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Parquet files are only 6-20MB in size?

2014-11-03 Thread ag007
David, that's exactly what I was after :) Awesome, thanks. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935p18002.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org