Hi there,

I have a pySpark job that is simply taking a tab separated CSV outputting it
to a Parquet file.  The code is based on the SQL write parquet example. 
(Using a different inferred schema, only 35 columns). The input files range
from 100MB to 12 Gb.

I have tried different different block sizes from 10MB through to 1 Gb, I
have tried different parallelism. The total part files total about 1:5
compression.  

I am trying to get large parquet files.  Having this many small files will
cause problems to my name node.  I have over 500,000 of these files. 

Your assistance would be greatly appreciated.

cheers,
Ag

PS Another solution may be if there is a parquet concat tool around.  I
couldn't see one.  I understand that this tool would have to adjust the
footer.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to