Thanks Akhil,
Am I right in saying that the repartition will spread the data randomly so I
loose chronological order?
I really just want the csv -- parquet format in the same order it came in.
If I set repartition with 1 will this not be random?
cheers,
Ag
--
View this message in context:
David, that's exactly what I was after :) Awesome, thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935p18002.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi there,
I have a pySpark job that is simply taking a tab separated CSV outputting it
to a Parquet file. The code is based on the SQL write parquet example.
(Using a different inferred schema, only 35 columns). The input files range
from 100MB to 12 Gb.
I have tried different different block