Hi Nikos,
How many columns and distinct values of some_column are there in the
DataFrame? Parquet writer is known to be very memory consuming for wide
tables. And lots of distinct partition column values result in many
concurrent Parquet writers. One possible workaround is to first
Ηι Lian,
Thank you for the tip. Indeed, there were a lot of distinct values in my
result set (approximately 3000). As you suggested i decided to partition
the data firstly on a column with much smaller cardinality.
Thanks
n
On Thu, Jul 16, 2015 at 2:09 PM, Cheng Lian lian.cs@gmail.com
Hi,
I am trying to test partitioning for DataFrames with parquet usage so i
attempted to do df.write().partitionBy(some_column).parquet(path) on a
small dataset of 20.000 records which when saved as parquet locally with
gzip take 4mb of disk space.
However, on my dev machine with