Parquet write optimization by row group size config

Pavel Plotnikov Tue, 19 Jan 2016 04:44:14 -0800

Hello,
I'm using spark on some machines in standalone mode, data storage is
mounted on this machines via nfs. A have input data stream and when i'm
trying to store all data for hour in parquet, a job executes mostly on one
core and this hourly data are stored in 40- 50 minutes. It is very slow!
And it is not IO problem. After research how parquet file works, i'm found
that it can be parallelized on row group abstraction level.
I think row group for my files is to large, and how can i change it?
When i create to big DataFrame i devides in parts very well and writes
quikly!


Thanks,
Pavel

Parquet write optimization by row group size config

Reply via email to