Hello,
I'm using spark on some machines in standalone mode, data storage is
mounted on this machines via nfs. A have input data stream and when i'm
trying to store all data for hour in parquet, a job executes mostly on one
core and this hourly data are stored in 40- 50 minutes. It is very slow!
And it is not IO problem. After research how parquet file works, i'm found
that it can be parallelized on row group abstraction level.
I think row group for my files is to large, and how can i change it?
When i create to big DataFrame i devides in parts very well and writes
quikly!

Thanks,
Pavel

Reply via email to