Hello, I'm using spark on some machines in standalone mode, data storage is mounted on this machines via nfs. A have input data stream and when i'm trying to store all data for hour in parquet, a job executes mostly on one core and this hourly data are stored in 40- 50 minutes. It is very slow! And it is not IO problem. After research how parquet file works, i'm found that it can be parallelized on row group abstraction level. I think row group for my files is to large, and how can i change it? When i create to big DataFrame i devides in parts very well and writes quikly!
Thanks, Pavel