Parquet read performance for different schemas

Tomas Bartalos Thu, 19 Sep 2019 09:16:19 -0700

Hello,

I have 2 parquets (each containing 1 file):


   - parquet-wide - schema has 25 top level cols + 1 array
   - parquet-narrow - schema has 3 top level cols

Both files have same data for given columns.
When I read from parquet-wide spark reports* read 52.6 KB*, from
parquet-narrow *only 2.6 KB*.
For bigger dataset the difference is *413 MB vs 961 MB*. Needless to say
reading narrow parquet is much faster.

Since schema pruning is applied I *expected to get similar results* for
both scenarios (timing and amount of data read).
What do you think is the reason for such a big difference, is there any
tuning I can do ?

Thank you,
Tomas

Parquet read performance for different schemas

Reply via email to