GZip compresses very well but uses a lot of CPU during compression and
decompression. Try running a test with store.parquet.compression =
'zstd' (introduced in Drill 1.20.0). You can use CTAS statements in
Drill to create Parquet files compressed with Zstandard.
If some columns or row groups need not be scanned, ensure that they are
being excluded by the query.
Ensure that your Parquet files have been partitioned to a suitable size,
normally somewhere between 250 and 1000Mb.
For some data, setting store.parquet.use_new_reader = false will be
significantly faster.
If profiling the Drillbits doing the scans reveals that they are waiting
for data due to limited I/O throughput then consider faster storage.
E.g. Data locality in HDFS can be exploited by Drill to achieve higher
throughput.
On 2023/03/07 08:17, Prabhakar Bhosale wrote:
hi team,
I have compressed (gzip) parquet files created with apache drill. the total
folder size is 7.8gb and the number of rows are 116,249,263. the query
takes 2min 18sec.
Most of the time is spent on "PARQUET_ROW_GROUP_SCAN". Is there any way to
improve this performance?
i am using
Drill - 1.20
CPU - 8 core
mem - 16gb
I also tried increasing memory to 32GB but no much difference. I also tried
certain recommendations given in drill documentation but with no success.
Any pointer/help is highly appreciated. thx
REgards
Prabhakar