Re: Improve performance for "PARQUET_ROW_GROUP_SCAN"

James Turton Mon, 13 Mar 2023 07:14:18 -0700

GZip compresses very well but uses a lot of CPU during compression anddecompression. Try running a test with store.parquet.compression ='zstd' (introduced in Drill 1.20.0). You can use CTAS statements inDrill to create Parquet files compressed with Zstandard.

If some columns or row groups need not be scanned, ensure that they arebeing excluded by the query.

Ensure that your Parquet files have been partitioned to a suitable size,normally somewhere between 250 and 1000Mb.

For some data, setting store.parquet.use_new_reader = false will besignificantly faster.

If profiling the Drillbits doing the scans reveals that they are waitingfor data due to limited I/O throughput then consider faster storage.E.g. Data locality in HDFS can be exploited by Drill to achieve higherthroughput.



On 2023/03/07 08:17, Prabhakar Bhosale wrote:

hi team,
I have compressed (gzip) parquet files created with apache drill. the total
folder size is 7.8gb and the number of rows are 116,249,263. the query
takes 2min 18sec.
Most of the time is spent on "PARQUET_ROW_GROUP_SCAN". Is there any way to
improve this performance?
i am using
Drill - 1.20
CPU - 8 core
mem - 16gb

I also tried increasing memory to 32GB but no much difference. I also tried
certain recommendations given in drill documentation but with no success.

Any pointer/help is highly appreciated. thx

REgards
Prabhakar

Re: Improve performance for "PARQUET_ROW_GROUP_SCAN"

Reply via email to