Re: Improve performance for "PARQUET_ROW_GROUP_SCAN"

James Turton Wed, 22 Mar 2023 00:20:39 -0700

To increase the minor fragment count set the optionplanner.cpu_load_average. You can also increase the number of concurrentParquet reader threads using store.parquet.reader.columnreader.async.

However, since your tests with faster compression codecs showed noimprovements I think that your query is probably memory bandwidth bound,a common state of affairs for a single node cluster. To add more memorybandwidth to your cluster you'll need to scale horizontally e.g. 2x 16GbDrillbits instead of 1x 32Gb Drillbit.


What do you see for TIME_DISK_SCAN? If that's also small then

On 2023/03/18 06:36, Prabhakar Bhosale wrote:

Hi James,
Thanks for your detail guidance. Please see my findings below
*You wrote*: GZip compresses very well but uses a lot of CPU duringcompression and
decompression. Try running a test with store.parquet.compression =
'zstd' (introduced in Drill 1.20.0). You can use CTAS statements in
Drill to create Parquet files compressed with Zstandard.
*Me*: I tried both lz4 and zstd also, but none of them seems to begiving any better results. Lz4 give some improvement but not considerable.In Operator metrics, lz4 is faster in decompression but that time isnullified by time_load_datepage and time_to_decode_datapage
For zstd the decompression time is same as that of gzip
On CPU utilization - querying to all 3 types of compressed filesalmost utilizes similar CPU,
*You Wrote:* If some columns or row groups need not be scanned, ensurethat they are
being excluded by the query.
*Me: *Yes, this I had already tried and it improved the performanceconsiderably. I had to sort the data while creating parquet files
*You Wrote: *Ensure that your Parquet files have been partitioned to asuitable size,
normally somewhere between 250 and 1000Mb.
*Me: *No changes made to drill defaults
*You Wrote:* For some data, setting store.parquet.use_new_reader =false will be
significantly faster.
*Me: *I am using drill 1.20.1, in this version, for this options it iswritten that "NOt supported in this version" and the value is false. Itried after making it true and the query could not complete even after3 times of duration taken for gzip. so I think, this is not useful formy data.
*You Wrote: *If profiling the Drillbits doing the scans reveals thatthey are waiting
for data due to limited I/O throughput then consider faster storage.
E.g. Data locality in HDFS can be exploited by Drill to achieve higher
throughput.
*Me: *Operator Metrics "TIME_DISK_SCAN_WAIT" is less than 0.2 sec, soi don't think disk I/O bottleneck here.
*My additional observations are*
1. The operator metrics "TIME_VARCOLUMN_READ" is taking 19+ seconds asmost of the columns query reads are of VARCHAR. is there any way toimprove upon this?2. the "numfiles" reported in physical plan is different for all 3compression for same exact data and same query. The numfiles forgzip-109, lz4 - 161 and zstd-152. I was expecting this should be samefor all 3 compression formats. Same is the case with NUM_ROWGROUPSoperator3. The minor fragments created under "PARQUET_ROW_GROUP_SCAN" are 6. Iassume these are the number of parallel threads created to selectdata. Is there any setting that can allow me to create more minorfragments for this operator?
Thanks for reading this long email.

Regards
Prabhakar



On Mon, Mar 13, 2023 at 7:44 PM James Turton <[email protected]> wrote:

    GZip compresses very well but uses a lot of CPU during compression
    and
    decompression. Try running a test with store.parquet.compression =
    'zstd' (introduced in Drill 1.20.0). You can use CTAS statements in
    Drill to create Parquet files compressed with Zstandard.

    If some columns or row groups need not be scanned, ensure that
    they are
    being excluded by the query.

    Ensure that your Parquet files have been partitioned to a suitable
    size,
    normally somewhere between 250 and 1000Mb.

    For some data, setting store.parquet.use_new_reader = false will be
    significantly faster.

    If profiling the Drillbits doing the scans reveals that they are
    waiting
    for data due to limited I/O throughput then consider faster storage.
    E.g. Data locality in HDFS can be exploited by Drill to achieve
    higher
    throughput.


    On 2023/03/07 08:17, Prabhakar Bhosale wrote:
    > hi team,
    > I have compressed (gzip) parquet files created with apache
    drill. the total
    > folder size is 7.8gb and the number of rows are 116,249,263. the
    query
    > takes 2min 18sec.
    > Most of the time is spent on "PARQUET_ROW_GROUP_SCAN". Is there
    any way to
    > improve this performance?
    > i am using
    > Drill - 1.20
    > CPU - 8 core
    > mem - 16gb
    >
    > I also tried increasing memory to 32GB but no much difference. I
    also tried
    > certain recommendations given in drill documentation but with no
    success.
    >
    > Any pointer/help is highly appreciated. thx
    >
    > REgards
    > Prabhakar
    >

Re: Improve performance for "PARQUET_ROW_GROUP_SCAN"

Reply via email to