[ 
https://issues.apache.org/jira/browse/PARQUET-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-2423:
------------------------------------
    Labels: pull-request-available  (was: )

> Avoid allocating buffer obeject in RecordReader's SkipRecords
> -------------------------------------------------------------
>
>                 Key: PARQUET-2423
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2423
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp
>            Reporter: Jinpeng Zhou
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently each invocation of SkipRecords() for non-repeated fields will 
> create a brand new buffer object[1]. I think it probably worth keep the 
> buffer object alive and just resize it for each skip, as the buffer is just a 
> bitmap (i.e., should remain quite small even we don't free its memory after 
> skip). Performance results are as follows:
> Keep buffer object alive:
> --------------------------------------------------------------------------------------------------------------
> Benchmark                                                    Time             
> CPU   Iterations UserCounters...
> --------------------------------------------------------------------------------------------------------------
> *RecordReaderSkipRecords/Repetition:0/BatchSize:1      29958201 ns     
> 29943377 ns           23 bytes_per_second=163.068M/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:10      3190298 ns      
> 3190524 ns          227 bytes_per_second=1.49454G/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:100      479056 ns       
> 480437 ns         1500 bytes_per_second=9.92507G/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:1000     256497 ns       
> 257725 ns         2763 bytes_per_second=18.5018G/s*
> RecordReaderSkipRecords/Repetition:1/BatchSize:1000    2910364 ns      
> 2910479 ns          239 bytes_per_second=893.189M/s
> RecordReaderSkipRecords/Repetition:2/BatchSize:1000   20539472 ns     
> 20535632 ns           34 bytes_per_second=135.007M/s
>  
> Recreate upon each skip (current behavior):
> --------------------------------------------------------------------------------------------------------------
> Benchmark                                                    Time             
> CPU   Iterations UserCounters...
> --------------------------------------------------------------------------------------------------------------
> *RecordReaderSkipRecords/Repetition:0/BatchSize:1      33261760 ns     
> 33199124 ns           21 bytes_per_second=147.077M/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:10      3256993 ns      
> 3254609 ns          216 bytes_per_second=1.46511G/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:100      492856 ns       
> 493377 ns         1447 bytes_per_second=9.66477G/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:1000     262449 ns       
> 263227 ns         2694 bytes_per_second=18.1151G/s*
> RecordReaderSkipRecords/Repetition:1/BatchSize:1000    2996951 ns      
> 2997148 ns          235 bytes_per_second=867.36M/s
> RecordReaderSkipRecords/Repetition:2/BatchSize:1000   20864734 ns     
> 20850593 ns           34 bytes_per_second=132.968M/s
> [1]https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1482



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to