[ 
https://issues.apache.org/jira/browse/PARQUET-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinpeng Zhou updated PARQUET-2423:
----------------------------------
    Description: 
Currently each invocation of SkipRecords() for non-repeated fields will create 
a brand new buffer object[1]. I think it probably worth keep the buffer object 
alive and just resize it for each skip, as the buffer is just a bitmap (i.e., 
should remain quite small even we don't free its memory after skip). 
Performance results are as follows:

Keep buffer object alive:

--------------------------------------------------------------------------------------------------------------

Benchmark                                                    Time             
CPU   Iterations UserCounters...

--------------------------------------------------------------------------------------------------------------

*RecordReaderSkipRecords/Repetition:0/BatchSize:1      29958201 ns     29943377 
ns           23 bytes_per_second=163.068M/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:10      3190298 ns      3190524 
ns          227 bytes_per_second=1.49454G/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:100      479056 ns       480437 
ns         1500 bytes_per_second=9.92507G/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:1000     256497 ns       257725 
ns         2763 bytes_per_second=18.5018G/s*

RecordReaderSkipRecords/Repetition:1/BatchSize:1000    2910364 ns      2910479 
ns          239 bytes_per_second=893.189M/s

RecordReaderSkipRecords/Repetition:2/BatchSize:1000   20539472 ns     20535632 
ns           34 bytes_per_second=135.007M/s

 

Recreate upon each skip (current behavior):

--------------------------------------------------------------------------------------------------------------

Benchmark                                                    Time             
CPU   Iterations UserCounters...

--------------------------------------------------------------------------------------------------------------

*RecordReaderSkipRecords/Repetition:0/BatchSize:1      33261760 ns     33199124 
ns           21 bytes_per_second=147.077M/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:10      3256993 ns      3254609 
ns          216 bytes_per_second=1.46511G/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:100      492856 ns       493377 
ns         1447 bytes_per_second=9.66477G/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:1000     262449 ns       263227 
ns         2694 bytes_per_second=18.1151G/s*

RecordReaderSkipRecords/Repetition:1/BatchSize:1000    2996951 ns      2997148 
ns          235 bytes_per_second=867.36M/s

RecordReaderSkipRecords/Repetition:2/BatchSize:1000   20864734 ns     20850593 
ns           34 bytes_per_second=132.968M/s

[1]https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1482

  was:
Currently each invocation of SkipRecords() for non-repeated fields will create 
a brand new buffer object[1]. I think it probably worth keep the buffer object 
alive and just resize it for each skip, as the buffer is just a bitmap (i.e., 
should remain quite small even we don't free its memory after skip). 
Performance results are as follows:

Keep buffer object alive:

--------------------------------------------------------------------------------------------------------------

Benchmark                                                    Time             
CPU   Iterations UserCounters...

--------------------------------------------------------------------------------------------------------------

*RecordReaderSkipRecords/Repetition:0/BatchSize:1      29958201 ns     29943377 
ns           23 bytes_per_second=163.068M/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:10      3190298 ns      3190524 
ns          227 bytes_per_second=1.49454G/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:100      479056 ns       480437 
ns         1500 bytes_per_second=9.92507G/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:1000     256497 ns       257725 
ns         2763 bytes_per_second=18.5018G/s*

RecordReaderSkipRecords/Repetition:1/BatchSize:1000    2910364 ns      2910479 
ns          239 bytes_per_second=893.189M/s

RecordReaderSkipRecords/Repetition:2/BatchSize:1000   20539472 ns     20535632 
ns           34 bytes_per_second=135.007M/s

 

Recreate upon each skip (current behavior):

--------------------------------------------------------------------------------------------------------------

Benchmark                                                    Time             
CPU   Iterations UserCounters...

--------------------------------------------------------------------------------------------------------------

*RecordReaderSkipRecords/Repetition:0/BatchSize:1      33261760 ns     33199124 
ns           21 bytes_per_second=147.077M/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:10      3256993 ns      3254609 
ns          216 bytes_per_second=1.46511G/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:100      492856 ns       493377 
ns         1447 bytes_per_second=9.66477G/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:1000     262449 ns       263227 
ns         2694 bytes_per_second=18.1151G/s*

RecordReaderSkipRecords/Repetition:1/BatchSize:1000    2996951 ns      2997148 
ns          235 bytes_per_second=867.36M/s

RecordReaderSkipRecords/Repetition:2/BatchSize:1000   20864734 ns     20850593 
ns           34 bytes_per_second=132.968M/s

[1]


> Avoid allocating buffer obeject in RecordReader's SkipRecords
> -------------------------------------------------------------
>
>                 Key: PARQUET-2423
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2423
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp
>            Reporter: Jinpeng Zhou
>            Priority: Minor
>
> Currently each invocation of SkipRecords() for non-repeated fields will 
> create a brand new buffer object[1]. I think it probably worth keep the 
> buffer object alive and just resize it for each skip, as the buffer is just a 
> bitmap (i.e., should remain quite small even we don't free its memory after 
> skip). Performance results are as follows:
> Keep buffer object alive:
> --------------------------------------------------------------------------------------------------------------
> Benchmark                                                    Time             
> CPU   Iterations UserCounters...
> --------------------------------------------------------------------------------------------------------------
> *RecordReaderSkipRecords/Repetition:0/BatchSize:1      29958201 ns     
> 29943377 ns           23 bytes_per_second=163.068M/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:10      3190298 ns      
> 3190524 ns          227 bytes_per_second=1.49454G/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:100      479056 ns       
> 480437 ns         1500 bytes_per_second=9.92507G/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:1000     256497 ns       
> 257725 ns         2763 bytes_per_second=18.5018G/s*
> RecordReaderSkipRecords/Repetition:1/BatchSize:1000    2910364 ns      
> 2910479 ns          239 bytes_per_second=893.189M/s
> RecordReaderSkipRecords/Repetition:2/BatchSize:1000   20539472 ns     
> 20535632 ns           34 bytes_per_second=135.007M/s
>  
> Recreate upon each skip (current behavior):
> --------------------------------------------------------------------------------------------------------------
> Benchmark                                                    Time             
> CPU   Iterations UserCounters...
> --------------------------------------------------------------------------------------------------------------
> *RecordReaderSkipRecords/Repetition:0/BatchSize:1      33261760 ns     
> 33199124 ns           21 bytes_per_second=147.077M/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:10      3256993 ns      
> 3254609 ns          216 bytes_per_second=1.46511G/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:100      492856 ns       
> 493377 ns         1447 bytes_per_second=9.66477G/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:1000     262449 ns       
> 263227 ns         2694 bytes_per_second=18.1151G/s*
> RecordReaderSkipRecords/Repetition:1/BatchSize:1000    2996951 ns      
> 2997148 ns          235 bytes_per_second=867.36M/s
> RecordReaderSkipRecords/Repetition:2/BatchSize:1000   20864734 ns     
> 20850593 ns           34 bytes_per_second=132.968M/s
> [1]https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1482



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to