Jinpeng Zhou created PARQUET-2423:
-------------------------------------

             Summary: Avoid allocating buffer obeject in RecordReader's 
SkipRecords
                 Key: PARQUET-2423
                 URL: https://issues.apache.org/jira/browse/PARQUET-2423
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-cpp
            Reporter: Jinpeng Zhou


Currently each invocation of SkipRecords() for non-repeated fields will create 
a brand new buffer object[1]. I think it probably worth keep the buffer object 
alive and just resize it for each skip, as the buffer is just a bitmap (i.e., 
should remain quite small even we don't free its memory after skip). 
Performance results are as follows:

Keep buffer object alive:

--------------------------------------------------------------------------------------------------------------

Benchmark                                                    Time             
CPU   Iterations UserCounters...

--------------------------------------------------------------------------------------------------------------

*RecordReaderSkipRecords/Repetition:0/BatchSize:1      29958201 ns     29943377 
ns           23 bytes_per_second=163.068M/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:10      3190298 ns      3190524 
ns          227 bytes_per_second=1.49454G/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:100      479056 ns       480437 
ns         1500 bytes_per_second=9.92507G/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:1000     256497 ns       257725 
ns         2763 bytes_per_second=18.5018G/s*

RecordReaderSkipRecords/Repetition:1/BatchSize:1000    2910364 ns      2910479 
ns          239 bytes_per_second=893.189M/s

RecordReaderSkipRecords/Repetition:2/BatchSize:1000   20539472 ns     20535632 
ns           34 bytes_per_second=135.007M/s

 

Recreate upon each skip (current behavior):

--------------------------------------------------------------------------------------------------------------

Benchmark                                                    Time             
CPU   Iterations UserCounters...

--------------------------------------------------------------------------------------------------------------

*RecordReaderSkipRecords/Repetition:0/BatchSize:1      33261760 ns     33199124 
ns           21 bytes_per_second=147.077M/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:10      3256993 ns      3254609 
ns          216 bytes_per_second=1.46511G/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:100      492856 ns       493377 
ns         1447 bytes_per_second=9.66477G/s*

*RecordReaderSkipRecords/Repetition:0/BatchSize:1000     262449 ns       263227 
ns         2694 bytes_per_second=18.1151G/s*

RecordReaderSkipRecords/Repetition:1/BatchSize:1000    2996951 ns      2997148 
ns          235 bytes_per_second=867.36M/s

RecordReaderSkipRecords/Repetition:2/BatchSize:1000   20864734 ns     20850593 
ns           34 bytes_per_second=132.968M/s

[1]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to