[
https://issues.apache.org/jira/browse/PARQUET-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jinpeng Zhou updated PARQUET-2423:
----------------------------------
Description:
Currently each invocation of SkipRecords() for non-repeated fields will create
a brand new buffer object[1]. I think it probably worth keep the buffer object
alive and just resize it for each skip, as the buffer is just a bitmap (i.e.,
should remain quite small even we don't free its memory after skip).
Performance results are as follows:
Keep buffer object alive:
--------------------------------------------------------------------------------------------------------------
Benchmark Time
CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
*RecordReaderSkipRecords/Repetition:0/BatchSize:1 29958201 ns 29943377
ns 23 bytes_per_second=163.068M/s*
*RecordReaderSkipRecords/Repetition:0/BatchSize:10 3190298 ns 3190524
ns 227 bytes_per_second=1.49454G/s*
*RecordReaderSkipRecords/Repetition:0/BatchSize:100 479056 ns 480437
ns 1500 bytes_per_second=9.92507G/s*
*RecordReaderSkipRecords/Repetition:0/BatchSize:1000 256497 ns 257725
ns 2763 bytes_per_second=18.5018G/s*
RecordReaderSkipRecords/Repetition:1/BatchSize:1000 2910364 ns 2910479
ns 239 bytes_per_second=893.189M/s
RecordReaderSkipRecords/Repetition:2/BatchSize:1000 20539472 ns 20535632
ns 34 bytes_per_second=135.007M/s
Recreate upon each skip (current behavior):
--------------------------------------------------------------------------------------------------------------
Benchmark Time
CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
*RecordReaderSkipRecords/Repetition:0/BatchSize:1 33261760 ns 33199124
ns 21 bytes_per_second=147.077M/s*
*RecordReaderSkipRecords/Repetition:0/BatchSize:10 3256993 ns 3254609
ns 216 bytes_per_second=1.46511G/s*
*RecordReaderSkipRecords/Repetition:0/BatchSize:100 492856 ns 493377
ns 1447 bytes_per_second=9.66477G/s*
*RecordReaderSkipRecords/Repetition:0/BatchSize:1000 262449 ns 263227
ns 2694 bytes_per_second=18.1151G/s*
RecordReaderSkipRecords/Repetition:1/BatchSize:1000 2996951 ns 2997148
ns 235 bytes_per_second=867.36M/s
RecordReaderSkipRecords/Repetition:2/BatchSize:1000 20864734 ns 20850593
ns 34 bytes_per_second=132.968M/s
[1]https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1482
was:
Currently each invocation of SkipRecords() for non-repeated fields will create
a brand new buffer object[1]. I think it probably worth keep the buffer object
alive and just resize it for each skip, as the buffer is just a bitmap (i.e.,
should remain quite small even we don't free its memory after skip).
Performance results are as follows:
Keep buffer object alive:
--------------------------------------------------------------------------------------------------------------
Benchmark Time
CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
*RecordReaderSkipRecords/Repetition:0/BatchSize:1 29958201 ns 29943377
ns 23 bytes_per_second=163.068M/s*
*RecordReaderSkipRecords/Repetition:0/BatchSize:10 3190298 ns 3190524
ns 227 bytes_per_second=1.49454G/s*
*RecordReaderSkipRecords/Repetition:0/BatchSize:100 479056 ns 480437
ns 1500 bytes_per_second=9.92507G/s*
*RecordReaderSkipRecords/Repetition:0/BatchSize:1000 256497 ns 257725
ns 2763 bytes_per_second=18.5018G/s*
RecordReaderSkipRecords/Repetition:1/BatchSize:1000 2910364 ns 2910479
ns 239 bytes_per_second=893.189M/s
RecordReaderSkipRecords/Repetition:2/BatchSize:1000 20539472 ns 20535632
ns 34 bytes_per_second=135.007M/s
Recreate upon each skip (current behavior):
--------------------------------------------------------------------------------------------------------------
Benchmark Time
CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
*RecordReaderSkipRecords/Repetition:0/BatchSize:1 33261760 ns 33199124
ns 21 bytes_per_second=147.077M/s*
*RecordReaderSkipRecords/Repetition:0/BatchSize:10 3256993 ns 3254609
ns 216 bytes_per_second=1.46511G/s*
*RecordReaderSkipRecords/Repetition:0/BatchSize:100 492856 ns 493377
ns 1447 bytes_per_second=9.66477G/s*
*RecordReaderSkipRecords/Repetition:0/BatchSize:1000 262449 ns 263227
ns 2694 bytes_per_second=18.1151G/s*
RecordReaderSkipRecords/Repetition:1/BatchSize:1000 2996951 ns 2997148
ns 235 bytes_per_second=867.36M/s
RecordReaderSkipRecords/Repetition:2/BatchSize:1000 20864734 ns 20850593
ns 34 bytes_per_second=132.968M/s
[1]
> Avoid allocating buffer obeject in RecordReader's SkipRecords
> -------------------------------------------------------------
>
> Key: PARQUET-2423
> URL: https://issues.apache.org/jira/browse/PARQUET-2423
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-cpp
> Reporter: Jinpeng Zhou
> Priority: Minor
>
> Currently each invocation of SkipRecords() for non-repeated fields will
> create a brand new buffer object[1]. I think it probably worth keep the
> buffer object alive and just resize it for each skip, as the buffer is just a
> bitmap (i.e., should remain quite small even we don't free its memory after
> skip). Performance results are as follows:
> Keep buffer object alive:
> --------------------------------------------------------------------------------------------------------------
> Benchmark Time
> CPU Iterations UserCounters...
> --------------------------------------------------------------------------------------------------------------
> *RecordReaderSkipRecords/Repetition:0/BatchSize:1 29958201 ns
> 29943377 ns 23 bytes_per_second=163.068M/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:10 3190298 ns
> 3190524 ns 227 bytes_per_second=1.49454G/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:100 479056 ns
> 480437 ns 1500 bytes_per_second=9.92507G/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:1000 256497 ns
> 257725 ns 2763 bytes_per_second=18.5018G/s*
> RecordReaderSkipRecords/Repetition:1/BatchSize:1000 2910364 ns
> 2910479 ns 239 bytes_per_second=893.189M/s
> RecordReaderSkipRecords/Repetition:2/BatchSize:1000 20539472 ns
> 20535632 ns 34 bytes_per_second=135.007M/s
>
> Recreate upon each skip (current behavior):
> --------------------------------------------------------------------------------------------------------------
> Benchmark Time
> CPU Iterations UserCounters...
> --------------------------------------------------------------------------------------------------------------
> *RecordReaderSkipRecords/Repetition:0/BatchSize:1 33261760 ns
> 33199124 ns 21 bytes_per_second=147.077M/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:10 3256993 ns
> 3254609 ns 216 bytes_per_second=1.46511G/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:100 492856 ns
> 493377 ns 1447 bytes_per_second=9.66477G/s*
> *RecordReaderSkipRecords/Repetition:0/BatchSize:1000 262449 ns
> 263227 ns 2694 bytes_per_second=18.1151G/s*
> RecordReaderSkipRecords/Repetition:1/BatchSize:1000 2996951 ns
> 2997148 ns 235 bytes_per_second=867.36M/s
> RecordReaderSkipRecords/Repetition:2/BatchSize:1000 20864734 ns
> 20850593 ns 34 bytes_per_second=132.968M/s
> [1]https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1482
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]