pitrou opened a new pull request, #45202: URL: https://github.com/apache/arrow/pull/45202
### Rationale for this change It found out in https://github.com/apache/arrow/pull/45085 that there is a non-trivial overhead when writing size statistics is enabled. ### What changes are included in this PR? Dramatically reduce overhead by avoiding being limited by CPU cache latency when updating histogram values. Performance results on the author's machine: ``` ------------------------------------------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------------------------------------------ BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 8035203 ns 8029410 ns 88 bytes_per_second=1011.9Mi/s items_per_second=130.592M/s output_size=537.472k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 8181015 ns 8175040 ns 84 bytes_per_second=993.879Mi/s items_per_second=128.266M/s output_size=537.49k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 8218327 ns 8209084 ns 85 bytes_per_second=989.757Mi/s items_per_second=127.734M/s output_size=537.506k page_index_size=49 BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType> 10162863 ns 10158627 ns 69 bytes_per_second=454.728Mi/s items_per_second=103.22M/s output_size=848.305k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 10305078 ns 10300610 ns 67 bytes_per_second=448.46Mi/s items_per_second=101.797M/s output_size=848.327k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 10317502 ns 10311593 ns 67 bytes_per_second=447.983Mi/s items_per_second=101.689M/s output_size=848.348k page_index_size=50 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 13318338 ns 13304739 ns 50 bytes_per_second=641.689Mi/s items_per_second=78.8122M/s output_size=617.464k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 14205107 ns 14190396 ns 51 bytes_per_second=601.639Mi/s items_per_second=73.8934M/s output_size=617.487k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 14236149 ns 14222354 ns 48 bytes_per_second=600.288Mi/s items_per_second=73.7273M/s output_size=617.508k page_index_size=55 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType> 16079282 ns 16062363 ns 44 bytes_per_second=313.274Mi/s items_per_second=65.2816M/s output_size=927.326k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 17110015 ns 17093005 ns 40 bytes_per_second=294.385Mi/s items_per_second=61.3453M/s output_size=927.353k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 17095732 ns 17077822 ns 41 bytes_per_second=294.646Mi/s items_per_second=61.3999M/s output_size=927.379k page_index_size=56 ``` Performance results without this PR: ``` ------------------------------------------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------------------------------------------ BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 8042576 ns 8037678 ns 87 bytes_per_second=1010.86Mi/s items_per_second=130.458M/s output_size=537.472k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 9576627 ns 9571279 ns 73 bytes_per_second=848.894Mi/s items_per_second=109.554M/s output_size=537.488k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 9570204 ns 9563595 ns 73 bytes_per_second=849.576Mi/s items_per_second=109.642M/s output_size=537.502k page_index_size=47 BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType> 10165397 ns 10160868 ns 69 bytes_per_second=454.628Mi/s items_per_second=103.197M/s output_size=848.305k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 11662568 ns 11657396 ns 60 bytes_per_second=396.265Mi/s items_per_second=89.9494M/s output_size=848.325k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 11657135 ns 11653063 ns 60 bytes_per_second=396.412Mi/s items_per_second=89.9829M/s output_size=848.344k page_index_size=48 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 13182006 ns 13168704 ns 51 bytes_per_second=648.318Mi/s items_per_second=79.6264M/s output_size=617.464k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 16438205 ns 16421762 ns 43 bytes_per_second=519.89Mi/s items_per_second=63.8528M/s output_size=617.486k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 16424615 ns 16409032 ns 42 bytes_per_second=520.293Mi/s items_per_second=63.9024M/s output_size=617.506k page_index_size=54 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType> 15387808 ns 15373086 ns 46 bytes_per_second=327.32Mi/s items_per_second=68.2086M/s output_size=927.326k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 18319628 ns 18302938 ns 37 bytes_per_second=274.924Mi/s items_per_second=57.29M/s output_size=927.352k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 18346665 ns 18329336 ns 37 bytes_per_second=274.528Mi/s items_per_second=57.2075M/s output_size=927.377k page_index_size=55 ``` ### Are these changes tested? Tested by existing tests, validated by existing benchmarks. ### Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
