AlenkaF commented on PR #41870: URL: https://github.com/apache/arrow/pull/41870#issuecomment-4251781446
I finally took time to improve the benchmarks on this change. It has been clear from https://github.com/apache/arrow/pull/41870#issuecomment-2158757139 that creating a `Table` in case of a `RecordBatch` to `Tensor` conversion is the main issue. I have consulted Claude Code and GitHub Copilot which both gave me two good ideas to test. 1. Pre-compute the index in case of the row-major conversion (https://github.com/apache/arrow/pull/41870/changes/4a879f96b0f9efbdd71c2c28377b122591164386) - Number of regressions fell from 17 to 13 - Max regression fell from -43% to -38% <details> <summary>Benchmark result 1</summary> ``` $ archery --quiet benchmark diff --benchmark-filter=BatchToTensorSimple -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Non-regressions: (11) -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- benchmark baseline contender change % counters BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30 3.400 GiB/sec 4.006 GiB/sec 17.840 {'family_index': 2, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 609} BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3 9.607 GiB/sec 11.257 GiB/sec 17.181 {'family_index': 2, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1723} BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300 3.922 GiB/sec 4.246 GiB/sec 8.258 {'family_index': 2, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 711} BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3 1.346 GiB/sec 1.360 GiB/sec 1.043 {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 241} BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3 5.743 GiB/sec 5.754 GiB/sec 0.193 {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1024} BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30 2.329 GiB/sec 2.319 GiB/sec -0.430 {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 418} BatchToTensorSimple<Int8Type>/size:65536/num_columns:3 1.375 GiB/sec 1.365 GiB/sec -0.702 {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15777} BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30 980.412 MiB/sec 959.127 MiB/sec -2.171 {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 172} BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300 724.155 MiB/sec 708.308 MiB/sec -2.188 {'family_index': 0, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 126} BatchToTensorSimple<Int8Type>/size:65536/num_columns:30 1.255 GiB/sec 1.216 GiB/sec -3.107 {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 14451} BatchToTensorSimple<Int16Type>/size:65536/num_columns:3 5.227 GiB/sec 5.054 GiB/sec -3.307 {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 59905} --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Regressions: (13) --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- benchmark baseline contender change % counters BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30 7.150 GiB/sec 6.782 GiB/sec -5.158 {'family_index': 3, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1281} BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300 2.229 GiB/sec 2.017 GiB/sec -9.482 {'family_index': 1, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 401} BatchToTensorSimple<Int16Type>/size:65536/num_columns:30 4.068 GiB/sec 3.445 GiB/sec -15.303 {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 46683} BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3 14.380 GiB/sec 11.985 GiB/sec -16.657 {'family_index': 3, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2582} BatchToTensorSimple<Int64Type>/size:65536/num_columns:3 17.690 GiB/sec 14.347 GiB/sec -18.901 {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 203532} BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300 13.195 GiB/sec 10.688 GiB/sec -18.999 {'family_index': 3, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2358} BatchToTensorSimple<Int8Type>/size:65536/num_columns:300 746.820 MiB/sec 595.770 MiB/sec -20.226 {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 8300} BatchToTensorSimple<Int32Type>/size:65536/num_columns:3 9.870 GiB/sec 7.690 GiB/sec -22.088 {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 111961} BatchToTensorSimple<Int64Type>/size:65536/num_columns:30 8.216 GiB/sec 6.032 GiB/sec -26.581 {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 94142} BatchToTensorSimple<Int32Type>/size:65536/num_columns:30 6.596 GiB/sec 4.725 GiB/sec -28.357 {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 59518} BatchToTensorSimple<Int16Type>/size:65536/num_columns:300 1.214 GiB/sec 870.740 MiB/sec -29.978 {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 13917} BatchToTensorSimple<Int64Type>/size:65536/num_columns:300 1.508 GiB/sec 989.418 MiB/sec -35.907 {'family_index': 3, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17500} BatchToTensorSimple<Int32Type>/size:65536/num_columns:300 1.421 GiB/sec 901.875 MiB/sec -38.012 {'family_index': 2, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16439} ``` </details> 2. Template the `ToTensor` method for `RecordBatch` and `Table` separately avoiding heap-allocations from creating `Table` for `RecordBatch` (https://github.com/apache/arrow/pull/41870/changes/1f12b9013faf76c663b79e6f62c7c1be8c9fec2f) - Number of regressions fell to between 3 and 5 in multiple runs - Max regression fell from -38% to between 15% and 18% in multiple runs - Net throughput improves on most shapes. Remaining losses are concentrated in Int64 <details> <summary>Benchmark result 2</summary> ``` $archery --quiet benchmark diff --benchmark-filter=BatchToTensorSimple -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Non-regressions: (21) -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- benchmark baseline contender change % counters BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30 3.380 GiB/sec 4.020 GiB/sec 18.919 {'family_index': 2, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 609} BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3 9.570 GiB/sec 11.156 GiB/sec 16.575 {'family_index': 2, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1718} BatchToTensorSimple<Int16Type>/size:65536/num_columns:300 1.199 GiB/sec 1.385 GiB/sec 15.483 {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10000} BatchToTensorSimple<Int32Type>/size:65536/num_columns:300 1.424 GiB/sec 1.631 GiB/sec 14.528 {'family_index': 2, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16702} BatchToTensorSimple<Int64Type>/size:65536/num_columns:300 1.517 GiB/sec 1.735 GiB/sec 14.355 {'family_index': 3, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17535} BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300 3.912 GiB/sec 4.381 GiB/sec 11.977 {'family_index': 2, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 602} BatchToTensorSimple<Int8Type>/size:65536/num_columns:300 743.539 MiB/sec 800.822 MiB/sec 7.704 {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 8398} BatchToTensorSimple<Int64Type>/size:65536/num_columns:30 8.443 GiB/sec 8.652 GiB/sec 2.476 {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 96545} BatchToTensorSimple<Int16Type>/size:65536/num_columns:30 4.061 GiB/sec 4.161 GiB/sec 2.461 {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 46607} BatchToTensorSimple<Int8Type>/size:65536/num_columns:30 1.250 GiB/sec 1.267 GiB/sec 1.378 {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 14433} BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30 2.308 GiB/sec 2.339 GiB/sec 1.378 {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 417} BatchToTensorSimple<Int16Type>/size:65536/num_columns:3 5.172 GiB/sec 5.237 GiB/sec 1.257 {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 59889} BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3 5.737 GiB/sec 5.765 GiB/sec 0.493 {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1029} BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300 723.279 MiB/sec 725.363 MiB/sec 0.288 {'family_index': 0, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 126} BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30 979.022 MiB/sec 981.447 MiB/sec 0.248 {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 171} BatchToTensorSimple<Int32Type>/size:65536/num_columns:30 6.662 GiB/sec 6.678 GiB/sec 0.242 {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 76524} BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300 2.232 GiB/sec 2.219 GiB/sec -0.569 {'family_index': 1, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 400} BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3 1.344 GiB/sec 1.302 GiB/sec -3.108 {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 238} BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30 7.073 GiB/sec 6.852 GiB/sec -3.124 {'family_index': 3, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1281} BatchToTensorSimple<Int32Type>/size:65536/num_columns:3 9.812 GiB/sec 9.479 GiB/sec -3.388 {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 112938} BatchToTensorSimple<Int8Type>/size:65536/num_columns:3 1.372 GiB/sec 1.316 GiB/sec -4.075 {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15748} ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Regressions: (3) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- benchmark baseline contender change % counters BatchToTensorSimple<Int64Type>/size:65536/num_columns:3 17.550 GiB/sec 16.345 GiB/sec -6.865 {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 202920} BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300 13.096 GiB/sec 11.669 GiB/sec -10.894 {'family_index': 3, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2345} BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3 14.224 GiB/sec 12.131 GiB/sec -14.716 {'family_index': 3, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2570} ``` </details> cc @jorisvandenbossche in case you are interested =) I see there is a Python CI build with a failing test and a Windows C++ failure when building. Will fix it asap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
