Re: [PR] GH-40062: [C++][Python] Conversion of Table to Arrow Tensor [arrow]

via GitHub Wed, 15 Apr 2026 04:59:08 -0700


AlenkaF commented on PR #41870:
URL: https://github.com/apache/arrow/pull/41870#issuecomment-4251781446


   I finally took time to improve the benchmarks on this change. It has been 
clear from https://github.com/apache/arrow/pull/41870#issuecomment-2158757139 
that creating a `Table` in case of a `RecordBatch` to `Tensor` conversion is 
the main issue. I have consulted Claude Code and GitHub Copilot which both gave 
me two good ideas to test.
   
   1. Pre-compute the index in case of the row-major conversion 
(https://github.com/apache/arrow/pull/41870/changes/4a879f96b0f9efbdd71c2c28377b122591164386)
   - Number of regressions fell from 17 to 13
   - Max regression fell from -43% to -38%
   
   <details>
    <summary>Benchmark result 1</summary>
   
   ```
   $ archery --quiet benchmark diff --benchmark-filter=BatchToTensorSimple
   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   Non-regressions: (11)
   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                     benchmark        baseline  
     contender  change %                                                        
                                                                                
                                                         counters
    BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30   3.400 GiB/sec  
 4.006 GiB/sec    17.840  {'family_index': 2, 'per_family_instance_index': 4, 
'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 609}
     BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3   9.607 GiB/sec  
11.257 GiB/sec    17.181  {'family_index': 2, 'per_family_instance_index': 3, 
'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1723}
   BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300   3.922 GiB/sec  
 4.246 GiB/sec     8.258 {'family_index': 2, 'per_family_instance_index': 5, 
'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 711}
      BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3   1.346 GiB/sec  
 1.360 GiB/sec     1.043    {'family_index': 0, 'per_family_instance_index': 3, 
'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 241}
     BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3   5.743 GiB/sec  
 5.754 GiB/sec     0.193  {'family_index': 1, 'per_family_instance_index': 3, 
'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1024}
    BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30   2.329 GiB/sec  
 2.319 GiB/sec    -0.430  {'family_index': 1, 'per_family_instance_index': 4, 
'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 418}
        BatchToTensorSimple<Int8Type>/size:65536/num_columns:3   1.375 GiB/sec  
 1.365 GiB/sec    -0.702    {'family_index': 0, 'per_family_instance_index': 0, 
'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15777}
     BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30 980.412 MiB/sec 
959.127 MiB/sec    -2.171   {'family_index': 0, 'per_family_instance_index': 4, 
'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 172}
    BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300 724.155 MiB/sec 
708.308 MiB/sec    -2.188  {'family_index': 0, 'per_family_instance_index': 5, 
'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 126}
       BatchToTensorSimple<Int8Type>/size:65536/num_columns:30   1.255 GiB/sec  
 1.216 GiB/sec    -3.107   {'family_index': 0, 'per_family_instance_index': 1, 
'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 14451}
       BatchToTensorSimple<Int16Type>/size:65536/num_columns:3   5.227 GiB/sec  
 5.054 GiB/sec    -3.307   {'family_index': 1, 'per_family_instance_index': 0, 
'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 59905}
   
   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   Regressions: (13)
   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                     benchmark        baseline  
     contender  change %                                                        
                                                                                
                                                          counters
    BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30   7.150 GiB/sec  
 6.782 GiB/sec    -5.158  {'family_index': 3, 'per_family_instance_index': 4, 
'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1281}
   BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300   2.229 GiB/sec  
 2.017 GiB/sec    -9.482  {'family_index': 1, 'per_family_instance_index': 5, 
'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 401}
      BatchToTensorSimple<Int16Type>/size:65536/num_columns:30   4.068 GiB/sec  
 3.445 GiB/sec   -15.303   {'family_index': 1, 'per_family_instance_index': 1, 
'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 46683}
     BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3  14.380 GiB/sec  
11.985 GiB/sec   -16.657   {'family_index': 3, 'per_family_instance_index': 3, 
'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2582}
       BatchToTensorSimple<Int64Type>/size:65536/num_columns:3  17.690 GiB/sec  
14.347 GiB/sec   -18.901   {'family_index': 3, 'per_family_instance_index': 0, 
'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 203532}
   BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300  13.195 GiB/sec  
10.688 GiB/sec   -18.999 {'family_index': 3, 'per_family_instance_index': 5, 
'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2358}
      BatchToTensorSimple<Int8Type>/size:65536/num_columns:300 746.820 MiB/sec 
595.770 MiB/sec   -20.226    {'family_index': 0, 'per_family_instance_index': 
2, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 8300}
       BatchToTensorSimple<Int32Type>/size:65536/num_columns:3   9.870 GiB/sec  
 7.690 GiB/sec   -22.088   {'family_index': 2, 'per_family_instance_index': 0, 
'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 111961}
      BatchToTensorSimple<Int64Type>/size:65536/num_columns:30   8.216 GiB/sec  
 6.032 GiB/sec   -26.581   {'family_index': 3, 'per_family_instance_index': 1, 
'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 94142}
      BatchToTensorSimple<Int32Type>/size:65536/num_columns:30   6.596 GiB/sec  
 4.725 GiB/sec   -28.357   {'family_index': 2, 'per_family_instance_index': 1, 
'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 59518}
     BatchToTensorSimple<Int16Type>/size:65536/num_columns:300   1.214 GiB/sec 
870.740 MiB/sec   -29.978  {'family_index': 1, 'per_family_instance_index': 2, 
'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 13917}
     BatchToTensorSimple<Int64Type>/size:65536/num_columns:300   1.508 GiB/sec 
989.418 MiB/sec   -35.907  {'family_index': 3, 'per_family_instance_index': 2, 
'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17500}
     BatchToTensorSimple<Int32Type>/size:65536/num_columns:300   1.421 GiB/sec 
901.875 MiB/sec   -38.012  {'family_index': 2, 'per_family_instance_index': 2, 
'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16439}
   ```
   
   </details>
   
   2. Template the `ToTensor` method for `RecordBatch` and `Table` separately 
avoiding heap-allocations from creating `Table` for `RecordBatch` 
(https://github.com/apache/arrow/pull/41870/changes/1f12b9013faf76c663b79e6f62c7c1be8c9fec2f)
   - Number of regressions fell to between 3 and 5 in multiple runs
   - Max regression fell from -38% to between 15% and 18% in multiple runs
   - Net throughput improves on most shapes. Remaining losses are concentrated 
in Int64
   
   <details>
    <summary>Benchmark result 2</summary>
   
   ```
   $archery --quiet benchmark diff --benchmark-filter=BatchToTensorSimple
   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   Non-regressions: (21)
   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                     benchmark        baseline  
     contender  change %                                                        
                                                                                
                                                         counters
    BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30   3.380 GiB/sec  
 4.020 GiB/sec    18.919  {'family_index': 2, 'per_family_instance_index': 4, 
'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 609}
     BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3   9.570 GiB/sec  
11.156 GiB/sec    16.575  {'family_index': 2, 'per_family_instance_index': 3, 
'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1718}
     BatchToTensorSimple<Int16Type>/size:65536/num_columns:300   1.199 GiB/sec  
 1.385 GiB/sec    15.483 {'family_index': 1, 'per_family_instance_index': 2, 
'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10000}
     BatchToTensorSimple<Int32Type>/size:65536/num_columns:300   1.424 GiB/sec  
 1.631 GiB/sec    14.528 {'family_index': 2, 'per_family_instance_index': 2, 
'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16702}
     BatchToTensorSimple<Int64Type>/size:65536/num_columns:300   1.517 GiB/sec  
 1.735 GiB/sec    14.355 {'family_index': 3, 'per_family_instance_index': 2, 
'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17535}
   BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300   3.912 GiB/sec  
 4.381 GiB/sec    11.977 {'family_index': 2, 'per_family_instance_index': 5, 
'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 602}
      BatchToTensorSimple<Int8Type>/size:65536/num_columns:300 743.539 MiB/sec 
800.822 MiB/sec     7.704   {'family_index': 0, 'per_family_instance_index': 2, 
'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 8398}
      BatchToTensorSimple<Int64Type>/size:65536/num_columns:30   8.443 GiB/sec  
 8.652 GiB/sec     2.476  {'family_index': 3, 'per_family_instance_index': 1, 
'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 96545}
      BatchToTensorSimple<Int16Type>/size:65536/num_columns:30   4.061 GiB/sec  
 4.161 GiB/sec     2.461  {'family_index': 1, 'per_family_instance_index': 1, 
'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 46607}
       BatchToTensorSimple<Int8Type>/size:65536/num_columns:30   1.250 GiB/sec  
 1.267 GiB/sec     1.378   {'family_index': 0, 'per_family_instance_index': 1, 
'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 14433}
    BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30   2.308 GiB/sec  
 2.339 GiB/sec     1.378  {'family_index': 1, 'per_family_instance_index': 4, 
'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 417}
       BatchToTensorSimple<Int16Type>/size:65536/num_columns:3   5.172 GiB/sec  
 5.237 GiB/sec     1.257   {'family_index': 1, 'per_family_instance_index': 0, 
'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 59889}
     BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3   5.737 GiB/sec  
 5.765 GiB/sec     0.493  {'family_index': 1, 'per_family_instance_index': 3, 
'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1029}
    BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300 723.279 MiB/sec 
725.363 MiB/sec     0.288  {'family_index': 0, 'per_family_instance_index': 5, 
'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 126}
     BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30 979.022 MiB/sec 
981.447 MiB/sec     0.248   {'family_index': 0, 'per_family_instance_index': 4, 
'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 171}
      BatchToTensorSimple<Int32Type>/size:65536/num_columns:30   6.662 GiB/sec  
 6.678 GiB/sec     0.242  {'family_index': 2, 'per_family_instance_index': 1, 
'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 76524}
   BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300   2.232 GiB/sec  
 2.219 GiB/sec    -0.569 {'family_index': 1, 'per_family_instance_index': 5, 
'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 400}
      BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3   1.344 GiB/sec  
 1.302 GiB/sec    -3.108    {'family_index': 0, 'per_family_instance_index': 3, 
'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 238}
    BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30   7.073 GiB/sec  
 6.852 GiB/sec    -3.124 {'family_index': 3, 'per_family_instance_index': 4, 
'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1281}
       BatchToTensorSimple<Int32Type>/size:65536/num_columns:3   9.812 GiB/sec  
 9.479 GiB/sec    -3.388  {'family_index': 2, 'per_family_instance_index': 0, 
'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 112938}
        BatchToTensorSimple<Int8Type>/size:65536/num_columns:3   1.372 GiB/sec  
 1.316 GiB/sec    -4.075    {'family_index': 0, 'per_family_instance_index': 0, 
'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15748}
   
   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   Regressions: (3)
   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                     benchmark       baseline   
   contender  change %                                                          
                                                                                
                                                        counters
       BatchToTensorSimple<Int64Type>/size:65536/num_columns:3 17.550 GiB/sec 
16.345 GiB/sec    -6.865   {'family_index': 3, 'per_family_instance_index': 0, 
'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 202920}
   BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300 13.096 GiB/sec 
11.669 GiB/sec   -10.894 {'family_index': 3, 'per_family_instance_index': 5, 
'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2345}
     BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3 14.224 GiB/sec 
12.131 GiB/sec   -14.716   {'family_index': 3, 'per_family_instance_index': 3, 
'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2570}
   ```
   </details>
   
   cc @jorisvandenbossche in case you are interested =)
   
   I see there is a Python CI build with a failing test and a Windows C++ 
failure when building. Will fix it asap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-40062: [C++][Python] Conversion of Table to Arrow Tensor [arrow]

Reply via email to