Re: [PR] [Parquet] Fix slow dictionary encoding of NaN float values [arrow-rs]

via GitHub Tue, 07 Jan 2025 19:49:40 -0800


adamreeve commented on PR #6953:
URL: https://github.com/apache/arrow-rs/pull/6953#issuecomment-2576679001


   Benchmark results from the new benchmarks before changing the interning 
behaviour:
   ```
   write_batch primitive/4096 values float with NaNs
                           time:   [5.6968 ms 5.7060 ms 5.7141 ms]
                           thrpt:  [9.6186 MiB/s 9.6324 MiB/s 9.6479 MiB/s]
   Found 8 outliers among 100 measurements (8.00%)
     3 (3.00%) low severe
     4 (4.00%) low mild
     1 (1.00%) high mild
   write_batch primitive/4096 values float with no NaNs
                           time:   [383.44 µs 383.65 µs 383.85 µs]
                           thrpt:  [143.18 MiB/s 143.26 MiB/s 143.34 MiB/s]
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   ```
   This shows that writing with 50% NaN values is much slower than with no NaNs.
   
   After the change, performance with NaNs is very similar to without NaNs:
   ```
   write_batch primitive/4096 values float with NaNs
                           time:   [406.40 µs 406.63 µs 406.88 µs]
                           thrpt:  [135.08 MiB/s 135.16 MiB/s 135.24 MiB/s]
                    change:
                           time:   [-92.875% -92.861% -92.845%] (p = 0.00 < 
0.05)
                           thrpt:  [+1297.6% +1300.7% +1303.5%]
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     1 (1.00%) high mild
     2 (2.00%) high severe
   write_batch primitive/4096 values float with no NaNs
                           time:   [382.52 µs 384.16 µs 385.50 µs]
                           thrpt:  [142.58 MiB/s 143.07 MiB/s 143.68 MiB/s]
                    change:
                           time:   [+0.1803% +0.3520% +0.5192%] (p = 0.00 < 
0.05)
                           thrpt:  [-0.5165% -0.3507% -0.1799%]
                           Change within noise threshold.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) low severe
   ```
   
   (I removed the `4096 values float with no NaNs` benchmark from this PR after 
running these benchmarks as I don't think there's a lot of value in keeping it)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Parquet] Fix slow dictionary encoding of NaN float values [arrow-rs]

Reply via email to