bkietz commented on pull request #8188: URL: https://github.com/apache/arrow/pull/8188#issuecomment-693622612
For a range of file and column counts, the time to read is as follows: ``` nfiles ncolumns legacy_time default_time regression 1 1 0.490398 0.401345 -0.181592 1 2 0.642569 0.523074 -0.185964 1 4 0.988469 0.945871 -0.043095 1 8 1.541519 1.602061 0.039274 2 1 1.078602 0.622690 -0.422688 2 2 1.275463 0.922737 -0.276548 2 4 1.601820 2.001778 0.249689 2 8 2.847058 4.283226 0.504439 4 1 2.116808 0.760073 -0.640935 4 2 2.458016 1.472731 -0.400846 4 4 3.975070 2.648561 -0.333707 4 8 6.531598 6.030903 -0.076657 ``` (times in seconds, regression computed as (default_time - legacy_time)/legacy_time) `$ python -m pyperf system show` <details> <pre> System state ============ CPU: use 8 logical CPUs: 0-7 Perf event: Maximum sample rate: 1 per second ASLR: Full randomization Linux scheduler: No CPU is isolated CPU Frequency: 0-7=min=max=1800 MHz CPU scaling governor (intel_pstate): performance Turbo Boost (intel_pstate): Turbo Boost disabled IRQ affinity: irqbalance service: inactive IRQ affinity: Default IRQ affinity: CPU 0-7 IRQ affinity: IRQ affinity: IRQ 0-17,51,120-127,129-130,138-139,146,155-158=CPU 0-7; IRQ 128=CPU 0; IRQ 131=CPU 1; IRQ 132=CPU 2; IRQ 133=CPU 3; IRQ 134=CPU 4; IRQ 135=CPU 5; IRQ 136=CPU 6; IRQ 137=CPU 7 Power supply: the power cable is plugged </pre> </details> We mostly see a performance improvement with defaults, including moderate improvement in single file reading time. Note the significant regressions when reading two files with 4 or 8 columns, which is to be expected since legacy is able to divide that work across 4 or 8 threads instead of only 2. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org