[GitHub] [arrow] bkietz commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

GitBox Wed, 16 Sep 2020 12:37:20 -0700


bkietz commented on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-693622612



   For a range of file and column counts, the time to read is as follows:
   ```
   nfiles  ncolumns  legacy_time  default_time  regression
        1         1     0.490398      0.401345   -0.181592
        1         2     0.642569      0.523074   -0.185964
        1         4     0.988469      0.945871   -0.043095
        1         8     1.541519      1.602061    0.039274
        2         1     1.078602      0.622690   -0.422688
        2         2     1.275463      0.922737   -0.276548
        2         4     1.601820      2.001778    0.249689
        2         8     2.847058      4.283226    0.504439
        4         1     2.116808      0.760073   -0.640935
        4         2     2.458016      1.472731   -0.400846
        4         4     3.975070      2.648561   -0.333707
        4         8     6.531598      6.030903   -0.076657
   ```
   (times in seconds, regression computed as (default_time - 
legacy_time)/legacy_time)
   
   `$ python -m pyperf system show`
   <details>
   <pre>
   System state
   ============
   
   CPU: use 8 logical CPUs: 0-7
   Perf event: Maximum sample rate: 1 per second
   ASLR: Full randomization
   Linux scheduler: No CPU is isolated
   CPU Frequency: 0-7=min=max=1800 MHz
   CPU scaling governor (intel_pstate): performance
   Turbo Boost (intel_pstate): Turbo Boost disabled
   IRQ affinity: irqbalance service: inactive
   IRQ affinity: Default IRQ affinity: CPU 0-7
   IRQ affinity: IRQ affinity: IRQ 
0-17,51,120-127,129-130,138-139,146,155-158=CPU 0-7; IRQ 128=CPU 0; IRQ 131=CPU 
1; IRQ 132=CPU 2; IRQ 133=CPU 3; IRQ 134=CPU 4; IRQ 135=CPU 5; IRQ 136=CPU 6; 
IRQ 137=CPU 7
   Power supply: the power cable is plugged
   </pre>
   </details>
   
   We mostly see a performance improvement with defaults, including moderate 
improvement in single file reading time. Note the significant regressions when 
reading two files with 4 or 8 columns, which is to be expected since legacy is 
able to divide that work across 4 or 8 threads instead of only 2.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] bkietz commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Reply via email to