[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

GitBox Tue, 15 Sep 2020 06:36:43 -0700


jorisvandenbossche commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r488672303




##########
File path: python/pyarrow/parquet.py
##########
@@ -1479,6 +1479,11 @@ def read(self, columns=None, use_threads=True, 
use_pandas_metadata=False):
                 ]
                 columns = columns + list(set(index_columns) - set(columns))
 
+        if len(list(self._dataset.get_fragments())) <= 1:
+            # Allow per-column parallelism; would otherwise cause contention
+            # in the presence of per-file parallelism.
+            use_threads = False

Review comment:
       Alternative would also to handle this within the `to_table` method? 

##########
File path: python/pyarrow/parquet.py
##########
@@ -1479,6 +1479,11 @@ def read(self, columns=None, use_threads=True, 
use_pandas_metadata=False):
                 ]
                 columns = columns + list(set(index_columns) - set(columns))
 
+        if len(list(self._dataset.get_fragments())) <= 1:
+            # Allow per-column parallelism; would otherwise cause contention
+            # in the presence of per-file parallelism.
+            use_threads = False

Review comment:
       We should probably check that `use_threads`is actually True when 
enabling this per-column parallelism

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1070,6 +1070,7 @@ cdef class ParquetFileFormat(FileFormat):
         options = &(wrapped.get().reader_options)
         options.use_buffered_stream = read_options.use_buffered_stream
         options.buffer_size = read_options.buffer_size
+        options.enable_parallel_column_conversion = True

Review comment:
       Should the default be False here? (and then be set to True if we only 
have a single file)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Reply via email to