jorisvandenbossche commented on a change in pull request #8188: URL: https://github.com/apache/arrow/pull/8188#discussion_r488672303
########## File path: python/pyarrow/parquet.py ########## @@ -1479,6 +1479,11 @@ def read(self, columns=None, use_threads=True, use_pandas_metadata=False): ] columns = columns + list(set(index_columns) - set(columns)) + if len(list(self._dataset.get_fragments())) <= 1: + # Allow per-column parallelism; would otherwise cause contention + # in the presence of per-file parallelism. + use_threads = False Review comment: Alternative would also to handle this within the `to_table` method? ########## File path: python/pyarrow/parquet.py ########## @@ -1479,6 +1479,11 @@ def read(self, columns=None, use_threads=True, use_pandas_metadata=False): ] columns = columns + list(set(index_columns) - set(columns)) + if len(list(self._dataset.get_fragments())) <= 1: + # Allow per-column parallelism; would otherwise cause contention + # in the presence of per-file parallelism. + use_threads = False Review comment: We should probably check that `use_threads`is actually True when enabling this per-column parallelism ########## File path: python/pyarrow/_dataset.pyx ########## @@ -1070,6 +1070,7 @@ cdef class ParquetFileFormat(FileFormat): options = &(wrapped.get().reader_options) options.use_buffered_stream = read_options.use_buffered_stream options.buffer_size = read_options.buffer_size + options.enable_parallel_column_conversion = True Review comment: Should the default be False here? (and then be set to True if we only have a single file) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org