jorisvandenbossche commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r488672303
##########
File path: python/pyarrow/parquet.py
##########
@@ -1479,6 +1479,11 @@ def read(self, columns=None, use_threads=True,
use_pandas_metadata=False):
]
columns = columns + list(set(index_columns) - set(columns))
+ if len(list(self._dataset.get_fragments())) <= 1:
+ # Allow per-column parallelism; would otherwise cause contention
+ # in the presence of per-file parallelism.
+ use_threads = False
Review comment:
Alternative would also to handle this within the `to_table` method?
##########
File path: python/pyarrow/parquet.py
##########
@@ -1479,6 +1479,11 @@ def read(self, columns=None, use_threads=True,
use_pandas_metadata=False):
]
columns = columns + list(set(index_columns) - set(columns))
+ if len(list(self._dataset.get_fragments())) <= 1:
+ # Allow per-column parallelism; would otherwise cause contention
+ # in the presence of per-file parallelism.
+ use_threads = False
Review comment:
We should probably check that `use_threads`is actually True when
enabling this per-column parallelism
##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1070,6 +1070,7 @@ cdef class ParquetFileFormat(FileFormat):
options = &(wrapped.get().reader_options)
options.use_buffered_stream = read_options.use_buffered_stream
options.buffer_size = read_options.buffer_size
+ options.enable_parallel_column_conversion = True
Review comment:
Should the default be False here? (and then be set to True if we only
have a single file)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]