jorisvandenbossche commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r492271433
##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1013,27 +1013,38 @@ cdef class ParquetReadOptions(_Weakrefable):
dictionary_columns : list of string, default None
Names of columns which should be dictionary encoded as
they are read.
+ enable_parallel_column_conversion : bool, default False
+ EXPERIMENTAL: Parallelize conversion across columns. This option is
+ ignored if a scan is already parallelized across input files to avoid
+ thread contention. This option will be removed after support is added
+ for simultaneous parallelization across files and columns.
Review comment:
I was going to comment: is it needed to add this option in the *public*
dataset interface? Since we can enable/disable it under the hood depending on
whether we are reading a single file or not, and since we do not plan to keep
this keyword long term.
But, currently we implement this logic in parquet.py, which of course needs
to be able to pass it through to the dataset API, so it needs to be a keyword
this way.
##########
File path: python/pyarrow/parquet.py
##########
@@ -1400,38 +1400,55 @@ def __init__(self, path_or_paths, filesystem=None,
filters=None,
buffer_size=buffer_size)
if read_dictionary is not None:
read_options.update(dictionary_columns=read_dictionary)
- parquet_format = ds.ParquetFileFormat(read_options=read_options)
# map filters to Expressions
self._filters = filters
self._filter_expression = filters and _filters_to_expression(filters)
- # check for single NativeFile dataset
- if not isinstance(path_or_paths, list):
- if not _is_path_like(path_or_paths):
- fragment = parquet_format.make_fragment(path_or_paths)
- self._dataset = ds.FileSystemDataset(
- [fragment], schema=fragment.physical_schema,
- format=parquet_format,
- filesystem=fragment.filesystem
- )
- return
+ # map old filesystems to new one
+ if filesystem is not None:
+ filesystem = _ensure_filesystem(
+ filesystem, use_mmap=memory_map)
+ else:
+ # assume local file system
+ filesystem = LocalFileSystem(use_mmap=memory_map)
Review comment:
I am not sure it is correct to *always* assume local filesystem if no
`filesystem` is specified. Eg for a URI this will then not work (but this
clearly not tested ..)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]