jorisvandenbossche commented on a change in pull request #8188: URL: https://github.com/apache/arrow/pull/8188#discussion_r492271433
########## File path: python/pyarrow/_dataset.pyx ########## @@ -1013,27 +1013,38 @@ cdef class ParquetReadOptions(_Weakrefable): dictionary_columns : list of string, default None Names of columns which should be dictionary encoded as they are read. + enable_parallel_column_conversion : bool, default False + EXPERIMENTAL: Parallelize conversion across columns. This option is + ignored if a scan is already parallelized across input files to avoid + thread contention. This option will be removed after support is added + for simultaneous parallelization across files and columns. Review comment: I was going to comment: is it needed to add this option in the *public* dataset interface? Since we can enable/disable it under the hood depending on whether we are reading a single file or not, and since we do not plan to keep this keyword long term. But, currently we implement this logic in parquet.py, which of course needs to be able to pass it through to the dataset API, so it needs to be a keyword this way. ########## File path: python/pyarrow/parquet.py ########## @@ -1400,38 +1400,55 @@ def __init__(self, path_or_paths, filesystem=None, filters=None, buffer_size=buffer_size) if read_dictionary is not None: read_options.update(dictionary_columns=read_dictionary) - parquet_format = ds.ParquetFileFormat(read_options=read_options) # map filters to Expressions self._filters = filters self._filter_expression = filters and _filters_to_expression(filters) - # check for single NativeFile dataset - if not isinstance(path_or_paths, list): - if not _is_path_like(path_or_paths): - fragment = parquet_format.make_fragment(path_or_paths) - self._dataset = ds.FileSystemDataset( - [fragment], schema=fragment.physical_schema, - format=parquet_format, - filesystem=fragment.filesystem - ) - return + # map old filesystems to new one + if filesystem is not None: + filesystem = _ensure_filesystem( + filesystem, use_mmap=memory_map) + else: + # assume local file system + filesystem = LocalFileSystem(use_mmap=memory_map) Review comment: I am not sure it is correct to *always* assume local filesystem if no `filesystem` is specified. Eg for a URI this will then not work (but this clearly not tested ..) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org