[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

GitBox Mon, 21 Sep 2020 11:44:46 -0700


jorisvandenbossche commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r492271433




##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1013,27 +1013,38 @@ cdef class ParquetReadOptions(_Weakrefable):
     dictionary_columns : list of string, default None
         Names of columns which should be dictionary encoded as
         they are read.
+    enable_parallel_column_conversion : bool, default False
+        EXPERIMENTAL: Parallelize conversion across columns. This option is
+        ignored if a scan is already parallelized across input files to avoid
+        thread contention. This option will be removed after support is added
+        for simultaneous parallelization across files and columns.

Review comment:
       I was going to comment: is it needed to add this option in the *public* 
dataset interface? Since we can enable/disable it under the hood depending on 
whether we are reading a single file or not, and since we do not plan to keep 
this keyword long term. 
   
   But, currently we implement this logic in parquet.py, which of course needs 
to be able to pass it through to the dataset API, so it needs to be a keyword 
this way. 

##########
File path: python/pyarrow/parquet.py
##########
@@ -1400,38 +1400,55 @@ def __init__(self, path_or_paths, filesystem=None, 
filters=None,
                                 buffer_size=buffer_size)
         if read_dictionary is not None:
             read_options.update(dictionary_columns=read_dictionary)
-        parquet_format = ds.ParquetFileFormat(read_options=read_options)
 
         # map filters to Expressions
         self._filters = filters
         self._filter_expression = filters and _filters_to_expression(filters)
 
-        # check for single NativeFile dataset
-        if not isinstance(path_or_paths, list):
-            if not _is_path_like(path_or_paths):
-                fragment = parquet_format.make_fragment(path_or_paths)
-                self._dataset = ds.FileSystemDataset(
-                    [fragment], schema=fragment.physical_schema,
-                    format=parquet_format,
-                    filesystem=fragment.filesystem
-                )
-                return
+        # map old filesystems to new one
+        if filesystem is not None:
+            filesystem = _ensure_filesystem(
+                filesystem, use_mmap=memory_map)
+        else:
+            # assume local file system
+            filesystem = LocalFileSystem(use_mmap=memory_map)

Review comment:
       I am not sure it is correct to *always* assume local filesystem if no 
`filesystem` is specified. Eg for a URI this will then not work (but this 
clearly not tested ..)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Reply via email to