[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10628: ARROW-12364: [Python] [Dataset] Add metadata_collector option to ds.write_dataset()

GitBox Fri, 09 Jul 2021 01:48:10 -0700


jorisvandenbossche commented on a change in pull request #10628:
URL: https://github.com/apache/arrow/pull/10628#discussion_r666783710




##########
File path: python/pyarrow/dataset.py
##########
@@ -731,6 +731,12 @@ def write_dataset(data, base_dir, basename_template=None, 
format=None,
         (e.g. S3)
     max_partitions : int, default 1024
         Maximum number of partitions any batch may be written into.
+    file_visitor : Function
+        If set, this function will be called with a WrittenFile instance
+        for each file created during the call.  This object will contain
+        the path and (if the dataset is a parquet dataset) the parquet

Review comment:
       > For my education, what is the concern with users relying on this 
class? It seems less brittle than users relying on a snippet of documentation.
   
   It just "locks us in" on using exactly this class (as users could start 
relying on the actual specific class (eg `isinstance(obj, WrittenFile`, 
although that shouldn't be useful in practice), instead of the interface of the 
class (the fact that it has a path and metadata attributes). Without publicly 
exposing the class, it gives us the freedom in the future to change this (eg 
expose the actual FileWriter) without having to worry about possibly breaking 
code, as long as it still has the path and metadata attributes.
   
   I personally follow Ben's comment about this basically being a namedtuple, 
but since cython doesn't support namedtuples, a simple class seems a good 
alternative (a difference with eg ReadOptions, is that a user never creates a 
WrittenFile themselves). 
   
   Now, I don't care that much about it, and we could also simply expose it 
publicly :)  (i.e. import it in the pyarrow.dataset namespace and add the class 
to the API reference docs)
   But I think with your current updated docstring of write_dataset, this is 
clear enough.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10628: ARROW-12364: [Python] [Dataset] Add metadata_collector option to ds.write_dataset()

Reply via email to