jorisvandenbossche commented on a change in pull request #10628: URL: https://github.com/apache/arrow/pull/10628#discussion_r666783710
########## File path: python/pyarrow/dataset.py ########## @@ -731,6 +731,12 @@ def write_dataset(data, base_dir, basename_template=None, format=None, (e.g. S3) max_partitions : int, default 1024 Maximum number of partitions any batch may be written into. + file_visitor : Function + If set, this function will be called with a WrittenFile instance + for each file created during the call. This object will contain + the path and (if the dataset is a parquet dataset) the parquet Review comment: > For my education, what is the concern with users relying on this class? It seems less brittle than users relying on a snippet of documentation. It just "locks us in" on using exactly this class (as users could start relying on the actual specific class (eg `isinstance(obj, WrittenFile`, although that shouldn't be useful in practice), instead of the interface of the class (the fact that it has a path and metadata attributes). Without publicly exposing the class, it gives us the freedom in the future to change this (eg expose the actual FileWriter) without having to worry about possibly breaking code, as long as it still has the path and metadata attributes. I personally follow Ben's comment about this basically being a namedtuple, but since cython doesn't support namedtuples, a simple class seems a good alternative (a difference with eg ReadOptions, is that a user never creates a WrittenFile themselves). Now, I don't care that much about it, and we could also simply expose it publicly :) (i.e. import it in the pyarrow.dataset namespace and add the class to the API reference docs) But I think with your current updated docstring of write_dataset, this is clear enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org