lidavidm commented on a change in pull request #12560:
URL: https://github.com/apache/arrow/pull/12560#discussion_r825877794
##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -2019,9 +2023,12 @@ cdef class Scanner(_Weakrefable):
dataset : Dataset
Dataset to scan.
columns : list of str or dict, default None
- The columns to project. This can be a list of column names to include
- (order and duplicates will be preserved), or a dictionary with
- {new_column_name: expression} values for more advanced projections.
+ The columns to project. This can be a list of column names to
+ include (order and duplicates will be preserved) which may contain the
+ augmented fields such as `batch_index`, `fragment_index`,
+ `last_in_fragment` and `filename`, or a dictionary
+ with {new_column_name: expression} values for more advanced
+ projections.
Review comment:
I think elsewhere in this file we have "common" template docstrings to
avoid duplication between various wrappers, it may be worth considering for
this docstring
##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -239,15 +239,17 @@ cdef class Dataset(_Weakrefable):
----------
columns : list of str, default None
The columns to project. This can be a list of column names to
- include (order and duplicates will be preserved), or a dictionary
+ include (order and duplicates will be preserved) which may contain
the
+ augmented fields such as `batch_index`, `fragment_index`,
+ `last_in_fragment` and `filename`, or a dictionary
Review comment:
A few things: the augmented fields start with a double underscore,
right? Also, the wording is a little confusing, it makes it sound like we can
only use the augmented fields in the list. Maybe we can put the list of
augmented fields at the bottom or after this paragraph, e.g. "The list of
columns or expressions may use the special fields __batch_index (the index of
the batch within the fragment), ...., or __filename (the name of the source
file or a description of the source fragment)."
##########
File path: cpp/src/arrow/dataset/scanner_test.cc
##########
@@ -1503,6 +1518,7 @@ TEST(ScanNode, Schema) {
fields.push_back(field("__fragment_index", int32()));
fields.push_back(field("__batch_index", int32()));
fields.push_back(field("__last_in_fragment", boolean()));
+ fields.push_back(field("__filename", utf8()));
Review comment:
Is it worth checking the values too? (We could sort the table before
comparison?)
##########
File path: cpp/src/arrow/dataset/scanner_test.cc
##########
@@ -1503,6 +1518,7 @@ TEST(ScanNode, Schema) {
fields.push_back(field("__fragment_index", int32()));
fields.push_back(field("__batch_index", int32()));
fields.push_back(field("__last_in_fragment", boolean()));
+ fields.push_back(field("__filename", utf8()));
Review comment:
Though I suppose `__filename` will have a different value on every run
so that may be annoying
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]