[GitHub] [arrow] lidavidm commented on a change in pull request #12560: ARROW-15281: [C++] Implement ability to retrieve fragment filename

GitBox Mon, 14 Mar 2022 05:12:11 -0700


lidavidm commented on a change in pull request #12560:
URL: https://github.com/apache/arrow/pull/12560#discussion_r825877794




##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -2019,9 +2023,12 @@ cdef class Scanner(_Weakrefable):
     dataset : Dataset
         Dataset to scan.
     columns : list of str or dict, default None
-        The columns to project. This can be a list of column names to include
-        (order and duplicates will be preserved), or a dictionary with
-        {new_column_name: expression} values for more advanced projections.
+        The columns to project. This can be a list of column names to
+        include (order and duplicates will be preserved) which may contain the
+        augmented fields such as `batch_index`, `fragment_index`, 
+        `last_in_fragment` and `filename`, or a dictionary
+        with {new_column_name: expression} values for more advanced
+        projections.

Review comment:
       I think elsewhere in this file we have "common" template docstrings to 
avoid duplication between various wrappers, it may be worth considering for 
this docstring

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -239,15 +239,17 @@ cdef class Dataset(_Weakrefable):
         ----------
         columns : list of str, default None
             The columns to project. This can be a list of column names to
-            include (order and duplicates will be preserved), or a dictionary
+            include (order and duplicates will be preserved) which may contain 
the
+            augmented fields such as `batch_index`, `fragment_index`, 
+            `last_in_fragment` and `filename`, or a dictionary

Review comment:
       A few things: the augmented fields start with a double underscore, 
right? Also, the wording is a little confusing, it makes it sound like we can 
only use the augmented fields in the list. Maybe we can put the list of 
augmented fields at the bottom or after this paragraph, e.g. "The list of 
columns or expressions may use the special fields __batch_index (the index of 
the batch within the fragment), ...., or __filename (the name of the source 
file or a description of the source fragment)."

##########
File path: cpp/src/arrow/dataset/scanner_test.cc
##########
@@ -1503,6 +1518,7 @@ TEST(ScanNode, Schema) {
   fields.push_back(field("__fragment_index", int32()));
   fields.push_back(field("__batch_index", int32()));
   fields.push_back(field("__last_in_fragment", boolean()));
+  fields.push_back(field("__filename", utf8()));

Review comment:
       Is it worth checking the values too? (We could sort the table before 
comparison?)

##########
File path: cpp/src/arrow/dataset/scanner_test.cc
##########
@@ -1503,6 +1518,7 @@ TEST(ScanNode, Schema) {
   fields.push_back(field("__fragment_index", int32()));
   fields.push_back(field("__batch_index", int32()));
   fields.push_back(field("__last_in_fragment", boolean()));
+  fields.push_back(field("__filename", utf8()));

Review comment:
       Though I suppose `__filename` will have a different value on every run 
so that may be annoying




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm commented on a change in pull request #12560: ARROW-15281: [C++] Implement ability to retrieve fragment filename

Reply via email to