[ https://issues.apache.org/jira/browse/ARROW-16409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529618#comment-17529618 ]
David Li commented on ARROW-16409: ---------------------------------- Ah, good point. But it does seem like we can/should just get rid of scanner to start with. > [C++][Python][R] Deprecate "scanner" (but keep "scan node") from public API > --------------------------------------------------------------------------- > > Key: ARROW-16409 > URL: https://issues.apache.org/jira/browse/ARROW-16409 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Assignee: Weston Pace > Priority: Major > Fix For: 9.0.0 > > > The scanner, in its original form, was something of a prototype query engine. > It handled complex projection (beyond just casting) and filtering. Over > time features have been moved out of the scanner and into the execution > engine to the point that the scanner now is just a tool for scanning multiple > files simultaneously to feed as input to an exec plan (i.e. "scan node"). > The concept of a "scanner" should mostly be removed from our public API > surface. Those working directly with the execution engine will still need to > know about the scan node but that should be about it. > For example, in python we have pages [like > this|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html] > and code like this: > {noformat} > dataset = ds.dataset('/tmp/my_dataset', format='parquet') > scanner = dataset.scanner(columns=['x']) > ds.write_dataset(scanner, '/tmp/my_new_dataset', format='parquet') > {noformat} > Over time I think this will lead to confusion. It's already a little > convoluted. For example, a call to {{dataset.to_table(...)}} creates a > {{Scanner}} and calls {{ToTable}} with {{ScanOptions}}. This method then > creates an {{ExecPlan}} and, in order to do so, must create a {{ScanNode}}. > The {{ScanNode}} consumes some (but not all) of the options in {{ScanOption}} > while the {{ExecPlan}} consumes the rest. > The {{Scanner}} (if one continues to exist) should be an internal detail not > visible to users. The previous code could either change to use a new term > {{query}}: > {noformat} > dataset = ds.dataset('/tmp/my_dataset', format='parquet') > query = dataset.query(columns=['x']) > ds.write_dataset(query, '/tmp/my_new_dataset', format='parquet') > {noformat} > Or we could use the record batch reader concept: > {noformat} > dataset = ds.dataset('/tmp/my_dataset', format='parquet') > record_batch_reader = dataset.to_reader(columns=['x']) > ds.write_dataset(record_batch_reader, '/tmp/my_new_dataset', format='parquet') > {noformat} > I would like to make some changes to the scanner in 9.0.0 and would hope to > address this then so I'm happy to hear opinions / thoughts. -- This message was sent by Atlassian Jira (v8.20.7#820007)