Weston Pace created ARROW-16409:
-----------------------------------

             Summary: [C++][Python][R] Deprecate "scanner" (but keep "scan 
node") from public API
                 Key: ARROW-16409
                 URL: https://issues.apache.org/jira/browse/ARROW-16409
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace
            Assignee: Weston Pace
             Fix For: 9.0.0


The scanner, in its original form, was something of a prototype query engine.  
It handled complex projection (beyond just casting) and filtering.  Over time 
features have been moved out of the scanner and into the execution engine to 
the point that the scanner now is just a tool for scanning multiple files 
simultaneously to feed as input to an exec plan (i.e. "scan node").

The concept of a "scanner" should mostly be removed from our public API 
surface.  Those working directly with the execution engine will still need to 
know about the scan node but that should be about it.

For example, in python we have pages [like 
this|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html]
 and code like this:

{noformat}
dataset = ds.dataset('/tmp/my_dataset', format='parquet')
scanner = dataset.scanner(columns=['x'])
ds.write_dataset(scanner, '/tmp/my_new_dataset', format='parquet')
{noformat}

Over time I think this will lead to confusion.  It's already a little 
convoluted.  For example, a call to {{dataset.to_table(...)}} creates a 
{{Scanner}} and calls {{ToTable}} with {{ScanOptions}}.  This method then 
creates an {{ExecPlan}} and, in order to do so, must create a {{ScanNode}}.  
The {{ScanNode}} consumes some (but not all) of the options in {{ScanOption}} 
while the {{ExecPlan}} consumes the rest.

The {{Scanner}} (if one continues to exist) should be an internal detail not 
visible to users.  The previous code could either change to use a new term 
{{query}}:

{noformat}
dataset = ds.dataset('/tmp/my_dataset', format='parquet')
query = dataset.query(columns=['x'])
ds.write_dataset(query, '/tmp/my_new_dataset', format='parquet')
{noformat}

Or we could use the record batch reader concept:

{noformat}
dataset = ds.dataset('/tmp/my_dataset', format='parquet')
record_batch_reader = dataset.to_reader(columns=['x'])
ds.write_dataset(record_batch_reader, '/tmp/my_new_dataset', format='parquet')
{noformat}

I would like to make some changes to the scanner in 9.0.0 and would hope to 
address this then so I'm happy to hear opinions / thoughts.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to