[jira] [Commented] (ARROW-16409) [C++][Python][R] Deprecate "scanner" (but keep "scan node") from public API

Weston Pace (Jira) Thu, 28 Apr 2022 12:37:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529607#comment-17529607
 ]


Weston Pace commented on ARROW-16409:
-------------------------------------

{{select}} sounds reasonable but right now the {{scanner()}} method can take a 
filter as well which might be a bit odd for {{select}}.  I think pyarrow still 
needs to figure out what it wants to be the API to the query engine.  For 
example, we have {{dataset.join}} and {{table.join}} but both of those output 
tables and not execution plans (technically {{dataset.join}} outputs an 
in-memory-dataset but that's effectively a table).  R has taken the dplyr 
approach.  There is also the Ibis/Substrait approach.  The status quo would be 
introducing some kind of {{Query}} class that does what {{Scanner}} does today 
(container for dataset, projection, & filter).

> [C++][Python][R] Deprecate "scanner" (but keep "scan node") from public API
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-16409
>                 URL: https://issues.apache.org/jira/browse/ARROW-16409
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>             Fix For: 9.0.0
>
>
> The scanner, in its original form, was something of a prototype query engine. 
>  It handled complex projection (beyond just casting) and filtering.  Over 
> time features have been moved out of the scanner and into the execution 
> engine to the point that the scanner now is just a tool for scanning multiple 
> files simultaneously to feed as input to an exec plan (i.e. "scan node").
> The concept of a "scanner" should mostly be removed from our public API 
> surface.  Those working directly with the execution engine will still need to 
> know about the scan node but that should be about it.
> For example, in python we have pages [like 
> this|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html]
>  and code like this:
> {noformat}
> dataset = ds.dataset('/tmp/my_dataset', format='parquet')
> scanner = dataset.scanner(columns=['x'])
> ds.write_dataset(scanner, '/tmp/my_new_dataset', format='parquet')
> {noformat}
> Over time I think this will lead to confusion.  It's already a little 
> convoluted.  For example, a call to {{dataset.to_table(...)}} creates a 
> {{Scanner}} and calls {{ToTable}} with {{ScanOptions}}.  This method then 
> creates an {{ExecPlan}} and, in order to do so, must create a {{ScanNode}}.  
> The {{ScanNode}} consumes some (but not all) of the options in {{ScanOption}} 
> while the {{ExecPlan}} consumes the rest.
> The {{Scanner}} (if one continues to exist) should be an internal detail not 
> visible to users.  The previous code could either change to use a new term 
> {{query}}:
> {noformat}
> dataset = ds.dataset('/tmp/my_dataset', format='parquet')
> query = dataset.query(columns=['x'])
> ds.write_dataset(query, '/tmp/my_new_dataset', format='parquet')
> {noformat}
> Or we could use the record batch reader concept:
> {noformat}
> dataset = ds.dataset('/tmp/my_dataset', format='parquet')
> record_batch_reader = dataset.to_reader(columns=['x'])
> ds.write_dataset(record_batch_reader, '/tmp/my_new_dataset', format='parquet')
> {noformat}
> I would like to make some changes to the scanner in 9.0.0 and would hope to 
> address this then so I'm happy to hear opinions / thoughts.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16409) [C++][Python][R] Deprecate "scanner" (but keep "scan node") from public API

Reply via email to