[ https://issues.apache.org/jira/browse/ARROW-17462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-17462: ----------------------------------- Labels: pull-request-available (was: ) > [R] Cast scalars to type of field in Expression building > -------------------------------------------------------- > > Key: ARROW-17462 > URL: https://issues.apache.org/jira/browse/ARROW-17462 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Reporter: Neal Richardson > Assignee: Neal Richardson > Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > After looking at the ExecPlan output of some queries, it jumped out at me how > we translate {{ int_field == 5 }} in R as {{ cast(int_field, float64) == 5 }} > because 5 is a double in R. > This extra work has a noticeable performance impact. Here's a simple query on > the taxi dataset, filtering down to 54 out of 1.5 billion rows and selecting > a single column. My idea was to make a query that does not much other than > evaluate the filter. > {code} > > system.time(ds |> select(passenger_count) |> filter(passenger_count > 10) > > |> compute()) > user system elapsed > 0.391 0.024 0.362 > > system.time(ds |> select(passenger_count) |> filter(passenger_count > > > Scalar$create(10, type = int8())) |> compute()) > user system elapsed > 0.206 0.025 0.179 > {code} > You can see the difference in the query plans too: > {code} > > ds |> select(passenger_count) |> filter(passenger_count > 10) |> explain() > ExecPlan with 4 nodes: > 3:SinkNode{} > 2:ProjectNode{projection=[passenger_count]} > 1:FilterNode{filter=(cast(passenger_count, {to_type=double, > allow_int_overflow=false, allow_time_truncate=false, > allow_time_overflow=false, allow_decimal_truncate=false, > allow_float_truncate=false, allow_invalid_utf8=false}) > 10)} > 0:SourceNode{} > > ds |> select(passenger_count) |> filter(passenger_count > Scalar$create(10, > > type = int8())) |> explain() > ExecPlan with 4 nodes: > 3:SinkNode{} > 2:ProjectNode{projection=[passenger_count]} > 1:FilterNode{filter=(passenger_count > 10)} > 0:SourceNode{} > {code} > Ideally Acero would do this more intelligently (cf. ARROW-11402), but we > should also be able to do smarter things when assembling the Expression in R. -- This message was sent by Atlassian Jira (v8.20.10#820010)