[ https://issues.apache.org/jira/browse/ARROW-14197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424721#comment-17424721 ]
David Li commented on ARROW-14197: ---------------------------------- Tangential, but the exec plan string repr being hard to read is unfortunate. It would help if R assigned names to the nodes, or perhaps we could have the ExecPlan auto-assign names if the client bindings don't give any. (Also, maybe GraphViz output would be nice after all…) > [C++] Hashjoin + datasets hanging > --------------------------------- > > Key: ARROW-14197 > URL: https://issues.apache.org/jira/browse/ARROW-14197 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Jonathan Keane > Priority: Critical > Labels: query-engine > Fix For: 6.0.0 > > Attachments: gdb.2.log, gdb.log, sample-while-hung.out.txt > > > I’m getting a hang on the TPC-H query 4 pretty reliably (though it’s not > _every_ time). The query is: > {code} > l <- input_table("lineitem") %>% > select(l_orderkey, l_commitdate, l_receiptdate) %>% > filter(l_commitdate < l_receiptdate) %>% > select(l_orderkey) > o <- input_table("orders") %>% > select(o_orderkey, o_orderdate, o_orderpriority) %>% > # kludge: filter(o_orderdate >= "1993-07-01", o_orderdate < "1993-07-01" > + interval '3' month) %>% > filter(o_orderdate >= as.Date("1993-07-01"), o_orderdate < > as.Date("1993-10-01")) %>% > select(o_orderkey, o_orderpriority) > # distinct after join, tested and indeed faster > lo <- inner_join(l, o, by = c("l_orderkey" = "o_orderkey")) %>% > distinct() %>% > select(o_orderpriority) > aggr <- lo %>% > group_by(o_orderpriority) %>% > summarise(order_count = n()) %>% > arrange(o_orderpriority) %>% > collect() > {code} > Basically, filtered lineitems, filtered orders, join those together, > group_by, summarise, arrange. > This happens pretty reliably when the {{input_table}} is a dataset backed by > parquet or feather fiels (e.g. {{input_table}} returns something like > {{arrow::open_dataset("path/to/{filename}.feather", format = "feather")}} > One can replicate this by installing an arrowbench branch > (https://github.com/ursacomputing/arrowbench/pull/37) with, in R: > {{remotes::install_github("ursacomputing/arrowbench@moar-tpch"}} and then > running the following: > {code} > library(arrowbench) > results <- run_benchmark( > tpc_h, > scale_factor = 1, > cpu_count = 8, > query_id = 4, > lib_path = "remote-apache/arrow@HEAD", # remove this line if you have a > recent install of the arrow r package that supports hash joins and want to > avoid building a separate copy. > format = "feather", > n_iter = 20 > ) > {code} > Note this _sometimes_ will finish, but frequently it will not and be stuck. -- This message was sent by Atlassian Jira (v8.3.4#803005)