Jonathan Keane created ARROW-14197: -------------------------------------- Summary: [C++] Hashjoin + datasets hanging Key: ARROW-14197 URL: https://issues.apache.org/jira/browse/ARROW-14197 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Jonathan Keane Attachments: sample-while-hung.out.txt
I’m getting a hang on the TPC-H query 4 pretty reliably (though it’s not _every_ time). The query is: {code} l <- input_table("lineitem") %>% select(l_orderkey, l_commitdate, l_receiptdate) %>% filter(l_commitdate < l_receiptdate) %>% select(l_orderkey) o <- input_table("orders") %>% select(o_orderkey, o_orderdate, o_orderpriority) %>% # kludge: filter(o_orderdate >= "1993-07-01", o_orderdate < "1993-07-01" + interval '3' month) %>% filter(o_orderdate >= as.Date("1993-07-01"), o_orderdate < as.Date("1993-10-01")) %>% select(o_orderkey, o_orderpriority) # distinct after join, tested and indeed faster lo <- inner_join(l, o, by = c("l_orderkey" = "o_orderkey")) %>% distinct() %>% select(o_orderpriority) aggr <- lo %>% group_by(o_orderpriority) %>% summarise(order_count = n()) %>% arrange(o_orderpriority) %>% collect() {code} Basically, filtered lineitems, filtered orders, join those together, group_by, summarise, arrange. This happens pretty reliably when the {{input_table}} is a dataset backed by parquet or feather fiels (e.g. {{input_table}} returns something like {{arrow::open_dataset("path/to/{filename}.feather", format = "feather")}} One can replicate this by installing an arrowbench branch (https://github.com/ursacomputing/arrowbench/pull/37) with, in R: {{remotes::install_github("ursacomputing/arrowbench@moar-tpch"}} and then running the following: {code} library(arrowbench) results <- run_benchmark( tpc_h, scale_factor = 1, cpu_count = 8, query_id = 4, lib_path = "remote-apache/arrow@HEAD", # remove this line if you have a recent install of the arrow r package that supports hash joins and want to avoid building a separate copy. format = "feather", n_iter = 20 ) {code} Note this _sometimes_ will finish, but frequently it will not and be stuck. -- This message was sent by Atlassian Jira (v8.3.4#803005)