[jira] [Commented] (ARROW-17115) [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Jonathan Keane (Jira) Wed, 20 Jul 2022 14:13:08 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569175#comment-17569175
 ]


Jonathan Keane commented on ARROW-17115:
----------------------------------------

A reprex that causes this from R (which is effectively the TPC-H 12 query that 
segfaults):


{code:r}
library(arrow)
library(dplyr)
library(arrowbench)

ensure_source("tpch", scale_factor = 10)

open_dataset("data/lineitem_10.parquet") %>%
  filter(
    l_shipmode %in% c("MAIL", "SHIP"),
    l_commitdate < l_receiptdate,
    l_shipdate < l_commitdate,
    l_receiptdate >= as.Date("1994-01-01"),
    l_receiptdate < as.Date("1995-01-01")
  ) %>%
  inner_join(
    open_dataset("data/orders_10.parquet"),
    by = c("l_orderkey" = "o_orderkey")
  ) %>%
  group_by(l_shipmode) %>%
  summarise(
    high_line_count = sum(
      if_else(
        (o_orderpriority == "1-URGENT") | (o_orderpriority == "2-HIGH"),
        1L,
        0L
      )
    ),
    low_line_count = sum(
      if_else(
        (o_orderpriority != "1-URGENT") & (o_orderpriority != "2-HIGH"),
        1L,
        0L
      )
    )
  ) %>%
  ungroup() %>%
  arrange(l_shipmode) %>%
  collect()
{code}

> [C++] HashJoin fails if it encounters a batch with more than 32Ki rows
> ----------------------------------------------------------------------
>
>                 Key: ARROW-17115
>                 URL: https://issues.apache.org/jira/browse/ARROW-17115
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Blocker
>             Fix For: 9.0.0
>
>
> The new swiss join assumes that batches are being broken according to the 
> morsel/batch model and it assumes those batches have, at most, 32Ki rows 
> (signed 16-bit indices are used in various places).
> However, we are not currently slicing all of our inputs to batches this 
> small.  This is causing conbench to fail and would likely be a problem with 
> any large inputs.
> We should fix this by slicing batches in the engine to the appropriate 
> maximum size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17115) [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

Reply via email to