[ 
https://issues.apache.org/jira/browse/ARROW-15938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-15938:
------------------------------------
    Fix Version/s: 9.0.0
                       (was: 10.0.0)

> [R][C++] Segfault in left join with empty right table when filtered on 
> partition
> --------------------------------------------------------------------------------
>
>                 Key: ARROW-15938
>                 URL: https://issues.apache.org/jira/browse/ARROW-15938
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 7.0.2
>         Environment: ubuntu linux, R4.1.2
>            Reporter: Vitalie Spinu
>            Assignee: Weston Pace
>            Priority: Critical
>              Labels: pull-request-available, query-engine
>             Fix For: 9.0.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> When the right table in a join is empty as a result of a filtering on a 
> partition group the join segfaults:
> {code:java}
>   library(arrow)
>   library(glue)
>   df <- mutate(iris, id = runif(n()))
>   dir <- "./tmp/iris"
>   dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F)
>   dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F)
>   write_parquet(df, glue("{dir}/group=a/part1.parquet"))
>   write_parquet(df, glue("{dir}/group=b/part2.parquet")) 
>  db1 <- open_dataset(dir) %>%
>     filter(group == "blabla")  
> open_dataset(dir) %>%
>     filter(group == "b") %>%
>     select(id) %>%
>     left_join(db1, by = "id") %>%
>     collect()
>   {code}
> {code:java}
> ==24063== Thread 7:
> ==24063== Invalid read of size 1
> ==24063==    at 0x1FFE606D: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long, 
> arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, 
> arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE68CC: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long, 
> int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE84D5: 
> arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long, 
> arrow::compute::ExecBatch const&) (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFE8CB4: 
> arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int, 
> arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x200011CF: 
> arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*, 
> arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFB580E: 
> arrow::compute::MapNode::SubmitTask(std::function<arrow::Result<arrow::compute::ExecBatch>
>  (arrow::compute::ExecBatch)>, 
> arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in 
> /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FFB6444: arrow::internal::FnOnce<void 
> ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture 
> (arrow::Future<arrow::internal::Empty>, 
> arrow::compute::MapNode::SubmitTask(std::function<arrow::Result<arrow::compute::ExecBatch>
>  (arrow::compute::ExecBatch)>, 
> arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})> 
> >::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x1FE2B2A0: 
> std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}>
>  > >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
> ==24063==    by 0x92844BF: ??? (in 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29)
> ==24063==    by 0x6DD46DA: start_thread (pthread_create.c:463)
> ==24063==    by 0x710D71E: clone (clone.S:95)
> ==24063==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
> ==24063==  *** caught segfault ***
> address 0x10, cause 'memory not mapped'Traceback:
>  1: Table__from_RecordBatchReader(self)
>  2: tab$read_table()
>  3: do_exec_plan(x)
>  4: doTryCatch(return(expr), name, parentenv, handler)
>  5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
>  6: tryCatchList(expr, classes, parentenv, handlers)
>  7: tryCatch(tab <- do_exec_plan(x), error = function(e) {    
> handle_csv_read_error(e, x$.data$schema)})
>  8: collect.arrow_dplyr_query(.)
>  9: collect(.)
> 10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>%     
> left_join(db1, by = "id") %>% collect()Possible actions:
> 1: abort (with core dump, if enabled)
> 2: normal R exit
> 3: exit R without saving workspace
> 4: exit R saving workspace {code}
> This is arrow from the current master ece0e23f1. 
> It's worth noting that if the right table is filtered on a non-partitioned 
> variable the problem does not occur.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to