[ https://issues.apache.org/jira/browse/ARROW-15938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson updated ARROW-15938: ------------------------------------ Labels: query-engine (was: ) > [R][C++] Segfault in left join with empty right table when filtered on > partition > -------------------------------------------------------------------------------- > > Key: ARROW-15938 > URL: https://issues.apache.org/jira/browse/ARROW-15938 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Affects Versions: 7.0.2 > Environment: ubuntu linux, R4.1.2 > Reporter: Vitalie Spinu > Priority: Major > Labels: query-engine > > When the right table in a join is empty as a result of a filtering on a > partition group the join segfaults: > {code:java} > library(arrow) > library(glue) > df <- mutate(iris, id = runif(n())) > dir <- "./tmp/iris" > dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F) > dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F) > write_parquet(df, glue("{dir}/group=a/part1.parquet")) > write_parquet(df, glue("{dir}/group=b/part2.parquet")) > db1 <- open_dataset(dir) %>% > filter(group == "blabla") > open_dataset(dir) %>% > filter(group == "b") %>% > select(id) %>% > left_join(db1, by = "id") %>% > collect() > {code} > {code:java} > ==24063== Thread 7: > ==24063== Invalid read of size 1 > ==24063== at 0x1FFE606D: > arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long, > arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, > arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE68CC: > arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long, > int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE84D5: > arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long, > arrow::compute::ExecBatch const&) (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFE8CB4: > arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int, > arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x200011CF: > arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*, > arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFB580E: > arrow::compute::MapNode::SubmitTask(std::function<arrow::Result<arrow::compute::ExecBatch> > (arrow::compute::ExecBatch)>, > arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in > /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FFB6444: arrow::internal::FnOnce<void > ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture > (arrow::Future<arrow::internal::Empty>, > arrow::compute::MapNode::SubmitTask(std::function<arrow::Result<arrow::compute::ExecBatch> > (arrow::compute::ExecBatch)>, > arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})> > >::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x1FE2B2A0: > std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > > >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) > ==24063== by 0x92844BF: ??? (in > /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29) > ==24063== by 0x6DD46DA: start_thread (pthread_create.c:463) > ==24063== by 0x710D71E: clone (clone.S:95) > ==24063== Address 0x10 is not stack'd, malloc'd or (recently) free'd > ==24063== *** caught segfault *** > address 0x10, cause 'memory not mapped'Traceback: > 1: Table__from_RecordBatchReader(self) > 2: tab$read_table() > 3: do_exec_plan(x) > 4: doTryCatch(return(expr), name, parentenv, handler) > 5: tryCatchOne(expr, names, parentenv, handlers[[1L]]) > 6: tryCatchList(expr, classes, parentenv, handlers) > 7: tryCatch(tab <- do_exec_plan(x), error = function(e) { > handle_csv_read_error(e, x$.data$schema)}) > 8: collect.arrow_dplyr_query(.) > 9: collect(.) > 10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>% > left_join(db1, by = "id") %>% collect()Possible actions: > 1: abort (with core dump, if enabled) > 2: normal R exit > 3: exit R without saving workspace > 4: exit R saving workspace {code} > This is arrow from the current master ece0e23f1. > It's worth noting that if the right table is filtered on a non-partitioned > variable the problem does not occur. -- This message was sent by Atlassian Jira (v8.20.10#820010)