amoeba commented on PR #37709: URL: https://github.com/apache/arrow/pull/37709#issuecomment-1763573360
Hey @oliviermeslin, I ran your example from https://github.com/apache/arrow/issues/37655 w/o this patch and got the expected error after the script printed "Doing the join with 9 variables": > ! Invalid: There are more than 2^32 bytes of key data. Acero cannot process a join of this magnitude I then built libarrow and the R package off your patch and actually got a segfault. Full output below: <details> <summary>Output with segfault</summary> ``` ❯ Rscript acero-join-test.R Welcome to R :) This session's PID is 2890 Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. Attaching package: ‘arrow’ The following object is masked from ‘package:utils’: timestamp Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union [1] "Doing the join with 2 variables" [1] "id" "variable1" [1] "Doing the join with 3 variables" [1] "id" "variable1" "variable2" [1] "Doing the join with 4 variables" [1] "id" "variable1" "variable2" "variable3" [1] "Doing the join with 5 variables" [1] "id" "variable1" "variable2" "variable3" "variable4" [1] "Doing the join with 6 variables" [1] "id" "variable1" "variable2" "variable3" "variable4" "variable5" [1] "Doing the join with 7 variables" [1] "id" "variable1" "variable2" "variable3" "variable4" "variable5" [7] "variable6" [1] "Doing the join with 8 variables" [1] "id" "variable1" "variable2" "variable3" "variable4" "variable5" [7] "variable6" "variable7" [1] "Doing the join with 9 variables" [1] "id" "variable1" "variable2" "variable3" "variable4" "variable5" [7] "variable6" "variable7" "variable8" *** caught segfault *** address 0x7ed0457b20, cause 'invalid permissions' *** caught segfault *** address 0x7f6676b420, cause 'invalid permissions' *** caught segfault *** address 0x7ed8992de0, cause 'invalid permissions' *** caught segfault *** address 0x7ee948e360, cause 'invalid permissions' Traceback: 1: Table__from_ExecPlanReader(self) 2: x$read_table() 3: as_arrow_table.RecordBatchReader(reader) Traceback: 1: Table__from_ExecPlanReader(self) 2: x$read_table() 3: as_arrow_table.RecordBatchReader(reader) 4: as_arrow_table(reader) 5: as_arrow_table.arrow_dplyr_query(x) 6: as_arrow_table(x) Traceback: Traceback: 1: Table__from_ExecPlanReader(self) 4: as_arrow_table(reader) 2: x$read_table() 3: as_arrow_table.RecordBatchReader(reader) 7: doTryCatch(return(expr), name, parentenv, handler) 8: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 9: tryCatchList(expr, classes, parentenv, handlers) 5: 1: as_arrow_table.arrow_dplyr_query(x) 6: as_arrow_table(x) 4: 7: doTryCatch(return(expr), name, parentenv, handler) 8: as_arrow_table(reader) 5: as_arrow_table.arrow_dplyr_query(x)Table__from_ExecPlanReader(self) 2: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 9: tryCatchList(expr, classes, parentenv, handlers) 10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) { augment_io_error_msg(e, call, schema = schema())}) 11: compute.arrow_dplyr_query(left_join(data, data(all_of(vars_temp), all_of(vars_temp)), by = c(id = "id"))) 12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) 13: FUN(X[[i]], ...) 14: lapply(1:nb_var, function(n) { print(paste0("Doing the join with ", n + 1, " variables")) vars_temp <- c("id", vars[1:n]) print(vars_temp) data_out <- compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) return("Success!")}) An irrecoverable exception occurred. R is aborting now ... 10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) { augment_io_error_msg(e, call, schema = schema())(x$read_table())() 6: as_arrow_table(x) 7: doTryCatch(return(expr), name, parentenv, handler) 8: }) tryCatchOne(expr, names, parentenv, handlers[[1L]]) 9: tryCatchList(expr, classes, parentenv, handlers) 3: as_arrow_table.RecordBatchReader(reader) 4: as_arrow_table(reader)10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) { 5: as_arrow_table.arrow_dplyr_query(x) 6: augment_io_error_msg(e, call, schema = schema())}) 11: compute.arrow_dplyr_query(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) 12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) 13: FUN(X[[i]], ...) 14: lapply(1:nb_var, function(n) { print(paste0("Doing the join with ", n + 1, " variables")) vars_temp <- c("id", vars[1:n]) print(vars_temp) data_out <- compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) return("Success!")}) An irrecoverable exception occurred. R is aborting now ... as_arrow_table(x) 7: doTryCatch(return(expr), name, parentenv, handler) 8: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 9: tryCatchList(expr, classes, parentenv, handlers) 10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) { augment_io_error_msg(e, call, schema = schema())}) 11: compute.arrow_dplyr_query(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) 12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) 13: FUN(X[[i]], ...) 14: lapply(1:nb_var, function(n) { print(paste0("Doing the join with ", n + 1, "Doing thes")) vars_temp <- c("id", vars[1:n]) print(vars_temp) data_out <- compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) return("Success!")}) An irrecoverable exception occurred. R is aborting now ... 11: compute.arrow_dplyr_query(left_join(data, vars_temp(), by = c(id = "id"))) 12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) 13: FUN(X[[i]], ...) 14: lapply(1:nb_var, function(n) { print(paste0("Doing the join with ", n + 1, " variables")) vars_temp <- c("id", vars[1:n]) print(vars_temp) data_out <- compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) return("Success!")}) An irrecoverable exception occurred. R is aborting now ... *** caught segfault *** address 0x7f6ecc2520, cause 'invalid permissions' Traceback: 1: Table__from_ExecPlanReader(self) 2: x$read_table() 3: as_arrow_table.RecordBatchReader(reader) 4: as_arrow_table(reader) 5: as_arrow_table.arrow_dplyr_query(x) 6: as_arrow_table(x) 7: doTryCatch(return(expr), name, parentenv, handler) 8: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 9: tryCatchList(expr, classes, parentenv, handlers) 10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) { augment_io_error_msg(e, call, schema = schema())}) 11: compute.arrow_dplyr_query(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) 12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) 13: FUN(X[[i]], ...) 14: lapply(1:nb_var, function(n) { print(paste0("Doing the join with ", n + 1, " variables")) vars_temp <- c("id", vars[1:n]) print(vars_temp) data_out <- compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) return("Success!")}) An irrecoverable exception occurred. R is aborting now ... *** caught segfault *** address 0x7f5e1daf00, cause 'invalid permissions' Traceback: 1: Table__from_ExecPlanReader(self) 2: x$read_table() 3: as_arrow_table.RecordBatchReader(reader) 4: as_arrow_table(reader) 5: as_arrow_table.arrow_dplyr_query(x) 6: as_arrow_table(x) 7: doTryCatch(return(expr), name, parentenv, handler) 8: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 9: tryCatchList(expr, classes, parentenv, handlers) 10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) { augment_io_error_msg(e, call, schema = schema())}) 11: compute.arrow_dplyr_query(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) 12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) 13: FUN(X[[i]], ...) 14: lapply(1:nb_var, function(n) { print(paste0("Doing the join with ", n + 1, " variables")) vars_temp <- c("id", vars[1:n]) print(vars_temp) data_out <- compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) return("Success!")}) An irrecoverable exception occurred. R is aborting now ... *** caught segfault *** address 0x7ee0f0bac0, cause 'invalid permissions' Traceback: 1: Table__from_ExecPlanReader(self) 2: x$read_table() 3: as_arrow_table.RecordBatchReader(reader) 4: as_arrow_table(reader) 5: as_arrow_table.arrow_dplyr_query(x) 6: as_arrow_table(x) 7: doTryCatch(return(expr), name, parentenv, handler) 8: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 9: tryCatchList(expr, classes, parentenv, handlers) 10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) { augment_io_error_msg(e, call, schema = schema())}) 11: compute.arrow_dplyr_query(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) 12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) 13: FUN(X[[i]], ...) 14: lapply(1:nb_var, function(n) { print(paste0("Doing the join with ", n + 1, " variables")) vars_temp <- c("id", vars[1:n]) print(vars_temp) data_out <- compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) return("Success!")}) An irrecoverable exception occurred. R is aborting now ... *** caught segfault *** address 0x7ecb394000, cause 'invalid permissions' Traceback: 1: Table__from_ExecPlanReader(self) 2: x$read_table() 3: as_arrow_table.RecordBatchReader(reader) 4: as_arrow_table(reader) 5: as_arrow_table.arrow_dplyr_query(x) 6: as_arrow_table(x) 7: doTryCatch(return(expr), name, parentenv, handler) 8: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 9: tryCatchList(expr, classes, parentenv, handlers) 10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) { augment_io_error_msg(e, call, schema = schema())}) 11: compute.arrow_dplyr_query(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) 12: compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) 13: FUN(X[[i]], ...) 14: lapply(1:nb_var, function(n) { print(paste0("Doing the join with ", n + 1, " variables")) vars_temp <- c("id", vars[1:n]) print(vars_temp) data_out <- compute(left_join(data, select(data, all_of(vars_temp)), by = c(id = "id"))) return("Success!")}) An irrecoverable exception occurred. R is aborting now ... fish: Job 1, 'Rscript acero-join-test.R' terminated by signal SIGSEGV (Address boundary error) ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
