paleolimbot commented on pull request #11730: URL: https://github.com/apache/arrow/pull/11730#issuecomment-984968571
The error *might* be related to the fact that the DuckDB stream errors after it's complete instead of returning 0 (success) and outputting an invalid array. DuckDB code where this happens: https://github.com/duckdb/duckdb/blob/master/src/common/arrow_wrapper.cpp#L103-L105 Example of a carrow stream that does this (and also causes the exec plan to segfault). <details> ``` r library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) example_data <- tibble::tibble( int = c(1:3, NA_integer_, 5:10), dbl = c(1:8, NA, 10) + .1, dbl2 = rep(5, 10), lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE), false = logical(10), chr = letters[c(1:5, NA, 7:10)], fct = factor(letters[c(1:4, NA, NA, 7:10)]) ) tf <- tempfile() new_ds <- rbind( cbind(example_data, part = 1), cbind(example_data, part = 2), cbind(example_data, part = 3), cbind(example_data, part = 4) ) %>% mutate(row_order = 1:n()) %>% select(-false, -lgl, -fct) write_dataset(new_ds, tf, partitioning = "part") ds <- open_dataset(tf) stream <- carrow:::blank_invalid_array_stream() stream_ptr <- carrow:::xptr_addr_double(stream) s <- Scanner$create( ds, NULL, filter = TRUE, use_async = FALSE, use_threads = TRUE )$ ToRecordBatchReader()$ export_to_c(stream_ptr) rbr <- arrow:::ImportRecordBatchReader(stream_ptr) rbr$read_table() #> Table #> 40 rows x 6 columns #> $int <int32> #> $dbl <double> #> $dbl2 <double> #> $chr <string> #> $row_order <int32> #> $part <int32> #> #> See $metadata for additional Schema metadata rbr$read_table() #> Table #> 0 rows x 6 columns #> $int <int32> #> $dbl <double> #> $dbl2 <double> #> $chr <string> #> $row_order <int32> #> $part <int32> #> #> See $metadata for additional Schema metadata # stream that errors after it's complete stream <- carrow:::blank_invalid_array_stream() stream_ptr <- carrow:::xptr_addr_double(stream) s <- Scanner$create( ds, NULL, filter = TRUE, use_async = FALSE, use_threads = TRUE )$ ToRecordBatchReader()$ export_to_c(stream_ptr) stream2 <- carrow::carrow_array_stream_function(ds$schema, function() { carrow::carrow_array_stream_get_next(stream) }) rbr <- carrow::carrow_array_stream_to_arrow(stream2) rbr$read_table() #> Table #> 40 rows x 6 columns #> $int <int32> #> $dbl <double> #> $dbl2 <double> #> $chr <string> #> $row_order <int32> #> $part <int32> #> #> See $metadata for additional Schema metadata rbr$read_table() #> Error: Invalid: function array stream is finished #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/c/bridge.cc:1759 StatusFromCError(stream_.get_next(&stream_, &c_array)) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:326 ReadNext(&batch) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337 ReadAll(&batches) ``` <sup>Created on 2021-12-02 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup> </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
