[GitHub] [arrow] paleolimbot commented on pull request #11730: ARROW-14745: [R] Enable true duckdb streaming

GitBox Thu, 02 Dec 2021 12:17:20 -0800


paleolimbot commented on pull request #11730:
URL: https://github.com/apache/arrow/pull/11730#issuecomment-984968571



   The error *might* be related to the fact that the DuckDB stream errors after 
it's complete instead of returning 0 (success) and outputting an invalid array.
   
   DuckDB code where this happens:
   
   
https://github.com/duckdb/duckdb/blob/master/src/common/arrow_wrapper.cpp#L103-L105
   
   Example of a carrow stream that does this (and also causes the exec plan to 
segfault).
   
   <details>
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   
   example_data <- tibble::tibble(
     int = c(1:3, NA_integer_, 5:10),
     dbl = c(1:8, NA, 10) + .1,
     dbl2 = rep(5, 10),
     lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE),
     false = logical(10),
     chr = letters[c(1:5, NA, 7:10)],
     fct = factor(letters[c(1:4, NA, NA, 7:10)])
   )
   
   tf <- tempfile()
   new_ds <- rbind(
     cbind(example_data, part = 1),
     cbind(example_data, part = 2),
     cbind(example_data, part = 3),
     cbind(example_data, part = 4)
   ) %>%
     mutate(row_order = 1:n()) %>% 
     select(-false, -lgl, -fct)
   
   write_dataset(new_ds, tf, partitioning = "part")
   
   ds <- open_dataset(tf)
   
   stream <- carrow:::blank_invalid_array_stream()
   stream_ptr <- carrow:::xptr_addr_double(stream)
   s <- Scanner$create(
     ds, 
     NULL,
     filter = TRUE,
     use_async = FALSE,
     use_threads = TRUE
   )$
     ToRecordBatchReader()$
     export_to_c(stream_ptr)
   
   
   rbr <- arrow:::ImportRecordBatchReader(stream_ptr)
   rbr$read_table()
   #> Table
   #> 40 rows x 6 columns
   #> $int <int32>
   #> $dbl <double>
   #> $dbl2 <double>
   #> $chr <string>
   #> $row_order <int32>
   #> $part <int32>
   #> 
   #> See $metadata for additional Schema metadata
   rbr$read_table()
   #> Table
   #> 0 rows x 6 columns
   #> $int <int32>
   #> $dbl <double>
   #> $dbl2 <double>
   #> $chr <string>
   #> $row_order <int32>
   #> $part <int32>
   #> 
   #> See $metadata for additional Schema metadata
   
   # stream that errors after it's complete
   stream <- carrow:::blank_invalid_array_stream()
   stream_ptr <- carrow:::xptr_addr_double(stream)
   s <- Scanner$create(
     ds, 
     NULL,
     filter = TRUE,
     use_async = FALSE,
     use_threads = TRUE
   )$
     ToRecordBatchReader()$
     export_to_c(stream_ptr)
   
   stream2 <- carrow::carrow_array_stream_function(ds$schema, function() {
     carrow::carrow_array_stream_get_next(stream)
   })
   
   rbr <- carrow::carrow_array_stream_to_arrow(stream2)
   rbr$read_table()
   #> Table
   #> 40 rows x 6 columns
   #> $int <int32>
   #> $dbl <double>
   #> $dbl2 <double>
   #> $chr <string>
   #> $row_order <int32>
   #> $part <int32>
   #> 
   #> See $metadata for additional Schema metadata
   rbr$read_table()
   #> Error: Invalid: function array stream is finished
   #> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/c/bridge.cc:1759  
StatusFromCError(stream_.get_next(&stream_, &c_array))
   #> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:326 
 ReadNext(&batch)
   #> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337 
 ReadAll(&batches)
   ```
   
   <sup>Created on 2021-12-02 by the [reprex 
package](https://reprex.tidyverse.org) (v2.0.1)</sup>
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] paleolimbot commented on pull request #11730: ARROW-14745: [R] Enable true duckdb streaming

Reply via email to