[jira] [Commented] (ARROW-13480) [C++] [R] [Python] C-interface error propagation
[ https://issues.apache.org/jira/browse/ARROW-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404012#comment-17404012 ] Antoine Pitrou commented on ARROW-13480: Ok, this is really a dataset bug. I will create a draft PR and let Weston iterate. > [C++] [R] [Python] C-interface error propagation > - > > Key: ARROW-13480 > URL: https://issues.apache.org/jira/browse/ARROW-13480 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python, R >Reporter: Jonathan Keane >Priority: Major > > Working on integration with DuckDB, we ran into an issue where it looks like > errors are not being propagated fully/correctly with record batch readers > using the C-interface. The DuckDB issue where this came up is > https://github.com/duckdb/duckdb/issues/2055 > In the example I'm passing a dataset with either one or two files from R to > python. I've specifically mis-specified the schema to get an error The one > file version works like I expect percolating the error up: > {code:r} > > library("arrow") > > > > venv <- try(reticulate::virtualenv_create("arrow-test")) > virtualenv: arrow-test > > install_pyarrow("arrow-test", nightly = TRUE) > [output from installing pyarrow ...] > > reticulate::use_virtualenv("arrow-test") > > > > file <- "arrow/r/inst/v0.7.1.parquet" > > arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null())) > > > > scan <- Scanner$create(arrow_table) > > reader <- scan$ToRecordBatchReader() > > pyreader <- reticulate::r_to_py(reader) > > pytab <- pyreader$read_all() > Error in py_call_impl(callable, dots$args, dots$keywords) : > OSError: NotImplemented: Unsupported cast from double to null using > function cast_null > Detailed traceback: > File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all > File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status > {code} > But when having 2 (or more) files, the process hangs reading all of the > batches: > {code:r} > > library("arrow") > > > > venv <- try(reticulate::virtualenv_create("arrow-test")) > virtualenv: arrow-test > > install_pyarrow("arrow-test", nightly = TRUE) > [output from installing pyarrow ...] > > reticulate::use_virtualenv("arrow-test") > > > > file <- "arrow/r/inst/v0.7.1.parquet" > > arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null())) > > > > scan <- Scanner$create(arrow_table) > > reader <- scan$ToRecordBatchReader() > > pyreader <- reticulate::r_to_py(reader) > > pytab <- pyreader$read_all() > {hangs forever here} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13480) [C++] [R] [Python] C-interface error propagation
[ https://issues.apache.org/jira/browse/ARROW-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404004#comment-17404004 ] Antoine Pitrou commented on ARROW-13480: Ah, I can reproduce now using a larger number of files. > [C++] [R] [Python] C-interface error propagation > - > > Key: ARROW-13480 > URL: https://issues.apache.org/jira/browse/ARROW-13480 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python, R >Reporter: Jonathan Keane >Priority: Major > > Working on integration with DuckDB, we ran into an issue where it looks like > errors are not being propagated fully/correctly with record batch readers > using the C-interface. The DuckDB issue where this came up is > https://github.com/duckdb/duckdb/issues/2055 > In the example I'm passing a dataset with either one or two files from R to > python. I've specifically mis-specified the schema to get an error The one > file version works like I expect percolating the error up: > {code:r} > > library("arrow") > > > > venv <- try(reticulate::virtualenv_create("arrow-test")) > virtualenv: arrow-test > > install_pyarrow("arrow-test", nightly = TRUE) > [output from installing pyarrow ...] > > reticulate::use_virtualenv("arrow-test") > > > > file <- "arrow/r/inst/v0.7.1.parquet" > > arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null())) > > > > scan <- Scanner$create(arrow_table) > > reader <- scan$ToRecordBatchReader() > > pyreader <- reticulate::r_to_py(reader) > > pytab <- pyreader$read_all() > Error in py_call_impl(callable, dots$args, dots$keywords) : > OSError: NotImplemented: Unsupported cast from double to null using > function cast_null > Detailed traceback: > File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all > File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status > {code} > But when having 2 (or more) files, the process hangs reading all of the > batches: > {code:r} > > library("arrow") > > > > venv <- try(reticulate::virtualenv_create("arrow-test")) > virtualenv: arrow-test > > install_pyarrow("arrow-test", nightly = TRUE) > [output from installing pyarrow ...] > > reticulate::use_virtualenv("arrow-test") > > > > file <- "arrow/r/inst/v0.7.1.parquet" > > arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null())) > > > > scan <- Scanner$create(arrow_table) > > reader <- scan$ToRecordBatchReader() > > pyreader <- reticulate::r_to_py(reader) > > pytab <- pyreader$read_all() > {hangs forever here} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13480) [C++] [R] [Python] C-interface error propagation
[ https://issues.apache.org/jira/browse/ARROW-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404003#comment-17404003 ] Antoine Pitrou commented on ARROW-13480: I cannot reproduce using the equivalent code in Python, i.e. the following works here: {code:python} def test_dataset_error(): from pyarrow import dataset as ds c_stream = ffi.new("struct ArrowArrayStream*") ptr_stream = int(ffi.cast("uintptr_t", c_stream)) fn = "/home/antoine/arrow/dev/r/inst/v0.7.1.parquet" dataset = ds.dataset([fn, fn], schema=pa.schema({'x': pa.null()})) scanner = dataset.scanner() reader = scanner.to_reader() reader._export_to_c(ptr_stream) del reader, dataset, scanner reader_new = pa.ipc.RecordBatchReader._import_from_c(ptr_stream) with pytest.raises(OSError, match="Unsupported cast from double to null"): reader_new.read_all() {code} > [C++] [R] [Python] C-interface error propagation > - > > Key: ARROW-13480 > URL: https://issues.apache.org/jira/browse/ARROW-13480 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python, R >Reporter: Jonathan Keane >Priority: Major > > Working on integration with DuckDB, we ran into an issue where it looks like > errors are not being propagated fully/correctly with record batch readers > using the C-interface. The DuckDB issue where this came up is > https://github.com/duckdb/duckdb/issues/2055 > In the example I'm passing a dataset with either one or two files from R to > python. I've specifically mis-specified the schema to get an error The one > file version works like I expect percolating the error up: > {code:r} > > library("arrow") > > > > venv <- try(reticulate::virtualenv_create("arrow-test")) > virtualenv: arrow-test > > install_pyarrow("arrow-test", nightly = TRUE) > [output from installing pyarrow ...] > > reticulate::use_virtualenv("arrow-test") > > > > file <- "arrow/r/inst/v0.7.1.parquet" > > arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null())) > > > > scan <- Scanner$create(arrow_table) > > reader <- scan$ToRecordBatchReader() > > pyreader <- reticulate::r_to_py(reader) > > pytab <- pyreader$read_all() > Error in py_call_impl(callable, dots$args, dots$keywords) : > OSError: NotImplemented: Unsupported cast from double to null using > function cast_null > Detailed traceback: > File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all > File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status > {code} > But when having 2 (or more) files, the process hangs reading all of the > batches: > {code:r} > > library("arrow") > > > > venv <- try(reticulate::virtualenv_create("arrow-test")) > virtualenv: arrow-test > > install_pyarrow("arrow-test", nightly = TRUE) > [output from installing pyarrow ...] > > reticulate::use_virtualenv("arrow-test") > > > > file <- "arrow/r/inst/v0.7.1.parquet" > > arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null())) > > > > scan <- Scanner$create(arrow_table) > > reader <- scan$ToRecordBatchReader() > > pyreader <- reticulate::r_to_py(reader) > > pytab <- pyreader$read_all() > {hangs forever here} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13480) [C++] [R] [Python] C-interface error propagation
[ https://issues.apache.org/jira/browse/ARROW-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403999#comment-17403999 ] Antoine Pitrou commented on ARROW-13480: Also, can you get a gdb backtrace of where the hanging occurs? > [C++] [R] [Python] C-interface error propagation > - > > Key: ARROW-13480 > URL: https://issues.apache.org/jira/browse/ARROW-13480 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python, R >Reporter: Jonathan Keane >Priority: Major > > Working on integration with DuckDB, we ran into an issue where it looks like > errors are not being propagated fully/correctly with record batch readers > using the C-interface. The DuckDB issue where this came up is > https://github.com/duckdb/duckdb/issues/2055 > In the example I'm passing a dataset with either one or two files from R to > python. I've specifically mis-specified the schema to get an error The one > file version works like I expect percolating the error up: > {code:r} > > library("arrow") > > > > venv <- try(reticulate::virtualenv_create("arrow-test")) > virtualenv: arrow-test > > install_pyarrow("arrow-test", nightly = TRUE) > [output from installing pyarrow ...] > > reticulate::use_virtualenv("arrow-test") > > > > file <- "arrow/r/inst/v0.7.1.parquet" > > arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null())) > > > > scan <- Scanner$create(arrow_table) > > reader <- scan$ToRecordBatchReader() > > pyreader <- reticulate::r_to_py(reader) > > pytab <- pyreader$read_all() > Error in py_call_impl(callable, dots$args, dots$keywords) : > OSError: NotImplemented: Unsupported cast from double to null using > function cast_null > Detailed traceback: > File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all > File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status > {code} > But when having 2 (or more) files, the process hangs reading all of the > batches: > {code:r} > > library("arrow") > > > > venv <- try(reticulate::virtualenv_create("arrow-test")) > virtualenv: arrow-test > > install_pyarrow("arrow-test", nightly = TRUE) > [output from installing pyarrow ...] > > reticulate::use_virtualenv("arrow-test") > > > > file <- "arrow/r/inst/v0.7.1.parquet" > > arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null())) > > > > scan <- Scanner$create(arrow_table) > > reader <- scan$ToRecordBatchReader() > > pyreader <- reticulate::r_to_py(reader) > > pytab <- pyreader$read_all() > {hangs forever here} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13480) [C++] [R] [Python] C-interface error propagation
[ https://issues.apache.org/jira/browse/ARROW-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403991#comment-17403991 ] Antoine Pitrou commented on ARROW-13480: If you call {{pyreader$read_next_batch()}}, do you get the error as expected? > [C++] [R] [Python] C-interface error propagation > - > > Key: ARROW-13480 > URL: https://issues.apache.org/jira/browse/ARROW-13480 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python, R >Reporter: Jonathan Keane >Priority: Major > > Working on integration with DuckDB, we ran into an issue where it looks like > errors are not being propagated fully/correctly with record batch readers > using the C-interface. The DuckDB issue where this came up is > https://github.com/duckdb/duckdb/issues/2055 > In the example I'm passing a dataset with either one or two files from R to > python. I've specifically mis-specified the schema to get an error The one > file version works like I expect percolating the error up: > {code:r} > > library("arrow") > > > > venv <- try(reticulate::virtualenv_create("arrow-test")) > virtualenv: arrow-test > > install_pyarrow("arrow-test", nightly = TRUE) > [output from installing pyarrow ...] > > reticulate::use_virtualenv("arrow-test") > > > > file <- "arrow/r/inst/v0.7.1.parquet" > > arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null())) > > > > scan <- Scanner$create(arrow_table) > > reader <- scan$ToRecordBatchReader() > > pyreader <- reticulate::r_to_py(reader) > > pytab <- pyreader$read_all() > Error in py_call_impl(callable, dots$args, dots$keywords) : > OSError: NotImplemented: Unsupported cast from double to null using > function cast_null > Detailed traceback: > File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all > File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status > {code} > But when having 2 (or more) files, the process hangs reading all of the > batches: > {code:r} > > library("arrow") > > > > venv <- try(reticulate::virtualenv_create("arrow-test")) > virtualenv: arrow-test > > install_pyarrow("arrow-test", nightly = TRUE) > [output from installing pyarrow ...] > > reticulate::use_virtualenv("arrow-test") > > > > file <- "arrow/r/inst/v0.7.1.parquet" > > arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null())) > > > > scan <- Scanner$create(arrow_table) > > reader <- scan$ToRecordBatchReader() > > pyreader <- reticulate::r_to_py(reader) > > pytab <- pyreader$read_all() > {hangs forever here} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13480) [C++] [R] [Python] C-interface error propagation
[ https://issues.apache.org/jira/browse/ARROW-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400437#comment-17400437 ] Neal Richardson commented on ARROW-13480: - cc [~apitrou] [~westonpace] > [C++] [R] [Python] C-interface error propagation > - > > Key: ARROW-13480 > URL: https://issues.apache.org/jira/browse/ARROW-13480 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python, R >Reporter: Jonathan Keane >Priority: Major > > Working on integration with DuckDB, we ran into an issue where it looks like > errors are not being propagated fully/correctly with record batch readers > using the C-interface. The DuckDB issue where this came up is > https://github.com/duckdb/duckdb/issues/2055 > In the example I'm passing a dataset with either one or two files from R to > python. I've specifically mis-specified the schema to get an error The one > file version works like I expect percolating the error up: > {code:r} > > library("arrow") > > > > venv <- try(reticulate::virtualenv_create("arrow-test")) > virtualenv: arrow-test > > install_pyarrow("arrow-test", nightly = TRUE) > [output from installing pyarrow ...] > > reticulate::use_virtualenv("arrow-test") > > > > file <- "arrow/r/inst/v0.7.1.parquet" > > arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null())) > > > > scan <- Scanner$create(arrow_table) > > reader <- scan$ToRecordBatchReader() > > pyreader <- reticulate::r_to_py(reader) > > pytab <- pyreader$read_all() > Error in py_call_impl(callable, dots$args, dots$keywords) : > OSError: NotImplemented: Unsupported cast from double to null using > function cast_null > Detailed traceback: > File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all > File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status > {code} > But when having 2 (or more) files, the process hangs reading all of the > batches: > {code:r} > > library("arrow") > > > > venv <- try(reticulate::virtualenv_create("arrow-test")) > virtualenv: arrow-test > > install_pyarrow("arrow-test", nightly = TRUE) > [output from installing pyarrow ...] > > reticulate::use_virtualenv("arrow-test") > > > > file <- "arrow/r/inst/v0.7.1.parquet" > > arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null())) > > > > scan <- Scanner$create(arrow_table) > > reader <- scan$ToRecordBatchReader() > > pyreader <- reticulate::r_to_py(reader) > > pytab <- pyreader$read_all() > {hangs forever here} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)