[jira] [Commented] (ARROW-13480) [C++] [R] [Python] C-interface error propagation

2021-08-24 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404012#comment-17404012
 ] 

Antoine Pitrou commented on ARROW-13480:


Ok, this is really a dataset bug. I will create a draft PR and let Weston 
iterate.

> [C++] [R] [Python] C-interface error propagation 
> -
>
> Key: ARROW-13480
> URL: https://issues.apache.org/jira/browse/ARROW-13480
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python, R
>Reporter: Jonathan Keane
>Priority: Major
>
> Working on integration with DuckDB, we ran into an issue where it looks like 
> errors are not being propagated fully/correctly with record batch readers 
> using the C-interface. The DuckDB issue where this came up is 
> https://github.com/duckdb/duckdb/issues/2055
> In the example I'm passing a dataset with either one or two files from R to 
> python. I've specifically mis-specified the schema to get an error The one 
> file version works like I expect percolating the error up:
> {code:r}
> > library("arrow")
> > 
> > venv <- try(reticulate::virtualenv_create("arrow-test"))
> virtualenv: arrow-test
> > install_pyarrow("arrow-test", nightly = TRUE)
> [output from installing pyarrow ...]
> > reticulate::use_virtualenv("arrow-test")
> > 
> > file <- "arrow/r/inst/v0.7.1.parquet"
> > arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null()))
> > 
> > scan <- Scanner$create(arrow_table)
> > reader <- scan$ToRecordBatchReader()
> > pyreader <- reticulate::r_to_py(reader)
> > pytab <- pyreader$read_all()
> Error in py_call_impl(callable, dots$args, dots$keywords) : 
>   OSError: NotImplemented: Unsupported cast from double to null using 
> function cast_null
> Detailed traceback:
>   File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> {code}
> But when having 2 (or more) files, the process hangs reading all of the 
> batches:
> {code:r}
> > library("arrow")
> > 
> > venv <- try(reticulate::virtualenv_create("arrow-test"))
> virtualenv: arrow-test
> > install_pyarrow("arrow-test", nightly = TRUE)
> [output from installing pyarrow ...]
> > reticulate::use_virtualenv("arrow-test")
> > 
> > file <- "arrow/r/inst/v0.7.1.parquet"
> > arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null()))
> > 
> > scan <- Scanner$create(arrow_table)
> > reader <- scan$ToRecordBatchReader()
> > pyreader <- reticulate::r_to_py(reader)
> > pytab <- pyreader$read_all()
> {hangs forever here}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13480) [C++] [R] [Python] C-interface error propagation

2021-08-24 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404004#comment-17404004
 ] 

Antoine Pitrou commented on ARROW-13480:


Ah, I can reproduce now using a larger number of files.

> [C++] [R] [Python] C-interface error propagation 
> -
>
> Key: ARROW-13480
> URL: https://issues.apache.org/jira/browse/ARROW-13480
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python, R
>Reporter: Jonathan Keane
>Priority: Major
>
> Working on integration with DuckDB, we ran into an issue where it looks like 
> errors are not being propagated fully/correctly with record batch readers 
> using the C-interface. The DuckDB issue where this came up is 
> https://github.com/duckdb/duckdb/issues/2055
> In the example I'm passing a dataset with either one or two files from R to 
> python. I've specifically mis-specified the schema to get an error The one 
> file version works like I expect percolating the error up:
> {code:r}
> > library("arrow")
> > 
> > venv <- try(reticulate::virtualenv_create("arrow-test"))
> virtualenv: arrow-test
> > install_pyarrow("arrow-test", nightly = TRUE)
> [output from installing pyarrow ...]
> > reticulate::use_virtualenv("arrow-test")
> > 
> > file <- "arrow/r/inst/v0.7.1.parquet"
> > arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null()))
> > 
> > scan <- Scanner$create(arrow_table)
> > reader <- scan$ToRecordBatchReader()
> > pyreader <- reticulate::r_to_py(reader)
> > pytab <- pyreader$read_all()
> Error in py_call_impl(callable, dots$args, dots$keywords) : 
>   OSError: NotImplemented: Unsupported cast from double to null using 
> function cast_null
> Detailed traceback:
>   File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> {code}
> But when having 2 (or more) files, the process hangs reading all of the 
> batches:
> {code:r}
> > library("arrow")
> > 
> > venv <- try(reticulate::virtualenv_create("arrow-test"))
> virtualenv: arrow-test
> > install_pyarrow("arrow-test", nightly = TRUE)
> [output from installing pyarrow ...]
> > reticulate::use_virtualenv("arrow-test")
> > 
> > file <- "arrow/r/inst/v0.7.1.parquet"
> > arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null()))
> > 
> > scan <- Scanner$create(arrow_table)
> > reader <- scan$ToRecordBatchReader()
> > pyreader <- reticulate::r_to_py(reader)
> > pytab <- pyreader$read_all()
> {hangs forever here}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13480) [C++] [R] [Python] C-interface error propagation

2021-08-24 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404003#comment-17404003
 ] 

Antoine Pitrou commented on ARROW-13480:


I cannot reproduce using the equivalent code in Python, i.e. the following 
works here:
{code:python}
def test_dataset_error():
from pyarrow import dataset as ds

c_stream = ffi.new("struct ArrowArrayStream*")
ptr_stream = int(ffi.cast("uintptr_t", c_stream))

fn = "/home/antoine/arrow/dev/r/inst/v0.7.1.parquet"
dataset = ds.dataset([fn, fn],
 schema=pa.schema({'x': pa.null()}))
scanner = dataset.scanner()
reader = scanner.to_reader()
reader._export_to_c(ptr_stream)
del reader, dataset, scanner

reader_new = pa.ipc.RecordBatchReader._import_from_c(ptr_stream)
with pytest.raises(OSError,
   match="Unsupported cast from double to null"):
reader_new.read_all()
{code}

> [C++] [R] [Python] C-interface error propagation 
> -
>
> Key: ARROW-13480
> URL: https://issues.apache.org/jira/browse/ARROW-13480
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python, R
>Reporter: Jonathan Keane
>Priority: Major
>
> Working on integration with DuckDB, we ran into an issue where it looks like 
> errors are not being propagated fully/correctly with record batch readers 
> using the C-interface. The DuckDB issue where this came up is 
> https://github.com/duckdb/duckdb/issues/2055
> In the example I'm passing a dataset with either one or two files from R to 
> python. I've specifically mis-specified the schema to get an error The one 
> file version works like I expect percolating the error up:
> {code:r}
> > library("arrow")
> > 
> > venv <- try(reticulate::virtualenv_create("arrow-test"))
> virtualenv: arrow-test
> > install_pyarrow("arrow-test", nightly = TRUE)
> [output from installing pyarrow ...]
> > reticulate::use_virtualenv("arrow-test")
> > 
> > file <- "arrow/r/inst/v0.7.1.parquet"
> > arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null()))
> > 
> > scan <- Scanner$create(arrow_table)
> > reader <- scan$ToRecordBatchReader()
> > pyreader <- reticulate::r_to_py(reader)
> > pytab <- pyreader$read_all()
> Error in py_call_impl(callable, dots$args, dots$keywords) : 
>   OSError: NotImplemented: Unsupported cast from double to null using 
> function cast_null
> Detailed traceback:
>   File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> {code}
> But when having 2 (or more) files, the process hangs reading all of the 
> batches:
> {code:r}
> > library("arrow")
> > 
> > venv <- try(reticulate::virtualenv_create("arrow-test"))
> virtualenv: arrow-test
> > install_pyarrow("arrow-test", nightly = TRUE)
> [output from installing pyarrow ...]
> > reticulate::use_virtualenv("arrow-test")
> > 
> > file <- "arrow/r/inst/v0.7.1.parquet"
> > arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null()))
> > 
> > scan <- Scanner$create(arrow_table)
> > reader <- scan$ToRecordBatchReader()
> > pyreader <- reticulate::r_to_py(reader)
> > pytab <- pyreader$read_all()
> {hangs forever here}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13480) [C++] [R] [Python] C-interface error propagation

2021-08-24 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403999#comment-17403999
 ] 

Antoine Pitrou commented on ARROW-13480:


Also, can you get a gdb backtrace of where the hanging occurs?

> [C++] [R] [Python] C-interface error propagation 
> -
>
> Key: ARROW-13480
> URL: https://issues.apache.org/jira/browse/ARROW-13480
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python, R
>Reporter: Jonathan Keane
>Priority: Major
>
> Working on integration with DuckDB, we ran into an issue where it looks like 
> errors are not being propagated fully/correctly with record batch readers 
> using the C-interface. The DuckDB issue where this came up is 
> https://github.com/duckdb/duckdb/issues/2055
> In the example I'm passing a dataset with either one or two files from R to 
> python. I've specifically mis-specified the schema to get an error The one 
> file version works like I expect percolating the error up:
> {code:r}
> > library("arrow")
> > 
> > venv <- try(reticulate::virtualenv_create("arrow-test"))
> virtualenv: arrow-test
> > install_pyarrow("arrow-test", nightly = TRUE)
> [output from installing pyarrow ...]
> > reticulate::use_virtualenv("arrow-test")
> > 
> > file <- "arrow/r/inst/v0.7.1.parquet"
> > arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null()))
> > 
> > scan <- Scanner$create(arrow_table)
> > reader <- scan$ToRecordBatchReader()
> > pyreader <- reticulate::r_to_py(reader)
> > pytab <- pyreader$read_all()
> Error in py_call_impl(callable, dots$args, dots$keywords) : 
>   OSError: NotImplemented: Unsupported cast from double to null using 
> function cast_null
> Detailed traceback:
>   File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> {code}
> But when having 2 (or more) files, the process hangs reading all of the 
> batches:
> {code:r}
> > library("arrow")
> > 
> > venv <- try(reticulate::virtualenv_create("arrow-test"))
> virtualenv: arrow-test
> > install_pyarrow("arrow-test", nightly = TRUE)
> [output from installing pyarrow ...]
> > reticulate::use_virtualenv("arrow-test")
> > 
> > file <- "arrow/r/inst/v0.7.1.parquet"
> > arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null()))
> > 
> > scan <- Scanner$create(arrow_table)
> > reader <- scan$ToRecordBatchReader()
> > pyreader <- reticulate::r_to_py(reader)
> > pytab <- pyreader$read_all()
> {hangs forever here}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13480) [C++] [R] [Python] C-interface error propagation

2021-08-24 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403991#comment-17403991
 ] 

Antoine Pitrou commented on ARROW-13480:


If you call {{pyreader$read_next_batch()}}, do you get the error as expected?

> [C++] [R] [Python] C-interface error propagation 
> -
>
> Key: ARROW-13480
> URL: https://issues.apache.org/jira/browse/ARROW-13480
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python, R
>Reporter: Jonathan Keane
>Priority: Major
>
> Working on integration with DuckDB, we ran into an issue where it looks like 
> errors are not being propagated fully/correctly with record batch readers 
> using the C-interface. The DuckDB issue where this came up is 
> https://github.com/duckdb/duckdb/issues/2055
> In the example I'm passing a dataset with either one or two files from R to 
> python. I've specifically mis-specified the schema to get an error The one 
> file version works like I expect percolating the error up:
> {code:r}
> > library("arrow")
> > 
> > venv <- try(reticulate::virtualenv_create("arrow-test"))
> virtualenv: arrow-test
> > install_pyarrow("arrow-test", nightly = TRUE)
> [output from installing pyarrow ...]
> > reticulate::use_virtualenv("arrow-test")
> > 
> > file <- "arrow/r/inst/v0.7.1.parquet"
> > arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null()))
> > 
> > scan <- Scanner$create(arrow_table)
> > reader <- scan$ToRecordBatchReader()
> > pyreader <- reticulate::r_to_py(reader)
> > pytab <- pyreader$read_all()
> Error in py_call_impl(callable, dots$args, dots$keywords) : 
>   OSError: NotImplemented: Unsupported cast from double to null using 
> function cast_null
> Detailed traceback:
>   File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> {code}
> But when having 2 (or more) files, the process hangs reading all of the 
> batches:
> {code:r}
> > library("arrow")
> > 
> > venv <- try(reticulate::virtualenv_create("arrow-test"))
> virtualenv: arrow-test
> > install_pyarrow("arrow-test", nightly = TRUE)
> [output from installing pyarrow ...]
> > reticulate::use_virtualenv("arrow-test")
> > 
> > file <- "arrow/r/inst/v0.7.1.parquet"
> > arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null()))
> > 
> > scan <- Scanner$create(arrow_table)
> > reader <- scan$ToRecordBatchReader()
> > pyreader <- reticulate::r_to_py(reader)
> > pytab <- pyreader$read_all()
> {hangs forever here}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13480) [C++] [R] [Python] C-interface error propagation

2021-08-17 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400437#comment-17400437
 ] 

Neal Richardson commented on ARROW-13480:
-

cc [~apitrou] [~westonpace]

> [C++] [R] [Python] C-interface error propagation 
> -
>
> Key: ARROW-13480
> URL: https://issues.apache.org/jira/browse/ARROW-13480
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python, R
>Reporter: Jonathan Keane
>Priority: Major
>
> Working on integration with DuckDB, we ran into an issue where it looks like 
> errors are not being propagated fully/correctly with record batch readers 
> using the C-interface. The DuckDB issue where this came up is 
> https://github.com/duckdb/duckdb/issues/2055
> In the example I'm passing a dataset with either one or two files from R to 
> python. I've specifically mis-specified the schema to get an error The one 
> file version works like I expect percolating the error up:
> {code:r}
> > library("arrow")
> > 
> > venv <- try(reticulate::virtualenv_create("arrow-test"))
> virtualenv: arrow-test
> > install_pyarrow("arrow-test", nightly = TRUE)
> [output from installing pyarrow ...]
> > reticulate::use_virtualenv("arrow-test")
> > 
> > file <- "arrow/r/inst/v0.7.1.parquet"
> > arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null()))
> > 
> > scan <- Scanner$create(arrow_table)
> > reader <- scan$ToRecordBatchReader()
> > pyreader <- reticulate::r_to_py(reader)
> > pytab <- pyreader$read_all()
> Error in py_call_impl(callable, dots$args, dots$keywords) : 
>   OSError: NotImplemented: Unsupported cast from double to null using 
> function cast_null
> Detailed traceback:
>   File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> {code}
> But when having 2 (or more) files, the process hangs reading all of the 
> batches:
> {code:r}
> > library("arrow")
> > 
> > venv <- try(reticulate::virtualenv_create("arrow-test"))
> virtualenv: arrow-test
> > install_pyarrow("arrow-test", nightly = TRUE)
> [output from installing pyarrow ...]
> > reticulate::use_virtualenv("arrow-test")
> > 
> > file <- "arrow/r/inst/v0.7.1.parquet"
> > arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null()))
> > 
> > scan <- Scanner$create(arrow_table)
> > reader <- scan$ToRecordBatchReader()
> > pyreader <- reticulate::r_to_py(reader)
> > pytab <- pyreader$read_all()
> {hangs forever here}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)