[ 
https://issues.apache.org/jira/browse/ARROW-18089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dewey Dunnington updated ARROW-18089:
-------------------------------------
    Fix Version/s: 12.0.0
                       (was: 11.0.0)

> [R] Cannot read_parquet on http URL
> -----------------------------------
>
>                 Key: ARROW-18089
>                 URL: https://issues.apache.org/jira/browse/ARROW-18089
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Neal Richardson
>            Priority: Critical
>              Labels: triaged
>             Fix For: 12.0.0
>
>
> {code}
> u <- 
> "https://raw.githubusercontent.com/apache/arrow/master/r/inst/v0.7.1.parquet";
> read_parquet(u)
> # Error: file must be a "RandomAccessFile"
> read_parquet(url(u))
> # Error: file must be a "RandomAccessFile"
> {code}
> The issue is that urls get turned into InputStream by {{make_readable_file}}, 
> and parquet requires RandomAccessFile. 
> {code}
> arrow:::make_readable_file(u)
> # InputStream
> {code}
> There are two relevant codepaths in make_readable_file: if given a string 
> URL, it tries {{FileSystem$from_uri()}} and falls back to 
> {{MakeRConnectionInputStream}}, which returns InputStream not 
> RandomAccessFile. If provided a connection object (i.e. {{url(u)}}), it tries 
> MakeRConnectionRandomAccessFile first and falls back to 
> MakeRConnectionInputStream. If you provide a {{url()}} it does fall back to 
> InputStream: 
> {code}
> arrow:::MakeRConnectionRandomAccessFile(url(u))
> # Error: Tell() returned an error
> {code}
> If we truly can't work with a HTTP URL in read_parquet, we should at least 
> document that. We could also do the workaround of downloading to a tempfile 
> first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to