Neal Richardson created ARROW-18089:
---------------------------------------

             Summary: [R] Cannot read_parquet on http URL
                 Key: ARROW-18089
                 URL: https://issues.apache.org/jira/browse/ARROW-18089
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
            Reporter: Neal Richardson
             Fix For: 11.0.0


{code}
u <- 
"https://raw.githubusercontent.com/apache/arrow/master/r/inst/v0.7.1.parquet";
read_parquet(u)
# Error: file must be a "RandomAccessFile"
read_parquet(url(u))
# Error: file must be a "RandomAccessFile"
{code}

The issue is that urls get turned into InputStream by {{make_readable_file}}, 
and parquet requires RandomAccessFile. 

{code}
arrow:::make_readable_file(u)
# InputStream
{code}

There are two relevant codepaths in make_readable_file: if given a string URL, 
it tries {{FileSystem$from_uri()}} and falls back to 
{{MakeRConnectionInputStream}}, which returns InputStream not RandomAccessFile. 
If provided a connection object (i.e. {{url(u)}}), it tries 
MakeRConnectionRandomAccessFile first and falls back to 
MakeRConnectionInputStream. If you provide a {{url()}} it does fall back to 
InputStream: 

{code}
arrow:::MakeRConnectionRandomAccessFile(url(u))
# Error: Tell() returned an error
{code}

If we truly can't work with a HTTP URL in read_parquet, we should at least 
document that. We could also do the workaround of downloading to a tempfile 
first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to