[GitHub] [arrow-datafusion] oersted opened a new issue, #4965: Read DataFrame from URL

GitBox Wed, 18 Jan 2023 01:07:16 -0800


oersted opened a new issue, #4965:
URL: https://github.com/apache/arrow-datafusion/issues/4965


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   It is a common use-case to dynamically download datasets (usually CSV or 
JSON) from third-party APIs, particularly when the data changes with time.
   
   **Describe the solution you'd like**
   
   It would be nice to be able to pass any URL to the `DataFrame::read_*` and 
stream the data in. Currently a confusing error is displayed if the URL doesn't 
correspond to a supported source like S3.
   
   ```Error: Internal error: No suitable object store found for 
https://<...>.csv. This was likely caused by a bug in DataFusion's code and we 
would welcome that you file an bug report in our issue tracker```
   
   **Describe alternatives you've considered**
   
   It is easy enough to download the file into a temporary directory and read 
it from there. But if the dataset is large, it would be preferable to stream it 
progressively into memory, rather than spending time and space making an 
unnecessary copy to disk.
   
   This is not necessarily something that should be within the scope of 
DataFusion, but since object stores are already supported I think it makes 
sense to support arbitrary URLs for consistency and completeness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] oersted opened a new issue, #4965: Read DataFrame from URL

Reply via email to