Re: Improve the ergonomics of new PyArrow FileSystem API in Python ARROW-7584

2020-01-27 Thread Wes McKinney
hi Fabian

I responded on the JIRA. I'm generally supportive of ergonomic
improvements to the FS API in Python. It might make sense to break the
work into multiple patches to ease review burden

Thanks for offering to work on this.

- Wes

On Fri, Jan 24, 2020 at 4:46 AM Fabian Höring  wrote:
>
> Hello,
>
> I created this ticket to discuss possible improvements of the new PyArrow 
> FileSystem API
> https://issues.apache.org/jira/browse/ARROW-7584
>
> As of today there seem to be only two popular projects to have an agnostic 
> FileSystem API that can handle S3 & HDFS from Python:
> - PyArrow via https://arrow.apache.org/docs/python/filesystems.html
> - TensorFlow via https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile
>
> On my side I would like to reuse a clean FileSystem API in my project and 
> turned to the arrow for this purpose (I think TensorFlow already handles too 
> many use cases should not provide yet another feature).
>
> "Clean FileSystem API" for me also means to cover the interactive use case 
> where one uses that API like the file system shell commands. We actually used 
> https://github.com/dask/hdfs3 before and it worked really.
>
> Currently there is the FileSystem API work in progress (see 
> https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185) and 
> I would take the occasion to improve it and fix some issues with the existing 
> API.
>
> Can you have a look at the comments on 
> https://issues.apache.org/jira/browse/ARROW-7584 and give feedback ?
>
> I can do the implementations I suggest on my side but would like to make sure 
> they will be accepted.
>
> Best regards,
> Fabian Höring
>


Improve the ergonomics of new PyArrow FileSystem API in Python ARROW-7584

2020-01-24 Thread Fabian Höring
Hello,

I created this ticket to discuss possible improvements of the new PyArrow 
FileSystem API
https://issues.apache.org/jira/browse/ARROW-7584
 
As of today there seem to be only two popular projects to have an agnostic 
FileSystem API that can handle S3 & HDFS from Python:
- PyArrow via https://arrow.apache.org/docs/python/filesystems.html
- TensorFlow via https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile
 
On my side I would like to reuse a clean FileSystem API in my project and 
turned to the arrow for this purpose (I think TensorFlow already handles too 
many use cases should not provide yet another feature).
 
"Clean FileSystem API" for me also means to cover the interactive use case 
where one uses that API like the file system shell commands. We actually used 
https://github.com/dask/hdfs3 before and it worked really.
 
Currently there is the FileSystem API work in progress (see 
https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185) and I 
would take the occasion to improve it and fix some issues with the existing API.
 
Can you have a look at the comments on 
https://issues.apache.org/jira/browse/ARROW-7584 and give feedback ?
 
I can do the implementations I suggest on my side but would like to make sure 
they will be accepted.

Best regards,
Fabian Höring