Re: [Python] Implementing own Filesystem Subclass in PyArrow v3.0.0

Weston Pace Thu, 10 Feb 2022 13:34:21 -0800

> 3. Given our limited API sets, what would you recommend?

The filesystem interface is already rather minimal.  We generally
don't put a function in there if we aren't using it somewhere.  That
being said, you can often get away with a mock implementation.  From a
quick rundown:


GetFileInfo/OpenInputStream/OpenOutputStream/OpenInputFile - These are
used almost everywhere
CreateDir/DeleteDir/DeleteDirContents - These are used when writing
datasets (and so you will need it if you want to write partitioned
parquet)
DeleteFile/Move/CopyFile - I think these may only be used in our unit
tests, you could maybe get by without them

> - CRUD bucket
> - CRUD objects
> Currently, there is no support for streaming or working with any type of file 
> handle. I've already looked into how s3fs.cc was implemented but was not sure 
> I could apply it in my situation.

Some thoughts:

 * Do you support empty directories?

This is a tricky one.  We do rely on empty directories in some of our
datasets APIs.  For example, we CreateDir and then put files in it.
There is some discussion on [1] about how we might emulate this in GCS
but I don't know what exactly got implemented.

 * No support for streaming?

Does this mean you need to download an entire file at a time (e.g. you
can't stream the file or do a partial read of the file)?  In this case
you can mock it by downloading the file and then wrapping it with
arrow::io::BufferReader.  That provides the input stream and readable
file interfaces on top of an in-memory buffer.  You can also probably
use arrow::io::BufferedOutputStream to collect all writes in memory
and then override the Close method to actually persist the write.
This being said, you will of course use considerably more memory than
you need to.  So you'll need to make sure your files are small enough
to fit into memory.

[1] https://issues.apache.org/jira/browse/ARROW-1231

On Thu, Feb 10, 2022 at 1:31 AM Joris Van den Bossche
<[email protected]> wrote:
>
> HI Jae,
>
> Mainly providing an answer on your first question:
>
> On Thu, 10 Feb 2022 at 05:06, Jae Lee <[email protected]> wrote:
>>
>> Hi Team,
>>
>> I would like to implement a custom subclass of pyarrow.filesystem.FileSystem 
>> (or perhaps pyarrow.fs.FileSystem) and was hoping to leverage the full 
>> potential of what pyarrow provides with parquet files - partitioning, 
>> filter, etc. The underneath storage is cloud-based and not S3 compatible. 
>> Our API only provides support for
>> - CRUD bucket
>> - CRUD objects
>> Currently, there is no support for streaming or working with any type of 
>> file handle. I've already looked into how s3fs.cc was implemented but was 
>> not sure I could apply it in my situation.
>>
>> Questions:
>> 1. What Filesystem class do I need to implement to take full advantage of 
>> what arrow provides in terms of dealing with parquet files? 
>> (pyarrow.filesystem.FileSystem vs pyarrow.fs.FileSystem)
>
>
> The pyarrow.filesystem module is deprecated, so you should look at pyarrow.fs 
> FileSystems. Those filesystems are mostly implemented in C++ and can't be 
> directly subclassed in Python (only in C++), but there is a dedicated 
> mechanism to implement a FileSystem in Python, using the PyFileSystem class 
> and the FileSystemHandler class (see 
> https://arrow.apache.org/docs/python/api/filesystems.html#filesystem-implementations).
> You would need to implement your own FileSystemHandler, and then you can 
> create a filesystem object that will be recognized by pyarrow functions with 
> `fs = PyFileSystem(my_handler)`.
>
> We don't really have documentation about this (apart from the API docs for 
> FileSystemHandler), but it might probably be best to look at an example. And 
> we have an actual use case of this in our own code base to wrap 
> fsspec-compatible python filesystems that can be used as example: see 
> https://github.com/apache/arrow/blob/c0bae8daea2ace51c64f6db38cfb3d04c5bed657/python/pyarrow/fs.py#L254-L406
>
>>
>> 2. Is there any example of implementation of cloud-based non-s3 compatible 
>> filesystem?
>
>
> I am not aware of one in Python (in C++, we now also have a Google Cloud 
> Storage filesystem, but I suppose that has an extensive API). The Python 
> fsspec package (which can be used in pyarrow through the above mentioned 
> handler) implements some filesystems for "cloud" storage (eg for http, ftp), 
> but I am not familiar with the implementation details.
>
>>
>> 3. Given our limited API sets, what would you recommend?
>>
>> Initially, I was thinking to download the entire parquet file/directory to a 
>> local file system and provide a handle but was curious if there would be an 
>> any better way to handle this.
>>
>> Thank you in advance!
>> Jae

Re: [Python] Implementing own Filesystem Subclass in PyArrow v3.0.0

Reply via email to