HI Jae, Mainly providing an answer on your first question:
On Thu, 10 Feb 2022 at 05:06, Jae Lee <[email protected]> wrote: > Hi Team, > > I would like to implement a custom subclass of > pyarrow.filesystem.FileSystem (or perhaps pyarrow.fs.FileSystem) and was > hoping to leverage the full potential of what pyarrow provides with parquet > files - partitioning, filter, etc. The underneath storage is cloud-based > and not S3 compatible. Our API only provides support for > - CRUD bucket > - CRUD objects > Currently, there is no support for streaming or working with any type of > file handle. I've already looked into how s3fs.cc was implemented but was > not sure I could apply it in my situation. > > Questions: > 1. What Filesystem class do I need to implement to take full advantage of > what arrow provides in terms of dealing with parquet files? > (pyarrow.filesystem.FileSystem vs pyarrow.fs.FileSystem) > The pyarrow.filesystem module is deprecated, so you should look at pyarrow.fs FileSystems. Those filesystems are mostly implemented in C++ and can't be directly subclassed in Python (only in C++), but there is a dedicated mechanism to implement a FileSystem in Python, using the PyFileSystem class and the FileSystemHandler class (see https://arrow.apache.org/docs/python/api/filesystems.html#filesystem-implementations). You would need to implement your own FileSystemHandler, and then you can create a filesystem object that will be recognized by pyarrow functions with `fs = PyFileSystem(my_handler)`. We don't really have documentation about this (apart from the API docs for FileSystemHandler), but it might probably be best to look at an example. And we have an actual use case of this in our own code base to wrap fsspec-compatible python filesystems that can be used as example: see https://github.com/apache/arrow/blob/c0bae8daea2ace51c64f6db38cfb3d04c5bed657/python/pyarrow/fs.py#L254-L406 > 2. Is there any example of implementation of cloud-based non-s3 compatible > filesystem? > I am not aware of one in Python (in C++, we now also have a Google Cloud Storage filesystem, but I suppose that has an extensive API). The Python fsspec package (which can be used in pyarrow through the above mentioned handler) implements some filesystems for "cloud" storage (eg for http, ftp), but I am not familiar with the implementation details. > 3. Given our limited API sets, what would you recommend? > > Initially, I was thinking to download the entire parquet file/directory to > a local file system and provide a handle but was curious if there would be > an any better way to handle this. > > Thank you in advance! > Jae >
