> 3. Given our limited API sets, what would you recommend? The filesystem interface is already rather minimal. We generally don't put a function in there if we aren't using it somewhere. That being said, you can often get away with a mock implementation. From a quick rundown:
GetFileInfo/OpenInputStream/OpenOutputStream/OpenInputFile - These are used almost everywhere CreateDir/DeleteDir/DeleteDirContents - These are used when writing datasets (and so you will need it if you want to write partitioned parquet) DeleteFile/Move/CopyFile - I think these may only be used in our unit tests, you could maybe get by without them > - CRUD bucket > - CRUD objects > Currently, there is no support for streaming or working with any type of file > handle. I've already looked into how s3fs.cc was implemented but was not sure > I could apply it in my situation. Some thoughts: * Do you support empty directories? This is a tricky one. We do rely on empty directories in some of our datasets APIs. For example, we CreateDir and then put files in it. There is some discussion on [1] about how we might emulate this in GCS but I don't know what exactly got implemented. * No support for streaming? Does this mean you need to download an entire file at a time (e.g. you can't stream the file or do a partial read of the file)? In this case you can mock it by downloading the file and then wrapping it with arrow::io::BufferReader. That provides the input stream and readable file interfaces on top of an in-memory buffer. You can also probably use arrow::io::BufferedOutputStream to collect all writes in memory and then override the Close method to actually persist the write. This being said, you will of course use considerably more memory than you need to. So you'll need to make sure your files are small enough to fit into memory. [1] https://issues.apache.org/jira/browse/ARROW-1231 On Thu, Feb 10, 2022 at 1:31 AM Joris Van den Bossche <[email protected]> wrote: > > HI Jae, > > Mainly providing an answer on your first question: > > On Thu, 10 Feb 2022 at 05:06, Jae Lee <[email protected]> wrote: >> >> Hi Team, >> >> I would like to implement a custom subclass of pyarrow.filesystem.FileSystem >> (or perhaps pyarrow.fs.FileSystem) and was hoping to leverage the full >> potential of what pyarrow provides with parquet files - partitioning, >> filter, etc. The underneath storage is cloud-based and not S3 compatible. >> Our API only provides support for >> - CRUD bucket >> - CRUD objects >> Currently, there is no support for streaming or working with any type of >> file handle. I've already looked into how s3fs.cc was implemented but was >> not sure I could apply it in my situation. >> >> Questions: >> 1. What Filesystem class do I need to implement to take full advantage of >> what arrow provides in terms of dealing with parquet files? >> (pyarrow.filesystem.FileSystem vs pyarrow.fs.FileSystem) > > > The pyarrow.filesystem module is deprecated, so you should look at pyarrow.fs > FileSystems. Those filesystems are mostly implemented in C++ and can't be > directly subclassed in Python (only in C++), but there is a dedicated > mechanism to implement a FileSystem in Python, using the PyFileSystem class > and the FileSystemHandler class (see > https://arrow.apache.org/docs/python/api/filesystems.html#filesystem-implementations). > You would need to implement your own FileSystemHandler, and then you can > create a filesystem object that will be recognized by pyarrow functions with > `fs = PyFileSystem(my_handler)`. > > We don't really have documentation about this (apart from the API docs for > FileSystemHandler), but it might probably be best to look at an example. And > we have an actual use case of this in our own code base to wrap > fsspec-compatible python filesystems that can be used as example: see > https://github.com/apache/arrow/blob/c0bae8daea2ace51c64f6db38cfb3d04c5bed657/python/pyarrow/fs.py#L254-L406 > >> >> 2. Is there any example of implementation of cloud-based non-s3 compatible >> filesystem? > > > I am not aware of one in Python (in C++, we now also have a Google Cloud > Storage filesystem, but I suppose that has an extensive API). The Python > fsspec package (which can be used in pyarrow through the above mentioned > handler) implements some filesystems for "cloud" storage (eg for http, ftp), > but I am not familiar with the implementation details. > >> >> 3. Given our limited API sets, what would you recommend? >> >> Initially, I was thinking to download the entire parquet file/directory to a >> local file system and provide a handle but was curious if there would be an >> any better way to handle this. >> >> Thank you in advance! >> Jae
