Re: [DISCUSS] Filesystem in PyIceberg

Steve Loughran Tue, 13 Aug 2024 03:36:14 -0700

On Tue, 13 Aug 2024 at 03:50, Xuanwo <xua...@apache.org> wrote:

> Hi, André
>
> Thanks a lot for starting this thread.
>
> List operations on storage services are expensive and slow. That's why
> Iceberg is designed to store metadata in files and avoid using list
> operations in FileIO. However, `orphan file removal` or `garbage cleanup`
> are special tasks that do require scanning the entire storage location and
> comparing it with our existing metadata files.
>


Not quite.

Listing via treewalking is awful because it has both high latency on "pure"
object stores with client-side mimiced directories. And you will pay per
LIST call.

In S3, as implemented by S3FileIO and HadoopIO, the the
SupportsPrefixOperations.listPrefix() operation is independent of directory
structure and instead just O(files). Results come back in pages, about 1000
or so. If you have versioned buckets you'll get less if there are many
overwritten/tombstoned objects. To compensate for this you should really
schedule processing of data into separate threads from that during the
listing. This is actually beneficial on classic tree walks on high-latency
stores with "read" directories -including Azure ADLS Gen2.




>
> I believe that if there is a way to ensure all engines use List operations
> correctly ( don't abuse list! ), it would be beneficial for us to introduce
> list files in FileIO.
>
> > I believe that if there is a way to ensure all engines use List
operations correctly ( don't abuse list! ), it would be beneficial for us
to introduce list files in FileIO.

Given SupportsPrefixOperations exists: use listPrefix()
Similarly, use SupportsBulkOperations.deleteFiles() for bulk deletion

You should actually be able to wire them up, either directly:
deleteFiles(listPrefix(path)), or more interestingly, with a filter in
between. This would integrate paged LIST results with paged single/bulk
delete calls.

The S3FileIO.deleteFiles() and the hadoop 3.4.1 variant (which will be
ready for review once we ship that)
https://github.com/apache/iceberg/pull/10233 can both do the bulk delete in
aggregate calls, which S3FileIO well actually do asynchronously. Each row
in the batch counts as one write operation and is trivial to trigger
throttling; if the AWS SDK is doing the retries you wouldn't even directly
notice it -but all clients writing to that S3 shard will be delayed. Being
aggressive here is a bit antisocial for any background vacuuming task.

Anyway: use listFiles(), but know that even if deleteFiles() is optimised
for cloud storage it can be slow and impact every other application writing
to the same store. And someone should update
org.apache.iceberg.aws.util.RetryDetector to count throttle events the way
we do in the s3a codebase.


> I prefer to have this in FileIO and eventually exposed in
> pyicberg/iceberg-rust's public API instead of letting users use opendal
> directly. The public API could be a metadata table or something similar; I
> haven't given it much thought yet.
>
> FileIO is now a widely shared design across different language
> implementations, and we have built a mature mechanism to allow users to
> implement and provide their own FileIO. By adding a new API in FileIO, we
> can ensure that we are not favoring any specific FileIO implementation.
>

Given SupportsPrefixOperations is there, just use that if the FileIO
instance supports it.


>
> On Tue, Aug 13, 2024, at 07:01, André Luis Anastácio wrote:
>
> Thank you Fokko about the context! This blog post helped me a lot!
>
> I understand that in the Iceberg Java implementation the maintenance
> procedures are just interfaces
> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/actions/DeleteOrphanFiles.java#L34>,
> and the implementation is done on the engine side
> <https://github.com/apache/iceberg/blob/ae08334cad1f1a9eebb9cdcf48ce5084da9bc44d/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L103>.
> What do you think about this for PyIceberg?
>
> I was hoping to leverage the metadata tables for that.
>
> I’m not sure if I understand correctly. Do you mean that the idea would be
> to access the metadata using the metadata tables through the table public
> API instead of reading the metadata files directly?
>
> If I understood correctly, and following what was done in the Java
> implementation, what are your thoughts on having the procedures module
> using only the PyIceberg public API and OpenDAL to handle with filesystem?
> With that, we would have something that is not coupled with the PyIceberg
> internals.
>
> André Anastácio
>
>
> On Monday, August 12th, 2024 at 5:03 PM, Fokko Driesprong <
> fo...@apache.org> wrote:
>
> Hi André,
>
> First of all, thanks for raising this. Maintenance routines are a
> long-awaited functionality in PyIceberg.
>
> The FileIO concept <https://iceberg.apache.org/fileio/> is not limited to
> PyIceberg, but is also present in Java
> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/io/FileIO.java>
> and Iceberg-Rust
> <https://github.com/apache/iceberg-rust/blob/bbbea9751439dea6afb85f5acf0f3689357cf3de/crates/iceberg/src/io/file_io.rs#L40>.
> The main focus of FileIO is to provide object-store native operations to
> the Iceberg client (an excellent blog can be found here
> <https://tabular.io/blog/iceberg-fileio-cloud-native-tables/>). Based on
> this, I don't think we want to create a first-class citizen for
> FileSystem-like operations, because Iceberg is designed to work with object
> stores native operations.
>
> That said, in PyIceberg the abstraction between the engine and the FileIO
> is not as clear as in other implementations. This is mostly because the
> ArrowFileIO
> <https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L328>
> returns Arrow buffers, and therefore we ended up with a more closely
> related implementation than desired. It would be good to see if we can
> untangle that, and I'm sure that once we get OpenDAL or Iceberg-Rust in
> there, there will be a strong need to do that.
>
> Orphan files is quite a resource-intensive operation since it requires
> listing all the files under the location, and comparing this with all the
> files in the metadata (I was hoping to leverage the metadata tables for
> that).
>
> Hope this helps!
>
> Kind regards,
> Fokko
>
>
>
>
>
>
> Op ma 12 aug 2024 om 14:38 schreef André Luis Anastácio
> <ndrl...@proton.me.invalid>:
>
>
> Hello everyone,
>
> I’ve been studying the Java implementation of orphan file removal to
> replicate it in PyIceberg. During this process, I noticed a key difference:
> in Java, we use the Hadoop Filesystem[1], while in PyIceberg, we use the
> Filesystem provided by FileIO[2][3].
>
> Currently, we support two FileIO implementations: Fsspec and PyArrow.
> However, there is a hard requirement to use PyArrow for the reading
> process, and when we instantiate the FileSystem, we wrap Fsspec with the
> PyArrow interface[4][5].
>
> Thus, we can say that the default filesystem interface is the PyArrow one.
>
> In the future, we aim to use the FileIO from rust-iceberg, which leverages
> OpenDAL—a tool that doesn’t have wrappers for the Fsspec or Arrow
> interfaces.
>
> For the FileIO context (write/read/delete operations), I believe we are in
> good shape. The challenge arises when we need to access the Filesystem
> object to handle tasks like listing files.
>
> With this in mind, I want to open a discussion about how we should
> standardize an interface for file listing.
>
> What should be our default interface for listing files?
>
> - Create our own definition (e.g., extend FileIO or create a new
> Filesystem interface)
> - Use Fsspec
> - Use Arrow
> - Use OpenDAL
> - Other?
>
>
> Could we move the implementation for retrieving and wrapping the
> Filesystem[4][5] to another location, so it can be reused elsewhere?
>
> Any other suggestions?
>
> [1]
> https://github.com/apache/iceberg/blob/ae08334cad1f1a9eebb9cdcf48ce5084da9bc44d/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L356
> [2]
> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/fsspec.py#L350-L354
> [3]
> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L346-L401
> [4]
> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1335-L1349
> [5]
> https://github.com/apache/iceberg-python/blob/4f33f3a03841c9aa4f6ac389fea5726821f6f116/pyiceberg/io/pyarrow.py#L1429-L1443
>
> André Anastácio
>
> Xuanwo
>
> https://xuanwo.io/
>
>

Re: [DISCUSS] Filesystem in PyIceberg

Reply via email to