Re: [DISCUSS] Use iceberg-rust for PyIceberg Bucket Transform

Sung Yun Sun, 18 Aug 2024 18:09:14 -0700

Hi folks, thank you all for your input. It looks like there's a general
agreement that this is a good idea. Given that Xuanwo has already put the
scaffolding in place for pyiceberg_core
<https://github.com/apache/iceberg-rust/pull/518>, I've put up this PR to
implement bucket_transform <https://github.com/apache/iceberg-rust/pull/556>
in hopes of exposing it to PyIceberg as soon as we can.


> To be honest, I don't think bucket transforms alone is not a good
starting point, maybe a more general approach would be to provide another
set of transform implementations backed by iceberg-rust?

That's a valid point. I know we are in the process of defining the
integration points between PyIceberg and pyiceberg_core (rust python
binding), so I think we have a good opportunity to replace all the
transforms.

At the same time, given that pyiceberg_core is meant to be used as a
private package of PyIceberg, I think it there's also an argument for
exposing the functionality as soon as we can, and then taking our time to
properly define the integration points to replace existing, already
supported features within PyIceberg.

On Fri, Aug 2, 2024 at 10:07 PM Renjie Liu <liurenjie2...@gmail.com> wrote:

> Hi:
>
> Would pushing the bucket transforms into Rust a good exercise to get the
>> scaffolding in place?
>
>
> To be honest, I don't think bucket transforms alone is not a good starting
> point, maybe a more general approach would be to provide another set of
> transform implementations backed by iceberg-rust? The class hierarchy may
> look this following:
> ```
> Transform
>    RustTransform
>           BucketRustTransform
>           YeanRustTransform
>           ....
> ```
>
> I think it would be a win-win situation where we can combine the momentum
>> of PyIceberg and push the things that Python doesn't do well (mostly
>> multithreading and heavy lifting) to Iceberg-Rust, and also have
>> Iceberg-Rust as a library for Rust query engines to interact with Iceberg
>> tables directly.
>
>
> +1.
>
>
>
>
> On Sat, Aug 3, 2024 at 2:20 AM Fokko Driesprong <fo...@apache.org> wrote:
>
>> Hey everyone,
>>
>> In the beginning of PyIceberg, one of the goals was to keep PyIceberg
>> pure Python. At some point, we've added a Cython Avro decoder because of
>> performance reasons, but we still have a pure Python fallback. Today you
>> can still do metadata operating using s3fs without any native code. Once
>> you start reading/writing data, you'll need to pull in PyArrow.
>>
>> I think it would be a win-win situation where we can combine the momentum
>> of PyIceberg and push the things that Python doesn't do well (mostly
>> multithreading and heavy lifting) to Iceberg-Rust, and also have
>> Iceberg-Rust as a library for Rust query engines to interact with Iceberg
>> tables directly.
>>
>> My main question here is about sequencing. Would pushing the bucket
>> transforms into Rust a good exercise to get the scaffolding in place?
>>
>> Kind regards,
>> Fokko
>>
>>
>> Op vr 2 aug 2024 om 16:38 schreef Renjie Liu <liurenjie2...@gmail.com>:
>>
>>> Hi:
>>>
>>> Thanks Sung for raising this. Just as Ryan said, I'm also +1 for using
>>> more iceberg-rust in pyiceberg.
>>>
>>> Is it that we would introduce a hard dependency on iceberg-rust?
>>>
>>>
>>> I think so.
>>>
>>> Is that a risk that could make PyIceberg unusable for some people? I
>>>> don't think that would be a problem since we already have a requirement for
>>>> pyarrow for these cases.
>>>
>>>
>>> +1.
>>>
>>> In fact, what I think more about is making iceberg-rust the backend of
>>> pyiceberg, and this is what me, xuanwo and Fokko had talked about in
>>> brainstorming several times. Not only bucket transform, but also not only
>>> FileIO in another thread initiated by xuanwo. I think these are all
>>> intermediate steps to our final goal.
>>>
>>>
>>> On Fri, Aug 2, 2024 at 3:21 AM Ryan Blue <b...@databricks.com.invalid>
>>> wrote:
>>>
>>>> In general, I think the idea of using iceberg-rust more from PyIceberg
>>>> is great. I think it will be a good path to pushing more things down to
>>>> native code.
>>>>
>>>> What are the trade-offs of doing it this way? Is it that we would
>>>> introduce a hard dependency on iceberg-rust? Is that a risk that could make
>>>> PyIceberg unusable for some people? I don't think that would be a problem
>>>> since we already have a requirement for pyarrow for these cases.
>>>>
>>>> On Thu, Aug 1, 2024 at 6:39 AM Sung Yun <sungwy...@gmail.com> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> This is something I've been mulling about for a while and I thought
>>>>> this would be the right forum to discuss this topic as a follow up to a
>>>>> similar topic discussion thread on using python bindings from iceberg-rust
>>>>> to support pyiceberg.
>>>>>
>>>>> As soon as we released 0.7.0 which supports writes into tables with
>>>>> TimeTransform partitions
>>>>> <https://github.com/apache/iceberg-python/pull/784/files>, our
>>>>> prospective users started asking about the support for Bucket Transform
>>>>> partitions.
>>>>>
>>>>> Iceberg has a custom logic for Bucket partitions (Thanks for the link
>>>>> <https://iceberg.apache.org/spec/#bucket-transform-details> Fokko). I
>>>>> took a look into the Java code
>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L99>
>>>>> and I think it looks somewhat like:
>>>>>
>>>>> * mmh3_hash(val) mod (num_buckets)
>>>>>
>>>>> And has field type specific logic so that each type is hashed
>>>>> appropriately.
>>>>>
>>>>> Unfortunately there is no existing pyarrow compute function that does
>>>>> this, so I'd like to propose that we write the function in iceberg-rust
>>>>> that takes an Arrow Array reference and the bucket number as the input,
>>>>> that returns a new Arrow Array reference with the bucket values evaluated
>>>>> that corresponds to the input Arrow Array in the same order.
>>>>>
>>>>> When iceberg-rust becomes more mature, I believe that the same
>>>>> underlying transform function can be reused for bucket partitions within
>>>>> this repository, and in the interim we could support writes into Bucket
>>>>> partitioned tables on PyIceberg by exposing this function as a Python
>>>>> binding that we import into PyIceberg.
>>>>>
>>>>> I'd love to hear how folks feel about this idea!
>>>>>
>>>>>
>>>>> Cross posted Discussion on iceberg-rust: #514
>>>>> <https://github.com/apache/iceberg-rust/discussions/514>
>>>>>
>>>>>
>>>>> Sung
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Databricks
>>>>
>>>

Re: [DISCUSS] Use iceberg-rust for PyIceberg Bucket Transform

Reply via email to