advancedxy commented on PR #8579:
URL: https://github.com/apache/iceberg/pull/8579#issuecomment-1899645844
> First of all, we should evaluate other hash functions apart from Murmur3.
Parquet, for instance, uses xxHash that is supposed to be much faster
> Second, Parquet avoids the modulo operator for performance reasons.
Both sounds great improvement to me. Apart from faster hash, I'd like to add
another possible option to explore: user defined hash function for bucket
transform while we are working `bucketV2`. From time to time, I got request
from users that is it possible to custom Iceberg's bucket partitioning
strategy, so that it has exactly the same distribution of downstream systems.
> If we merge a general change about multi-arg transforms, we can start
working on changes to the expression API while figuring out the details about
bucketV2.
I'm ok to merge multi-arg transform first. However I'm not sure how to
provide examples for single-arg transform v.s. multi-arg transform as there
will be no `bucketV2` transform for now. I am referring this part:
```markdown
|**`Partition Field`** [2]|`JSON object: {`<br /> `"source-id":
<id int>,`<br /> `"field-id": <field id int>,`<br
/> `"name": <name string>,`<br /> `"transform":
<transform JSON>`<br />`}`|`{`<br /> `"source-id": 1,`<br
/> `"field-id": 1000,`<br /> `"name": "id_bucket",`<br
/> `"transform": "bucket[16]"`<br />`}`|
|**`Partition Field with multi-arg transform`** [3]|`JSON object: {`<br
/> `"source-id": -1,`<br /> `"source-ids": <list of
ids>,`<br /> `"field-id": <field id int>,`<br /> `"name":
<name string>,`<br /> `"transform": <transform JSON>`<br
/>`}`|`{`<br /> `"source-id": -1,`<br /> `"source-ids":
[1,2],`<br /> `"field-id": 1000,`<br /> `"name":
"id_type_bucket",`<br /> `"transform": "bucketV2[16]"`<br />`}`|
```
@szehon-ho @aokolnychyi do you have any suggestions?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]