asheeshgarg commented on issue #208:
URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1912738795
@Fokko @syun64 @syun64 another option I can think is use polars to do it
simple example below with hashing and partitioning sorting in partition. Where
all the partition is handle by rust layer in Polars and we write parquet based
on arrow table returned.
Not sure if we want to add it as dependency? We can do custom transforms
like hours etc we have in iceberg as well easily.
import pyarrow as pa
import pyarrow.compute as pc
import polars as pl
t = pa.table({'strings':["A", "A", "B", "A"],'ints':[2, 1, 3, 4]})
df=pl.from_arrow(t)
N = 2
tables=(df.with_columns([
(pl.col("strings").hash() % N).alias("partition_id")
]).partition_by("partition_id"))
for tbl in tables:
print(tbl.to_arrow().sort_by("ints"))
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]