Re: Bucketed Joins on Iceberg

Ryan Blue Thu, 28 Jan 2021 14:20:43 -0800

Hi Romin,

Spark has poor support for bucketed joins and we have a design doc
<https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit>
to hopefully improve that. We talked about this yesterday at the community
sync. One of the parts that we also need to get into Spark is an interface
to provide functions. We also have a design doc for adding a FunctionCatalog
API
<https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit>.
Given that we are still working on building consensus for those before
taking them to the Spark community, I think it might make sense to get a
prototype working in the Spark SQL extensions that were added in the last
release. @Chao <sunc...@apple.com> is also working on this, so let's
discuss how we can work together.


rb

On Tue, Jan 26, 2021 at 5:37 PM Romin Parekh <rominpar...@gmail.com> wrote:

> Hi Iceberg Devs,
> I am evaluating the performance of bucketed joins across two bucketed
> datasets to find an optimal bucketing strategy. I was able to ingest into a
> bucketed table [1] and using the TableScan API, I am able to see a subset
> (total files / bucket size) of files being scanned [2]. I also benchmarked
> different joins across the two datasets (different bucket variations).
> However, I recently came across this comment
> <https://github.com/apache/iceberg/issues/430#issuecomment-533360026> on
> issue #430 <https://github.com/apache/iceberg/issues/430> indicating some
> work is still pending for Spark to leverage Iceberg bucket values. I was
> wondering if that comment is still accurate? Is there anything I can help
> contribute?
>
> *[1] - Partition Spec*
>
> val partitionSpec = PartitionSpec
>     .builderFor(mergedSchema)
>     .identity("namespace")
>     .bucket("id", numberOfBuckets)
>     .build()
>
> *[2] - TableScan API*
>
> val iBucketIdExp  = Expressions.equal("id", "1")
> val iBucketIdScan = table.newScan().filter(iBucketIdExp)
> val filesScanned  = iBucketIdScan.planFiles.asScala.size.toLong
>
>
> --
> Thanks,
> Romin
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Bucketed Joins on Iceberg

Reply via email to