Re: Bucketed Joins on Iceberg

Romin Parekh Thu, 28 Jan 2021 21:01:00 -0800

Thanks Ryan. I was unable to attend the community sync but a few of my
colleagues did. We are discussing next steps internally and are also open
to contributing.


Thanks,
Romin

On Thu, Jan 28, 2021 at 2:20 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi Romin,
>
> Spark has poor support for bucketed joins and we have a design doc
> <https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit>
> to hopefully improve that. We talked about this yesterday at the community
> sync. One of the parts that we also need to get into Spark is an interface
> to provide functions. We also have a design doc for adding a FunctionCatalog
> API
> <https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit>.
> Given that we are still working on building consensus for those before
> taking them to the Spark community, I think it might make sense to get a
> prototype working in the Spark SQL extensions that were added in the last
> release. @Chao <sunc...@apple.com> is also working on this, so let's
> discuss how we can work together.
>
> rb
>
> On Tue, Jan 26, 2021 at 5:37 PM Romin Parekh <rominpar...@gmail.com>
> wrote:
>
>> Hi Iceberg Devs,
>> I am evaluating the performance of bucketed joins across two bucketed
>> datasets to find an optimal bucketing strategy. I was able to ingest into a
>> bucketed table [1] and using the TableScan API, I am able to see a subset
>> (total files / bucket size) of files being scanned [2]. I also benchmarked
>> different joins across the two datasets (different bucket variations).
>> However, I recently came across this comment
>> <https://github.com/apache/iceberg/issues/430#issuecomment-533360026> on
>> issue #430 <https://github.com/apache/iceberg/issues/430> indicating
>> some work is still pending for Spark to leverage Iceberg bucket values. I
>> was wondering if that comment is still accurate? Is there anything I can
>> help contribute?
>>
>> *[1] - Partition Spec*
>>
>> val partitionSpec = PartitionSpec
>>     .builderFor(mergedSchema)
>>     .identity("namespace")
>>     .bucket("id", numberOfBuckets)
>>     .build()
>>
>> *[2] - TableScan API*
>>
>> val iBucketIdExp  = Expressions.equal("id", "1")
>> val iBucketIdScan = table.newScan().filter(iBucketIdExp)
>> val filesScanned  = iBucketIdScan.planFiles.asScala.size.toLong
>>
>>
>> --
>> Thanks,
>> Romin
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Thanks,
Romin

Re: Bucketed Joins on Iceberg

Reply via email to