Thanks Ryan. I was unable to attend the community sync but a few of my colleagues did. We are discussing next steps internally and are also open to contributing.
Thanks, Romin On Thu, Jan 28, 2021 at 2:20 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > Hi Romin, > > Spark has poor support for bucketed joins and we have a design doc > <https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit> > to hopefully improve that. We talked about this yesterday at the community > sync. One of the parts that we also need to get into Spark is an interface > to provide functions. We also have a design doc for adding a FunctionCatalog > API > <https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit>. > Given that we are still working on building consensus for those before > taking them to the Spark community, I think it might make sense to get a > prototype working in the Spark SQL extensions that were added in the last > release. @Chao <sunc...@apple.com> is also working on this, so let's > discuss how we can work together. > > rb > > On Tue, Jan 26, 2021 at 5:37 PM Romin Parekh <rominpar...@gmail.com> > wrote: > >> Hi Iceberg Devs, >> I am evaluating the performance of bucketed joins across two bucketed >> datasets to find an optimal bucketing strategy. I was able to ingest into a >> bucketed table [1] and using the TableScan API, I am able to see a subset >> (total files / bucket size) of files being scanned [2]. I also benchmarked >> different joins across the two datasets (different bucket variations). >> However, I recently came across this comment >> <https://github.com/apache/iceberg/issues/430#issuecomment-533360026> on >> issue #430 <https://github.com/apache/iceberg/issues/430> indicating >> some work is still pending for Spark to leverage Iceberg bucket values. I >> was wondering if that comment is still accurate? Is there anything I can >> help contribute? >> >> *[1] - Partition Spec* >> >> val partitionSpec = PartitionSpec >> .builderFor(mergedSchema) >> .identity("namespace") >> .bucket("id", numberOfBuckets) >> .build() >> >> *[2] - TableScan API* >> >> val iBucketIdExp = Expressions.equal("id", "1") >> val iBucketIdScan = table.newScan().filter(iBucketIdExp) >> val filesScanned = iBucketIdScan.planFiles.asScala.size.toLong >> >> >> -- >> Thanks, >> Romin >> >> >> > > -- > Ryan Blue > Software Engineer > Netflix > -- Thanks, Romin