Very informative. Thanks! On Mon, Jul 27, 2020 at 5:09 PM nishith agarwal <n3.nas...@gmail.com> wrote:
> Yes. > > SparkSession has a reference to something called a SessionState here -> > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L152 > > Each SessionState allows for a bunch of experimentalMethods for specific > optimizations that you can plug in for interception, here -> > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala#L64 > > So, say you want to add a new rule to the execution/logical plan, one can > do as follows > > sparkSession.sessionState.experimentalMethods.extraOptimizations ++= > <YOUR_OWN_RULE> > > Now, whenever a *df.filter(..) *or such a transformation is used, if you > have enabled this RULE through the sparkSession, the indexing will > automatically kick in based on your implementation of the RULE. Hyperspace > provides some rules, one of them can be seen here -> > > https://github.com/microsoft/hyperspace/blob/master/src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala > > Thanks, > Nishith > > On Mon, Jul 27, 2020 at 11:53 AM Vinoth Chandar <vin...@apache.org> wrote: > > > Thanks Nishith! > > > > >>Plugs in at the time of spark query planning to allow for automatic > > indexing optimizations based on the created index > > > > This is very interesting. Could you expand more? One day, love to support > > point(ish) lookups on. Hudi tables :) > > > > On Mon, Jul 27, 2020 at 8:29 AM nishith agarwal <n3.nas...@gmail.com> > > wrote: > > > > > Thanks Vinoth for kicking off this thread. I have also been looking > into > > > hyperspace and is definitely an interesting project. On exploring the > > > project, I found the following in addition to what you mentioned > > > > > > - Super easy to use, has a simple API to integrate into a spark based > > > application > > > - Record-level (aka needle in a haystack), the index doesn't perform > > well. > > > The underlying file format for indexing seems to be parquet to leverage > > > mix,max and other columnar advantages to skip indexes. > > > - Plugs in at the time of spark query planning to allow for automatic > > > indexing optimizations based on the created index (something I found > > > interesting and worth exploring especially for RFC-08) > > > > > > +1 on stepping the gas on RFC-08/15 for record level + incremental > > > indexing. Although, hyperspace does have a promising roadmap and would > be > > > good to see some collaboration here as well. > > > > > > Thanks, > > > Nishith > > > > > > On Sun, Jul 26, 2020 at 3:28 PM Vinoth Chandar <vin...@apache.org> > > wrote: > > > > > > > Hello all, > > > > > > > > In case you have not followed Hyperspace is a new indexing subsystem > > for > > > > Spark from Microsoft. It seemed like a very interesting project and I > > > tried > > > > to explore if it can help us with an indexing option inside Hudi. > > > > > > > > TL;DR : > > > > > > > > - Was exploring if hyperspace can be used an alternative for our > > > > record/bloom indexes > > > > - For the needle-in-a-haystack search i.e a single id out of all > the > > > > records, hyperspace also seems to be not very effective atm (might > > not > > > > be > > > > surprising given the recommendations so far). > > > > - Index refresh still seems like non-incremental i.e rebuilding > the > > > > entire index from scratch every time. > > > > - Our old workhorse BLOOM_INDEX still significantly outperforms. > But > > > we > > > > should really step on the gas for RFC-15 like efforts/RFC-08 to > make > > > > this > > > > much faster, which gives us an incrementally updating version > > > > > > > > Everything said, Hyperspace is a very cool project and it is only > going > > > to > > > > get better over time. We have good ways of collaborating in the > future. > > > Any > > > > hyperspace folks (if lurking here), please chime in (it's worth a > shot) > > > > > > > > You can find my experiments here. > > > > > https://gist.github.com/vinothchandar/593b19c47bea2406b9a8a9aaed30775a > > > > > > > > Please keep the conversations to the mailing list, so everyone can > > chime > > > > in. > > > > > > > > Thanks > > > > Vinoth > > > > > > > > > >