Yes. SparkSession has a reference to something called a SessionState here -> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L152
Each SessionState allows for a bunch of experimentalMethods for specific optimizations that you can plug in for interception, here -> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala#L64 So, say you want to add a new rule to the execution/logical plan, one can do as follows sparkSession.sessionState.experimentalMethods.extraOptimizations ++= <YOUR_OWN_RULE> Now, whenever a *df.filter(..) *or such a transformation is used, if you have enabled this RULE through the sparkSession, the indexing will automatically kick in based on your implementation of the RULE. Hyperspace provides some rules, one of them can be seen here -> https://github.com/microsoft/hyperspace/blob/master/src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala Thanks, Nishith On Mon, Jul 27, 2020 at 11:53 AM Vinoth Chandar <vin...@apache.org> wrote: > Thanks Nishith! > > >>Plugs in at the time of spark query planning to allow for automatic > indexing optimizations based on the created index > > This is very interesting. Could you expand more? One day, love to support > point(ish) lookups on. Hudi tables :) > > On Mon, Jul 27, 2020 at 8:29 AM nishith agarwal <n3.nas...@gmail.com> > wrote: > > > Thanks Vinoth for kicking off this thread. I have also been looking into > > hyperspace and is definitely an interesting project. On exploring the > > project, I found the following in addition to what you mentioned > > > > - Super easy to use, has a simple API to integrate into a spark based > > application > > - Record-level (aka needle in a haystack), the index doesn't perform > well. > > The underlying file format for indexing seems to be parquet to leverage > > mix,max and other columnar advantages to skip indexes. > > - Plugs in at the time of spark query planning to allow for automatic > > indexing optimizations based on the created index (something I found > > interesting and worth exploring especially for RFC-08) > > > > +1 on stepping the gas on RFC-08/15 for record level + incremental > > indexing. Although, hyperspace does have a promising roadmap and would be > > good to see some collaboration here as well. > > > > Thanks, > > Nishith > > > > On Sun, Jul 26, 2020 at 3:28 PM Vinoth Chandar <vin...@apache.org> > wrote: > > > > > Hello all, > > > > > > In case you have not followed Hyperspace is a new indexing subsystem > for > > > Spark from Microsoft. It seemed like a very interesting project and I > > tried > > > to explore if it can help us with an indexing option inside Hudi. > > > > > > TL;DR : > > > > > > - Was exploring if hyperspace can be used an alternative for our > > > record/bloom indexes > > > - For the needle-in-a-haystack search i.e a single id out of all the > > > records, hyperspace also seems to be not very effective atm (might > not > > > be > > > surprising given the recommendations so far). > > > - Index refresh still seems like non-incremental i.e rebuilding the > > > entire index from scratch every time. > > > - Our old workhorse BLOOM_INDEX still significantly outperforms. But > > we > > > should really step on the gas for RFC-15 like efforts/RFC-08 to make > > > this > > > much faster, which gives us an incrementally updating version > > > > > > Everything said, Hyperspace is a very cool project and it is only going > > to > > > get better over time. We have good ways of collaborating in the future. > > Any > > > hyperspace folks (if lurking here), please chime in (it's worth a shot) > > > > > > You can find my experiments here. > > > https://gist.github.com/vinothchandar/593b19c47bea2406b9a8a9aaed30775a > > > > > > Please keep the conversations to the mailing list, so everyone can > chime > > > in. > > > > > > Thanks > > > Vinoth > > > > > >