Re: [DISCUSS] Hyperspace + Hudi

nishith agarwal Mon, 27 Jul 2020 17:10:09 -0700

Yes.

SparkSession has a reference to something called a SessionState here ->
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L152


Each SessionState allows for a bunch of experimentalMethods for specific
optimizations that you can plug in for interception, here ->
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala#L64

So, say you want to add a new rule to the execution/logical plan, one can
do as follows

sparkSession.sessionState.experimentalMethods.extraOptimizations ++=
<YOUR_OWN_RULE>

Now, whenever a *df.filter(..) *or such a transformation is used, if you
have enabled this RULE through the sparkSession, the indexing will
automatically kick in based on your implementation of the RULE. Hyperspace
provides some rules, one of them can be seen here ->
https://github.com/microsoft/hyperspace/blob/master/src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala

Thanks,
Nishith

On Mon, Jul 27, 2020 at 11:53 AM Vinoth Chandar <vin...@apache.org> wrote:

> Thanks Nishith!
>
> >>Plugs in at the time of spark query planning to allow for automatic
> indexing optimizations based on the created index
>
> This is very interesting. Could you expand more? One day, love to support
> point(ish) lookups on. Hudi tables :)
>
> On Mon, Jul 27, 2020 at 8:29 AM nishith agarwal <n3.nas...@gmail.com>
> wrote:
>
> > Thanks Vinoth for kicking off this thread. I have also been looking into
> > hyperspace and is definitely an interesting project. On exploring the
> > project, I found the following in addition to what you mentioned
> >
> > - Super easy to use, has a simple API to integrate into a spark based
> > application
> > - Record-level (aka needle in a haystack), the index doesn't perform
> well.
> > The underlying file format for indexing seems to be parquet to leverage
> > mix,max and other columnar advantages to skip indexes.
> > - Plugs in at the time of spark query planning to allow for automatic
> > indexing optimizations based on the created index (something I found
> > interesting and worth exploring especially for RFC-08)
> >
> > +1 on stepping the gas on RFC-08/15 for record level + incremental
> > indexing. Although, hyperspace does have a promising roadmap and would be
> > good to see some collaboration here as well.
> >
> > Thanks,
> > Nishith
> >
> > On Sun, Jul 26, 2020 at 3:28 PM Vinoth Chandar <vin...@apache.org>
> wrote:
> >
> > > Hello all,
> > >
> > > In case you have not followed Hyperspace is a new indexing subsystem
> for
> > > Spark from Microsoft. It seemed like a very interesting project and I
> > tried
> > > to explore if it can help us with an indexing option inside Hudi.
> > >
> > > TL;DR :
> > >
> > >    - Was exploring if hyperspace can be used an alternative for our
> > >    record/bloom indexes
> > >    - For the needle-in-a-haystack search i.e a single id out of all the
> > >    records, hyperspace also seems to be not very effective atm (might
> not
> > > be
> > >    surprising given the recommendations so far).
> > >    - Index refresh still seems like non-incremental i.e rebuilding the
> > >    entire index from scratch every time.
> > >    - Our old workhorse BLOOM_INDEX still significantly outperforms. But
> > we
> > >    should really step on the gas for RFC-15 like efforts/RFC-08 to make
> > > this
> > >    much faster, which gives us an incrementally updating version
> > >
> > > Everything said, Hyperspace is a very cool project and it is only going
> > to
> > > get better over time. We have good ways of collaborating in the future.
> > Any
> > > hyperspace folks (if lurking here), please chime in (it's worth a shot)
> > >
> > > You can find my experiments here.
> > > https://gist.github.com/vinothchandar/593b19c47bea2406b9a8a9aaed30775a
> > >
> > > Please keep the conversations to the mailing list, so everyone can
> chime
> > > in.
> > >
> > > Thanks
> > > Vinoth
> > >
> >
>

Re: [DISCUSS] Hyperspace + Hudi

Reply via email to