Re: [DISCUSS] Hyperspace + Hudi

Vinoth Chandar Tue, 28 Jul 2020 10:29:17 -0700

Very informative. Thanks!

On Mon, Jul 27, 2020 at 5:09 PM nishith agarwal <n3.nas...@gmail.com> wrote:


> Yes.
>
> SparkSession has a reference to something called a SessionState here ->
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L152
>
> Each SessionState allows for a bunch of experimentalMethods for specific
> optimizations that you can plug in for interception, here ->
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala#L64
>
> So, say you want to add a new rule to the execution/logical plan, one can
> do as follows
>
> sparkSession.sessionState.experimentalMethods.extraOptimizations ++=
> <YOUR_OWN_RULE>
>
> Now, whenever a *df.filter(..) *or such a transformation is used, if you
> have enabled this RULE through the sparkSession, the indexing will
> automatically kick in based on your implementation of the RULE. Hyperspace
> provides some rules, one of them can be seen here ->
>
> https://github.com/microsoft/hyperspace/blob/master/src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala
>
> Thanks,
> Nishith
>
> On Mon, Jul 27, 2020 at 11:53 AM Vinoth Chandar <vin...@apache.org> wrote:
>
> > Thanks Nishith!
> >
> > >>Plugs in at the time of spark query planning to allow for automatic
> > indexing optimizations based on the created index
> >
> > This is very interesting. Could you expand more? One day, love to support
> > point(ish) lookups on. Hudi tables :)
> >
> > On Mon, Jul 27, 2020 at 8:29 AM nishith agarwal <n3.nas...@gmail.com>
> > wrote:
> >
> > > Thanks Vinoth for kicking off this thread. I have also been looking
> into
> > > hyperspace and is definitely an interesting project. On exploring the
> > > project, I found the following in addition to what you mentioned
> > >
> > > - Super easy to use, has a simple API to integrate into a spark based
> > > application
> > > - Record-level (aka needle in a haystack), the index doesn't perform
> > well.
> > > The underlying file format for indexing seems to be parquet to leverage
> > > mix,max and other columnar advantages to skip indexes.
> > > - Plugs in at the time of spark query planning to allow for automatic
> > > indexing optimizations based on the created index (something I found
> > > interesting and worth exploring especially for RFC-08)
> > >
> > > +1 on stepping the gas on RFC-08/15 for record level + incremental
> > > indexing. Although, hyperspace does have a promising roadmap and would
> be
> > > good to see some collaboration here as well.
> > >
> > > Thanks,
> > > Nishith
> > >
> > > On Sun, Jul 26, 2020 at 3:28 PM Vinoth Chandar <vin...@apache.org>
> > wrote:
> > >
> > > > Hello all,
> > > >
> > > > In case you have not followed Hyperspace is a new indexing subsystem
> > for
> > > > Spark from Microsoft. It seemed like a very interesting project and I
> > > tried
> > > > to explore if it can help us with an indexing option inside Hudi.
> > > >
> > > > TL;DR :
> > > >
> > > >    - Was exploring if hyperspace can be used an alternative for our
> > > >    record/bloom indexes
> > > >    - For the needle-in-a-haystack search i.e a single id out of all
> the
> > > >    records, hyperspace also seems to be not very effective atm (might
> > not
> > > > be
> > > >    surprising given the recommendations so far).
> > > >    - Index refresh still seems like non-incremental i.e rebuilding
> the
> > > >    entire index from scratch every time.
> > > >    - Our old workhorse BLOOM_INDEX still significantly outperforms.
> But
> > > we
> > > >    should really step on the gas for RFC-15 like efforts/RFC-08 to
> make
> > > > this
> > > >    much faster, which gives us an incrementally updating version
> > > >
> > > > Everything said, Hyperspace is a very cool project and it is only
> going
> > > to
> > > > get better over time. We have good ways of collaborating in the
> future.
> > > Any
> > > > hyperspace folks (if lurking here), please chime in (it's worth a
> shot)
> > > >
> > > > You can find my experiments here.
> > > >
> https://gist.github.com/vinothchandar/593b19c47bea2406b9a8a9aaed30775a
> > > >
> > > > Please keep the conversations to the mailing list, so everyone can
> > chime
> > > > in.
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> >
>

Re: [DISCUSS] Hyperspace + Hudi

Reply via email to