Thanks Nishith! >>Plugs in at the time of spark query planning to allow for automatic indexing optimizations based on the created index
This is very interesting. Could you expand more? One day, love to support point(ish) lookups on. Hudi tables :) On Mon, Jul 27, 2020 at 8:29 AM nishith agarwal <[email protected]> wrote: > Thanks Vinoth for kicking off this thread. I have also been looking into > hyperspace and is definitely an interesting project. On exploring the > project, I found the following in addition to what you mentioned > > - Super easy to use, has a simple API to integrate into a spark based > application > - Record-level (aka needle in a haystack), the index doesn't perform well. > The underlying file format for indexing seems to be parquet to leverage > mix,max and other columnar advantages to skip indexes. > - Plugs in at the time of spark query planning to allow for automatic > indexing optimizations based on the created index (something I found > interesting and worth exploring especially for RFC-08) > > +1 on stepping the gas on RFC-08/15 for record level + incremental > indexing. Although, hyperspace does have a promising roadmap and would be > good to see some collaboration here as well. > > Thanks, > Nishith > > On Sun, Jul 26, 2020 at 3:28 PM Vinoth Chandar <[email protected]> wrote: > > > Hello all, > > > > In case you have not followed Hyperspace is a new indexing subsystem for > > Spark from Microsoft. It seemed like a very interesting project and I > tried > > to explore if it can help us with an indexing option inside Hudi. > > > > TL;DR : > > > > - Was exploring if hyperspace can be used an alternative for our > > record/bloom indexes > > - For the needle-in-a-haystack search i.e a single id out of all the > > records, hyperspace also seems to be not very effective atm (might not > > be > > surprising given the recommendations so far). > > - Index refresh still seems like non-incremental i.e rebuilding the > > entire index from scratch every time. > > - Our old workhorse BLOOM_INDEX still significantly outperforms. But > we > > should really step on the gas for RFC-15 like efforts/RFC-08 to make > > this > > much faster, which gives us an incrementally updating version > > > > Everything said, Hyperspace is a very cool project and it is only going > to > > get better over time. We have good ways of collaborating in the future. > Any > > hyperspace folks (if lurking here), please chime in (it's worth a shot) > > > > You can find my experiments here. > > https://gist.github.com/vinothchandar/593b19c47bea2406b9a8a9aaed30775a > > > > Please keep the conversations to the mailing list, so everyone can chime > > in. > > > > Thanks > > Vinoth > > >
