Thanks Vinoth for kicking off this thread. I have also been looking into hyperspace and is definitely an interesting project. On exploring the project, I found the following in addition to what you mentioned
- Super easy to use, has a simple API to integrate into a spark based application - Record-level (aka needle in a haystack), the index doesn't perform well. The underlying file format for indexing seems to be parquet to leverage mix,max and other columnar advantages to skip indexes. - Plugs in at the time of spark query planning to allow for automatic indexing optimizations based on the created index (something I found interesting and worth exploring especially for RFC-08) +1 on stepping the gas on RFC-08/15 for record level + incremental indexing. Although, hyperspace does have a promising roadmap and would be good to see some collaboration here as well. Thanks, Nishith On Sun, Jul 26, 2020 at 3:28 PM Vinoth Chandar <[email protected]> wrote: > Hello all, > > In case you have not followed Hyperspace is a new indexing subsystem for > Spark from Microsoft. It seemed like a very interesting project and I tried > to explore if it can help us with an indexing option inside Hudi. > > TL;DR : > > - Was exploring if hyperspace can be used an alternative for our > record/bloom indexes > - For the needle-in-a-haystack search i.e a single id out of all the > records, hyperspace also seems to be not very effective atm (might not > be > surprising given the recommendations so far). > - Index refresh still seems like non-incremental i.e rebuilding the > entire index from scratch every time. > - Our old workhorse BLOOM_INDEX still significantly outperforms. But we > should really step on the gas for RFC-15 like efforts/RFC-08 to make > this > much faster, which gives us an incrementally updating version > > Everything said, Hyperspace is a very cool project and it is only going to > get better over time. We have good ways of collaborating in the future. Any > hyperspace folks (if lurking here), please chime in (it's worth a shot) > > You can find my experiments here. > https://gist.github.com/vinothchandar/593b19c47bea2406b9a8a9aaed30775a > > Please keep the conversations to the mailing list, so everyone can chime > in. > > Thanks > Vinoth >
