Thanks Vinoth for kicking off this thread. I have also been looking into
hyperspace and is definitely an interesting project. On exploring the
project, I found the following in addition to what you mentioned

- Super easy to use, has a simple API to integrate into a spark based
application
- Record-level (aka needle in a haystack), the index doesn't perform well.
The underlying file format for indexing seems to be parquet to leverage
mix,max and other columnar advantages to skip indexes.
- Plugs in at the time of spark query planning to allow for automatic
indexing optimizations based on the created index (something I found
interesting and worth exploring especially for RFC-08)

+1 on stepping the gas on RFC-08/15 for record level + incremental
indexing. Although, hyperspace does have a promising roadmap and would be
good to see some collaboration here as well.

Thanks,
Nishith

On Sun, Jul 26, 2020 at 3:28 PM Vinoth Chandar <[email protected]> wrote:

> Hello all,
>
> In case you have not followed Hyperspace is a new indexing subsystem for
> Spark from Microsoft. It seemed like a very interesting project and I tried
> to explore if it can help us with an indexing option inside Hudi.
>
> TL;DR :
>
>    - Was exploring if hyperspace can be used an alternative for our
>    record/bloom indexes
>    - For the needle-in-a-haystack search i.e a single id out of all the
>    records, hyperspace also seems to be not very effective atm (might not
> be
>    surprising given the recommendations so far).
>    - Index refresh still seems like non-incremental i.e rebuilding the
>    entire index from scratch every time.
>    - Our old workhorse BLOOM_INDEX still significantly outperforms. But we
>    should really step on the gas for RFC-15 like efforts/RFC-08 to make
> this
>    much faster, which gives us an incrementally updating version
>
> Everything said, Hyperspace is a very cool project and it is only going to
> get better over time. We have good ways of collaborating in the future. Any
> hyperspace folks (if lurking here), please chime in (it's worth a shot)
>
> You can find my experiments here.
> https://gist.github.com/vinothchandar/593b19c47bea2406b9a8a9aaed30775a
>
> Please keep the conversations to the mailing list, so everyone can chime
> in.
>
> Thanks
> Vinoth
>

Reply via email to