[DISCUSS] Hyperspace + Hudi

Vinoth Chandar Sun, 26 Jul 2020 15:28:17 -0700

Hello all,

In case you have not followed Hyperspace is a new indexing subsystem for
Spark from Microsoft. It seemed like a very interesting project and I tried
to explore if it can help us with an indexing option inside Hudi.


TL;DR :

   - Was exploring if hyperspace can be used an alternative for our
   record/bloom indexes
   - For the needle-in-a-haystack search i.e a single id out of all the
   records, hyperspace also seems to be not very effective atm (might not be
   surprising given the recommendations so far).
   - Index refresh still seems like non-incremental i.e rebuilding the
   entire index from scratch every time.
   - Our old workhorse BLOOM_INDEX still significantly outperforms. But we
   should really step on the gas for RFC-15 like efforts/RFC-08 to make this
   much faster, which gives us an incrementally updating version

Everything said, Hyperspace is a very cool project and it is only going to
get better over time. We have good ways of collaborating in the future. Any
hyperspace folks (if lurking here), please chime in (it's worth a shot)

You can find my experiments here.
https://gist.github.com/vinothchandar/593b19c47bea2406b9a8a9aaed30775a

Please keep the conversations to the mailing list, so everyone can chime
in.

Thanks
Vinoth

[DISCUSS] Hyperspace + Hudi

Reply via email to