Hi all,

Geospatial use cases require indexing multiple dimensions; projects like GeoMesa and GeoWave have thoughts about this quite a bit (I work on the former).

If the ids are unique and queries have a definite time range, then a key structure like "id+timestamp" will let you do quite scans for each of the ids.  Each scan may take a few milliseconds and just the data necessary will be read.

If ids are contiguous, you may need to figure out a way to mash-up the ids and the timestamps.  This would be like how spatial indexes (as implemented by GeoMesa and GeoWave) in key-value stores (like HBase and Accumulo) use space-filling curves.  The queries would be two dimensional ranges and your query planner would map those ranges on to index/key ranges.

Cheers,

Jim

On 5/4/21 7:23 PM, Nick Dimiduk wrote:
Hi Kevin,

Did you get an answer to your question, maybe over on hbase-user?

As it seems you're aware, HBase is built on a single index -- the rowkey.
You may be able to implement something like MySQL's composite indexing on
HBase if the algorithm can be mapped to a 1-dimensional linear index. You
would have to implement this yourself as HBase doesn't offer this out of
the box. Such an encoding would be an interesting contribution to HBase, it
might sit over next to our other data encoding "types" in
`org.apache.hadoop.hbase.types`.

As for why your filtered queries are slow, you're the best person to start
answering that question. Is your data local to the region server that's
hosting it, or do you have multiple network hops and service
serialize/deserialize steps in your hot path? Is your index optimized for
your query (sounds like maybe not, based on the first question)? Have you
seen the Profiling Servlet [0]? You can start by setting that up, isolating
the workload, and collecting some FlameGraphs to analyze.

Thanks,
Nick

[0]: https://hbase.apache.org/book.html#profiler

On Mon, Apr 12, 2021 at 10:26 AM Kevin Wright <kevinwright1...@gmail.com>
wrote:

Hi!

Our application requires fast read queries that specify two ranges. One
range on timestamps, and another on ids. We are currently using Apache
HBase as our db, but we’re unsure how to optimally design the row keys /
schemas. Currently, scanning over row key (the ids) with filter on
timeranges is taking more time than what we expect. A normal query would
probably have say 200 rows that match the id range, and about 10 rows that
match both ranges, and we have currently on the order of 10s of millions of
rows.

We’re wondering if there’s something we can do to increase throughput with
HBase (e.g., is there something like composite indexing like in MySQL?).
Not sure if this is the best place to ask this, but if anyone could point
us to the right direction, that would be great!

Thank you!


Reply via email to