I watched Lars George's video about HBase and read the documentation
and it's saying that it's not a good idea to have the timestamp as a
key because that will always load the same region until the timestamp
reach a certain value and move to the next region (hotspotting).

I have a table with a uniq key, a file path and a "last update" field.
I can easily find back the file with the ID and find when it has been
updated.

But what I need too is to find the files not updated for more than a
certain period of time.

If I want to retrieve that from this single table, I will have to do a
full parsing of the table. Which might take a while.

So I thought of building a table to reference that (kind of secondary
index). The key is the "last update", one FC and each column will have
the ID of the file with a dummy content.

When a file is updated, I remove its cell from this table, and
introduce a new cell with the new timestamp as the key.

And so one.

With this schema, I can find the files by ID very quickly and I can
find the files which need to be updated pretty quickly too. But it's
hotspotting one region.

>From the video (0:45:10) I can see 4 situations.
1) Hotspotting.
2) Salting.
3) Key field swap/promotion
4) Randomization.

I need to avoid hostpotting, so I looked at the 3 other options.

I can do salting. Like prefix the timestamp with a number between 0
and 9. So that will distribut the load over 10 servers. To find all
the files with a timestamp below a specific value, I will need to run
10 requests instead of one. But when the load will becaume to big for
10 servers, I will have to prefix by a byte between 0 and 99? Which
mean 100 request? And the more regions I will have, the more requests
I will have to do. Is that really a good approach?

Key field swap is close to salting. I can add the first few bytes from
the path before the timestamp, but the issue will remain the same.

I looked and randomization, and I can't do that. Else I will have no
way to retreive the information I'm looking for.

So the question is. Is there a good way to store the data to retrieve
them base on the date?

Thanks,

JM

Reply via email to