Hi Flavio,
Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix)? It will allow you to model your multi-part row key like this:

CREATE TABLE flavio.analytics (
    source INTEGER,
    type INTEGER,
    qual VARCHAR,
    hash VARCHAR,
    ts DATE
CONSTRAINT pk PRIMARY KEY (source, type, qual, hash, ts) // Defines columns that make up the row key
)

Then you can issue SQL queries like this (to query for the last 7 days worth of data): SELECT * FROM flavio.analytics WHERE source IN (1,2,5) AND type IN (55,66) AND ts > CURRENT_DATE() - 7

This will internally take advantage of our SkipScan (http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html) to jump through your key space similar to FuzzyRowFilter, but in parallel from the client taking into account your region boundaries.

Or do more complex GROUP BY queries like this (to aggregate over the last 30 days worth of data, bucketized by day): SELECT type,COUNT(*) FROM flavio.analytics WHERE ts > CURRENT_DATE() - 30 GROUP BY type,TRUNCATE(ts,'DAY')

No need to worry about lexicographical sort order, flipping sign bits, normalizing/padding integer values, and all the other nuances of working with an API that works at the level of bytes. No need to write and manage installation of your own coprocessors to make aggregation efficient, perform topN queries, etc.

HTH.

Regards,
James
@JamesPlusPlus

On 07/03/2013 02:58 AM, Anoop John wrote:
When you make the RK and convert the int parts into byte[] ( Use
org.apache.hadoop.hbase.util.Bytes#toBytes(*int) *)  it will give 4 bytes
for every byte..  Be careful about the ordering...   When u convert a +ve
and -ve integer into byte[] and u do Lexiographical compare (as done in
HBase) u will see -ve number being greater than +ve..  If you dont have to
do deal with -ve numbers no issues  :)

Well when all the parts of the RK is of fixed width u will need any
seperator??

-Anoop-

On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <pomperma...@okkam.it>wrote:

Yeah, I was thinking to use a normalization step in order to allow the use
of FuzzyRowFilter but what is not clear to me is if integers must also be
normalized or not.
I will explain myself better. Suppose that i follow your advice and I
produce keys like:
  - 1|1|somehash|sometimestamp
  - 55|555|somehash|sometimestamp

Whould they match the same pattern or do I have to normalize them to the
following?
  - 001|001|somehash|sometimestamp
  - 055|555|somehash|sometimestamp

Moreover, I noticed that you used dots ('.') to separate things instead of
pipe ('|')..is there a reason for that (maybe performance or whatever) or
is just your favourite separator?

Best,
Flavio


On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <m...@axiak.net> wrote:

I'm not sure if you're eliding this fact or not, but you'd be much
better off if you used a fixed-width format for your keys. So in your
example, you'd have:

PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit
hash.8-byte timestamp

Example: \x00\x00\x00\x01\x00\x00\x02\x03....

The advantage of this is not only that it's significantly less data
(remember your key is stored on each KeyValue), but also you can now
use FuzzyRowFilter and other techniques to quickly perform scans. The
disadvantage is that you have to normalize the source-> integer but I
find I can either store that in an enum or cache it for a long time so
it's not a big issue.

-Mike

On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <pomperma...@okkam.it

wrote:
Thank you very much for the great support!
This is how I thought to design my key:

PATTERN: source|type|qualifier|hash(name)|timestamp
EXAMPLE:
google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753

Do you think my key could be good for my scope (my search will be
essentially by source or source|type)?
Another point is that initially I will not have so many sources, so I
will
probably have only google|* but in the next phases there could be more
sources..

Best,
Flavio

On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yuzhih...@gmail.com> wrote:

For #1, yes - the client receives less data after filtering.

For #2, please take a look at TestMultiVersions
(./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in
0.94)
for time range:

     scan = new Scan();

     scan.setTimeRange(1000L, Long.MAX_VALUE);
For row key selection, you need a filter. Take a look at
FuzzyRowFilter.java

Cheers

On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <
pomperma...@okkam.it
wrote:
  Thanks for the reply! I thus have two questions more:

1) is it true that filtering on timestamps doesn't affect
performance..?
2) could you send me a little snippet of how you would do such a
filter
(by
row key + timestamps)? For example get all rows whose key starts
with
'someid-' and whose timestamps is greater than some timestamp?

Best,
Flavio


On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yuzhih...@gmail.com> wrote:

bq. Using timestamp in row-keys is discouraged

The above is true.
Prefixing row key with timestamp would create hot region.

bq. should I filter by a simpler row-key plus a filter on
timestamp?
You can do the above.

On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <
pomperma...@okkam.it
wrote:
Hi to everybody,

in my use case I have to perform batch analysis skipping old
data.
For example, I want to process all rows created after a certain
timestamp,
passed as parameter.

What is the most effective way to do this?
Should I design my row-key to embed timestamp?
Or just filtering by timestamp of the row is fast as well? Or
what
else?
Initially I was thinking to compose my key as:
timestamp|source|title|type

but:

1) Using timestamp in row-keys is discouraged
2) If this design is ok, using this approach I still have
problems
filtering by timestamp because I cannot found a way to
numerically
filer
(instead of alphanumerically/by string). Example:
1372776400441|something has timestamp lesser
than 1372778470913|somethingelse but I cannot filter all row
whose
key
is
"numerically" greater than 1372776400441. Is it possible to
overcome
this
issue?
3) If this design is not ok, should I filter by a simpler
row-key
plus
a
filter on timestamp? Or what else?

Best,
Flavio


Reply via email to