Re: Help in designing row key

Ted Yu Wed, 03 Jul 2013 04:36:04 -0700

The two argument Bytes.add() calls:

    return add(a, b, HConstants.EMPTY_BYTE_ARRAY);


where a new byte array is allocated:

    byte [] result = new byte[a.length + b.length + c.length];
Meaning your code below would allocate two byte arrays.

Consider writing a method that accepts 4 byte [] parameters.

Cheers

On Wed, Jul 3, 2013 at 3:20 AM, Flavio Pompermaier <[email protected]>wrote:

> All my enums produce positive integers so I don't have +/-ve Integer
> problems.
> Obviously If I use fixed-length rowKeys I could take away the separator..
>
> Sorry but I'm very a newbie in this field..I'm trying to understand how to
> compose my key with the bytes..
> Is it correct the following?
>
> final byte[] firstToken = Bytes.toBytes(source);
> final byte[] secondToken = Bytes.toBytes(type);
> final byte[] thirdToken = Bytes.toBytes(qualifier);
> final byte[] fourthToken = Bytes.toBytes(md5ofSomeString);
> byte[] rowKey = Bytes.add(firstToken,secondToken,thirdToken);
> rowKey =  Bytes.add(rowKey,fourthToken);
>
> Best,
> Flavio
>
>
> On Wed, Jul 3, 2013 at 11:58 AM, Anoop John <[email protected]> wrote:
>
> > When you make the RK and convert the int parts into byte[] ( Use
> > org.apache.hadoop.hbase.util.Bytes#toBytes(*int) *)  it will give 4 bytes
> > for every byte..  Be careful about the ordering...   When u convert a +ve
> > and -ve integer into byte[] and u do Lexiographical compare (as done in
> > HBase) u will see -ve number being greater than +ve..  If you dont have
> to
> > do deal with -ve numbers no issues  :)
> >
> > Well when all the parts of the RK is of fixed width u will need any
> > seperator??
> >
> > -Anoop-
> >
> > On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <[email protected]
> > >wrote:
> >
> > > Yeah, I was thinking to use a normalization step in order to allow the
> > use
> > > of FuzzyRowFilter but what is not clear to me is if integers must also
> be
> > > normalized or not.
> > > I will explain myself better. Suppose that i follow your advice and I
> > > produce keys like:
> > >  - 1|1|somehash|sometimestamp
> > >  - 55|555|somehash|sometimestamp
> > >
> > > Whould they match the same pattern or do I have to normalize them to
> the
> > > following?
> > >  - 001|001|somehash|sometimestamp
> > >  - 055|555|somehash|sometimestamp
> > >
> > > Moreover, I noticed that you used dots ('.') to separate things instead
> > of
> > > pipe ('|')..is there a reason for that (maybe performance or whatever)
> or
> > > is just your favourite separator?
> > >
> > > Best,
> > > Flavio
> > >
> > >
> > > On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <[email protected]> wrote:
> > >
> > > > I'm not sure if you're eliding this fact or not, but you'd be much
> > > > better off if you used a fixed-width format for your keys. So in your
> > > > example, you'd have:
> > > >
> > > > PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit
> > > > hash.8-byte timestamp
> > > >
> > > > Example: \x00\x00\x00\x01\x00\x00\x02\x03....
> > > >
> > > > The advantage of this is not only that it's significantly less data
> > > > (remember your key is stored on each KeyValue), but also you can now
> > > > use FuzzyRowFilter and other techniques to quickly perform scans. The
> > > > disadvantage is that you have to normalize the source-> integer but I
> > > > find I can either store that in an enum or cache it for a long time
> so
> > > > it's not a big issue.
> > > >
> > > > -Mike
> > > >
> > > > On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <
> > [email protected]
> > > >
> > > > wrote:
> > > > > Thank you very much for the great support!
> > > > > This is how I thought to design my key:
> > > > >
> > > > > PATTERN: source|type|qualifier|hash(name)|timestamp
> > > > > EXAMPLE:
> > > > >
> google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753
> > > > >
> > > > > Do you think my key could be good for my scope (my search will be
> > > > > essentially by source or source|type)?
> > > > > Another point is that initially I will not have so many sources,
> so I
> > > > will
> > > > > probably have only google|* but in the next phases there could be
> > more
> > > > > sources..
> > > > >
> > > > > Best,
> > > > > Flavio
> > > > >
> > > > > On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <[email protected]>
> wrote:
> > > > >
> > > > >> For #1, yes - the client receives less data after filtering.
> > > > >>
> > > > >> For #2, please take a look at TestMultiVersions
> > > > >> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in
> > > 0.94)
> > > > >> for time range:
> > > > >>
> > > > >>     scan = new Scan();
> > > > >>
> > > > >>     scan.setTimeRange(1000L, Long.MAX_VALUE);
> > > > >> For row key selection, you need a filter. Take a look at
> > > > >> FuzzyRowFilter.java
> > > > >>
> > > > >> Cheers
> > > > >>
> > > > >> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <
> > > > [email protected]
> > > > >> >wrote:
> > > > >>
> > > > >> >  Thanks for the reply! I thus have two questions more:
> > > > >> >
> > > > >> > 1) is it true that filtering on timestamps doesn't affect
> > > > performance..?
> > > > >> > 2) could you send me a little snippet of how you would do such a
> > > > filter
> > > > >> (by
> > > > >> > row key + timestamps)? For example get all rows whose key starts
> > > with
> > > > >> > 'someid-' and whose timestamps is greater than some timestamp?
> > > > >> >
> > > > >> > Best,
> > > > >> > Flavio
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <[email protected]>
> > wrote:
> > > > >> >
> > > > >> > > bq. Using timestamp in row-keys is discouraged
> > > > >> > >
> > > > >> > > The above is true.
> > > > >> > > Prefixing row key with timestamp would create hot region.
> > > > >> > >
> > > > >> > > bq. should I filter by a simpler row-key plus a filter on
> > > timestamp?
> > > > >> > >
> > > > >> > > You can do the above.
> > > > >> > >
> > > > >> > > On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <
> > > > >> [email protected]
> > > > >> > > >wrote:
> > > > >> > >
> > > > >> > > > Hi to everybody,
> > > > >> > > >
> > > > >> > > > in my use case I have to perform batch analysis skipping old
> > > data.
> > > > >> > > > For example, I want to process all rows created after a
> > certain
> > > > >> > > timestamp,
> > > > >> > > > passed as parameter.
> > > > >> > > >
> > > > >> > > > What is the most effective way to do this?
> > > > >> > > > Should I design my row-key to embed timestamp?
> > > > >> > > > Or just filtering by timestamp of the row is fast as well?
> Or
> > > what
> > > > >> > else?
> > > > >> > > >
> > > > >> > > > Initially I was thinking to compose my key as:
> > > > >> > > > timestamp|source|title|type
> > > > >> > > >
> > > > >> > > > but:
> > > > >> > > >
> > > > >> > > > 1) Using timestamp in row-keys is discouraged
> > > > >> > > > 2) If this design is ok, using this approach I still have
> > > problems
> > > > >> > > > filtering by timestamp because I cannot found a way to
> > > numerically
> > > > >> > filer
> > > > >> > > > (instead of alphanumerically/by string). Example:
> > > > >> > > > 1372776400441|something has timestamp lesser
> > > > >> > > > than 1372778470913|somethingelse but I cannot filter all row
> > > whose
> > > > >> key
> > > > >> > is
> > > > >> > > > "numerically" greater than 1372776400441. Is it possible to
> > > > overcome
> > > > >> > this
> > > > >> > > > issue?
> > > > >> > > > 3) If this design is not ok, should I filter by a simpler
> > > row-key
> > > > >> plus
> > > > >> > a
> > > > >> > > > filter on timestamp? Or what else?
> > > > >> > > >
> > > > >> > > > Best,
> > > > >> > > > Flavio
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>

Re: Help in designing row key

Reply via email to