All my enums produce positive integers so I don't have +/-ve Integer problems. Obviously If I use fixed-length rowKeys I could take away the separator..
Sorry but I'm very a newbie in this field..I'm trying to understand how to compose my key with the bytes.. Is it correct the following? final byte[] firstToken = Bytes.toBytes(source); final byte[] secondToken = Bytes.toBytes(type); final byte[] thirdToken = Bytes.toBytes(qualifier); final byte[] fourthToken = Bytes.toBytes(md5ofSomeString); byte[] rowKey = Bytes.add(firstToken,secondToken,thirdToken); rowKey = Bytes.add(rowKey,fourthToken); Best, Flavio On Wed, Jul 3, 2013 at 11:58 AM, Anoop John <[email protected]> wrote: > When you make the RK and convert the int parts into byte[] ( Use > org.apache.hadoop.hbase.util.Bytes#toBytes(*int) *) it will give 4 bytes > for every byte.. Be careful about the ordering... When u convert a +ve > and -ve integer into byte[] and u do Lexiographical compare (as done in > HBase) u will see -ve number being greater than +ve.. If you dont have to > do deal with -ve numbers no issues :) > > Well when all the parts of the RK is of fixed width u will need any > seperator?? > > -Anoop- > > On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <[email protected] > >wrote: > > > Yeah, I was thinking to use a normalization step in order to allow the > use > > of FuzzyRowFilter but what is not clear to me is if integers must also be > > normalized or not. > > I will explain myself better. Suppose that i follow your advice and I > > produce keys like: > > - 1|1|somehash|sometimestamp > > - 55|555|somehash|sometimestamp > > > > Whould they match the same pattern or do I have to normalize them to the > > following? > > - 001|001|somehash|sometimestamp > > - 055|555|somehash|sometimestamp > > > > Moreover, I noticed that you used dots ('.') to separate things instead > of > > pipe ('|')..is there a reason for that (maybe performance or whatever) or > > is just your favourite separator? > > > > Best, > > Flavio > > > > > > On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <[email protected]> wrote: > > > > > I'm not sure if you're eliding this fact or not, but you'd be much > > > better off if you used a fixed-width format for your keys. So in your > > > example, you'd have: > > > > > > PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit > > > hash.8-byte timestamp > > > > > > Example: \x00\x00\x00\x01\x00\x00\x02\x03.... > > > > > > The advantage of this is not only that it's significantly less data > > > (remember your key is stored on each KeyValue), but also you can now > > > use FuzzyRowFilter and other techniques to quickly perform scans. The > > > disadvantage is that you have to normalize the source-> integer but I > > > find I can either store that in an enum or cache it for a long time so > > > it's not a big issue. > > > > > > -Mike > > > > > > On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier < > [email protected] > > > > > > wrote: > > > > Thank you very much for the great support! > > > > This is how I thought to design my key: > > > > > > > > PATTERN: source|type|qualifier|hash(name)|timestamp > > > > EXAMPLE: > > > > google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753 > > > > > > > > Do you think my key could be good for my scope (my search will be > > > > essentially by source or source|type)? > > > > Another point is that initially I will not have so many sources, so I > > > will > > > > probably have only google|* but in the next phases there could be > more > > > > sources.. > > > > > > > > Best, > > > > Flavio > > > > > > > > On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <[email protected]> wrote: > > > > > > > >> For #1, yes - the client receives less data after filtering. > > > >> > > > >> For #2, please take a look at TestMultiVersions > > > >> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in > > 0.94) > > > >> for time range: > > > >> > > > >> scan = new Scan(); > > > >> > > > >> scan.setTimeRange(1000L, Long.MAX_VALUE); > > > >> For row key selection, you need a filter. Take a look at > > > >> FuzzyRowFilter.java > > > >> > > > >> Cheers > > > >> > > > >> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier < > > > [email protected] > > > >> >wrote: > > > >> > > > >> > Thanks for the reply! I thus have two questions more: > > > >> > > > > >> > 1) is it true that filtering on timestamps doesn't affect > > > performance..? > > > >> > 2) could you send me a little snippet of how you would do such a > > > filter > > > >> (by > > > >> > row key + timestamps)? For example get all rows whose key starts > > with > > > >> > 'someid-' and whose timestamps is greater than some timestamp? > > > >> > > > > >> > Best, > > > >> > Flavio > > > >> > > > > >> > > > > >> > On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <[email protected]> > wrote: > > > >> > > > > >> > > bq. Using timestamp in row-keys is discouraged > > > >> > > > > > >> > > The above is true. > > > >> > > Prefixing row key with timestamp would create hot region. > > > >> > > > > > >> > > bq. should I filter by a simpler row-key plus a filter on > > timestamp? > > > >> > > > > > >> > > You can do the above. > > > >> > > > > > >> > > On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier < > > > >> [email protected] > > > >> > > >wrote: > > > >> > > > > > >> > > > Hi to everybody, > > > >> > > > > > > >> > > > in my use case I have to perform batch analysis skipping old > > data. > > > >> > > > For example, I want to process all rows created after a > certain > > > >> > > timestamp, > > > >> > > > passed as parameter. > > > >> > > > > > > >> > > > What is the most effective way to do this? > > > >> > > > Should I design my row-key to embed timestamp? > > > >> > > > Or just filtering by timestamp of the row is fast as well? Or > > what > > > >> > else? > > > >> > > > > > > >> > > > Initially I was thinking to compose my key as: > > > >> > > > timestamp|source|title|type > > > >> > > > > > > >> > > > but: > > > >> > > > > > > >> > > > 1) Using timestamp in row-keys is discouraged > > > >> > > > 2) If this design is ok, using this approach I still have > > problems > > > >> > > > filtering by timestamp because I cannot found a way to > > numerically > > > >> > filer > > > >> > > > (instead of alphanumerically/by string). Example: > > > >> > > > 1372776400441|something has timestamp lesser > > > >> > > > than 1372778470913|somethingelse but I cannot filter all row > > whose > > > >> key > > > >> > is > > > >> > > > "numerically" greater than 1372776400441. Is it possible to > > > overcome > > > >> > this > > > >> > > > issue? > > > >> > > > 3) If this design is not ok, should I filter by a simpler > > row-key > > > >> plus > > > >> > a > > > >> > > > filter on timestamp? Or what else? > > > >> > > > > > > >> > > > Best, > > > >> > > > Flavio > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > >
