Does it still make sense to follow the previous id generation we talked about? (for performance reasons instead of storing an entire string?)
<docId><byte1> = value1 <docId><byte2> = value2 instead of <docId><author> = value1 <docId><status> = value2 etc? On Mon, Jun 21, 2010 at 5:19 PM, N Kapshoo <[email protected]> wrote: > Aha. That makes sense (both atomic writes and Filters). > > I am definitely only looking to filter within a given user, so looks > like what you describe below might work for me. > > Thanks so much for all your help, Jonathan. You have saved me (at > least) 2 weeks of tinkering and poking around! > > On Mon, Jun 21, 2010 at 5:10 PM, Jonathan Gray <[email protected]> wrote: >> It would be inefficient to run that query against this schema, if you're >> talking about finding all documents with a given author across all users. >> In that case you'd want to use an additional table that had row keys as >> authors. >> >> If you want to search for documents with a specific author within a given >> users documents (single row) then you could use filters, and as Andrey said, >> it would be simpler if it was broken up into individual qualifiers but could >> also be done with a custom filter to read the serialized value. >> >> To answer your question, you'd want a QualifierFilter that matched against >> qualifiers of the form <anylong><author> and then a ValueFilter which >> matched the value against the specific author you're looking for. >> >> JG >> >>> -----Original Message----- >>> From: N Kapshoo [mailto:[email protected]] >>> Sent: Monday, June 21, 2010 2:59 PM >>> To: [email protected] >>> Subject: Re: composite value vs composite qualifier >>> >>> I am not sure how to use filters in my case since I do not know the >>> column name. >>> Eg: >>> DocInfo: 123213+author = "abc" >>> >>> 123213 is the docId. If I want to look for authors named 'abc' in all >>> docs, how would I go about specifying a filter? >>> >>> Thanks. >>> >>> On Mon, Jun 21, 2010 at 4:20 PM, Andrey Stepachev <[email protected]> >>> wrote: >>> > 2010/6/22 N Kapshoo <[email protected]> >>> > >>> >> Is there any querying value in separating out values tied to each >>> >> other vs. keeping them in a serialized object? I am guessing the >>> >> second option would be much faster considering it is one composite >>> >> value on the disk, but I would like to know if there are any >>> specific >>> >> advantages to doing things the other way. Thanks. >>> >> The values themselves are very small, basic information in String. >>> >> >>> >> Eg: >>> >> >>> >> DocInfo: <docId><type> = value1 >>> >> DocInfo: <docId><priority> = value2 >>> >> DocInfo: <docId><etcetc> = value3 >>> >> >>> >> >>> >> Vs >>> >> >>> >> DocInfo: docId = value (JSON(type, priority, etcetc)) >>> >> >>> >> Thank you. >>> >> >>> > >>> > This is mostly depends on usage pattern. >>> > >>> > 1. each value in storage have full key >>> key/family/qualifier/timestamp, so >>> > keyvalue size increasing >>> > (but this negative effect can be negated by using compression). So >>> > serialisation form will be smaller, take less disk io, and can be >>> faster. >>> > >>> > 2. second option gives you atomic updates (i.e all data comes as one >>> > "piece") and with first option you >>> > can have concurrent updates of the fields (and of course individual >>> history, >>> > in opposite to serialized object, which will have history for a whole >>> > object) >>> > >>> > 3. in serialised form you cant use server side filters (out of the >>> box, you >>> > should patch hbase to support custom filters, which will deserialise >>> object >>> > or use jsonpath on it's serialised form), but with first option - you >>> can. >>> > >> >
