Re: Get on a row with multiple columns

Varun Sharma Fri, 08 Feb 2013 22:57:57 -0800

We are actually doing some filtering already using a coprocessor during
major compactions but we dont really know in advance what is going to be
trimmed out. We only know when an unfollow action happens.


Anyhow this BulkDelete looks promising. I have never done coprocessor
endpoints before, so can you help me with a couple of questions:
1) I am running hbase 0.94.3, do I need to do anything on the region server
side, any configuration to take advantage of this or can i simply follow
the javadoc which is really informative and use the endpoint in my client ?
2) This, as i read it will simply run a single scan (with filters etc.) and
simply place delete markers for all the entries that were found during the
scan.

Thanks
Varun

On Fri, Feb 8, 2013 at 10:45 PM, Varun Sharma <va...@pinterest.com> wrote:

> The use case is like your twitter feed. Tweets from people u follow. When
> someone unfollows, you need to delete a bunch of his tweets from the
> following feed. So, its frequent, and we are essentially running into some
> extreme corner cases like the one above. We need high write throughput for
> this, since when someone tweets, we need to fanout the tweet to all the
> followers. We need the ability to do fast deletes (unfollow) and fast adds
> (follow) and also be able to do fast random gets - when a real user loads
> the feed. I doubt we will able to play much with the schema here since we
> need to support a bunch of use cases.
>
> @lars: It does not take 30 seconds to place 300 delete markers. It takes
> 30 seconds to first find which of those 300 pins are in the set of columns
> present - this invokes 300 gets and then place the appropriate delete
> markers. Note that we can have tens of thousands of columns in a single row
> so a single get is not cheap.
>
> If we were to just place delete markers, that is very fast. But when
> started doing that, our random read performance suffered because of too
> many delete markers. The 90th percentile on random reads shot up from 40
> milliseconds to 150 milliseconds, which is not acceptable for our usecase.
>
> Thanks
> Varun
>
>
> On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org> wrote:
>
>> Can you organize your columns and then delete by column family?
>>
>> deleteColumn without specifying a TS is expensive, since HBase first has
>> to figure out what the latest TS is.
>>
>> Should be better in 0.94.1 or later since deletes are batched like Puts
>> (still need to retrieve the latest version, though).
>>
>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which basically
>> let's specify a scan condition and then place specific delete marker for
>> all KVs encountered.
>>
>>
>> If you wanted to get really
>> fancy, you could hook up a coprocessor to the compaction process and
>> simply filter all KVs you no longer want (without ever placing any
>> delete markers).
>>
>>
>> Are you saying it takes 15 seconds to place 300 version delete markers?!
>>
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Varun Sharma <va...@pinterest.com>
>> To: user@hbase.apache.org
>> Sent: Friday, February 8, 2013 10:05 PM
>> Subject: Re: Get on a row with multiple columns
>>
>> We are given a set of 300 columns to delete. I tested two cases:
>>
>> 1) deleteColumns() - with the 's'
>>
>> This function simply adds delete markers for 300 columns, in our case,
>> typically only a fraction of these columns are actually present - 10.
>> After
>> starting to use deleteColumns, we starting seeing a drop in cluster wide
>> random read performance - 90th percentile latency worsened, so did 99th
>> probably because of having to traverse delete markers. I attribute this to
>> profusion of delete markers in the cluster. Major compactions slowed down
>> by almost 50 percent probably because of having to clean out significantly
>> more delete markers.
>>
>> 2) deleteColumn()
>>
>> Ended up with untolerable 15 second calls, which clogged all the handlers.
>> Making the cluster pretty much unresponsive.
>>
>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>> > For the 300 column deletes, can you show us how the Delete(s) are
>> > constructed ?
>> >
>> > Do you use this method ?
>> >
>> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
>> > Thanks
>> >
>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>> >
>> > > So a Get call with multiple columns on a single row should be much
>> faster
>> > > than independent Get(s) on each of those columns for that row. I am
>> > > basically seeing severely poor performance (~ 15 seconds) for certain
>> > > deleteColumn() calls and I am seeing that there is a
>> > > prepareDeleteTimestamps() function in HRegion.java which first tries
>> to
>> > > locate the column by doing individual gets on each column you want to
>> > > delete (I am doing 300 column deletes). Now, I think this should
>> ideall
>> > by
>> > > 1 get call with the batch of 300 columns so that one scan can retrieve
>> > the
>> > > columns and the columns that are found, are indeed deleted.
>> > >
>> > > Before I try this fix, I wanted to get an opinion if it will make a
>> > > difference to batch the get() and it seems from your answer, it
>> should.
>> > >
>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
>> wrote:
>> > >
>> > > > Everything is stored as a KeyValue in HBase.
>> > > > The Key part of a KeyValue contains the row key, column family,
>> column
>> > > > name, and timestamp in that order.
>> > > > Each column family has it's own store and store files.
>> > > >
>> > > > So in a nutshell a get is executed by starting a scan at the row key
>> > > > (which is a prefix of the key) in each store (CF) and then scanning
>> > > forward
>> > > > in each store until the next row key is reached. (in reality it is a
>> > bit
>> > > > more complicated due to multiple versions, skipping columns, etc)
>> > > >
>> > > >
>> > > > -- Lars
>> > > > ________________________________
>> > > > From: Varun Sharma <va...@pinterest.com>
>> > > > To: user@hbase.apache.org
>> > > > Sent: Friday, February 8, 2013 9:22 PM
>> > > > Subject: Re: Get on a row with multiple columns
>> > > >
>> > > > Sorry, I was a little unclear with my question.
>> > > >
>> > > > Lets say you have
>> > > >
>> > > > Get get = new Get(row)
>> > > > get.addColumn("1");
>> > > > get.addColumn("2");
>> > > > .
>> > > > .
>> > > > .
>> > > >
>> > > > When internally hbase executes the batch get, it will seek to column
>> > "1",
>> > > > now since data is lexicographically sorted, it does not need to seek
>> > from
>> > > > the beginning to get to "2", it can continue seeking, henceforth
>> since
>> > > > column "2" will always be after column "1". I want to know whether
>> this
>> > > is
>> > > > how a multicolumn get on a row works or not.
>> > > >
>> > > > Thanks
>> > > > Varun
>> > > >
>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <mlor...@uci.cu>
>> wrote:
>> > > >
>> > > > > Like Ishan said, a get give an instance of the Result class.
>> > > > > All utility methods that you can use are:
>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
>> > > > >  byte[] value()
>> > > > >  byte[] getRow()
>> > > > >  int size()
>> > > > >  boolean isEmpty()
>> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
>> > > > >  List<KeyValue> list()
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
>> > > > >
>> > > > >> Based on what I read in Lars' book, a get will return a result a
>> > > Result,
>> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by
>> the
>> > key
>> > > > and
>> > > > >> you access this array using raw or list methods on the Result
>> > object.
>> > > > >>
>> > > > >>
>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
>> va...@pinterest.com>
>> > > > wrote:
>> > > > >>
>> > > > >>  +user
>> > > > >>>
>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
>> va...@pinterest.com>
>> > > > >>> wrote:
>> > > > >>>
>> > > > >>>  Hi,
>> > > > >>>>
>> > > > >>>> When I do a Get on a row with multiple column qualifiers. Do we
>> > sort
>> > > > the
>> > > > >>>> column qualifers and make use of the sorted order when we get
>> the
>> > > > >>>>
>> > > > >>> results ?
>> > > > >>>
>> > > > >>>> Thanks
>> > > > >>>> Varun
>> > > > >>>>
>> > > > >>>>
>> > > > >>
>> > > > >>
>> > > > > --
>> > > > > Marcos Ortiz Valmaseda,
>> > > > > Product Manager && Data Scientist at UCI
>> > > > > Blog: http://marcosluis2186.**posterous.com<
>> > > > http://marcosluis2186.posterous.com>
>> > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
>> > > > http://twitter.com/marcosluis2186>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Get on a row with multiple columns

Reply via email to