Re: Get on a row with multiple columns

lars hofhansl Sat, 09 Feb 2013 00:18:27 -0800

Should be set to true. If tcpnodelay is set to true, Nagle's is disabled.

-- Lars




________________________________
 From: Varun Sharma <va...@pinterest.com>
To: user@hbase.apache.org; lars hofhansl <la...@apache.org> 
Sent: Saturday, February 9, 2013 12:11 AM
Subject: Re: Get on a row with multiple columns
 

Okay I did my research - these need to be set to false. I agree.


On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <va...@pinterest.com> wrote:

I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and the hbase 
one - [hbase].ipc.client.tcpnodelay set to true. Do these induce network 
latency ?
>
>
>On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <la...@apache.org> wrote:
>
>Sorry.. I meant set these two config parameters to true (not false as I state 
>below).
>>
>>
>>
>>
>>----- Original Message -----
>>From: lars hofhansl <la...@apache.org>
>>To: "user@hbase.apache.org" <user@hbase.apache.org>
>>Cc:
>>Sent: Friday, February 8, 2013 11:41 PM
>>Subject: Re: Get on a row with multiple columns
>>
>>Only somewhat related. Seeing the magic 40ms random read time there. Did you 
>>disable Nagle's?
>>(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in 
>>hbase-site.xml).
>>
>>________________________________
>>From: Varun Sharma <va...@pinterest.com>
>>To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
>>Sent: Friday, February 8, 2013 10:45 PM
>>Subject: Re: Get on a row with multiple columns
>>
>>The use case is like your twitter feed. Tweets from people u follow. When
>>someone unfollows, you need to delete a bunch of his tweets from the
>>following feed. So, its frequent, and we are essentially running into some
>>extreme corner cases like the one above. We need high write throughput for
>>this, since when someone tweets, we need to fanout the tweet to all the
>>followers. We need the ability to do fast deletes (unfollow) and fast adds
>>(follow) and also be able to do fast random gets - when a real user loads
>>the feed. I doubt we will able to play much with the schema here since we
>>need to support a bunch of use cases.
>>
>>@lars: It does not take 30 seconds to place 300 delete markers. It takes 30
>>seconds to first find which of those 300 pins are in the set of columns
>>present - this invokes 300 gets and then place the appropriate delete
>>markers. Note that we can have tens of thousands of columns in a single row
>>so a single get is not cheap.
>>
>>If we were to just place delete markers, that is very fast. But when
>>started doing that, our random read performance suffered because of too
>>many delete markers. The 90th percentile on random reads shot up from 40
>>milliseconds to 150 milliseconds, which is not acceptable for our usecase.
>>
>>Thanks
>>Varun
>>
>>On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org> wrote:
>>
>>> Can you organize your columns and then delete by column family?
>>>
>>> deleteColumn without specifying a TS is expensive, since HBase first has
>>> to figure out what the latest TS is.
>>>
>>> Should be better in 0.94.1 or later since deletes are batched like Puts
>>> (still need to retrieve the latest version, though).
>>>
>>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which basically
>>> let's specify a scan condition and then place specific delete marker for
>>> all KVs encountered.
>>>
>>>
>>> If you wanted to get really
>>> fancy, you could hook up a coprocessor to the compaction process and
>>> simply filter all KVs you no longer want (without ever placing any
>>> delete markers).
>>>
>>>
>>> Are you saying it takes 15 seconds to place 300 version delete markers?!
>>>
>>>
>>> -- Lars
>>>
>>>
>>>
>>> ________________________________
>>>  From: Varun Sharma <va...@pinterest.com>
>>> To: user@hbase.apache.org
>>> Sent: Friday, February 8, 2013 10:05 PM
>>> Subject: Re: Get on a row with multiple columns
>>>
>>> We are given a set of 300 columns to delete. I tested two cases:
>>>
>>> 1) deleteColumns() - with the 's'
>>>
>>> This function simply adds delete markers for 300 columns, in our case,
>>> typically only a fraction of these columns are actually present - 10. After
>>> starting to use deleteColumns, we starting seeing a drop in cluster wide
>>> random read performance - 90th percentile latency worsened, so did 99th
>>> probably because of having to traverse delete markers. I attribute this to
>>> profusion of delete markers in the cluster. Major compactions slowed down
>>> by almost 50 percent probably because of having to clean out significantly
>>> more delete markers.
>>>
>>> 2) deleteColumn()
>>>
>>> Ended up with untolerable 15 second calls, which clogged all the handlers.
>>> Making the cluster pretty much unresponsive.
>>>
>>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>> > For the 300 column deletes, can you show us how the Delete(s) are
>>> > constructed ?
>>> >
>>> > Do you use this method ?
>>> >
>>> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
>>> > Thanks
>>> >
>>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
>>> wrote:
>>> >
>>> > > So a Get call with multiple columns on a single row should be much
>>> faster
>>> > > than independent Get(s) on each of those columns for that row. I am
>>> > > basically seeing severely poor performance (~ 15 seconds) for certain
>>> > > deleteColumn() calls and I am seeing that there is a
>>> > > prepareDeleteTimestamps() function in HRegion.java which first tries to
>>> > > locate the column by doing individual gets on each column you want to
>>> > > delete (I am doing 300 column deletes). Now, I think this should ideall
>>> > by
>>> > > 1 get call with the batch of 300 columns so that one scan can retrieve
>>> > the
>>> > > columns and the columns that are found, are indeed deleted.
>>> > >
>>> > > Before I try this fix, I wanted to get an opinion if it will make a
>>> > > difference to batch the get() and it seems from your answer, it should.
>>> > >
>>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
>>> wrote:
>>> > >
>>> > > > Everything is stored as a KeyValue in HBase.
>>> > > > The Key part of a KeyValue contains the row key, column family,
>>> column
>>> > > > name, and timestamp in that order.
>>> > > > Each column family has it's own store and store files.
>>> > > >
>>> > > > So in a nutshell a get is executed by starting a scan at the row key
>>> > > > (which is a prefix of the key) in each store (CF) and then scanning
>>> > > forward
>>> > > > in each store until the next row key is reached. (in reality it is a
>>> > bit
>>> > > > more complicated due to multiple versions, skipping columns, etc)
>>> > > >
>>> > > >
>>> > > > -- Lars
>>> > > > ________________________________
>>> > > > From: Varun Sharma <va...@pinterest.com>
>>> > > > To: user@hbase.apache.org
>>> > > > Sent: Friday, February 8, 2013 9:22 PM
>>> > > > Subject: Re: Get on a row with multiple columns
>>> > > >
>>> > > > Sorry, I was a little unclear with my question.
>>> > > >
>>> > > > Lets say you have
>>> > > >
>>> > > > Get get = new Get(row)
>>> > > > get.addColumn("1");
>>> > > > get.addColumn("2");
>>> > > > .
>>> > > > .
>>> > > > .
>>> > > >
>>> > > > When internally hbase executes the batch get, it will seek to column
>>> > "1",
>>> > > > now since data is lexicographically sorted, it does not need to seek
>>> > from
>>> > > > the beginning to get to "2", it can continue seeking, henceforth
>>> since
>>> > > > column "2" will always be after column "1". I want to know whether
>>> this
>>> > > is
>>> > > > how a multicolumn get on a row works or not.
>>> > > >
>>> > > > Thanks
>>> > > > Varun
>>> > > >
>>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <mlor...@uci.cu> wrote:
>>> > > >
>>> > > > > Like Ishan said, a get give an instance of the Result class.
>>> > > > > All utility methods that you can use are:
>>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
>>> > > > >  byte[] value()
>>> > > > >  byte[] getRow()
>>> > > > >  int size()
>>> > > > >  boolean isEmpty()
>>> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
>>> > > > >  List<KeyValue> list()
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
>>> > > > >
>>> > > > >> Based on what I read in Lars' book, a get will return a result a
>>> > > Result,
>>> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by the
>>> > key
>>> > > > and
>>> > > > >> you access this array using raw or list methods on the Result
>>> > object.
>>> > > > >>
>>> > > > >>
>>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <va...@pinterest.com
>>> >
>>> > > > wrote:
>>> > > > >>
>>> > > > >>  +user
>>> > > > >>>
>>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
>>> va...@pinterest.com>
>>> > > > >>> wrote:
>>> > > > >>>
>>> > > > >>>  Hi,
>>> > > > >>>>
>>> > > > >>>> When I do a Get on a row with multiple column qualifiers. Do we
>>> > sort
>>> > > > the
>>> > > > >>>> column qualifers and make use of the sorted order when we get
>>> the
>>> > > > >>>>
>>> > > > >>> results ?
>>> > > > >>>
>>> > > > >>>> Thanks
>>> > > > >>>> Varun
>>> > > > >>>>
>>> > > > >>>>
>>> > > > >>
>>> > > > >>
>>> > > > > --
>>> > > > > Marcos Ortiz Valmaseda,
>>> > > > > Product Manager && Data Scientist at UCI
>>> > > > > Blog: http://marcosluis2186.**posterous.com<
>>> > > > http://marcosluis2186.posterous.com>
>>> > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
>>> > > > http://twitter.com/marcosluis2186>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Get on a row with multiple columns

Reply via email to