How large are the KeyValues? Can you estimate how much data you're
materializing for this query? HBase's RPC implementation does not currently
support streaming, so the entire result set (all 4000 objects) will be held
in memory to service the request. This is a known issue (I'm lacking on a
JIRA at the moment...) The way to mitigate this problem is to issue queries
in smaller batches or use a scan with limits on the batch size
(Scan#get/setBatch()).

You might also look at the SkipScan implementation in Apache Phoenix. It
uses a Scan + Filter to get around this problem for these kinds of queries.
http://phoenix.apache.org/skip_scan.html

On Wed, Feb 25, 2015 at 10:16 AM, Ted Tuttle <t...@mentacapital.com> wrote:

> Heaps are 16G w/ hfile.block.cache.size = 0.5
>
>
>
> Machines have 32G onboard and we used to run w/ 24G heaps but reduced them
> to lower GC times.
>
>
>
> Not so sure about which regions were hot.  And I don't want to repeat and
> take down my cluster again :)
>
>
>
> What I know:
>
>
>
> 1) The request was about 4000 gets.
>
> 2) The 4000 keys are likely contiguous and therefore probably represent
> entire regions
>
> 3) Once we batched the gets (so as not to kill the cluster) the result was
> >10G of data in client. We blew the heap there :(
>
> 4) Our regions are 10G (hbase.hregion.max.filesize  = 10737418240)
>
>
>
> Distributing these key via salting is not in our best interest as we
> typically do these types of timeseries queries (though only recently at
> this scale).
>
>
>
> I think I understand the failure mode, I guess I am just surprised that a
> greedy client can kill the cluster and that we are required to batch our
> gets in order to protect the cluster.
>
>
>
> *From:* Nick Dimiduk [mailto:ndimi...@gmail.com]
> *Sent:* Wednesday, February 25, 2015 9:40 AM
> *To:* hbase-user
> *Cc:* Ted Yu; Development
>
> *Subject:* Re: Table.get(List<Get>) overwhelms several RSs
>
>
>
> How large is your region server heap? What's your setting
> for hfile.block.cache.size? Can you identify which region is being burned
> up (i.e., is it META?)
>
>
>
> It is possible for a hot region to act as a "death pill" that roams around
> the cluster. We see this with the meta region with poorly-behaved clients.
>
>
>
> -n
>
>
>
> On Wed, Feb 25, 2015 at 8:38 AM, Ted Tuttle <t...@mentacapital.com> wrote:
>
> Hard to say how balanced the table is.
>
> We have a mixed requirement where we want some locality for timeseries
> queries against "clusters" of information.  However the "clusters" in a
> table are should be well distributed if the dataset is large enough.
>
> The query in question killed 5 RSs so I am inferring either:
>
> 1) the table was spread across these 5 RSs
> 2) the query moved around on the cluster as RSs failed
>
> Perhaps you could tell me if #2 is possible.
>
> We are running v0.94.9
>
> From: Ted Yu [mailto:yuzhih...@gmail.com]
> Sent: Wednesday, February 25, 2015 7:24 AM
> To: user@hbase.apache.org
> Cc: Development
> Subject: Re: Table.get(List<Get>) overwhelms several RSs
>
> Was the underlying table balanced (meaning its regions spread evenly
> across region servers) ?
>
> What release of HBase are you using ?
>
> Cheers
>
> On Wed, Feb 25, 2015 at 7:08 AM, Ted Tuttle <t...@mentacapital.com<mailto:
> t...@mentacapital.com>> wrote:
> Hello-
>
> In the last week we had multiple times where we lost 5 of 8 RSs in the
> space of a few minutes because of slow GCs.
>
> We traced this back to a client calling Table.get(List<Get> gets) with a
> collection containing ~4000 individual gets.
>
> We've worked around this by limiting the number of Gets we send in a
> single call to Table.get(List<Get>)
>
> Is there some configuration parameter that we are missing here?
> Thanks,
> Ted
>
>
>

Reply via email to