Thanks Satish,

To clarify: I’m not looking up single rows. I’m looking up the history of each 
widget, which returns hundreds-to-thousands of results per widget (per query).

Each query is a range scan, it’s just that I’m performing thousands of them.

From: Satish Iyengar [mailto:[email protected]]
Sent: Friday, December 04, 2015 9:43 AM
To: [email protected]
Subject: Re: Help tuning for bursts of high traffic?

Hi Zack,

Did you consider avoiding hitting hbase for every single row by doing that step 
in an offline mode? I was thinking if you could have some kind of daily export 
of hbase table and then use pig to perform join (co-group perhaps) to do the 
same. Obviously this would work only when your hbase table is not maintained by 
stream based system. Hbase is really good at range scans and may not be ideal 
for single row (large number of).

Thanks,
Satish





On Fri, Dec 4, 2015 at 9:09 AM, Riesland, Zack 
<[email protected]<mailto:[email protected]>> wrote:
SHORT EXPLANATION: a much higher percentage of queries to phoenix return 
exceptionally slow after querying very heavily for several minutes.

LONGER EXPLANATION:

I’ve been using Pheonix for about a year as a data store for web-based 
reporting tools and it works well.

Now, I’m trying to use the data in a different (much more request-intensive) 
way and encountering some issues.

The scenario is basically this:

Daily, ingest very large CSV files with data for widgets.

Each input file has hundreds of rows of data for each widget, and tens of 
thousands of unique widgets.

As a first step, I want to de-duplicate this data against my Phoenix-based DB 
(I can’t rely on just upserting the data for de-dup because it will go through 
several ETL steps before being stored into Phoenix/HBase).

So, per-widget, I perform a query against Phoenix (the table is keyed against 
the unique widget ID + sample point). I get all the data for a given widget id, 
within a certain period of time, and then I only ingest rows for that widget 
that are new to me.

I’m doing this in Java in a single step: I loop through my input file and 
perform one query per widget, using the same Connection object to Phoenix.

THE ISSUE:

What I’m finding is that for the first several thousand queries, I almost 
always get a very fast (less than 10 ms) response (good).

But after 15-20 thousand queries, the response starts to get MUCH slower. Some 
queries respond as expected, but many take as many as 2-3 minutes, pushing the 
total time to prime the data structure into the 12-15 hour range, when it would 
only take 2-3 hours if all the queries were fast.

The same exact queries, when run manually and not part of this bulk process, 
return in the (expected) < 10 ms.

So it SEEMS like the burst of queries puts Phoenix into some sort of busy state 
that causes it to respond far too slowly.

The connection properties I’m setting are:

Phoenix.query.timeoutMs: 90000
Phoenix.query.keepAliveMs: 90000
Phenix.query.threadPoolSize: 256

Our cluster is 9 (beefy) region servers and the table I’m referencing is 511 
regions. We went through a lot of pain to get the data split extremely well, 
and I don’t think Schema design is the issue here.

Can anyone help me understand how to make this better? Is there a better 
approach I could take? A better set of configuration parameters? Is our cluster 
just too small for this?


Thanks!













--
Satish Iyengar

"Anyone who has never made a mistake has never tried anything new."
Albert Einstein

Reply via email to