Re: Writting bottleneck in HBase ?

2016-12-07 Thread schausson
Hi Ted, thanks for your help !

It seems I was not clear with my explanation, let me try again :
In my input file, let's say I have 2000 parameters and for each parameter,
5000 values recorded along given timeframe.
When I read the file, I read it part by part, basically by using a time
sliding window : For instance, I read all parameters values between t0 and
t1, 
which return me approximately  5 values per parameter. I write this chunk of
data to HBase and read the file for subsequent time window (t1 to t2), write
data to HBase and so on...

About hashing mechanism applied to rowId, here is the algorithm :

public long hash(String string) {
  long h = 1125899906842597L; // prime
  int len = string.length();

  for (int i = 0; i < len; i++) {
h = 31*h + string.charAt(i);
  }
  return h;
}

Which does not guarantee any even distribution from what I understand...

Regards



--
View this message in context: 
http://apache-hbase.679495.n3.nabble.com/Writting-bottleneck-in-HBase-tp4084656p4084985.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: Writting bottleneck in HBase ?

2016-12-03 Thread Ted Yu
I was in China the past 10 days where I didn't have access to gmail.

bq. repeat this sequence a thousand times

You mean proceeding with the next parameter ?

bq. use hashing mechanism to transform this long string

How is the hash generated ?
The hash prefix should presumably evenly distribute the write load.

Thanks

On Thu, Nov 24, 2016 at 8:13 AM, schausson  wrote:

> Hi, thanks for your answer.
>
> About your question related to thread management : yes, I have several
> threads (up to 4) that may call my persistence method.
>
> When I wrote the post, I had not configured anything special about regions
> for my table so it basically used default splitting policy I guess.
> Next to your answer, I gave a try to this :
> /byte[][] splits = new
> RegionSplitter.HexStringSplit().split(numberOfRegionServers);
> /
> Which lead to 12 regions at table creation time.
>
> It slightly improved performances : persistance drops from 2min to 1min40s
> approximately.
>
> I tried with 24 regions but nothing changed then...
>
> About how parameters IDs are distributed : to make it simple, I read 5
> values per parameter (*2000) and call persistence, and repeat this sequence
> a thousand times. So they should distribute accross all my region servers,
> right ?
> One additional clue : parameters ID are alphanumeric, evenly distributed
> between A and Z, but I add a prefix to them which is long string
> (about 25 characters). To save storage space (because rowId is dupplicated
> for each cell), I use hashing mechanism to transform this long string into
> Long value (and I ahev a mapping table next to the main table), so I dont
> really know how these Long values "distribute"...
>
> Not sure I'm clear...
>
>
>
>
>
>
>
>
>
> --
> View this message in context: http://apache-hbase.679495.n3.
> nabble.com/Writting-bottleneck-in-HBase-tp4084656p4084678.html
> Sent from the HBase User mailing list archive at Nabble.com.
>


Re: Writting bottleneck in HBase ?

2016-11-24 Thread schausson
Hi, thanks for your answer.

About your question related to thread management : yes, I have several
threads (up to 4) that may call my persistence method.

When I wrote the post, I had not configured anything special about regions
for my table so it basically used default splitting policy I guess. 
Next to your answer, I gave a try to this :
/byte[][] splits = new
RegionSplitter.HexStringSplit().split(numberOfRegionServers);
/
Which lead to 12 regions at table creation time.

It slightly improved performances : persistance drops from 2min to 1min40s
approximately.

I tried with 24 regions but nothing changed then...

About how parameters IDs are distributed : to make it simple, I read 5
values per parameter (*2000) and call persistence, and repeat this sequence
a thousand times. So they should distribute accross all my region servers,
right ?
One additional clue : parameters ID are alphanumeric, evenly distributed
between A and Z, but I add a prefix to them which is long string
(about 25 characters). To save storage space (because rowId is dupplicated
for each cell), I use hashing mechanism to transform this long string into
Long value (and I ahev a mapping table next to the main table), so I dont
really know how these Long values "distribute"...

Not sure I'm clear...









--
View this message in context: 
http://apache-hbase.679495.n3.nabble.com/Writting-bottleneck-in-HBase-tp4084656p4084678.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: Writting bottleneck in HBase ?

2016-11-23 Thread Ted Yu
bq. it calls the persistence method asynchronously

Assuming the persistence method is still executing when the next threshold
value is reached, do you have other threads to do persistence ?
If so, how many threads can potentially run at the same time ?

How many regions does the table have ?

What's the distribution of parameter Ids in the input file ? One case is
that the parameter Ids are sequential w.r.t. region boundaries, ending up
with writes region by region.

On Wed, Nov 23, 2016 at 8:01 AM, schausson  wrote:

> Hi,
>
> I am new to HBase and I'm facing performance issues ...
>
> Short story : I want to persist 1000 values in HBase and it takes same
> time on a basic sandbox (HDP hadoop sandbox with single region server node)
> as it takes on our "production" cluster (that comprises 12 region server
> with higher capabilities than my developer's laptop ...)
>
> Detailed case :
>
> Basically, the use case is : My java application receives a binary file
> that
> contains timeseries, decodes them and stores decoded data into a single
> HBase table.
> HBase table design : we store one parameter per row, and we create one
> column per timestamp to store associated value.
> My test case is based on an input file that spawns ~2000 rows/parameters
> containing ~5000 values per row (=> around 1000 values to store in my
> HBase table in the end)
>
> For this purpose, my application uses hbase client API :
> Basically, my code proceeds as following : it decodes parameters timeseries
> from input file and stores these values in a map>.
>
> When it reaches 1 values (threshold that may be changed), it calls the
> persistence method asynchronously and continue decoding operation till end
> of the input file.
> The persistence method proceeds like this (simplified code) :
> /for (paramId : map.keys) {
> Put put = new Put(paramId);
> for (value : map.get(paramId)) {
> put.addColumn(family, columnName, value)
> }
> table.put(put);
> }
> /
> Choosing a threshold value of 1 leads to ~1000 calls to persistence
> method. Each call generates 2000 calls to table.put() method, each put
> containing ~5 columns.
>
> When I run this on HDP sandbox on my laptop (single region server), it
> processes in less than 2 minutes
> When I run this on our production cluster (12 region servers), it processes
> in 2 minutes and sometimes more.
>
> My question is : is the writting load distributed across all the region
> servers ? obviously no... What should I do if I want my application to
> scale
> properly when we add additional region servers ?
>
> I don't know if I gave enough information, so please do not hesitate to ask
> me more detail if needed, but any help would be greatly appreciated ...
>
> Regards
>
> Sebastien
>
>
>
>
> --
> View this message in context: http://apache-hbase.679495.n3.
> nabble.com/Writting-bottleneck-in-HBase-tp4084656.html
> Sent from the HBase User mailing list archive at Nabble.com.
>