Re: Optimizing Multi Gets in hbase

2013-02-19 Thread lars hofhansl
I should qualify that statement, actually.

I was comparing scanning 1m KVs to getting 1m KVs when all KVs are returned.

As James Taylor pointed out to me privately: A fairer comparison would have 
been to run a scan with a filter that lets x% of the rows pass (i.e. the 
selectivity of the scan would be x%) and compare that to a multi Get of the 
same x% of the row.

There we found that a Scan+Filter is more efficient that issuing multi Gets if 
x is = 1-2%.


Or in other words, translating many Gets into a Scan+Filter is beneficial if 
the Scan would return at least 1-2% of the rows to the client.
For example:
if you are looking for less than 10-20k rows in 1m rows, using muli Gets is 
likely more efficient.
if you are looking for more than 10-20k rows in 1m rows, using a Scan+Filter is 
likely more efficient.


Of course this is predicated on whether you have an efficient way to represent 
the rows you are looking for in a filter, so that would probably shift this 
slightly more towards Gets (just imaging a Filter that to encode 100k random 
row keys to be matched; since Filters are instantiated store there is another 
natural limit there).


As I said below, the crux of the matter is having some histograms of your data, 
so that such a decision could be made automatically.


-- Lars




 From: lars hofhansl la...@apache.org
To: user@hbase.apache.org user@hbase.apache.org 
Sent: Monday, February 18, 2013 5:48 PM
Subject: Re: Optimizing Multi Gets in hbase
 
As it happens we did some tests around last week.
Turns out doing Gets in batches instead of a scan still gives you 1/3 of the 
performance.

I.e. when you have a table with, say, 10m rows and scanning take N seconds, 
then calling 10m Gets in batches of 1000 take ~3N, which is pretty impressive.

Now, this is with all data in the cache!
When the data is not in the cache and the Gets are random it is many orders of 
magnitude slower, as the Gets are sprayed all over the disk. In that case 
sorting the Gets and issuing scans would indeed be much more efficient.


The Gets in a batch are already sorted on the client, but as N. says it is hard 
to determine when to turn many Gets into a Scan with filters automatically. 
Without statistics/histograms I'd even wager a guess that would be impossible 
to do.
Imagine you issue 1 random Gets, but your table has 10bn rows, in that case 
it is almost certain that the Gets are faster than a scan.
Now image the Gets only cover a small key range. With statistics we could tell 
whether it would beneficial to turn this into a scan.

It's not that hard to add statistics to HBase. Would do it as part of the 
compactions, and record the histograms in some table.


You can always do that yourself. If you suspect you are touching most rows in a 
table/region, just issue a scan with a appropriate filter (may have to 
implement your own filter, though). Maybe we could a version of RowFilter that 
match against multiple keys.


-- Lars




From: Varun Sharma va...@pinterest.com
To: user@hbase.apache.org 
Sent: Monday, February 18, 2013 1:57 AM
Subject: Optimizing Multi Gets in hbase

Hi,

I am trying to batched get(s) on a cluster. Here is the code:

ListGet gets = ...
// Prepare my gets with the rows i need
myHTable.get(gets);

I have two questions about the above scenario:
i) Is this the most optimal way to do this ?
ii) I have a feeling that if there are multiple gets in this case, on the
same region, then each one of those shall instantiate separate scan(s) over
the region even though a single scan is sufficient. Am I mistaken here ?

Thanks
Varun

Re: HBase without compactions?

2013-02-19 Thread lars hofhansl
If you store data in LSM trees you need compactions.
The advantage is that your data files are immutable.
MapR has a mutable file system and they probably store their data in something 
more akin to B-Trees...?
Or maybe they somehow avoid the expensive merge sorting of many small files. It 
seems that is has to be one or the other.

(Maybe somebody from MapR reads this and can explain how it actually works.)

Compations let you trade random IO for sequential IO (just to state the 
obvious). It seems that you can't have it both ways.

-- Lars




 From: Otis Gospodnetic otis.gospodne...@gmail.com
To: user@hbase.apache.org 
Sent: Monday, February 18, 2013 7:30 PM
Subject: HBase without compactions?
 
Hello,

It's kind of funny, we run SPM, which includes SPM for HBase (performance
monitoring service/tool for HBase essentially) and we currently store all
performance metrics in HBase.

I see a ton of HBase development activity, which is great, but it just
occurred to me that I don't think I recall seeing anything about getting
rid of compactions.  Yet, compactions are one thing that I know hurt us the
most and is one thing that MapR somehow got rid of in their implementation.

Have there been any discussions,attempts, or thoughts about finding a way
to avoid compactions?

Thanks,
Otis
--
HBASE Performance Monitoring - http://sematext.com/spm/index.html

Re: Optimizing Multi Gets in hbase

2013-02-19 Thread Nicolas Liochon
Looking at the code, it seems possible to do this server side within the
multi invocation: we could group the get by region, and do a single scan.
We could also add some heuristics if necessary...



On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl la...@apache.org wrote:

 I should qualify that statement, actually.

 I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
 returned.

 As James Taylor pointed out to me privately: A fairer comparison would
 have been to run a scan with a filter that lets x% of the rows pass (i.e.
 the selectivity of the scan would be x%) and compare that to a multi Get of
 the same x% of the row.

 There we found that a Scan+Filter is more efficient that issuing multi
 Gets if x is = 1-2%.


 Or in other words, translating many Gets into a Scan+Filter is beneficial
 if the Scan would return at least 1-2% of the rows to the client.
 For example:
 if you are looking for less than 10-20k rows in 1m rows, using muli Gets
 is likely more efficient.
 if you are looking for more than 10-20k rows in 1m rows, using a
 Scan+Filter is likely more efficient.


 Of course this is predicated on whether you have an efficient way to
 represent the rows you are looking for in a filter, so that would probably
 shift this slightly more towards Gets (just imaging a Filter that to encode
 100k random row keys to be matched; since Filters are instantiated store
 there is another natural limit there).


 As I said below, the crux of the matter is having some histograms of your
 data, so that such a decision could be made automatically.


 -- Lars



 
  From: lars hofhansl la...@apache.org
 To: user@hbase.apache.org user@hbase.apache.org
 Sent: Monday, February 18, 2013 5:48 PM
 Subject: Re: Optimizing Multi Gets in hbase

 As it happens we did some tests around last week.
 Turns out doing Gets in batches instead of a scan still gives you 1/3 of
 the performance.

 I.e. when you have a table with, say, 10m rows and scanning take N
 seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty
 impressive.

 Now, this is with all data in the cache!
 When the data is not in the cache and the Gets are random it is many
 orders of magnitude slower, as the Gets are sprayed all over the disk. In
 that case sorting the Gets and issuing scans would indeed be much more
 efficient.


 The Gets in a batch are already sorted on the client, but as N. says it is
 hard to determine when to turn many Gets into a Scan with filters
 automatically. Without statistics/histograms I'd even wager a guess that
 would be impossible to do.
 Imagine you issue 1 random Gets, but your table has 10bn rows, in that
 case it is almost certain that the Gets are faster than a scan.
 Now image the Gets only cover a small key range. With statistics we could
 tell whether it would beneficial to turn this into a scan.

 It's not that hard to add statistics to HBase. Would do it as part of the
 compactions, and record the histograms in some table.


 You can always do that yourself. If you suspect you are touching most rows
 in a table/region, just issue a scan with a appropriate filter (may have to
 implement your own filter, though). Maybe we could a version of RowFilter
 that match against multiple keys.


 -- Lars



 
 From: Varun Sharma va...@pinterest.com
 To: user@hbase.apache.org
 Sent: Monday, February 18, 2013 1:57 AM
 Subject: Optimizing Multi Gets in hbase

 Hi,

 I am trying to batched get(s) on a cluster. Here is the code:

 ListGet gets = ...
 // Prepare my gets with the rows i need
 myHTable.get(gets);

 I have two questions about the above scenario:
 i) Is this the most optimal way to do this ?
 ii) I have a feeling that if there are multiple gets in this case, on the
 same region, then each one of those shall instantiate separate scan(s) over
 the region even though a single scan is sufficient. Am I mistaken here ?

 Thanks
 Varun



Re: PreSplit the table with Long format

2013-02-19 Thread Viral Bajaria
HBase shell is a jruby shell and so you can invoke any java commands from
it.

For example:
 import org.apache.hadoop.hbase.util.Bytes
 Bytes.toLong(Bytes.toBytes(1000))

Not sure if this works as expected since I don't have a terminal in front
of me but you could try (assuming the SPLITS keyword takes byte array as
input, never used SPLITS from the command line):
create 'testTable', 'cf1' , { SPLITS = [ Bytes.toBytes(1000),
Bytes.toBytes(2000), Bytes.toBytes(3000) ] }

Thanks,
Viral

On Tue, Feb 19, 2013 at 1:52 AM, Farrokh Shahriari 
mohandes.zebeleh...@gmail.com wrote:

 Hi there
 As I use rowkey in long format,I must presplit table in long format too.But
 when I've run this command,it presplit the table with STRING format :
 create 'testTable','cf1',{SPLITS = [ '1000','2000','3000']}

 How can I presplit the table with Long format ?

 Farrokh



Re: PreSplit the table with Long format

2013-02-19 Thread Farrokh Shahriari
Tnx for your help,but it doesn't work.Do you have any other idea,cause I
must run it from the shell.

Farrokh


On Tue, Feb 19, 2013 at 1:30 PM, Viral Bajaria viral.baja...@gmail.comwrote:

 HBase shell is a jruby shell and so you can invoke any java commands from
 it.

 For example:
  import org.apache.hadoop.hbase.util.Bytes
  Bytes.toLong(Bytes.toBytes(1000))

 Not sure if this works as expected since I don't have a terminal in front
 of me but you could try (assuming the SPLITS keyword takes byte array as
 input, never used SPLITS from the command line):
 create 'testTable', 'cf1' , { SPLITS = [ Bytes.toBytes(1000),
 Bytes.toBytes(2000), Bytes.toBytes(3000) ] }

 Thanks,
 Viral

 On Tue, Feb 19, 2013 at 1:52 AM, Farrokh Shahriari 
 mohandes.zebeleh...@gmail.com wrote:

  Hi there
  As I use rowkey in long format,I must presplit table in long format
 too.But
  when I've run this command,it presplit the table with STRING format :
  create 'testTable','cf1',{SPLITS = [ '1000','2000','3000']}
 
  How can I presplit the table with Long format ?
 
  Farrokh
 



Re: storing lists in columns

2013-02-19 Thread Stas Maksimov
Hi Jean-Marc,

I've validated this, it works perfectly. Very easy to implement and it's
very fast!

Thankfully in this project there isn't a lot of lists in each table, so I
won't have to create too many column families. In other scenarios it could
be a problem.

Many thanks,
Stas


On 16 February 2013 02:29, Jean-Marc Spaggiari jean-m...@spaggiari.orgwrote:

 Hi Stas,

 Few options are coming into my mind.

 Quickly:
 1) Why not storing the products in specif columns instead of in the
 same one? Like:
 table, rowid1, cf:list, c:aa, value:true
 table, rowid1, cf:list, c:bb, value:true
 table, rowid1, cf:list, c:cc, value:true
 table, rowid2, cf:list, c:aabb, value:true
 table, rowid2, cf:list, c:cc, value:true
 That way when you do a search you query directly the right column for
 the right row. And using exist call with also reduce the size of the
 data transfered.

 2) You can store the data in the oposite way. Like:
 table, aa, cf:products, c:rowid1, value:true
 table, aabb, cf:products, c:rowid2, value:true
 table, bb, cf:products, c:rowid1, value:true
 table, cc, cf:products, c:rowid1, value:true
 table, cc, cf:products, c:rowid2, value:true
 Here, you query by your product ID, and you search the column based on
 your previous rowid.


 I will say the 2 solutions are equivalent, but it will really depend
 on your data pattern and you query pattern.

 JM

 2013/2/15, Stas Maksimov maksi...@gmail.com:
  Hi all,
 
  I have a requirement to store lists in HBase columns like this:
  table, rowid1, f:list, aa, bb, cc
  table, rowid2, f:list, aabb, cc
 
  There is a further requirement to be able to find rows where f:list
  contains a particular item, e.g. when I need to find rows having item
 aa
  only rowid1 should match, and for item cc both rowid1 and rowid2
  should match.
 
  For now I decided to use SingleColumnValueFilter with substring matching.
  As using comma-separated list proved difficult to search through, I'm
 using
  pipe symbols to separate items like this: |aa|bb|cc|, so that I could
  pass the search item surrounded by pipes into the filter:
  SingleColumnValueFilter ('f', 'list', =, 'substring:|aa|')
 
  This proved to work effectively enough, however I would prefer to use
  something more standard for my list storage (e.g. serialised JSON), or
  perhaps something even more optimised for a search - performance really
  does matter here.
 
  Any opinions on this solution and possible enhancements are much
  appreciated.
 
  Many thanks,
  Stas
 



Re: storing lists in columns

2013-02-19 Thread Jean-Marc Spaggiari
Hi Stas,

Don't forget that you should always try to keep the number of columns
families lower than 3, else you might face some performances issues.

JM

2013/2/19, Stas Maksimov maksi...@gmail.com:
 Hi Jean-Marc,

 I've validated this, it works perfectly. Very easy to implement and it's
 very fast!

 Thankfully in this project there isn't a lot of lists in each table, so I
 won't have to create too many column families. In other scenarios it could
 be a problem.

 Many thanks,
 Stas


 On 16 February 2013 02:29, Jean-Marc Spaggiari
 jean-m...@spaggiari.orgwrote:

 Hi Stas,

 Few options are coming into my mind.

 Quickly:
 1) Why not storing the products in specif columns instead of in the
 same one? Like:
 table, rowid1, cf:list, c:aa, value:true
 table, rowid1, cf:list, c:bb, value:true
 table, rowid1, cf:list, c:cc, value:true
 table, rowid2, cf:list, c:aabb, value:true
 table, rowid2, cf:list, c:cc, value:true
 That way when you do a search you query directly the right column for
 the right row. And using exist call with also reduce the size of the
 data transfered.

 2) You can store the data in the oposite way. Like:
 table, aa, cf:products, c:rowid1, value:true
 table, aabb, cf:products, c:rowid2, value:true
 table, bb, cf:products, c:rowid1, value:true
 table, cc, cf:products, c:rowid1, value:true
 table, cc, cf:products, c:rowid2, value:true
 Here, you query by your product ID, and you search the column based on
 your previous rowid.


 I will say the 2 solutions are equivalent, but it will really depend
 on your data pattern and you query pattern.

 JM

 2013/2/15, Stas Maksimov maksi...@gmail.com:
  Hi all,
 
  I have a requirement to store lists in HBase columns like this:
  table, rowid1, f:list, aa, bb, cc
  table, rowid2, f:list, aabb, cc
 
  There is a further requirement to be able to find rows where f:list
  contains a particular item, e.g. when I need to find rows having item
 aa
  only rowid1 should match, and for item cc both rowid1 and
  rowid2
  should match.
 
  For now I decided to use SingleColumnValueFilter with substring
  matching.
  As using comma-separated list proved difficult to search through, I'm
 using
  pipe symbols to separate items like this: |aa|bb|cc|, so that I could
  pass the search item surrounded by pipes into the filter:
  SingleColumnValueFilter ('f', 'list', =, 'substring:|aa|')
 
  This proved to work effectively enough, however I would prefer to use
  something more standard for my list storage (e.g. serialised JSON), or
  perhaps something even more optimised for a search - performance really
  does matter here.
 
  Any opinions on this solution and possible enhancements are much
  appreciated.
 
  Many thanks,
  Stas
 




Table deleted after restart of computer

2013-02-19 Thread Paul van Hoven
I just started with hbase. Therefore I created a table and filled this
table with some data. But after restarting my computer all the data
has gone. This even happens when stopping hbase with stop-hbase.sh.

How can this happen?


Re: Table deleted after restart of computer

2013-02-19 Thread Ted Yu
Which HBase / hadoop version were you using ?

Did you start the cluster in standalone mode ?

Thanks

On Tue, Feb 19, 2013 at 5:23 AM, Paul van Hoven 
paul.van.ho...@googlemail.com wrote:

 I just started with hbase. Therefore I created a table and filled this
 table with some data. But after restarting my computer all the data
 has gone. This even happens when stopping hbase with stop-hbase.sh.

 How can this happen?



Re: Table deleted after restart of computer

2013-02-19 Thread Paul van Hoven
I installed hbase via brew.

brew install hadoop hbase pig hive

Then I started hbase via start-hbase.sh command. Therefore I'm pretty
sure it is a standalone version.



2013/2/19 Ted Yu yuzhih...@gmail.com:
 Which HBase / hadoop version were you using ?

 Did you start the cluster in standalone mode ?

 Thanks

 On Tue, Feb 19, 2013 at 5:23 AM, Paul van Hoven 
 paul.van.ho...@googlemail.com wrote:

 I just started with hbase. Therefore I created a table and filled this
 table with some data. But after restarting my computer all the data
 has gone. This even happens when stopping hbase with stop-hbase.sh.

 How can this happen?



Re: Table deleted after restart of computer

2013-02-19 Thread Ibrahim Yakti
Hello Paul,

The default location for hbase data is /tmp so when you restart your
machine it will be deleted, you need to change it as per
http://hbase.apache.org/book.html#quickstart




--
Ibrahim


On Tue, Feb 19, 2013 at 5:54 PM, Ted Yu yuzhih...@gmail.com wrote:

 Which HBase / hadoop version were you using ?

 Did you start the cluster in standalone mode ?

 Thanks

 On Tue, Feb 19, 2013 at 5:23 AM, Paul van Hoven 
 paul.van.ho...@googlemail.com wrote:

  I just started with hbase. Therefore I created a table and filled this
  table with some data. But after restarting my computer all the data
  has gone. This even happens when stopping hbase with stop-hbase.sh.
 
  How can this happen?
 



Re: coprocessor enabled put very slow, help please~~~

2013-02-19 Thread Wei Tan
A side question: if HTablePool is not encouraged to be used... how we 
handle the thread safeness in using HTable? Any replacement for 
HTablePool, in plan?
Thanks,


Best Regards,
Wei




From:   Michel Segel michael_se...@hotmail.com
To: user@hbase.apache.org user@hbase.apache.org, 
Date:   02/18/2013 09:23 AM
Subject:Re: coprocessor enabled put very slow, help please~~~



Why are you using an HTable Pool?
Why are you closing the table after each iteration through?

Try using 1 HTable object. Turn off WAL
Initiate in start()
Close in Stop()
Surround the use in a try / catch
If exception caught, re instantiate new HTable connection.

Maybe want to flush the connection after puts. 


Again not sure why you are using check and put on the base table. Your 
count could be off.

As an example look at poem/rhyme 'Marry had a little lamb'.
Then check your word count.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 18, 2013, at 7:21 AM, prakash kadel prakash.ka...@gmail.com 
wrote:

 Thank you guys for your replies,
 Michael,
   I think i didnt make it clear. Here is my use case,
 
 I have text documents to insert in the hbase. (With possible duplicates)
 Suppose i have a document as :  I am working. He is not working
 
 I want to insert this document to a table in hbase, say table doc
 
 =doc table=
 -
 rowKey : doc_id
 cf: doc_content
 value: I am working. He is not working
 
 Now, i to create another table that stores the word count, say doc_idx
 
 doc_idx table
 ---
 rowKey : I, cf: count, value: 1
 rowKey : am, cf: count, value: 1
 rowKey : working, cf: count, value: 2
 rowKey : He, cf: count, value: 1
 rowKey : is, cf: count, value: 1
 rowKey : not, cf: count, value: 1
 
 My MR job code:
 ==
 
 if(doc.checkAndPut(rowKey, doc_content, , null, putDoc)) {
for(String word : doc_content.split(\\s+)) {
   Increment inc = new Increment(Bytes.toBytes(word));
   inc.addColumn(count, , 1);
}
 }
 
 Now, i wanted to do some experiments with coprocessors. So, i modified
 the code as follows.
 
 My MR job code:
 ===
 
 doc.checkAndPut(rowKey, doc_content, , null, putDoc);
 
 Coprocessor code:
 ===
 
public void start(CoprocessorEnvironment env)  {
pool = new HTablePool(conf, 100);
}
 
public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
 compareOp, comparator,  put, result) {
 
if(!result) return true; // check if the put succeeded
 
HTableInterface table_idx = pool.getTable(doc_idx);
 
try {
 
for(KeyValue contentKV = put.get(doc_content, )) {
for(String word :
 contentKV.getValue().split(\\s+)) {
Increment inc = new
 Increment(Bytes.toBytes(word));
inc.addColumn(count, , 1);
table_idx.increment(inc);
}
   }
} finally {
table_idx.close();
}
return true;
}
 
public void stop(env) {
pool.close();
}
 
 I am a newbee to HBASE. I am not sure this is the way to do.
 Given that, why is the cooprocessor enabled version much slower than
 the one without?
 
 
 Sincerely,
 Prakash Kadel
 
 
 On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
 michael_se...@hotmail.com wrote:
 
 The  issue I was talking about was the use of a check and put.
 The OP wrote:
 each map inserts to doc table.(checkAndPut)
 regionobserver coprocessor does a postCheckAndPut and inserts some 
rows to
 a index table.
 
 My question is why does the OP use a checkAndPut, and the 
RegionObserver's postChecAndPut?
 
 
 Here's a good example... 
http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put

 
 The OP doesn't really get in to the use case, so we don't know why the 
Check and Put in the M/R job.
 He should just be using put() and then a postPut().
 
 Another issue... since he's writing to  a different HTable... how? Does 
he create an HTable instance in the start() method of his RO object and 
then reference it later? Or does he create the instance of the HTable on 
the fly in each postCheckAndPut() ?
 Without seeing his code, we don't know.
 
 Note that this is synchronous set of writes. Your overall return from 
the M/R call to put will wait until the second row is inserted.
 
 Interestingly enough, you may want to consider disabling the WAL on the 
write to the index.  You can always run a M/R job that rebuilds the index 
should something occur to the system where you might lose the data. 
Indexes *ARE* expendable. ;-)
 
 Does that explain it?
 
 -Mike
 
 On Feb 18, 2013, at 4:57 AM, yonghu yongyong...@gmail.com wrote:
 
 Hi, Michael
 
 I don't quite understand what do you mean by round trip back to the
 client. In my understanding, as the RegionServer and TaskTracker can
 be the same node, MR don't have to pull data 

Re: Optimizing Multi Gets in hbase

2013-02-19 Thread Varun Sharma
I have another question, if I am running a scan wrapped around multiple
rows in the same region, in the following way:

Scan scan = new scan(getWithMultipleRowsInSameRegion);

Now, how does execution occur. Is it just a sequential scan across the
entire region or does it seek to hfile blocks containing the actual values.
What I truly mean is, lets say the multi get is on following rows:

Row1 : HFileBlock1
Row2 : HFileBlock20
Row3 : Does not exist
Row4 : HFileBlock25
Row5 : HFileBlock100

The efficient way to do this would be to determine the correct blocks using
the index and then searching within the blocks for, say Row1. Then, seek to
HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
seeking to + searching within HFileBlocks as needed.

I am wondering if a scan wrapped around a Get with multiple rows would do
the same ?

Thanks
Varun

On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon nkey...@gmail.com wrote:

 Looking at the code, it seems possible to do this server side within the
 multi invocation: we could group the get by region, and do a single scan.
 We could also add some heuristics if necessary...



 On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl la...@apache.org wrote:

  I should qualify that statement, actually.
 
  I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
  returned.
 
  As James Taylor pointed out to me privately: A fairer comparison would
  have been to run a scan with a filter that lets x% of the rows pass (i.e.
  the selectivity of the scan would be x%) and compare that to a multi Get
 of
  the same x% of the row.
 
  There we found that a Scan+Filter is more efficient that issuing multi
  Gets if x is = 1-2%.
 
 
  Or in other words, translating many Gets into a Scan+Filter is beneficial
  if the Scan would return at least 1-2% of the rows to the client.
  For example:
  if you are looking for less than 10-20k rows in 1m rows, using muli Gets
  is likely more efficient.
  if you are looking for more than 10-20k rows in 1m rows, using a
  Scan+Filter is likely more efficient.
 
 
  Of course this is predicated on whether you have an efficient way to
  represent the rows you are looking for in a filter, so that would
 probably
  shift this slightly more towards Gets (just imaging a Filter that to
 encode
  100k random row keys to be matched; since Filters are instantiated store
  there is another natural limit there).
 
 
  As I said below, the crux of the matter is having some histograms of your
  data, so that such a decision could be made automatically.
 
 
  -- Lars
 
 
 
  
   From: lars hofhansl la...@apache.org
  To: user@hbase.apache.org user@hbase.apache.org
  Sent: Monday, February 18, 2013 5:48 PM
  Subject: Re: Optimizing Multi Gets in hbase
 
  As it happens we did some tests around last week.
  Turns out doing Gets in batches instead of a scan still gives you 1/3 of
  the performance.
 
  I.e. when you have a table with, say, 10m rows and scanning take N
  seconds, then calling 10m Gets in batches of 1000 take ~3N, which is
 pretty
  impressive.
 
  Now, this is with all data in the cache!
  When the data is not in the cache and the Gets are random it is many
  orders of magnitude slower, as the Gets are sprayed all over the disk. In
  that case sorting the Gets and issuing scans would indeed be much more
  efficient.
 
 
  The Gets in a batch are already sorted on the client, but as N. says it
 is
  hard to determine when to turn many Gets into a Scan with filters
  automatically. Without statistics/histograms I'd even wager a guess that
  would be impossible to do.
  Imagine you issue 1 random Gets, but your table has 10bn rows, in
 that
  case it is almost certain that the Gets are faster than a scan.
  Now image the Gets only cover a small key range. With statistics we could
  tell whether it would beneficial to turn this into a scan.
 
  It's not that hard to add statistics to HBase. Would do it as part of the
  compactions, and record the histograms in some table.
 
 
  You can always do that yourself. If you suspect you are touching most
 rows
  in a table/region, just issue a scan with a appropriate filter (may have
 to
  implement your own filter, though). Maybe we could a version of RowFilter
  that match against multiple keys.
 
 
  -- Lars
 
 
 
  
  From: Varun Sharma va...@pinterest.com
  To: user@hbase.apache.org
  Sent: Monday, February 18, 2013 1:57 AM
  Subject: Optimizing Multi Gets in hbase
 
  Hi,
 
  I am trying to batched get(s) on a cluster. Here is the code:
 
  ListGet gets = ...
  // Prepare my gets with the rows i need
  myHTable.get(gets);
 
  I have two questions about the above scenario:
  i) Is this the most optimal way to do this ?
  ii) I have a feeling that if there are multiple gets in this case, on the
  same region, then each one of those shall instantiate separate scan(s)
 over
  the region even though a single scan is 

Re: coprocessor enabled put very slow, help please~~~

2013-02-19 Thread Michael Segel
Good question.. 

You create a class MyRO. 

How many instances of  MyRO exist per RS?

How many queries can access the instance MyRO at the same time? 




On Feb 19, 2013, at 9:15 AM, Wei Tan w...@us.ibm.com wrote:

 A side question: if HTablePool is not encouraged to be used... how we 
 handle the thread safeness in using HTable? Any replacement for 
 HTablePool, in plan?
 Thanks,
 
 
 Best Regards,
 Wei
 
 
 
 
 From:   Michel Segel michael_se...@hotmail.com
 To: user@hbase.apache.org user@hbase.apache.org, 
 Date:   02/18/2013 09:23 AM
 Subject:Re: coprocessor enabled put very slow, help please~~~
 
 
 
 Why are you using an HTable Pool?
 Why are you closing the table after each iteration through?
 
 Try using 1 HTable object. Turn off WAL
 Initiate in start()
 Close in Stop()
 Surround the use in a try / catch
 If exception caught, re instantiate new HTable connection.
 
 Maybe want to flush the connection after puts. 
 
 
 Again not sure why you are using check and put on the base table. Your 
 count could be off.
 
 As an example look at poem/rhyme 'Marry had a little lamb'.
 Then check your word count.
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On Feb 18, 2013, at 7:21 AM, prakash kadel prakash.ka...@gmail.com 
 wrote:
 
 Thank you guys for your replies,
 Michael,
  I think i didnt make it clear. Here is my use case,
 
 I have text documents to insert in the hbase. (With possible duplicates)
 Suppose i have a document as :  I am working. He is not working
 
 I want to insert this document to a table in hbase, say table doc
 
 =doc table=
 -
 rowKey : doc_id
 cf: doc_content
 value: I am working. He is not working
 
 Now, i to create another table that stores the word count, say doc_idx
 
 doc_idx table
 ---
 rowKey : I, cf: count, value: 1
 rowKey : am, cf: count, value: 1
 rowKey : working, cf: count, value: 2
 rowKey : He, cf: count, value: 1
 rowKey : is, cf: count, value: 1
 rowKey : not, cf: count, value: 1
 
 My MR job code:
 ==
 
 if(doc.checkAndPut(rowKey, doc_content, , null, putDoc)) {
   for(String word : doc_content.split(\\s+)) {
  Increment inc = new Increment(Bytes.toBytes(word));
  inc.addColumn(count, , 1);
   }
 }
 
 Now, i wanted to do some experiments with coprocessors. So, i modified
 the code as follows.
 
 My MR job code:
 ===
 
 doc.checkAndPut(rowKey, doc_content, , null, putDoc);
 
 Coprocessor code:
 ===
 
   public void start(CoprocessorEnvironment env)  {
   pool = new HTablePool(conf, 100);
   }
 
   public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
 compareOp, comparator,  put, result) {
 
   if(!result) return true; // check if the put succeeded
 
   HTableInterface table_idx = pool.getTable(doc_idx);
 
   try {
 
   for(KeyValue contentKV = put.get(doc_content, )) {
   for(String word :
 contentKV.getValue().split(\\s+)) {
   Increment inc = new
 Increment(Bytes.toBytes(word));
   inc.addColumn(count, , 1);
   table_idx.increment(inc);
   }
  }
   } finally {
   table_idx.close();
   }
   return true;
   }
 
   public void stop(env) {
   pool.close();
   }
 
 I am a newbee to HBASE. I am not sure this is the way to do.
 Given that, why is the cooprocessor enabled version much slower than
 the one without?
 
 
 Sincerely,
 Prakash Kadel
 
 
 On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
 michael_se...@hotmail.com wrote:
 
 The  issue I was talking about was the use of a check and put.
 The OP wrote:
 each map inserts to doc table.(checkAndPut)
 regionobserver coprocessor does a postCheckAndPut and inserts some 
 rows to
 a index table.
 
 My question is why does the OP use a checkAndPut, and the 
 RegionObserver's postChecAndPut?
 
 
 Here's a good example... 
 http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
 
 
 The OP doesn't really get in to the use case, so we don't know why the 
 Check and Put in the M/R job.
 He should just be using put() and then a postPut().
 
 Another issue... since he's writing to  a different HTable... how? Does 
 he create an HTable instance in the start() method of his RO object and 
 then reference it later? Or does he create the instance of the HTable on 
 the fly in each postCheckAndPut() ?
 Without seeing his code, we don't know.
 
 Note that this is synchronous set of writes. Your overall return from 
 the M/R call to put will wait until the second row is inserted.
 
 Interestingly enough, you may want to consider disabling the WAL on the 
 write to the index.  You can always run a M/R job that rebuilds the index 
 should something occur to the system where you might lose the data. 
 Indexes *ARE* expendable. ;-)
 
 Does that explain it?
 
 -Mike
 
 On Feb 18, 2013, at 

Rowkey design question

2013-02-19 Thread Paul van Hoven
Hi,

I'm currently playing with hbase. The design of the rowkey seems to be
critical.

The rowkey for a certain database table of mine is:

timestamp+ipaddress

It looks something like this when performing a scan on the table in the shell:
hbase(main):012:0 scan 'ToyDataTable'
ROW COLUMN+CELL
 135702000+192.168.178.9column=CF:SampleCol,
timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00

Since I got several rows for different timestamps I'd like to tell a
scan to just a region of the table for example from 2013-01-07 to
2013-01-09. Previously I only had a timestamp as the rowkey and I
could restrict the rowkey like that:

SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd HH:mm:ss);
Date startDate = formatter.parse(2013-01-07 07:00:00);
Date endDate = formatter.parse(2013-01-10 07:00:00);

HTableInterface toyDataTable = 
pool.getTable(ToyDataTable);
Scan scan = new Scan( Bytes.toBytes( 
startDate.getTime() ),
Bytes.toBytes( endDate.getTime() ) );

But this no longer works with my new design.

Is there a way to tell the scan object to filter the rows with respect
to the timestamp, or do I have to use a filter object?


Re: Rowkey design question

2013-02-19 Thread Mohammad Tariq
Hello Paul,

Try this and see if it works :
   scan.setStartRow(Bytes.toBytes(startDate.getTime() + ));
   scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ));

Also try not to use TS as the rowkey, as it may lead to RS hotspotting.
Just add a hash to your rowkeys so that data is distributed evenly on all
the RSs.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven 
paul.van.ho...@googlemail.com wrote:

 Hi,

 I'm currently playing with hbase. The design of the rowkey seems to be
 critical.

 The rowkey for a certain database table of mine is:

 timestamp+ipaddress

 It looks something like this when performing a scan on the table in the
 shell:
 hbase(main):012:0 scan 'ToyDataTable'
 ROW COLUMN+CELL
  135702000+192.168.178.9column=CF:SampleCol,
 timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00

 Since I got several rows for different timestamps I'd like to tell a
 scan to just a region of the table for example from 2013-01-07 to
 2013-01-09. Previously I only had a timestamp as the rowkey and I
 could restrict the rowkey like that:

 SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd HH:mm:ss);
 Date startDate = formatter.parse(2013-01-07
 07:00:00);
 Date endDate = formatter.parse(2013-01-10
 07:00:00);

 HTableInterface toyDataTable =
 pool.getTable(ToyDataTable);
 Scan scan = new Scan( Bytes.toBytes(
 startDate.getTime() ),
 Bytes.toBytes( endDate.getTime() ) );

 But this no longer works with my new design.

 Is there a way to tell the scan object to filter the rows with respect
 to the timestamp, or do I have to use a filter object?



Re: Rowkey design question

2013-02-19 Thread Paul van Hoven
Hey Tariq,

thanks for your quick answer. I'm not sure if I got the idea in the
seond part of your answer. You mean if I use a timestamp as a rowkey I
should append a hash like this:

135727920+MD5HASH

and then the data would be distributed more equally?


2013/2/19 Mohammad Tariq donta...@gmail.com:
 Hello Paul,

 Try this and see if it works :
scan.setStartRow(Bytes.toBytes(startDate.getTime() + ));
scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ));

 Also try not to use TS as the rowkey, as it may lead to RS hotspotting.
 Just add a hash to your rowkeys so that data is distributed evenly on all
 the RSs.

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven 
 paul.van.ho...@googlemail.com wrote:

 Hi,

 I'm currently playing with hbase. The design of the rowkey seems to be
 critical.

 The rowkey for a certain database table of mine is:

 timestamp+ipaddress

 It looks something like this when performing a scan on the table in the
 shell:
 hbase(main):012:0 scan 'ToyDataTable'
 ROW COLUMN+CELL
  135702000+192.168.178.9column=CF:SampleCol,
 timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00

 Since I got several rows for different timestamps I'd like to tell a
 scan to just a region of the table for example from 2013-01-07 to
 2013-01-09. Previously I only had a timestamp as the rowkey and I
 could restrict the rowkey like that:

 SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd HH:mm:ss);
 Date startDate = formatter.parse(2013-01-07
 07:00:00);
 Date endDate = formatter.parse(2013-01-10
 07:00:00);

 HTableInterface toyDataTable =
 pool.getTable(ToyDataTable);
 Scan scan = new Scan( Bytes.toBytes(
 startDate.getTime() ),
 Bytes.toBytes( endDate.getTime() ) );

 But this no longer works with my new design.

 Is there a way to tell the scan object to filter the rows with respect
 to the timestamp, or do I have to use a filter object?



Re: coprocessor enabled put very slow, help please~~~

2013-02-19 Thread Michael Segel
I should follow up with that I was asking why he was using an HTable Pool, not 
saying that it was wrong. 

Still. I think in the pool the writes shouldn't have to go to the WAL. 


On Feb 19, 2013, at 10:01 AM, Michael Segel michael_se...@hotmail.com wrote:

 Good question.. 
 
 You create a class MyRO. 
 
 How many instances of  MyRO exist per RS?
 
 How many queries can access the instance MyRO at the same time? 
 
 
 
 
 On Feb 19, 2013, at 9:15 AM, Wei Tan w...@us.ibm.com wrote:
 
 A side question: if HTablePool is not encouraged to be used... how we 
 handle the thread safeness in using HTable? Any replacement for 
 HTablePool, in plan?
 Thanks,
 
 
 Best Regards,
 Wei
 
 
 
 
 From:   Michel Segel michael_se...@hotmail.com
 To: user@hbase.apache.org user@hbase.apache.org, 
 Date:   02/18/2013 09:23 AM
 Subject:Re: coprocessor enabled put very slow, help please~~~
 
 
 
 Why are you using an HTable Pool?
 Why are you closing the table after each iteration through?
 
 Try using 1 HTable object. Turn off WAL
 Initiate in start()
 Close in Stop()
 Surround the use in a try / catch
 If exception caught, re instantiate new HTable connection.
 
 Maybe want to flush the connection after puts. 
 
 
 Again not sure why you are using check and put on the base table. Your 
 count could be off.
 
 As an example look at poem/rhyme 'Marry had a little lamb'.
 Then check your word count.
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On Feb 18, 2013, at 7:21 AM, prakash kadel prakash.ka...@gmail.com 
 wrote:
 
 Thank you guys for your replies,
 Michael,
 I think i didnt make it clear. Here is my use case,
 
 I have text documents to insert in the hbase. (With possible duplicates)
 Suppose i have a document as :  I am working. He is not working
 
 I want to insert this document to a table in hbase, say table doc
 
 =doc table=
 -
 rowKey : doc_id
 cf: doc_content
 value: I am working. He is not working
 
 Now, i to create another table that stores the word count, say doc_idx
 
 doc_idx table
 ---
 rowKey : I, cf: count, value: 1
 rowKey : am, cf: count, value: 1
 rowKey : working, cf: count, value: 2
 rowKey : He, cf: count, value: 1
 rowKey : is, cf: count, value: 1
 rowKey : not, cf: count, value: 1
 
 My MR job code:
 ==
 
 if(doc.checkAndPut(rowKey, doc_content, , null, putDoc)) {
  for(String word : doc_content.split(\\s+)) {
 Increment inc = new Increment(Bytes.toBytes(word));
 inc.addColumn(count, , 1);
  }
 }
 
 Now, i wanted to do some experiments with coprocessors. So, i modified
 the code as follows.
 
 My MR job code:
 ===
 
 doc.checkAndPut(rowKey, doc_content, , null, putDoc);
 
 Coprocessor code:
 ===
 
  public void start(CoprocessorEnvironment env)  {
  pool = new HTablePool(conf, 100);
  }
 
  public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
 compareOp, comparator,  put, result) {
 
  if(!result) return true; // check if the put succeeded
 
  HTableInterface table_idx = pool.getTable(doc_idx);
 
  try {
 
  for(KeyValue contentKV = put.get(doc_content, )) {
  for(String word :
 contentKV.getValue().split(\\s+)) {
  Increment inc = new
 Increment(Bytes.toBytes(word));
  inc.addColumn(count, , 1);
  table_idx.increment(inc);
  }
 }
  } finally {
  table_idx.close();
  }
  return true;
  }
 
  public void stop(env) {
  pool.close();
  }
 
 I am a newbee to HBASE. I am not sure this is the way to do.
 Given that, why is the cooprocessor enabled version much slower than
 the one without?
 
 
 Sincerely,
 Prakash Kadel
 
 
 On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
 michael_se...@hotmail.com wrote:
 
 The  issue I was talking about was the use of a check and put.
 The OP wrote:
 each map inserts to doc table.(checkAndPut)
 regionobserver coprocessor does a postCheckAndPut and inserts some 
 rows to
 a index table.
 
 My question is why does the OP use a checkAndPut, and the 
 RegionObserver's postChecAndPut?
 
 
 Here's a good example... 
 http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
 
 
 The OP doesn't really get in to the use case, so we don't know why the 
 Check and Put in the M/R job.
 He should just be using put() and then a postPut().
 
 Another issue... since he's writing to  a different HTable... how? Does 
 he create an HTable instance in the start() method of his RO object and 
 then reference it later? Or does he create the instance of the HTable on 
 the fly in each postCheckAndPut() ?
 Without seeing his code, we don't know.
 
 Note that this is synchronous set of writes. Your overall return from 
 the M/R call to put will wait until the second row is inserted.
 
 Interestingly enough, you may want to consider disabling the 

Re: Optimizing Multi Gets in hbase

2013-02-19 Thread Nicolas Liochon
Imho,  the easiest thing to do would be to write a filter.
You need to order the rows, then you can use hints to navigate to the next
row (SEEK_NEXT_USING_HINT).
The main drawback I see is that the filter will be invoked on all regions
servers, including the ones that don't need it. But this would also means
you have a very specific query pattern (which could be the case, I just
don't know), and you can still use the startRow / stopRow of the scan, and
create multiple scan if necessary. I'm also interested in Lars' opinion on
this.

Nicolas



On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma va...@pinterest.com wrote:

 I have another question, if I am running a scan wrapped around multiple
 rows in the same region, in the following way:

 Scan scan = new scan(getWithMultipleRowsInSameRegion);

 Now, how does execution occur. Is it just a sequential scan across the
 entire region or does it seek to hfile blocks containing the actual values.
 What I truly mean is, lets say the multi get is on following rows:

 Row1 : HFileBlock1
 Row2 : HFileBlock20
 Row3 : Does not exist
 Row4 : HFileBlock25
 Row5 : HFileBlock100

 The efficient way to do this would be to determine the correct blocks using
 the index and then searching within the blocks for, say Row1. Then, seek to
 HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
 seeking to + searching within HFileBlocks as needed.

 I am wondering if a scan wrapped around a Get with multiple rows would do
 the same ?

 Thanks
 Varun

 On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon nkey...@gmail.com
 wrote:

  Looking at the code, it seems possible to do this server side within the
  multi invocation: we could group the get by region, and do a single scan.
  We could also add some heuristics if necessary...
 
 
 
  On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl la...@apache.org wrote:
 
   I should qualify that statement, actually.
  
   I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
   returned.
  
   As James Taylor pointed out to me privately: A fairer comparison would
   have been to run a scan with a filter that lets x% of the rows pass
 (i.e.
   the selectivity of the scan would be x%) and compare that to a multi
 Get
  of
   the same x% of the row.
  
   There we found that a Scan+Filter is more efficient that issuing multi
   Gets if x is = 1-2%.
  
  
   Or in other words, translating many Gets into a Scan+Filter is
 beneficial
   if the Scan would return at least 1-2% of the rows to the client.
   For example:
   if you are looking for less than 10-20k rows in 1m rows, using muli
 Gets
   is likely more efficient.
   if you are looking for more than 10-20k rows in 1m rows, using a
   Scan+Filter is likely more efficient.
  
  
   Of course this is predicated on whether you have an efficient way to
   represent the rows you are looking for in a filter, so that would
  probably
   shift this slightly more towards Gets (just imaging a Filter that to
  encode
   100k random row keys to be matched; since Filters are instantiated
 store
   there is another natural limit there).
  
  
   As I said below, the crux of the matter is having some histograms of
 your
   data, so that such a decision could be made automatically.
  
  
   -- Lars
  
  
  
   
From: lars hofhansl la...@apache.org
   To: user@hbase.apache.org user@hbase.apache.org
   Sent: Monday, February 18, 2013 5:48 PM
   Subject: Re: Optimizing Multi Gets in hbase
  
   As it happens we did some tests around last week.
   Turns out doing Gets in batches instead of a scan still gives you 1/3
 of
   the performance.
  
   I.e. when you have a table with, say, 10m rows and scanning take N
   seconds, then calling 10m Gets in batches of 1000 take ~3N, which is
  pretty
   impressive.
  
   Now, this is with all data in the cache!
   When the data is not in the cache and the Gets are random it is many
   orders of magnitude slower, as the Gets are sprayed all over the disk.
 In
   that case sorting the Gets and issuing scans would indeed be much more
   efficient.
  
  
   The Gets in a batch are already sorted on the client, but as N. says it
  is
   hard to determine when to turn many Gets into a Scan with filters
   automatically. Without statistics/histograms I'd even wager a guess
 that
   would be impossible to do.
   Imagine you issue 1 random Gets, but your table has 10bn rows, in
  that
   case it is almost certain that the Gets are faster than a scan.
   Now image the Gets only cover a small key range. With statistics we
 could
   tell whether it would beneficial to turn this into a scan.
  
   It's not that hard to add statistics to HBase. Would do it as part of
 the
   compactions, and record the histograms in some table.
  
  
   You can always do that yourself. If you suspect you are touching most
  rows
   in a table/region, just issue a scan with a appropriate filter (may
 have
  to
   implement your own filter, 

Re: Using HBase for Deduping

2013-02-19 Thread Rahul Ravindran
I could surround with a Try..Catch, but that would each time I insert a UUID 
for the first time (99% of the time), I would do a checkAndPut(), catch the 
resultant exception and perform a Put; so, 2 operations each reduce invocation, 
which is what I was looking to avoid



 From: Michael Segel michael_se...@hotmail.com
To: user@hbase.apache.org; Rahul Ravindran rahu...@yahoo.com 
Sent: Friday, February 15, 2013 9:24 AM
Subject: Re: Using HBase for Deduping
 

Interesting. 

Surround with a Try Catch? 

But it sounds like you're on the right path. 

Happy Coding!


On Feb 15, 2013, at 11:12 AM, Rahul Ravindran rahu...@yahoo.com wrote:

I had tried checkAndPut yesterday with a null passed as the value and it had 
thrown an exception when the row did not exist. Perhaps, I was doing something 
wrong. Will try that again, since, yes, I would prefer a checkAndPut().



From: Michael Segel michael_se...@hotmail.com
To: user@hbase.apache.org 
Cc: Rahul Ravindran rahu...@yahoo.com 
Sent: Friday, February 15, 2013 4:36 AM
Subject: Re: Using HBase for Deduping


On Feb 15, 2013, at 3:07 AM, Asaf Mesika asaf.mes...@gmail.com wrote:


Michael, this means read for every write?

Yes and no. 

At the macro level, a read for every write would mean that your client would 
read a record from HBase, and then based on some logic it would either write a 
record, or not. 

So that you have a lot of overhead in the initial get() and then put(). 

At this macro level, with a Check and Put you have less overhead because of a 
single message to HBase.

Intermal to HBase, you would still have to check the value in the row, if it 
exists and then perform an insert or not. 

WIth respect to your billion events an hour... 

dividing by 3600 to get the number of events in a second. You would have less 
than 300,000 events a second. 

What exactly are you doing and how large are those events? 

Since you are processing these events in a batch job, timing doesn't appear to 
be that important and of course there is also async hbase which may improve 
some of the performance. 

YMMV but this is a good example of the checkAndPut()




On Friday, February 15, 2013, Michael Segel wrote:


What constitutes a duplicate?

An over simplification is to do a HTable.checkAndPut() where you do the
put if the column doesn't exist.
Then if the row is inserted (TRUE) return value, you push the event.

That will do what you want.

At least at first blush.



On Feb 14, 2013, at 3:24 PM, Viral Bajaria viral.baja...@gmail.com
wrote:


Given the size of the data ( 1B rows) and the frequency of job run (once
per hour), I don't think your most optimal solution is to lookup HBase
for

every single event. You will benefit more by loading the HBase table
directly in your MR job.

In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique
UUID's ?


Also once you have done the unique, are you going to use the data again
in

some other way i.e. online serving of traffic or some other analysis ? Or
this is just to compute some unique #'s ?

It will be more helpful if you describe your final use case of the
computed

data too. Given the amount of back and forth, we can take it off list too
and summarize the conversation for the list.

On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran rahu...@yahoo.com
wrote:



We can't rely on the the assumption event dupes will not dupe outside an
hour boundary. So, your take is that, doing a lookup per event within
the

MR job is going to be bad?



From: Viral Bajaria viral.baja...@gmail.com
To: Rahul Ravindran rahu...@yahoo.com
Cc: user@hbase.apache.org user@hbase.apache.org
Sent: Thursday, February 14, 2013 12:48 PM
Subject: Re: Using HBase for Deduping

You could do with a 2-pronged approach here i.e. some MR and some HBase
lookups. I don't think this is the best solution either given the # of
events you will get.

FWIW, the solution below again relies on the assumption that if a event
is

duped in the same hour it won't have a dupe outside of that hour
boundary.

If it can have then you are better of with running a MR job with the
current hour + another 3 hours of data or an MR job with the current
hour +

the HBase table as input to the job too (i.e. no HBase lookups, just
read

the HFile directly) ?

- Run a MR job which de-dupes events for the current hour i.e. only
runs on

1 hour worth of data.
- Mark records which you were not able to de-dupe in the current run
- For the records that you were not able to de-dupe, check against HBase
whether you saw that event in the past. If you did, you can drop the
current event or update the event to the new value (based on your
business

logic)
- Save all the de-duped events (via HBase bulk upload)

Sorry if I just rambled along, but without knowing the whole problem
it's

very tough to come up with a probable solution. So correct my
assumptions

and we could drill down more.


Re: Co-Processor in scanning the HBase's Table

2013-02-19 Thread Farrokh Shahriari
Thanks you guys

On Mon, Feb 18, 2013 at 12:00 PM, Amit Sela am...@infolinks.com wrote:

 Yes... that was emailing half asleep... :)

 On Mon, Feb 18, 2013 at 7:23 AM, Anoop Sam John anoo...@huawei.com
 wrote:

  We dont have any hook like postScan()..  In ur case you can try with
  postScannerClose()..  This will be called once per region. When the scan
 on
  that region is over the scanner opened on that region will get closed and
  at that time this hook will get executed.
 
  -Anoop-
  
  From: Farrokh Shahriari [mohandes.zebeleh...@gmail.com]
  Sent: Monday, February 18, 2013 10:27 AM
  To: user@hbase.apache.org
  Cc: cdh-u...@cloudera.org
  Subject: Re: Co-Processor in scanning the HBase's Table
 
  Thanks you Amit,I will check that.
  @Anoop: I wanna run that just after scanning a region or after scanning
 the
  regions that to belong to one regionserver.
 
  On Mon, Feb 18, 2013 at 7:45 AM, Anoop Sam John anoo...@huawei.com
  wrote:
 
   I wanna use a custom code after scanning a large table and prefer to
 run
   the code after scanning each region
  
   Exactly at what point you want to run your custom code?  We have hooks
 at
   points like opening a scanner at a region, closing scanner at a region,
   calling next (pre/post) etc
  
   -Anoop-
   
   From: Farrokh Shahriari [mohandes.zebeleh...@gmail.com]
   Sent: Monday, February 18, 2013 12:21 AM
   To: cdh-u...@cloudera.org; user@hbase.apache.org
   Subject: Co-Processor in scanning the HBase's Table
  
   Hi there
   I wanna use a custom code after scanning a large table and prefer to
 run
   the code after scanning each region.I know that I should use
   co-processor,but don't know which of Observer,Endpoint or both of them
 I
   should use ? Is there any simple example of them ?
  
   Tnx
  
 



Re: Rowkey design question

2013-02-19 Thread Mohammad Tariq
No. before the timestamp. All the row keys which are identical go to the
same region. This is the default Hbase behavior and is meant to make the
performance better. But sometimes the machine gets overloaded with reads
and writes because we get concentrated on that particular machine. For
example timeseries data. So it's better to hash the keys in order to make
them go to all the machines equally. HTH

BTW, did that range query work??

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Tue, Feb 19, 2013 at 9:54 PM, Paul van Hoven 
paul.van.ho...@googlemail.com wrote:

 Hey Tariq,

 thanks for your quick answer. I'm not sure if I got the idea in the
 seond part of your answer. You mean if I use a timestamp as a rowkey I
 should append a hash like this:

 135727920+MD5HASH

 and then the data would be distributed more equally?


 2013/2/19 Mohammad Tariq donta...@gmail.com:
  Hello Paul,
 
  Try this and see if it works :
 scan.setStartRow(Bytes.toBytes(startDate.getTime() + ));
 scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ));
 
  Also try not to use TS as the rowkey, as it may lead to RS hotspotting.
  Just add a hash to your rowkeys so that data is distributed evenly on all
  the RSs.
 
  Warm Regards,
  Tariq
  https://mtariq.jux.com/
  cloudfront.blogspot.com
 
 
  On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven 
  paul.van.ho...@googlemail.com wrote:
 
  Hi,
 
  I'm currently playing with hbase. The design of the rowkey seems to be
  critical.
 
  The rowkey for a certain database table of mine is:
 
  timestamp+ipaddress
 
  It looks something like this when performing a scan on the table in the
  shell:
  hbase(main):012:0 scan 'ToyDataTable'
  ROW COLUMN+CELL
   135702000+192.168.178.9column=CF:SampleCol,
  timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00
 
  Since I got several rows for different timestamps I'd like to tell a
  scan to just a region of the table for example from 2013-01-07 to
  2013-01-09. Previously I only had a timestamp as the rowkey and I
  could restrict the rowkey like that:
 
  SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd
 HH:mm:ss);
  Date startDate = formatter.parse(2013-01-07
  07:00:00);
  Date endDate = formatter.parse(2013-01-10
  07:00:00);
 
  HTableInterface toyDataTable =
  pool.getTable(ToyDataTable);
  Scan scan = new Scan( Bytes.toBytes(
  startDate.getTime() ),
  Bytes.toBytes( endDate.getTime() ) );
 
  But this no longer works with my new design.
 
  Is there a way to tell the scan object to filter the rows with respect
  to the timestamp, or do I have to use a filter object?
 



Re: Rowkey design question

2013-02-19 Thread Paul van Hoven
Yeah it worked fine.

But as I understand: If I prefix my row key with something like

md5-hash + timestamp

then the rowkeys are probably evenly distributed but how would I
perform then a scan restricted to a special time range?


2013/2/19 Mohammad Tariq donta...@gmail.com:
 No. before the timestamp. All the row keys which are identical go to the
 same region. This is the default Hbase behavior and is meant to make the
 performance better. But sometimes the machine gets overloaded with reads
 and writes because we get concentrated on that particular machine. For
 example timeseries data. So it's better to hash the keys in order to make
 them go to all the machines equally. HTH

 BTW, did that range query work??

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Tue, Feb 19, 2013 at 9:54 PM, Paul van Hoven 
 paul.van.ho...@googlemail.com wrote:

 Hey Tariq,

 thanks for your quick answer. I'm not sure if I got the idea in the
 seond part of your answer. You mean if I use a timestamp as a rowkey I
 should append a hash like this:

 135727920+MD5HASH

 and then the data would be distributed more equally?


 2013/2/19 Mohammad Tariq donta...@gmail.com:
  Hello Paul,
 
  Try this and see if it works :
 scan.setStartRow(Bytes.toBytes(startDate.getTime() + ));
 scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ));
 
  Also try not to use TS as the rowkey, as it may lead to RS hotspotting.
  Just add a hash to your rowkeys so that data is distributed evenly on all
  the RSs.
 
  Warm Regards,
  Tariq
  https://mtariq.jux.com/
  cloudfront.blogspot.com
 
 
  On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven 
  paul.van.ho...@googlemail.com wrote:
 
  Hi,
 
  I'm currently playing with hbase. The design of the rowkey seems to be
  critical.
 
  The rowkey for a certain database table of mine is:
 
  timestamp+ipaddress
 
  It looks something like this when performing a scan on the table in the
  shell:
  hbase(main):012:0 scan 'ToyDataTable'
  ROW COLUMN+CELL
   135702000+192.168.178.9column=CF:SampleCol,
  timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00
 
  Since I got several rows for different timestamps I'd like to tell a
  scan to just a region of the table for example from 2013-01-07 to
  2013-01-09. Previously I only had a timestamp as the rowkey and I
  could restrict the rowkey like that:
 
  SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd
 HH:mm:ss);
  Date startDate = formatter.parse(2013-01-07
  07:00:00);
  Date endDate = formatter.parse(2013-01-10
  07:00:00);
 
  HTableInterface toyDataTable =
  pool.getTable(ToyDataTable);
  Scan scan = new Scan( Bytes.toBytes(
  startDate.getTime() ),
  Bytes.toBytes( endDate.getTime() ) );
 
  But this no longer works with my new design.
 
  Is there a way to tell the scan object to filter the rows with respect
  to the timestamp, or do I have to use a filter object?
 



Re: Rowkey design question

2013-02-19 Thread Mohammad Tariq
You can use 
FuzzyRowFilterhttp://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FuzzyRowFilter.htmlto
do that.

Have a look at this
linkhttp://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/.
You might find it helpful.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Tue, Feb 19, 2013 at 11:20 PM, Paul van Hoven 
paul.van.ho...@googlemail.com wrote:

 Yeah it worked fine.

 But as I understand: If I prefix my row key with something like

 md5-hash + timestamp

 then the rowkeys are probably evenly distributed but how would I
 perform then a scan restricted to a special time range?


 2013/2/19 Mohammad Tariq donta...@gmail.com:
  No. before the timestamp. All the row keys which are identical go to the
  same region. This is the default Hbase behavior and is meant to make the
  performance better. But sometimes the machine gets overloaded with reads
  and writes because we get concentrated on that particular machine. For
  example timeseries data. So it's better to hash the keys in order to make
  them go to all the machines equally. HTH
 
  BTW, did that range query work??
 
  Warm Regards,
  Tariq
  https://mtariq.jux.com/
  cloudfront.blogspot.com
 
 
  On Tue, Feb 19, 2013 at 9:54 PM, Paul van Hoven 
  paul.van.ho...@googlemail.com wrote:
 
  Hey Tariq,
 
  thanks for your quick answer. I'm not sure if I got the idea in the
  seond part of your answer. You mean if I use a timestamp as a rowkey I
  should append a hash like this:
 
  135727920+MD5HASH
 
  and then the data would be distributed more equally?
 
 
  2013/2/19 Mohammad Tariq donta...@gmail.com:
   Hello Paul,
  
   Try this and see if it works :
  scan.setStartRow(Bytes.toBytes(startDate.getTime() + ));
  scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ));
  
   Also try not to use TS as the rowkey, as it may lead to RS
 hotspotting.
   Just add a hash to your rowkeys so that data is distributed evenly on
 all
   the RSs.
  
   Warm Regards,
   Tariq
   https://mtariq.jux.com/
   cloudfront.blogspot.com
  
  
   On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven 
   paul.van.ho...@googlemail.com wrote:
  
   Hi,
  
   I'm currently playing with hbase. The design of the rowkey seems to
 be
   critical.
  
   The rowkey for a certain database table of mine is:
  
   timestamp+ipaddress
  
   It looks something like this when performing a scan on the table in
 the
   shell:
   hbase(main):012:0 scan 'ToyDataTable'
   ROW COLUMN+CELL
135702000+192.168.178.9column=CF:SampleCol,
   timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00
  
   Since I got several rows for different timestamps I'd like to tell a
   scan to just a region of the table for example from 2013-01-07 to
   2013-01-09. Previously I only had a timestamp as the rowkey and I
   could restrict the rowkey like that:
  
   SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd
  HH:mm:ss);
   Date startDate = formatter.parse(2013-01-07
   07:00:00);
   Date endDate = formatter.parse(2013-01-10
   07:00:00);
  
   HTableInterface toyDataTable =
   pool.getTable(ToyDataTable);
   Scan scan = new Scan( Bytes.toBytes(
   startDate.getTime() ),
   Bytes.toBytes( endDate.getTime() ) );
  
   But this no longer works with my new design.
  
   Is there a way to tell the scan object to filter the rows with
 respect
   to the timestamp, or do I have to use a filter object?
  
 



Re: Optimizing Multi Gets in hbase

2013-02-19 Thread Varun Sharma
The other suggestion, sounds better to me where the multi call is modified
to run the Get(s) with this new filter or just initiate a scan with all the
get(s). Since the client automatically groups the multi calls by region
server and only calls the respective region servers. That would eliminate
calling all region servers. This might be an interesting benchmark to run.

On Tue, Feb 19, 2013 at 9:28 AM, Nicolas Liochon nkey...@gmail.com wrote:

 Imho,  the easiest thing to do would be to write a filter.
 You need to order the rows, then you can use hints to navigate to the next
 row (SEEK_NEXT_USING_HINT).
 The main drawback I see is that the filter will be invoked on all regions
 servers, including the ones that don't need it. But this would also means
 you have a very specific query pattern (which could be the case, I just
 don't know), and you can still use the startRow / stopRow of the scan, and
 create multiple scan if necessary. I'm also interested in Lars' opinion on
 this.

 Nicolas



 On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma va...@pinterest.com wrote:

  I have another question, if I am running a scan wrapped around multiple
  rows in the same region, in the following way:
 
  Scan scan = new scan(getWithMultipleRowsInSameRegion);
 
  Now, how does execution occur. Is it just a sequential scan across the
  entire region or does it seek to hfile blocks containing the actual
 values.
  What I truly mean is, lets say the multi get is on following rows:
 
  Row1 : HFileBlock1
  Row2 : HFileBlock20
  Row3 : Does not exist
  Row4 : HFileBlock25
  Row5 : HFileBlock100
 
  The efficient way to do this would be to determine the correct blocks
 using
  the index and then searching within the blocks for, say Row1. Then, seek
 to
  HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
  seeking to + searching within HFileBlocks as needed.
 
  I am wondering if a scan wrapped around a Get with multiple rows would do
  the same ?
 
  Thanks
  Varun
 
  On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon nkey...@gmail.com
  wrote:
 
   Looking at the code, it seems possible to do this server side within
 the
   multi invocation: we could group the get by region, and do a single
 scan.
   We could also add some heuristics if necessary...
  
  
  
   On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl la...@apache.org
 wrote:
  
I should qualify that statement, actually.
   
I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
returned.
   
As James Taylor pointed out to me privately: A fairer comparison
 would
have been to run a scan with a filter that lets x% of the rows pass
  (i.e.
the selectivity of the scan would be x%) and compare that to a multi
  Get
   of
the same x% of the row.
   
There we found that a Scan+Filter is more efficient that issuing
 multi
Gets if x is = 1-2%.
   
   
Or in other words, translating many Gets into a Scan+Filter is
  beneficial
if the Scan would return at least 1-2% of the rows to the client.
For example:
if you are looking for less than 10-20k rows in 1m rows, using muli
  Gets
is likely more efficient.
if you are looking for more than 10-20k rows in 1m rows, using a
Scan+Filter is likely more efficient.
   
   
Of course this is predicated on whether you have an efficient way to
represent the rows you are looking for in a filter, so that would
   probably
shift this slightly more towards Gets (just imaging a Filter that to
   encode
100k random row keys to be matched; since Filters are instantiated
  store
there is another natural limit there).
   
   
As I said below, the crux of the matter is having some histograms of
  your
data, so that such a decision could be made automatically.
   
   
-- Lars
   
   
   

 From: lars hofhansl la...@apache.org
To: user@hbase.apache.org user@hbase.apache.org
Sent: Monday, February 18, 2013 5:48 PM
Subject: Re: Optimizing Multi Gets in hbase
   
As it happens we did some tests around last week.
Turns out doing Gets in batches instead of a scan still gives you 1/3
  of
the performance.
   
I.e. when you have a table with, say, 10m rows and scanning take N
seconds, then calling 10m Gets in batches of 1000 take ~3N, which is
   pretty
impressive.
   
Now, this is with all data in the cache!
When the data is not in the cache and the Gets are random it is many
orders of magnitude slower, as the Gets are sprayed all over the
 disk.
  In
that case sorting the Gets and issuing scans would indeed be much
 more
efficient.
   
   
The Gets in a batch are already sorted on the client, but as N. says
 it
   is
hard to determine when to turn many Gets into a Scan with filters
automatically. Without statistics/histograms I'd even wager a guess
  that
would be impossible to do.
Imagine you issue 1 random Gets, 

Re: Optimizing Multi Gets in hbase

2013-02-19 Thread lars hofhansl
I was thinking along the same lines. Doing a skip scan via filter hinting. The 
problem is as you say that the Filter is instantiated everywhere and it might 
be of significant size (have to maintain all row keys you are looking for).


RegionScanner now a reseek method, it is possible to do this via a coprocessor. 
They are also loaded per region (but at least not for each store), and one can 
use the shared coproc state I added to alleviate the memory concern.

Thinking about this in terms of multiple scan is interesting. One could 
identify clusters of close row keys in the Gets and issue a Scan for each 
cluster.


-- Lars




 From: Nicolas Liochon nkey...@gmail.com
To: user user@hbase.apache.org 
Sent: Tuesday, February 19, 2013 9:28 AM
Subject: Re: Optimizing Multi Gets in hbase
 
Imho,  the easiest thing to do would be to write a filter.
You need to order the rows, then you can use hints to navigate to the next
row (SEEK_NEXT_USING_HINT).
The main drawback I see is that the filter will be invoked on all regions
servers, including the ones that don't need it. But this would also means
you have a very specific query pattern (which could be the case, I just
don't know), and you can still use the startRow / stopRow of the scan, and
create multiple scan if necessary. I'm also interested in Lars' opinion on
this.

Nicolas



On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma va...@pinterest.com wrote:

 I have another question, if I am running a scan wrapped around multiple
 rows in the same region, in the following way:

 Scan scan = new scan(getWithMultipleRowsInSameRegion);

 Now, how does execution occur. Is it just a sequential scan across the
 entire region or does it seek to hfile blocks containing the actual values.
 What I truly mean is, lets say the multi get is on following rows:

 Row1 : HFileBlock1
 Row2 : HFileBlock20
 Row3 : Does not exist
 Row4 : HFileBlock25
 Row5 : HFileBlock100

 The efficient way to do this would be to determine the correct blocks using
 the index and then searching within the blocks for, say Row1. Then, seek to
 HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
 seeking to + searching within HFileBlocks as needed.

 I am wondering if a scan wrapped around a Get with multiple rows would do
 the same ?

 Thanks
 Varun

 On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon nkey...@gmail.com
 wrote:

  Looking at the code, it seems possible to do this server side within the
  multi invocation: we could group the get by region, and do a single scan.
  We could also add some heuristics if necessary...
 
 
 
  On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl la...@apache.org wrote:
 
   I should qualify that statement, actually.
  
   I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
   returned.
  
   As James Taylor pointed out to me privately: A fairer comparison would
   have been to run a scan with a filter that lets x% of the rows pass
 (i.e.
   the selectivity of the scan would be x%) and compare that to a multi
 Get
  of
   the same x% of the row.
  
   There we found that a Scan+Filter is more efficient that issuing multi
   Gets if x is = 1-2%.
  
  
   Or in other words, translating many Gets into a Scan+Filter is
 beneficial
   if the Scan would return at least 1-2% of the rows to the client.
   For example:
   if you are looking for less than 10-20k rows in 1m rows, using muli
 Gets
   is likely more efficient.
   if you are looking for more than 10-20k rows in 1m rows, using a
   Scan+Filter is likely more efficient.
  
  
   Of course this is predicated on whether you have an efficient way to
   represent the rows you are looking for in a filter, so that would
  probably
   shift this slightly more towards Gets (just imaging a Filter that to
  encode
   100k random row keys to be matched; since Filters are instantiated
 store
   there is another natural limit there).
  
  
   As I said below, the crux of the matter is having some histograms of
 your
   data, so that such a decision could be made automatically.
  
  
   -- Lars
  
  
  
   
    From: lars hofhansl la...@apache.org
   To: user@hbase.apache.org user@hbase.apache.org
   Sent: Monday, February 18, 2013 5:48 PM
   Subject: Re: Optimizing Multi Gets in hbase
  
   As it happens we did some tests around last week.
   Turns out doing Gets in batches instead of a scan still gives you 1/3
 of
   the performance.
  
   I.e. when you have a table with, say, 10m rows and scanning take N
   seconds, then calling 10m Gets in batches of 1000 take ~3N, which is
  pretty
   impressive.
  
   Now, this is with all data in the cache!
   When the data is not in the cache and the Gets are random it is many
   orders of magnitude slower, as the Gets are sprayed all over the disk.
 In
   that case sorting the Gets and issuing scans would indeed be much more
   efficient.
  
  
   The Gets in a batch are already sorted on the 

Re: Optimizing Multi Gets in hbase

2013-02-19 Thread Nicolas Liochon
Interesting, in the client we're doing a group by location the multiget.
So we could have the filter as HBase core code, and then we could use it in
the client for the multiget: compared to my initial proposal, we don't have
to change anything in the server code and we reuse the filtering framework.
The filter can be also be used independently.

Is there any issue with this? The reseek seems to be quite smart in the way
it handles the bloom filters, I don't know if it behaves differently in
this case vs. a simple get.


On Tue, Feb 19, 2013 at 7:27 PM, lars hofhansl la...@apache.org wrote:

 I was thinking along the same lines. Doing a skip scan via filter hinting.
 The problem is as you say that the Filter is instantiated everywhere and it
 might be of significant size (have to maintain all row keys you are looking
 for).


 RegionScanner now a reseek method, it is possible to do this via a
 coprocessor. They are also loaded per region (but at least not for each
 store), and one can use the shared coproc state I added to alleviate the
 memory concern.

 Thinking about this in terms of multiple scan is interesting. One could
 identify clusters of close row keys in the Gets and issue a Scan for each
 cluster.


 -- Lars



 
  From: Nicolas Liochon nkey...@gmail.com
 To: user user@hbase.apache.org
 Sent: Tuesday, February 19, 2013 9:28 AM
 Subject: Re: Optimizing Multi Gets in hbase

 Imho,  the easiest thing to do would be to write a filter.
 You need to order the rows, then you can use hints to navigate to the next
 row (SEEK_NEXT_USING_HINT).
 The main drawback I see is that the filter will be invoked on all regions
 servers, including the ones that don't need it. But this would also means
 you have a very specific query pattern (which could be the case, I just
 don't know), and you can still use the startRow / stopRow of the scan, and
 create multiple scan if necessary. I'm also interested in Lars' opinion on
 this.

 Nicolas



 On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma va...@pinterest.com wrote:

  I have another question, if I am running a scan wrapped around multiple
  rows in the same region, in the following way:
 
  Scan scan = new scan(getWithMultipleRowsInSameRegion);
 
  Now, how does execution occur. Is it just a sequential scan across the
  entire region or does it seek to hfile blocks containing the actual
 values.
  What I truly mean is, lets say the multi get is on following rows:
 
  Row1 : HFileBlock1
  Row2 : HFileBlock20
  Row3 : Does not exist
  Row4 : HFileBlock25
  Row5 : HFileBlock100
 
  The efficient way to do this would be to determine the correct blocks
 using
  the index and then searching within the blocks for, say Row1. Then, seek
 to
  HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
  seeking to + searching within HFileBlocks as needed.
 
  I am wondering if a scan wrapped around a Get with multiple rows would do
  the same ?
 
  Thanks
  Varun
 
  On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon nkey...@gmail.com
  wrote:
 
   Looking at the code, it seems possible to do this server side within
 the
   multi invocation: we could group the get by region, and do a single
 scan.
   We could also add some heuristics if necessary...
  
  
  
   On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl la...@apache.org
 wrote:
  
I should qualify that statement, actually.
   
I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
returned.
   
As James Taylor pointed out to me privately: A fairer comparison
 would
have been to run a scan with a filter that lets x% of the rows pass
  (i.e.
the selectivity of the scan would be x%) and compare that to a multi
  Get
   of
the same x% of the row.
   
There we found that a Scan+Filter is more efficient that issuing
 multi
Gets if x is = 1-2%.
   
   
Or in other words, translating many Gets into a Scan+Filter is
  beneficial
if the Scan would return at least 1-2% of the rows to the client.
For example:
if you are looking for less than 10-20k rows in 1m rows, using muli
  Gets
is likely more efficient.
if you are looking for more than 10-20k rows in 1m rows, using a
Scan+Filter is likely more efficient.
   
   
Of course this is predicated on whether you have an efficient way to
represent the rows you are looking for in a filter, so that would
   probably
shift this slightly more towards Gets (just imaging a Filter that to
   encode
100k random row keys to be matched; since Filters are instantiated
  store
there is another natural limit there).
   
   
As I said below, the crux of the matter is having some histograms of
  your
data, so that such a decision could be made automatically.
   
   
-- Lars
   
   
   

 From: lars hofhansl la...@apache.org
To: user@hbase.apache.org user@hbase.apache.org
Sent: Monday, February 18, 2013 

Re: Optimizing Multi Gets in hbase

2013-02-19 Thread Nicolas Liochon
As well, an advantage of going only to the servers needed is the famous
MTTR: there are a less chance to go to a dead server or to a region that
just moved.


On Tue, Feb 19, 2013 at 7:42 PM, Nicolas Liochon nkey...@gmail.com wrote:

 Interesting, in the client we're doing a group by location the multiget.
 So we could have the filter as HBase core code, and then we could use it
 in the client for the multiget: compared to my initial proposal, we don't
 have to change anything in the server code and we reuse the filtering
 framework. The filter can be also be used independently.

 Is there any issue with this? The reseek seems to be quite smart in the
 way it handles the bloom filters, I don't know if it behaves differently in
 this case vs. a simple get.


 On Tue, Feb 19, 2013 at 7:27 PM, lars hofhansl la...@apache.org wrote:

 I was thinking along the same lines. Doing a skip scan via filter
 hinting. The problem is as you say that the Filter is instantiated
 everywhere and it might be of significant size (have to maintain all row
 keys you are looking for).


 RegionScanner now a reseek method, it is possible to do this via a
 coprocessor. They are also loaded per region (but at least not for each
 store), and one can use the shared coproc state I added to alleviate the
 memory concern.

 Thinking about this in terms of multiple scan is interesting. One could
 identify clusters of close row keys in the Gets and issue a Scan for each
 cluster.


 -- Lars



 
  From: Nicolas Liochon nkey...@gmail.com
 To: user user@hbase.apache.org
 Sent: Tuesday, February 19, 2013 9:28 AM
 Subject: Re: Optimizing Multi Gets in hbase

 Imho,  the easiest thing to do would be to write a filter.
 You need to order the rows, then you can use hints to navigate to the next
 row (SEEK_NEXT_USING_HINT).
 The main drawback I see is that the filter will be invoked on all regions
 servers, including the ones that don't need it. But this would also means
 you have a very specific query pattern (which could be the case, I just
 don't know), and you can still use the startRow / stopRow of the scan, and
 create multiple scan if necessary. I'm also interested in Lars' opinion on
 this.

 Nicolas



 On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma va...@pinterest.com
 wrote:

  I have another question, if I am running a scan wrapped around multiple
  rows in the same region, in the following way:
 
  Scan scan = new scan(getWithMultipleRowsInSameRegion);
 
  Now, how does execution occur. Is it just a sequential scan across the
  entire region or does it seek to hfile blocks containing the actual
 values.
  What I truly mean is, lets say the multi get is on following rows:
 
  Row1 : HFileBlock1
  Row2 : HFileBlock20
  Row3 : Does not exist
  Row4 : HFileBlock25
  Row5 : HFileBlock100
 
  The efficient way to do this would be to determine the correct blocks
 using
  the index and then searching within the blocks for, say Row1. Then,
 seek to
  HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
  seeking to + searching within HFileBlocks as needed.
 
  I am wondering if a scan wrapped around a Get with multiple rows would
 do
  the same ?
 
  Thanks
  Varun
 
  On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon nkey...@gmail.com
  wrote:
 
   Looking at the code, it seems possible to do this server side within
 the
   multi invocation: we could group the get by region, and do a single
 scan.
   We could also add some heuristics if necessary...
  
  
  
   On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl la...@apache.org
 wrote:
  
I should qualify that statement, actually.
   
I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
returned.
   
As James Taylor pointed out to me privately: A fairer comparison
 would
have been to run a scan with a filter that lets x% of the rows pass
  (i.e.
the selectivity of the scan would be x%) and compare that to a multi
  Get
   of
the same x% of the row.
   
There we found that a Scan+Filter is more efficient that issuing
 multi
Gets if x is = 1-2%.
   
   
Or in other words, translating many Gets into a Scan+Filter is
  beneficial
if the Scan would return at least 1-2% of the rows to the client.
For example:
if you are looking for less than 10-20k rows in 1m rows, using muli
  Gets
is likely more efficient.
if you are looking for more than 10-20k rows in 1m rows, using a
Scan+Filter is likely more efficient.
   
   
Of course this is predicated on whether you have an efficient way to
represent the rows you are looking for in a filter, so that would
   probably
shift this slightly more towards Gets (just imaging a Filter that to
   encode
100k random row keys to be matched; since Filters are instantiated
  store
there is another natural limit there).
   
   
As I said below, the crux of the matter is having some histograms of
  your
data, so that 

Scanning a row for certain md5hash does not work

2013-02-19 Thread Paul van Hoven
I'm currently reading a book about hbase (hbase in action by manning).
In this book it is explained how to perform a scan if the rowkey is
made out of a md5 hash (page 45 in the book). My rowkey design (and
table filling method) looks like this:

SimpleDateFormat dateFormatter = new SimpleDateFormat(-MM-dd);
SimpleDateFormat timeFormatter = new SimpleDateFormat(HH:mm:ss);
Date date = dateFormatter.parse(2013-01-01);

for( int i = 0; i  31; ++i ) {
for( int k = 0; k  24; ++k ) {
for( int j = 0; j  1; ++j ) {
//md5() is a custom method that transforms a
string into a md5 hash
byte[] ts = md5( dateFormatter.format(date) );
byte[] tm = md5( timeFormatter.format(date) );
byte[] ip = md5( generateRandomIPAddress() /* toy 
method that
generates ip addresses */ );
byte[] rowkey = new byte[ ts.length + tm.length + 
ip.length ];
System.arraycopy( ts, 0, rowkey, 0, ts.length );
System.arraycopy( tm, 0, rowkey, ts.length, tm.length );
System.arraycopy( ip, 0, rowkey, ts.length+tm.length, 
ip.length );
Put p = new Put( rowkey );

p.add( Bytes.toBytes(CF), Bytes.toBytes(SampleCol),
Bytes.toBytes( Value_ + (i+1) +  =  + dateFormatter.format(date) +
  + timeFormatter.format(date) ) );
toyDataTable.put( p );
}

//custom method that adds an hour to the current date object
date = addHours( date, 1 );
}

}

Now I'd like to do the following scan (I more or less took the same
code from the example in the book):

SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd);
Date refDate = formatter.parse(2013-01-15);

HTableInterface toyDataTable = pool.getTable(ToyDataTable);

byte[] md5Key = md5( refDate.getTime() + );
int md5Length = 16;
int longLength = 8;
byte[] startRow = Bytes.padTail( md5Key, longLength );
byte[] endRow = Bytes.padTail( md5Key, longLength );
endRow[md5Length-1]++;

Scan scan = new Scan( startRow, endRow );
ResultScanner rs = toyDataTable.getScanner( scan );
for( Result r : rs ) {
String value =  Bytes.toString( r.getValue( Bytes.toBytes(CF),
Bytes.toBytes(SampleCol)) );
System.out.println( value );
}

The result is empty. How is that possible?


Re: Scanning a row for certain md5hash does not work

2013-02-19 Thread Paul van Hoven
Sorry, I had a mistake in my rowkey generation.

Thanks for reading!

2013/2/19 Paul van Hoven paul.van.ho...@googlemail.com:
 I'm currently reading a book about hbase (hbase in action by manning).
 In this book it is explained how to perform a scan if the rowkey is
 made out of a md5 hash (page 45 in the book). My rowkey design (and
 table filling method) looks like this:

 SimpleDateFormat dateFormatter = new SimpleDateFormat(-MM-dd);
 SimpleDateFormat timeFormatter = new SimpleDateFormat(HH:mm:ss);
 Date date = dateFormatter.parse(2013-01-01);

 for( int i = 0; i  31; ++i ) {
 for( int k = 0; k  24; ++k ) {
 for( int j = 0; j  1; ++j ) {
 //md5() is a custom method that transforms a
 string into a md5 hash
 byte[] ts = md5( dateFormatter.format(date) );
 byte[] tm = md5( timeFormatter.format(date) );
 byte[] ip = md5( generateRandomIPAddress() /* toy 
 method that
 generates ip addresses */ );
 byte[] rowkey = new byte[ ts.length + tm.length + 
 ip.length ];
 System.arraycopy( ts, 0, rowkey, 0, ts.length );
 System.arraycopy( tm, 0, rowkey, ts.length, tm.length 
 );
 System.arraycopy( ip, 0, rowkey, ts.length+tm.length, 
 ip.length );
 Put p = new Put( rowkey );

 p.add( Bytes.toBytes(CF), 
 Bytes.toBytes(SampleCol),
 Bytes.toBytes( Value_ + (i+1) +  =  + dateFormatter.format(date) +
   + timeFormatter.format(date) ) );
 toyDataTable.put( p );
 }

 //custom method that adds an hour to the current date object
 date = addHours( date, 1 );
 }

 }

 Now I'd like to do the following scan (I more or less took the same
 code from the example in the book):

 SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd);
 Date refDate = formatter.parse(2013-01-15);

 HTableInterface toyDataTable = pool.getTable(ToyDataTable);

 byte[] md5Key = md5( refDate.getTime() + );
 int md5Length = 16;
 int longLength = 8;
 byte[] startRow = Bytes.padTail( md5Key, longLength );
 byte[] endRow = Bytes.padTail( md5Key, longLength );
 endRow[md5Length-1]++;

 Scan scan = new Scan( startRow, endRow );
 ResultScanner rs = toyDataTable.getScanner( scan );
 for( Result r : rs ) {
 String value =  Bytes.toString( r.getValue( Bytes.toBytes(CF),
 Bytes.toBytes(SampleCol)) );
 System.out.println( value );
 }

 The result is empty. How is that possible?


Re: coprocessor enabled put very slow, help please~~~

2013-02-19 Thread Andrew Purtell
A coprocessor is some code running in a server process. The resources
available and rules of the road are different from client side programming.
HTablePool (and HTable in general) is problematic for server side
programming in my opinion: http://search-hadoop.com/m/XtAi5Fogw32 Since
this comes up now and again seems like a lightweight alternative for server
side IPC could be useful.


On Tue, Feb 19, 2013 at 7:15 AM, Wei Tan w...@us.ibm.com wrote:

 A side question: if HTablePool is not encouraged to be used... how we
 handle the thread safeness in using HTable? Any replacement for
 HTablePool, in plan?
 Thanks,


 Best Regards,
 Wei




 From:   Michel Segel michael_se...@hotmail.com
 To: user@hbase.apache.org user@hbase.apache.org,
 Date:   02/18/2013 09:23 AM
 Subject:Re: coprocessor enabled put very slow, help please~~~



 Why are you using an HTable Pool?
 Why are you closing the table after each iteration through?

 Try using 1 HTable object. Turn off WAL
 Initiate in start()
 Close in Stop()
 Surround the use in a try / catch
 If exception caught, re instantiate new HTable connection.

 Maybe want to flush the connection after puts.


 Again not sure why you are using check and put on the base table. Your
 count could be off.

 As an example look at poem/rhyme 'Marry had a little lamb'.
 Then check your word count.

 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On Feb 18, 2013, at 7:21 AM, prakash kadel prakash.ka...@gmail.com
 wrote:

  Thank you guys for your replies,
  Michael,
I think i didnt make it clear. Here is my use case,
 
  I have text documents to insert in the hbase. (With possible duplicates)
  Suppose i have a document as :  I am working. He is not working
 
  I want to insert this document to a table in hbase, say table doc
 
  =doc table=
  -
  rowKey : doc_id
  cf: doc_content
  value: I am working. He is not working
 
  Now, i to create another table that stores the word count, say doc_idx
 
  doc_idx table
  ---
  rowKey : I, cf: count, value: 1
  rowKey : am, cf: count, value: 1
  rowKey : working, cf: count, value: 2
  rowKey : He, cf: count, value: 1
  rowKey : is, cf: count, value: 1
  rowKey : not, cf: count, value: 1
 
  My MR job code:
  ==
 
  if(doc.checkAndPut(rowKey, doc_content, , null, putDoc)) {
 for(String word : doc_content.split(\\s+)) {
Increment inc = new Increment(Bytes.toBytes(word));
inc.addColumn(count, , 1);
 }
  }
 
  Now, i wanted to do some experiments with coprocessors. So, i modified
  the code as follows.
 
  My MR job code:
  ===
 
  doc.checkAndPut(rowKey, doc_content, , null, putDoc);
 
  Coprocessor code:
  ===
 
 public void start(CoprocessorEnvironment env)  {
 pool = new HTablePool(conf, 100);
 }
 
 public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
  compareOp, comparator,  put, result) {
 
 if(!result) return true; // check if the put succeeded
 
 HTableInterface table_idx = pool.getTable(doc_idx);
 
 try {
 
 for(KeyValue contentKV = put.get(doc_content, )) {
 for(String word :
  contentKV.getValue().split(\\s+)) {
 Increment inc = new
  Increment(Bytes.toBytes(word));
 inc.addColumn(count, , 1);
 table_idx.increment(inc);
 }
}
 } finally {
 table_idx.close();
 }
 return true;
 }
 
 public void stop(env) {
 pool.close();
 }
 
  I am a newbee to HBASE. I am not sure this is the way to do.
  Given that, why is the cooprocessor enabled version much slower than
  the one without?
 
 
  Sincerely,
  Prakash Kadel
 
 
  On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
  michael_se...@hotmail.com wrote:
 
  The  issue I was talking about was the use of a check and put.
  The OP wrote:
  each map inserts to doc table.(checkAndPut)
  regionobserver coprocessor does a postCheckAndPut and inserts some
 rows to
  a index table.
 
  My question is why does the OP use a checkAndPut, and the
 RegionObserver's postChecAndPut?
 
 
  Here's a good example...

 http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put

 
  The OP doesn't really get in to the use case, so we don't know why the
 Check and Put in the M/R job.
  He should just be using put() and then a postPut().
 
  Another issue... since he's writing to  a different HTable... how? Does
 he create an HTable instance in the start() method of his RO object and
 then reference it later? Or does he create the instance of the HTable on
 the fly in each postCheckAndPut() ?
  Without seeing his code, we don't know.
 
  Note that this is synchronous set of writes. Your overall return from
 the M/R call to put will wait until the second row is inserted.
 

Re: coprocessor enabled put very slow, help please~~~

2013-02-19 Thread Asaf Mesika
1. Try batching your increment calls to a ListRow and use batch() to
execute it. Should reduce RPC calls by 2 magnitudes.
2. Combine batching with scanning more words, thus aggregating your count
for a certain word thus less Increment commands.
3. Enable Bloom Filters. Should speed up Increment by a factor of 2 at
least.
4. Don't use keyValue.getValue(). It does a System.arraycopy behind the
scenes. Use getBuffer() and getValueOffset() and getValueLength() and
iterate on the existing array. Write your own Split without going into
using String functions which goes through encoding (expensive). Just find
your delimiter by byte comparison.
5. Enable BloomFilters on doc table. It should speed up the checkAndPut.
6. I wouldn't give up WAL. It ain't your bottleneck IMO.

On Monday, February 18, 2013, prakash kadel wrote:

 Thank you guys for your replies,
 Michael,
I think i didnt make it clear. Here is my use case,

 I have text documents to insert in the hbase. (With possible duplicates)
 Suppose i have a document as :  I am working. He is not working

 I want to insert this document to a table in hbase, say table doc

 =doc table=
 -
 rowKey : doc_id
 cf: doc_content
 value: I am working. He is not working

 Now, i to create another table that stores the word count, say doc_idx

 doc_idx table
 ---
 rowKey : I, cf: count, value: 1
 rowKey : am, cf: count, value: 1
 rowKey : working, cf: count, value: 2
 rowKey : He, cf: count, value: 1
 rowKey : is, cf: count, value: 1
 rowKey : not, cf: count, value: 1

 My MR job code:
 ==

 if(doc.checkAndPut(rowKey, doc_content, , null, putDoc)) {
 for(String word : doc_content.split(\\s+)) {
Increment inc = new Increment(Bytes.toBytes(word));
inc.addColumn(count, , 1);
 }
 }

 Now, i wanted to do some experiments with coprocessors. So, i modified
 the code as follows.

 My MR job code:
 ===

 doc.checkAndPut(rowKey, doc_content, , null, putDoc);

 Coprocessor code:
 ===

 public void start(CoprocessorEnvironment env)  {
 pool = new HTablePool(conf, 100);
 }

 public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
 compareOp,   comparator,  put, result) {

 if(!result) return true; // check if the put succeeded

 HTableInterface table_idx = pool.getTable(doc_idx);

 try {

 for(KeyValue contentKV = put.get(doc_content,
 )) {
 for(String word :
 contentKV.getValue().split(\\s+)) {
 Increment inc = new
 Increment(Bytes.toBytes(word));
 inc.addColumn(count, , 1);
 table_idx.increment(inc);
 }
}
 } finally {
 table_idx.close();
 }
 return true;
 }

 public void stop(env) {
 pool.close();
 }

 I am a newbee to HBASE. I am not sure this is the way to do.
 Given that, why is the cooprocessor enabled version much slower than
 the one without?


 Sincerely,
 Prakash Kadel


 On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
 michael_se...@hotmail.com javascript:; wrote:
 
  The  issue I was talking about was the use of a check and put.
  The OP wrote:
  each map inserts to doc table.(checkAndPut)
  regionobserver coprocessor does a postCheckAndPut and inserts some
 rows to
  a index table.
 
  My question is why does the OP use a checkAndPut, and the
 RegionObserver's postChecAndPut?
 
 
  Here's a good example...
 http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
 
  The OP doesn't really get in to the use case, so we don't know why the
 Check and Put in the M/R job.
  He should just be using put() and then a postPut().
 
  Another issue... since he's writing to  a different HTable... how? Does
 he create an HTable instance in the start() method of his RO object and
 then reference it later? Or does he create the instance of the HTable on
 the fly in each postCheckAndPut() ?
  Without seeing his code, we don't know.
 
  Note that this is synchronous set of writes. Your overall return from
 the M/R call to put will wait until the second row is inserted.
 
  Interestingly enough, you may want to consider disabling the WAL on the
 write to the index.  You can always run a M/R job that rebuilds the index
 should something occur to the system where you might lose the data.
  Indexes *ARE* expendable. ;-)
 
  Does that explain it?
 
  -Mike
 
  On Feb 18, 2013, at 4:57 AM, yonghu yongyong...@gmail.com wrote:
 
  Hi, Michael
 
  I don't quite understand what do you mean by round trip back to the
  client. In my understanding, as the RegionServer and TaskTracker can
  be the same node, MR don't have to pull data into client and then
  process.  And you also mention the 

Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
Hi

Is there any way to balance just one table? I found one of my table is not 
balanced, while all the other table is balanced. So I want to fix this table.

Best Regards,
Raymond Liu



Re: Is there any way to balance one table?

2013-02-19 Thread Ted Yu
What version of HBase are you using ?

0.94 has per-table load balancing.

Cheers

On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com wrote:

 Hi

 Is there any way to balance just one table? I found one of my table is not
 balanced, while all the other table is balanced. So I want to fix this
 table.

 Best Regards,
 Raymond Liu




RE: Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
0.94.1

Any cmd in shell? Or I need to change balance threshold to 0 an run global 
balancer cmd in shell?



Best Regards,
Raymond Liu

 -Original Message-
 From: Ted Yu [mailto:yuzhih...@gmail.com]
 Sent: Wednesday, February 20, 2013 9:09 AM
 To: user@hbase.apache.org
 Subject: Re: Is there any way to balance one table?
 
 What version of HBase are you using ?
 
 0.94 has per-table load balancing.
 
 Cheers
 
 On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com
 wrote:
 
  Hi
 
  Is there any way to balance just one table? I found one of my table is
  not balanced, while all the other table is balanced. So I want to fix
  this table.
 
  Best Regards,
  Raymond Liu
 
 


availability of 0.94.4 and 0.94.5 in maven repo?

2013-02-19 Thread James Taylor
Unless I'm doing something wrong, it looks like the Maven repository 
(http://mvnrepository.com/artifact/org.apache.hbase/hbase) only contains 
HBase up to 0.94.3. Is there a different repo I should use, or if not, 
any ETA on when it'll be updated?


James



Re: availability of 0.94.4 and 0.94.5 in maven repo?

2013-02-19 Thread Viral Bajaria
I have come across this too, I think someone with authorization needs to
perform a maven release to the apache maven repository and/or maven central.

For now, I just end up compiling the dot release from trunk and deploy it
to my local repository for other projects to use.

Thanks,
Viral

On Tue, Feb 19, 2013 at 5:30 PM, James Taylor jtay...@salesforce.comwrote:

 Unless I'm doing something wrong, it looks like the Maven repository (
 http://mvnrepository.com/**artifact/org.apache.hbase/**hbasehttp://mvnrepository.com/artifact/org.apache.hbase/hbase)
 only contains HBase up to 0.94.3. Is there a different repo I should use,
 or if not, any ETA on when it'll be updated?

 James




Re: availability of 0.94.4 and 0.94.5 in maven repo?

2013-02-19 Thread Joarder KAMAL
I also came up with the same issue 1 day ago while building YCSB HBase
client for HBase 0.94.5. Later I used the 0.94.3 version to carry out my
work for the time being.

Regards,
Joarder Kamal


On 20 February 2013 12:32, Viral Bajaria viral.baja...@gmail.com wrote:

 I have come across this too, I think someone with authorization needs to
 perform a maven release to the apache maven repository and/or maven
 central.

 For now, I just end up compiling the dot release from trunk and deploy it
 to my local repository for other projects to use.

 Thanks,
 Viral

 On Tue, Feb 19, 2013 at 5:30 PM, James Taylor jtay...@salesforce.com
 wrote:

  Unless I'm doing something wrong, it looks like the Maven repository (
  http://mvnrepository.com/**artifact/org.apache.hbase/**hbase
 http://mvnrepository.com/artifact/org.apache.hbase/hbase)
  only contains HBase up to 0.94.3. Is there a different repo I should use,
  or if not, any ETA on when it'll be updated?
 
  James
 
 



RE: Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
I choose to move region manually. Any other approaching?

 
 0.94.1
 
 Any cmd in shell? Or I need to change balance threshold to 0 an run global
 balancer cmd in shell?
 
 
 
 Best Regards,
 Raymond Liu
 
  -Original Message-
  From: Ted Yu [mailto:yuzhih...@gmail.com]
  Sent: Wednesday, February 20, 2013 9:09 AM
  To: user@hbase.apache.org
  Subject: Re: Is there any way to balance one table?
 
  What version of HBase are you using ?
 
  0.94 has per-table load balancing.
 
  Cheers
 
  On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com
  wrote:
 
   Hi
  
   Is there any way to balance just one table? I found one of my table
   is not balanced, while all the other table is balanced. So I want to
   fix this table.
  
   Best Regards,
   Raymond Liu
  
  


Re: availability of 0.94.4 and 0.94.5 in maven repo?

2013-02-19 Thread Andrew Purtell
Same here, just tripped over this moments ago.


On Tue, Feb 19, 2013 at 5:30 PM, James Taylor jtay...@salesforce.comwrote:

 Unless I'm doing something wrong, it looks like the Maven repository (
 http://mvnrepository.com/**artifact/org.apache.hbase/**hbasehttp://mvnrepository.com/artifact/org.apache.hbase/hbase)
 only contains HBase up to 0.94.3. Is there a different repo I should use,
 or if not, any ETA on when it'll be updated?

 James




-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)


Re: Is there any way to balance one table?

2013-02-19 Thread Jean-Marc Spaggiari
Hi Liu,

Why did not you simply called the balancer? If other tables are
already balanced, it should not touch them and will only balance the
table which is not balancer?

JM

2013/2/19, Liu, Raymond raymond@intel.com:
 I choose to move region manually. Any other approaching?


 0.94.1

 Any cmd in shell? Or I need to change balance threshold to 0 an run
 global
 balancer cmd in shell?



 Best Regards,
 Raymond Liu

  -Original Message-
  From: Ted Yu [mailto:yuzhih...@gmail.com]
  Sent: Wednesday, February 20, 2013 9:09 AM
  To: user@hbase.apache.org
  Subject: Re: Is there any way to balance one table?
 
  What version of HBase are you using ?
 
  0.94 has per-table load balancing.
 
  Cheers
 
  On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com
  wrote:
 
   Hi
  
   Is there any way to balance just one table? I found one of my table
   is not balanced, while all the other table is balanced. So I want to
   fix this table.
  
   Best Regards,
   Raymond Liu
  
  



Problem In Understanding Compaction Process

2013-02-19 Thread Anty
Hi: Guys

  I have some problem in understanding the compaction process, Can
someone shed some light on me, much appreciate. Here is the problem:

  Region Server after successfully generate the final compacted file,
it going through two steps:
   1. move the above compacted file into region's directory
   2. delete replaced files.

   the above two steps are not atomic, if Region Server crash after
step1, and  before step2, then there are duplication records!  Is this
problem handled  in reading process , or there is another mechanism to fix
this?

-- 
Best Regards
Anty Rao


Re: Is there any way to balance one table?

2013-02-19 Thread Ted Yu
HBASE-3373 introduced hbase.master.loadbalance.bytable which defaults to
true.

This means when you issue 'balancer' command in shell, table should be
balanced for you.

Cheers

On Tue, Feb 19, 2013 at 5:16 PM, Liu, Raymond raymond@intel.com wrote:

 0.94.1

 Any cmd in shell? Or I need to change balance threshold to 0 an run global
 balancer cmd in shell?



 Best Regards,
 Raymond Liu

  -Original Message-
  From: Ted Yu [mailto:yuzhih...@gmail.com]
  Sent: Wednesday, February 20, 2013 9:09 AM
  To: user@hbase.apache.org
  Subject: Re: Is there any way to balance one table?
 
  What version of HBase are you using ?
 
  0.94 has per-table load balancing.
 
  Cheers
 
  On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com
  wrote:
 
   Hi
  
   Is there any way to balance just one table? I found one of my table is
   not balanced, while all the other table is balanced. So I want to fix
   this table.
  
   Best Regards,
   Raymond Liu
  
  



RE: Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
Hi

I do call balancer, while it seems it doesn't work. Might due to this table is 
small and overall region number difference is within threshold?

 -Original Message-
 From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
 Sent: Wednesday, February 20, 2013 10:59 AM
 To: user@hbase.apache.org
 Subject: Re: Is there any way to balance one table?
 
 Hi Liu,
 
 Why did not you simply called the balancer? If other tables are already
 balanced, it should not touch them and will only balance the table which is 
 not
 balancer?
 
 JM
 
 2013/2/19, Liu, Raymond raymond@intel.com:
  I choose to move region manually. Any other approaching?
 
 
  0.94.1
 
  Any cmd in shell? Or I need to change balance threshold to 0 an run
  global balancer cmd in shell?
 
 
 
  Best Regards,
  Raymond Liu
 
   -Original Message-
   From: Ted Yu [mailto:yuzhih...@gmail.com]
   Sent: Wednesday, February 20, 2013 9:09 AM
   To: user@hbase.apache.org
   Subject: Re: Is there any way to balance one table?
  
   What version of HBase are you using ?
  
   0.94 has per-table load balancing.
  
   Cheers
  
   On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond
   raymond@intel.com
   wrote:
  
Hi
   
Is there any way to balance just one table? I found one of my
table is not balanced, while all the other table is balanced. So
I want to fix this table.
   
Best Regards,
Raymond Liu
   
   
 


Re: Is there any way to balance one table?

2013-02-19 Thread Marcos Ortiz

What is the size of your table?

On 02/19/2013 10:40 PM, Liu, Raymond wrote:

Hi

I do call balancer, while it seems it doesn't work. Might due to this table is 
small and overall region number difference is within threshold?


-Original Message-
From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
Sent: Wednesday, February 20, 2013 10:59 AM
To: user@hbase.apache.org
Subject: Re: Is there any way to balance one table?

Hi Liu,

Why did not you simply called the balancer? If other tables are already
balanced, it should not touch them and will only balance the table which is not
balancer?

JM

2013/2/19, Liu, Raymond raymond@intel.com:

I choose to move region manually. Any other approaching?


0.94.1

Any cmd in shell? Or I need to change balance threshold to 0 an run
global balancer cmd in shell?



Best Regards,
Raymond Liu


-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, February 20, 2013 9:09 AM
To: user@hbase.apache.org
Subject: Re: Is there any way to balance one table?

What version of HBase are you using ?

0.94 has per-table load balancing.

Cheers

On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond
raymond@intel.com
wrote:


Hi

Is there any way to balance just one table? I found one of my
table is not balanced, while all the other table is balanced. So
I want to fix this table.

Best Regards,
Raymond Liu




--
Marcos Ortiz Valmaseda,
Product Manager  Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 http://twitter.com/marcosluis2186


Re: Is there any way to balance one table?

2013-02-19 Thread Ted Yu
You're right. Default sloppiness is 20%:
this.slop = conf.getFloat(hbase.regions.slop, (float) 0.2);
src/main/java/org/apache/hadoop/hbase/master/DefaultLoadBalancer.java

Meaning, region count on any server can be as far as 20% from average
region count.

You can tighten sloppiness.

On Tue, Feb 19, 2013 at 7:40 PM, Liu, Raymond raymond@intel.com wrote:

 Hi

 I do call balancer, while it seems it doesn't work. Might due to this
 table is small and overall region number difference is within threshold?

  -Original Message-
  From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
  Sent: Wednesday, February 20, 2013 10:59 AM
  To: user@hbase.apache.org
  Subject: Re: Is there any way to balance one table?
 
  Hi Liu,
 
  Why did not you simply called the balancer? If other tables are already
  balanced, it should not touch them and will only balance the table which
 is not
  balancer?
 
  JM
 
  2013/2/19, Liu, Raymond raymond@intel.com:
   I choose to move region manually. Any other approaching?
  
  
   0.94.1
  
   Any cmd in shell? Or I need to change balance threshold to 0 an run
   global balancer cmd in shell?
  
  
  
   Best Regards,
   Raymond Liu
  
-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, February 20, 2013 9:09 AM
To: user@hbase.apache.org
Subject: Re: Is there any way to balance one table?
   
What version of HBase are you using ?
   
0.94 has per-table load balancing.
   
Cheers
   
On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond
raymond@intel.com
wrote:
   
 Hi

 Is there any way to balance just one table? I found one of my
 table is not balanced, while all the other table is balanced. So
 I want to fix this table.

 Best Regards,
 Raymond Liu


  



RE: Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
I mean region number is small.

Overall I have say 3000 region on 4 node, while this table only have 96 region. 
It won't be 24 for each region server, instead , will be something like 
19/30/23/21 etc.

This means that I need to limit the slop to 0.02 etc? so that the balancer 
actually run on this table?

Best Regards,
Raymond Liu

From: Marcos Ortiz [mailto:mlor...@uci.cu] 
Sent: Wednesday, February 20, 2013 11:44 AM
To: user@hbase.apache.org
Cc: Liu, Raymond
Subject: Re: Is there any way to balance one table?

What is the size of your table?
On 02/19/2013 10:40 PM, Liu, Raymond wrote:
Hi

I do call balancer, while it seems it doesn't work. Might due to this table is 
small and overall region number difference is within threshold?

-Original Message-
From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
Sent: Wednesday, February 20, 2013 10:59 AM
To: user@hbase.apache.org
Subject: Re: Is there any way to balance one table?

Hi Liu,

Why did not you simply called the balancer? If other tables are already
balanced, it should not touch them and will only balance the table which is not
balancer?

JM

2013/2/19, Liu, Raymond raymond@intel.com:
I choose to move region manually. Any other approaching?


0.94.1

Any cmd in shell? Or I need to change balance threshold to 0 an run
global balancer cmd in shell?



Best Regards,
Raymond Liu

-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, February 20, 2013 9:09 AM
To: user@hbase.apache.org
Subject: Re: Is there any way to balance one table?

What version of HBase are you using ?

0.94 has per-table load balancing.

Cheers

On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond
raymond@intel.com
wrote:

Hi

Is there any way to balance just one table? I found one of my
table is not balanced, while all the other table is balanced. So
I want to fix this table.

Best Regards,
Raymond Liu




-- 
Marcos Ortiz Valmaseda, 
Product Manager  Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186


RE: Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
Yeah, Since balance is already done on each table, why slop is not calculate 
upon each table...

 
 You're right. Default sloppiness is 20%:
 this.slop = conf.getFloat(hbase.regions.slop, (float) 0.2);
 src/main/java/org/apache/hadoop/hbase/master/DefaultLoadBalancer.java
 
 Meaning, region count on any server can be as far as 20% from average region
 count.
 
 You can tighten sloppiness.
 
 On Tue, Feb 19, 2013 at 7:40 PM, Liu, Raymond raymond@intel.com
 wrote:
 
  Hi
 
  I do call balancer, while it seems it doesn't work. Might due to this
  table is small and overall region number difference is within threshold?
 
   -Original Message-
   From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
   Sent: Wednesday, February 20, 2013 10:59 AM
   To: user@hbase.apache.org
   Subject: Re: Is there any way to balance one table?
  
   Hi Liu,
  
   Why did not you simply called the balancer? If other tables are
   already balanced, it should not touch them and will only balance the
   table which
  is not
   balancer?
  
   JM
  
   2013/2/19, Liu, Raymond raymond@intel.com:
I choose to move region manually. Any other approaching?
   
   
0.94.1
   
Any cmd in shell? Or I need to change balance threshold to 0 an
run global balancer cmd in shell?
   
   
   
Best Regards,
Raymond Liu
   
 -Original Message-
 From: Ted Yu [mailto:yuzhih...@gmail.com]
 Sent: Wednesday, February 20, 2013 9:09 AM
 To: user@hbase.apache.org
 Subject: Re: Is there any way to balance one table?

 What version of HBase are you using ?

 0.94 has per-table load balancing.

 Cheers

 On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond
 raymond@intel.com
 wrote:

  Hi
 
  Is there any way to balance just one table? I found one of my
  table is not balanced, while all the other table is balanced.
  So I want to fix this table.
 
  Best Regards,
  Raymond Liu
 
 
   
 


Re: Is there any way to balance one table?

2013-02-19 Thread Ted Yu
Yes, Raymond.
You should lower sloppiness.

On Tue, Feb 19, 2013 at 7:48 PM, Liu, Raymond raymond@intel.com wrote:

 I mean region number is small.

 Overall I have say 3000 region on 4 node, while this table only have 96
 region. It won't be 24 for each region server, instead , will be something
 like 19/30/23/21 etc.

 This means that I need to limit the slop to 0.02 etc? so that the balancer
 actually run on this table?

 Best Regards,
 Raymond Liu

 From: Marcos Ortiz [mailto:mlor...@uci.cu]
 Sent: Wednesday, February 20, 2013 11:44 AM
 To: user@hbase.apache.org
 Cc: Liu, Raymond
 Subject: Re: Is there any way to balance one table?

 What is the size of your table?
 On 02/19/2013 10:40 PM, Liu, Raymond wrote:
 Hi

 I do call balancer, while it seems it doesn't work. Might due to this
 table is small and overall region number difference is within threshold?

 -Original Message-
 From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
 Sent: Wednesday, February 20, 2013 10:59 AM
 To: user@hbase.apache.org
 Subject: Re: Is there any way to balance one table?

 Hi Liu,

 Why did not you simply called the balancer? If other tables are already
 balanced, it should not touch them and will only balance the table which
 is not
 balancer?

 JM

 2013/2/19, Liu, Raymond raymond@intel.com:
 I choose to move region manually. Any other approaching?


 0.94.1

 Any cmd in shell? Or I need to change balance threshold to 0 an run
 global balancer cmd in shell?



 Best Regards,
 Raymond Liu

 -Original Message-
 From: Ted Yu [mailto:yuzhih...@gmail.com]
 Sent: Wednesday, February 20, 2013 9:09 AM
 To: user@hbase.apache.org
 Subject: Re: Is there any way to balance one table?

 What version of HBase are you using ?

 0.94 has per-table load balancing.

 Cheers

 On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond
 raymond@intel.com
 wrote:

 Hi

 Is there any way to balance just one table? I found one of my
 table is not balanced, while all the other table is balanced. So
 I want to fix this table.

 Best Regards,
 Raymond Liu




 --
 Marcos Ortiz Valmaseda,
 Product Manager  Data Scientist at UCI
 Blog: http://marcosluis2186.posterous.com
 Twitter: @marcosluis2186



RE: Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
Hmm, in order to have the 96 region table be balanced within 20% On a 3000 
region cluster when all other table is balanced.

the slop will need to be around 20%/30, say 0.006? won't it be too small?

 
 Yes, Raymond.
 You should lower sloppiness.
 
 On Tue, Feb 19, 2013 at 7:48 PM, Liu, Raymond raymond@intel.com
 wrote:
 
  I mean region number is small.
 
  Overall I have say 3000 region on 4 node, while this table only have
  96 region. It won't be 24 for each region server, instead , will be
  something like 19/30/23/21 etc.
 
  This means that I need to limit the slop to 0.02 etc? so that the
  balancer actually run on this table?
 
  Best Regards,
  Raymond Liu
 
  From: Marcos Ortiz [mailto:mlor...@uci.cu]
  Sent: Wednesday, February 20, 2013 11:44 AM
  To: user@hbase.apache.org
  Cc: Liu, Raymond
  Subject: Re: Is there any way to balance one table?
 
  What is the size of your table?
  On 02/19/2013 10:40 PM, Liu, Raymond wrote:
  Hi
 
  I do call balancer, while it seems it doesn't work. Might due to this
  table is small and overall region number difference is within threshold?
 
  -Original Message-
  From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
  Sent: Wednesday, February 20, 2013 10:59 AM
  To: user@hbase.apache.org
  Subject: Re: Is there any way to balance one table?
 
  Hi Liu,
 
  Why did not you simply called the balancer? If other tables are
  already balanced, it should not touch them and will only balance the
  table which is not balancer?
 
  JM
 
  2013/2/19, Liu, Raymond raymond@intel.com:
  I choose to move region manually. Any other approaching?
 
 
  0.94.1
 
  Any cmd in shell? Or I need to change balance threshold to 0 an run
  global balancer cmd in shell?
 
 
 
  Best Regards,
  Raymond Liu
 
  -Original Message-
  From: Ted Yu [mailto:yuzhih...@gmail.com]
  Sent: Wednesday, February 20, 2013 9:09 AM
  To: user@hbase.apache.org
  Subject: Re: Is there any way to balance one table?
 
  What version of HBase are you using ?
 
  0.94 has per-table load balancing.
 
  Cheers
 
  On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com
  wrote:
 
  Hi
 
  Is there any way to balance just one table? I found one of my table is
  not balanced, while all the other table is balanced. So I want to fix
  this table.
 
  Best Regards,
  Raymond Liu
 
 
 
 
  --
  Marcos Ortiz Valmaseda,
  Product Manager  Data Scientist at UCI
  Blog: http://marcosluis2186.posterous.com
  Twitter: @marcosluis2186
 


Re: Is there any way to balance one table?

2013-02-19 Thread Ted Yu
bq. On a 3000 region cluster

Balancing is per-table. Meaning total number of regions doesn't come into
play.

On Tue, Feb 19, 2013 at 7:55 PM, Liu, Raymond raymond@intel.com wrote:

 Hmm, in order to have the 96 region table be balanced within 20% On a 3000
 region cluster when all other table is balanced.

 the slop will need to be around 20%/30, say 0.006? won't it be too small?

 
  Yes, Raymond.
  You should lower sloppiness.
 
  On Tue, Feb 19, 2013 at 7:48 PM, Liu, Raymond raymond@intel.com
  wrote:
 
   I mean region number is small.
  
   Overall I have say 3000 region on 4 node, while this table only have
   96 region. It won't be 24 for each region server, instead , will be
   something like 19/30/23/21 etc.
  
   This means that I need to limit the slop to 0.02 etc? so that the
   balancer actually run on this table?
  
   Best Regards,
   Raymond Liu
  
   From: Marcos Ortiz [mailto:mlor...@uci.cu]
   Sent: Wednesday, February 20, 2013 11:44 AM
   To: user@hbase.apache.org
   Cc: Liu, Raymond
   Subject: Re: Is there any way to balance one table?
  
   What is the size of your table?
   On 02/19/2013 10:40 PM, Liu, Raymond wrote:
   Hi
  
   I do call balancer, while it seems it doesn't work. Might due to this
   table is small and overall region number difference is within
 threshold?
  
   -Original Message-
   From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
   Sent: Wednesday, February 20, 2013 10:59 AM
   To: user@hbase.apache.org
   Subject: Re: Is there any way to balance one table?
  
   Hi Liu,
  
   Why did not you simply called the balancer? If other tables are
   already balanced, it should not touch them and will only balance the
   table which is not balancer?
  
   JM
  
   2013/2/19, Liu, Raymond raymond@intel.com:
   I choose to move region manually. Any other approaching?
  
  
   0.94.1
  
   Any cmd in shell? Or I need to change balance threshold to 0 an run
   global balancer cmd in shell?
  
  
  
   Best Regards,
   Raymond Liu
  
   -Original Message-
   From: Ted Yu [mailto:yuzhih...@gmail.com]
   Sent: Wednesday, February 20, 2013 9:09 AM
   To: user@hbase.apache.org
   Subject: Re: Is there any way to balance one table?
  
   What version of HBase are you using ?
  
   0.94 has per-table load balancing.
  
   Cheers
  
   On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond raymond@intel.com
   wrote:
  
   Hi
  
   Is there any way to balance just one table? I found one of my table is
   not balanced, while all the other table is balanced. So I want to fix
   this table.
  
   Best Regards,
   Raymond Liu
  
  
  
  
   --
   Marcos Ortiz Valmaseda,
   Product Manager  Data Scientist at UCI
   Blog: http://marcosluis2186.posterous.com
   Twitter: @marcosluis2186
  



region server of -ROOT- table is dead, but not reassigned

2013-02-19 Thread Lu, Wei
Hi, all,

When I scan any table, I got:

13/02/20 05:16:45 INFO ipc.HBaseRPC: Server at Rs1/10.20.118.3:60020 could not 
be reached after 1 tries, giving up.
...
ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=7, exceptions:
...

What I observe:

1)  -ROOT- table is on Region Server rs1

Table Regions
Name

Region Server

Start Key

End Key

Requests

-ROOT-

Rs1:60020http://adcau03.machine.wisdom.com:60030/

-

-



2)  But the region server rs1 is dead

Dead Region Servers

ServerName

Rs4,60020,1361109702535

Rs1,60020,1361109710150

Total:

servers: 2






Does it mean that the region server holding the -ROOT- table is dead, but the 
-ROOT- region is not reassigned to any other region servers?
Why?

Thanks,
Wei








RE: region server of -ROOT- table is dead, but not reassigned

2013-02-19 Thread Lu, Wei
By the way, the hbase version I am using is 0.92.1-cdh4.0.1

From: Lu, Wei
Sent: Wednesday, February 20, 2013 1:28 PM
To: user@hbase.apache.org
Subject: region server of -ROOT- table is dead, but not reassigned

Hi, all,

When I scan any table, I got:

13/02/20 05:16:45 INFO ipc.HBaseRPC: Server at Rs1/10.20.118.3:60020 could not 
be reached after 1 tries, giving up.
...
ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=7, exceptions:
...

What I observe:

1)  -ROOT- table is on Region Server rs1

Table Regions
Name

Region Server

Start Key

End Key

Requests

-ROOT-

Rs1:60020http://adcau03.machine.wisdom.com:60030/

-

-



2)  But the region server rs1 is dead

Dead Region Servers

ServerName

Rs4,60020,1361109702535

Rs1,60020,1361109710150

Total:

servers: 2






Does it mean that the region server holding the -ROOT- table is dead, but the 
-ROOT- region is not reassigned to any other region servers?
Why?

Thanks,
Wei








[resend] region server of -ROOT- table is dead, but not reassigned

2013-02-19 Thread Lu, Wei
Hi, all,



When I scan any table, I got:



13/02/20 05:16:45 INFO ipc.HBaseRPC: Server at Rs1/10.20.118.3:60020 could not 
be reached after 1 tries, giving up.

...

ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=7, exceptions:

...



What I observe:



1)  -ROOT- table is on Region Server rs1



Table Regions

NameRegion Server   Start Key End Key   Requests

-ROOT-  Rs1:60020   -   -





2)  But the region server rs1 is dead



Dead Region Servers

  ServerName

  Rs4,60020,1361109702535

  Rs1,60020,1361109710150

Total:servers: 2







Does it mean that the region server holding the -ROOT- table is dead, but the 
-ROOT- region is not reassigned to any other region servers?

Why?



By the way, the hbase version I am using is 0.92.1-cdh4.0.1



Thanks,

Wei



RE: Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
You mean slop is also base on per table?
Weird, then it should work for my case let me check again.

Best Regards,
Raymond Liu

 
 bq. On a 3000 region cluster
 
 Balancing is per-table. Meaning total number of regions doesn't come into 
 play.
 
 On Tue, Feb 19, 2013 at 7:55 PM, Liu, Raymond raymond@intel.com
 wrote:
 
  Hmm, in order to have the 96 region table be balanced within 20% On a
  3000 region cluster when all other table is balanced.
 
  the slop will need to be around 20%/30, say 0.006? won't it be too small?
 
  
   Yes, Raymond.
   You should lower sloppiness.
  
   On Tue, Feb 19, 2013 at 7:48 PM, Liu, Raymond
   raymond@intel.com
   wrote:
  
I mean region number is small.
   
Overall I have say 3000 region on 4 node, while this table only
have
96 region. It won't be 24 for each region server, instead , will
be something like 19/30/23/21 etc.
   
This means that I need to limit the slop to 0.02 etc? so that the
balancer actually run on this table?
   
Best Regards,
Raymond Liu
   
From: Marcos Ortiz [mailto:mlor...@uci.cu]
Sent: Wednesday, February 20, 2013 11:44 AM
To: user@hbase.apache.org
Cc: Liu, Raymond
Subject: Re: Is there any way to balance one table?
   
What is the size of your table?
On 02/19/2013 10:40 PM, Liu, Raymond wrote:
Hi
   
I do call balancer, while it seems it doesn't work. Might due to
this table is small and overall region number difference is within
  threshold?
   
-Original Message-
From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
Sent: Wednesday, February 20, 2013 10:59 AM
To: user@hbase.apache.org
Subject: Re: Is there any way to balance one table?
   
Hi Liu,
   
Why did not you simply called the balancer? If other tables are
already balanced, it should not touch them and will only balance
the table which is not balancer?
   
JM
   
2013/2/19, Liu, Raymond raymond@intel.com:
I choose to move region manually. Any other approaching?
   
   
0.94.1
   
Any cmd in shell? Or I need to change balance threshold to 0 an
run global balancer cmd in shell?
   
   
   
Best Regards,
Raymond Liu
   
-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, February 20, 2013 9:09 AM
To: user@hbase.apache.org
Subject: Re: Is there any way to balance one table?
   
What version of HBase are you using ?
   
0.94 has per-table load balancing.
   
Cheers
   
On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond
raymond@intel.com
wrote:
   
Hi
   
Is there any way to balance just one table? I found one of my
table is not balanced, while all the other table is balanced. So I
want to fix this table.
   
Best Regards,
Raymond Liu
   
   
   
   
--
Marcos Ortiz Valmaseda,
Product Manager  Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186
   
 


Re: [resend] region server of -ROOT- table is dead, but not reassigned

2013-02-19 Thread ramkrishna vasudevan
Ideally the ROOT table should be reassigned once the RS carrying ROOT goes
down.  This should happen automatically.

May be what does your logs say.  That would give us an insight.

Before that if you can restart your master it may solve this problem.  Even
then if it persists try to delete the zk data and restart the cluster.

REgards
Ram

On Wed, Feb 20, 2013 at 11:06 AM, Lu, Wei w...@microstrategy.com wrote:

 Hi, all,



 When I scan any table, I got:



 13/02/20 05:16:45 INFO ipc.HBaseRPC: Server at Rs1/10.20.118.3:60020could not 
 be reached after 1 tries, giving up.

 ...

 ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
 after attempts=7, exceptions:

 ...



 What I observe:



 1)  -ROOT- table is on Region Server rs1



 Table Regions

 NameRegion Server   Start Key End Key
 Requests

 -ROOT-  Rs1:60020   -   -





 2)  But the region server rs1 is dead



 Dead Region Servers

   ServerName

   Rs4,60020,1361109702535

   Rs1,60020,1361109710150

 Total:servers: 2





 

 Does it mean that the region server holding the -ROOT- table is dead, but
 the -ROOT- region is not reassigned to any other region servers?

 Why?



 By the way, the hbase version I am using is 0.92.1-cdh4.0.1



 Thanks,

 Wei




Re: availability of 0.94.4 and 0.94.5 in maven repo?

2013-02-19 Thread lars hofhansl
Time permitting, I will do that tomorrow.





 From: Andrew Purtell apurt...@apache.org
To: user@hbase.apache.org user@hbase.apache.org 
Sent: Tuesday, February 19, 2013 6:58 PM
Subject: Re: availability of 0.94.4 and 0.94.5 in maven repo?
 
Same here, just tripped over this moments ago.


On Tue, Feb 19, 2013 at 5:30 PM, James Taylor jtay...@salesforce.comwrote:

 Unless I'm doing something wrong, it looks like the Maven repository (
 http://mvnrepository.com/**artifact/org.apache.hbase/**hbasehttp://mvnrepository.com/artifact/org.apache.hbase/hbase)
 only contains HBase up to 0.94.3. Is there a different repo I should use,
 or if not, any ETA on when it'll be updated?

     James




-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: PreSplit the table with Long format

2013-02-19 Thread Farrokh Shahriari
Hello again,

Doesn't anyone know how I can do this.
The problem is:
When you insert something from the shell, it supposes it's a string and
then does a Bytes.toBytes conversion on the string and stores it in hbase.
So how can I tell the shell that the thing I'm entering isn't a string? How
I can put a value with a long format inside hbase, through the shell.

If you need to know, I want to pre-split my table. I can't do it through
java code, because I've installed a security library on hbase, with which I
can create encrypted tables. It adds a securecreate command to the shell,
from which I can create encrypted tables, but I can't create encrypted
tables with the java code. So I'm forced to use the shell to create the
table, and I want to pre-split my table with long values, because my row
keys are in the long format.

Please help, I really need this.
Thanks

On Tue, Feb 19, 2013 at 2:12 PM, Farrokh Shahriari 
mohandes.zebeleh...@gmail.com wrote:

 Tnx for your help,but it doesn't work.Do you have any other idea,cause I
 must run it from the shell.

 Farrokh



 On Tue, Feb 19, 2013 at 1:30 PM, Viral Bajaria viral.baja...@gmail.comwrote:

 HBase shell is a jruby shell and so you can invoke any java commands from
 it.

 For example:
  import org.apache.hadoop.hbase.util.Bytes
  Bytes.toLong(Bytes.toBytes(1000))

 Not sure if this works as expected since I don't have a terminal in front
 of me but you could try (assuming the SPLITS keyword takes byte array as
 input, never used SPLITS from the command line):
 create 'testTable', 'cf1' , { SPLITS = [ Bytes.toBytes(1000),
 Bytes.toBytes(2000), Bytes.toBytes(3000) ] }

 Thanks,
 Viral

 On Tue, Feb 19, 2013 at 1:52 AM, Farrokh Shahriari 
 mohandes.zebeleh...@gmail.com wrote:

  Hi there
  As I use rowkey in long format,I must presplit table in long format
 too.But
  when I've run this command,it presplit the table with STRING format :
  create 'testTable','cf1',{SPLITS = [ '1000','2000','3000']}
 
  How can I presplit the table with Long format ?
 
  Farrokh