is there a HBase 0.98 hdfs directory structure introduction?

2014-11-02 Thread Liu, Ming (HPIT-GADSC)
Hi, all,

I have a program to calculate the disk usage of hbase per table in hbase 0.94. 
I used to use the "hadoop fs -du" command against directory "$roodir/table" as 
the size a table uses, as described in HBase's ref guide: 
http://hbase.apache.org/book/trouble.namenode.html .
However, when we upgraded to HBase 0.98, the directory structure changed a lot. 
Yes, I can use "ls" to find the table directory and modify the program myself, 
but I wish there will be a good reference to learn more details about the 
change. The document on hbase official web site seems not updated. So can 
anyone help to briefly introduce the new directory structure or give me a link? 
It will be good to know what each directory is for.

Thanks,
Ming


RE: is there a HBase 0.98 hdfs directory structure introduction?

2014-11-05 Thread Liu, Ming (HPIT-GADSC)
Thanks Ted for the short but very useful reply! ^_^
It is clear now.

-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com] 
Sent: Monday, November 03, 2014 11:30 AM
To: user@hbase.apache.org
Subject: Re: is there a HBase 0.98 hdfs directory structure introduction?

In 0.98, you would find your table under the following directory:
$roodir/{namespace}/table

If you don't specify namespace at table creation time, 'default' namespace 
would be used.

Cheers

On Sun, Nov 2, 2014 at 7:16 PM, Liu, Ming (HPIT-GADSC) 
wrote:

> Hi, all,
>
> I have a program to calculate the disk usage of hbase per table in 
> hbase 0.94. I used to use the "hadoop fs -du" command against 
> directory "$roodir/table" as the size a table uses, as described in 
> HBase's ref
> guide: http://hbase.apache.org/book/trouble.namenode.html .
> However, when we upgraded to HBase 0.98, the directory structure 
> changed a lot. Yes, I can use "ls" to find the table directory and 
> modify the program myself, but I wish there will be a good reference 
> to learn more details about the change. The document on hbase official 
> web site seems not updated. So can anyone help to briefly introduce 
> the new directory structure or give me a link? It will be good to know 
> what each directory is for.
>
> Thanks,
> Ming
>


Is it possible that HBase update performance is much better than read in YCSB test?

2014-11-11 Thread Liu, Ming (HPIT-GADSC)
Hi, all,

I am trying to use YCSB to test on our HBase 0.98.5 instance and got a strange 
result: update is 6x better than read. It is just an exercise, so the HBase is 
running in a workstation in standalone mode.
I modified the workloada shipped with YCSB into two new workloads: workloadr 
and workloadu, where workloadr is do 100% read operation and workloadu is do 
100% update operation. At the bottom is the workloadr and workloadu config 
files for your reference.

I found out that the read performance is much worse than the update 
performance, read is about 6000:

YCSB Client 0.1
Command line: -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloadr -p 
columnfamily=family -s -t
[OVERALL], RunTime(ms), 16565.0
[OVERALL], Throughput(ops/sec), 6036.824630244491

And the update performance is about 36000, 6x better than read.

YCSB Client 0.1
Command line: -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloadu -p 
columnfamily=family -s -t
[OVERALL], RunTime(ms), 2767.0
[OVERALL], Throughput(ops/sec), 36140.22406938923

Is this possible? IMHO, read should be faster than update.
Maybe I am wrong in the workload file? Or there is a possibility that update is 
faster than read? I don't find a YCSB mailing list, if anyone knows, please 
give me a link, so I can also ask question on that mailing list. But is it 
possible that put is faster than get in hbase? If not, the result must be wrong 
and I need to debug the YCSB code to figure out what is going wrong.

Workloadr:
recordcount=10
operationcount=10
workload=com.yahoo.ycsb.workloads.CoreWorkload
readallfields=true
readproportion=1
updateproportion=0
scanproportion=0
insertproportion=0
requestdistribution=zipfian

workloadu:
recordcount=10
operationcount=10
workload=com.yahoo.ycsb.workloads.CoreWorkload
readallfields=true
readproportion=0
updateproportion=1
scanproportion=0
insertproportion=0
requestdistribution=zipfian


Thanks,
Ming


RE: Is it possible that HBase update performance is much better than read in YCSB test?

2014-11-12 Thread Liu, Ming (HPIT-GADSC)
Thank you Andrew, this is an excellent answer, I get it now. I will try your 
hbase client for a 'fair' test :-)

Best Regards,
Ming

-Original Message-
From: Andrew Purtell [mailto:apurt...@apache.org] 
Sent: Thursday, November 13, 2014 2:08 AM
To: user@hbase.apache.org
Cc: DeRoo, John
Subject: Re: Is it possible that HBase update performance is much better than 
read in YCSB test?

Try this HBase YCSB client instead:
https://github.com/apurtell/ycsb/tree/new_hbase_client

The HBase YCSB driver in the master repo holds on to one HTable instance per 
driver thread. We accumulate writes into a 12MB write buffer before flushing 
them en masse. This is why the behavior you are seeing confounds your 
expectations. It's not correct behavior IMHO. YCSB wants to measure the round 
trip of every op, not the non-cost of local caching. Worse, if we have a lot of 
driver threads accumulating 12MB of edits more or less at the same rate, then 
we will flush these buffers more or less at the same time and stampede the 
cluster, which leads to deep valleys in observed write performance of 30-60 
seconds or longer.



On Tue, Nov 11, 2014 at 8:40 PM, Liu, Ming (HPIT-GADSC) 
wrote:

> Hi, all,
>
> I am trying to use YCSB to test on our HBase 0.98.5 instance and got a 
> strange result: update is 6x better than read. It is just an exercise, 
> so the HBase is running in a workstation in standalone mode.
> I modified the workloada shipped with YCSB into two new workloads:
> workloadr and workloadu, where workloadr is do 100% read operation and 
> workloadu is do 100% update operation. At the bottom is the workloadr 
> and workloadu config files for your reference.
>
> I found out that the read performance is much worse than the update 
> performance, read is about 6000:
>
> YCSB Client 0.1
> Command line: -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloadr 
> -p columnfamily=family -s -t [OVERALL], RunTime(ms), 16565.0 
> [OVERALL], Throughput(ops/sec), 6036.824630244491
>
> And the update performance is about 36000, 6x better than read.
>
> YCSB Client 0.1
> Command line: -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloadu 
> -p columnfamily=family -s -t [OVERALL], RunTime(ms), 2767.0 [OVERALL], 
> Throughput(ops/sec), 36140.22406938923
>
> Is this possible? IMHO, read should be faster than update.
> Maybe I am wrong in the workload file? Or there is a possibility that 
> update is faster than read? I don't find a YCSB mailing list, if 
> anyone knows, please give me a link, so I can also ask question on 
> that mailing list. But is it possible that put is faster than get in 
> hbase? If not, the result must be wrong and I need to debug the YCSB 
> code to figure out what is going wrong.
>
> Workloadr:
> recordcount=10
> operationcount=10
> workload=com.yahoo.ycsb.workloads.CoreWorkload
> readallfields=true
> readproportion=1
> updateproportion=0
> scanproportion=0
> insertproportion=0
> requestdistribution=zipfian
>
> workloadu:
> recordcount=10
> operationcount=10
> workload=com.yahoo.ycsb.workloads.CoreWorkload
> readallfields=true
> readproportion=0
> updateproportion=1
> scanproportion=0
> insertproportion=0
> requestdistribution=zipfian
>
>
> Thanks,
> Ming
>



--
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)


how to explain read/write performance change after modifying the hfile.block.cache.size?

2014-11-20 Thread Liu, Ming (HPIT-GADSC)
Hello, all,

I am playing with YCSB to test HBase performance. I am using HBase 0.98.5. I 
tried to adjust the hfile.block.cache.size to see the difference, when I set 
hfile.block.cache.size to 0, read performance is very bad, but write 
performance is very very very good; when I set hfile.block.cache.size to 
0.4, read is better, but write performance drop dramatically. I disable the 
client side writebuffer already.
This is hard to understand for me:
The HBase guide just said hfile.block.cache.size setting is about how much 
memory used as block cache used by StoreFile. I have no idea of how HBase works 
internally. Typically, it is easy to understand that increase the size of cache 
should help the read, but why it will harm the write operation? The write 
performance down from 30,000 to 4,000 for your reference, just by changing the 
hfile.block.cache.size from 0 to 0.4.
Could anyone give me a brief explanation about this observation or give me some 
advices about what to study to understand what is block cache used for?

Another question: HBase write will first write to WAL then to memstore. Will 
the write to WAL go to disk directly before hbase write memstore, a sync 
operation or it is possible that write to WAL is still buffered somewhere when 
hbase put the data into the memstore?

Reading src code may cost me months, so a kindly reply will help me a lot... ...
Thanks very much!

Best Regards,
Ming


RE: how to explain read/write performance change after modifying the hfile.block.cache.size?

2014-11-20 Thread Liu, Ming (HPIT-GADSC)
Thank you Ted,
It is a great explanation. You are always very helpful ^_^
I will study the link carefully.

Thanks,
Ming

-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com] 
Sent: Friday, November 21, 2014 1:32 AM
To: user@hbase.apache.org
Subject: Re: how to explain read/write performance change after modifying the 
hfile.block.cache.size?

When block cache size increases from 0 to 0.4, the amount of heap given to 
memstore decreases. This would slow down the writes.
Please see:
http://hbase.apache.org/book.html#store.memstore

For your second question, see this thread:
http://search-hadoop.com/m/DHED4TEvBy1/lars+hbase+hflush&subj=Re+Clarifications+on+HBase+Durability

Cheers

On Thu, Nov 20, 2014 at 8:05 AM, Liu, Ming (HPIT-GADSC) 
wrote:

> Hello, all,
>
> I am playing with YCSB to test HBase performance. I am using HBase 0.98.5.
> I tried to adjust the hfile.block.cache.size to see the difference, 
> when I set hfile.block.cache.size to 0, read performance is very bad, 
> but write performance is very very very good; when I set 
> hfile.block.cache.size to 0.4, read is better, but write performance 
> drop dramatically. I disable the client side writebuffer already.
> This is hard to understand for me:
> The HBase guide just said hfile.block.cache.size setting is about how 
> much memory used as block cache used by StoreFile. I have no idea of 
> how HBase works internally. Typically, it is easy to understand that 
> increase the size of cache should help the read, but why it will harm 
> the write operation? The write performance down from 30,000 to 4,000 
> for your reference, just by changing the hfile.block.cache.size from 0 to 0.4.
> Could anyone give me a brief explanation about this observation or 
> give me some advices about what to study to understand what is block cache 
> used for?
>
> Another question: HBase write will first write to WAL then to memstore.
> Will the write to WAL go to disk directly before hbase write memstore, 
> a sync operation or it is possible that write to WAL is still buffered 
> somewhere when hbase put the data into the memstore?
>
> Reading src code may cost me months, so a kindly reply will help me a 
> lot... ...
> Thanks very much!
>
> Best Regards,
> Ming
>


RE: how to explain read/write performance change after modifying the hfile.block.cache.size?

2014-11-20 Thread Liu, Ming (HPIT-GADSC)
ability

Cheers

On Thu, Nov 20, 2014 at 8:05 AM, Liu, Ming (HPIT-GADSC) 
wrote:

> Hello, all,
>
> I am playing with YCSB to test HBase performance. I am using HBase 0.98.5.
> I tried to adjust the hfile.block.cache.size to see the difference, 
> when I set hfile.block.cache.size to 0, read performance is very bad, 
> but write performance is very very very good; when I set 
> hfile.block.cache.size to 0.4, read is better, but write performance 
> drop dramatically. I disable the client side writebuffer already.
> This is hard to understand for me:
> The HBase guide just said hfile.block.cache.size setting is about how 
> much memory used as block cache used by StoreFile. I have no idea of 
> how HBase works internally. Typically, it is easy to understand that 
> increase the size of cache should help the read, but why it will harm 
> the write operation? The write performance down from 30,000 to 4,000 
> for your reference, just by changing the hfile.block.cache.size from 0 to 0.4.
> Could anyone give me a brief explanation about this observation or 
> give me some advices about what to study to understand what is block cache 
> used for?
>
> Another question: HBase write will first write to WAL then to memstore.
> Will the write to WAL go to disk directly before hbase write memstore, 
> a sync operation or it is possible that write to WAL is still buffered 
> somewhere when hbase put the data into the memstore?
>
> Reading src code may cost me months, so a kindly reply will help me a 
> lot... ...
> Thanks very much!
>
> Best Regards,
> Ming
>


   


RE: how to explain read/write performance change after modifying the hfile.block.cache.size?

2014-11-22 Thread Liu, Ming (HPIT-GADSC)
Thank you Nick,

I will increase the heap size.
The workstation is a development workstation. People use it to code and build 
software, and do unit testing. And I use 'who' and 'ps' to make sure it is 
exclusive to me when I did the test, but it is not accurate. You are right, the 
benchmark is not accepted in that env. I just use this to 'develop' a benchmark 
tool, since I wrote a new YCSB driver for your system on HBase, but the 
following test is using native HBase driver and test on Hbase. For the real 
benchmark, we will run it on a separated cluster later. But I hope the data at 
least make sensor even on a shared env.
The heap configuration is something I really need to check , thank you.

Best Regards,
Ming

-Original Message-
From: Nick Dimiduk [mailto:ndimi...@gmail.com] 
Sent: Saturday, November 22, 2014 5:57 AM
To: user@hbase.apache.org
Cc: lars hofhansl
Subject: Re: how to explain read/write performance change after modifying the 
hfile.block.cache.size?

400mb blockcache? Ouch. What's your hbase-env.sh? Have you configured a heap 
size? My guess is you're using the un configured default of 1G. Should be at 
least 8G, and maybe more like 30G with this kind of host.

How many users are sharing it and with what kinds of tasks? If there's no IO 
isolation between processes, I suspect your benchmarks will be worthless on 
this shared environment.

-n

On Friday, November 21, 2014, Liu, Ming (HPIT-GADSC) 
wrote:

> Thank you Lars,
>
> There must be something wrong with my testing yesterday. I cannot 
> reproduce the issue anymore. Now, changing the cache.size from 0 to 
> 0.4 will not slow down the write perf dramatically, but still will 
> slow down write (Yesterday I saw a 7x slowdown, today it is about 1.3x 
> slowdown which is acceptable for me). As Ted pointed out, it is 
> possible that memstore cannot get enough memory when more RAM give to 
> block cache so it flush more frequently, but I really need more 
> reading to understand when memstore will flush.
> And one thing I noticed is when I restart hbase for the very first 
> test, the performance is best, then the second time, it is slower, 
> both read and write, and slower and slower in the following tests and 
> get to a stable point after about 3 or 4 times, in each run I will 
> read 5,000,000 rows and update 5,000,000 rows. There are too many 
> factors affect the read/write OPS in hbase...
>
> My purpose is to find a proper way to evaluate performance, since we 
> are going to change something in hbase and it is good to have a base 
> benchmark so we can compare the performance after change. So I must 
> make sure the perf test itself make sense and should be trusted.
>
> I saw an entry in the log may help to see the cache settings in my system:
> hfile.LruBlockCache: Total=373.54 MB, free=13.13 MB, max=386.68 MB, 
> blocks=5655, accesses=17939947, hits=14065015, hitRatio=78.40%, , 
> cachingAccesses=17934857, cachingHits=14064420, 
> cachingHitsRatio=78.42%, evictions=15646, evicted=3861343, 
> evictedPerRun=246.7942657470703
>
> My testing environment is a workstation with 12 core CPU, 96G memory 
> and
> 1.7 T disk. But it is a shared workstation, many users share it and I 
> started hbase in standalone mode with hbase-site.xml as below :
>
> 
>   
> hbase.rootdir
> hdfs://localhost:24400/hbase
>   
>   
> hbase.zookeeper.property.dataDir
> hdfs://localhost:24400/zookeeper
>   
>   
> hbase.master.port
> 24560
>   
>   
> hbase.master.info.port
> 24561
>   
>   
> hbase.regionserver.port
> 24562
>   
>   
> hbase.regionserver.info.port
> 24563
>   
>   
> hbase.zookeeper.peerport
> 24567
>   
>   
> hbase.zookeeper.leaderport
> 24568
>   
>   
> hbase.zookeeper.property.clientPort
> 24570
>   
>   
> hbase.rest.port
> 24571
>   
>   
> hbase.client.scanner.caching
> 100
>   
>   
> hbase.client.scanner.timeout.period
> 6
>   
>   
>  hbase.bulkload.staging.dir
>  hdfs://localhost:24400/hbase-staging
>   
>  
> hbase.snapshot.enabled
> true
>   
>   
> hbase.master.distributed.log.splitting
> false
>
>
>  zookeeper.session.timeout
>  9000:-) I just want to make sure never timeout
> here, since I get timeout so many times...
>
>
>  hfile.block.cache.size
>  0.4
>
> 
>
>
> [liuliumi@ YCSB]$ free -m
>  total   used   free sharedbuffers cached
> Mem: 96731  46828  49903  0984  32525
> -/+ buffers/cache:  

What is proper way to make a hbase connection? using HTable (conf,tbl) or createConnection? Zookeeper session run out.

2014-11-24 Thread Liu, Ming (HPIT-GADSC)
Hello,

I am using HBase 0.98.5. In example hbase client programs, some use 
createConnection() and some use HTable() directly, I found they behave 
different, I wrote two simple tests program using these two different methods, 
each program will start two threads and do simple put.  and I found:
One program will start only 1 zookeeper sessions shared in two threads while 
another will start 2 zookeeper sessions. So I don't know why the program using 
createConnection  will do more zookeeper requests than simply use HTable. Is it 
possible to use createConnection in two threads but share the same zookeeper 
session?

Here is details:
Demo1 will make two zookeeper sessions, seems two HBase connections; but Demo2 
will only make one zookeeper session. My real program is using createConnection 
in multiple threads as in demo1, since I have a very small zookeeper , it only 
allows 60 concurrent sessions, so my program always fail when there are 
hundreds of threads started. But I saw if using HTable directly, it will only 
consume 1 zookeeper session. But it will change a lot in my current program, so 
I wish there is a way to use createConnection and behave same as using HTable, 
is it possible?

Source code:

Demo1.java
class ClientThread extends Thread
{
public static Configuration configuration;
static {
configuration = HBaseConfiguration.create();
}
public void run()
{
try {
System.out.println("start insert data ..");
HConnection 
con=HConnectionManager.createConnection(configuration);
HTable table = (HTable)con.getTable("hbase_table1");
Put put = new Put("1".getBytes());
put.add("c1".getBytes(), null, "baidu".getBytes());
put.add("c2".getBytes(), null, 
"http://www.baidu.com1".getBytes());
try {
table.put(put);
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("end insert data ..");
}
catch  (Exception e) {
   }

}
}
public class demo1 {

public static void main(String[] args) throws Exception {
Thread t1=new ClientThread();
Thread t2=new ClientThread();
t1.start();
t2.start();
}

}


Demo2.java
class ClientThread1 extends Thread
{
public static Configuration configuration;
static {
configuration = HBaseConfiguration.create();
}
public void run()
{
System.out.println("start insert data ..");
try {
HTableInterface table = new HTable(configuration, 
"hbase_table1");
Put put = new Put("1".getBytes());
put.add("c1".getBytes(), null, "baidu".getBytes());
put.add("c2".getBytes(), null, 
"http://www.baidu.com1".getBytes());
table.put(put);
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("end insert data ..");

}


}
public class demo2 {

public static void main(String[] args) throws Exception {
Thread t1=new ClientThread1();
Thread t2=new ClientThread1();
t1.start();
t2.start();
}
}



This should be a very basic question, sorry, I really did some search but 
cannot find any good explaination. Any help will be very appreciated.

Thanks,
Ming


RE: What is proper way to make a hbase connection? using HTable (conf,tbl) or createConnection? Zookeeper session run out.

2014-11-24 Thread Liu, Ming (HPIT-GADSC)
Thank you Bharath,

This is a very helpful reply! I will share the connection between two threads. 
Simply put, HTable is not safe for multi-thread, is this true? In 
multi-threads, one must use HConnectionManager.

Thanks,
Ming
-Original Message-
From: Bharath Vissapragada [mailto:bhara...@cloudera.com] 
Sent: Monday, November 24, 2014 4:52 PM
To: hbase-user
Subject: Re: What is proper way to make a hbase connection? using HTable 
(conf,tbl) or createConnection? Zookeeper session run out.

Hi Ming,

HConnection connection = HConnectionManager.createConnection(conf);
HTableInterface table = connection.getTable("mytable"); table.get(...); / 
table.put(...);

Is the correct way to use. However
HConnectionManager.createConnection(conf)  gives you a "shared" HConnection 
which you can reuse across multiple threads that get their own
conn.getTable() and do their puts/gets.

Thanks,
Bharath

On Mon, Nov 24, 2014 at 2:03 PM, Liu, Ming (HPIT-GADSC) 
wrote:

> Hello,
>
> I am using HBase 0.98.5. In example hbase client programs, some use
> createConnection() and some use HTable() directly, I found they behave 
> different, I wrote two simple tests program using these two different 
> methods, each program will start two threads and do simple put.  and I
> found:
> One program will start only 1 zookeeper sessions shared in two threads 
> while another will start 2 zookeeper sessions. So I don't know why the 
> program using createConnection  will do more zookeeper requests than 
> simply use HTable. Is it possible to use createConnection in two 
> threads but share the same zookeeper session?
>
> Here is details:
> Demo1 will make two zookeeper sessions, seems two HBase connections; 
> but
> Demo2 will only make one zookeeper session. My real program is using 
> createConnection in multiple threads as in demo1, since I have a very 
> small zookeeper , it only allows 60 concurrent sessions, so my program 
> always fail when there are hundreds of threads started. But I saw if 
> using HTable directly, it will only consume 1 zookeeper session. But 
> it will change a lot in my current program, so I wish there is a way 
> to use createConnection and behave same as using HTable, is it possible?
>
> Source code:
>
> Demo1.java
> class ClientThread extends Thread
> {
> public static Configuration configuration;
> static {
> configuration = HBaseConfiguration.create();
> }
> public void run()
> {
> try {
> System.out.println("start insert data ..");
> HConnection
> con=HConnectionManager.createConnection(configuration);
> HTable table = (HTable)con.getTable("hbase_table1");
> Put put = new Put("1".getBytes());
> put.add("c1".getBytes(), null, "baidu".getBytes());
> put.add("c2".getBytes(), null, "http://www.baidu.com1 
> ".getBytes());
> try {
> table.put(put);
> } catch (IOException e) {
> e.printStackTrace();
> }
> System.out.println("end insert data ..");
> }
> catch  (Exception e) {
>}
>
> }
> }
> public class demo1 {
>
> public static void main(String[] args) throws Exception {
> Thread t1=new ClientThread();
> Thread t2=new ClientThread();
> t1.start();
> t2.start();
> }
>
> }
>
>
> Demo2.java
> class ClientThread1 extends Thread
> {
> public static Configuration configuration;
> static {
> configuration = HBaseConfiguration.create();
> }
> public void run()
> {
> System.out.println("start insert data ..");
> try {
> HTableInterface table = new 
> HTable(configuration, "hbase_table1");
> Put put = new Put("1".getBytes());
> put.add("c1".getBytes(), null, "baidu".getBytes());
> put.add("c2".getBytes(), null, "
> http://www.baidu.com1".getBytes());
> table.put(put);
> } catch (Exception e) {
> e.printStackTrace();
> }
> System.out.println("end insert data ..");
>
> }
>
>
> }
> public class demo2 {
>
> public static void main(String[] args) throws Exception {
> Thread t1=new ClientThread1();
> Thread t2=new ClientThread1();
> t1.start();
> t2.start();
> }
> }
>
>
>
> This should be a very basic question, sorry, I really did some search 
> but cannot find any good explaination. Any help will be very appreciated.
>
> Thanks,
> Ming
>



--
Bharath Vissapragada
<http://www.cloudera.com>


RE: YCSB load failed because hbase region too busy

2014-11-25 Thread Liu, Ming (HPIT-GADSC)
Hi, Louis,

Sorry I cannot help here, I am just very curious about your YCSB test results. 
It is not easy to find latest YCSB testing result on internet.

In my own hbase env, my testing result is 'update is always better than read' 
and 'Scan is slightly better than update'. I tried many times with various 
hbase configuration and tested on two different hardware and get same result, 
although the absolute number is different, but 'random single write is always 
better than random single read' and '100-length scan is better than single 
write', very stable result. I modified the workload to add a pure update 
workload, it will be very appreciated that you can do that test too (By setting 
readproportion to 0 and updateproportion to 1), I also change the 
CoreWorkload.java doTransactionScan() to always do a 100-len scan instead of a 
random len scan, so I can easily get how many rows scanned and compare to pure 
write and pure read result.

If it is not proper to show the absolute number of your test result, could you 
at least tell me if in your test 'read is better than write' or 'write is 
better than read', and by how much? I asked a few times in this mailing list, 
and I think people explained to me that it is possible that write is better 
than read in HBase, but I still want to know if this is common or just in my 
env.

And I thought you may meet the 'stuck' issue mentioned in 
http://hbase.apache.org/book.html in section 9.7.7.7.1.1 , but I am not sure. 
Happy to know how you solve the issue later. And as Ram and Qiang,Tian 
mentioned, you can only 'alleviate' the issue by increasing the knob but if you 
give hbase too much pressure, it will not work well sooner or later. Everyone 
has its own limitation :-) 

Thanks,
Ming

-Original Message-
From: louis hust [mailto:louis.h...@gmail.com] 
Sent: Tuesday, November 25, 2014 9:44 PM
To: user@hbase.apache.org
Subject: Re: YCSB load failed because hbase region too busy

hi ram,
thanks for help, i just do a test for bucket cache, in product env, we will 
follow your suggestion

Sent from my iPhone

> On 2014年11月25日, at 20:36, ramkrishna vasudevan 
>  wrote:
> 
> Your write ingest is too high. You have to control that by first 
> adding more nodes and ensuring that you have a more distributed load.  
> And also try with the changing the hbase.hstore.blockingStoreFiles.
> 
> Even changing the above value if your write ingest is so high such 
> that if it can reach this configured value again you can see blocking writes.
> 
> Regards
> RAm
> 
> 
>> On Tue, Nov 25, 2014 at 2:20 PM, Qiang Tian  wrote:
>> 
>> in your log:
>> 2014-11-25 13:31:35,048 WARN  [MemStoreFlusher.13]
>> regionserver.MemStoreFlusher: Region
>> usertable2,user8289,1416889268210.7e8fd83bb34b155bd0385aa63124a875. 
>> has too many store files; delaying flush up to 9ms
>> 
>> please see my original reply...you can try increasing 
>> "hbase.hstore.blockingStoreFiles", also you have only 1 RS and you 
>> split to
>> 100 regionsyou can try 2 RS with 20 regions.
>> 
>> 
>> 
>>> On Tue, Nov 25, 2014 at 3:42 PM, louis.hust  wrote:
>>> 
>>> yes, the stack trace like below:
>>> 
>>> 2014-11-25 13:35:40:946 4260 sec: 232700856 operations; 28173.18 
>>> current ops/sec; [INSERT AverageLatency(us)=637.59]
>>> 2014-11-25 13:35:50:946 4270 sec: 232700856 operations; 0 current
>> ops/sec;
>>> 14/11/25 13:35:59 INFO client.AsyncProcess: #14, table=usertable2,
>>> attempt=10/35 failed 109 ops, last exception:
>>> org.apache.hadoop.hbase.RegionTooBusyException:
>>> org.apache.hadoop.hbase.RegionTooBusyException: Above memstore 
>>> limit,
>> regionName=usertable2,user8289,1416889268210.7e8fd83bb34b155bd0385aa6
>> 3124a875.,
>>> server=l-hbase10.dba.cn1.qunar.com,60020,1416889404151,
>>> memstoreSize=536886800, blockingMemStoreSize=536870912
>>>at
>> org.apache.hadoop.hbase.regionserver.HRegion.checkResources(HRegion.j
>> ava:2822)
>>>at
>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java
>> :2234)
>>>at
>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java
>> :2201)
>>>at
>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java
>> :2205)
>>>at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.doBatchOp(HRegionS
>> erver.java:4253)
>>>at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.doNonAtomicRegionM
>> utation(HRegionServer.java:3469)
>>>at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServe
>> r.java:3359)
>>>at
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService
>> $2.callBlockingMethod(ClientProtos.java:29503)
>>>at
>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
>>>at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
>>>at
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpc
>> Scheduler.java:160)
>>>at
>> org.apache.hadoop.hbase.ip

how to tell there is a OOM in regionserver

2014-12-01 Thread Liu, Ming (HPIT-GADSC)
Hi, all,

Recently, one of our HBase 0.98.5 instance meet with issues: when run some 
specific workload, all region servers will suddenly shut down at same time, but 
master is still running. When I check the log, in master log, I can see 
messages like
2014-12-01 08:28:11,072 DEBUG [main-EventThread] master.ServerManager: 
Added=n008.cluster,60020,1417413986550 to dead servers, submitted shutdown 
handler to be executed meta=false
And on n008, regionserver log file, there is no ERROR message, the last log 
entry looks very like a ZooKeeper startup message. The log just stopped with 
that last ZooKeeper startup message, and the Region Server process was gone 
when we check with 'jps'.

We then increased the heap size of regionserver, and it work fine. RegionServer 
no longer disappear. So we doubt there was a Out Of Memory issue, so the region 
server processes are killed. But my questions are:

1.   What log message will indicate there is a OOM? Since the region server 
is 'kill -9', so I think there is no message can tell this.

2.   If there is no typical log message about OOM, then how can an admin 
make sure there is a region server OOM happened? We just guess, but can not 
make sure. We hope there is a method to tell OOM occured for sure.

3.   Does the Zookeeper message appears every time with RegionServer OOM 
(if it is a OOM). Or it is just a random event just in our system?

So in sum, I want to know what is the typical clue that people can make sure 
there is a OOM issue in HBase region server?

Thank you,
Ming


RE: how to tell there is a OOM in regionserver

2014-12-01 Thread Liu, Ming (HPIT-GADSC)
Thank you both!

Yes, I can see there is the '.out' file with clear proof of process was 
'killed'. So we can prove this issue now!
And it is also true that we must rely on JVM itself for proof that the kill 
operation is due to OOM. 
Thank you both, this is a very good learning.

Thanks,
Ming

-Original Message-
From: Bharath Vissapragada [mailto:bhara...@cloudera.com] 
Sent: Tuesday, December 02, 2014 2:00 PM
To: hbase-user
Subject: Re: how to tell there is a OOM in regionserver

I agree with Otis' response. Adding a few more details, there is a ".out"
 file in the logs/ directory, that is the stdout for each of these daemons and 
incase of  an OOM crash, it prints something like this

# java.lang.OutOfMemoryError: Java heap space

# -XX:OnOutOfMemoryError="kill -9 %p"

#   Executing /bin/sh -c "kill -9 "...



On Tue, Dec 2, 2014 at 11:06 AM, Otis Gospodnetic < otis.gospodne...@gmail.com> 
wrote:

> Hi Ming,
>
> 1) There typically is an OOM message from the JVM itself
>
> 2) I would monitor the server instead of relying on log messages 
> mentioning OOMs.  For example, in SPM <http://sematext.com/spm/> we 
> have "hearbeat alerts" that tell us when we stop hearing from 
> RegionServers and other types of servers.  It also helps when servers 
> simply die for reasons other than OOM.
>
> 3) You could (should?) monitor individual memory pools and possibly 
> set alerts or anomaly detection on those.  If you have that, if there 
> was an OOM, you will typically see one of the memory pools approach 
> 100% utilization.  I personally really like this report in SPM because 
> it gives a bit more insight than just "heap size/utilization".  So I'd 
> point the admin to this sort of monitoring report.
>
> 4) High GC counts/time, or jump in those metrics, and then typically 
> also jump in CPU usage is what often precedes OOMs.
>
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management 
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Tue, Dec 2, 2014 at 12:22 AM, Liu, Ming (HPIT-GADSC) 
> 
> wrote:
>
> > Hi, all,
> >
> > Recently, one of our HBase 0.98.5 instance meet with issues: when 
> > run
> some
> > specific workload, all region servers will suddenly shut down at 
> > same
> time,
> > but master is still running. When I check the log, in master log, I 
> > can
> see
> > messages like
> > 2014-12-01 08:28:11,072 DEBUG [main-EventThread] master.ServerManager:
> > Added=n008.cluster,60020,1417413986550 to dead servers, submitted
> shutdown
> > handler to be executed meta=false
> > And on n008, regionserver log file, there is no ERROR message, the 
> > last log entry looks very like a ZooKeeper startup message. The log 
> > just
> stopped
> > with that last ZooKeeper startup message, and the Region Server 
> > process
> was
> > gone when we check with 'jps'.
> >
> > We then increased the heap size of regionserver, and it work fine.
> > RegionServer no longer disappear. So we doubt there was a Out Of 
> > Memory issue, so the region server processes are killed. But my questions 
> > are:
> >
> > 1.   What log message will indicate there is a OOM? Since the region
> > server is 'kill -9', so I think there is no message can tell this.
> >
> > 2.   If there is no typical log message about OOM, then how can an
> > admin make sure there is a region server OOM happened? We just 
> > guess, but can not make sure. We hope there is a method to tell OOM 
> > occured for
> sure.
> >
> > 3.   Does the Zookeeper message appears every time with RegionServer
> > OOM (if it is a OOM). Or it is just a random event just in our system?
> >
> > So in sum, I want to know what is the typical clue that people can 
> > make sure there is a OOM issue in HBase region server?
> >
> > Thank you,
> > Ming
> >
>



--
Bharath Vissapragada
<http://www.cloudera.com>


Durability of in-memory column family

2015-01-05 Thread Liu, Ming (HPIT-GADSC)
Hi, all,

I want to use a column family to save some runtime data, which is small so I 
set that CF in memory to increase the performance, and the user data is still 
save in a normal CF.
My question is: will the data in the in-memory column family get lost, if there 
is a failure of regionServer?  Or in other words, is the data in an in-memory 
CF as safe as in an ordinary CF? No difference?

I could do test myself, but it needs some time, so I would like to be lazy and 
ask for help here :) If someone happened to know the answer, thanks in advance!

Thanks,
Ming


Given a Put object, is there any way to change the timestamp of it?

2015-01-20 Thread Liu, Ming (HPIT-GADSC)
Hello, there,

I am developing a coprocessor under HBase 0.98.6. The client send a Put object 
to the coprocessor in Protobuf, when the coprocessor receive the message , it 
invokes ProtobufUtil.toPut to convert it to a Put object. Do various checking  
and then put it into HBase table.
Now, I get a requirement to change the timestamp of that Put object, but I 
found no way to do this.

I was first try to generate a new Put object with a new timestamp, and try to 
copy the old one into this new object. But I found given a Put object, I have 
no way to get ALL its cells out if I don't know the column family and column 
qualifier name in advance. In my case, those CF/Column names are random as user 
defined. So I stuck here. Could anyone have idea how to workaround this?

The Mutation class has getTimestamp() method but no setTimestamp(). I wish 
there is a setTimestamp() for it. Is there any reason it is not provided? I 
hope in future release Mutation can expose a setTimestamp() method, is it 
possible? If so, my job will get much easier...

Thanks,
Ming


RE: Given a Put object, is there any way to change the timestamp of it?

2015-01-21 Thread Liu, Ming (HPIT-GADSC)
Thanks Ted!
This is exactly what I need. 

This will be a memory copy, but it solves my problem. Hope HBase can provide a 
setTimeStamp() method in future release.

Best Regards,
Ming

-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com] 
Sent: Wednesday, January 21, 2015 11:30 AM
To: user@hbase.apache.org
Subject: Re: Given a Put object, is there any way to change the timestamp of it?

bq. I have no way to get ALL its cells out Mutation has the following method:

  public NavigableMap> getFamilyCellMap() {

FYI

On Tue, Jan 20, 2015 at 5:43 PM, Liu, Ming (HPIT-GADSC) 
wrote:

> Hello, there,
>
> I am developing a coprocessor under HBase 0.98.6. The client send a 
> Put object to the coprocessor in Protobuf, when the coprocessor 
> receive the message , it invokes ProtobufUtil.toPut to convert it to a 
> Put object. Do various checking  and then put it into HBase table.
> Now, I get a requirement to change the timestamp of that Put object, 
> but I found no way to do this.
>
> I was first try to generate a new Put object with a new timestamp, and 
> try to copy the old one into this new object. But I found given a Put 
> object, I have no way to get ALL its cells out if I don't know the 
> column family and column qualifier name in advance. In my case, those 
> CF/Column names are random as user defined. So I stuck here. Could 
> anyone have idea how to workaround this?
>
> The Mutation class has getTimestamp() method but no setTimestamp(). I 
> wish there is a setTimestamp() for it. Is there any reason it is not 
> provided? I hope in future release Mutation can expose a 
> setTimestamp() method, is it possible? If so, my job will get much easier...
>
> Thanks,
> Ming
>


HTable or HConnectionManager, how a client connect to HBase?

2015-02-14 Thread Liu, Ming (HPIT-GADSC)
Hi,

I am using HBase 0.98.6.

I learned from this maillist before, that the recommended method to 'connect' 
to HBase from client is to use HConnectionManager like this:
HConnection 
con=HConnectionManager.createConnection(configuration);
HTableInterfacetable = 
con.getTable("hbase_table1");
Instead of
HTableInterface table = new 
HTable(configuration, "hbase_table1");

I don't quite understand the reason. I was thinking that each time I initialize 
a HTable instance, it needs to create a new HConnection. And that is expensive. 
But using the first method, multiple HTable instances can share the same 
HConnection. That is quite reasonable to me.
However, I was reading from some articles on internet that , even if I use the 
'new HTable(conf, tbl)' method, if the 'conf' object is the same one, all the 
HTable instances will still share the same HConnection. I was recently read yet 
another article and said when using 'new HTable(conf, tbl)', one don't need to 
use the exactly same 'conf' object (same one in memory). if two 'conf' objects, 
two different objects are all the same, I mean all attributes of these two are 
same (for example, created from the same hbase-site.xml and never change) then 
HTable objects can still share the same HConnection.  I also try to read the 
HTable src code, it is very hard, but it seems to me the last statement is 
correct: 'HTable will share HConnection, if configuration is all the same'.

Sorry for so verbose. My question:
If two 'configuration' objects are same, then two HTable object instantiated 
with them respectively can still share the same HConnection or not? Directly 
using the 'new HTable()' method.
If the answer is 'yes', then why I still need the HConnectionManager to create 
a shared connection?
I am talking about 0.98.6.
I googled for days, and even try to read HBase src code, but still get really 
confused. I try to do some tests also, but since I am too newbie, I don't know 
how to verify the difference, I really don't know what a HConnection do under 
the hood. I counted the ZooKeeper client requests, and I found some difference. 
If this ZooKeeper requests difference is a correct metrics, it means to me that 
two HTable do not share HConnetion even using same 'configuration' in the 
constructor. So it confused me more and more

Please someone kindly help me for this newbie question and thanks in advance.

Thanks,
Ming




RE: managing HConnection

2015-02-16 Thread Liu, Ming (HPIT-GADSC)
Hi, 

Thank you Serega for the helpful reply and thanks Jneidi for asking this. I 
have similar confusion.
So Serega, when does your application finally close the HConnection? Or the 
connection is NEVER closed as long as your application is running? Is it OK to 
NOT close the HConnection and the application exit directly?
My application is a long-running service, accept user request and do CRUD to 
hbase. So I would like to use your model here. But, is it reasonable to keep 
that HConnection open a very long time, for example months? Is there any 
potential problem I need to take care?  
Also as David Chen asked, if all threads share same HConnection, it may has 
limitation to support high throughput, so a pool of Connections maybe better?

Thanks,
Ming

-Original Message-
From: Serega Sheypak [mailto:serega.shey...@gmail.com] 
Sent: Wednesday, February 04, 2015 1:02 AM
To: user
Subject: Re: managing HConnection

Hi, guys from group helped me a lot. I did solve pretty the same problem (CRUD 
web-app)

1. Use single instance of HConnection per application.
2. Instantiate it once.
3. create HTable instance for each CRUD operation and safely close it 
(try-catch-finally). Use the same HConnection to create any HTable for CRUD 
operation.
4. DO NOT close HConnection after CRUD operation

I have logic controllers which get HConnection injection in HttpServlet.init 
method.
So I have 5 HConnection instances per application created during servlet 
initialization


2015-02-03 18:12 GMT+03:00 Ted Yu :

> Please see '61.1. Cluster Connections' under 
> http://hbase.apache.org/book.html#architecture.client
>
> Cheers
>
> On Tue, Feb 3, 2015 at 6:47 AM, sleimanjneidi 
> 
> wrote:
>
> > Hi all,
> > I am using hbase-0.98.1-cdh5.1.4 client and I am a bit confused by 
> > the documentation of HConnection. The document says the following:
> >
> > HConnection instances can be shared. Sharing is usually what you 
> > want because rather than each HConnection instance having to do its 
> > own discovery of regions out on the cluster, instead, all clients 
> > get to
> share
> > the one cache of locations. HConnectionManager does the sharing for 
> > you
> if
> > you go by it getting connections. Sharing makes cleanup of 
> > HConnections awkward. .
> >
> > So now I have a simple question: Can I share the same HConnection
> instance
> > in my entire application?
> > And write some magic code to know when to close or never close at all?
> > Or I have to create an instance and close it every time I do a CRUD 
> > operation ?
> >
> > Many thanks
> >
> >
> >
>


RE: HTable or HConnectionManager, how a client connect to HBase?

2015-02-16 Thread Liu, Ming (HPIT-GADSC)
Hi,

I have to spend a lot of time to look into the source code of HTable, 
HConnectionManager. 
IMHO, it seems the document on hbase website is misleading. In the hbase online 
document : http://hbase.apache.org/book.html#architecture.client . It mentioned:
==
For example, this is preferred:

HBaseConfiguration conf = HBaseConfiguration.create();
HTable table1 = new HTable(conf, "myTable");
HTable table2 = new HTable(conf, "myTable");

as opposed to this:

HBaseConfiguration conf1 = HBaseConfiguration.create();
HTable table1 = new HTable(conf1, "myTable");
HBaseConfiguration conf2 = HBaseConfiguration.create();
HTable table2 = new HTable(conf2, "myTable");
===
After I checking the src code , it seems only in 0.20 code, HTable must use the 
same Configuration instance in order to share the HConnection. 0.20 uses the 
configuration instance as the key for a hashmap to save HConnections. I check 
0.90.0 code, it already use HConnectionKey as the key of the HashMap which save 
the shared HConnections. 

So as far as I understand, the document is NOT true for HBase later than 0.90 
version. These two examples can both share HConnection instance. If I am wrong, 
please correct me.  

For my previous question. If two HTable already share the HConnection, why I 
need to create a HConnection first by HConnectionManager.createConnection()?
By reading the src code, it seems the HTable.close() will also close the 
HConnection, so one table do a close, the following HTable have to reconnect, 
no shareing. But if the HTable is initiated by the HConnection.getTable(), it 
will use a special constructor of HTable to make sure when HTable.close() is 
invoked, it will NOT close the connection. So the HConnection can be shared.

I will use the recommended method, and as discussed in another thread here, to 
share HConnection one still have to ensure the shared connection should not be 
closed. So the HConnectionManager is a good abstraction to control the life 
cycle of a connection. I seem to understand now :-) 

Thanks,
Ming


-----Original Message-
From: Liu, Ming (HPIT-GADSC) 
Sent: Saturday, February 14, 2015 10:45 PM
To: user@hbase.apache.org
Subject: HTable or HConnectionManager, how a client connect to HBase?

Hi,

I am using HBase 0.98.6.

I learned from this maillist before, that the recommended method to 'connect' 
to HBase from client is to use HConnectionManager like this:
HConnection 
con=HConnectionManager.createConnection(configuration);
HTableInterfacetable = 
con.getTable("hbase_table1"); Instead of
HTableInterface table = new 
HTable(configuration, "hbase_table1");

I don't quite understand the reason. I was thinking that each time I initialize 
a HTable instance, it needs to create a new HConnection. And that is expensive. 
But using the first method, multiple HTable instances can share the same 
HConnection. That is quite reasonable to me.
However, I was reading from some articles on internet that , even if I use the 
'new HTable(conf, tbl)' method, if the 'conf' object is the same one, all the 
HTable instances will still share the same HConnection. I was recently read yet 
another article and said when using 'new HTable(conf, tbl)', one don't need to 
use the exactly same 'conf' object (same one in memory). if two 'conf' objects, 
two different objects are all the same, I mean all attributes of these two are 
same (for example, created from the same hbase-site.xml and never change) then 
HTable objects can still share the same HConnection.  I also try to read the 
HTable src code, it is very hard, but it seems to me the last statement is 
correct: 'HTable will share HConnection, if configuration is all the same'.

Sorry for so verbose. My question:
If two 'configuration' objects are same, then two HTable object instantiated 
with them respectively can still share the same HConnection or not? Directly 
using the 'new HTable()' method.
If the answer is 'yes', then why I still need the HConnectionManager to create 
a shared connection?
I am talking about 0.98.6.
I googled for days, and even try to read HBase src code, but still get really 
confused. I try to do some tests also, but since I am too newbie, I don't know 
how to verify the difference, I really don't know what a HConnection do under 
the hood. I counted the ZooKeeper client requests, and I found some difference. 
If this ZooKeeper requests difference is a correct metrics, it means to me that 
two HTable do not share HConnetion even using same 'configuration' in the 
constructor. So it confused me more and more

Please someone kindly help me for this newbie question and thanks in advance.

Thanks,
Ming




RE: HTable or HConnectionManager, how a client connect to HBase?

2015-02-23 Thread Liu, Ming (HPIT-GADSC)
Thanks, Enis,

Your reply is very clear,  I finally understand it now.

Best Regards,
Ming
-Original Message-
From: Enis Söztutar [mailto:enis@gmail.com] 
Sent: Thursday, February 19, 2015 10:41 AM
To: hbase-user
Subject: Re: HTable or HConnectionManager, how a client connect to HBase?

It is a bit more complex than that. It is actually a hash of some subset of the 
configuration properties. See HConnectionKey class if you want to learn more. 
But the important thing is that with the new style, you do not need to worry 
anything about these since there is no implicit connection sharing. Everything 
is explicit now.

Enis

On Tue, Feb 17, 2015 at 11:50 PM, Serega Sheypak 
wrote:

> Hi, Enis Söztutar
> You've wrote:
> >>You are right that the constructor new HTable(Configuration, ..) 
> >>will
> share the underlying connection if same configuration object is used.
>
> What do it mean "the same"? is equality checked using reference (java 
> == ) or using equals(Object other) method?
>
>
> 2015-02-18 7:34 GMT+03:00 Enis Söztutar :
>
> > Hi,
> >
> > You are right that the constructor new HTable(Configuration, ..) 
> > will
> share
> > the underlying connection if same configuration object is used.
> Connection
> > is a heavy weight object, that holds the zookeeper connection, rpc
> client,
> > socket connections to multiple region servers, master, and the 
> > thread
> pool,
> > etc. You definitely do not want to create multiple connections per
> process
> > unless you know what you are doing.
> >
> > The model is changed, and the old way of HTable(Configuration, ..) 
> > is deprecated because, we want to make the Connection lifecycle 
> > management explicit. In the new model, an opened Connection is 
> > closed by the user again, and light weight Table instances are obtained 
> > from the Connection.
> > Having HTable's share their connections implicitly makes reasoning 
> > about
> it
> > too hard. The new model should be pretty easy to follow.
> >
> > Enis
> >
> > On Sat, Feb 14, 2015 at 6:45 AM, Liu, Ming (HPIT-GADSC) <
> ming.l...@hp.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I am using HBase 0.98.6.
> > >
> > > I learned from this maillist before, that the recommended method 
> > > to 'connect' to HBase from client is to use HConnectionManager like this:
> > > HConnection 
> > > con=HConnectionManager.createConnection(configuration);
> > > HTableInterfacetable = 
> > > con.getTable("hbase_table1"); Instead of
> > > HTableInterface table = new 
> > > HTable(configuration, "hbase_table1");
> > >
> > > I don't quite understand the reason. I was thinking that each time 
> > > I initialize a HTable instance, it needs to create a new 
> > > HConnection. And that is expensive. But using the first method, 
> > > multiple HTable
> instances
> > > can share the same HConnection. That is quite reasonable to me.
> > > However, I was reading from some articles on internet that , even 
> > > if I
> > use
> > > the 'new HTable(conf, tbl)' method, if the 'conf' object is the 
> > > same
> one,
> > > all the HTable instances will still share the same HConnection. I 
> > > was recently read yet another article and said when using 'new 
> > > HTable(conf, tbl)', one don't need to use the exactly same 'conf' 
> > > object (same one
> in
> > > memory). if two 'conf' objects, two different objects are all the
> same, I
> > > mean all attributes of these two are same (for example, created 
> > > from
> the
> > > same hbase-site.xml and never change) then HTable objects can 
> > > still
> share
> > > the same HConnection.  I also try to read the HTable src code, it 
> > > is
> very
> > > hard, but it seems to me the last statement is correct: 'HTable 
> > > will
> > share
> > > HConnection, if configuration is all the same'.
> > >
> > > Sorry for so verbose. My question:
> > > If two 'configuration' objects are same, then two HTable object 
> > > instantiated with them respectively can still share the same
> HConnection
> > or
> > > not? Directly using the 'new HTable()' method.
> > > If the answer is 'yes', then why I still need the 

how to use RegionCoprocessorEnvironment getSharedData() to share data among coprocessors?

2015-04-14 Thread Liu, Ming (HPIT-GADSC)
Hi, all,

I am trying to learn how to share data between two coprocessors. I have one 
Observer coprocessor and one Endpoint coprocessor. In the observer, it overload 
the prePut/preDelete to maintain a counter. And I want the Endpoint coprocessor 
to read that counter and return to client caller. So I want to use the 
getSharedData() method in RegionCoprocessorEnvironment, but I cannot make it 
work. Could anybody help me here?

In the Observer Coprocessor :
During start(), create the shared object "counter":
-
public void start(CoprocessorEnvironment envi) throws IOException {
  Env.getSharedData().put("counter", new Long(0) ); //create the counter
-

In the Endpoint coprocessor:
During start(), try to read the shared "counter" , but failed.
--
public void start(CoprocessorEnvironment envi) throws IOException {
 LOG.info("The size of sharedData map is: " + envi.getSharedData().size() 
); //try to get the counter
--
Here it print 0, if I use evni.getSharedData().containsKey("counter"), it will 
return false.

When creating table, I call addCoprocessor() method to add Observer first, then 
Endpoint coprocessor. I confirmed that by checking the hbase log file message. 
I only have one region for that table during the run. I confirmed by hbase 
shell status 'detailed' command.

There is not much example I can find about how to use getSharedData(), could 
someone help me here? What is missing in my simple code? Thanks very much in 
advance!

Thanks,
Ming


RE: how to use RegionCoprocessorEnvironment getSharedData() to share data among coprocessors?

2015-04-15 Thread Liu, Ming (HPIT-GADSC)
Thank you Ted,

I am using 0.98.11.  I do read the first example, but I don't find the second 
one before, I will try to read it, but it seems too complex for me :-) 
I read from 
http://hadoop-hbase.blogspot.com/2012/10/coprocessor-access-to-hbase-internals.html
 
That : "This shared data is per coprocessor class and per regionserver."
So it means to me that hbase cannot share between two different coprocessors 
like an endpoint and an observer by using this sharedData, since there are two 
different classes in my case. I use Zookeeper to share data for now, but I was 
told not to depend on ZooKeeper too much, no idea why. Is there any other good 
way I can use to share data among different coprocessors?  

Thanks,
Ming
-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com] 
Sent: Wednesday, April 15, 2015 8:25 PM
To: user@hbase.apache.org
Subject: Re: how to use RegionCoprocessorEnvironment getSharedData() to share 
data among coprocessors?

Which hbase release are you using ?

Please take a look at the following tests for example of using getSharedData() :

hbase-examples/src/main/java/org/apache/hadoop/hbase/coprocessor/example/ZooKeeperScanPolicyObserver.java
hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestCoprocessorInterface.java

Cheers

On Tue, Apr 14, 2015 at 10:35 PM, Liu, Ming (HPIT-GADSC) 
wrote:

> Hi, all,
>
> I am trying to learn how to share data between two coprocessors. I 
> have one Observer coprocessor and one Endpoint coprocessor. In the 
> observer, it overload the prePut/preDelete to maintain a counter. And 
> I want the Endpoint coprocessor to read that counter and return to 
> client caller. So I want to use the getSharedData() method in 
> RegionCoprocessorEnvironment, but I cannot make it work. Could anybody help 
> me here?
>
> In the Observer Coprocessor :
> During start(), create the shared object "counter":
> -
> public void start(CoprocessorEnvironment envi) throws IOException {
>   Env.getSharedData().put("counter", new Long(0) ); //create the 
> counter
> -
>
> In the Endpoint coprocessor:
> During start(), try to read the shared "counter" , but failed.
> --
> public void start(CoprocessorEnvironment envi) throws IOException {
>  LOG.info("The size of sharedData map is: " +
> envi.getSharedData().size() ); //try to get the counter
> --
> Here it print 0, if I use evni.getSharedData().containsKey("counter"), 
> it will return false.
>
> When creating table, I call addCoprocessor() method to add Observer 
> first, then Endpoint coprocessor. I confirmed that by checking the 
> hbase log file message. I only have one region for that table during 
> the run. I confirmed by hbase shell status 'detailed' command.
>
> There is not much example I can find about how to use getSharedData(), 
> could someone help me here? What is missing in my simple code? Thanks 
> very much in advance!
>
> Thanks,
> Ming
>


RE: how to use RegionCoprocessorEnvironment getSharedData() to share data among coprocessors?

2015-04-15 Thread Liu, Ming (HPIT-GADSC)
Hi, Ted,

I am not sure. I am a C programmer, only know a very little java. It seems to 
me that the two coprocessors have to 'extends' from two different base classes 
for Endpoint and Observer. Java seems not be able to do this.
I am just trying to study coprocessor, there is no solid production requirement 
to do this sharing thing. Coprocessor was introduced from 0.92, and no one need 
this before, so I am thinking this is not something important.
Thanks always for your help! You are very kind to answer all my questions in 
this maillist :-)
I will try to read the getSharedData code to further understand and feedback 
here if I can have some more findings or any other good method to shared data 
in a simple way.

Thanks,
Ming
 
-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com] 
Sent: Thursday, April 16, 2015 5:16 AM
To: user@hbase.apache.org
Subject: Re: how to use RegionCoprocessorEnvironment getSharedData() to share 
data among coprocessors?

Can you implement the Observer coprocessor and the Endpoint coprocessor in one 
class ?

Cheers

On Wed, Apr 15, 2015 at 9:00 AM, Liu, Ming (HPIT-GADSC) 
wrote:

> Thank you Ted,
>
> I am using 0.98.11.  I do read the first example, but I don't find the 
> second one before, I will try to read it, but it seems too complex for 
> me
> :-)
> I read from
> http://hadoop-hbase.blogspot.com/2012/10/coprocessor-access-to-hbase-i
> nternals.html That : "This shared data is per coprocessor class and 
> per regionserver."
> So it means to me that hbase cannot share between two different 
> coprocessors like an endpoint and an observer by using this 
> sharedData, since there are two different classes in my case. I use 
> Zookeeper to share data for now, but I was told not to depend on 
> ZooKeeper too much, no idea why. Is there any other good way I can use 
> to share data among different coprocessors?
>
> Thanks,
> Ming
> -Original Message-
> From: Ted Yu [mailto:yuzhih...@gmail.com]
> Sent: Wednesday, April 15, 2015 8:25 PM
> To: user@hbase.apache.org
> Subject: Re: how to use RegionCoprocessorEnvironment getSharedData() 
> to share data among coprocessors?
>
> Which hbase release are you using ?
>
> Please take a look at the following tests for example of using
> getSharedData() :
>
>
> hbase-examples/src/main/java/org/apache/hadoop/hbase/coprocessor/examp
> le/ZooKeeperScanPolicyObserver.java
>
> hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestCop
> rocessorInterface.java
>
> Cheers
>
> On Tue, Apr 14, 2015 at 10:35 PM, Liu, Ming (HPIT-GADSC) 
>  >
> wrote:
>
> > Hi, all,
> >
> > I am trying to learn how to share data between two coprocessors. I 
> > have one Observer coprocessor and one Endpoint coprocessor. In the 
> > observer, it overload the prePut/preDelete to maintain a counter. 
> > And I want the Endpoint coprocessor to read that counter and return 
> > to client caller. So I want to use the getSharedData() method in 
> > RegionCoprocessorEnvironment, but I cannot make it work. Could 
> > anybody
> help me here?
> >
> > In the Observer Coprocessor :
> > During start(), create the shared object "counter":
> > -
> > public void start(CoprocessorEnvironment envi) throws IOException {
> >   Env.getSharedData().put("counter", new Long(0) ); //create the 
> > counter
> > -
> >
> > In the Endpoint coprocessor:
> > During start(), try to read the shared "counter" , but failed.
> > --
> > public void start(CoprocessorEnvironment envi) throws IOException {
> >  LOG.info("The size of sharedData map is: " +
> > envi.getSharedData().size() ); //try to get the counter
> > --
> > Here it print 0, if I use 
> > evni.getSharedData().containsKey("counter"),
> > it will return false.
> >
> > When creating table, I call addCoprocessor() method to add Observer 
> > first, then Endpoint coprocessor. I confirmed that by checking the 
> > hbase log file message. I only have one region for that table during 
> > the run. I confirmed by hbase shell status 'detailed' command.
> >
> > There is not much example I can find about how to use 
> > getSharedData(), could someone help me here? What is missing in my 
> > simple code? Thanks very much in advance!
> >
> > Thanks,
> > Ming
> >
>


Why hbase need manual split?

2014-08-05 Thread Liu, Ming (HPIT-GADSC)
Hi, all,

As I understand, HBase will automatically split a region when the region is too 
big.
So in what scenario, user needs to do a manual split? Could someone kindly give 
me some examples that user need to do the region split explicitly via HBase 
Shell or Java API?

Thanks very much.

Regards,
Ming


RE: Why hbase need manual split?

2014-08-06 Thread Liu, Ming (HPIT-GADSC)
Thanks Arun, and John,

Both of your scenarios make a lot of sense to me. But for the "sequence-based 
key" case, I am still confused. It is like an append-only operation, so new 
data are always written into the same region, but that region will eventually 
reach the hbase.hregion.max.filesize and be automatically split, why still need 
a manual split? If we set the hbase.hregion.max.filesize to a "not too big" 
value, then a region will never grow too big?   

And I think I need to first understand how HBase do the auto split internally ( 
I am very new to HBase). Given a region with start key A, and end key B. When 
split, how HBase do split internally? Split in the middle of key range?
Original region is in range [A,B], so split to [A, B-A/2] and [B-A/2+1, B] ?
Then if most of the row key are in a small range [A, C], while C is very close 
to B-A/2, then I can see a problem of auto split. 

Is this true? Can HBase do split in other ways?

Thanks,
Ming

-Original Message-
From: john guthrie [mailto:graf...@gmail.com] 
Sent: Wednesday, August 06, 2014 6:01 PM
To: user@hbase.apache.org
Subject: Re: Why hbase need manual split?

i had a customer with a sequence-based key (yes, he knew all the downsides for 
that). being able to split manually meant he could split a region that got too 
big at the end vice right down the middle. with a sequentially increasing key, 
splitting the region in half left one region half the desired size and likely 
to never be added to


On Wed, Aug 6, 2014 at 2:44 AM, Arun Allamsetty 
wrote:

> Hi Ming,
>
> The reason why we have it is because the user can decide where each 
> key goes. I can think multiple scenarios off the top of my head where 
> it would be useful and others can correct me if I am wrong.
>
> 1. Cases where you cannot have row keys which are equally lexically 
> distributed, leading in unequal loads on the regions. In such cases, 
> we can set key ranges to be assigned to different regions so that we 
> can have a more equal distribution.
>
> 2. The second scenario I am thinking of may be wrong and if it is, 
> it'll clear my misconceptions. In case you cannot denormalize your 
> data and you have to perform joins on certain range of row keys which 
> are lexically similar. So we split them and they would be assigned to 
> the same region server (right?) and the join would be performed locally.
>
> Cheers,
> Arun
>
> Sent from a mobile device. Please don't mind the typos.
> On Aug 6, 2014 12:30 AM, "Liu, Ming (HPIT-GADSC)" 
> wrote:
>
> > Hi, all,
> >
> > As I understand, HBase will automatically split a region when the 
> > region is too big.
> > So in what scenario, user needs to do a manual split? Could someone
> kindly
> > give me some examples that user need to do the region split 
> > explicitly
> via
> > HBase Shell or Java API?
> >
> > Thanks very much.
> >
> > Regards,
> > Ming
> >
>


RE: Why hbase need manual split?

2014-08-06 Thread Liu, Ming (HPIT-GADSC)
Thanks John,  

This is a very good answer, now I understand why you use manual split, thanks. 
And I have a typo in my previous post, 
The C is very close to A not to B-A/2. So every split in middle of key range 
will result a big region and a small region. So very bad.
So HBase only do auto split in middle of key range? Or there exist other 
algorithm here. Any help will be very appreciated!

Best Regards,
Ming
-Original Message-
From: john guthrie [mailto:graf...@gmail.com] 
Sent: Wednesday, August 06, 2014 6:35 PM
To: user@hbase.apache.org
Subject: Re: Why hbase need manual split?

to be honest, we were doing manual splits for the main reason that we wanted to 
make sure it was done on our schedule.

but it also occurred to me that the automatic splits, at least by default, 
split the region in half. normally the idea is that both new halves continue to 
grow, but with a sequentially increasing key that won't be true. so if you're 
splitting in half you want your region split size to be twice your desired 
region size so that when a split does occur the "older"
half of the region is the size you want it. manual splitting lets you split at 
the end

hope this helps, and hope i'm not wrong, john



On Wed, Aug 6, 2014 at 6:25 AM, Liu, Ming (HPIT-GADSC) 
wrote:

> Thanks Arun, and John,
>
> Both of your scenarios make a lot of sense to me. But for the 
> "sequence-based key" case, I am still confused. It is like an 
> append-only operation, so new data are always written into the same 
> region, but that region will eventually reach the 
> hbase.hregion.max.filesize and be automatically split, why still need 
> a manual split? If we set the hbase.hregion.max.filesize to a "not too 
> big" value, then a region will never grow too big?
>
> And I think I need to first understand how HBase do the auto split 
> internally ( I am very new to HBase). Given a region with start key A, 
> and end key B. When split, how HBase do split internally? Split in the 
> middle of key range?
> Original region is in range [A,B], so split to [A, B-A/2] and 
> [B-A/2+1, B] ?
> Then if most of the row key are in a small range [A, C], while C is 
> very close to B-A/2, then I can see a problem of auto split.
>
> Is this true? Can HBase do split in other ways?
>
> Thanks,
> Ming
>
> -Original Message-
> From: john guthrie [mailto:graf...@gmail.com]
> Sent: Wednesday, August 06, 2014 6:01 PM
> To: user@hbase.apache.org
> Subject: Re: Why hbase need manual split?
>
> i had a customer with a sequence-based key (yes, he knew all the 
> downsides for that). being able to split manually meant he could split 
> a region that got too big at the end vice right down the middle. with 
> a sequentially increasing key, splitting the region in half left one 
> region half the desired size and likely to never be added to
>
>
> On Wed, Aug 6, 2014 at 2:44 AM, Arun Allamsetty 
>  >
> wrote:
>
> > Hi Ming,
> >
> > The reason why we have it is because the user can decide where each 
> > key goes. I can think multiple scenarios off the top of my head 
> > where it would be useful and others can correct me if I am wrong.
> >
> > 1. Cases where you cannot have row keys which are equally lexically 
> > distributed, leading in unequal loads on the regions. In such cases, 
> > we can set key ranges to be assigned to different regions so that we 
> > can have a more equal distribution.
> >
> > 2. The second scenario I am thinking of may be wrong and if it is, 
> > it'll clear my misconceptions. In case you cannot denormalize your 
> > data and you have to perform joins on certain range of row keys 
> > which are lexically similar. So we split them and they would be 
> > assigned to the same region server (right?) and the join would be performed 
> > locally.
> >
> > Cheers,
> > Arun
> >
> > Sent from a mobile device. Please don't mind the typos.
> > On Aug 6, 2014 12:30 AM, "Liu, Ming (HPIT-GADSC)" 
> > wrote:
> >
> > > Hi, all,
> > >
> > > As I understand, HBase will automatically split a region when the 
> > > region is too big.
> > > So in what scenario, user needs to do a manual split? Could 
> > > someone
> > kindly
> > > give me some examples that user need to do the region split 
> > > explicitly
> > via
> > > HBase Shell or Java API?
> > >
> > > Thanks very much.
> > >
> > > Regards,
> > > Ming
> > >
> >
>


when will hbase create the zookeeper znode 'root-region-server’ is created? Hbase 0.94

2014-10-16 Thread Liu, Ming (HPIT-GADSC)
Hello,

I am trying to debug a coprocessor code on hbase 0.94.24, which seems to work 
well on 0.94.5, but I cannot make it work on 0.94.24.

Here is the copy of some coprocessor init code:

public class TestEndpoint implements TestIface, HTableWrapper {
 …
  @Override
  public void start(CoprocessorEnvironment env) throws IOException {
this.env = env;
conf = env.getConfiguration();
HBaseAdmin admin = new HBaseAdmin(conf);
if (!admin.tableExists(SOME_TABLE)) {
   //do something if this table is not there
…

When hbase starts, in the log file I noticed that when regionserver load this 
coprocessor, it will hang there inside the admin.tableExists() function. That 
API will try to access the zookeeper znode ‘root-region-server’, so I start the 
‘hbase zkcli’ and run ‘ls /hbase’ at that time, and I found the znode 
‘root-region-server’ is not created. Since the coprocessor want to access a 
table, it must look up the –ROOT- region, which location is saved in that 
znode, but that znode is not there. Then it hangs there for ever. If I disable 
this coprocessor, hbase can start good and I can see ‘root-region-server’ znode 
created there.

This coprocessor code is claimed to work well on 0.94.5, so I am wondering if 
there is something changed about the sequence of ‘load coprocessor’ and the 
‘create root-region-server znode’ in hbase 0.94 serials after 0.94.5?

So my basic question is : when znode ‘root-region-server’ is created? Who 
create it? And is there any fixed timing sequence of this initialization 
between the coprocessor loading time?

By the way, I cannot find anywhere to download a 0.94.5 hbase source code, can 
anyone tell me if there is somewhere I can find it?

I know old version is obsolete , but this is not for production, but for 
research, so please help me if you have any idea. Thanks very much in advance.

Thanks,
Ming


RE: when will hbase create the zookeeper znode 'root-region-server’ is created? Hbase 0.94

2014-10-16 Thread Liu, Ming (HPIT-GADSC)
Thanks Ted and Sean,

It is great to find the archives, why I did not find them for so long time... 
:-)

The original coprocessor author reply me yesterday, it is my fault. In fact, 
that coprocessor is loaded after the hbase startup, not during hbase 
regionserver startup time. In their client application code , they invoke 
addCoprocessor() to load the coprocessor at run-time. So I should not put them 
into the hbase-site.xml. 
When loaded later, the HBaseAdmin.tableExist() can work, since at that time, 
all initialization is done. 
So this is not a version issue, it can work in 0.94.5 and 0.94.24 as well.

Thank you all for the help.
Ming

-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com] 
Sent: Thursday, October 16, 2014 10:29 PM
To: user@hbase.apache.org
Subject: Re: when will hbase create the zookeeper znode 'root-region-server’ is 
created? Hbase 0.94

Ming:
The tar ball in the archive contains source code. See example below:

$ tar tzvf hbase-0.94.5.tar.gz | grep '\.java' | grep Assignment
-rw-r--r--  0 jenkins jenkins  47982 Feb  7  2013 
hbase-0.94.5/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
-rw-r--r--  0 jenkins jenkins  136645 Feb  7  2013 
hbase-0.94.5/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java

FYI

On Thu, Oct 16, 2014 at 7:18 AM, Sean Busbey  wrote:

> On Thu, Oct 16, 2014 at 2:18 AM, Liu, Ming (HPIT-GADSC) 
> 
> wrote:
>
> >
> >
> > By the way, I cannot find anywhere to download a 0.94.5 hbase source
> code,
> > can anyone tell me if there is somewhere I can find it?
> >
> > I know old version is obsolete , but this is not for production, but 
> > for research, so please help me if you have any idea. Thanks very 
> > much in advance.
> >
>
>
> In most cases you can get old versions of the HBase source from the 
> ASF
> archives:
>
> http://archive.apache.org/dist/hbase/hbase-0.94.5/
>
>
> --
> Sean
>