Re: Feed Aggregator Schema

2009-08-17 Thread Peter Rietzler

Hi 

In our project we are handling event lists where we have similar
requirements. We do ordering by choosing our row keys wisely. We use the
following key for our events (they should be ordered by time in ascending
order):

eventListName/MMddHHmmssSSS-000[-111]

where eventListName is the name of the event list and 000 is a three digit
instance id to disambiguate between different running instances of
application, and -111 is optional to disambiguate events that occured in the
same millisecond on one instance. 

We additionally insert and artifical row for each day with the id

eventListName/MMddHHmmssSSS

This allows us to start scanning at the beginning of each day without
searching through the event list.

You need to be aware of the fact that if you have a very high load of
inserts, then always one hbase region server is busy inserting while the
others are idle ... if that's a problem for you, you have to find different
keys for your purpose. 

You could also use an HBase index table but I have no experience with it and
I remember an email on the mailing list that this would double all requests
because the API would first lookup the index table and then the original
table ??? (please correct me if this is not right ...)

Kind regards, 
Peter



Andrei Savu wrote:
> 
> Hello,
> 
> I am working on a project involving monitoring a large number of
> rss/atom feeds. I want to use hbase for data storage and I have some
> problems designing the schema. For the first iteration I want to be
> able to generate an aggregated feed (last 100 posts from all feeds in
> reverse chronological order).
> 
> Currently I am using two tables:
> 
> Feeds: column families Content and Meta : raw feed stored in Content:raw
> Urls: column families Content and Meta : raw post version stored in
> Content:raw and the rest of the data found in RSS stored in Meta
> 
> I need some sort of index table for the aggregated feed. How should I
> build that? Is hbase a good choice for this kind of application?
> 
> In other words: Is it possible( in hbase) to design a schema that
> could efficiently answer to queries like the one listed bellow?
> 
> SELECT data FROM Urls ORDER BY date DESC LIMIT 100
> 
> Thanks.
> 
> --
> Savu Andrei
> 
> Website: http://www.andreisavu.ro/
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Feed-Aggregator-Schema-tp24974071p25002264.html
Sent from the HBase User mailing list archive at Nabble.com.



Re: Feed Aggregator Schema

2009-08-17 Thread Andrei Savu
Thanks for your answer Peter.

I will give it a try using this approach and I will let you know how it works.

On Mon, Aug 17, 2009 at 10:26 AM, Peter
Rietzler wrote:
>
> Hi
>
> In our project we are handling event lists where we have similar
> requirements. We do ordering by choosing our row keys wisely. We use the
> following key for our events (they should be ordered by time in ascending
> order):
>
> eventListName/MMddHHmmssSSS-000[-111]
>
> where eventListName is the name of the event list and 000 is a three digit
> instance id to disambiguate between different running instances of
> application, and -111 is optional to disambiguate events that occured in the
> same millisecond on one instance.
>
> We additionally insert and artifical row for each day with the id
>
> eventListName/MMddHHmmssSSS
>
> This allows us to start scanning at the beginning of each day without
> searching through the event list.
>
> You need to be aware of the fact that if you have a very high load of
> inserts, then always one hbase region server is busy inserting while the
> others are idle ... if that's a problem for you, you have to find different
> keys for your purpose.
>
> You could also use an HBase index table but I have no experience with it and
> I remember an email on the mailing list that this would double all requests
> because the API would first lookup the index table and then the original
> table ??? (please correct me if this is not right ...)
>
> Kind regards,
> Peter
>
>
>
> Andrei Savu wrote:
>>
>> Hello,
>>
>> I am working on a project involving monitoring a large number of
>> rss/atom feeds. I want to use hbase for data storage and I have some
>> problems designing the schema. For the first iteration I want to be
>> able to generate an aggregated feed (last 100 posts from all feeds in
>> reverse chronological order).
>>
>> Currently I am using two tables:
>>
>> Feeds: column families Content and Meta : raw feed stored in Content:raw
>> Urls: column families Content and Meta : raw post version stored in
>> Content:raw and the rest of the data found in RSS stored in Meta
>>
>> I need some sort of index table for the aggregated feed. How should I
>> build that? Is hbase a good choice for this kind of application?
>>
>> In other words: Is it possible( in hbase) to design a schema that
>> could efficiently answer to queries like the one listed bellow?
>>
>> SELECT data FROM Urls ORDER BY date DESC LIMIT 100
>>
>> Thanks.
>>
>> --
>> Savu Andrei
>>
>> Website: http://www.andreisavu.ro/
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Feed-Aggregator-Schema-tp24974071p25002264.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>



-- 
Savu Andrei

Website: http://www.andreisavu.ro/


Indexed Table in Hbase

2009-08-17 Thread bharath vissapragada
Hi all ,

I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API ..  I
have seen some methods used to create an Indexed Table (on some column).. I
have some doubts regarding the same ...

1) Are these somewhat similar to Hash indexes(in RDBMS) where i can easily
lookup a column value and find it's corresponding rowkey(s)
2) Can i find any performance gain when i use IndexedTable to search for a
paritcular column value .. instead of scanning an entire normal HTable ..

Kindly clarify my doubts

Thanks in advance


Disabling tables in HBase 0.20

2009-08-17 Thread Mathias De Maré
Hi,

I tried to disable a table using the shell, and got an Exception:

hbase(main):004:0* disable 'url'
NativeException: org.apache.hadoop.hbase.RegionException: Retries exhausted,
it took too long to wait for the table url to be disabled.
from org/apache/hadoop/hbase/client/HBaseAdmin.java:406:in
`disableTable'
from org/apache/hadoop/hbase/client/HBaseAdmin.java:366:in
`disableTable'
from sun/reflect/NativeMethodAccessorImpl.java:-2:in `invoke0'
from sun/reflect/NativeMethodAccessorImpl.java:39:in `invoke'
from sun/reflect/DelegatingMethodAccessorImpl.java:25:in `invoke'
from java/lang/reflect/Method.java:597:in `invoke'
from org/jruby/javasupport/JavaMethod.java:298:in
`invokeWithExceptionHandling'
from org/jruby/javasupport/JavaMethod.java:259:in `invoke'
from org/jruby/java/invokers/InstanceMethodInvoker.java:44:in `call'
from org/jruby/runtime/callsite/CachingCallSite.java:273:in
`cacheAndCall'
from org/jruby/runtime/callsite/CachingCallSite.java:112:in `call'
from org/jruby/ast/CallOneArgNode.java:57:in `interpret'
from org/jruby/ast/NewlineNode.java:104:in `interpret'
from org/jruby/ast/BlockNode.java:71:in `interpret'
from org/jruby/internal/runtime/methods/InterpretedMethod.java:163:in
`call'
from org/jruby/internal/runtime/methods/DefaultMethod.java:144:in `call'
... 108 levels...
from root/installation/hbase/bin/$_dot_dot_/bin/hirb#start:-1:in `call'
from org/jruby/internal/runtime/methods/DynamicMethod.java:226:in `call'
from org/jruby/internal/runtime/methods/CompiledMethod.java:211:in
`call'
from org/jruby/internal/runtime/methods/CompiledMethod.java:71:in `call'
from org/jruby/runtime/callsite/CachingCallSite.java:253:in
`cacheAndCall'
from org/jruby/runtime/callsite/CachingCallSite.java:72:in `call'
from root/installation/hbase/bin/$_dot_dot_/bin/hirb.rb:487:in
`__file__'
from root/installation/hbase/bin/$_dot_dot_/bin/hirb.rb:-1:in `load'
from org/jruby/Ruby.java:577:in `runScript'
from org/jruby/Ruby.java:480:in `runNormally'
from org/jruby/Ruby.java:354:in `runFromMain'
from org/jruby/Main.java:229:in `run'
from org/jruby/Main.java:110:in `run'
from org/jruby/Main.java:94:in `main'
from /root/installation/hbase/bin/../bin/hirb.rb:346:in `disable'
from (hbase):5hbase(main):005:0>

Now, it seems I can no longer access any data in that table. (For example, I
can't do 'count 'url'' anymore.)
It looks somewhat related to
https://issues.apache.org/jira/browse/HBASE-1636 .
Is there a way to still get the data out of that table? And to disable and
enable the table, so I can change the blocksize?

Mathias


Re: Why the map input records are equal to the map output records

2009-08-17 Thread Mathias De Maré
On Wed, Aug 12, 2009 at 6:57 PM, Xine Jar  wrote:

>
> For my own information, is there a way I can verify that it did not read
> the table several times?
> should the Map output record become equal to the number of records in the
> table or not necessarily?
>
> Thank you,


Yes, it should be the same.

Mathias


Re: master kills itself

2009-08-17 Thread Jean-Daniel Cryans
It seems your Master had trouble to connect to a ZK server and its
session expired. In this case it kills itself to make sure that it
won't be managing the cluster at the same time as another Master which
may have started if there was any waiting.

Starting a Master on any node should be ok to recover, HBase is built for that.

J-D

On Sun, Aug 16, 2009 at 2:49 AM, Zheng Lv wrote:
> Hello,
>    Thank you for your suggestions.
>    Several days before We found our routing talbe has some problems, after
> adjusting now we are sure that the bandwidth is ok.
>    And we have used lzo compression.
>    So we started the test program again, but after running normally for 23
> hours, the master killed itself. Following is part of the log.
>    By the way, this time we inserted 10 webpages per second only.
> 2009-08-14 13:36:31,840 INFO org.apache.hadoop.hbase.master.ServerManager: 4
> region servers, 0 dead, average load 48.75
> 2009-08-14 13:36:32,016 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scanning meta region {server: 192.168.33.5:60020,
> regionnam
> e: .META.,,1, startKey: <>}
> 2009-08-14 13:36:32,076 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scanning meta region {server: 192.168.33.6:60020,
> regionnam
> e: -ROOT-,,0, startKey: <>}
> 2009-08-14 13:36:32,084 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> 192.168.33.6:60020
> , regionname: -ROOT-,,0, startKey: <>} complete
> 2009-08-14 13:36:32,316 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scan of 193 row(s) of meta region {server:
> 192.168.33.5:600
> 20, regionname: .META.,,1, startKey: <>} complete
> 2009-08-14 13:36:32,316 INFO org.apache.hadoop.hbase.master.BaseScanner: All
> 1 .META. region(s) scanned
> 2009-08-14 13:37:00,366 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x22313002be80001 to sun.nio.ch.selectionkeyi...@4a407c9f
> java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0
> lim=4 cap=4]
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:653)
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:897)
> 2009-08-14 13:37:00,881 INFO org.apache.zookeeper.ClientCnxn: Attempting
> connection to server ubuntu3/192.168.33.8:
> 2009-08-14 13:37:04,366 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x22313002be8 to sun.nio.ch.selectionkeyi...@4ac6ee33
> java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0
> lim=4 cap=4]
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:653)
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:897)
> 2009-08-14 13:37:04,721 INFO org.apache.zookeeper.ClientCnxn: Attempting
> connection to server ubuntu2/192.168.33.9:
> 2009-08-14 13:37:08,872 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x22313002be80001 to sun.nio.ch.selectionkeyi...@2e93ebe0
> java.io.IOException: TIMED OUT
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:858)
> 2009-08-14 13:37:08,873 WARN org.apache.zookeeper.ClientCnxn: Ignoring
> exception during shutdown output
> java.net.SocketException: Transport endpoint is not connected
>        at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
>        at
> sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
>        at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:956)
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:922)
> 2009-08-14 13:37:09,486 INFO org.apache.zookeeper.ClientCnxn: Attempting
> connection to server ubuntu2/192.168.33.9:
> 2009-08-14 13:37:12,712 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x22313002be8 to sun.nio.ch.selectionkeyi...@7162d703
> java.io.IOException: TIMED OUT
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:858)
> 2009-08-14 13:37:12,713 WARN org.apache.zookeeper.ClientCnxn: Ignoring
> exception during shutdown output
> java.net.SocketException: Transport endpoint is not connected
>        at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
>        at
> sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
>        at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:956)
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:922)
> 2009-08-14 13:37:13,032 INFO org.apache.zookeeper.ClientCnxn: Attempting
> connection to server ubuntu3/192.168.33.8:
> 2009-08-14 13:37:17,482 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x22313002be80001 to sun.nio.ch.selectionkeyi...@1012401d
> java.io.IOExcep

Re: Indexed Table in Hbase

2009-08-17 Thread Kirill Shabunov

Hi!

As far as I understand you are talking about the secondary indexes. Yes, 
they can be used to quickly get the rowkey by a value in the indexed column.


--Kirill

bharath vissapragada wrote:

Hi all ,

I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API ..  I
have seen some methods used to create an Indexed Table (on some column).. I
have some doubts regarding the same ...

1) Are these somewhat similar to Hash indexes(in RDBMS) where i can easily
lookup a column value and find it's corresponding rowkey(s)
2) Can i find any performance gain when i use IndexedTable to search for a
paritcular column value .. instead of scanning an entire normal HTable ..

Kindly clarify my doubts

Thanks in advance



Re: Disabling tables in HBase 0.20

2009-08-17 Thread Jean-Daniel Cryans
Mathias,

Disabling is still not done inline so you won't get feedback if it
takes time to disable your table. What takes that time is that it has
to flush all the data kept in memstore so for example if you have 20
regions in a region server and they are near full, it could take
minutes.

So maybe wait 1-2 minutes, do a flush '.META.' and major_compact
'.META.' just to be sure, and then reissue a disable. If it doesn't
work, then you may have hit the bug that blocks the release of HBase
0.20 (HBASE-1761). Or you can also try to enable it.

J-D

2009/8/17 Mathias De Maré :
> Hi,
>
> I tried to disable a table using the shell, and got an Exception:
>
> hbase(main):004:0* disable 'url'
> NativeException: org.apache.hadoop.hbase.RegionException: Retries exhausted,
> it took too long to wait for the table url to be disabled.
>    from org/apache/hadoop/hbase/client/HBaseAdmin.java:406:in
> `disableTable'
>    from org/apache/hadoop/hbase/client/HBaseAdmin.java:366:in
> `disableTable'
>    from sun/reflect/NativeMethodAccessorImpl.java:-2:in `invoke0'
>    from sun/reflect/NativeMethodAccessorImpl.java:39:in `invoke'
>    from sun/reflect/DelegatingMethodAccessorImpl.java:25:in `invoke'
>    from java/lang/reflect/Method.java:597:in `invoke'
>    from org/jruby/javasupport/JavaMethod.java:298:in
> `invokeWithExceptionHandling'
>    from org/jruby/javasupport/JavaMethod.java:259:in `invoke'
>    from org/jruby/java/invokers/InstanceMethodInvoker.java:44:in `call'
>    from org/jruby/runtime/callsite/CachingCallSite.java:273:in
> `cacheAndCall'
>    from org/jruby/runtime/callsite/CachingCallSite.java:112:in `call'
>    from org/jruby/ast/CallOneArgNode.java:57:in `interpret'
>    from org/jruby/ast/NewlineNode.java:104:in `interpret'
>    from org/jruby/ast/BlockNode.java:71:in `interpret'
>    from org/jruby/internal/runtime/methods/InterpretedMethod.java:163:in
> `call'
>    from org/jruby/internal/runtime/methods/DefaultMethod.java:144:in `call'
> ... 108 levels...
>    from root/installation/hbase/bin/$_dot_dot_/bin/hirb#start:-1:in `call'
>    from org/jruby/internal/runtime/methods/DynamicMethod.java:226:in `call'
>    from org/jruby/internal/runtime/methods/CompiledMethod.java:211:in
> `call'
>    from org/jruby/internal/runtime/methods/CompiledMethod.java:71:in `call'
>    from org/jruby/runtime/callsite/CachingCallSite.java:253:in
> `cacheAndCall'
>    from org/jruby/runtime/callsite/CachingCallSite.java:72:in `call'
>    from root/installation/hbase/bin/$_dot_dot_/bin/hirb.rb:487:in
> `__file__'
>    from root/installation/hbase/bin/$_dot_dot_/bin/hirb.rb:-1:in `load'
>    from org/jruby/Ruby.java:577:in `runScript'
>    from org/jruby/Ruby.java:480:in `runNormally'
>    from org/jruby/Ruby.java:354:in `runFromMain'
>    from org/jruby/Main.java:229:in `run'
>    from org/jruby/Main.java:110:in `run'
>    from org/jruby/Main.java:94:in `main'
>    from /root/installation/hbase/bin/../bin/hirb.rb:346:in `disable'
>    from (hbase):5hbase(main):005:0>
>
> Now, it seems I can no longer access any data in that table. (For example, I
> can't do 'count 'url'' anymore.)
> It looks somewhat related to
> https://issues.apache.org/jira/browse/HBASE-1636 .
> Is there a way to still get the data out of that table? And to disable and
> enable the table, so I can change the blocksize?
>
> Mathias
>


Re: Indexed Table in Hbase

2009-08-17 Thread bharath vissapragada
But i have read somewhere that Secondary indexes are somewhat slow compared
to normal Hbase tables ..Does that effect the performance ?

Also do you know the type of index created on the column(i mean Hash type or
Btree etc)

On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov  wrote:

> Hi!
>
> As far as I understand you are talking about the secondary indexes. Yes,
> they can be used to quickly get the rowkey by a value in the indexed column.
>
> --Kirill
>
>
> bharath vissapragada wrote:
>
>> Hi all ,
>>
>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API ..
>>  I
>> have seen some methods used to create an Indexed Table (on some column)..
>> I
>> have some doubts regarding the same ...
>>
>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can easily
>> lookup a column value and find it's corresponding rowkey(s)
>> 2) Can i find any performance gain when i use IndexedTable to search for a
>> paritcular column value .. instead of scanning an entire normal HTable ..
>>
>> Kindly clarify my doubts
>>
>> Thanks in advance
>>
>>


Re: Hi, i am still puzzeld to find a row key by a cell.

2009-08-17 Thread Rocks
Thanks for you answer.
however, what i want to do just likes "where" keyword of SQL in RDBMS. Is
it impossible in hbase ?

On Mon, Aug 17, 2009 at 2:58 PM, Ryan Rawson  wrote:

> hey,
>
> That isn't how hbase (or even rdbms) work.  Instead you can retrieve
> rows based on their row key. Otherwise you will have to read the
> entire table to find just that 1 row. Yes this is as inefficient as it
> sounds.
>
> If you frequently have this issue, you may need to build and maintain
> secondary indexes. Unlike relational dbs, there is no built in support
> for this, you have to write your app to handle this.
>
> -yran
>
>
>
> On Sun, Aug 16, 2009 at 11:51 PM, lei wang wrote:
> > Hi, if a know a cell in a hbase for its column::value,  i need
> to
> > know which row key it belongs to. I searched the HBase api several times,
> > but i can not find the right method to solve my problem. Thanks for one's
> > suggestion to me.
> >
>


Re: Indexed Table in Hbase

2009-08-17 Thread Jonathan Gray
It's not an actual hash or btree index, but rather secondary indexes in 
HBase are implemented by creating an additional HBase table.


If I have a table "users" (row key is userid) with family "data" and 
column "email", and I want to index the value in that column...


I can create a table "users_email" where the row key is the email 
address (value from the column in "users" table) and a single column 
that contains the userid.


Doing an "index lookup" would mean doing a get on "users_email" and then 
using that userid to do a lookup on the "users" table.


IndexedTable does this transparently, but still does require two 
queries.  So it's slower than a single query, but certainly faster than 
a full table scan.


If you need hash-level performance on the index lookup, there are lots 
of solutions outside of HBase that would work... In-memory Java HashMap, 
Tokyo Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text 
indexing, you can use Lucene or the like.


Make sense?

JG

bharath vissapragada wrote:

But i have read somewhere that Secondary indexes are somewhat slow compared
to normal Hbase tables ..Does that effect the performance ?

Also do you know the type of index created on the column(i mean Hash type or
Btree etc)

On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov  wrote:


Hi!

As far as I understand you are talking about the secondary indexes. Yes,
they can be used to quickly get the rowkey by a value in the indexed column.

--Kirill


bharath vissapragada wrote:


Hi all ,

I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API ..
 I
have seen some methods used to create an Indexed Table (on some column)..
I
have some doubts regarding the same ...

1) Are these somewhat similar to Hash indexes(in RDBMS) where i can easily
lookup a column value and find it's corresponding rowkey(s)
2) Can i find any performance gain when i use IndexedTable to search for a
paritcular column value .. instead of scanning an entire normal HTable ..

Kindly clarify my doubts

Thanks in advance






Re: Hi, i am still puzzeld to find a row key by a cell.

2009-08-17 Thread Jonathan Gray

There is some built-in support for secondary indexes.

Look at IndexedTable.  It is in contrib for 0.20 and part of the main 
codebase for 0.19.


0.20 Javadoc:

http://jgray.la/javadoc/hbase-0.20.0/org/apache/hadoop/hbase/client/tableindexed/package-summary.html

0.19 Javadoc:

http://hadoop.apache.org/hbase/docs/r0.19.3/api/org/apache/hadoop/hbase/client/tableindexed/package-summary.html


There have been some performance issues in the past, but I've not used 
the built-in facilities for some time.  I do know of some people using 
this in production.  Internally, we implement some indexes in the same 
manner but manage it in the application.  See my previous e-mail to the 
list to understand how it's implemented.


JG

Ryan Rawson wrote:

hey,

That isn't how hbase (or even rdbms) work.  Instead you can retrieve
rows based on their row key. Otherwise you will have to read the
entire table to find just that 1 row. Yes this is as inefficient as it
sounds.

If you frequently have this issue, you may need to build and maintain
secondary indexes. Unlike relational dbs, there is no built in support
for this, you have to write your app to handle this.

-yran



On Sun, Aug 16, 2009 at 11:51 PM, lei wang wrote:

Hi, if a know a cell in a hbase for its column::value,  i need to
know which row key it belongs to. I searched the HBase api several times,
but i can not find the right method to solve my problem. Thanks for one's
suggestion to me.





Re: Hi, i am still puzzeld to find a row key by a cell.

2009-08-17 Thread Jonathan Gray
This is possible by checking the values in your client, through 
server-side filters (see org.apache.hadoop.hbase.filter), or with 
secondary indexing (as described in my previous e-mail).


In the first two cases, you are doing a full table scan so it is very 
inefficient.  That's the same as it would be in an RDBMS if you were not 
using secondary indexes, however.


HBase has limited secondary indexing support, but do not expect the 
flexibility and performance you get from an RDBMS secondary index.


If this is central to your usage of HBase, make sure that HBase is what 
you want and take another look at your schema to see if there might be a 
better design to prevent needing heavy indexing or table scanning.


JG

Rocks wrote:

Thanks for you answer.
however, what i want to do just likes "where" keyword of SQL in RDBMS. Is
it impossible in hbase ?

On Mon, Aug 17, 2009 at 2:58 PM, Ryan Rawson  wrote:


hey,

That isn't how hbase (or even rdbms) work.  Instead you can retrieve
rows based on their row key. Otherwise you will have to read the
entire table to find just that 1 row. Yes this is as inefficient as it
sounds.

If you frequently have this issue, you may need to build and maintain
secondary indexes. Unlike relational dbs, there is no built in support
for this, you have to write your app to handle this.

-yran



On Sun, Aug 16, 2009 at 11:51 PM, lei wang wrote:

Hi, if a know a cell in a hbase for its column::value,  i need

to

know which row key it belongs to. I searched the HBase api several times,
but i can not find the right method to solve my problem. Thanks for one's
suggestion to me.





Re: Indexed Table in Hbase

2009-08-17 Thread bharath vissapragada
I got it ... I think this is definitely useful in my app because iam
performing a full table scan everytime for selecting the rowkeys based on
some column values .

BUT ..

 we can have more than one rowkey for the same column value .Can you please
tell me how they are stored .

Thanks in advance

On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray  wrote:

> It's not an actual hash or btree index, but rather secondary indexes in
> HBase are implemented by creating an additional HBase table.
>
> If I have a table "users" (row key is userid) with family "data" and column
> "email", and I want to index the value in that column...
>
> I can create a table "users_email" where the row key is the email address
> (value from the column in "users" table) and a single column that contains
> the userid.
>
> Doing an "index lookup" would mean doing a get on "users_email" and then
> using that userid to do a lookup on the "users" table.
>
> IndexedTable does this transparently, but still does require two queries.
>  So it's slower than a single query, but certainly faster than a full table
> scan.
>
> If you need hash-level performance on the index lookup, there are lots of
> solutions outside of HBase that would work... In-memory Java HashMap, Tokyo
> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text indexing,
> you can use Lucene or the like.
>
> Make sense?
>
> JG
>
>
> bharath vissapragada wrote:
>
>> But i have read somewhere that Secondary indexes are somewhat slow
>> compared
>> to normal Hbase tables ..Does that effect the performance ?
>>
>> Also do you know the type of index created on the column(i mean Hash type
>> or
>> Btree etc)
>>
>> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov  wrote:
>>
>>  Hi!
>>>
>>> As far as I understand you are talking about the secondary indexes. Yes,
>>> they can be used to quickly get the rowkey by a value in the indexed
>>> column.
>>>
>>> --Kirill
>>>
>>>
>>> bharath vissapragada wrote:
>>>
>>>  Hi all ,

 I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API ..
  I
 have seen some methods used to create an Indexed Table (on some
 column)..
 I
 have some doubts regarding the same ...

 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can
 easily
 lookup a column value and find it's corresponding rowkey(s)
 2) Can i find any performance gain when i use IndexedTable to search for
 a
 paritcular column value .. instead of scanning an entire normal HTable
 ..

 Kindly clarify my doubts

 Thanks in advance



>>


Re: Indexed Table in Hbase

2009-08-17 Thread Jonathan Gray

I'm actually unsure about that.  Look at the code or experiment.

Seems to me that there would be a uniqueness requirement, otherwise what 
do you expect the behavior to be?  A get can only return a single row, 
so multiple index hits doesn't really make sense.


Clint?  You out there? :)

JG

bharath vissapragada wrote:

I got it ... I think this is definitely useful in my app because iam
performing a full table scan everytime for selecting the rowkeys based on
some column values .

BUT ..

 we can have more than one rowkey for the same column value .Can you please
tell me how they are stored .

Thanks in advance

On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray  wrote:


It's not an actual hash or btree index, but rather secondary indexes in
HBase are implemented by creating an additional HBase table.

If I have a table "users" (row key is userid) with family "data" and column
"email", and I want to index the value in that column...

I can create a table "users_email" where the row key is the email address
(value from the column in "users" table) and a single column that contains
the userid.

Doing an "index lookup" would mean doing a get on "users_email" and then
using that userid to do a lookup on the "users" table.

IndexedTable does this transparently, but still does require two queries.
 So it's slower than a single query, but certainly faster than a full table
scan.

If you need hash-level performance on the index lookup, there are lots of
solutions outside of HBase that would work... In-memory Java HashMap, Tokyo
Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text indexing,
you can use Lucene or the like.

Make sense?

JG


bharath vissapragada wrote:


But i have read somewhere that Secondary indexes are somewhat slow
compared
to normal Hbase tables ..Does that effect the performance ?

Also do you know the type of index created on the column(i mean Hash type
or
Btree etc)

On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov  wrote:

 Hi!

As far as I understand you are talking about the secondary indexes. Yes,
they can be used to quickly get the rowkey by a value in the indexed
column.

--Kirill


bharath vissapragada wrote:

 Hi all ,

I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API ..
 I
have seen some methods used to create an Indexed Table (on some
column)..
I
have some doubts regarding the same ...

1) Are these somewhat similar to Hash indexes(in RDBMS) where i can
easily
lookup a column value and find it's corresponding rowkey(s)
2) Can i find any performance gain when i use IndexedTable to search for
a
paritcular column value .. instead of scanning an entire normal HTable
..

Kindly clarify my doubts

Thanks in advance







Re: Indexed Table in Hbase

2009-08-17 Thread bharath vissapragada
Generally one may expect that apart frm the rowkey other columns can have
repeated attributes and similar is the case with my application ..
In the API . there seems to be no such function doing that job

If any others know more abt it or faced the same situation kindly reply.

Thanks .


On Mon, Aug 17, 2009 at 10:30 PM, Jonathan Gray  wrote:

> I'm actually unsure about that.  Look at the code or experiment.
>
> Seems to me that there would be a uniqueness requirement, otherwise what do
> you expect the behavior to be?  A get can only return a single row, so
> multiple index hits doesn't really make sense.
>
> Clint?  You out there? :)
>
> JG
>
>
> bharath vissapragada wrote:
>
>> I got it ... I think this is definitely useful in my app because iam
>> performing a full table scan everytime for selecting the rowkeys based on
>> some column values .
>>
>> BUT ..
>>
>>  we can have more than one rowkey for the same column value .Can you
>> please
>> tell me how they are stored .
>>
>> Thanks in advance
>>
>> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray  wrote:
>>
>>  It's not an actual hash or btree index, but rather secondary indexes in
>>> HBase are implemented by creating an additional HBase table.
>>>
>>> If I have a table "users" (row key is userid) with family "data" and
>>> column
>>> "email", and I want to index the value in that column...
>>>
>>> I can create a table "users_email" where the row key is the email address
>>> (value from the column in "users" table) and a single column that
>>> contains
>>> the userid.
>>>
>>> Doing an "index lookup" would mean doing a get on "users_email" and then
>>> using that userid to do a lookup on the "users" table.
>>>
>>> IndexedTable does this transparently, but still does require two queries.
>>>  So it's slower than a single query, but certainly faster than a full
>>> table
>>> scan.
>>>
>>> If you need hash-level performance on the index lookup, there are lots of
>>> solutions outside of HBase that would work... In-memory Java HashMap,
>>> Tokyo
>>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
>>> indexing,
>>> you can use Lucene or the like.
>>>
>>> Make sense?
>>>
>>> JG
>>>
>>>
>>> bharath vissapragada wrote:
>>>
>>>  But i have read somewhere that Secondary indexes are somewhat slow
 compared
 to normal Hbase tables ..Does that effect the performance ?

 Also do you know the type of index created on the column(i mean Hash
 type
 or
 Btree etc)

 On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov 
 wrote:

  Hi!

> As far as I understand you are talking about the secondary indexes.
> Yes,
> they can be used to quickly get the rowkey by a value in the indexed
> column.
>
> --Kirill
>
>
> bharath vissapragada wrote:
>
>  Hi all ,
>
>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API
>> ..
>>  I
>> have seen some methods used to create an Indexed Table (on some
>> column)..
>> I
>> have some doubts regarding the same ...
>>
>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can
>> easily
>> lookup a column value and find it's corresponding rowkey(s)
>> 2) Can i find any performance gain when i use IndexedTable to search
>> for
>> a
>> paritcular column value .. instead of scanning an entire normal HTable
>> ..
>>
>> Kindly clarify my doubts
>>
>> Thanks in advance
>>
>>
>>
>>
>>


Re: Indexed Table in Hbase

2009-08-17 Thread Gary Helmling
When defining the IndexSpecification for your table, you can pass your
own implementation of
org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.

This allows you to control how the row keys are generated for the
secondary index table.  For example, you could append the original
table's row key to the indexed value to ensure uniqueness in
referencing the original rows.

When you create an indexed scanner, the secondary index code opens and
wraps a scanner on the secondary index table, based on the start row
you specify (the indexed value you're looking up).  It applies any
filter passed to rows on the secondary index table, so make sure
anything you want to filter on is listed in the "indexed columns" in
your IndexSpecification.

For any rows returned by the wrapped scanner, the client code then
does a get for the original table record (the original row key is
stored in the "__INDEX__" column family I think).

So in total, when using secondary indexes, you wind up with 1 scan + N
gets to look at N rows.

At least, this was my understanding of how things worked as of 0.19.
I'm actually moving indexing into my app layer as I update to 0.20.

Hope this helps.

--gh


On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray wrote:
> I'm actually unsure about that.  Look at the code or experiment.
>
> Seems to me that there would be a uniqueness requirement, otherwise what do
> you expect the behavior to be?  A get can only return a single row, so
> multiple index hits doesn't really make sense.
>
> Clint?  You out there? :)
>
> JG
>
> bharath vissapragada wrote:
>>
>> I got it ... I think this is definitely useful in my app because iam
>> performing a full table scan everytime for selecting the rowkeys based on
>> some column values .
>>
>> BUT ..
>>
>>  we can have more than one rowkey for the same column value .Can you
>> please
>> tell me how they are stored .
>>
>> Thanks in advance
>>
>> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray  wrote:
>>
>>> It's not an actual hash or btree index, but rather secondary indexes in
>>> HBase are implemented by creating an additional HBase table.
>>>
>>> If I have a table "users" (row key is userid) with family "data" and
>>> column
>>> "email", and I want to index the value in that column...
>>>
>>> I can create a table "users_email" where the row key is the email address
>>> (value from the column in "users" table) and a single column that
>>> contains
>>> the userid.
>>>
>>> Doing an "index lookup" would mean doing a get on "users_email" and then
>>> using that userid to do a lookup on the "users" table.
>>>
>>> IndexedTable does this transparently, but still does require two queries.
>>>  So it's slower than a single query, but certainly faster than a full
>>> table
>>> scan.
>>>
>>> If you need hash-level performance on the index lookup, there are lots of
>>> solutions outside of HBase that would work... In-memory Java HashMap,
>>> Tokyo
>>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
>>> indexing,
>>> you can use Lucene or the like.
>>>
>>> Make sense?
>>>
>>> JG
>>>
>>>
>>> bharath vissapragada wrote:
>>>
 But i have read somewhere that Secondary indexes are somewhat slow
 compared
 to normal Hbase tables ..Does that effect the performance ?

 Also do you know the type of index created on the column(i mean Hash
 type
 or
 Btree etc)

 On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov 
 wrote:

  Hi!
>
> As far as I understand you are talking about the secondary indexes.
> Yes,
> they can be used to quickly get the rowkey by a value in the indexed
> column.
>
> --Kirill
>
>
> bharath vissapragada wrote:
>
>  Hi all ,
>>
>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API
>> ..
>>  I
>> have seen some methods used to create an Indexed Table (on some
>> column)..
>> I
>> have some doubts regarding the same ...
>>
>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can
>> easily
>> lookup a column value and find it's corresponding rowkey(s)
>> 2) Can i find any performance gain when i use IndexedTable to search
>> for
>> a
>> paritcular column value .. instead of scanning an entire normal HTable
>> ..
>>
>> Kindly clarify my doubts
>>
>> Thanks in advance
>>
>>
>>
>>
>


Re: Indexed Table in Hbase

2009-08-17 Thread Ski Gh3
Agree, I think the index-scan can probably be more useful than the
index-get.
Actually in Jonathan's example, I would compose the index table key with
"indexedvalue"+"primarykey".
Many rows may have the same indexed values (not in this email example, but
think about other stuff),
then I can get all primary keys with the same indexed values.

Cheers

On Mon, Aug 17, 2009 at 10:23 AM, Gary Helmling  wrote:

> When defining the IndexSpecification for your table, you can pass your
> own implementation of
> org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.
>
> This allows you to control how the row keys are generated for the
> secondary index table.  For example, you could append the original
> table's row key to the indexed value to ensure uniqueness in
> referencing the original rows.
>
> When you create an indexed scanner, the secondary index code opens and
> wraps a scanner on the secondary index table, based on the start row
> you specify (the indexed value you're looking up).  It applies any
> filter passed to rows on the secondary index table, so make sure
> anything you want to filter on is listed in the "indexed columns" in
> your IndexSpecification.
>
> For any rows returned by the wrapped scanner, the client code then
> does a get for the original table record (the original row key is
> stored in the "__INDEX__" column family I think).
>
> So in total, when using secondary indexes, you wind up with 1 scan + N
> gets to look at N rows.
>
> At least, this was my understanding of how things worked as of 0.19.
> I'm actually moving indexing into my app layer as I update to 0.20.
>
> Hope this helps.
>
> --gh
>
>
> On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray wrote:
> > I'm actually unsure about that.  Look at the code or experiment.
> >
> > Seems to me that there would be a uniqueness requirement, otherwise what
> do
> > you expect the behavior to be?  A get can only return a single row, so
> > multiple index hits doesn't really make sense.
> >
> > Clint?  You out there? :)
> >
> > JG
> >
> > bharath vissapragada wrote:
> >>
> >> I got it ... I think this is definitely useful in my app because iam
> >> performing a full table scan everytime for selecting the rowkeys based
> on
> >> some column values .
> >>
> >> BUT ..
> >>
> >>  we can have more than one rowkey for the same column value .Can you
> >> please
> >> tell me how they are stored .
> >>
> >> Thanks in advance
> >>
> >> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray 
> wrote:
> >>
> >>> It's not an actual hash or btree index, but rather secondary indexes in
> >>> HBase are implemented by creating an additional HBase table.
> >>>
> >>> If I have a table "users" (row key is userid) with family "data" and
> >>> column
> >>> "email", and I want to index the value in that column...
> >>>
> >>> I can create a table "users_email" where the row key is the email
> address
> >>> (value from the column in "users" table) and a single column that
> >>> contains
> >>> the userid.
> >>>
> >>> Doing an "index lookup" would mean doing a get on "users_email" and
> then
> >>> using that userid to do a lookup on the "users" table.
> >>>
> >>> IndexedTable does this transparently, but still does require two
> queries.
> >>>  So it's slower than a single query, but certainly faster than a full
> >>> table
> >>> scan.
> >>>
> >>> If you need hash-level performance on the index lookup, there are lots
> of
> >>> solutions outside of HBase that would work... In-memory Java HashMap,
> >>> Tokyo
> >>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
> >>> indexing,
> >>> you can use Lucene or the like.
> >>>
> >>> Make sense?
> >>>
> >>> JG
> >>>
> >>>
> >>> bharath vissapragada wrote:
> >>>
>  But i have read somewhere that Secondary indexes are somewhat slow
>  compared
>  to normal Hbase tables ..Does that effect the performance ?
> 
>  Also do you know the type of index created on the column(i mean Hash
>  type
>  or
>  Btree etc)
> 
>  On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov 
>  wrote:
> 
>   Hi!
> >
> > As far as I understand you are talking about the secondary indexes.
> > Yes,
> > they can be used to quickly get the rowkey by a value in the indexed
> > column.
> >
> > --Kirill
> >
> >
> > bharath vissapragada wrote:
> >
> >  Hi all ,
> >>
> >> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3
> API
> >> ..
> >>  I
> >> have seen some methods used to create an Indexed Table (on some
> >> column)..
> >> I
> >> have some doubts regarding the same ...
> >>
> >> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can
> >> easily
> >> lookup a column value and find it's corresponding rowkey(s)
> >> 2) Can i find any performance gain when i use IndexedTable to search
> >> for
> >> a
> >> paritcular column value .. instead of scanning an enti

Re: Indexed Table in Hbase

2009-08-17 Thread bharath vissapragada
Thanks for ur explanation Gary ,

Consider my case where i can have repetitions of values .. So u say that i
edit the IndexKeyGenerator in such a way that instead of storing
(column->rowkey) i should do in such a way that (coulmn-> rowkey1,rowkey2)
as diff timestamps ... if yes is that a good way ?

On Mon, Aug 17, 2009 at 10:53 PM, Gary Helmling  wrote:

> When defining the IndexSpecification for your table, you can pass your
> own implementation of
> org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.
>
> This allows you to control how the row keys are generated for the
> secondary index table.  For example, you could append the original
> table's row key to the indexed value to ensure uniqueness in
> referencing the original rows.
>
> When you create an indexed scanner, the secondary index code opens and
> wraps a scanner on the secondary index table, based on the start row
> you specify (the indexed value you're looking up).  It applies any
> filter passed to rows on the secondary index table, so make sure
> anything you want to filter on is listed in the "indexed columns" in
> your IndexSpecification.
>
> For any rows returned by the wrapped scanner, the client code then
> does a get for the original table record (the original row key is
> stored in the "__INDEX__" column family I think).
>
> So in total, when using secondary indexes, you wind up with 1 scan + N
> gets to look at N rows.
>
> At least, this was my understanding of how things worked as of 0.19.
> I'm actually moving indexing into my app layer as I update to 0.20.
>
> Hope this helps.
>
> --gh
>
>
> On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray wrote:
> > I'm actually unsure about that.  Look at the code or experiment.
> >
> > Seems to me that there would be a uniqueness requirement, otherwise what
> do
> > you expect the behavior to be?  A get can only return a single row, so
> > multiple index hits doesn't really make sense.
> >
> > Clint?  You out there? :)
> >
> > JG
> >
> > bharath vissapragada wrote:
> >>
> >> I got it ... I think this is definitely useful in my app because iam
> >> performing a full table scan everytime for selecting the rowkeys based
> on
> >> some column values .
> >>
> >> BUT ..
> >>
> >>  we can have more than one rowkey for the same column value .Can you
> >> please
> >> tell me how they are stored .
> >>
> >> Thanks in advance
> >>
> >> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray 
> wrote:
> >>
> >>> It's not an actual hash or btree index, but rather secondary indexes in
> >>> HBase are implemented by creating an additional HBase table.
> >>>
> >>> If I have a table "users" (row key is userid) with family "data" and
> >>> column
> >>> "email", and I want to index the value in that column...
> >>>
> >>> I can create a table "users_email" where the row key is the email
> address
> >>> (value from the column in "users" table) and a single column that
> >>> contains
> >>> the userid.
> >>>
> >>> Doing an "index lookup" would mean doing a get on "users_email" and
> then
> >>> using that userid to do a lookup on the "users" table.
> >>>
> >>> IndexedTable does this transparently, but still does require two
> queries.
> >>>  So it's slower than a single query, but certainly faster than a full
> >>> table
> >>> scan.
> >>>
> >>> If you need hash-level performance on the index lookup, there are lots
> of
> >>> solutions outside of HBase that would work... In-memory Java HashMap,
> >>> Tokyo
> >>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
> >>> indexing,
> >>> you can use Lucene or the like.
> >>>
> >>> Make sense?
> >>>
> >>> JG
> >>>
> >>>
> >>> bharath vissapragada wrote:
> >>>
>  But i have read somewhere that Secondary indexes are somewhat slow
>  compared
>  to normal Hbase tables ..Does that effect the performance ?
> 
>  Also do you know the type of index created on the column(i mean Hash
>  type
>  or
>  Btree etc)
> 
>  On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov 
>  wrote:
> 
>   Hi!
> >
> > As far as I understand you are talking about the secondary indexes.
> > Yes,
> > they can be used to quickly get the rowkey by a value in the indexed
> > column.
> >
> > --Kirill
> >
> >
> > bharath vissapragada wrote:
> >
> >  Hi all ,
> >>
> >> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3
> API
> >> ..
> >>  I
> >> have seen some methods used to create an Indexed Table (on some
> >> column)..
> >> I
> >> have some doubts regarding the same ...
> >>
> >> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can
> >> easily
> >> lookup a column value and find it's corresponding rowkey(s)
> >> 2) Can i find any performance gain when i use IndexedTable to search
> >> for
> >> a
> >> paritcular column value .. instead of scanning an entire normal
> HTable
> >> ..
> >>
> >> Ki

RE: Indexed Table in Hbase

2009-08-17 Thread Hegner, Travis
I'm not familiar with tableindexed at all, but my manually indexed tables have 
the value as the row key, and a single column for each row of the original 
table that has that value.

The key u...@domain.com would have columns rows:user1, rows:user7, rows:user12, 
etc.

Then just do a get on u...@domain.com and you'll have a whole list of users 
with that email address. The added benefit is that you can put some useful 
piece of info into any of the rows:user1 cells like whether the address is 
primary, or whatever fits your design.

Just a thought, perhaps you could implement that method with the 
tableindexed.IndexKeyGenerator that Gary mentioned.

Thanks,

Travis Hegner
http://www.travishegner.com/


-Original Message-
From: bharath vissapragada [mailto:bharathvissapragada1...@gmail.com]
Sent: Monday, August 17, 2009 1:46 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Indexed Table in Hbase

Thanks for ur explanation Gary ,

Consider my case where i can have repetitions of values .. So u say that i
edit the IndexKeyGenerator in such a way that instead of storing
(column->rowkey) i should do in such a way that (coulmn-> rowkey1,rowkey2)
as diff timestamps ... if yes is that a good way ?

On Mon, Aug 17, 2009 at 10:53 PM, Gary Helmling  wrote:

> When defining the IndexSpecification for your table, you can pass your
> own implementation of
> org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.
>
> This allows you to control how the row keys are generated for the
> secondary index table.  For example, you could append the original
> table's row key to the indexed value to ensure uniqueness in
> referencing the original rows.
>
> When you create an indexed scanner, the secondary index code opens and
> wraps a scanner on the secondary index table, based on the start row
> you specify (the indexed value you're looking up).  It applies any
> filter passed to rows on the secondary index table, so make sure
> anything you want to filter on is listed in the "indexed columns" in
> your IndexSpecification.
>
> For any rows returned by the wrapped scanner, the client code then
> does a get for the original table record (the original row key is
> stored in the "__INDEX__" column family I think).
>
> So in total, when using secondary indexes, you wind up with 1 scan + N
> gets to look at N rows.
>
> At least, this was my understanding of how things worked as of 0.19.
> I'm actually moving indexing into my app layer as I update to 0.20.
>
> Hope this helps.
>
> --gh
>
>
> On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray wrote:
> > I'm actually unsure about that.  Look at the code or experiment.
> >
> > Seems to me that there would be a uniqueness requirement, otherwise what
> do
> > you expect the behavior to be?  A get can only return a single row, so
> > multiple index hits doesn't really make sense.
> >
> > Clint?  You out there? :)
> >
> > JG
> >
> > bharath vissapragada wrote:
> >>
> >> I got it ... I think this is definitely useful in my app because iam
> >> performing a full table scan everytime for selecting the rowkeys based
> on
> >> some column values .
> >>
> >> BUT ..
> >>
> >>  we can have more than one rowkey for the same column value .Can you
> >> please
> >> tell me how they are stored .
> >>
> >> Thanks in advance
> >>
> >> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray 
> wrote:
> >>
> >>> It's not an actual hash or btree index, but rather secondary indexes in
> >>> HBase are implemented by creating an additional HBase table.
> >>>
> >>> If I have a table "users" (row key is userid) with family "data" and
> >>> column
> >>> "email", and I want to index the value in that column...
> >>>
> >>> I can create a table "users_email" where the row key is the email
> address
> >>> (value from the column in "users" table) and a single column that
> >>> contains
> >>> the userid.
> >>>
> >>> Doing an "index lookup" would mean doing a get on "users_email" and
> then
> >>> using that userid to do a lookup on the "users" table.
> >>>
> >>> IndexedTable does this transparently, but still does require two
> queries.
> >>>  So it's slower than a single query, but certainly faster than a full
> >>> table
> >>> scan.
> >>>
> >>> If you need hash-level performance on the index lookup, there are lots
> of
> >>> solutions outside of HBase that would work... In-memory Java HashMap,
> >>> Tokyo
> >>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
> >>> indexing,
> >>> you can use Lucene or the like.
> >>>
> >>> Make sense?
> >>>
> >>> JG
> >>>
> >>>
> >>> bharath vissapragada wrote:
> >>>
>  But i have read somewhere that Secondary indexes are somewhat slow
>  compared
>  to normal Hbase tables ..Does that effect the performance ?
> 
>  Also do you know the type of index created on the column(i mean Hash
>  type
>  or
>  Btree etc)
> 
>  On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov 
>  wrote:
> 
>   Hi!
> >
> > As far as I 

Re: Indexed Table in Hbase

2009-08-17 Thread Gary Helmling
Hi Bharath,

If you're using the default key generator
(org.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator),
it actually appends the base table row key for you.  So even though
the column value may be the same for multiple rows, the secondary
index table will still have 1 row for each row with the value in the
original table.  Here is relevant method from SimpleIndexKeyGenerator:

  public byte[] createIndexKey(byte[] rowKey, Map columns) {
return Bytes.add(columns.get(column), rowKey);
  }

So, say you have a table "mytable", with the columns:
info:keycol   (say this is the one you want to index)
info:col2
info:col3

If you define your table with the index specification -- new
IndexSpecification("keycol", Bytes.toBytes("info:keycol")) -- then
HBase will create the secondary index table named "mytable-by_keycol".

Then, say you add the following rows to "mytable":

"row1":  info:keycol="one", info:col2="abc", info:col3="def"
"row2":  info:keycol="one", info:col2="ghi", info:col3="jkl"

At this point, your index table ("mytable-by_keycol") will have the
following rows:

"onerow1": info:keycol="one", __INDEX__:ROW="row1"
"onerow2": info:keycol="one", __INDEX__:ROW="row2"

So you wind up with 2 rows in the index table (with unique row keys)
pointing back at the original table rows, even though we've only
stored a single distinct value for info:keycol.

To access the rows by the secondary index to create a scanner using
IndexedTable.getIndexedScanner(...).  I don't think there's support
for using the indexes when performing a random read with
HTable.getRow()/HTable.get().  (But maybe I'm wrong?)

As Travis mentions, you could always use an alternate approach to
implement your own indexing (use the index value as the row key for
your own index table and store the original table row keys as
individual columns).  I'm using the same approach for one access
pattern and so far it seems to work very well.

But as far as I know the built in secondary indexing assumes 1
secondary index table row -> 1 original table row.

Sorry if this got a bit long-winded.  It gets a little complicated to
explain in text...

--gh


On Mon, Aug 17, 2009 at 1:46 PM, bharath
vissapragada wrote:
> Thanks for ur explanation Gary ,
>
> Consider my case where i can have repetitions of values .. So u say that i
> edit the IndexKeyGenerator in such a way that instead of storing
> (column->rowkey) i should do in such a way that (coulmn-> rowkey1,rowkey2)
> as diff timestamps ... if yes is that a good way ?
>
> On Mon, Aug 17, 2009 at 10:53 PM, Gary Helmling  wrote:
>
>> When defining the IndexSpecification for your table, you can pass your
>> own implementation of
>> org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.
>>
>> This allows you to control how the row keys are generated for the
>> secondary index table.  For example, you could append the original
>> table's row key to the indexed value to ensure uniqueness in
>> referencing the original rows.
>>
>> When you create an indexed scanner, the secondary index code opens and
>> wraps a scanner on the secondary index table, based on the start row
>> you specify (the indexed value you're looking up).  It applies any
>> filter passed to rows on the secondary index table, so make sure
>> anything you want to filter on is listed in the "indexed columns" in
>> your IndexSpecification.
>>
>> For any rows returned by the wrapped scanner, the client code then
>> does a get for the original table record (the original row key is
>> stored in the "__INDEX__" column family I think).
>>
>> So in total, when using secondary indexes, you wind up with 1 scan + N
>> gets to look at N rows.
>>
>> At least, this was my understanding of how things worked as of 0.19.
>> I'm actually moving indexing into my app layer as I update to 0.20.
>>
>> Hope this helps.
>>
>> --gh
>>
>>
>> On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray wrote:
>> > I'm actually unsure about that.  Look at the code or experiment.
>> >
>> > Seems to me that there would be a uniqueness requirement, otherwise what
>> do
>> > you expect the behavior to be?  A get can only return a single row, so
>> > multiple index hits doesn't really make sense.
>> >
>> > Clint?  You out there? :)
>> >
>> > JG
>> >
>> > bharath vissapragada wrote:
>> >>
>> >> I got it ... I think this is definitely useful in my app because iam
>> >> performing a full table scan everytime for selecting the rowkeys based
>> on
>> >> some column values .
>> >>
>> >> BUT ..
>> >>
>> >>  we can have more than one rowkey for the same column value .Can you
>> >> please
>> >> tell me how they are stored .
>> >>
>> >> Thanks in advance
>> >>
>> >> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray 
>> wrote:
>> >>
>> >>> It's not an actual hash or btree index, but rather secondary indexes in
>> >>> HBase are implemented by creating an additional HBase table.
>> >>>
>> >>> If I have a table "users" (row key is userid) with family "data" and
>> 

NoServerForRegionException, TableNotFoundException and WrongRegionException

2009-08-17 Thread Marc Limotte
I'm seeing a nice variety of Exceptions from HBase and could use some
pointers about what to do next.

This is a new map/reduce program, updating about 550k rows with around a
dozen columns on a very small cluster (only 4 nodes... as we're still
testing and it doesn't have to support production yet).  Hbase Version
0.19.1.

I ran the job and it seems to make some progress, and then dies after
several hours, reporting "NoServerForRegionException: No server address
listed in .META. for region TABLEX,,1250526695078".  I retried it a few
times with the same result.  I also noticed that the load is not well
balanced, all requests seemed to be going to one node.  I adjust
hadoop-site.xml with the addition of these two entries:

hbase.hregion.max.filesize
33554432

   hbase.client.retries.number
5

And restarted hbase (and hadoop to be safe).  Re-ran and got the same error
in the M/R job.

*I thought I'd try dropping the table, since it's a new table and I can
recreate it.  But that gives another exception:
*
hbase(main):002:0> disable 'TABLEX'
NativeException: org.apache.hadoop.hbase.TableNotFoundException:
org.apache.hadoop.hbase.TableNotFoundException: TABLEX
at
org.apache.hadoop.hbase.master.TableOperation$ProcessTableOperation.call(TableOperation.java:129)
at
org.apache.hadoop.hbase.master.TableOperation$ProcessTableOperation.call(TableOperation.java:70)
at
org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:64)
at
org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:143)
at org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:691)
...


*And now I see this exception in the HBase logs:
*
org.apache.hadoop.hbase.regionserver.WrongRegionException:
org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row out
of range for HRegion .META.,,1250280235390, startKey='',
getEndKey()='TABLEX,,1250219949252',
row='TABLEX,840.56098.0544,1250526661861'
at
org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:1788)
at
org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1844)
at
org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:1912)
at
org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1244)
at
org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1216)
...


*As a test, tried a "count"...
*
hbase(main):007:0* count 'TABLEX'
NativeException: org.apache.hadoop.hbase.client.NoServerForRegionException:
No server address listed in .META. for region TABLEX,,1250526695078
from org/apache/hadoop/hbase/client/HConnectionManager.java:548:in
`locateRegionInMeta'
from org/apache/hadoop/hbase/client/HConnectionManager.java:478:in
`locateRegion'
from org/apache/hadoop/hbase/client/HConnectionManager.java:440:in
`locateRegion'
from org/apache/hadoop/hbase/client/HTable.java:114:in `'
from org/apache/hadoop/hbase/client/HTable.java:97:in `'
from sun/reflect/NativeConstructorAccessorImpl.java:-2:in `newInstance0'
...


*Also saw a thread somewhere that suggested doing a major compaction.  Did
that.  It returns almost immediately.  Not sure if that's normal or not...
no perceivable impact from doing this, though.*

hbase(main):013:0> major_compact '.META.'
0 row(s) in 0.0220 seconds
hbase(main):014:0>

Not sure what else to try?  Is there a way to force removal of the table in
question?  Is there something else I should be looking at?

Marc


Re: Hi, i am still puzzeld to find a row key by a cell.

2009-08-17 Thread Bradford Stephens
Just reiterating with JGray says -- usually, when you need a secondary
index, you can get away with denormalizing and duplicating your data.

On Mon, Aug 17, 2009 at 9:30 AM, Jonathan Gray wrote:
> This is possible by checking the values in your client, through server-side
> filters (see org.apache.hadoop.hbase.filter), or with secondary indexing (as
> described in my previous e-mail).
>
> In the first two cases, you are doing a full table scan so it is very
> inefficient.  That's the same as it would be in an RDBMS if you were not
> using secondary indexes, however.
>
> HBase has limited secondary indexing support, but do not expect the
> flexibility and performance you get from an RDBMS secondary index.
>
> If this is central to your usage of HBase, make sure that HBase is what you
> want and take another look at your schema to see if there might be a better
> design to prevent needing heavy indexing or table scanning.
>
> JG
>
> Rocks wrote:
>>
>> Thanks for you answer.
>> however, what i want to do just likes "where" keyword of SQL in RDBMS. Is
>> it impossible in hbase ?
>>
>> On Mon, Aug 17, 2009 at 2:58 PM, Ryan Rawson  wrote:
>>
>>> hey,
>>>
>>> That isn't how hbase (or even rdbms) work.  Instead you can retrieve
>>> rows based on their row key. Otherwise you will have to read the
>>> entire table to find just that 1 row. Yes this is as inefficient as it
>>> sounds.
>>>
>>> If you frequently have this issue, you may need to build and maintain
>>> secondary indexes. Unlike relational dbs, there is no built in support
>>> for this, you have to write your app to handle this.
>>>
>>> -yran
>>>
>>>
>>>
>>> On Sun, Aug 16, 2009 at 11:51 PM, lei wang
>>> wrote:

 Hi, if a know a cell in a hbase for its column::value,  i need
>>>
>>> to

 know which row key it belongs to. I searched the HBase api several
 times,
 but i can not find the right method to solve my problem. Thanks for
 one's
 suggestion to me.

>>
>



-- 
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science


HBase-0.20.0 Performance Evaluation

2009-08-17 Thread Schubert Zhang
We have just done a Performance Evaluation on HBase-0.20.0.
Refers to:
http://docloud.blogspot.com/2009/08/hbase-0200-performance-evaluation.html


Re: NoServerForRegionException, TableNotFoundException and WrongRegionException

2009-08-17 Thread stack
Please update to the head of 0.19 trunk, or better update to 0.20 trunk --
espeically if you are testing.  Issues described below have been addressed.

How many regions do you have in your table?  Are all going to one
regionserver because you only have one region?

Yours,
St.Ack


On Mon, Aug 17, 2009 at 12:19 PM, Marc Limotte  wrote:

> I'm seeing a nice variety of Exceptions from HBase and could use some
> pointers about what to do next.
>
> This is a new map/reduce program, updating about 550k rows with around a
> dozen columns on a very small cluster (only 4 nodes... as we're still
> testing and it doesn't have to support production yet).  Hbase Version
> 0.19.1.
>
> I ran the job and it seems to make some progress, and then dies after
> several hours, reporting "NoServerForRegionException: No server address
> listed in .META. for region TABLEX,,1250526695078".  I retried it a few
> times with the same result.  I also noticed that the load is not well
> balanced, all requests seemed to be going to one node.  I adjust
> hadoop-site.xml with the addition of these two entries:
>
>hbase.hregion.max.filesize
>33554432
>
>   hbase.client.retries.number
>5
>
> And restarted hbase (and hadoop to be safe).  Re-ran and got the same error
> in the M/R job.
>
> *I thought I'd try dropping the table, since it's a new table and I can
> recreate it.  But that gives another exception:
> *
> hbase(main):002:0> disable 'TABLEX'
> NativeException: org.apache.hadoop.hbase.TableNotFoundException:
> org.apache.hadoop.hbase.TableNotFoundException: TABLEX
>at
>
> org.apache.hadoop.hbase.master.TableOperation$ProcessTableOperation.call(TableOperation.java:129)
>at
>
> org.apache.hadoop.hbase.master.TableOperation$ProcessTableOperation.call(TableOperation.java:70)
>at
>
> org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:64)
>at
>
> org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:143)
>at org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:691)
> ...
>
>
> *And now I see this exception in the HBase logs:
> *
> org.apache.hadoop.hbase.regionserver.WrongRegionException:
> org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row
> out
> of range for HRegion .META.,,1250280235390, startKey='',
> getEndKey()='TABLEX,,1250219949252',
> row='TABLEX,840.56098.0544,1250526661861'
>at
> org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:1788)
>at
>
> org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1844)
>at
> org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:1912)
>at
> org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1244)
>at
> org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1216)
> ...
>
>
> *As a test, tried a "count"...
> *
> hbase(main):007:0* count 'TABLEX'
> NativeException: org.apache.hadoop.hbase.client.NoServerForRegionException:
> No server address listed in .META. for region TABLEX,,1250526695078
>from org/apache/hadoop/hbase/client/HConnectionManager.java:548:in
> `locateRegionInMeta'
>from org/apache/hadoop/hbase/client/HConnectionManager.java:478:in
> `locateRegion'
>from org/apache/hadoop/hbase/client/HConnectionManager.java:440:in
> `locateRegion'
>from org/apache/hadoop/hbase/client/HTable.java:114:in `'
>from org/apache/hadoop/hbase/client/HTable.java:97:in `'
>from sun/reflect/NativeConstructorAccessorImpl.java:-2:in `newInstance0'
> ...
>
>
> *Also saw a thread somewhere that suggested doing a major compaction.  Did
> that.  It returns almost immediately.  Not sure if that's normal or not...
> no perceivable impact from doing this, though.*
>
> hbase(main):013:0> major_compact '.META.'
> 0 row(s) in 0.0220 seconds
> hbase(main):014:0>
>
> Not sure what else to try?  Is there a way to force removal of the table in
> question?  Is there something else I should be looking at?
>
> Marc
>


memcache size configurable?

2009-08-17 Thread Adam Silberstein
Hi,

The HBase architecture documentation says that the memcache size is
configurable.  What does it default to?  Is this an option I can set in
conf/hbase-site.xml?  If so, does anybody know the syntax?

 

Thanks,

Adam

 



Re: memcache size configurable?

2009-08-17 Thread stack
>From hbase-default.xml:

  
hbase.hregion.memstore.flush.size
67108864

Memstore will be flushed to disk if size of the memstore
exceeds this number of bytes.  Value is checked by a thread that runs
every hbase.server.thread.wakefrequency.

  

Our memstore used to be called memcache.

Why change it?
St.Ack



On Mon, Aug 17, 2009 at 2:42 PM, Adam Silberstein wrote:

> Hi,
>
> The HBase architecture documentation says that the memcache size is
> configurable.  What does it default to?  Is this an option I can set in
> conf/hbase-site.xml?  If so, does anybody know the syntax?
>
>
>
> Thanks,
>
> Adam
>
>
>
>


Hbase 0.20 example\manual

2009-08-17 Thread Alex Spodinets
Hello,

Could somone kindly point me to an example of HBase 0.20 API usage. All i
was able to find so far is a Map\Reduce example in the 0.20 SVN source.
Would be also good to have some info on how 0.20 should be installed,
especially the zoo keeper.

Thanks.


Re: Hbase 0.20 example\manual

2009-08-17 Thread Jonathan Gray

Look at the overview/summary in the javadocs.

I'm not sure if an official one has been posted yet, but you can check 
out the Getting Started guide here:


http://jgray.la/javadoc/hbase-0.20.0/overview-summary.html

And API examples here:

http://jgray.la/javadoc/hbase-0.20.0/org/apache/hadoop/hbase/client/package-summary.html

JG

Alex Spodinets wrote:

Hello,

Could somone kindly point me to an example of HBase 0.20 API usage. All i
was able to find so far is a Map\Reduce example in the 0.20 SVN source.
Would be also good to have some info on how 0.20 should be installed,
especially the zoo keeper.

Thanks.



Re: Hbase 0.20 example\manual

2009-08-17 Thread stack
Does this help?

http://people.apache.org/~stack/hbase-0.20.0-candidate-1/docs/api/overview-summary.html#overview_description

Includes sample client usage and all about zk + hbase.

St.Ack


On Mon, Aug 17, 2009 at 3:57 PM, Alex Spodinets  wrote:

> Hello,
>
> Could somone kindly point me to an example of HBase 0.20 API usage. All i
> was able to find so far is a Map\Reduce example in the 0.20 SVN source.
> Would be also good to have some info on how 0.20 should be installed,
> especially the zoo keeper.
>
> Thanks.
>


Re: Hbase 0.20 example\manual

2009-08-17 Thread Alex Spodinets
exciting, thanks.

On Tue, Aug 18, 2009 at 2:05 AM, Jonathan Gray  wrote:

> Look at the overview/summary in the javadocs.
>
> I'm not sure if an official one has been posted yet, but you can check out
> the Getting Started guide here:
>
> http://jgray.la/javadoc/hbase-0.20.0/overview-summary.html
>
> And API examples here:
>
>
> http://jgray.la/javadoc/hbase-0.20.0/org/apache/hadoop/hbase/client/package-summary.html
>
> JG
>
>
> Alex Spodinets wrote:
>
>> Hello,
>>
>> Could somone kindly point me to an example of HBase 0.20 API usage. All i
>> was able to find so far is a Map\Reduce example in the 0.20 SVN source.
>> Would be also good to have some info on how 0.20 should be installed,
>> especially the zoo keeper.
>>
>> Thanks.
>>
>>


Re: HBase in a real world application

2009-08-17 Thread stack
Our writes were off by a factor of 7 or 8.  Writes should be better now
(HBASE-1771).
Thanks,
St.Ack


On Thu, Aug 13, 2009 at 4:53 PM, stack  wrote:

> I just tried it.  It seems slow to me writing too.  Let me take a look
> St.Ack
>
>
> On Thu, Aug 13, 2009 at 10:06 AM, llpind  wrote:
>
>>
>> Okay I changed replication to 2.  and removed "-XX:NewSize=6m
>> -XX:MaxNewSize=6m"
>>
>> here is results for randomWrite 3 clients:
>>
>>
>>
>> RandomWrite =
>>
>> hadoop-0.20.0/bin/hadoop jar hbase-0.20.0/hbase-0.20.0-test.jar
>>  --nomapred
>> randomWrite 3
>>
>>
>> 09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-0 Start
>> randomWrite at offset 0 for 1048576 rows
>> 09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-1 Start
>> randomWrite at offset 1048576 for 1048576 rows
>> 09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-2 Start
>> randomWrite at offset 2097152 for 1048576 rows
>> 09/08/13 09:51:47 INFO hbase.PerformanceEvaluation: client-0
>> 0/104857/1048576
>> 09/08/13 09:51:48 INFO hbase.PerformanceEvaluation: client-1
>> 1048576/1153427/2097152
>> 09/08/13 09:51:48 INFO hbase.PerformanceEvaluation: client-2
>> 2097152/2201997/3145728
>> 09/08/13 09:52:22 INFO hbase.PerformanceEvaluation: client-1
>> 1048576/1258284/2097152
>> 09/08/13 09:52:23 INFO hbase.PerformanceEvaluation: client-0
>> 0/209714/1048576
>> 09/08/13 09:52:24 INFO hbase.PerformanceEvaluation: client-2
>> 2097152/2306854/3145728
>> 09/08/13 09:52:47 INFO hbase.PerformanceEvaluation: client-1
>> 1048576/1363141/2097152
>> 09/08/13 09:52:58 INFO hbase.PerformanceEvaluation: client-0
>> 0/314571/1048576
>> 09/08/13 09:52:58 INFO hbase.PerformanceEvaluation: client-2
>> 2097152/2411711/3145728
>> 09/08/13 09:53:24 INFO hbase.PerformanceEvaluation: client-1
>> 1048576/1467998/2097152
>> 09/08/13 09:53:27 INFO hbase.PerformanceEvaluation: client-0
>> 0/419428/1048576
>> 09/08/13 09:53:27 INFO hbase.PerformanceEvaluation: client-2
>> 2097152/2516568/3145728
>> 09/08/13 09:53:48 INFO hbase.PerformanceEvaluation: client-1
>> 1048576/1572855/2097152
>> 09/08/13 09:54:08 INFO hbase.PerformanceEvaluation: client-2
>> 2097152/2621425/3145728
>> 09/08/13 09:54:10 INFO hbase.PerformanceEvaluation: client-0
>> 0/524285/1048576
>> 09/08/13 09:54:40 INFO hbase.PerformanceEvaluation: client-1
>> 1048576/1677712/2097152
>> 09/08/13 09:54:49 INFO hbase.PerformanceEvaluation: client-2
>> 2097152/2726282/3145728
>> 09/08/13 09:54:52 INFO hbase.PerformanceEvaluation: client-0
>> 0/629142/1048576
>> 09/08/13 09:55:57 INFO hbase.PerformanceEvaluation: client-1
>> 1048576/1782569/2097152
>> 09/08/13 09:56:21 INFO hbase.PerformanceEvaluation: client-2
>> 2097152/2831139/3145728
>> 09/08/13 09:56:41 INFO hbase.PerformanceEvaluation: client-0
>> 0/733999/1048576
>> 09/08/13 09:57:23 INFO hbase.PerformanceEvaluation: client-1
>> 1048576/1887426/2097152
>> 09/08/13 09:58:40 INFO hbase.PerformanceEvaluation: client-2
>> 2097152/2935996/3145728
>> 09/08/13 09:58:54 INFO hbase.PerformanceEvaluation: client-0
>> 0/838856/1048576
>> 09/08/13 10:00:29 INFO hbase.PerformanceEvaluation: client-1
>> 1048576/1992283/2097152
>> 09/08/13 10:01:01 INFO hbase.PerformanceEvaluation: client-2
>> 2097152/3040853/3145728
>> 09/08/13 10:01:24 INFO hbase.PerformanceEvaluation: client-0
>> 0/943713/1048576
>> 09/08/13 10:02:36 INFO hbase.PerformanceEvaluation: client-1
>> 1048576/2097140/2097152
>> 09/08/13 10:02:37 INFO hbase.PerformanceEvaluation: client-1 Finished
>> randomWrite in 680674ms at offset 1048576 for 1048576 rows
>> 09/08/13 10:02:37 INFO hbase.PerformanceEvaluation: Finished 1 in 680674ms
>> writing 1048576 rows
>> 09/08/13 10:03:19 INFO hbase.PerformanceEvaluation: client-2
>> 2097152/3145710/3145728
>> 09/08/13 10:03:20 INFO hbase.PerformanceEvaluation: client-2 Finished
>> randomWrite in 723771ms at offset 2097152 for 1048576 rows
>> 09/08/13 10:03:20 INFO hbase.PerformanceEvaluation: Finished 2 in 723771ms
>> writing 1048576 rows
>> 09/08/13 10:03:41 INFO hbase.PerformanceEvaluation: client-0
>> 0/1048570/1048576
>> 09/08/13 10:03:42 INFO hbase.PerformanceEvaluation: client-0 Finished
>> randomWrite in 746054ms at offset 0 for 1048576 rows
>> 09/08/13 10:03:42 INFO hbase.PerformanceEvaluation: Finished 0 in 746054ms
>> writing 1048576 rows
>>
>>
>>
>> 
>>
>> Still pretty slow.  Any other ideas?  I'm running the client from the
>> master
>> box, but its not running any regionServers or datanodes.
>>
>> stack-3 wrote:
>> >
>> > Your config. looks fine.
>> >
>> > Only think that gives me pause is:
>> >
>> > "-XX:NewSize=6m -XX:MaxNewSize=6m"
>> >
>> > Any reason for the above?
>> >
>> > If you study your GC logs, lots of pauses?
>> >
>> > Oh, and this: replication is set to 6.  Why 6?  Each write must commit
>> to
>> > 6
>> > datanodes before complete.  In the tests posted on wiki, we replicate to
>> 3
>> > nodes.
>> >
>> > In end of t

Re: HBase in a real world application

2009-08-17 Thread Jeff Hammerbacher
Hey Stack,

I notice that the patch for this issue doesn't include any sort of tests
that might have caught this regression. Do you guys have an HBaseBench,
HBaseMix, or similarly named tool for catching performance regressions?

Thanks,
Jeff

On Mon, Aug 17, 2009 at 4:51 PM, stack  wrote:

> Our writes were off by a factor of 7 or 8.  Writes should be better now
> (HBASE-1771).
> Thanks,
> St.Ack
>
>
> On Thu, Aug 13, 2009 at 4:53 PM, stack  wrote:
>
> > I just tried it.  It seems slow to me writing too.  Let me take a
> look
> > St.Ack
> >
> >
> > On Thu, Aug 13, 2009 at 10:06 AM, llpind  wrote:
> >
> >>
> >> Okay I changed replication to 2.  and removed "-XX:NewSize=6m
> >> -XX:MaxNewSize=6m"
> >>
> >> here is results for randomWrite 3 clients:
> >>
> >>
> >>
> >> RandomWrite =
> >>
> >> hadoop-0.20.0/bin/hadoop jar hbase-0.20.0/hbase-0.20.0-test.jar
> >>  --nomapred
> >> randomWrite 3
> >>
> >>
> >> 09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-0 Start
> >> randomWrite at offset 0 for 1048576 rows
> >> 09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-1 Start
> >> randomWrite at offset 1048576 for 1048576 rows
> >> 09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-2 Start
> >> randomWrite at offset 2097152 for 1048576 rows
> >> 09/08/13 09:51:47 INFO hbase.PerformanceEvaluation: client-0
> >> 0/104857/1048576
> >> 09/08/13 09:51:48 INFO hbase.PerformanceEvaluation: client-1
> >> 1048576/1153427/2097152
> >> 09/08/13 09:51:48 INFO hbase.PerformanceEvaluation: client-2
> >> 2097152/2201997/3145728
> >> 09/08/13 09:52:22 INFO hbase.PerformanceEvaluation: client-1
> >> 1048576/1258284/2097152
> >> 09/08/13 09:52:23 INFO hbase.PerformanceEvaluation: client-0
> >> 0/209714/1048576
> >> 09/08/13 09:52:24 INFO hbase.PerformanceEvaluation: client-2
> >> 2097152/2306854/3145728
> >> 09/08/13 09:52:47 INFO hbase.PerformanceEvaluation: client-1
> >> 1048576/1363141/2097152
> >> 09/08/13 09:52:58 INFO hbase.PerformanceEvaluation: client-0
> >> 0/314571/1048576
> >> 09/08/13 09:52:58 INFO hbase.PerformanceEvaluation: client-2
> >> 2097152/2411711/3145728
> >> 09/08/13 09:53:24 INFO hbase.PerformanceEvaluation: client-1
> >> 1048576/1467998/2097152
> >> 09/08/13 09:53:27 INFO hbase.PerformanceEvaluation: client-0
> >> 0/419428/1048576
> >> 09/08/13 09:53:27 INFO hbase.PerformanceEvaluation: client-2
> >> 2097152/2516568/3145728
> >> 09/08/13 09:53:48 INFO hbase.PerformanceEvaluation: client-1
> >> 1048576/1572855/2097152
> >> 09/08/13 09:54:08 INFO hbase.PerformanceEvaluation: client-2
> >> 2097152/2621425/3145728
> >> 09/08/13 09:54:10 INFO hbase.PerformanceEvaluation: client-0
> >> 0/524285/1048576
> >> 09/08/13 09:54:40 INFO hbase.PerformanceEvaluation: client-1
> >> 1048576/1677712/2097152
> >> 09/08/13 09:54:49 INFO hbase.PerformanceEvaluation: client-2
> >> 2097152/2726282/3145728
> >> 09/08/13 09:54:52 INFO hbase.PerformanceEvaluation: client-0
> >> 0/629142/1048576
> >> 09/08/13 09:55:57 INFO hbase.PerformanceEvaluation: client-1
> >> 1048576/1782569/2097152
> >> 09/08/13 09:56:21 INFO hbase.PerformanceEvaluation: client-2
> >> 2097152/2831139/3145728
> >> 09/08/13 09:56:41 INFO hbase.PerformanceEvaluation: client-0
> >> 0/733999/1048576
> >> 09/08/13 09:57:23 INFO hbase.PerformanceEvaluation: client-1
> >> 1048576/1887426/2097152
> >> 09/08/13 09:58:40 INFO hbase.PerformanceEvaluation: client-2
> >> 2097152/2935996/3145728
> >> 09/08/13 09:58:54 INFO hbase.PerformanceEvaluation: client-0
> >> 0/838856/1048576
> >> 09/08/13 10:00:29 INFO hbase.PerformanceEvaluation: client-1
> >> 1048576/1992283/2097152
> >> 09/08/13 10:01:01 INFO hbase.PerformanceEvaluation: client-2
> >> 2097152/3040853/3145728
> >> 09/08/13 10:01:24 INFO hbase.PerformanceEvaluation: client-0
> >> 0/943713/1048576
> >> 09/08/13 10:02:36 INFO hbase.PerformanceEvaluation: client-1
> >> 1048576/2097140/2097152
> >> 09/08/13 10:02:37 INFO hbase.PerformanceEvaluation: client-1 Finished
> >> randomWrite in 680674ms at offset 1048576 for 1048576 rows
> >> 09/08/13 10:02:37 INFO hbase.PerformanceEvaluation: Finished 1 in
> 680674ms
> >> writing 1048576 rows
> >> 09/08/13 10:03:19 INFO hbase.PerformanceEvaluation: client-2
> >> 2097152/3145710/3145728
> >> 09/08/13 10:03:20 INFO hbase.PerformanceEvaluation: client-2 Finished
> >> randomWrite in 723771ms at offset 2097152 for 1048576 rows
> >> 09/08/13 10:03:20 INFO hbase.PerformanceEvaluation: Finished 2 in
> 723771ms
> >> writing 1048576 rows
> >> 09/08/13 10:03:41 INFO hbase.PerformanceEvaluation: client-0
> >> 0/1048570/1048576
> >> 09/08/13 10:03:42 INFO hbase.PerformanceEvaluation: client-0 Finished
> >> randomWrite in 746054ms at offset 0 for 1048576 rows
> >> 09/08/13 10:03:42 INFO hbase.PerformanceEvaluation: Finished 0 in
> 746054ms
> >> writing 1048576 rows
> >>
> >>
> >>
> >> 
> >>
> >> Still pretty slow.  Any other ideas?  I'm running the client f

Re: HBase in a real world application

2009-08-17 Thread Jonathan Gray

That would be PerformanceEvaluation :)

Jeff Hammerbacher wrote:

Hey Stack,

I notice that the patch for this issue doesn't include any sort of tests
that might have caught this regression. Do you guys have an HBaseBench,
HBaseMix, or similarly named tool for catching performance regressions?

Thanks,
Jeff

On Mon, Aug 17, 2009 at 4:51 PM, stack  wrote:


Our writes were off by a factor of 7 or 8.  Writes should be better now
(HBASE-1771).
Thanks,
St.Ack


On Thu, Aug 13, 2009 at 4:53 PM, stack  wrote:


I just tried it.  It seems slow to me writing too.  Let me take a

look

St.Ack


On Thu, Aug 13, 2009 at 10:06 AM, llpind  wrote:


Okay I changed replication to 2.  and removed "-XX:NewSize=6m
-XX:MaxNewSize=6m"

here is results for randomWrite 3 clients:



RandomWrite =

hadoop-0.20.0/bin/hadoop jar hbase-0.20.0/hbase-0.20.0-test.jar
 --nomapred
randomWrite 3


09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-0 Start
randomWrite at offset 0 for 1048576 rows
09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-1 Start
randomWrite at offset 1048576 for 1048576 rows
09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-2 Start
randomWrite at offset 2097152 for 1048576 rows
09/08/13 09:51:47 INFO hbase.PerformanceEvaluation: client-0
0/104857/1048576
09/08/13 09:51:48 INFO hbase.PerformanceEvaluation: client-1
1048576/1153427/2097152
09/08/13 09:51:48 INFO hbase.PerformanceEvaluation: client-2
2097152/2201997/3145728
09/08/13 09:52:22 INFO hbase.PerformanceEvaluation: client-1
1048576/1258284/2097152
09/08/13 09:52:23 INFO hbase.PerformanceEvaluation: client-0
0/209714/1048576
09/08/13 09:52:24 INFO hbase.PerformanceEvaluation: client-2
2097152/2306854/3145728
09/08/13 09:52:47 INFO hbase.PerformanceEvaluation: client-1
1048576/1363141/2097152
09/08/13 09:52:58 INFO hbase.PerformanceEvaluation: client-0
0/314571/1048576
09/08/13 09:52:58 INFO hbase.PerformanceEvaluation: client-2
2097152/2411711/3145728
09/08/13 09:53:24 INFO hbase.PerformanceEvaluation: client-1
1048576/1467998/2097152
09/08/13 09:53:27 INFO hbase.PerformanceEvaluation: client-0
0/419428/1048576
09/08/13 09:53:27 INFO hbase.PerformanceEvaluation: client-2
2097152/2516568/3145728
09/08/13 09:53:48 INFO hbase.PerformanceEvaluation: client-1
1048576/1572855/2097152
09/08/13 09:54:08 INFO hbase.PerformanceEvaluation: client-2
2097152/2621425/3145728
09/08/13 09:54:10 INFO hbase.PerformanceEvaluation: client-0
0/524285/1048576
09/08/13 09:54:40 INFO hbase.PerformanceEvaluation: client-1
1048576/1677712/2097152
09/08/13 09:54:49 INFO hbase.PerformanceEvaluation: client-2
2097152/2726282/3145728
09/08/13 09:54:52 INFO hbase.PerformanceEvaluation: client-0
0/629142/1048576
09/08/13 09:55:57 INFO hbase.PerformanceEvaluation: client-1
1048576/1782569/2097152
09/08/13 09:56:21 INFO hbase.PerformanceEvaluation: client-2
2097152/2831139/3145728
09/08/13 09:56:41 INFO hbase.PerformanceEvaluation: client-0
0/733999/1048576
09/08/13 09:57:23 INFO hbase.PerformanceEvaluation: client-1
1048576/1887426/2097152
09/08/13 09:58:40 INFO hbase.PerformanceEvaluation: client-2
2097152/2935996/3145728
09/08/13 09:58:54 INFO hbase.PerformanceEvaluation: client-0
0/838856/1048576
09/08/13 10:00:29 INFO hbase.PerformanceEvaluation: client-1
1048576/1992283/2097152
09/08/13 10:01:01 INFO hbase.PerformanceEvaluation: client-2
2097152/3040853/3145728
09/08/13 10:01:24 INFO hbase.PerformanceEvaluation: client-0
0/943713/1048576
09/08/13 10:02:36 INFO hbase.PerformanceEvaluation: client-1
1048576/2097140/2097152
09/08/13 10:02:37 INFO hbase.PerformanceEvaluation: client-1 Finished
randomWrite in 680674ms at offset 1048576 for 1048576 rows
09/08/13 10:02:37 INFO hbase.PerformanceEvaluation: Finished 1 in

680674ms

writing 1048576 rows
09/08/13 10:03:19 INFO hbase.PerformanceEvaluation: client-2
2097152/3145710/3145728
09/08/13 10:03:20 INFO hbase.PerformanceEvaluation: client-2 Finished
randomWrite in 723771ms at offset 2097152 for 1048576 rows
09/08/13 10:03:20 INFO hbase.PerformanceEvaluation: Finished 2 in

723771ms

writing 1048576 rows
09/08/13 10:03:41 INFO hbase.PerformanceEvaluation: client-0
0/1048570/1048576
09/08/13 10:03:42 INFO hbase.PerformanceEvaluation: client-0 Finished
randomWrite in 746054ms at offset 0 for 1048576 rows
09/08/13 10:03:42 INFO hbase.PerformanceEvaluation: Finished 0 in

746054ms

writing 1048576 rows





Still pretty slow.  Any other ideas?  I'm running the client from the
master
box, but its not running any regionServers or datanodes.

stack-3 wrote:

Your config. looks fine.

Only think that gives me pause is:

"-XX:NewSize=6m -XX:MaxNewSize=6m"

Any reason for the above?

If you study your GC logs, lots of pauses?

Oh, and this: replication is set to 6.  Why 6?  Each write must commit

to

6
datanodes before complete.  In the tests posted on wiki, we replicate

to

3

nodes.

In end of this message you

Re: HBase in a real world application

2009-08-17 Thread stack
On Mon, Aug 17, 2009 at 4:54 PM, Jeff Hammerbacher wrote:

> Hey Stack,
>
> I notice that the patch for this issue doesn't include any sort of tests
> that might have caught this regression. Do you guys have an HBaseBench,
> HBaseMix, or similarly named tool for catching performance regressions?
>

Not as part of our build.  The way its currently done is that near release,
we run our little PerformanceEvaluation doohickey.  If its way off, crack
the profiler.

We have been trying to get some of the hadoop allotment of EC2 time so we
could set up a regular run up on AWS but no luck so far.

Good on your Jeff,
St.Ack







>
> Thanks,
> Jeff
>
> On Mon, Aug 17, 2009 at 4:51 PM, stack  wrote:
>
> > Our writes were off by a factor of 7 or 8.  Writes should be better now
> > (HBASE-1771).
> > Thanks,
> > St.Ack
> >
> >
> > On Thu, Aug 13, 2009 at 4:53 PM, stack  wrote:
> >
> > > I just tried it.  It seems slow to me writing too.  Let me take a
> > look
> > > St.Ack
> > >
> > >
> > > On Thu, Aug 13, 2009 at 10:06 AM, llpind 
> wrote:
> > >
> > >>
> > >> Okay I changed replication to 2.  and removed "-XX:NewSize=6m
> > >> -XX:MaxNewSize=6m"
> > >>
> > >> here is results for randomWrite 3 clients:
> > >>
> > >>
> > >>
> > >> RandomWrite =
> > >>
> > >> hadoop-0.20.0/bin/hadoop jar hbase-0.20.0/hbase-0.20.0-test.jar
> > >>  --nomapred
> > >> randomWrite 3
> > >>
> > >>
> > >> 09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-0 Start
> > >> randomWrite at offset 0 for 1048576 rows
> > >> 09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-1 Start
> > >> randomWrite at offset 1048576 for 1048576 rows
> > >> 09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-2 Start
> > >> randomWrite at offset 2097152 for 1048576 rows
> > >> 09/08/13 09:51:47 INFO hbase.PerformanceEvaluation: client-0
> > >> 0/104857/1048576
> > >> 09/08/13 09:51:48 INFO hbase.PerformanceEvaluation: client-1
> > >> 1048576/1153427/2097152
> > >> 09/08/13 09:51:48 INFO hbase.PerformanceEvaluation: client-2
> > >> 2097152/2201997/3145728
> > >> 09/08/13 09:52:22 INFO hbase.PerformanceEvaluation: client-1
> > >> 1048576/1258284/2097152
> > >> 09/08/13 09:52:23 INFO hbase.PerformanceEvaluation: client-0
> > >> 0/209714/1048576
> > >> 09/08/13 09:52:24 INFO hbase.PerformanceEvaluation: client-2
> > >> 2097152/2306854/3145728
> > >> 09/08/13 09:52:47 INFO hbase.PerformanceEvaluation: client-1
> > >> 1048576/1363141/2097152
> > >> 09/08/13 09:52:58 INFO hbase.PerformanceEvaluation: client-0
> > >> 0/314571/1048576
> > >> 09/08/13 09:52:58 INFO hbase.PerformanceEvaluation: client-2
> > >> 2097152/2411711/3145728
> > >> 09/08/13 09:53:24 INFO hbase.PerformanceEvaluation: client-1
> > >> 1048576/1467998/2097152
> > >> 09/08/13 09:53:27 INFO hbase.PerformanceEvaluation: client-0
> > >> 0/419428/1048576
> > >> 09/08/13 09:53:27 INFO hbase.PerformanceEvaluation: client-2
> > >> 2097152/2516568/3145728
> > >> 09/08/13 09:53:48 INFO hbase.PerformanceEvaluation: client-1
> > >> 1048576/1572855/2097152
> > >> 09/08/13 09:54:08 INFO hbase.PerformanceEvaluation: client-2
> > >> 2097152/2621425/3145728
> > >> 09/08/13 09:54:10 INFO hbase.PerformanceEvaluation: client-0
> > >> 0/524285/1048576
> > >> 09/08/13 09:54:40 INFO hbase.PerformanceEvaluation: client-1
> > >> 1048576/1677712/2097152
> > >> 09/08/13 09:54:49 INFO hbase.PerformanceEvaluation: client-2
> > >> 2097152/2726282/3145728
> > >> 09/08/13 09:54:52 INFO hbase.PerformanceEvaluation: client-0
> > >> 0/629142/1048576
> > >> 09/08/13 09:55:57 INFO hbase.PerformanceEvaluation: client-1
> > >> 1048576/1782569/2097152
> > >> 09/08/13 09:56:21 INFO hbase.PerformanceEvaluation: client-2
> > >> 2097152/2831139/3145728
> > >> 09/08/13 09:56:41 INFO hbase.PerformanceEvaluation: client-0
> > >> 0/733999/1048576
> > >> 09/08/13 09:57:23 INFO hbase.PerformanceEvaluation: client-1
> > >> 1048576/1887426/2097152
> > >> 09/08/13 09:58:40 INFO hbase.PerformanceEvaluation: client-2
> > >> 2097152/2935996/3145728
> > >> 09/08/13 09:58:54 INFO hbase.PerformanceEvaluation: client-0
> > >> 0/838856/1048576
> > >> 09/08/13 10:00:29 INFO hbase.PerformanceEvaluation: client-1
> > >> 1048576/1992283/2097152
> > >> 09/08/13 10:01:01 INFO hbase.PerformanceEvaluation: client-2
> > >> 2097152/3040853/3145728
> > >> 09/08/13 10:01:24 INFO hbase.PerformanceEvaluation: client-0
> > >> 0/943713/1048576
> > >> 09/08/13 10:02:36 INFO hbase.PerformanceEvaluation: client-1
> > >> 1048576/2097140/2097152
> > >> 09/08/13 10:02:37 INFO hbase.PerformanceEvaluation: client-1 Finished
> > >> randomWrite in 680674ms at offset 1048576 for 1048576 rows
> > >> 09/08/13 10:02:37 INFO hbase.PerformanceEvaluation: Finished 1 in
> > 680674ms
> > >> writing 1048576 rows
> > >> 09/08/13 10:03:19 INFO hbase.PerformanceEvaluation: client-2
> > >> 2097152/3145710/3145728
> > >> 09/08/13 10:03:20 INFO hbase.PerformanceEvaluation: client-2 Finished
> > >> randomWrite in 723771ms at offset 20

Re: NoServerForRegionException, TableNotFoundException and WrongRegionException

2009-08-17 Thread Marc Limotte
Regions seem to be reasonably dispersed... as of now... not sure if that was
true before I reset hbase.hregion.max.filesize.


Region Servers
AddressStart CodeLoad
host1:600201250533702083requests=0, regions=10, usedHeap=32,
maxHeap=888
host2:600201250533702094requests=0, regions=12, usedHeap=32,
maxHeap=888
host3:600201250533702052requests=0, regions=7, usedHeap=31,
maxHeap=888
host4:600201250533702078requests=0, regions=11, usedHeap=32,
maxHeap=888
Total: servers: 4 requests=0, regions=40

Marc



> -- Forwarded message --
> From: stack 
> To: hbase-user@hadoop.apache.org
> Date: Mon, 17 Aug 2009 14:21:47 -0700
> Subject: Re: NoServerForRegionException, TableNotFoundException and
> WrongRegionException
> Please update to the head of 0.19 trunk, or better update to 0.20 trunk --
> espeically if you are testing.  Issues described below have been addressed.
>
> How many regions do you have in your table?  Are all going to one
> regionserver because you only have one region?
>
> Yours,
> St.Ack
>
>
> On Mon, Aug 17, 2009 at 12:19 PM, Marc Limotte 
> wrote:
>
> > I'm seeing a nice variety of Exceptions from HBase and could use some
> > pointers about what to do next.
> >
> > This is a new map/reduce program, updating about 550k rows with around a
> > dozen columns on a very small cluster (only 4 nodes... as we're still
> > testing and it doesn't have to support production yet).  Hbase Version
> > 0.19.1.
> >
> > I ran the job and it seems to make some progress, and then dies after
> > several hours, reporting "NoServerForRegionException: No server address
> > listed in .META. for region TABLEX,,1250526695078".  I retried it a few
> > times with the same result.  I also noticed that the load is not well
> > balanced, all requests seemed to be going to one node.  I adjust
> > hadoop-site.xml with the addition of these two entries:
> >
> >hbase.hregion.max.filesize
> >33554432
> >
> >   hbase.client.retries.number
> >5
> >
> > And restarted hbase (and hadoop to be safe).  Re-ran and got the same
> error
> > in the M/R job.
> >
> > *I thought I'd try dropping the table, since it's a new table and I can
> > recreate it.  But that gives another exception:
> > *
> > hbase(main):002:0> disable 'TABLEX'
> > NativeException: org.apache.hadoop.hbase.TableNotFoundException:
> > org.apache.hadoop.hbase.TableNotFoundException: TABLEX
> >at
> >
> >
> org.apache.hadoop.hbase.master.TableOperation$ProcessTableOperation.call(TableOperation.java:129)
> >at
> >
> >
> org.apache.hadoop.hbase.master.TableOperation$ProcessTableOperation.call(TableOperation.java:70)
> >at
> >
> >
> org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:64)
> >at
> >
> >
> org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:143)
> >at
> org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:691)
> > ...
> >
> >
> > *And now I see this exception in the HBase logs:
> > *
> > org.apache.hadoop.hbase.regionserver.WrongRegionException:
> > org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row
> > out
> > of range for HRegion .META.,,1250280235390, startKey='',
> > getEndKey()='TABLEX,,1250219949252',
> > row='TABLEX,840.56098.0544,1250526661861'
> >at
> > org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:1788)
> >at
> >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1844)
> >at
> > org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:1912)
> >at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1244)
> >at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1216)
> > ...
> >
> >
> > *As a test, tried a "count"...
> > *
> > hbase(main):007:0* count 'TABLEX'
> > NativeException:
> org.apache.hadoop.hbase.client.NoServerForRegionException:
> > No server address listed in .META. for region TABLEX,,1250526695078
> >from org/apache/hadoop/hbase/client/HConnectionManager.java:548:in
> > `locateRegionInMeta'
> >from org/apache/hadoop/hbase/client/HConnectionManager.java:478:in
> > `locateRegion'
> >from org/apache/hadoop/hbase/client/HConnectionManager.java:440:in
> > `locateRegion'
> >from org/apache/hadoop/hbase/client/HTable.java:114:in `'
> >from org/apache/hadoop/hbase/client/HTable.java:97:in `'
> >from sun/reflect/NativeConstructorAccessorImpl.java:-2:in
> `newInstance0'
> > ...
> >
> >
> > *Also saw a thread somewhere that suggested doing a major compaction.
>  Did
> > that.  It returns almost immediately.  Not sure if that's normal or
> not...
> > no perceivable impact from doing this, though.*
> >
> > hbase(main):013:0> major_compact '.META.'
> > 0 row(s) in 0.0220 seconds
> > hbase(main):014:0>
> >
> > Not sure what else to try?  Is there a way to force removal of the table
> in
> > question?  Is there somet

Re: HBase in a real world application

2009-08-17 Thread Jeff Hammerbacher
Thanks guys. For the lazy (e.g. me) and future searchers, here are some
links. The benchmark is meant to simulate the same performance tests quoted
in Google's BigTable paper.

* PerformanceEvaluation wiki page:
http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation
* PerformanceEvaluation.java:
http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/test/org/apache/hadoop/hbase/PerformanceEvaluation.java?view=co

Thanks,
Jeff

On Mon, Aug 17, 2009 at 5:09 PM, stack  wrote:

> On Mon, Aug 17, 2009 at 4:54 PM, Jeff Hammerbacher  >wrote:
>
> > Hey Stack,
> >
> > I notice that the patch for this issue doesn't include any sort of tests
> > that might have caught this regression. Do you guys have an HBaseBench,
> > HBaseMix, or similarly named tool for catching performance regressions?
> >
>
> Not as part of our build.  The way its currently done is that near release,
> we run our little PerformanceEvaluation doohickey.  If its way off, crack
> the profiler.
>
> We have been trying to get some of the hadoop allotment of EC2 time so we
> could set up a regular run up on AWS but no luck so far.
>
> Good on your Jeff,
> St.Ack
>
>
>
>
>
>
>
> >
> > Thanks,
> > Jeff
> >
> > On Mon, Aug 17, 2009 at 4:51 PM, stack  wrote:
> >
> > > Our writes were off by a factor of 7 or 8.  Writes should be better now
> > > (HBASE-1771).
> > > Thanks,
> > > St.Ack
> > >
> > >
> > > On Thu, Aug 13, 2009 at 4:53 PM, stack  wrote:
> > >
> > > > I just tried it.  It seems slow to me writing too.  Let me take a
> > > look
> > > > St.Ack
> > > >
> > > >
> > > > On Thu, Aug 13, 2009 at 10:06 AM, llpind 
> > wrote:
> > > >
> > > >>
> > > >> Okay I changed replication to 2.  and removed "-XX:NewSize=6m
> > > >> -XX:MaxNewSize=6m"
> > > >>
> > > >> here is results for randomWrite 3 clients:
> > > >>
> > > >>
> > > >>
> > > >> RandomWrite =
> > > >>
> > > >> hadoop-0.20.0/bin/hadoop jar hbase-0.20.0/hbase-0.20.0-test.jar
> > > >>  --nomapred
> > > >> randomWrite 3
> > > >>
> > > >>
> > > >> 09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-0 Start
> > > >> randomWrite at offset 0 for 1048576 rows
> > > >> 09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-1 Start
> > > >> randomWrite at offset 1048576 for 1048576 rows
> > > >> 09/08/13 09:51:15 INFO hbase.PerformanceEvaluation: client-2 Start
> > > >> randomWrite at offset 2097152 for 1048576 rows
> > > >> 09/08/13 09:51:47 INFO hbase.PerformanceEvaluation: client-0
> > > >> 0/104857/1048576
> > > >> 09/08/13 09:51:48 INFO hbase.PerformanceEvaluation: client-1
> > > >> 1048576/1153427/2097152
> > > >> 09/08/13 09:51:48 INFO hbase.PerformanceEvaluation: client-2
> > > >> 2097152/2201997/3145728
> > > >> 09/08/13 09:52:22 INFO hbase.PerformanceEvaluation: client-1
> > > >> 1048576/1258284/2097152
> > > >> 09/08/13 09:52:23 INFO hbase.PerformanceEvaluation: client-0
> > > >> 0/209714/1048576
> > > >> 09/08/13 09:52:24 INFO hbase.PerformanceEvaluation: client-2
> > > >> 2097152/2306854/3145728
> > > >> 09/08/13 09:52:47 INFO hbase.PerformanceEvaluation: client-1
> > > >> 1048576/1363141/2097152
> > > >> 09/08/13 09:52:58 INFO hbase.PerformanceEvaluation: client-0
> > > >> 0/314571/1048576
> > > >> 09/08/13 09:52:58 INFO hbase.PerformanceEvaluation: client-2
> > > >> 2097152/2411711/3145728
> > > >> 09/08/13 09:53:24 INFO hbase.PerformanceEvaluation: client-1
> > > >> 1048576/1467998/2097152
> > > >> 09/08/13 09:53:27 INFO hbase.PerformanceEvaluation: client-0
> > > >> 0/419428/1048576
> > > >> 09/08/13 09:53:27 INFO hbase.PerformanceEvaluation: client-2
> > > >> 2097152/2516568/3145728
> > > >> 09/08/13 09:53:48 INFO hbase.PerformanceEvaluation: client-1
> > > >> 1048576/1572855/2097152
> > > >> 09/08/13 09:54:08 INFO hbase.PerformanceEvaluation: client-2
> > > >> 2097152/2621425/3145728
> > > >> 09/08/13 09:54:10 INFO hbase.PerformanceEvaluation: client-0
> > > >> 0/524285/1048576
> > > >> 09/08/13 09:54:40 INFO hbase.PerformanceEvaluation: client-1
> > > >> 1048576/1677712/2097152
> > > >> 09/08/13 09:54:49 INFO hbase.PerformanceEvaluation: client-2
> > > >> 2097152/2726282/3145728
> > > >> 09/08/13 09:54:52 INFO hbase.PerformanceEvaluation: client-0
> > > >> 0/629142/1048576
> > > >> 09/08/13 09:55:57 INFO hbase.PerformanceEvaluation: client-1
> > > >> 1048576/1782569/2097152
> > > >> 09/08/13 09:56:21 INFO hbase.PerformanceEvaluation: client-2
> > > >> 2097152/2831139/3145728
> > > >> 09/08/13 09:56:41 INFO hbase.PerformanceEvaluation: client-0
> > > >> 0/733999/1048576
> > > >> 09/08/13 09:57:23 INFO hbase.PerformanceEvaluation: client-1
> > > >> 1048576/1887426/2097152
> > > >> 09/08/13 09:58:40 INFO hbase.PerformanceEvaluation: client-2
> > > >> 2097152/2935996/3145728
> > > >> 09/08/13 09:58:54 INFO hbase.PerformanceEvaluation: client-0
> > > >> 0/838856/1048576
> > > >> 09/08/13 10:00:29 INFO hbase.PerformanceEvaluation: client-1
> > > >> 1048576/1992283/2097152
> > > >> 09/08/13 10:01:01 INFO hbase.PerformanceEvaluation:

Re: NoServerForRegionException, TableNotFoundException and WrongRegionException

2009-08-17 Thread Jonathan Gray

To reiterate what stack said, you need to upgrade.

There are serious, known bugs in 0.19.1.  Upgrade to 0.19 branch or 0.20 
branch, instructions can be found from this page:


http://hadoop.apache.org/hbase/version_control.html

For example,

svn co http://svn.apache.org/repos/asf/hadoop/hbase/branches/0.19/ 
hbase-0.19-branch


cd hbase-0.19-branch

ant jar


JG

Marc Limotte wrote:

Regions seem to be reasonably dispersed... as of now... not sure if that was
true before I reset hbase.hregion.max.filesize.


Region Servers
AddressStart CodeLoad
host1:600201250533702083requests=0, regions=10, usedHeap=32,
maxHeap=888
host2:600201250533702094requests=0, regions=12, usedHeap=32,
maxHeap=888
host3:600201250533702052requests=0, regions=7, usedHeap=31,
maxHeap=888
host4:600201250533702078requests=0, regions=11, usedHeap=32,
maxHeap=888
Total: servers: 4 requests=0, regions=40

Marc




-- Forwarded message --
From: stack 
To: hbase-user@hadoop.apache.org
Date: Mon, 17 Aug 2009 14:21:47 -0700
Subject: Re: NoServerForRegionException, TableNotFoundException and
WrongRegionException
Please update to the head of 0.19 trunk, or better update to 0.20 trunk --
espeically if you are testing.  Issues described below have been addressed.

How many regions do you have in your table?  Are all going to one
regionserver because you only have one region?

Yours,
St.Ack


On Mon, Aug 17, 2009 at 12:19 PM, Marc Limotte 
wrote:


I'm seeing a nice variety of Exceptions from HBase and could use some
pointers about what to do next.

This is a new map/reduce program, updating about 550k rows with around a
dozen columns on a very small cluster (only 4 nodes... as we're still
testing and it doesn't have to support production yet).  Hbase Version
0.19.1.

I ran the job and it seems to make some progress, and then dies after
several hours, reporting "NoServerForRegionException: No server address
listed in .META. for region TABLEX,,1250526695078".  I retried it a few
times with the same result.  I also noticed that the load is not well
balanced, all requests seemed to be going to one node.  I adjust
hadoop-site.xml with the addition of these two entries:

   hbase.hregion.max.filesize
   33554432

  hbase.client.retries.number
   5

And restarted hbase (and hadoop to be safe).  Re-ran and got the same

error

in the M/R job.

*I thought I'd try dropping the table, since it's a new table and I can
recreate it.  But that gives another exception:
*
hbase(main):002:0> disable 'TABLEX'
NativeException: org.apache.hadoop.hbase.TableNotFoundException:
org.apache.hadoop.hbase.TableNotFoundException: TABLEX
   at



org.apache.hadoop.hbase.master.TableOperation$ProcessTableOperation.call(TableOperation.java:129)

   at



org.apache.hadoop.hbase.master.TableOperation$ProcessTableOperation.call(TableOperation.java:70)

   at



org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:64)

   at



org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:143)

   at

org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:691)

...


*And now I see this exception in the HBase logs:
*
org.apache.hadoop.hbase.regionserver.WrongRegionException:
org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row
out
of range for HRegion .META.,,1250280235390, startKey='',
getEndKey()='TABLEX,,1250219949252',
row='TABLEX,840.56098.0544,1250526661861'
   at
org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:1788)
   at



org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1844)

   at
org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:1912)
   at


org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1244)

   at


org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1216)

...


*As a test, tried a "count"...
*
hbase(main):007:0* count 'TABLEX'
NativeException:

org.apache.hadoop.hbase.client.NoServerForRegionException:

No server address listed in .META. for region TABLEX,,1250526695078
   from org/apache/hadoop/hbase/client/HConnectionManager.java:548:in
`locateRegionInMeta'
   from org/apache/hadoop/hbase/client/HConnectionManager.java:478:in
`locateRegion'
   from org/apache/hadoop/hbase/client/HConnectionManager.java:440:in
`locateRegion'
   from org/apache/hadoop/hbase/client/HTable.java:114:in `'
   from org/apache/hadoop/hbase/client/HTable.java:97:in `'
   from sun/reflect/NativeConstructorAccessorImpl.java:-2:in

`newInstance0'

...


*Also saw a thread somewhere that suggested doing a major compaction.

 Did

that.  It returns almost immediately.  Not sure if that's normal or

not...

no perceivable impact from doing this, though.*

hbase(main):013:0> major_compact '.META.'
0 row(s) in 0.0220 seconds
hbase(main):014:0>

Not sure what else to try?  Is there a way to force removal of the table

in

question?  Is there

Re: Hi, i am still puzzeld to find a row key by a cell.

2009-08-17 Thread Rocks
Thanks a lot. Maybe i will go with server-side filters.

On Tue, Aug 18, 2009 at 3:24 AM, Bradford Stephens <
bradfordsteph...@gmail.com> wrote:

> Just reiterating with JGray says -- usually, when you need a secondary
> index, you can get away with denormalizing and duplicating your data.
>
> On Mon, Aug 17, 2009 at 9:30 AM, Jonathan Gray wrote:
> > This is possible by checking the values in your client, through
> server-side
> > filters (see org.apache.hadoop.hbase.filter), or with secondary indexing
> (as
> > described in my previous e-mail).
> >
> > In the first two cases, you are doing a full table scan so it is very
> > inefficient.  That's the same as it would be in an RDBMS if you were not
> > using secondary indexes, however.
> >
> > HBase has limited secondary indexing support, but do not expect the
> > flexibility and performance you get from an RDBMS secondary index.
> >
> > If this is central to your usage of HBase, make sure that HBase is what
> you
> > want and take another look at your schema to see if there might be a
> better
> > design to prevent needing heavy indexing or table scanning.
> >
> > JG
> >
> > Rocks wrote:
> >>
> >> Thanks for you answer.
> >> however, what i want to do just likes "where" keyword of SQL in RDBMS.
> Is
> >> it impossible in hbase ?
> >>
> >> On Mon, Aug 17, 2009 at 2:58 PM, Ryan Rawson 
> wrote:
> >>
> >>> hey,
> >>>
> >>> That isn't how hbase (or even rdbms) work.  Instead you can retrieve
> >>> rows based on their row key. Otherwise you will have to read the
> >>> entire table to find just that 1 row. Yes this is as inefficient as it
> >>> sounds.
> >>>
> >>> If you frequently have this issue, you may need to build and maintain
> >>> secondary indexes. Unlike relational dbs, there is no built in support
> >>> for this, you have to write your app to handle this.
> >>>
> >>> -yran
> >>>
> >>>
> >>>
> >>> On Sun, Aug 16, 2009 at 11:51 PM, lei wang
> >>> wrote:
> 
>  Hi, if a know a cell in a hbase for its column::value,  i
> need
> >>>
> >>> to
> 
>  know which row key it belongs to. I searched the HBase api several
>  times,
>  but i can not find the right method to solve my problem. Thanks for
>  one's
>  suggestion to me.
> 
> >>
> >
>
>
>
> --
> http://www.roadtofailure.com -- The Fringes of Scalability, Social
> Media, and Computer Science
>


ANN: hbase 0.20.0 Release Candidate 2 available for download

2009-08-17 Thread stack
The second hbase 0.20.0 release candidate is available for download:

http://people.apache.org/~stack/hbase-0.20.0-candidate-2/

Near to 450 issues have been addressed since the 0.19 branch.  Release notes
are available here: http://su.pr/18zcEO .

HBase 0.20.0 runs on Hadoop 0.20.0.  Alot has changed since 0.19.x including
configuration fundamentals.  Be sure to read the 'Getting Started'
documentation in 0.20.0 available here: http://su.pr/8YQjHO

If you wish to bring your 0.19.x hbase data forward to 0.20.0, you will need
to run a migration.  See http://wiki.apache.org/hadoop/Hbase/HowToMigrate.
First read the overview and then go to the section, 'From 0.19.x to 0.20.x'.

Should we release this candidate as hbase 0.20.0?  Please vote +1/-1 by
Wednesday, August 25th.

Yours,
The HBasistas


Re: ANN: hbase 0.20.0 Release Candidate 2 available for download

2009-08-17 Thread Rocks
What is the difference between the two candidate version?

On Tue, Aug 18, 2009 at 1:52 PM, stack  wrote:

> The second hbase 0.20.0 release candidate is available for download:
>
> http://people.apache.org/~stack/hbase-0.20.0-candidate-2/
>
> Near to 450 issues have been addressed since the 0.19 branch.  Release
> notes
> are available here: http://su.pr/18zcEO .
>
> HBase 0.20.0 runs on Hadoop 0.20.0.  Alot has changed since 0.19.x
> including
> configuration fundamentals.  Be sure to read the 'Getting Started'
> documentation in 0.20.0 available here: http://su.pr/8YQjHO
>
> If you wish to bring your 0.19.x hbase data forward to 0.20.0, you will
> need
> to run a migration.  See http://wiki.apache.org/hadoop/Hbase/HowToMigrate.
> First read the overview and then go to the section, 'From 0.19.x to
> 0.20.x'.
>
> Should we release this candidate as hbase 0.20.0?  Please vote +1/-1 by
> Wednesday, August 25th.
>
> Yours,
> The HBasistas
>


Re: ANN: hbase 0.20.0 Release Candidate 2 available for download

2009-08-17 Thread Ryan Rawson
A few bugs were fixed, seemingly minor but critical.

enjoy!
-ryan

On Mon, Aug 17, 2009 at 11:16 PM, Rocks wrote:
> What is the difference between the two candidate version?
>
> On Tue, Aug 18, 2009 at 1:52 PM, stack  wrote:
>
>> The second hbase 0.20.0 release candidate is available for download:
>>
>> http://people.apache.org/~stack/hbase-0.20.0-candidate-2/
>>
>> Near to 450 issues have been addressed since the 0.19 branch.  Release
>> notes
>> are available here: http://su.pr/18zcEO .
>>
>> HBase 0.20.0 runs on Hadoop 0.20.0.  Alot has changed since 0.19.x
>> including
>> configuration fundamentals.  Be sure to read the 'Getting Started'
>> documentation in 0.20.0 available here: http://su.pr/8YQjHO
>>
>> If you wish to bring your 0.19.x hbase data forward to 0.20.0, you will
>> need
>> to run a migration.  See http://wiki.apache.org/hadoop/Hbase/HowToMigrate.
>> First read the overview and then go to the section, 'From 0.19.x to
>> 0.20.x'.
>>
>> Should we release this candidate as hbase 0.20.0?  Please vote +1/-1 by
>> Wednesday, August 25th.
>>
>> Yours,
>> The HBasistas
>>
>