Fwd: What should we expect from Hama Examples Rand?
Hi, 'RAND' example of hama-examples.jar is basically a simple M/R job that creates a table filled with random numbers. So, before the run Hama, Pls check whether you able to create tables via hbase shell. Can someone of Hbase the developers help this problem? Thanks -- Forwarded message -- From: Ratner, Alan S (IS) Date: Sat, Dec 12, 2009 at 1:32 AM Subject: What should we expect from Hama Examples Rand? To: hama-u...@incubator.apache.org Having fixed the groomserver problem I did the following: 1) clean out /tmp files 2) format Hadoop namenode 3) start Hadoop 4) start HBase/Zookeeper 5) start Hama 6) launch Hama examples rand -m 10 -r 10 2000 2000 30.5% matrixA The outcome is puzzling. A 3-second long diary shows up in the Hama log file reporting nothing unusual. But the terminal reported problems with HBase. When I Googled this problem all I saw was someone who had multiple versions of HBase on their system. I am using a fresh VM with Ubuntu 8.04, Hadoop 0.20.1, Zookeeper 3.2.1, HBase 0.20.2, Hama 0.2.0 and JDK 1.6.0_17. No older versions of anything was ever installed. BTW: I thought the problem might be related to my running HBase in standalone mode so I just switched to HBase pseudo-distributed mode but I see the same problems. Any help appreciated. -- Alan Hama log == Fri Dec 11 09:23:03 EST 2009 Starting master on ngc ulimit -n 1024 2009-12-11 09:23:05,512 INFO org.apache.hama.HamaMaster: STARTUP_MSG: / STARTUP_MSG: Starting HamaMaster STARTUP_MSG: host = ngc/127.0.1.1 STARTUP_MSG: args = [start] STARTUP_MSG: version = 0.20.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 810220; compiled by 'oom' on Tue Sep 1 20:55:56 UTC 2009 / 2009-12-11 09:23:05,939 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=HamaMaster, port=4 2009-12-11 09:23:06,634 INFO org.apache.hama.HamaMaster: Cleaning up the system directory 2009-12-11 09:23:06,710 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2009-12-11 09:23:06,716 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 4: starting 2009-12-11 09:23:06,721 INFO org.apache.hama.HamaMaster: Starting RUNNING 2009-12-11 09:23:06,721 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 4: starting Hama terminal /hadoop-0.20.1-test.jar:/home/ngc/Desktop/hama/bin/../lib/hbase-0.20.0.jar:/home/ngc/Desktop/hama/bin/../lib/hbase-0.20.0-test.jar:/home/ngc/Desktop/hama/bin/../lib/jasper- compiler-5.5.12.jar:/home/ngc/Desktop/hama/bin/../lib/jasper-runtime-5.5.12.jar:/home/ngc/Desktop/hama/bin/../lib/javacc.jar:/home/ngc/Desktop/hama/bin/../lib/jetty-6.1.14.jar:/home/ngc/Desktop/hama/bin/../lib/jetty-util-6.1.14.jar:/home/ngc/Desktop/hama/bin/../lib/jruby-complete-1.2.0.jar:/home/ngc/Desktop/hama/bin/../lib/json.jar:/home/ngc/Desktop/hama/bin/../lib/junit-3.8.1.jar:/home/ngc/Desktop/hama/bin/../lib/libthrift-r771587.jar:/home/ngc/Desktop/hama/bin/../lib/log4j-1.2.13.jar:/home/ngc/Desktop/hama/bin/../lib/log4j-1.2.15.jar:/home/ngc/Desktop/hama/bin/../lib/servlet-api-2.5-6.1.14.jar:/home/ngc/Desktop/hama/bin/../lib/xmlenc-0.52.jar:/home/ngc/Desktop/hama/bin/../lib/zookeeper-3.2.1.jar:/home/ngc/Desktop/hama/bin/../lib/jetty-ext/*.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/annotations.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/ant.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/asm-3.0.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/asm-analysis-3.0.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/asm-commons-3.0.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/asm-tree-3.0.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/asm-util-3.0.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/asm-xml-3.0.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/bcel.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/dom4j-full.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/findbugs-ant.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/findbugsGUI.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/findbugs.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/jsr305.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/plugin/coreplugin.jar:/home/ngc/Desktop/hadoop-0.20.1/conf:/home/ngc/Desktop/hbase-0.20.2/conf 09/12/11 09:25:12 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/home/ngc/Desktop/jre1.6.0_17/lib/i386/client:/home/ngc/Desktop/jre1.6.0_17/lib/i386:/home/ngc/Desktop/jre1.6.0_17/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib 09/12/11 09:25:12 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp 09/12/11 09:25:12 INFO zookeeper.ZooKeeper: Client environment:java.compiler= 09/12/11 09:25:12 INFO zookeeper.ZooKeeper: Client environment:os.na
Re: HBase Utility functions (for Java 5+)
On Tue, Dec 15, 2009 at 1:03 AM, stack wrote: > HBase requires java 6 (1.6) or above. > St.Ack > > On Mon, Dec 14, 2009 at 7:41 PM, Paul Smith wrote: > >> Just wondering if anyone knows of an existing Hbase utility library that is >> open sourced that can assist those that have Java5 and above. I'm starting >> off in Hbase, and thinking it'd be great to have API calls similar to the >> Google Collections framework. If one doesn't exist, I think I could start >> off a new project in Google Code (ASL it). I think Hbase is targetted < >> Java 5, so can't take advantage of this yet internally. >> >> The sorts of API functions I thought would be useful to make code more >> readable would be something like: >> >> >> HTable hTable = new >> TableBuilder(hbaseConfiguration).withTableName("foo") >> .withSimpleColumnFamilies("bar", "eek", >> "moo").deleteAndRecreate(); >> >> and >> >> ResultScanner scanner = new >> ResultScannerBuilder(hTable).withColumnFamilies( >> "family1", "family2").build(); >> >> >> taking advantage of varargs liberally and using nice Patterns etc. While >> the Bytes class is useful, I'd personally benefit from an API that can >> easily pack arbitrary multiple ints (and other data types) together into >> byte[] for RowKeyGoodness(tm) ala: >> >> byte[] rowKey = BytePacker.pack(fooId, barId, eekId, mooId); >> >> (behind the scenes this is a vararg method that recursively packs each into >> into byte[] via Bytes.add(byte[] b1, byte[] b2) etc. >> >> If anyone knows of a library that does this, pointers please. >> >> cheers, >> >> Paul > I could see this being very useful. My first barrier to hbase was trying to figure out how to turn what I knew of as an SQL select cause into a set of HBaser server side filters. Mostly, I pieced this together with help from the list, and the Test Cases. That could be frustrating for some. Now that I am used to it, I notice that the HBase way is actually much cleaner and much less code. So, yes a helper library is a great thing. As part of the "proof of concept" I am working on, large sections of it are mostly descriptions of doing things like column projections in both SQL and HBase with filters. So I think both are very helpful for making Hbase more attractive to an end user.
Re: HBase Utility functions (for Java 5+)
This definitely seems to be a common initial hurdle, though I think each project comes at it with their own specific needs. There are a variety of frameworks or libraries you can check out on the Supporting Projects page: http://wiki.apache.org/hadoop/SupportingProjects In my case, I wanted a simple object -> hbase mapping layer that would take care of the boilerplate work of persistence and provide a slightly higher level API for queries. It's been open-sourced on github: http://github.com/ghelmling/meetup.beeno It still only really account for my project needs -- we're serving realtime requests from our web site and not currently doing any MR processing. But if it could be of use, I could always use feedback on how to evolve it. :) Some of the other projects listed on the wiki page are doubtless more mature, so they may meet your needs as well. If none of them are quite what you're looking for, then there's always room for another! --gh On Tue, Dec 15, 2009 at 10:39 AM, Edward Capriolo wrote: > On Tue, Dec 15, 2009 at 1:03 AM, stack wrote: > > HBase requires java 6 (1.6) or above. > > St.Ack > > > > On Mon, Dec 14, 2009 at 7:41 PM, Paul Smith wrote: > > > >> Just wondering if anyone knows of an existing Hbase utility library that > is > >> open sourced that can assist those that have Java5 and above. I'm > starting > >> off in Hbase, and thinking it'd be great to have API calls similar to > the > >> Google Collections framework. If one doesn't exist, I think I could > start > >> off a new project in Google Code (ASL it). I think Hbase is targetted < > >> Java 5, so can't take advantage of this yet internally. > >> > >> The sorts of API functions I thought would be useful to make code more > >> readable would be something like: > >> > >> > >>HTable hTable = new > >> TableBuilder(hbaseConfiguration).withTableName("foo") > >>.withSimpleColumnFamilies("bar", "eek", > >> "moo").deleteAndRecreate(); > >> > >> and > >> > >>ResultScanner scanner = new > >> ResultScannerBuilder(hTable).withColumnFamilies( > >>"family1", "family2").build(); > >> > >> > >> taking advantage of varargs liberally and using nice Patterns etc. > While > >> the Bytes class is useful, I'd personally benefit from an API that can > >> easily pack arbitrary multiple ints (and other data types) together into > >> byte[] for RowKeyGoodness(tm) ala: > >> > >> byte[] rowKey = BytePacker.pack(fooId, barId, eekId, mooId); > >> > >> (behind the scenes this is a vararg method that recursively packs each > into > >> into byte[] via Bytes.add(byte[] b1, byte[] b2) etc. > >> > >> If anyone knows of a library that does this, pointers please. > >> > >> cheers, > >> > >> Paul > > > > I could see this being very useful. My first barrier to hbase was > trying to figure out how to turn what I knew of as an SQL select cause > into a set of HBaser server side filters. Mostly, I pieced this > together with help from the list, and the Test Cases. That could be > frustrating for some. Now that I am used to it, I notice that the > HBase way is actually much cleaner and much less code. > > So, yes a helper library is a great thing. > > As part of the "proof of concept" I am working on, large sections of > it are mostly descriptions of doing things like column projections in > both SQL and HBase with filters. So I think both are very helpful for > making Hbase more attractive to an end user. >
Re: HBase Utility functions (for Java 5+)
On Tue, Dec 15, 2009 at 11:04 AM, Gary Helmling wrote: > This definitely seems to be a common initial hurdle, though I think each > project comes at it with their own specific needs. There are a variety of > frameworks or libraries you can check out on the Supporting Projects page: > http://wiki.apache.org/hadoop/SupportingProjects > > In my case, I wanted a simple object -> hbase mapping layer that would take > care of the boilerplate work of persistence and provide a slightly higher > level API for queries. It's been open-sourced on github: > > http://github.com/ghelmling/meetup.beeno > > It still only really account for my project needs -- we're serving realtime > requests from our web site and not currently doing any MR processing. But > if it could be of use, I could always use feedback on how to evolve it. :) > > Some of the other projects listed on the wiki page are doubtless more > mature, so they may meet your needs as well. If none of them are quite what > you're looking for, then there's always room for another! > > --gh > > > On Tue, Dec 15, 2009 at 10:39 AM, Edward Capriolo > wrote: > >> On Tue, Dec 15, 2009 at 1:03 AM, stack wrote: >> > HBase requires java 6 (1.6) or above. >> > St.Ack >> > >> > On Mon, Dec 14, 2009 at 7:41 PM, Paul Smith wrote: >> > >> >> Just wondering if anyone knows of an existing Hbase utility library that >> is >> >> open sourced that can assist those that have Java5 and above. I'm >> starting >> >> off in Hbase, and thinking it'd be great to have API calls similar to >> the >> >> Google Collections framework. If one doesn't exist, I think I could >> start >> >> off a new project in Google Code (ASL it). I think Hbase is targetted < >> >> Java 5, so can't take advantage of this yet internally. >> >> >> >> The sorts of API functions I thought would be useful to make code more >> >> readable would be something like: >> >> >> >> >> >> HTable hTable = new >> >> TableBuilder(hbaseConfiguration).withTableName("foo") >> >> .withSimpleColumnFamilies("bar", "eek", >> >> "moo").deleteAndRecreate(); >> >> >> >> and >> >> >> >> ResultScanner scanner = new >> >> ResultScannerBuilder(hTable).withColumnFamilies( >> >> "family1", "family2").build(); >> >> >> >> >> >> taking advantage of varargs liberally and using nice Patterns etc. >> While >> >> the Bytes class is useful, I'd personally benefit from an API that can >> >> easily pack arbitrary multiple ints (and other data types) together into >> >> byte[] for RowKeyGoodness(tm) ala: >> >> >> >> byte[] rowKey = BytePacker.pack(fooId, barId, eekId, mooId); >> >> >> >> (behind the scenes this is a vararg method that recursively packs each >> into >> >> into byte[] via Bytes.add(byte[] b1, byte[] b2) etc. >> >> >> >> If anyone knows of a library that does this, pointers please. >> >> >> >> cheers, >> >> >> >> Paul >> > >> >> I could see this being very useful. My first barrier to hbase was >> trying to figure out how to turn what I knew of as an SQL select cause >> into a set of HBaser server side filters. Mostly, I pieced this >> together with help from the list, and the Test Cases. That could be >> frustrating for some. Now that I am used to it, I notice that the >> HBase way is actually much cleaner and much less code. >> >> So, yes a helper library is a great thing. >> >> As part of the "proof of concept" I am working on, large sections of >> it are mostly descriptions of doing things like column projections in >> both SQL and HBase with filters. So I think both are very helpful for >> making Hbase more attractive to an end user. >> > All interesting. In a sense, I believe you should learn to walk before you can run :). It is hard to troubleshoot how an ORM mapper is working if you basically clueless on the Hbase API. You know when lots of user tools get pulled in the mix: q: How do I only get column X? a: You need to get a spring inject able, grails, restful, ORM mapper, that is only found in git, but there is like 4 forks of it, so pick this one :)
Re: Performance related question
Thanks J-D & Mtohiko for the tips. Significant improvement in performance, but there's still room for improvement. In my local pseudo distributed mode the 2 map reduce jobs now run in less than 4 minutes (from 32 mins) and in cluster of 10 nodes + 5 zk nodes they run in 11 minutes (down from 1 hour & 30 mins). But still I would like to come to a point where they run faster on a cluster than on my local machine. Here's what I did: 1) Fixed a bug in my code that was causing unnecessary writes to HBase. 2) Added these two lines after creating 'new HTable': table.setAutoFlush(false); table.setWriteBufferSize(1024*1024*12); 3) Added this line after Put: put.setWriteToWAL(false); 4) Added this line (only when running on cluster): job.setNumReduceTasks(20); There are other 64-bit related improvements which I cannot try; mainly because Amazon charges (way) too much for 64-bit machines. It costs me over $25 for 15 machines for less than 3 hours, so I switched to 'm1.small' 32-bit machines. Of course, one of the promises of the distributed computing is that we will be able to use "cheap commodity hardware", right :) So I would like to stick with 'm1.small' for now. (But I am willing to use about 30 machines if that's going to help.) Anyway, I have noticed that one of my Mappers is taking too long. If anyone would share ideas of how to improve Mapper speed, that would be greatly appreciated. Basically, in this Mapper I read about 50,000 rows from a HBase table using TableMapReduceUtil.initTableMapperJob() and do some complex processing for "values" of each row. I don't write anything back in HBase, but I do write quite a few lines (context.write()) to HDFS. Any suggestions? Thanks once again for the help. 2009/12/13 > Hello, > > Something Something wrote: > > PS: One thing I have noticed is that it goes to 66% very fast and then > > slows down from there.. > > It seems that only one reducer works. You should increase reduce tasks. > The default reduce task's number is written on > hadoop/docs/mapred-default.html. > The default parameter of mapred.reduce.tasks is 1. So only one reduce task > runs. > > There are two ways to increase reduce tasks: > 1. Use Job.setNumReduceTasks(int tasks) on your MapReduce job file. > 2. Denote more mapred.reduce.tasks on hadoop/conf/mapred-site.xml. > > You can get the best perfomance if you run 20 reduce tasks. The detail of > the number > of reduce tasks is written on > http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Reducer > at "How many Reduces?" as J-D wrote. Notice that > JobConf.setNumReduceTasks(int) is > already deprecated, so you should use Job.setNumReduceTasks(int tasks) > rather than > JobConf.setNumReduceTasks(int). > -- > Motohiko Mouri >
Re: HBase Utility functions (for Java 5+)
I completely agree with the need to understand both the fundamental HBase API, and how HBase stores data at a low level. Both are very important in knowing how to structure your data for best performance. Which you should figure out before moving on to other niceties. As far as the actual data storage, Lars George did a really informative write-up: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html And of course there's the HBase Architecture doc: http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture and the Google BigTable paper. On Tue, Dec 15, 2009 at 11:21 AM, Edward Capriolo wrote: > On Tue, Dec 15, 2009 at 11:04 AM, Gary Helmling > wrote: > > This definitely seems to be a common initial hurdle, though I think each > > project comes at it with their own specific needs. There are a variety > of > > frameworks or libraries you can check out on the Supporting Projects > page: > > http://wiki.apache.org/hadoop/SupportingProjects > > > > In my case, I wanted a simple object -> hbase mapping layer that would > take > > care of the boilerplate work of persistence and provide a slightly higher > > level API for queries. It's been open-sourced on github: > > > > http://github.com/ghelmling/meetup.beeno > > > > It still only really account for my project needs -- we're serving > realtime > > requests from our web site and not currently doing any MR processing. > But > > if it could be of use, I could always use feedback on how to evolve it. > :) > > > > Some of the other projects listed on the wiki page are doubtless more > > mature, so they may meet your needs as well. If none of them are quite > what > > you're looking for, then there's always room for another! > > > > --gh > > > > > > On Tue, Dec 15, 2009 at 10:39 AM, Edward Capriolo >wrote: > > > >> On Tue, Dec 15, 2009 at 1:03 AM, stack wrote: > >> > HBase requires java 6 (1.6) or above. > >> > St.Ack > >> > > >> > On Mon, Dec 14, 2009 at 7:41 PM, Paul Smith > wrote: > >> > > >> >> Just wondering if anyone knows of an existing Hbase utility library > that > >> is > >> >> open sourced that can assist those that have Java5 and above. I'm > >> starting > >> >> off in Hbase, and thinking it'd be great to have API calls similar to > >> the > >> >> Google Collections framework. If one doesn't exist, I think I could > >> start > >> >> off a new project in Google Code (ASL it). I think Hbase is > targetted < > >> >> Java 5, so can't take advantage of this yet internally. > >> >> > >> >> The sorts of API functions I thought would be useful to make code > more > >> >> readable would be something like: > >> >> > >> >> > >> >>HTable hTable = new > >> >> TableBuilder(hbaseConfiguration).withTableName("foo") > >> >>.withSimpleColumnFamilies("bar", "eek", > >> >> "moo").deleteAndRecreate(); > >> >> > >> >> and > >> >> > >> >>ResultScanner scanner = new > >> >> ResultScannerBuilder(hTable).withColumnFamilies( > >> >>"family1", "family2").build(); > >> >> > >> >> > >> >> taking advantage of varargs liberally and using nice Patterns etc. > >> While > >> >> the Bytes class is useful, I'd personally benefit from an API that > can > >> >> easily pack arbitrary multiple ints (and other data types) together > into > >> >> byte[] for RowKeyGoodness(tm) ala: > >> >> > >> >> byte[] rowKey = BytePacker.pack(fooId, barId, eekId, mooId); > >> >> > >> >> (behind the scenes this is a vararg method that recursively packs > each > >> into > >> >> into byte[] via Bytes.add(byte[] b1, byte[] b2) etc. > >> >> > >> >> If anyone knows of a library that does this, pointers please. > >> >> > >> >> cheers, > >> >> > >> >> Paul > >> > > >> > >> I could see this being very useful. My first barrier to hbase was > >> trying to figure out how to turn what I knew of as an SQL select cause > >> into a set of HBaser server side filters. Mostly, I pieced this > >> together with help from the list, and the Test Cases. That could be > >> frustrating for some. Now that I am used to it, I notice that the > >> HBase way is actually much cleaner and much less code. > >> > >> So, yes a helper library is a great thing. > >> > >> As part of the "proof of concept" I am working on, large sections of > >> it are mostly descriptions of doing things like column projections in > >> both SQL and HBase with filters. So I think both are very helpful for > >> making Hbase more attractive to an end user. > >> > > > > All interesting. In a sense, I believe you should learn to walk before > you can run :). It is hard to troubleshoot how an ORM mapper is > working if you basically clueless on the Hbase API. > > You know when lots of user tools get pulled in the mix: > q: How do I only get column X? > a: You need to get a spring inject able, grails, restful, ORM mapper, > that is only found in git, but there is like 4 forks of it, so pick > this one :) >
Re: HBase Utility functions (for Java 5+)
On Tue, Dec 15, 2009 at 9:21 AM, Gary Helmling wrote: > I completely agree with the need to understand both the fundamental HBase > API, and how HBase stores data at a low level. Both are very important in > knowing how to structure your data for best performance. Which you should > figure out before moving on to other niceties. Code (ASL it). I think > Hbase is > > On the other hand, forcing the user to understand the details of how data is stored instead of presenting them with a well abstracted api makes the learning curve steeper. These kinds of cleaner APIs would be a good way to prevent the standard situation of one engineer on the team figuring out HBase, then others say "why is this so complicated" so they write an internal set of wrappers and utility methods. This wouldn't solve the problems for people who want a full ORM, but I think there's an in-between sweet spot that abstracts away byte[] but still exposes column families and such.
Re: running unit test based on HBaseClusterTestCase
Do you have hadoop jars in your eclipse classpath? Stack On Dec 14, 2009, at 10:58 PM, Guohua Hao wrote: Hello All, In my own application, I have a unit test case which extends HBaseClusterTestCase in order to test some of my operation over HBase cluster. I override the setup function in my own test case, and this setup function begins with super.setup() function call. When I try to run my unit test from within Eclipse, I got the following error: java.lang.NoSuchMethodError: org.apache.hadoop.security.UserGroupInformation.setCurrentUser(Lorg/ apache/hadoop/security/UserGroupInformation;)V at org.apache.hadoop.hdfs.MiniDFSCluster. (MiniDFSCluster.java:236) at org.apache.hadoop.hdfs.MiniDFSCluster. (MiniDFSCluster.java:119) at org.apache.hadoop.hbase.HBaseClusterTestCase.setUp (HBaseClusterTestCase.java:123) I included the hadoop-0.20.1-core.jar in my classpath, since this jar file contains the org.apache.hadoop.security.UserGroupInformation class. Could anybody give me some hint on how to solve this problem? Thank you very much, Guohua
Re: Fwd: What should we expect from Hama Examples Rand?
You are missing some supporting jar. > java.io.IOException: java.io.IOException: java.lang.NullPointerException > at java.lang.Class.searchMethods(Unknown Source) Note that the exception is in a JVM method (java.lang.Class.searchMethods). This is not really a HBase problem per se, but instead very likely a classpath issue. It looks to me like the JVM cannot (transitively) load a class to handle the RPC. Are all of the supporting jars for Hadoop and HBase on the classpath? It seems at least one is not. - Andy From: Edward J. Yoon To: hbase-user@hadoop.apache.org Cc: alan.rat...@ngc.com Sent: Tue, December 15, 2009 1:07:02 AM Subject: Fwd: What should we expect from Hama Examples Rand? Hi, 'RAND' example of hama-examples.jar is basically a simple M/R job that creates a table filled with random numbers. So, before the run Hama, Pls check whether you able to create tables via hbase shell. Can someone of Hbase the developers help this problem? Thanks -- Forwarded message -- From: Ratner, Alan S (IS) Date: Sat, Dec 12, 2009 at 1:32 AM Subject: What should we expect from Hama Examples Rand? To: hama-u...@incubator.apache.org Having fixed the groomserver problem I did the following: 1) clean out /tmp files 2) format Hadoop namenode 3) start Hadoop 4) start HBase/Zookeeper 5) start Hama 6) launch Hama examples rand -m 10 -r 10 2000 2000 30.5% matrixA The outcome is puzzling. A 3-second long diary shows up in the Hama log file reporting nothing unusual. But the terminal reported problems with HBase. When I Googled this problem all I saw was someone who had multiple versions of HBase on their system. I am using a fresh VM with Ubuntu 8.04, Hadoop 0.20.1, Zookeeper 3.2.1, HBase 0.20.2, Hama 0.2.0 and JDK 1.6.0_17. No older versions of anything was ever installed. BTW: I thought the problem might be related to my running HBase in standalone mode so I just switched to HBase pseudo-distributed mode but I see the same problems. Any help appreciated. -- Alan Hama log == Fri Dec 11 09:23:03 EST 2009 Starting master on ngc ulimit -n 1024 2009-12-11 09:23:05,512 INFO org.apache.hama.HamaMaster: STARTUP_MSG: / STARTUP_MSG: Starting HamaMaster STARTUP_MSG: host = ngc/127.0.1.1 STARTUP_MSG: args = [start] STARTUP_MSG: version = 0.20.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 810220; compiled by 'oom' on Tue Sep 1 20:55:56 UTC 2009 / 2009-12-11 09:23:05,939 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=HamaMaster, port=4 2009-12-11 09:23:06,634 INFO org.apache.hama.HamaMaster: Cleaning up the system directory 2009-12-11 09:23:06,710 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2009-12-11 09:23:06,716 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 4: starting 2009-12-11 09:23:06,721 INFO org.apache.hama.HamaMaster: Starting RUNNING 2009-12-11 09:23:06,721 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 4: starting Hama terminal /hadoop-0.20.1-test.jar:/home/ngc/Desktop/hama/bin/../lib/hbase-0.20.0.jar:/home/ngc/Desktop/hama/bin/../lib/hbase-0.20.0-test.jar:/home/ngc/Desktop/hama/bin/../lib/jasper- compiler-5.5.12.jar:/home/ngc/Desktop/hama/bin/../lib/jasper-runtime-5.5.12.jar:/home/ngc/Desktop/hama/bin/../lib/javacc.jar:/home/ngc/Desktop/hama/bin/../lib/jetty-6.1.14.jar:/home/ngc/Desktop/hama/bin/../lib/jetty-util-6.1.14.jar:/home/ngc/Desktop/hama/bin/../lib/jruby-complete-1.2.0.jar:/home/ngc/Desktop/hama/bin/../lib/json.jar:/home/ngc/Desktop/hama/bin/../lib/junit-3.8.1.jar:/home/ngc/Desktop/hama/bin/../lib/libthrift-r771587.jar:/home/ngc/Desktop/hama/bin/../lib/log4j-1.2.13.jar:/home/ngc/Desktop/hama/bin/../lib/log4j-1.2.15.jar:/home/ngc/Desktop/hama/bin/../lib/servlet-api-2.5-6.1.14.jar:/home/ngc/Desktop/hama/bin/../lib/xmlenc-0.52.jar:/home/ngc/Desktop/hama/bin/../lib/zookeeper-3.2.1.jar:/home/ngc/Desktop/hama/bin/../lib/jetty-ext/*.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/annotations.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/ant.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/asm-3.0.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/as m-analysis-3.0.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/asm-commons-3.0.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/asm-tree-3.0.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/asm-util-3.0.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/asm-xml-3.0.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/bcel.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/dom4j-full.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs/findbugs-ant.jar:/home/ngc/Desktop/hama/bin/../lib/findbugs
hlogs do not get cleared
We're running a 13 node HBase cluster. We had some problems a week ago with it being overloaded and errors related to not being able to find a block on HDFS, but adding four more nodes and increasing max heap from 3GB to 4.5GB on all nodes fixed any problems. Looking at the logs now, though, we see that HLogs are not getting removed: 2009-12-15 01:45:48,426 INFO org.apache.hadoop.hbase.regionserver.HLog: Roll /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260867136036, entries=210524, calcsize=63757422, filesize=41073798. New hlog /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260870348421 2009-12-15 01:45:48,427 INFO org.apache.hadoop.hbase.regionserver.HLog: Too many hlogs: logs=130, maxlogs=96; forcing flush of region with oldest edits: articles-article-id,f15489ea-38a4-4127-9179-1b2dc5f3b5d4,1260083783909 2009-12-15 01:57:14,188 INFO org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on region articles,\x00\x00\x01\x25\x8C\x0F\xCB\x18\xB5U\xF7\xC6\x5DoH\xB8\x98\xEBH,E\x7C\x07\x14,1260830133341 2009-12-15 01:57:17,519 INFO org.apache.hadoop.hbase.regionserver.HLog: Roll /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260870348421, entries=92795, calcsize=63908073, filesize=54042783. New hlog /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260871037510 2009-12-15 01:57:17,519 INFO org.apache.hadoop.hbase.regionserver.HLog: Too many hlogs: logs=131, maxlogs=96; forcing flush of region with oldest edits: articles-article-id,f1cd1b02-3d1b-453c-b44f-94ec5a1e3a46,1260007536878 >From reading the log message, I interpret this as saying that every time it rolls an hlog, if there are more than maxlogs logs, it will flush one region. I'm assuming that a log could have edits for multiple regions, so this seems to mean that if we have 100 regions and maxlogs set to 96, if it flushes one region each time it rolls a log, it will create 100 logs before it flushes all regions and is able to delete the log, so it will reach steady state at 196 hlogs. Is this correct? We're concerned because when we had problems last week, we saw lots of log messages related to "Too many hlogs" and had assumed they were related to the problems. Is this anything to worry about?
Re: hlogs do not get cleared
Kevin, Too many hlogs means that the inserts are hitting a lot of regions, that those regions aren't filled enough to flush so that we have to force flush them to give some room. When you added region servers, it spread the regions load so that hlogs were getting filled at a slower rate. Could you tell us more about the rate of insertion, size of data, and number of regions per region server? Thx, J-D On Tue, Dec 15, 2009 at 10:34 AM, Kevin Peterson wrote: > We're running a 13 node HBase cluster. We had some problems a week ago with > it being overloaded and errors related to not being able to find a block on > HDFS, but adding four more nodes and increasing max heap from 3GB to 4.5GB > on all nodes fixed any problems. > > Looking at the logs now, though, we see that HLogs are not getting removed: > > 2009-12-15 01:45:48,426 INFO org.apache.hadoop.hbase.regionserver.HLog: Roll > /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260867136036, > entries=210524, calcsize=63757422, filesize=41073798. New hlog > /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260870348421 > 2009-12-15 01:45:48,427 INFO org.apache.hadoop.hbase.regionserver.HLog: Too > many hlogs: logs=130, maxlogs=96; forcing flush of region with oldest edits: > articles-article-id,f15489ea-38a4-4127-9179-1b2dc5f3b5d4,1260083783909 > 2009-12-15 01:57:14,188 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Starting compaction on region > articles,\x00\x00\x01\x25\x8C\x0F\xCB\x18\xB5U\xF7\xC6\x5DoH\xB8\x98\xEBH,E\x7C\x07\x14,1260830133341 > 2009-12-15 01:57:17,519 INFO org.apache.hadoop.hbase.regionserver.HLog: Roll > /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260870348421, > entries=92795, calcsize=63908073, filesize=54042783. New hlog > /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260871037510 > 2009-12-15 01:57:17,519 INFO org.apache.hadoop.hbase.regionserver.HLog: Too > many hlogs: logs=131, maxlogs=96; forcing flush of region with oldest edits: > articles-article-id,f1cd1b02-3d1b-453c-b44f-94ec5a1e3a46,1260007536878 > > From reading the log message, I interpret this as saying that every time it > rolls an hlog, if there are more than maxlogs logs, it will flush one > region. I'm assuming that a log could have edits for multiple regions, so > this seems to mean that if we have 100 regions and maxlogs set to 96, if it > flushes one region each time it rolls a log, it will create 100 logs before > it flushes all regions and is able to delete the log, so it will reach > steady state at 196 hlogs. Is this correct? > > We're concerned because when we had problems last week, we saw lots of log > messages related to "Too many hlogs" and had assumed they were related to > the problems. Is this anything to worry about? >
Re: Performance related question
Given that m1.small has 1 CPU, 1.7GB of RAM and 1/8 (or less) the IO of the host machine and counting in the fact that those machines are networked as a whole I expect it to much much slower that your local machine. Those machines are so under-powered that the overhead of hadoop/hbase probably overwhelms any gain from the total number of nodes. Instead do this: - Replace all your m1.small with m1.large in a factor of 4:1. - Don't give ZK their own machine, in such a small environment it doesn't make much sense. (give them their own EBS maybe) - Use an ensemble of only 3 peers. - Give HBase plenty of RAM like 4GB. WRT your mappers, make sure you use scanner pre-fetching. In your job setup set hbase.client.scanner.caching to something like 30. J-D On Tue, Dec 15, 2009 at 9:14 AM, Something Something wrote: > Thanks J-D & Mtohiko for the tips. Significant improvement in performance, > but there's still room for improvement. In my local pseudo distributed mode > the 2 map reduce jobs now run in less than 4 minutes (from 32 mins) and in > cluster of 10 nodes + 5 zk nodes they run in 11 minutes (down from 1 hour & > 30 mins). But still I would like to come to a point where they run faster > on a cluster than on my local machine. > > Here's what I did: > > 1) Fixed a bug in my code that was causing unnecessary writes to HBase. > 2) Added these two lines after creating 'new HTable': > table.setAutoFlush(false); > table.setWriteBufferSize(1024*1024*12); > 3) Added this line after Put: > put.setWriteToWAL(false); > 4) Added this line (only when running on cluster): > job.setNumReduceTasks(20); > > There are other 64-bit related improvements which I cannot try; mainly > because Amazon charges (way) too much for 64-bit machines. It costs me over > $25 for 15 machines for less than 3 hours, so I switched to 'm1.small' > 32-bit machines. Of course, one of the promises of the distributed > computing is that we will be able to use "cheap commodity hardware", right > :) So I would like to stick with 'm1.small' for now. (But I am willing to > use about 30 machines if that's going to help.) > > Anyway, I have noticed that one of my Mappers is taking too long. If anyone > would share ideas of how to improve Mapper speed, that would be greatly > appreciated. Basically, in this Mapper I read about 50,000 rows from a > HBase table using TableMapReduceUtil.initTableMapperJob() and do some > complex processing for "values" of each row. I don't write anything back in > HBase, but I do write quite a few lines (context.write()) to HDFS. Any > suggestions? > > Thanks once again for the help. > > > > 2009/12/13 > >> Hello, >> >> Something Something wrote: >> > PS: One thing I have noticed is that it goes to 66% very fast and then >> > slows down from there.. >> >> It seems that only one reducer works. You should increase reduce tasks. >> The default reduce task's number is written on >> hadoop/docs/mapred-default.html. >> The default parameter of mapred.reduce.tasks is 1. So only one reduce task >> runs. >> >> There are two ways to increase reduce tasks: >> 1. Use Job.setNumReduceTasks(int tasks) on your MapReduce job file. >> 2. Denote more mapred.reduce.tasks on hadoop/conf/mapred-site.xml. >> >> You can get the best perfomance if you run 20 reduce tasks. The detail of >> the number >> of reduce tasks is written on >> http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Reducer >> at "How many Reduces?" as J-D wrote. Notice that >> JobConf.setNumReduceTasks(int) is >> already deprecated, so you should use Job.setNumReduceTasks(int tasks) >> rather than >> JobConf.setNumReduceTasks(int). >> -- >> Motohiko Mouri >> >
Re: running unit test based on HBaseClusterTestCase
Yes, I included all the necessary jar files I think. I guess my problem is probably related to my eclipse setup. I can create a MiniDFSCluster object by running my application in command line (e.g., bin/hadoop myApplicationClass) , and a MiniDFSCluster object is created inside the main function of myApplicationClass. But I can NOT run this program within eclipse, probably I did not do it in the right way. I got the similar error message saying java.lang.NoSuchMethodError: org.apache.hadoop.security. > > > UserGroupInformation.setCurrentUser(Lorg/apache/hadoop/security/UserGroupInformation;)V > at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:236) > at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:119) > Could you guys please give me more hint? Thanks Guohua On Tue, Dec 15, 2009 at 11:59 AM, Stack wrote: > Do you have hadoop jars in your eclipse classpath? > Stack > > > > > On Dec 14, 2009, at 10:58 PM, Guohua Hao wrote: > > Hello All, >> >> In my own application, I have a unit test case which extends >> HBaseClusterTestCase in order to test some of my operation over HBase >> cluster. I override the setup function in my own test case, and this setup >> function begins with super.setup() function call. >> >> When I try to run my unit test from within Eclipse, I got the following >> error: >> >> java.lang.NoSuchMethodError: >> >> org.apache.hadoop.security.UserGroupInformation.setCurrentUser(Lorg/apache/hadoop/security/UserGroupInformation;)V >> at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:236) >> at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:119) >> at >> >> org.apache.hadoop.hbase.HBaseClusterTestCase.setUp(HBaseClusterTestCase.java:123) >> >> I included the hadoop-0.20.1-core.jar in my classpath, since this jar file >> contains the org.apache.hadoop.security.UserGroupInformation class. >> >> Could anybody give me some hint on how to solve this problem? >> >> Thank you very much, >> Guohua >> >
Re: Performance related question
Btw, nothing says that ZK users (incl hbase) _must_ run a multi-node ZK ensemble. For coordination tasks a single ZK server (standalone mode) is often sufficient, you just need to realize you are sacrificing reliability/availability. Going from 1 -> 3 -> 5 -> 7 ZK servers in an ensemble should primarily be driven by reliability requirements. See this page for details on performance studies I've made for standalone and 3 server ZK ensembles: http://bit.ly/4ekN8G Patrick Jean-Daniel Cryans wrote: Given that m1.small has 1 CPU, 1.7GB of RAM and 1/8 (or less) the IO of the host machine and counting in the fact that those machines are networked as a whole I expect it to much much slower that your local machine. Those machines are so under-powered that the overhead of hadoop/hbase probably overwhelms any gain from the total number of nodes. Instead do this: - Replace all your m1.small with m1.large in a factor of 4:1. - Don't give ZK their own machine, in such a small environment it doesn't make much sense. (give them their own EBS maybe) - Use an ensemble of only 3 peers. - Give HBase plenty of RAM like 4GB. WRT your mappers, make sure you use scanner pre-fetching. In your job setup set hbase.client.scanner.caching to something like 30. J-D On Tue, Dec 15, 2009 at 9:14 AM, Something Something wrote: Thanks J-D & Mtohiko for the tips. Significant improvement in performance, but there's still room for improvement. In my local pseudo distributed mode the 2 map reduce jobs now run in less than 4 minutes (from 32 mins) and in cluster of 10 nodes + 5 zk nodes they run in 11 minutes (down from 1 hour & 30 mins). But still I would like to come to a point where they run faster on a cluster than on my local machine. Here's what I did: 1) Fixed a bug in my code that was causing unnecessary writes to HBase. 2) Added these two lines after creating 'new HTable': table.setAutoFlush(false); table.setWriteBufferSize(1024*1024*12); 3) Added this line after Put: put.setWriteToWAL(false); 4) Added this line (only when running on cluster): job.setNumReduceTasks(20); There are other 64-bit related improvements which I cannot try; mainly because Amazon charges (way) too much for 64-bit machines. It costs me over $25 for 15 machines for less than 3 hours, so I switched to 'm1.small' 32-bit machines. Of course, one of the promises of the distributed computing is that we will be able to use "cheap commodity hardware", right :) So I would like to stick with 'm1.small' for now. (But I am willing to use about 30 machines if that's going to help.) Anyway, I have noticed that one of my Mappers is taking too long. If anyone would share ideas of how to improve Mapper speed, that would be greatly appreciated. Basically, in this Mapper I read about 50,000 rows from a HBase table using TableMapReduceUtil.initTableMapperJob() and do some complex processing for "values" of each row. I don't write anything back in HBase, but I do write quite a few lines (context.write()) to HDFS. Any suggestions? Thanks once again for the help. 2009/12/13 Hello, Something Something wrote: PS: One thing I have noticed is that it goes to 66% very fast and then slows down from there.. It seems that only one reducer works. You should increase reduce tasks. The default reduce task's number is written on hadoop/docs/mapred-default.html. The default parameter of mapred.reduce.tasks is 1. So only one reduce task runs. There are two ways to increase reduce tasks: 1. Use Job.setNumReduceTasks(int tasks) on your MapReduce job file. 2. Denote more mapred.reduce.tasks on hadoop/conf/mapred-site.xml. You can get the best perfomance if you run 20 reduce tasks. The detail of the number of reduce tasks is written on http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Reducer at "How many Reduces?" as J-D wrote. Notice that JobConf.setNumReduceTasks(int) is already deprecated, so you should use Job.setNumReduceTasks(int tasks) rather than JobConf.setNumReduceTasks(int). -- Motohiko Mouri
Re: running unit test based on HBaseClusterTestCase
Order can be important. Don't forget to include conf directories. Below is from an eclipse .classpath that seems to work for me: St.Ack On Tue, Dec 15, 2009 at 11:21 AM, Guohua Hao wrote: > Yes, I included all the necessary jar files I think. I guess my problem is > probably related to my eclipse setup. > > I can create a MiniDFSCluster object by running my application in command > line (e.g., bin/hadoop myApplicationClass) , and a MiniDFSCluster object is > created inside the main function of myApplicationClass. But I can NOT run > this program within eclipse, probably I did not do it in the right way. I > got the similar error message saying > > java.lang.NoSuchMethodError: > org.apache.hadoop.security. > > > > > > > UserGroupInformation.setCurrentUser(Lorg/apache/hadoop/security/UserGroupInformation;)V > > at > org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:236) > > at > org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:119) > > > > Could you guys please give me more hint? > > Thanks > Guohua > > > > On Tue, Dec 15, 2009 at 11:59 AM, Stack wrote: > > > Do you have hadoop jars in your eclipse classpath? > > Stack > > > > > > > > > > On Dec 14, 2009, at 10:58 PM, Guohua Hao wrote: > > > > Hello All, > >> > >> In my own application, I have a unit test case which extends > >> HBaseClusterTestCase in order to test some of my operation over HBase > >> cluster. I override the setup function in my own test case, and this > setup > >> function begins with super.setup() function call. > >> > >> When I try to run my unit test from within Eclipse, I got the following > >> error: > >> > >> java.lang.NoSuchMethodError: > >> > >> > org.apache.hadoop.security.UserGroupInformation.setCurrentUser(Lorg/apache/hadoop/security/UserGroupInformation;)V > >> at > org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:236) > >> at > org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:119) > >> at > >> > >> > org.apache.hadoop.hbase.HBaseClusterTestCase.setUp(HBaseClusterTestCase.java:123) > >> > >> I included the hadoop-0.20.1-core.jar in my classpath, since this jar > file > >> contains the org.apache.hadoop.security.UserGroupInformation class. > >> > >> Could anybody give me some hint on how to solve this problem? > >> > >> Thank you very much, > >> Guohua > >> > > >
Re: hlogs do not get cleared
I'd advise setting the upper limit for WALs back down to 32 rather than the 96 you have. Lets figure why old logs are not being cleared up even if only 32. When 96, it means that on crash, the log splitting process has more logs to process (~96 rather than ~32). It'll take longer for the split process to run and therefore longer for the regions to come back on line. Is this the state of things across all regionservers or just one or two? As J-D asks, your loading profile, how many regions per regionserver would be of interest. Next up would be your putting up a regionserver log that we could pull and look at. We'd check the edit sequence numbers to figure why we're not letting logs go. Thanks Kevin, St.Ack On Tue, Dec 15, 2009 at 10:34 AM, Kevin Peterson wrote: > We're running a 13 node HBase cluster. We had some problems a week ago with > it being overloaded and errors related to not being able to find a block on > HDFS, but adding four more nodes and increasing max heap from 3GB to 4.5GB > on all nodes fixed any problems. > > Looking at the logs now, though, we see that HLogs are not getting removed: > > 2009-12-15 01:45:48,426 INFO org.apache.hadoop.hbase.regionserver.HLog: > Roll > /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260867136036, > entries=210524, calcsize=63757422, filesize=41073798. New hlog > /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260870348421 > 2009-12-15 01:45:48,427 INFO org.apache.hadoop.hbase.regionserver.HLog: Too > many hlogs: logs=130, maxlogs=96; forcing flush of region with oldest > edits: > articles-article-id,f15489ea-38a4-4127-9179-1b2dc5f3b5d4,1260083783909 > 2009-12-15 01:57:14,188 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Starting compaction on region > > articles,\x00\x00\x01\x25\x8C\x0F\xCB\x18\xB5U\xF7\xC6\x5DoH\xB8\x98\xEBH,E\x7C\x07\x14,1260830133341 > 2009-12-15 01:57:17,519 INFO org.apache.hadoop.hbase.regionserver.HLog: > Roll > /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260870348421, > entries=92795, calcsize=63908073, filesize=54042783. New hlog > /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260871037510 > 2009-12-15 01:57:17,519 INFO org.apache.hadoop.hbase.regionserver.HLog: Too > many hlogs: logs=131, maxlogs=96; forcing flush of region with oldest > edits: > articles-article-id,f1cd1b02-3d1b-453c-b44f-94ec5a1e3a46,1260007536878 > > From reading the log message, I interpret this as saying that every time it > rolls an hlog, if there are more than maxlogs logs, it will flush one > region. I'm assuming that a log could have edits for multiple regions, so > this seems to mean that if we have 100 regions and maxlogs set to 96, if it > flushes one region each time it rolls a log, it will create 100 logs before > it flushes all regions and is able to delete the log, so it will reach > steady state at 196 hlogs. Is this correct? > > We're concerned because when we had problems last week, we saw lots of log > messages related to "Too many hlogs" and had assumed they were related to > the problems. Is this anything to worry about? >
Re: Help on HBase shell alter command usage
Hi, I saw the following from scan 'crawltable' command in hbase shell: ... com.onsoft.www:http/column=stt:, timestamp=1260405530801, value=\003 3 row(s) in 0.2490 seconds How do I query the value for stt column ? hbase(main):005:0> get 'crawltable', 'com.onsoft.www:http/', { column='stt:' } SyntaxError: (hbase):6: odd number list for Hash. from (hbase):6 Can someone explain this 'odd number' error ? Thanks On Mon, Dec 14, 2009 at 10:16 PM, stack wrote: > Are you using hbase 0.20? If so, there is no 'compress'. Its NONE, LZO, > or > GZIP (You'll have to build lzo yourself. See hbase wiki for how). > > See the shell help. It has examples of how to change parameters on column > families. > > St.Ack > > 2009/12/14 Xin Jing > > > Hi all, > > > > I want to change the column family property for a existing hbase table. > > Setting one comlumn family COMPRESSION from 'none' to comress, and chagne > > one column family IN_MEMORY from 'false' to 'true'. > > > > I want to use hbase shell to achieve that, but I cannot find the detailed > > description on 'alter' command. Could anyone point me to a reference on > > that? > > > > Thanks > > - Xin > > >
Re: HBase Utility functions (for Java 5+)
On Tue, Dec 15, 2009 at 9:56 AM, Kevin Peterson wrote: > These kinds of cleaner APIs would be a good way to prevent the standard > situation of one engineer on the team figuring out HBase, then others say > "why is this so complicated" so they write an internal set of wrappers and > utility methods. > This wouldn't solve the problems for people who want a full ORM, but I think > there's an in-between sweet spot that abstracts away byte[] but still > exposes column families and such. > What do fellas think of Lars' George's genercizing (sp? word?) of the client API? See his patch up in https://issues.apache.org/jira/browse/HBASE-1990. Would this be enough? St.Ack
Re: Help on HBase shell alter command usage
Try: hbase(main):005:0> get 'crawltable', 'com.onsoft.www:http/', { COLUMNS => 'stt:'} i.e. '=>' rather than '='. Also, its COLUMNS (uppercase I believe) rather than column. Run 'help' in the shell for help and examples. St.Ack On Tue, Dec 15, 2009 at 11:53 AM, Ted Yu wrote: > Hi, > I saw the following from scan 'crawltable' command in hbase shell: > ... > com.onsoft.www:http/column=stt:, timestamp=1260405530801, > value=\003 > 3 row(s) in 0.2490 seconds > > How do I query the value for stt column ? > > hbase(main):005:0> get 'crawltable', 'com.onsoft.www:http/', { > column='stt:' > } > SyntaxError: (hbase):6: odd number list for Hash. >from (hbase):6 > > Can someone explain this 'odd number' error ? > > Thanks > > On Mon, Dec 14, 2009 at 10:16 PM, stack wrote: > > > Are you using hbase 0.20? If so, there is no 'compress'. Its NONE, LZO, > > or > > GZIP (You'll have to build lzo yourself. See hbase wiki for how). > > > > See the shell help. It has examples of how to change parameters on > column > > families. > > > > St.Ack > > > > 2009/12/14 Xin Jing > > > > > Hi all, > > > > > > I want to change the column family property for a existing hbase table. > > > Setting one comlumn family COMPRESSION from 'none' to comress, and > chagne > > > one column family IN_MEMORY from 'false' to 'true'. > > > > > > I want to use hbase shell to achieve that, but I cannot find the > detailed > > > description on 'alter' command. Could anyone point me to a reference on > > > that? > > > > > > Thanks > > > - Xin > > > > > >
Re: HBase Utility functions (for Java 5+)
Seems like an intuitive option to me. Tim On Tue, Dec 15, 2009 at 9:04 PM, stack wrote: > On Tue, Dec 15, 2009 at 9:56 AM, Kevin Peterson wrote: > >> These kinds of cleaner APIs would be a good way to prevent the standard >> situation of one engineer on the team figuring out HBase, then others say >> "why is this so complicated" so they write an internal set of wrappers and >> utility methods. >> > > This wouldn't solve the problems for people who want a full ORM, but I think >> there's an in-between sweet spot that abstracts away byte[] but still >> exposes column families and such. >> > > > What do fellas think of Lars' George's genercizing (sp? word?) of the client > API? See his patch up in https://issues.apache.org/jira/browse/HBASE-1990. > Would this be enough? > St.Ack >
Re: Help on HBase shell alter command usage
That works. scan command gives values for columns. Is there a shell command which lists unique row values, such as 'com.onsoft.www:http/' ? Thanks On Tue, Dec 15, 2009 at 12:09 PM, stack wrote: > Try: > > hbase(main):005:0> get 'crawltable', 'com.onsoft.www:http/', { COLUMNS => > 'stt:'} > > i.e. '=>' rather than '='. Also, its COLUMNS (uppercase I believe) rather > than column. > > Run 'help' in the shell for help and examples. > > St.Ack > > On Tue, Dec 15, 2009 at 11:53 AM, Ted Yu wrote: > > > Hi, > > I saw the following from scan 'crawltable' command in hbase shell: > > ... > > com.onsoft.www:http/column=stt:, timestamp=1260405530801, > > value=\003 > > 3 row(s) in 0.2490 seconds > > > > How do I query the value for stt column ? > > > > hbase(main):005:0> get 'crawltable', 'com.onsoft.www:http/', { > > column='stt:' > > } > > SyntaxError: (hbase):6: odd number list for Hash. > >from (hbase):6 > > > > Can someone explain this 'odd number' error ? > > > > Thanks > > > > On Mon, Dec 14, 2009 at 10:16 PM, stack wrote: > > > > > Are you using hbase 0.20? If so, there is no 'compress'. Its NONE, > LZO, > > > or > > > GZIP (You'll have to build lzo yourself. See hbase wiki for how). > > > > > > See the shell help. It has examples of how to change parameters on > > column > > > families. > > > > > > St.Ack > > > > > > 2009/12/14 Xin Jing > > > > > > > Hi all, > > > > > > > > I want to change the column family property for a existing hbase > table. > > > > Setting one comlumn family COMPRESSION from 'none' to comress, and > > chagne > > > > one column family IN_MEMORY from 'false' to 'true'. > > > > > > > > I want to use hbase shell to achieve that, but I cannot find the > > detailed > > > > description on 'alter' command. Could anyone point me to a reference > on > > > > that? > > > > > > > > Thanks > > > > - Xin > > > > > > > > > >
Re: hlogs do not get cleared
On Tue, Dec 15, 2009 at 10:43 AM, Jean-Daniel Cryans wrote: > > Too many hlogs means that the inserts are hitting a lot of regions, > that those regions aren't filled enough to flush so that we have to > force flush them to give some room. When you added region servers, it > spread the regions load so that hlogs were getting filled at a slower > rate. > > Could you tell us more about the rate of insertion, size of data, and > number of regions per region server? > > This makes some sense now. I currently have 2200 regions across 3 tables. My largest table accounts for about 1600 of those regions and is mostly active at one end of the keyspace -- our key is based on date, but data only roughly arrives in order. I also write to two secondary indexes, which have no pattern to the key at all. One of these secondary tables has 488 regions and the other has 96 regions. We write about 10M items per day to the main table (articles). All of these get written to one of the secondary indexes (article-ids). About a third get written to the other secondary index. Total volume of data is about 10GB / day written. I think the key is as you say that the regions aren't filled enough to flush. The articles table gets mostly written to near one end and I see splits happening regularly. The index tables have no pattern so the 10 millions writes get scattered across the different regions. I've looked more closely at a log file (linked below), and if I forget about my main table (which would tend to get flushed), and look only at the indexes, this seems to be what's happening: 1. Up to maxLogs HLogs, it doesn't do any flushes. 2. Once it gets above maxLogs, it will start flushing one region each time it creates a new HLog. 3. If the first HLog had edits for say 50 regions, it will need to flush the region with oldest edits 50 times before the HLog can be removed. If N is the number of regions getting written to, but not getting enough writes to flush on their own, then I think this converges to maxLogs + N logs on average. If I think of maxLogs as "number of logs to start flushing regions at" this makes sense. http://kdpeterson.net/paste/hbase-hadoop-regionserver-mi-prod-app35.ec2.biz360.com.log.2009-12-14
Re: HBase Utility functions (for Java 5+)
On 16/12/2009, at 7:04 AM, stack wrote: > On Tue, Dec 15, 2009 at 9:56 AM, Kevin Peterson wrote: > >> These kinds of cleaner APIs would be a good way to prevent the standard >> situation of one engineer on the team figuring out HBase, then others say >> "why is this so complicated" so they write an internal set of wrappers and >> utility methods. >> > > This wouldn't solve the problems for people who want a full ORM, but I think >> there's an in-between sweet spot that abstracts away byte[] but still >> exposes column families and such. >> > > > What do fellas think of Lars' George's genercizing (sp? word?) of the client > API? See his patch up in https://issues.apache.org/jira/browse/HBASE-1990. > Would this be enough? > St.Ack That's a pretty good start, but I think a good collection of useful builders and utilities that handle the 80% case will help HBase gain much more traction. As an person starting with HBase, there are a lot of concepts to get, Bytes definitely get in the way of seeing the real underlying patterns. I'm a total believer in understanding the internals to get the best out of a product, but that often comes after experimentation, and these high-level libraries grease the wheels for faster 'grok'ing the concepts. Thinking out loud here, but something like this may be useful (more useful?, I dunno, I'm still used to this): PutBuilder builder = new PutBuilder(hTable); // first Row builder.withRowKey(1stRowKey).withColumnFamily("foo").put("columnA", valueA).put("columnB",valueB); // secondRow builder.withRowKey(2ndRowKey).withColumnFamily("eek").put("columnC", valueC).put("columnD",valueD); .. builder.putAll(); I also feel a little silly, because I've only JUST discovered the Writables class, my initial example of packing 4 ints is silly, a simple Class that implements Writeable is a much more elegant solution (I wasn't sure why Bytes.add(..) only took 2 or 3 args). Paul
FilterList and SingleColumnValueFilter
I ran into some problems with FilterList and SingleColumnValueFilter. I created a FilterList with MUST_PASS_ONE and two SingleColumnValueFilters (each testing equality on a different columns) and query some trivial data: http://pastie.org/744890 The problem that I encountered were two-fold: SingleColumnValueFilter.filterKeyValues() returns ReturnCode.INCLUDE if the column names do not match. If FilterList is employed, then when the first Filter returns INCLUDE (because the column names do not match), no more filters for that KeyValue are evaluated. That is problematic because when filterRow() is finally called for those filters, matchedColumn is never found to be true because they were not invoked (due to FilterList exiting from the filterList iteration when the name mismatched INCLUDE was returned). The fix (at least for this scenario) is for SingleColumnValueFilter.filterKeyValues() to return ReturnCode.NEXT_ROW (rather than INCLUDE). The second problem is at the bottom of FilterList.filterKeyValue() where ReturnCode.SKIP is returned if MUST_PASS_ONE is the operator, rather than always returning ReturnCode.INCLUDE and then leaving the final filter decision to be made by the call to filterRow(). I am sure there is a good reason for returning SKIP in other scenarios, but it is problematic in mine. Feedback would be much appreciated. Paul
Re: FilterList and SingleColumnValueFilter
Hi Paul, I've encountered the same problem. I think its fixed as part of https://issues.apache.org/jira/browse/HBASE-2037 Regards, Yoram On Wed, Dec 16, 2009 at 10:45 AM, Paul Ambrose wrote: > I ran into some problems with FilterList and SingleColumnValueFilter. > > I created a FilterList with MUST_PASS_ONE and two SingleColumnValueFilters > (each testing equality on a different columns) and query some trivial data: > > http://pastie.org/744890 > > The problem that I encountered were two-fold: > > SingleColumnValueFilter.filterKeyValues() returns ReturnCode.INCLUDE > if the column names do not match. If FilterList is employed, then when the > first Filter returns INCLUDE (because the column names do not match), no > more filters for that KeyValue are evaluated. That is problematic because > when filterRow() is finally called for those filters, matchedColumn is > never > found to be true because they were not invoked (due to FilterList exiting > from > the filterList iteration when the name mismatched INCLUDE was returned). > The fix (at least for this scenario) is for > SingleColumnValueFilter.filterKeyValues() to > return ReturnCode.NEXT_ROW (rather than INCLUDE). > > The second problem is at the bottom of FilterList.filterKeyValue() > where ReturnCode.SKIP is returned if MUST_PASS_ONE is the operator, > rather than always returning ReturnCode.INCLUDE and then leaving the > final filter decision to be made by the call to filterRow(). I am sure > there is a good > reason for returning SKIP in other scenarios, but it is problematic in > mine. > > Feedback would be much appreciated. > > Paul > > > > > > > >
Re: HBase Utility functions (for Java 5+)
Thanks for the feedback Paul. I agree the Builder pattern is an interesting option. Please see https://issues.apache.org/jira/browse/HBASE-2051 - Andy From: Paul Smith To: hbase-user@hadoop.apache.org Sent: Tue, December 15, 2009 3:21:44 PM Subject: Re: HBase Utility functions (for Java 5+) On 16/12/2009, at 7:04 AM, stack wrote: > On Tue, Dec 15, 2009 at 9:56 AM, Kevin Peterson wrote: > >> These kinds of cleaner APIs would be a good way to prevent the standard >> situation of one engineer on the team figuring out HBase, then others say >> "why is this so complicated" so they write an internal set of wrappers and >> utility methods. >> > > This wouldn't solve the problems for people who want a full ORM, but I think >> there's an in-between sweet spot that abstracts away byte[] but still >> exposes column families and such. >> > > > What do fellas think of Lars' George's genercizing (sp? word?) of the client > API? See his patch up in https://issues.apache.org/jira/browse/HBASE-1990. > Would this be enough? > St.Ack That's a pretty good start, but I think a good collection of useful builders and utilities that handle the 80% case will help HBase gain much more traction. As an person starting with HBase, there are a lot of concepts to get, Bytes definitely get in the way of seeing the real underlying patterns. I'm a total believer in understanding the internals to get the best out of a product, but that often comes after experimentation, and these high-level libraries grease the wheels for faster 'grok'ing the concepts. Thinking out loud here, but something like this may be useful (more useful?, I dunno, I'm still used to this): PutBuilder builder = new PutBuilder(hTable); // first Row builder.withRowKey(1stRowKey).withColumnFamily("foo").put("columnA", valueA).put("columnB",valueB); // secondRow builder.withRowKey(2ndRowKey).withColumnFamily("eek").put("columnC", valueC).put("columnD",valueD); ... builder.putAll(); I also feel a little silly, because I've only JUST discovered the Writables class, my initial example of packing 4 ints is silly, a simple Class that implements Writeable is a much more elegant solution (I wasn't sure why Bytes.add(..) only took 2 or 3 args). Paul
Re: FilterList and SingleColumnValueFilter
Paul: I can apply the fix from hbase-2037... I can break it out of the posted patch thats up there. Just say the word. St.Ack On Tue, Dec 15, 2009 at 4:17 PM, Ram Kulbak wrote: > Hi Paul, > > I've encountered the same problem. I think its fixed as part of > https://issues.apache.org/jira/browse/HBASE-2037 > > Regards, > Yoram > > > > On Wed, Dec 16, 2009 at 10:45 AM, Paul Ambrose wrote: > > > I ran into some problems with FilterList and SingleColumnValueFilter. > > > > I created a FilterList with MUST_PASS_ONE and two > SingleColumnValueFilters > > (each testing equality on a different columns) and query some trivial > data: > > > > http://pastie.org/744890 > > > > The problem that I encountered were two-fold: > > > > SingleColumnValueFilter.filterKeyValues() returns ReturnCode.INCLUDE > > if the column names do not match. If FilterList is employed, then when > the > > first Filter returns INCLUDE (because the column names do not match), no > > more filters for that KeyValue are evaluated. That is problematic > because > > when filterRow() is finally called for those filters, matchedColumn is > > never > > found to be true because they were not invoked (due to FilterList exiting > > from > > the filterList iteration when the name mismatched INCLUDE was returned). > > The fix (at least for this scenario) is for > > SingleColumnValueFilter.filterKeyValues() to > > return ReturnCode.NEXT_ROW (rather than INCLUDE). > > > > The second problem is at the bottom of FilterList.filterKeyValue() > > where ReturnCode.SKIP is returned if MUST_PASS_ONE is the operator, > > rather than always returning ReturnCode.INCLUDE and then leaving the > > final filter decision to be made by the call to filterRow(). I am sure > > there is a good > > reason for returning SKIP in other scenarios, but it is problematic in > > mine. > > > > Feedback would be much appreciated. > > > > Paul > > > > > > > > > > > > > > > > >
Re: FilterList and SingleColumnValueFilter
Hey Michael, If hbase-2037 will make it into 0.20.3, I am fine. If not, I would greatly appreciate you breaking it out for 0.20.3. Thanks, Paul On Dec 15, 2009, at 10:28 PM, stack wrote: > Paul: > > I can apply the fix from hbase-2037... I can break it out of the posted > patch thats up there. Just say the word. > > St.Ack > > > On Tue, Dec 15, 2009 at 4:17 PM, Ram Kulbak wrote: > >> Hi Paul, >> >> I've encountered the same problem. I think its fixed as part of >> https://issues.apache.org/jira/browse/HBASE-2037 >> >> Regards, >> Yoram >> >> >> >> On Wed, Dec 16, 2009 at 10:45 AM, Paul Ambrose wrote: >> >>> I ran into some problems with FilterList and SingleColumnValueFilter. >>> >>> I created a FilterList with MUST_PASS_ONE and two >> SingleColumnValueFilters >>> (each testing equality on a different columns) and query some trivial >> data: >>> >>> http://pastie.org/744890 >>> >>> The problem that I encountered were two-fold: >>> >>> SingleColumnValueFilter.filterKeyValues() returns ReturnCode.INCLUDE >>> if the column names do not match. If FilterList is employed, then when >> the >>> first Filter returns INCLUDE (because the column names do not match), no >>> more filters for that KeyValue are evaluated. That is problematic >> because >>> when filterRow() is finally called for those filters, matchedColumn is >>> never >>> found to be true because they were not invoked (due to FilterList exiting >>> from >>> the filterList iteration when the name mismatched INCLUDE was returned). >>> The fix (at least for this scenario) is for >>> SingleColumnValueFilter.filterKeyValues() to >>> return ReturnCode.NEXT_ROW (rather than INCLUDE). >>> >>> The second problem is at the bottom of FilterList.filterKeyValue() >>> where ReturnCode.SKIP is returned if MUST_PASS_ONE is the operator, >>> rather than always returning ReturnCode.INCLUDE and then leaving the >>> final filter decision to be made by the call to filterRow(). I am sure >>> there is a good >>> reason for returning SKIP in other scenarios, but it is problematic in >>> mine. >>> >>> Feedback would be much appreciated. >>> >>> Paul >>> >>> >>> >>> >>> >>> >>> >>> >>