Hey Xueling: Now I notice that you are the fellow who recently wrote up on the hadoop list.
Todds described scheme I take it won't work for you then? There'd be less moving parts for sure. Up on hadoop list you gave a description of your records as so: "1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0 1 4 *103570835* F .. 23G 24 "The highlighted field is called "position of match" and the query we are interested in is the # of sequences in a certain range of this "position of match". For instance the range can be "position of match" > 200 and "position of match" + 36 < 200,000." What are you thinking regards row key? Will each of the fields above be concatenated as row key or will they each be individual columns all in the one column family or in many? I'd suggest you get some subset of your dataset, say a million records or so. This should load into a single hbase node fine. Use this small dataset to figure the schema that best serves the way you'll be querying the data. If you can get away with a single family, work on writing an import that write hfiles directly: http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk. It'll run an order of magnitude or more faster than going via the API. Now, as to the size of the cluster, see the presentations section where Ryan describes the hardware used loading up a 9B row table. His hardware might be more than you need. I'd suggest you start with 4 or 5 nodes and see how loading goes. Check query latency. If the numbers are not to your liking, add more nodes. HBase generally scales linearly. Hope this helps, St.Ack On Thu, Dec 17, 2009 at 4:00 PM, Xueling Shu <[email protected]>wrote: > Hi St.Ack: > > Wondering how many nodes in a cluster do you recommend to hold 5B data? > Eventually we need to handle X times 5B data. I want to get an idea of how > many resources we need. > > Thanks, > Xueling > > > On Thu, Dec 17, 2009 at 3:45 PM, stack <[email protected]> wrote: > > > Hey Xueling, 5B into a single node ain't going to work. Get yourself a > bit > > of a cluster somewhere. Single node is for messing around. Not for > doing > > 'real' stuff. > > > > St.Ack > > > > > > On Thu, Dec 17, 2009 at 3:29 PM, stack <[email protected]> wrote: > > > > > On Thu, Dec 17, 2009 at 2:38 PM, Xueling Shu <[email protected] > > >wrote: > > > > > >> > > >> Things started fine until 5 mins after the data population started. > > >> > > >> Here is the exception: > > >> Exception in thread "main" > > >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to > > >> contact > > >> region server 10.0.176.64:39045 for region Genome,,1261087437258, row > > >> > > >> > > > '\x00\x00\x00\x00\x0E\xB00\xAC\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00s\xAD', > > >> but failed after 10 attempts. > > >> Exceptions: > > >> java.io.IOException: java.io.IOException: Server not running, aborting > > >> > > > > > > See why it quit by looking in the regionserver log. > > > > > > Make sure you have latest hbase and read the 'Getting Started' section. > > > > > > St.Ack > > > > > > > > > > > > > > >> at > > >> > > >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2347) > > >> at > > >> > > >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1826) > > >> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > > >> at > > >> > > >> > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > >> at java.lang.reflect.Method.invoke(Method.java:597) > > >> at > > >> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) > > >> at > > >> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > > >> > > >> java.net.ConnectException: Connection refused > > >> java.net.ConnectException: Connection refused > > >> java.net.ConnectException: Connection refused > > >> java.net.ConnectException: Connection refused > > >> java.net.ConnectException: Connection refused > > >> java.net.ConnectException: Connection refused > > >> java.net.ConnectException: Connection refused > > >> java.net.ConnectException: Connection refused > > >> java.net.ConnectException: Connection refused > > >> > > >> at > > >> > > >> > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1002) > > >> at > > >> > > >> > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers$2.doCall(HConnectionManager.java:1193) > > >> at > > >> > > >> > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1115) > > >> at > > >> > > >> > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1201) > > >> at > > >> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:605) > > >> at org.apache.hadoop.hbase.client.HTable.put(HTable.java:470) > > >> at HadoopTrigger.populateData(HadoopTrigger.java:126) > > >> at HadoopTrigger.main(HadoopTrigger.java:52) > > >> > > >> Can anybody let me know how to fix it? > > >> Thanks, > > >> Xueling > > >> > > > > > > > > >
