Hey Xueling:

Now I notice that you are the fellow who recently wrote up on the hadoop
list.

Todds described scheme I take it won't work for you then? There'd be less
moving parts for sure.

Up on hadoop list you gave a description of your records as so:

"1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0 1 4 *103570835* F .. 23G 24

"The highlighted field is called "position of match" and the query we are
interested in is the # of sequences in a certain range of this "position of
match". For instance the range can be "position of match" > 200 and
"position of match" + 36 < 200,000."

What are you thinking regards row key?  Will each of the fields above be
concatenated as row key or will they each be individual columns all in the
one column family or in many?

I'd suggest you get some subset of your dataset, say a million records or
so.  This should load into a single hbase node fine.  Use this small dataset
to figure the schema that best serves the way you'll be querying the data.

If you can get away with a single family, work on writing an import that
write hfiles directly:
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk.
 It'll run an order of magnitude or more faster than going via the API.

Now, as to the size of the cluster, see the presentations section where Ryan
describes the hardware used loading up a 9B row table.  His hardware might
be more than you need.  I'd suggest you start with 4 or 5 nodes and see how
loading goes.  Check query latency.  If the numbers are not to your liking,
add more nodes.  HBase generally scales linearly.

Hope this helps,
St.Ack








On Thu, Dec 17, 2009 at 4:00 PM, Xueling Shu <[email protected]>wrote:

> Hi St.Ack:
>
> Wondering how many nodes in a cluster do you recommend to hold 5B data?
> Eventually we need to handle X times 5B data. I want to get an idea of how
> many resources we need.
>
> Thanks,
> Xueling
>
>
> On Thu, Dec 17, 2009 at 3:45 PM, stack <[email protected]> wrote:
>
> > Hey Xueling, 5B into a single node ain't going to work.  Get yourself a
> bit
> > of a cluster somewhere.  Single node is for messing around.  Not for
> doing
> > 'real' stuff.
> >
> > St.Ack
> >
> >
> > On Thu, Dec 17, 2009 at 3:29 PM, stack <[email protected]> wrote:
> >
> > > On Thu, Dec 17, 2009 at 2:38 PM, Xueling Shu <[email protected]
> > >wrote:
> > >
> > >>
> > >> Things started fine until 5 mins after the data population started.
> > >>
> > >> Here is the exception:
> > >> Exception in thread "main"
> > >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> > >> contact
> > >> region server 10.0.176.64:39045 for region Genome,,1261087437258, row
> > >>
> > >>
> >
> '\x00\x00\x00\x00\x0E\xB00\xAC\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00s\xAD',
> > >> but failed after 10 attempts.
> > >> Exceptions:
> > >> java.io.IOException: java.io.IOException: Server not running, aborting
> > >>
> > >
> > > See why it quit by looking in the regionserver log.
> > >
> > > Make sure you have latest hbase and read the 'Getting Started' section.
> > >
> > > St.Ack
> > >
> > >
> > >
> > >
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2347)
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1826)
> > >>        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> > >>        at
> > >>
> > >>
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >>        at java.lang.reflect.Method.invoke(Method.java:597)
> > >>        at
> > >> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
> > >>        at
> > >>
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> > >>
> > >> java.net.ConnectException: Connection refused
> > >> java.net.ConnectException: Connection refused
> > >> java.net.ConnectException: Connection refused
> > >> java.net.ConnectException: Connection refused
> > >> java.net.ConnectException: Connection refused
> > >> java.net.ConnectException: Connection refused
> > >> java.net.ConnectException: Connection refused
> > >> java.net.ConnectException: Connection refused
> > >> java.net.ConnectException: Connection refused
> > >>
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1002)
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers$2.doCall(HConnectionManager.java:1193)
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1115)
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1201)
> > >>        at
> > >> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:605)
> > >>        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:470)
> > >>        at HadoopTrigger.populateData(HadoopTrigger.java:126)
> > >>        at HadoopTrigger.main(HadoopTrigger.java:52)
> > >>
> > >> Can anybody let me know how to fix it?
> > >> Thanks,
> > >> Xueling
> > >>
> > >
> > >
> >
>

Reply via email to