I took a look at your attached configuration files. You have very little customization in them. Given you are running 0.19.x, you are missing some critical configuration. See http://wiki.apache.org/hadoop/Hbase/Troubleshooting. In particular, #5, #6, and #7. What about file descriptor count? Did you up that above its default of 1024? See http://wiki.apache.org/hadoop/Hbase/FAQ#6 (Alot of these issues fade in 0.20.0).
St.Ack On Wed, Sep 16, 2009 at 8:35 AM, <[email protected]>wrote: > Hi all, > Being in the process of evaluating hbase for managing "bigtable" (to give > an idea ~ 1G entries of 500 bytes). We are now facing some issues and i > would like to have comments concerning what i have noticed. > Our configuration is hadoop 0.19.1 and hbase 0.19.3, both > hadoop-default/site.xml and hbase-default/site.xml are attached, 15 nodes > (16 or 8 Go RAM and 1,3To disk, linux kernel 2.6.24-standard, java version > "1.6.0_12"). > For now the test case is on one IndexedTable (without at the moment using > the index column) with 25 M of entries/rows: > Map is formatting the data and 15 reduces are BatchUpdating the textual > data (like url and simple text fields < 500 bytes) > All processes (hadoop/hbase) are started with -Xmx1000m and IndexedTable is > configured with AutoCommit to false. > > ISSUE 1, We need one column index to have "fast" UI query (for instance as > an answer to Web form we could expect waiting at max 30sec). The only > documentation I found concerning indexed column comes from > http://rajeev1982.blogspot.com/2009/06/secondary-indexes-in-hbase.html > Instead of using the indextable properties in hbase-site.xml (that I have > tested but that gives very poor performance and also lost entries...) I pass > the properties to the job through a -conf indextable_properties.xml (file is > in attachement). I suppose that putting the indextable properties into the > hbase-site.xml apply to the whole hbase cluster making the whole performance > significantly decreasing ? > The best perf were reached passing through the -conf option of the Tool.run > method. > > ISSUE2, we are facing serious regionserver problems often leading to > regionserver shutdown like: > 2009-09-16 10:21:15,887 INFO > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Too many store files > for region urlsdata-validation, > forum.telecharger.01net.com/index.php?page=01net_voter&forum=microhebdo&category=5&topic=344142&post=5653085,1253089082422: > 23, waiting > > or > > 2009-09-14 16:39:24,611 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Blocking updates for 'IPC Server handler 1 on 60020' on region > urlsdata-validation, > www.abovetopsecret.com/forum/thread119/pg1&title=Underground+Communities,1252939031807: > Memcache size 128.0m is >= than blocking 128.0m size > 2009-09-14 16:39:24,942 INFO org.apache.hadoop.hdfs.DFSClient: Exception in > createBlockOutputStream java.io.IOException: Could not read from stream > 2009-09-14 16:39:24,942 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning > block blk_-873614322830930554_111500 > 2009-09-14 16:39:31,180 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_-873614322830930554_111500 bad datanode[0] nodes == > null > 2009-09-14 16:39:31,181 WARN org.apache.hadoop.hdfs.DFSClient: Could not > get block locations. Source file > "/hbase/urlsdata-validation/1733902030/info/mapfiles/2690714750206504745/data" > - Aborting... > 2009-09-14 16:39:31,241 FATAL > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Replay of hlog > required. Forcing server shutdown > > I've read some hbase/jira issues (hbase-1415, hbase-1058, hbase-1084...) > concerning similar problems, > but i cannot get a clear idea of what kind of fix is proposed ? > > > ISSUE3, Theses problems are causing table.commit() IOException losing all > the entries: > org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact > region server 192.168.255.8:60020 for region urlsdata-validation, > twitter.com/statuses/434272962,1253089707924, row ' > www.harmonicasurcher.com', but failed after 10 attempts. > Exceptions: > java.io.IOException: Call to /192.168.255.8:60020 failed on local > exception: java.io.EOFException > java.net.ConnectException: Call to /192.168.255.8:60020 failed on > connection exception: java.net.ConnectException: Connection refused > > Is there a way to get back the uncommitted entries (there are many of them > because we are in AutoCommit false) > to resubmit them later ? > To give an idea, we sometime lost about 170 000 entries out of 25M entries > due to this commit exception. > > > Guillaume Viland ([email protected]) > FT/TGPF/OPF/PORTAIL/DOP Sophia Antipolis > > > > ********************************* > This message and any attachments (the "message") are confidential and > intended solely for the addressees. > Any unauthorised use or dissemination is prohibited. > Messages are susceptible to alteration. > France Telecom Group shall not be liable for the message if altered, > changed or falsified. > If you are not the intended addressee of this message, please cancel it > immediately and inform the sender. > ******************************** > >
