Hey,

Looks lke you have some HDFS issues.

Things I did to make myself stable:

- run HDFS with -Xmx=2000m
- run HDFS with 2047 xciever limit (goes into hdfs-core.xml or
hadoop-site.xml)
- ulimit -n 32k - also important

With this I find that HDFS is very stable, I've imported hundreds of gigs.

You want to make sure the HDFS xciever limit is set in the hadoop/conf
directory, copied to every node and HDFS restarted.  Also sounds like you
might have a cluster with multiple versions of hadoop.  Double check that!

you're close!
-ryan

On Wed, Jun 10, 2009 at 3:32 PM, Bradford Stephens <
bradfordsteph...@gmail.com> wrote:

> Thanks so much for all the help, everyone... things are still broken,
> but maybe we're getting close.
>
> All the regionservers were dead by the time the job ended.  I see
> quite a few error messages like this:
>
> (I've put the entirety of the regionserver logs on pastebin:)
> http://pastebin.com/m2e6f9283
> http://pastebin.com/mf97bd57
>
> 2009-06-10 14:47:54,994 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: unable to process
> message: MSG_REGION_OPEN:
> joinedcontent,1DCC1616F7C7B53B69B5536F407A64DF,1244667570521:
> safeMode=false
> java.lang.NullPointerException
>
> There's also a scattering of messages like this:
> 2009-06-10 13:49:02,855 WARN
> org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 1 on
> 60020 took 3267ms appending an edit to HLog; editcount=21570
>
> aaand....
>
> 2009-06-10 14:03:27,270 INFO
> org.apache.hadoop.hbase.regionserver.HLog: Closed
>
> hdfs://dttest01:54310/hbase-0.19/log_192.168.18.49_1244659862699_60020/hlog.dat.1244667757560,
> entries=100006. New log writer:
> /hbase-0.19/log_192.168.18.49_1244659862699_60020/hlog.dat.1244667807249
> 2009-06-10 14:03:28,160 INFO org.apache.hadoop.hdfs.DFSClient:
> Exception in createBlockOutputStream java.io.IOException: Bad connect
> ack with firstBadLink 192.168.18.47:50010
> 2009-06-10 14:03:28,160 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning block blk_4831127457964871573_140781
> 2009-06-10 14:03:34,170 INFO org.apache.hadoop.hdfs.DFSClient:
> Exception in createBlockOutputStream java.io.IOException: Could not
> read from stream
> 2009-06-10 14:03:34,170 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning block blk_-6169186743102862627_140796
> 2009-06-10 14:03:34,485 INFO
> org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Forced flushing
> of joinedcontent,1F2F64F59088A3B121CFC66F7FCBA2A9,1244667654435
> because global memcache limit of 398.7m exceeded; currently 399.0m and
> flushing till 249.2m
>
> Finally, I saw this when I stopped and re-started my cluster:
>
> 2009-06-10 15:29:09,494 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(192.168.18.16:50010,
> storageID=DS-486600617-192.168.18.16-50010-1241838200467,
> infoPort=50075, ipcPort=50020):DataXceiver
> java.io.IOException: Version Mismatch
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:81)
>        at java.lang.Thread.run(Thread.java:619)
>
>
> On Wed, Jun 10, 2009 at 2:55 PM, Ryan Rawson<ryano...@gmail.com> wrote:
> > That is a client exception that is a sign of problems on the
> > regionserver...is it still running? What do the logs look like?
> >
> > On Jun 10, 2009 2:51 PM, "Bradford Stephens" <bradfordsteph...@gmail.com
> >
> > wrote:
> >
> > OK, I've tried all the optimizations you've suggested (still running
> > with a M/R job). Still having problems like this:
> >
> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> > contact region server 192.168.18.15:60020 for region
> > joinedcontent,242FEB3ED9BE0D8EF3856E9C4251464C,1244666594390, row
> > '291DB5C7440B0A5BDB0C12501308C55B', but failed after 10 attempts.
> > Exceptions:
> > java.io.IOException: Call to /192.168.18.15:60020 failed on local
> > exception: java.io.EOFException
> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
> > connection exception: java.net.ConnectException: Connection refused
> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
> > connection exception: java.net.ConnectException: Connection refused
> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
> > connection exception: java.net.ConnectException: Connection refused
> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
> > connection exception: java.net.ConnectException: Connection refused
> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
> > connection exception: java.net.ConnectException: Connection refused
> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
> > connection exception: java.net.ConnectException: Connection refused
> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
> > connection exception: java.net.ConnectException: Connection refused
> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
> > connection exception: java.net.ConnectException: Connection refused
> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
> > connection exception: java.net.ConnectException: Connection refused
> >
> > On Wed, Jun 10, 2009 at 12:40 AM, stack<st...@duboce.net> wrote: > On
> Tue,
> > Jun 9, 2009 at 11:51 AM,...
> >
>

Reply via email to