Hello Stack, Does it take you too long time to get logs? May I upload them to any other sites you are easy to download? LvZheng
2010/3/19 Zheng Lv <lvzheng19800...@gmail.com> > Hello Stack, > I must say thank you, for your patience too. > I'm sorry for that you had tried for many times but the logs you got were > not that usful. Now I have turn the logging to debug level, so if we get > these exceptions again, I will send you debug logs. Anyway, I still upload > the logs you want to rapidshare, although they are not in debug level. The > urls: > > http://rapidshare.com/files/365292889/hadoop-root-namenode-cactus207.log.2010-03-15.html > > http://rapidshare.com/files/365293127/hbase-root-master-cactus207.log.2010-03-15.html > > http://rapidshare.com/files/365293238/hbase-root-regionserver-cactus208.log.2010-03-15.html > > http://rapidshare.com/files/365293391/hbase-root-regionserver-cactus209.log.2010-03-15.html > > http://rapidshare.com/files/365293488/hbase-root-regionserver-cactus210.log.2010-03-15.html > > >For sure you've upped xceivers on your hdfs cluster and you've upped > >the file descriptors as per the 'Getting Started'? (Sorry, have to > >ask). > Before I got your mail, we didn't set the properties you mentioned, > because we didn't got the "too many open files" or something which are > mentioned in "getting start" docs. But now I have upped these properties. > We'll see what will happen. > > If you need more infomations, just tell me. > > Thanks again, > LvZheng. > > > 2010/3/19 Stack <st...@duboce.net> > > Yeah, I had to retry a couple of times ("Too busy; try back later -- >> or sign up premium service!"). >> >> It would have been nice to have wider log snippets. I'd like to have >> seen if the issue was double assignment. The master log snippet only >> shows the split. Regionserver 209's log is the one where the >> interesting stuff is going on around this time, 2010-03-15 >> 16:06:51,150, but its not in the provided set. Neither are you >> running at DEBUG level so it'd be harder to see what is up even if you >> provided it. >> >> Looking in 208, I see a few exceptions beyond the one you paste below. >> For sure you've upped xceivers on your hdfs cluster and you've upped >> the file descriptors as per the 'Getting Started'? (Sorry, have to >> ask). >> >> Can I have more of the logs? Can I have all of the namenode log, all >> of the master log and 209's log? This rapidshare thing is fine with >> me. I don't mind retrying. >> >> Sorry it took me a while to get to this. >> St.Ack >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Mar 17, 2010 at 8:32 PM, Zheng Lv <lvzheng19800...@gmail.com> >> wrote: >> > Hello Stack, >> > >Sorry. It's taken me a while. Let try and get to this this evening >> > Is it downloading the log files what take you a while? I m sorry, I >> used >> > to upload files to skydrive, but now we cant access the website. Is >> there >> > any netdisk or something you can download fast? I can upload to it. >> > LvZheng >> > 2010/3/18 Stack <saint....@gmail.com> >> > >> >> Sorry. It's taken me a while. Let try and get to this this evening >> >> >> >> Thank you for your patience >> >> >> >> >> >> >> >> >> >> On Mar 17, 2010, at 2:29 AM, Zheng Lv <lvzheng19800...@gmail.com> >> wrote: >> >> >> >> Hello Stack, >> >>> Did you receive my mail?It looks like you didnt. >> >>> LvZheng >> >>> >> >>> 2010/3/16 Zheng Lv <lvzheng19800...@gmail.com> >> >>> >> >>> Hello Stack, >> >>>> I have uploaded some parts of the logs on master, regionserver208 >> and >> >>>> regionserver210 to: >> >>>> http://rapidshare.com/files/363988384/master_207_log.txt.html >> >>>> http://rapidshare.com/files/363988673/regionserver_208_log.txt.html >> >>>> http://rapidshare.com/files/363988819/regionserver_210_log.txt.html >> >>>> I noticed that there are some LeaseExpiredException and "2010-03-15 >> >>>> 16:06:32,864 ERROR >> >>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: >> >>>> Compaction/Split failed for region ..." before 17 oclock. Did these >> lead >> >>>> to >> >>>> the error? Why did these happened? How to avoid these? >> >>>> Thanks. >> >>>> LvZheng >> >>>> 2010/3/16 Stack <st...@duboce.net> >> >>>> >> >>>> Maybe just the master log would be sufficient from around this time >> to >> >>>>> figure the story. >> >>>>> St.Ack >> >>>>> >> >>>>> On Mon, Mar 15, 2010 at 10:04 PM, Stack <st...@duboce.net> wrote: >> >>>>> >> >>>>>> Hey Zheng: >> >>>>>> >> >>>>>> On Mon, Mar 15, 2010 at 8:16 PM, Zheng Lv < >> lvzheng19800...@gmail.com> >> >>>>>> >> >>>>> wrote: >> >>>>> >> >>>>>> Hello Stack, >> >>>>>>> After we got these exceptions, we restart the cluster and >> restarted >> >>>>>>> >> >>>>>> the >> >>>>> >> >>>>>> job that failed, and the job succeeded. >> >>>>>>> Now when we access >> >>>>>>> >> >>>>>> /hbase/summary/1491233486/metrics/5046821377427277894, >> >>>>> >> >>>>>> we got " Cannot access >> >>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894: No such >> file or >> >>>>>>> directory." . >> >>>>>>> >> >>>>>>> >> >>>>>> So, that would seem to indicate that the reference was in memory >> >>>>>> only.. that file was not in filesystem. You could have tried >> closing >> >>>>>> that region. It would have been interesting also to find history >> on >> >>>>>> that region, to try and figure how it came to hold in memory a >> >>>>>> reference to a file since removed. >> >>>>>> >> >>>>>> The messages about this file in namenode logs are in here: >> >>>>>>> http://rapidshare.com/files/363938595/log.txt.html >> >>>>>>> >> >>>>>> >> >>>>>> This is interesting. Do you have regionserver logs from 209, 208, >> and >> >>>>>> 210 for corresponding times? >> >>>>>> >> >>>>>> Thanks, >> >>>>>> St.Ack >> >>>>>> >> >>>>>> The job failed startted about at 17 o'clock. >> >>>>>>> By the way, the hadoop version we are using is 0.20.1, the hbase >> >>>>>>> >> >>>>>> version >> >>>>> >> >>>>>> we are using is 0.20.3. >> >>>>>>> >> >>>>>>> Regards, >> >>>>>>> LvZheng >> >>>>>>> 2010/3/16 Stack <st...@duboce.net> >> >>>>>>> >> >>>>>>> Can you get that file from hdfs? >> >>>>>>>> >> >>>>>>>> ./bin/hadoop fs -get >> >>>>>>>>> >> >>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894 >> >>>>>>>> >> >>>>>>>> Does it look wholesome? Is it empty? >> >>>>>>>> >> >>>>>>>> What if you trace the life of that file in regionserver logs or >> >>>>>>>> probably better, over in namenode log? If you move this file >> aside, >> >>>>>>>> the region deploys? >> >>>>>>>> >> >>>>>>>> St.Ack >> >>>>>>>> >> >>>>>>>> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv < >> lvzheng19800...@gmail.com >> >>>>>>>> > >> >>>>>>>> wrote: >> >>>>>>>> >> >>>>>>>>> Hello Everyone, >> >>>>>>>>> Recently we often got these in our client logs: >> >>>>>>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: >> Trying >> >>>>>>>>> >> >>>>>>>> to >> >>>>> >> >>>>>> contact region server 172.16.1.208:60020 for region >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>> >> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017, >> >>>>> >> >>>>>> row >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>> >> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581', >> >>>>> >> >>>>>> but failed after 10 attempts. >> >>>>>>>>> Exceptions: >> >>>>>>>>> java.io.IOException: java.io.IOException: Cannot open filename >> >>>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894 >> >>>>>>>>> at >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>> >> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474) >> >>>>> >> >>>>>> at >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>> >> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800) >> >>>>> >> >>>>>> at >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>> >> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616) >> >>>>> >> >>>>>> at >> >>>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>> >> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743) >> >>>>> >> >>>>>> at java.io.DataInputStream.read(DataInputStream.java:132) >> >>>>>>>>> at >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>> >> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99) >> >>>>> >> >>>>>> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100) >> >>>>>>>>> at >> >>>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>> >> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020) >> >>>>> >> >>>>>> at >> >>>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>> >> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971) >> >>>>> >> >>>>>> at >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>> >> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304) >> >>>>> >> >>>>>> at >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>> >> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186) >> >>>>> >> >>>>>> at >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>> >> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207) >> >>>>> >> >>>>>> at >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>> >> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80) >> >>>>> >> >>>>>> at >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>> >> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65) >> >>>>> >> >>>>>> at org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461) >> >>>>>>>>> at >> >>>>>>>>> >> >>>>>>>> >> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396) >> >>>>> >> >>>>>> at >> >>>>>>>>> >> >>>>>>>> >> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385) >> >>>>> >> >>>>>> at >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>> >> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731) >> >>>>> >> >>>>>> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) >> >>>>>>>>> at >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> >>>>> >> >>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >> >>>>>>>>> at >> >>>>>>>>> >> >>>>>>>> >> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657) >> >>>>> >> >>>>>> at >> >>>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>> >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) >> >>>>> >> >>>>>> Is there any way to fix this problem? Or is there anything we can >> >>>>>>>>> >> >>>>>>>> do >> >>>>> >> >>>>>> even manually to relieve it? >> >>>>>>>>> Any suggestion? >> >>>>>>>>> Thank you. >> >>>>>>>>> LvZheng >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>>> >> > >> > >