Freudian slip :) -eran
On Thu, Apr 5, 2012 at 16:52, Ted Yu <yuzhih...@gmail.com> wrote: > Thanks for writing back. > > I guess you meant 'things are now operating well', below :-) > > On Thu, Apr 5, 2012 at 6:25 AM, Eran Kutner <e...@gigya.com> wrote: > > > As promised I'm writing back to update the list. > > Seems that after upgrading to cdh3u3 of the hadoop cluster and zookeeper > > ensemble (hadoop alone wasn't enough) things are no operating well with > no > > HDFS errors in the logs. I've also set > > hbase.regionserver.logroll.errors.tolerated to 3 just in case. Now that > the > > log is clean a new exception shows up but I'll open a separate thread > about > > it. > > > > Thanks everyone. > > > > -eran > > > > > > > > On Wed, Mar 28, 2012 at 23:06, Eran Kutner <e...@gigya.com> wrote: > > > > > hmmm... I couldn't find it either, so I've looked at the history of > that > > > file and sure enough a few check-ins back it had that message. > > > I have no idea how something like this could happen. I know I had some > > > merge issues when I first got the latest version and built that project > > but > > > I've then reverted all local changes and rebuilt. The only thing I can > > > imagine is that the previous compiled class file was not modified and > it > > > was the one that got included in the JAR, although I don;t really know > > how > > > can it happen. > > > > > > -eran > > > > > > > > > > > > On Wed, Mar 28, 2012 at 18:53, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > >> Eran: > > >> The error indicated some zookeeper related issue. > > >> Do you see KeeperException after the Error log ? > > >> > > >> I searched 90 codebase but couldn't find the exact log phrase: > > >> > > >> zhihyu$ find src/main -name '*.java' -exec grep "getting node's > version > > in > > >> CLOSI" {} \; -print > > >> zhihyu$ find src/main -name '*.java' -exec grep 'Error getting ' {} \; > > >> -print > > >> > > >> Cheers > > >> > > >> On Wed, Mar 28, 2012 at 9:45 AM, Eran Kutner <e...@gigya.com> wrote: > > >> > > >> > I don't see any prior HDFS issues in the 15 minutes before this > > >> exception. > > >> > The logs on the datanode reported as problematic are clean as well. > > >> > However, I now see the log is full of errors like this: > > >> > 2012-03-28 00:15:05,358 DEBUG > > >> > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: > > >> Processing > > >> > close of gs_users,731481|S > > >> > n쒪㝨眳ԫ䂣⫰==,1331226388691.29929cb2200b3541ead85e17b836ade5. > > >> > 2012-03-28 00:15:05,359 WARN > > >> > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: > Error > > >> > getting node's version in CLOSIN > > >> > G state, aborting close of > > >> > > > >> > > > gs_users,731481|Sn쒪㝨眳ԫ䂣⫰==,1331226388691.29929cb2200b3541ead85e17b836ade5. > > >> > > > >> > -eran > > >> > > > >> > > > >> > > > >> > On Wed, Mar 28, 2012 at 18:38, Jean-Daniel Cryans < > > jdcry...@apache.org > > >> > >wrote: > > >> > > > >> > > Any chance we can see what happened before that too? Usually you > > >> > > should see a lot more HDFS spam before getting that all the > > datanodes > > >> > > are bad. > > >> > > > > >> > > J-D > > >> > > > > >> > > On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <e...@gigya.com> > > wrote: > > >> > > > Hi, > > >> > > > > > >> > > > We have region server sporadically stopping under load due > > >> supposedly > > >> > to > > >> > > > errors writing to HDFS. Things like: > > >> > > > > > >> > > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: > > Error > > >> > > while > > >> > > > syncing > > >> > > > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. > > >> > Aborting.. > > >> > > > > > >> > > > It's happening with a different region server and data node > every > > >> time, > > >> > > so > > >> > > > it's not a problem with one specific server and there doesn't > seem > > >> to > > >> > be > > >> > > > anything really wrong with either of them. I've already > increased > > >> the > > >> > > file > > >> > > > descriptor limit, datanode xceivers and data node handler count. > > Any > > >> > idea > > >> > > > what can be causing these errors? > > >> > > > > > >> > > > > > >> > > > A more complete log is here: http://pastebin.com/wC90xU2x > > >> > > > > > >> > > > Thanks. > > >> > > > > > >> > > > -eran > > >> > > > > >> > > > >> > > > > > > > > >