Thread dump of TaskTracker: http://gist.github.com/363898
Thread dump of RegionServer: http://gist.github.com/363899 Not clear what's going on. I'm going to have a look at HBASE-2180... joost. On Sat, Apr 10, 2010 at 10:41 PM, Stack <[email protected]> wrote: > On Sat, Apr 10, 2010 at 4:38 PM, Joost Ouwerkerk <[email protected]> > wrote: > > We're mapping a table with about 2 million rows in 100 regions on 40 > nodes. > > In each map, we're doing a random read on the same table. We're > > encountering a situation that looks alot like deadlock. When the job is > > launched, some of the tasktrackers appear to get blocked in doing the > first > > random read. The only trace we get is an eventual Unknown Scanner > Exception > > in the RegionServer log, at which point the task is actually reported as > > successfully completed by MapReduce (1 row processed). There is no error > in > > the task's log. The job completes as SUCCESSFUL with an incomplete > number > > of rows. In the worst case scenario, we've actually seen ALL the > > tasktrackers encounter this problem; the job completes succesfully with > 100 > > rows processed (1 per region). > > > Any chance of a threaddump on the the problematic RS at the time? Can > you even figure the culprit? There is a known deadlock that can > happen writing (HBASE-2322) but this seems like something else. If > its a deadlock, often JVM can recognize it as so and it'll be detailed > on the tail of the threaddump. Todd has been messing too w/ jcarder > (sp)? That found HBASE-2322 but thats all it found I believe (I need > to run it on next release candidate before it becomes a release > candidate). Maybe you're running into very slow reads because you > don't have HBASE-2180? > > St.Ack > > > > > > > When we remove the code that does the random read in the map, there are > no > > problems. > > > > Anyone? This is driving me crazy because I can't reproduce it locally > (it > > only seems to be a problem in a distributed environment with many nodes) > and > > because there is no stacktrace besides the scanner exception (which is > > clearly a symptom, not a cause). > > > > j > > >
