On Sat, Apr 10, 2010 at 4:38 PM, Joost Ouwerkerk <[email protected]> wrote: > We're mapping a table with about 2 million rows in 100 regions on 40 nodes. > In each map, we're doing a random read on the same table. We're > encountering a situation that looks alot like deadlock. When the job is > launched, some of the tasktrackers appear to get blocked in doing the first > random read. The only trace we get is an eventual Unknown Scanner Exception > in the RegionServer log, at which point the task is actually reported as > successfully completed by MapReduce (1 row processed). There is no error in > the task's log. The job completes as SUCCESSFUL with an incomplete number > of rows. In the worst case scenario, we've actually seen ALL the > tasktrackers encounter this problem; the job completes succesfully with 100 > rows processed (1 per region).
Any chance of a threaddump on the the problematic RS at the time? Can you even figure the culprit? There is a known deadlock that can happen writing (HBASE-2322) but this seems like something else. If its a deadlock, often JVM can recognize it as so and it'll be detailed on the tail of the threaddump. Todd has been messing too w/ jcarder (sp)? That found HBASE-2322 but thats all it found I believe (I need to run it on next release candidate before it becomes a release candidate). Maybe you're running into very slow reads because you don't have HBASE-2180? St.Ack > > When we remove the code that does the random read in the map, there are no > problems. > > Anyone? This is driving me crazy because I can't reproduce it locally (it > only seems to be a problem in a distributed environment with many nodes) and > because there is no stacktrace besides the scanner exception (which is > clearly a symptom, not a cause). > > j >
