There's something odd with your jstack, most of the locks id are
missing... anyways there's one I can trace:
"IPC Server handler 8 on 60020" daemon prio=10 tid=aaabcc31800
nid=0x3219 waiting for monitor entry [0x0000000044c8e000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:2322)
- waiting to lock <fca7e28> (a
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1823)
"IPC Server handler 6 on 60020" daemon prio=10 tid=aaabc9a5000
nid=0x3217 runnable [0x0000000044a8c000]
...
- locked <fca7e28> (a
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:2322)
- locked <fca7e28> (a
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner)
It clearly shows that two handlers are trying to use the same
RegionScanner object. It would be nice to have a stack dump with
correct lock information tho...
Regarding your code, would it be possible to see the unaltered
version? Feel free to send it directly to me, and if I do find
something I'll post the findings back here.
Thanks,
J-D
On Fri, Sep 16, 2011 at 10:58 PM, Douglas Campbell <[email protected]> wrote:
> Answers below.
>
>
>
> ________________________________
> From: Jean-Daniel Cryans <[email protected]>
> To: [email protected]
> Sent: Friday, September 16, 2011 2:08 PM
> Subject: Re: REcovering from SocketTimeout during scan in 90.3
>
> On Fri, Sep 16, 2011 at 12:17 PM, Douglas Campbell <[email protected]> wrote:
>> The min/max keys are for each region right? Are they pretty big?
>>
>> doug : Typically around 100 keys and each key is 24bytes
>
> A typical region would be like - stores=4, storefiles=4,
> storefileSizeMB=1005, memstoreSizeMB=46, storefileIndexSizeMB=6
>
> Sorry, I meant to ask how big the regions were, not the rows.
>
>> Are you sharing scanners between multiple threads?
>>
>> doug: no - but each Result from the scan is passed to a thread to merge with
>> input and write back.
>
> Yeah, this really isn't what I'm reading tho... Would it be possible
> to see a full stack trace that contains those BLOCKED threads? (please
> put it in a pastebin)
>
> http://kpaste.net/02f67d
>
>>> I had one or more runs where this error occured and I wasn't taking care to
>>> call scanner.close()
>
> The other thing I was thinking, did you already implement the re-init
> of the scanner? If so, what's the code like?
>
>>>> The code traps runtime exception around the scanner iterator (pseudoish)
> while (toprocess.size() > 0 && !donescanning) {
> Scanner scanner = table.getScanner(buildScan(toprocess));
> try {
> for (Result r: scanner) {
> toprocess.remove(r.getRow());
> // fork thread with r
> if (toprocess.size() == 0) donescanning = true;
> }
> } catch (RuntimeException e) {
> scanner.close();
> if (e.getCause() instanceof IOEXception) { // probably hbase ex
> scanner = getScanner(buildScan(toprocess));
> } else {
> donescanning = true;
> }
> }
>
>>>> buildScan takes the keys and crams them in the filter.
>
> Thx,
>
> J-D