Re: REcovering from SocketTimeout during scan in 90.3

Jean-Daniel Cryans Mon, 19 Sep 2011 11:15:29 -0700

There's something odd with your jstack, most of the locks id are
missing... anyways there's one I can trace:


"IPC Server handler 8 on 60020" daemon prio=10 tid=aaabcc31800
nid=0x3219 waiting for monitor entry [0x0000000044c8e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:2322)
        - waiting to lock <fca7e28> (a
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1823)

"IPC Server handler 6 on 60020" daemon prio=10 tid=aaabc9a5000
nid=0x3217 runnable [0x0000000044a8c000]
...
        - locked <fca7e28> (a
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner)
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:2322)
        - locked <fca7e28> (a
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner)


It clearly shows that two handlers are trying to use the same
RegionScanner object. It would be nice to have a stack dump with
correct lock information tho...

Regarding your code, would it be possible to see the unaltered
version? Feel free to send it directly to me, and if I do find
something I'll post the findings back here.

Thanks,

J-D

On Fri, Sep 16, 2011 at 10:58 PM, Douglas Campbell <[email protected]> wrote:
> Answers below.
>
>
>
> ________________________________
> From: Jean-Daniel Cryans <[email protected]>
> To: [email protected]
> Sent: Friday, September 16, 2011 2:08 PM
> Subject: Re: REcovering from SocketTimeout during scan in 90.3
>
> On Fri, Sep 16, 2011 at 12:17 PM, Douglas Campbell <[email protected]> wrote:
>> The min/max keys are for each region right? Are they pretty big?
>>
>> doug : Typically around 100 keys and each key is 24bytes
>
> A typical region would be like - stores=4, storefiles=4, 
> storefileSizeMB=1005, memstoreSizeMB=46, storefileIndexSizeMB=6
>
> Sorry, I meant to ask how big the regions were, not the rows.
>
>> Are you sharing scanners between multiple threads?
>>
>> doug: no - but each Result from the scan is passed to a thread to merge with 
>> input and write back.
>
> Yeah, this really isn't what I'm reading tho... Would it be possible
> to see a full stack trace that contains those BLOCKED threads? (please
> put it in a pastebin)
>
> http://kpaste.net/02f67d
>
>>> I had one or more runs where this error occured and I wasn't taking care to 
>>> call scanner.close()
>
> The other thing I was thinking, did you already implement the re-init
> of the scanner? If so, what's the code like?
>
>>>> The code traps runtime exception around the scanner iterator (pseudoish)
> while (toprocess.size() > 0 && !donescanning) {
>     Scanner scanner = table.getScanner(buildScan(toprocess));
>     try {
>         for (Result r: scanner) {
>          toprocess.remove(r.getRow());
>          // fork thread with r
>          if (toprocess.size() == 0) donescanning = true;
>          }
>     } catch (RuntimeException e) {
>         scanner.close();
>         if (e.getCause() instanceof IOEXception) { // probably hbase ex
>              scanner =  getScanner(buildScan(toprocess));
>         } else {
>              donescanning = true;
>         }
>      }
>
>>>> buildScan takes the keys and crams them in the filter.
>
> Thx,
>
> J-D

Re: REcovering from SocketTimeout during scan in 90.3

Reply via email to