[ https://issues.apache.org/jira/browse/HBASE-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356395#comment-15356395 ]
Yu Li commented on HBASE-16132: ------------------------------- Let me add some background here: This is a problem we ran into on our production cluster with patched 1.1.2 version. Below is the way to stably reproduce the problem: 1. Special settings for the test: {noformat} regionserver.handler.count => 1 hbase.ipc.server.max.callqueue.length => 1 hbase.client.scanner.timeout.period => 3000 {noformat} 2. Load enough data using YCSB into tableA 3. Simulate a heavy load which keeps occupying the call queue and makes the RS busy: 4 physical clients, each with 32 YCSB processes, each process with 100 threads, random read against tableA 4. Meanwhile, issue a scan request against tableA using the attached class (will attach the file later) I'm not sure but I think HBASE-16074 might be caused by the same problem, JFYI [~mantonov] [~eclark] > Scan does not return all the result when regionserver is busy > ------------------------------------------------------------- > > Key: HBASE-16132 > URL: https://issues.apache.org/jira/browse/HBASE-16132 > Project: HBase > Issue Type: Bug > Reporter: binlijin > Attachments: HBASE-16132.patch, HBASE-16132_v2.patch, > HBASE-16132_v3.patch, HBASE-16132_v3.patch > > > We have find some corner case, when regionserver is busy and last a long > time. Some scanner may return null even if they do not scan all data. > We find in ScannerCallableWithReplicas there is a case do not handler > correct, when cs.poll timeout and do not return any result , it is will > return a null result, so scan get null result, and end the scan. > {code} > try { > Future<Pair<Result[], ScannerCallable>> f = cs.poll(timeout, > TimeUnit.MILLISECONDS); > if (f != null) { > Pair<Result[], ScannerCallable> r = f.get(timeout, > TimeUnit.MILLISECONDS); > if (r != null && r.getSecond() != null) { > updateCurrentlyServingReplica(r.getSecond(), r.getFirst(), done, > pool); > } > return r == null ? null : r.getFirst(); // great we got an answer > } > } catch (ExecutionException e) { > RpcRetryingCallerWithReadReplicas.throwEnrichedException(e, retries); > } catch (CancellationException e) { > throw new InterruptedIOException(e.getMessage()); > } catch (InterruptedException e) { > throw new InterruptedIOException(e.getMessage()); > } catch (TimeoutException e) { > throw new InterruptedIOException(e.getMessage()); > } finally { > // We get there because we were interrupted or because one or more of > the > // calls succeeded or failed. In all case, we stop all our tasks. > cs.cancelAll(); > } > return null; // unreachable > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)