I’m getting data from HBase using a large Spark cluster with parallelism of 
near 400. The query fails quire often with the message below. Sometimes a retry 
will work and sometimes the ultimate failure results (below). 

If I reduce parallelism in Spark it slows other parts of the algorithm 
unacceptably. I have also experimented with very large RPC/Scanner timeouts of 
many minutes—to no avail.

Any clues about what to look for or what may be setup wrong in my tables?

Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times, most 
recent failure: Lost task 44.3 in stage 147.0 (TID 24833, 
ip-172-16-3-9.eu-central-1.compute.internal): 
org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of 
OutOfOrderScannerNextException: was there a rpc timeout?+details
Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 times, most 
recent failure: Lost task 44.3 in stage 147.0 (TID 24833, 
ip-172-16-3-9.eu-central-1.compute.internal): 
org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of 
OutOfOrderScannerNextException: was there a rpc timeout? at 
org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:403) at 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:232)
 at 
org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138)
 at 

Reply via email to