[ https://issues.apache.org/jira/browse/KUDU-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186743#comment-17186743 ]
Alexey Serbin commented on KUDU-3169: ------------------------------------- This issue has been spotted elsewhere as well. This seems to be specific to Java Kudu client only, C++/Python Kudu clients don't have this issue. Below is a snippet from the client-side log: {noformat} 20/07/20 00:26:41 INFO client.AsyncKuduClient: Invalidating location edd230034290421aa36bbf83c4b3b97e(tserver-00.local:7050) for tablet a3dbde5879d3486fa68f442dff1b86d5: Service unavailable: Scan request on kudu.tserver.TabletServerService from 10.80.34.23:54724 dropped due to backpressure. The service queue is full; it has 50 items. 20/07/20 00:26:42 WARN client.AsyncKuduScanner: a3dbde5879d3486fa68f442dff1b86d5@[592d79bf710046a88bf6da9799fe26d6(terver-01.local:7050),d8677f078c754b1dac4a1aad2c5c1c7e(tserver-01.local:7050)] pretends to not know KuduScanner(table=impala::t00.p00, tablet=null, scannerId="33e4c93f3ca84ef8b5cd40c4846573f7", scanRequestTimeout=30000) org.apache.kudu.client.NonRecoverableException: Scanner 33e4c93f3ca84ef8b5cd40c4846573f7 not found (it may have expired) {noformat} Tablet server at {{tserver-00.local}} drops the RPC with scan request and Kudu client proceeds on to the next tablet server at {{tserver-01.local}}, sending scan continuation (not a new scan) request there. The tablet server at {{tserver-01.local}} responds with {{Status::NotFound}} status with specific error code {{TabletServerErrorPB::SCANNER_EXPIRED}}, hinting that the scanner with identifier {{33e4c93f3ca84ef8b5cd40c4846573f7}} might have already expired (see [the server-side code|https://github.com/apache/kudu/blob/c590a05778443bb6112e831d0b0ad0dce4b74724/src/kudu/tserver/scanners.cc#L170-L175] for details). The tablet server at {{tserver-01.local}} could not find the scanner because the client hadn't started scan operation with that tablet server, but started the scan operation with tablet server at {{tserver-00.local}}. > kudu java client throws scanner expired error while processing large scan on > High-load cluster > ----------------------------------------------------------------------------------------------- > > Key: KUDU-3169 > URL: https://issues.apache.org/jira/browse/KUDU-3169 > Project: Kudu > Issue Type: Bug > Components: client, java > Affects Versions: 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1 > Reporter: mintao > Priority: Major > Labels: scalability, stability > > user submits a spark task to scan a kudu table with large amount records, > after just few minutes the job failed after 4 attempts, each attempt failed > with error : > {code:java} > org.apache.kudu.client.NonRecoverableException: Scanner > 4e34e6f821be42b889022ec681e235cc not found (it may have expired) > org.apache.kudu.client.NonRecoverableException: Scanner > 4e34e6f821be42b889022ec681e235cc not found (it may have expired) at > org.apache.kudu.client.KuduException.transformException(KuduException.java:110) > at > org.apache.kudu.client.KuduClient.joinAndHandleException(KuduClient.java:402) > at org.apache.kudu.client.KuduScanner.nextRows(KuduScanner.java:57) at > org.apache.kudu.spark.kudu.RowIterator.hasNext(KuduRDD.scala:153) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:109) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) Suppressed: > org.apache.kudu.client.KuduException$OriginalException: Original asynchronous > stack trace at > org.apache.kudu.client.RpcProxy.dispatchTSError(RpcProxy.java:341) at > org.apache.kudu.client.RpcProxy.responseReceived(RpcProxy.java:263) at > org.apache.kudu.client.RpcProxy.access$000(RpcProxy.java:59) at > org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:152) at > org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:148) at > org.apache.kudu.client.Connection.messageReceived(Connection.java:391) at > org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) > at org.apache.kudu.client.Connection.handleUpstream(Connection.java:243) at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) > at > org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184) > at > org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) > at > org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) > at > org.apache.kudu.shaded.org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) > at > org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) > at > org.apache.kudu.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462) > at > org.apache.kudu.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443) > at > org.apache.kudu.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303) > at > org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) > at > org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) > at > org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) > at > org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) > at > org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) > at > org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) > at > org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) > at > org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) > at > org.apache.kudu.shaded.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) > at > org.apache.kudu.shaded.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) > ... 3 more{code} > Each task ran just for about 19 seconds then throws scanner not found error > while tserver uses a default scanner_ttl_ms (60s).In tserver log, We found > the scanner that memtioned in client log expired after spark job failed, and > another tserver receives the scan request with that scannerId specifies. > it seems AsyncKuduScanner in kudu java client will choose a random server > when retrying scanNextRows, even though the AsyncKuduScanner already has a > scannerId. -- This message was sent by Atlassian Jira (v8.3.4#803005)