[jira] [Commented] (SPARK-32893) Structured Streaming and Dynamic Allocation on StandaloneCluster

Ranga Reddy (Jira) Tue, 21 Mar 2023 05:39:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703182#comment-17703182
 ]


Ranga Reddy commented on SPARK-32893:
-------------------------------------

Similar kind of issue is created for 
[Kubernetes|https://issues.apache.org/jira/issues/?jql=project+%3D+SPARK+AND+component+%3D+Kubernetes]
 - https://issues.apache.org/jira/browse/SPARK-35625

> Structured Streaming and Dynamic Allocation on StandaloneCluster
> ----------------------------------------------------------------
>
>                 Key: SPARK-32893
>                 URL: https://issues.apache.org/jira/browse/SPARK-32893
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 3.0.1
>            Reporter: Duarte Ferreira
>            Priority: Major
>
> We are currently using Spark 3.0.1 Standalone cluster to run our Structured 
> streaming applications.
> We set the following configurations when running the application in cluster 
> mode:
>  * spark.dynamicAllocation.enabled = true
>  * spark.shuffle.service.enabled = true
>  * spark.cores.max =5
>  * spark.executor.memory = 1G
>  * spark.executor.cores = 1
> We also have the configurations set to enable spark.shuffle.service.enabled 
> on each worker and have a cluster composed of 1 master and 2 slaves.
> The application reads data from a kafka Topic (readTopic) using [This 
> documentation, 
> |https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html]applies
>  some transformations on the DataSet using spark SQL and writes data to 
> another Kafka Topic (writeTopic).
> When we start the application it behaves correctly, it starts with 0 
> executors and. as we start feeding data to the readTopic, it starts 
> increasing the number of executors until it reaches the 5 executors limit and 
> all messages are transformed and written to the writeTopic in Kafka.
> If we stop feeding messages to the readTopic the application will work as 
> expected and starts killing executors that are not needed anymore until we 
> stop sending data completely and it reach 0 executors running.
> If we start sending data again right away, it behaves just as expected it 
> starts increasing the numbers of executors again. But if we leave the 
> application in idle at 0 executors for around 10 minutes we start getting 
> errors like this:
> {noformat}
> *no*
> 20/09/15 10:41:22 ERROR TransportClient: Failed to send RPC RPC 
> 7570256331800450365 to sparkmaster/10.0.12.231:7077: 
> java.nio.channels.ClosedChannelException
> java.nio.channels.ClosedChannelException
>       at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>       at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>       at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>       at 
> io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1104)
>       at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>       at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>       at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
>       at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>       at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>       at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Connection reset by peer
>       at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>       at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>       at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>       at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>       at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
>       at 
> org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:148)
>       at 
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123)
>       at 
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:362)
>       at 
> io.netty.channel.nio.AbstractNioByteChannel.doWriteInternal(AbstractNioByteChannel.java:235)
>       at 
> io.netty.channel.nio.AbstractNioByteChannel.doWrite0(AbstractNioByteChannel.java:209)
>       at 
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:400)
>       at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:930)
>       at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:354)
>       at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:897)
>       at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1372)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:742)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:728)
>       at 
> io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:765)
>       ... 8 more
> 20/09/15 10:41:22 WARN ExecutorAllocationManager: Unable to reach the cluster 
> manager to request 1 total executors!
> {noformat}
> If we restart the master node, everything works again, if we restart the 
> spark app, everything starts working again.
> All nodes can ping the master node and they can start other applications or 
> kill the ones running, there seams to be a problem only when increasing the 
> executors.
> Is this a Bug or are we missing some configuration/timeout?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32893) Structured Streaming and Dynamic Allocation on StandaloneCluster

Reply via email to