[
https://issues.apache.org/jira/browse/SPARK-38329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neven Jovic updated SPARK-38329:
--------------------------------
Description:
I'm currently running spark structured streaming application written in
python(pyspark) where my source is kafka topic and sink i mongodb. I changed my
checkpoint to Amazon EFS, which is distributed on all spark workers and after
that I got increased I/o wait, averaging 8%
!Screenshot from 2022-02-25 14-16-11.png!
Currently I have 6000 messages coming to kafka every second, and I get every
once in a while a WARN message:
{quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up
files for HDFSStateStoreProvider[id = (op=0,part=90),dir =
file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For
input string: ""
{quote}
I'm not quite sure if that message has anything to do with high I/O wait and is
this behavior expected, or something to be concerned about?
was:
I'm currently running spark structured streaming application written in
python(pyspark) where my source is kafka topic and sink i mongodb. I changed my
checkpoint to Amazon EFS, which is distributed on all spark workers and after
that I got increased I/o wait, averaging 8%
!image-2022-02-25-14-42-31-904.png!
Currently I have 6000 messages coming to kafka every second, and I get every
once in a while a WARN message:
{quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up
files for HDFSStateStoreProvider[id = (op=0,part=90),dir =
file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For
input string: ""
{quote}
I'm not quite sure if that message has anything to do with high I/O wait and is
this behavior expected, or something to be concerned about?
> High I/O wait when Spark Structured Streaming checkpoint changed to EFS
> -----------------------------------------------------------------------
>
> Key: SPARK-38329
> URL: https://issues.apache.org/jira/browse/SPARK-38329
> Project: Spark
> Issue Type: Question
> Components: EC2, Input/Output, PySpark, Structured Streaming
> Affects Versions: 2.4.6
> Reporter: Neven Jovic
> Priority: Major
> Attachments: Screenshot from 2022-02-25 14-16-11.png
>
>
> I'm currently running spark structured streaming application written in
> python(pyspark) where my source is kafka topic and sink i mongodb. I changed
> my checkpoint to Amazon EFS, which is distributed on all spark workers and
> after that I got increased I/o wait, averaging 8%
>
> !Screenshot from 2022-02-25 14-16-11.png!
> Currently I have 6000 messages coming to kafka every second, and I get every
> once in a while a WARN message:
> {quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up
> files for HDFSStateStoreProvider[id = (op=0,part=90),dir =
> file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For
> input string: ""
> {quote}
> I'm not quite sure if that message has anything to do with high I/O wait and
> is this behavior expected, or something to be concerned about?
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]