Unfortunately your information wouldn't provide any hint that rows in the
state are evicted correctly on watermark advance or there's an unknown bug
which some of the rows in state are silently dropped. I haven't heard of
the case for the latter - probably you'd like to double check it with
Hey
My 2 cents on CI/Cd for pyspark. You can leverage pytests + holden karau's
spark testing libs for CI thus giving you `almost` same functionality as
Scala - I say almost as in Scala you have nice and descriptive funcspecs -
For me choice is based on expertise.having worked with teams which
Hi Wim,
I think we are splitting the atom here but my inference to functionality
was based on:
1. Spark is written in Scala, so knowing Scala programming language
helps coders navigate into the source code, if something does not function
as expected.
2. Given the framework using
Hi Users,
I have created a wheel file using Poetry. I tried running the following
commands to run spark job using wheel , but it is not working. Can anyone
please let me know about the invocation step for the wheel file?
spark-submit --py-files /path/to/wheel
spark-submit --files /path/to/wheel
It's really a very big discussion around Pyspark Vs Scala. I have little
bit experience about how we can automate the CI/CD when it's a JVM based
language.
I would like to take this as an opportunity to understand the end-to-end
CI/CD flow for Pyspark based ETL pipelines.
Could someone please
I think Sean is right, but in your argumentation you mention that
'functionality
is sacrificed in favour of the availability of resources'. That's where I
disagree with you but agree with Sean. That is mostly not true.
In your previous posts you also mentioned this . The only reason we
sometimes
We're using Stateful Structured Streaming in Spark 2.4. We are noticing
that when the load on the system is heavy & LOTs of messages are coming in
some of the states disappear with no error message. Any suggestions on how
we can debug this? Any tips for fixing this?
Thanks in advance.