Re: States get dropped in Structured Streaming

2020-10-23 Thread Jungtaek Lim
Unfortunately your information wouldn't provide any hint that rows in the state are evicted correctly on watermark advance or there's an unknown bug which some of the rows in state are silently dropped. I haven't heard of the case for the latter - probably you'd like to double check it with

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Sofia’s World
Hey My 2 cents on CI/Cd for pyspark. You can leverage pytests + holden karau's spark testing libs for CI thus giving you `almost` same functionality as Scala - I say almost as in Scala you have nice and descriptive funcspecs - For me choice is based on expertise.having worked with teams which

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Mich Talebzadeh
Hi Wim, I think we are splitting the atom here but my inference to functionality was based on: 1. Spark is written in Scala, so knowing Scala programming language helps coders navigate into the source code, if something does not function as expected. 2. Given the framework using

Need help on Calling Pyspark code using Wheel

2020-10-23 Thread Sachit Murarka
Hi Users, I have created a wheel file using Poetry. I tried running the following commands to run spark job using wheel , but it is not working. Can anyone please let me know about the invocation step for the wheel file? spark-submit --py-files /path/to/wheel spark-submit --files /path/to/wheel

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread William R
It's really a very big discussion around Pyspark Vs Scala. I have little bit experience about how we can automate the CI/CD when it's a JVM based language. I would like to take this as an opportunity to understand the end-to-end CI/CD flow for Pyspark based ETL pipelines. Could someone please

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Wim Van Leuven
I think Sean is right, but in your argumentation you mention that 'functionality is sacrificed in favour of the availability of resources'. That's where I disagree with you but agree with Sean. That is mostly not true. In your previous posts you also mentioned this . The only reason we sometimes

States get dropped in Structured Streaming

2020-10-23 Thread Eric Beabes
We're using Stateful Structured Streaming in Spark 2.4. We are noticing that when the load on the system is heavy & LOTs of messages are coming in some of the states disappear with no error message. Any suggestions on how we can debug this? Any tips for fixing this? Thanks in advance.