Spark structured streaming with periodical persist and unpersist

2021-02-11 Thread act_coder
I am currently building a spark structured streaming application where I am doing a batch-stream join. And the source for the batch data gets updated periodically. So, I am planning to do a persist/unpersist of that batch data periodically. Below is a sample code which I am using to persist and

Unsubscribe

2021-02-11 Thread Sunil Prabhakara

Re: understanding spark shuffle file re-use better

2021-02-11 Thread Attila Zsolt Piros
No, it won't be reused. You should reuse the dateframe for reusing the shuffle blocks (and cached data). I know this because the two actions will lead to building a two separate DAGs, but I will show you a way how you could check this on your own (with a small simple spark application). For

Trigger on GroupStateTimeout with no new data in group

2021-02-11 Thread Abhishek Gupta
Hi All, I had a question about modeling a user session kind of analytics use-case in Spark Structured Streaming. Is there a way to model something like this using Arbitrary stateful Spark streaming User session -> reads a few FAQS on a website and then decides to create a ticket or not FAQ

How to handle spark state which is growing too big even with timeout set.

2021-02-11 Thread Kuttaiah Robin
Hello, I have a use case where I need to read events(non correlated) from a kafka topic, then correlate and push forward to another topic. I use spark structured streaming with FlatMapGroupsWithStateFunction along with GroupStateTimeout.ProcessingTimeTimeout() . After each timeout, I do some