Re: [spark streaming] checkpoint location feature for batch processing

2020-05-02 Thread Magnus Nilsson
I've always had a question about Trigger.Once that I never got around to ask or test for myself. If you have a 24/7 stream to a Kafka topic. Will Trigger.Once get the last offset(s) when it starts and then quit once it hits this offset(s) or will the job run until no new messages is added to the

Modularising Spark/Scala program

2020-05-02 Thread Mich Talebzadeh
Hi, I have a Spark Scala program created and compiled with Maven. It works fine. It basically does the following: 1. Reads an xml file from HDFS location 2. Creates a DF on top of what it reads 3. Creates a new DF with some columns renamed etc 4. Creates a new DF for rejected rows

Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
Hi Mich! I think you can combine the good/rejected into one method that internally: - Create good/rejected df's given an input df and input rules/predicates to apply to the df. - Create a third df containing the good rows and the rejected rows with the bad columns nulled out -

Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
I neglected to include the rationale: the assumption is this will be a repeatedly needed process thus a reusable method were helpful. The predicate/input rules that are supported will need to be flexible enough to support the range of input data domains and use cases . For my workflows the

Re: Spark structured streaming - performance tuning

2020-05-02 Thread Srinivas V
Hi Alex, read the book , it is a good one but i don’t see things which I strongly want to understand. You are right on the partition and tasks. 1.How to use coalesce with spark structured streaming ? Also I want to ask few more questions, 2. How to restrict number of executors on structured

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-02 Thread Jungtaek Lim
If I understand correctly, Trigger.once executes only one micro-batch and terminates, that's all. Your understanding of structured streaming applies there as well. It's like a hybrid approach as bringing incremental processing from micro-batch but having processing interval as batch. That said,