Re: Increase batch interval in case of delay

2021-07-01 Thread András Kolbert
will save time. On Thu, 1 Jul 2021 at 13:45, Sean Owen wrote: > Wouldn't this happen naturally? the large batches would just take a longer > time to complete already. > > On Thu, Jul 1, 2021 at 6:32 AM András Kolbert > wrote: > >> Hi, >> >> I have a sp

Increase batch interval in case of delay

2021-07-01 Thread András Kolbert
Hi, I have a spark streaming application which generally able to process the data within the given time frame. However, in certain hours it starts increasing that causes a delay. In my scenario, the number of input records are not linearly increase the processing time. Hence, ideally I'd like to

Re: Tasks are skewed to one executor

2021-04-11 Thread András Kolbert
ontent is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sat, 10 Apr 2021 at 14:28, András Kolbert > wrote: > >> hi, >> >> I have a streaming job an

Tasks are skewed to one executor

2021-04-10 Thread András Kolbert
hi, I have a streaming job and quite often executors die (due to memory errors/ "unable to find location for shuffle etc) during the processing. I started digging and found that some of the tasks are concentrated to one executor, just as below: [image: image.png] Can this be the reason? Should I

Re: Use case advice

2021-01-14 Thread András Kolbert
sorry missed out a bit. Added, highlighted with yellow. On Thu, 14 Jan 2021 at 13:54, András Kolbert wrote: > Thanks, Muru, very helpful suggestion! Delta Lake is amazing, completely > changed a few of my projects! > > One question regarding that. > When I use the following state

Re: Use case advice

2021-01-14 Thread András Kolbert
, I keep facing with the ' Error: java.lang.ClassNotFoundException: Failed to find data source: delta. ' error message. What did I miss in my configuration/env variables? Thanks Andras On Sun, 10 Jan 2021, 3:33 am muru, wrote: > You could try Delta Lake or Apache Hudi for this use

Re: Use case advice

2021-01-09 Thread András Kolbert
by 1)? Driver is only > responsible for submitting Spark job, not performing. > > -- ND > > On 1/9/21 9:35 AM, András Kolbert wrote: > > Hi, > > I would like to get your advice on my use case. > > I have a few spark streaming applications where I need to keep >

Use case advice

2021-01-09 Thread András Kolbert
Hi, I would like to get your advice on my use case. I have a few spark streaming applications where I need to keep updating a dataframe after each batch. Each batch probably affects a small fraction of the dataframe (5k out of 200k records). The options I have been considering so far: 1) keep data

Re: Spark Streaming Checkpointing

2020-09-04 Thread András Kolbert
ific reason to use that? > > BR, > G > > > On Thu, Sep 3, 2020 at 11:41 AM András Kolbert > wrote: > >> Hi All, >> >> I have a Spark streaming application (2.4.4, Kafka 0.8 >> so Spark Direct >> Streaming) running just fine. >

Spark Streaming Checkpointing

2020-09-03 Thread András Kolbert
Hi All, I have a Spark streaming application (2.4.4, Kafka 0.8 >> so Spark Direct Streaming) running just fine. I create a context in the following way: ssc = StreamingContext(sc, 60) opts = {"metadata.broker.list":kafka_hosts,"auto.offset.reset": "largest", "group.id": run_type} kvs = Kafk

Spark Streaming Memory

2020-05-17 Thread András Kolbert
Hi, I have a streaming job (Spark 2.4.4) in which the memory usage keeps increasing over time. Periodically (20-25) mins the executors fall over (org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 6987) due to out of memory. In the UI, I can see that the