will save time.
On Thu, 1 Jul 2021 at 13:45, Sean Owen wrote:
> Wouldn't this happen naturally? the large batches would just take a longer
> time to complete already.
>
> On Thu, Jul 1, 2021 at 6:32 AM András Kolbert
> wrote:
>
>> Hi,
>>
>> I have a sp
Hi,
I have a spark streaming application which generally able to process the
data within the given time frame. However, in certain hours it starts
increasing that causes a delay.
In my scenario, the number of input records are not linearly increase the
processing time. Hence, ideally I'd like to
ontent is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 10 Apr 2021 at 14:28, András Kolbert
> wrote:
>
>> hi,
>>
>> I have a streaming job an
hi,
I have a streaming job and quite often executors die (due to memory errors/
"unable to find location for shuffle etc) during the processing. I started
digging and found that some of the tasks are concentrated to one executor,
just as below:
[image: image.png]
Can this be the reason?
Should I
sorry missed out a bit. Added, highlighted with yellow.
On Thu, 14 Jan 2021 at 13:54, András Kolbert
wrote:
> Thanks, Muru, very helpful suggestion! Delta Lake is amazing, completely
> changed a few of my projects!
>
> One question regarding that.
> When I use the following state
, I keep facing with the ' Error: java.lang.ClassNotFoundException:
Failed to find data source: delta. ' error message.
What did I miss in my configuration/env variables?
Thanks
Andras
On Sun, 10 Jan 2021, 3:33 am muru, wrote:
> You could try Delta Lake or Apache Hudi for this use
by 1)? Driver is only
> responsible for submitting Spark job, not performing.
>
> -- ND
>
> On 1/9/21 9:35 AM, András Kolbert wrote:
> > Hi,
> > I would like to get your advice on my use case.
> > I have a few spark streaming applications where I need to keep
>
Hi,
I would like to get your advice on my use case.
I have a few spark streaming applications where I need to keep updating a
dataframe after each batch. Each batch probably affects a small fraction of
the dataframe (5k out of 200k records).
The options I have been considering so far:
1) keep data
ific reason to use that?
>
> BR,
> G
>
>
> On Thu, Sep 3, 2020 at 11:41 AM András Kolbert
> wrote:
>
>> Hi All,
>>
>> I have a Spark streaming application (2.4.4, Kafka 0.8 >> so Spark Direct
>> Streaming) running just fine.
>
Hi All,
I have a Spark streaming application (2.4.4, Kafka 0.8 >> so Spark Direct
Streaming) running just fine.
I create a context in the following way:
ssc = StreamingContext(sc, 60) opts =
{"metadata.broker.list":kafka_hosts,"auto.offset.reset": "largest",
"group.id": run_type}
kvs = Kafk
Hi,
I have a streaming job (Spark 2.4.4) in which the memory usage keeps
increasing over time.
Periodically (20-25) mins the executors fall over
(org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 6987) due to out of memory. In the UI, I can see that
the
11 matches
Mail list logo