Hi all:
In Spark Streaming, I want to count some metrics by day, but in method
"mapWithState", there is no API for this. Of course, I can achieve this by
adding some time information to the record. However, I still want to use the
spark API implementation . So, is there any direct or indir
Hi,
I ran into a pretty weird issue with to_avro and from_avro where it was not
able to parse the data in a struct correctly. Please see the simple and
self contained example below. I am using Spark 2.4. I am not sure if I
missed something.
This is how I start the spark-shell on my Mac:
./bin/
Hi Guillermo,
What was the interval in between restarting the spark job? As a feature in
Kafka, a broker deleted offsets for a consumer group after inactivity of 24
hours.
In such a case, the newly started spark streaming job will read offsets
from beginning for the same groupId.
Akshay Bhardwaj
If it helps, below is the same query progress report that I am able to
fetch from streaming query
{
"id" : "f2cb24d4-622e-4355-b315-8e440f01a90c",
"runId" : "6f3834ff-10a9-4f57-ae71-8a434ee519ce",
"name" : "query_name_1",
"timestamp" : "2019-02-27T06:06:58.500Z",
"batchId" : 3725,
"num
Hi Experts,
I have a structured streaming query running on spark 2.3 over yarn cluster,
with below features:
- Reading JSON messages from Kafka topic with:
- maxOffsetsPerTrigger as 5000
- trigger interval of my writeStream task is 500ms.
- streaming dataset is defined as eve
I've been facing this issue for the past few months too.
I always thought it was an infrastructure issue, but we were never able to
figure out what the infra issue was.
If others are facing this issue too - then maybe it's a valid bug.
Does anyone have any ideas on how we can debug this?
On Fri,
Hello,
I recently noticed that spark doesn't optimize the joins when we are
limiting it.
Say when we have
payment.join(customer,Seq("customerId"), "left").limit(1).explain(true)
Spark doesn't optimize it.
> == Physical Plan ==
> CollectLimit 1
> +- *(5) Project [customerId#29, paymentId#28,