Fw: how to reset streaming state regularly

2019-02-26 Thread shicheng31...@gmail.com
Hi all: In Spark Streaming, I want to count some metrics by day, but in method "mapWithState", there is no API for this. Of course, I can achieve this by adding some time information to the record. However, I still want to use the spark API implementation . So, is there any direct or

to_avro and from_avro not working with struct type in spark 2.4

2019-02-26 Thread Hien Luu
Hi, I ran into a pretty weird issue with to_avro and from_avro where it was not able to parse the data in a struct correctly. Please see the simple and self contained example below. I am using Spark 2.4. I am not sure if I missed something. This is how I start the spark-shell on my Mac:

Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-26 Thread Akshay Bhardwaj
Hi Guillermo, What was the interval in between restarting the spark job? As a feature in Kafka, a broker deleted offsets for a consumer group after inactivity of 24 hours. In such a case, the newly started spark streaming job will read offsets from beginning for the same groupId. Akshay Bhardwaj

Re: Spark 2.3 | Structured Streaming | Metric for numInputRows

2019-02-26 Thread Akshay Bhardwaj
If it helps, below is the same query progress report that I am able to fetch from streaming query { "id" : "f2cb24d4-622e-4355-b315-8e440f01a90c", "runId" : "6f3834ff-10a9-4f57-ae71-8a434ee519ce", "name" : "query_name_1", "timestamp" : "2019-02-27T06:06:58.500Z", "batchId" : 3725,

Spark 2.3 | Structured Streaming | Metric for numInputRows

2019-02-26 Thread Akshay Bhardwaj
Hi Experts, I have a structured streaming query running on spark 2.3 over yarn cluster, with below features: - Reading JSON messages from Kafka topic with: - maxOffsetsPerTrigger as 5000 - trigger interval of my writeStream task is 500ms. - streaming dataset is defined as

Re: Occasional broadcast timeout when dynamic allocation is on

2019-02-26 Thread Abdeali Kothari
I've been facing this issue for the past few months too. I always thought it was an infrastructure issue, but we were never able to figure out what the infra issue was. If others are facing this issue too - then maybe it's a valid bug. Does anyone have any ideas on how we can debug this? On

Spark sql join optimizations

2019-02-26 Thread Akhilanand
Hello, I recently noticed that spark doesn't optimize the joins when we are limiting it. Say when we have payment.join(customer,Seq("customerId"), "left").limit(1).explain(true) Spark doesn't optimize it. > == Physical Plan == > CollectLimit 1 > +- *(5) Project [customerId#29, paymentId#28,