structured streaming with mapGroupWithState

2020-03-11 Thread Srinivas V
Anyone using this combination for prod? I am planning to use for a use case with 15000 events per second from few Kafka topics. Through events are big, I would just have to take the businessIds, frequency, first and last event timestamp and save this into mapGroupWithState. I need to keep them for

Re: Time-based frequency table at scale

2020-03-11 Thread Nicolas Paris
Hi, did you try exploding the arrays, then doing the aggregation/count and at the end applying a udf to add the 0 values ? my experience is working on arrays is usually a bad idea. sakag writes: > Hi all, > > We have a rather interesting use case, and are struggling to come up with an >

Re: Time-based frequency table at scale

2020-03-11 Thread Enrico Minack
An interesting puzzle indeed. What is your measure of "that scales"? Does not fail, does not spill, does not need a huge amount of memory / disk, is O(N), processes X records per second and core? Enrico Am 11.03.20 um 16:59 schrieb sakag: Hi all, We have a rather interesting use case,

Re: ForEachBatch collecting batch to driver

2020-03-11 Thread Burak Yavuz
foreachBatch gives you the micro-batch as a DataFrame, which is distributed. If you don't call collect on that DataFrame, it shouldn't have any memory implications on the Driver. On Tue, Mar 10, 2020 at 3:46 PM Ruijing Li wrote: > Hi all, > > I’m curious on how foreachbatch works in spark

Time-based frequency table at scale

2020-03-11 Thread sakag
Hi all, We have a rather interesting use case, and are struggling to come up with an approach that scales. Reaching out to seek your expert opinion/feedback and tips. What we are trying to do is to find the count of numerical ids over a sliding time window where each of our data records has

Error in using hbase-spark connector

2020-03-11 Thread PRAKASH GOPALSAMY
Hi Team, We are trying to read hbase table from spark using hbase-spark connector. But our job is failing in the pushdown part of the filter in stage 0, due the below error. kindly help us to resolve this issue. caused by : java.lang.NoClassDefFoundError: scala/collection/immutable/StringOps at