Ordering pushdown for Spark Datasources

2021-04-04 Thread Kohki Nishio
Hello, I'm trying to use Spark SQL as a log analytics solution. As you might guess, for most use-cases, data is ordered by timestamp and the amount of data is large. If I want to show the first 100 entries (ordered by timestamp) for a given condition, Spark Executor has to scan the whole entries

Re: Spark structured streaming + offset management in kafka + kafka headers

2021-04-04 Thread Gabor Somogyi
Just to be crystal clear Dstreams will be deprecated sooner or later and there will be no support so highly advised to migrate... G On Sun, 4 Apr 2021, 19:23 Ali Gouta, wrote: > Thanks Mich ! > > Ali Gouta. > > On Sun, Apr 4, 2021 at 6:44 PM Mich Talebzadeh > wrote: > >> Hi Ali, >> >> The

Re: Spark structured streaming + offset management in kafka + kafka headers

2021-04-04 Thread Ali Gouta
Thanks Mich ! Ali Gouta. On Sun, Apr 4, 2021 at 6:44 PM Mich Talebzadeh wrote: > Hi Ali, > > The old saying of one experiment is worth a hundred hypotheses, still > stands. > > As per Test driven approach have a go at it and see what comes out. Forum > members including myself have reported on

Re: Spark structured streaming + offset management in kafka + kafka headers

2021-04-04 Thread Mich Talebzadeh
Hi Ali, The old saying of one experiment is worth a hundred hypotheses, still stands. As per Test driven approach have a go at it and see what comes out. Forum members including myself have reported on SSS in Spark user group, so you are at home on this. HTH, view my Linkedin profile

Re: Spark structured streaming + offset management in kafka + kafka headers

2021-04-04 Thread Ali Gouta
Great, so SSS provides also an api that allows handling RDDs through dataFrames using foreachBatch. Still that I am not sure this is a good practice in general right ? Well, it depends on the use case in any way. Thank you so much for the hints ! Best regards, Ali Gouta. On Sun, Apr 4, 2021 at

Re: Spark structured streaming + offset management in kafka + kafka headers

2021-04-04 Thread Mich Talebzadeh
Hi Ali, On a practical side, I have used both the old DStreams and the newer Spark structured streaming (SSS). SSS does a good job at micro-batch level in the form of foreachBatch(SendToSink) "foreach" performs custom write logic on each row and "foreachBatch" *performs custom write logic

Re: Spark structured streaming + offset management in kafka + kafka headers

2021-04-04 Thread Ali Gouta
Thank you guys for your answers, I will dig more this new way of doing things and why not consider leaving the old Dstreams and use instead structured streaming. Hope that strucrured streaming + spark on Kubernetes works well and the combination is production ready. Best regards, Ali Gouta. Le

Re: Spark structured streaming + offset management in kafka + kafka headers

2021-04-04 Thread Jacek Laskowski
Hi, Just to add it to Gabor's excellent answer that checkpointing and offsets are infrastructure-related and should not really be in the hands of Spark devs who should instead focus on the business purpose of the code (not offsets that are very low-level and not really important). BTW That's

Re: Writing to Google Cloud Storage with v2 algorithm safe?

2021-04-04 Thread Jacek Laskowski
Hi Vaquar, Thanks a lot! Accepted as the answer (yet there was the other answer that was very helpful too). Tons of reading ahead to understand it more. That once again makes me feel that Hadoop MapReduce experience would help a great deal (and I've got none). Pozdrawiam, Jacek Laskowski

Re: Spark structured streaming + offset management in kafka + kafka headers

2021-04-04 Thread Gabor Somogyi
There is no way to store offsets in Kafka and restart from the stored offset. Structured Streaming stores offset in checkpoint and it restart from there without any user code. Offsets can be stored with a listener but it can be only used for lag calculation. BR, G On Sat, 3 Apr 2021, 21:09 Ali