The following <http://spark.apache.org/docs/latest/configuration.html#spark-streaming> settings may be what you’re looking for:
- spark.streaming.backpressure.enabled - spark.streaming.backpressure.initialRate - spark.streaming.receiver.maxRate - spark.streaming.kafka.maxRatePerPartition On Mon, Mar 19, 2018 at 5:27 PM, kant kodali <kanth...@gmail.com> wrote: > Yes it indeed makes sense! Is there a way to get incremental counts when I > start from 0 and go through 10M records? perhaps count for every micro > batch or something? > > On Mon, Mar 19, 2018 at 1:57 PM, Geoff Von Allmen <ge...@ibleducation.com> > wrote: > >> Trigger does not mean report the current solution every 'trigger >> seconds'. It means it will attempt to fetch new data and process it no >> faster than trigger seconds intervals. >> >> If you're reading from the beginning and you've got 10M entries in kafka, >> it's likely pulling everything down then processing it completely and >> giving you an initial output. From here on out, it will check kafka every 1 >> second for new data and process it, showing you only the updated rows. So >> the initial read will give you the entire output since there is nothing to >> be 'updating' from. If you add data to kafka now that the streaming job has >> completed it's first batch (and leave it running), it will then show you >> the new/updated rows since the last batch every 1 second (assuming it can >> fetch + process in that time span). >> >> If the combined fetch + processing time is > the trigger time, you will >> notice warnings that it is 'falling behind' (I forget the exact verbiage, >> but something to the effect of the calculation took XX time and is falling >> behind). In that case, it will immediately check kafka for new messages and >> begin processing the next batch (if new messages exist). >> >> Hope that makes sense - >> >> >> On Mon, Mar 19, 2018 at 13:36 kant kodali <kanth...@gmail.com> wrote: >> >>> Hi All, >>> >>> I have 10 million records in my Kafka and I am just trying to >>> spark.sql(select count(*) from kafka_view). I am reading from kafka and >>> writing to kafka. >>> >>> My writeStream is set to "update" mode and trigger interval of one >>> second (Trigger.ProcessingTime(1000)). I expect the counts to be >>> printed every second but looks like it would print after going through all >>> 10M. why? >>> >>> Also, it seems to take forever whereas Linux wc of 10M rows would take >>> 30 seconds. >>> >>> Thanks! >>> >> >