Even with the new direct streams in 1.3, isn't it the case that the job *scheduling* follows the partition order, rather than job *execution*? Or is it the case that the stream listens to job completion event (using a streamlistener) before scheduling the next batch? To compare with storm from a message ordering point of view, unless a tuple is fully processed by the DAG (as defined by spout+bolts), the next tuple does not enter the DAG.
On Thu, Feb 19, 2015 at 9:47 PM, Cody Koeninger <c...@koeninger.org> wrote: > Kafka ordering is guaranteed on a per-partition basis. > > The high-level consumer api as used by the spark kafka streams prior to > 1.3 will consume from multiple kafka partitions, thus not giving any > ordering guarantees. > > The experimental direct stream in 1.3 uses the "simple" consumer api, and > there is a 1:1 correspondence between spark partitions and kafka > partitions. So you will get deterministic ordering, but only on a > per-partition basis. > > On Thu, Feb 19, 2015 at 11:31 PM, Neelesh <neele...@gmail.com> wrote: > >> I had a chance to talk to TD today at the Strata+Hadoop Conf in San Jose. >> We talked a bit about this after his presentation about this - the short >> answer is spark streaming does not guarantee any sort of ordering (within >> batches, across batches). One would have to use updateStateByKey to >> collect the events and sort them based on some attribute of the event. But >> TD said message ordering is a frequently asked feature recently and is >> getting on his radar. >> >> I went through the source code and there does not seem to be any >> architectural/design limitation to support this. (JobScheduler, >> JobGenerator are a good starting point to see how stuff works under the >> hood). Overriding DStream#compute and using streaminglistener looks like a >> simple way of ensuring ordered execution of batches within a stream. But >> this would be a partial solution, since ordering within a batch needs some >> more work that I don't understand fully yet. >> >> Side note : My custom receiver polls the metricsservlet once in a while >> to decide whether jobs are getting done fast enough and throttle/relax >> pushing data in to receivers based on the numbers provided by >> metricsservlet. I had to do this because out-of-the-box rate limiting right >> now is static and cannot adapt to the state of the cluster >> >> thnx >> -neelesh >> >> On Wed, Feb 18, 2015 at 4:13 PM, jay vyas <jayunit100.apa...@gmail.com> >> wrote: >> >>> This is a *fantastic* question. The idea of how we identify individual >>> things in multiple DStreams is worth looking at. >>> >>> The reason being, that you can then fine tune your streaming job, based >>> on the RDD identifiers (i.e. are the timestamps from the producer >>> correlating closely to the order in which RDD elements are being produced) >>> ? If *NO* then you need to (1) dial up throughput on producer sources or >>> else (2) increase cluster size so that spark is capable of evenly handling >>> load. >>> >>> You cant decide to do (1) or (2) unless you can track when the >>> streaming elements are being converted to RDDs by spark itself. >>> >>> >>> >>> On Wed, Feb 18, 2015 at 6:54 PM, Neelesh <neele...@gmail.com> wrote: >>> >>>> There does not seem to be a definitive answer on this. Every time I >>>> google for message ordering,the only relevant thing that comes up is this >>>> - >>>> http://samza.apache.org/learn/documentation/0.8/comparisons/spark-streaming.html >>>> . >>>> >>>> With a kafka receiver that pulls data from a single kafka partition of >>>> a kafka topic, are individual messages in the microbatch in same the order >>>> as kafka partition? Are successive microbatches originating from a kafka >>>> partition executed in order? >>>> >>>> >>>> Thanks! >>>> >>>> >>> >>> >>> >>> -- >>> jay vyas >>> >> >> >