Even with the new direct streams in 1.3,  isn't it the case that the job
*scheduling* follows the partition order, rather than job *execution*? Or
is it the case that the stream listens to job completion event (using a
streamlistener) before scheduling the next batch?  To compare with storm
from a message ordering point of view, unless a tuple is fully processed by
the DAG (as defined by spout+bolts), the next tuple does not enter the DAG.


On Thu, Feb 19, 2015 at 9:47 PM, Cody Koeninger <c...@koeninger.org> wrote:

> Kafka ordering is guaranteed on a per-partition basis.
>
> The high-level consumer api as used by the spark kafka streams prior to
> 1.3 will consume from multiple kafka partitions, thus not giving any
> ordering guarantees.
>
> The experimental direct stream in 1.3 uses the "simple" consumer api, and
> there is a 1:1 correspondence between spark partitions and kafka
> partitions.  So you will get deterministic ordering, but only on a
> per-partition basis.
>
> On Thu, Feb 19, 2015 at 11:31 PM, Neelesh <neele...@gmail.com> wrote:
>
>> I had a chance to talk to TD today at the Strata+Hadoop Conf in San Jose.
>> We talked a bit about this after his presentation about this - the short
>> answer is spark streaming does not guarantee any sort of ordering (within
>> batches, across batches).  One would have to use updateStateByKey to
>> collect the events and sort them based on some attribute of the event.  But
>> TD said message ordering is a frequently asked feature recently and is
>> getting on his radar.
>>
>> I went through the source code and there does not seem to be any
>> architectural/design limitation to support this.  (JobScheduler,
>> JobGenerator are a good starting point to see how stuff works under the
>> hood).  Overriding DStream#compute and using streaminglistener looks like a
>> simple way of ensuring ordered execution of batches within a stream. But
>> this would be a partial solution, since ordering within a batch needs some
>> more work that I don't understand fully yet.
>>
>> Side note :  My custom receiver polls the metricsservlet once in a while
>> to decide whether jobs are getting done fast enough and throttle/relax
>> pushing data in to receivers based on the numbers provided by
>> metricsservlet. I had to do this because out-of-the-box rate limiting right
>> now is static and cannot adapt to the state of the cluster
>>
>> thnx
>> -neelesh
>>
>> On Wed, Feb 18, 2015 at 4:13 PM, jay vyas <jayunit100.apa...@gmail.com>
>> wrote:
>>
>>> This is a *fantastic* question.  The idea of how we identify individual
>>> things in multiple  DStreams is worth looking at.
>>>
>>> The reason being, that you can then fine tune your streaming job, based
>>> on the RDD identifiers (i.e. are the timestamps from the producer
>>> correlating closely to the order in which RDD elements are being produced)
>>> ?  If *NO* then you need to (1) dial up throughput on producer sources or
>>> else (2) increase cluster size so that spark is capable of evenly handling
>>> load.
>>>
>>> You cant decide to do (1) or (2) unless you can track  when the
>>> streaming elements are being  converted to RDDs by spark itself.
>>>
>>>
>>>
>>> On Wed, Feb 18, 2015 at 6:54 PM, Neelesh <neele...@gmail.com> wrote:
>>>
>>>> There does not seem to be a definitive answer on this. Every time I
>>>> google for message ordering,the only relevant thing that comes up is this
>>>>  -
>>>> http://samza.apache.org/learn/documentation/0.8/comparisons/spark-streaming.html
>>>> .
>>>>
>>>> With a kafka receiver that pulls data from a single kafka partition of
>>>> a kafka topic, are individual messages in the microbatch in same the order
>>>> as kafka partition? Are successive microbatches originating from a kafka
>>>> partition executed in order?
>>>>
>>>>
>>>> Thanks!
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> jay vyas
>>>
>>
>>
>

Reply via email to