The different between Stream vs Micro Batch is about Ordering of Messages
> Spark Streaming guarantees ordered processing of RDDs in one DStream. Since 
> each RDD is processed in parallel, there is not order guaranteed within the 
> RDD. This is a tradeoff design Spark made. If you want to process the 
> messages in order within the RDD, you have to process them in one thread, 
> which does not have the benefit of parallelism.

More about that 
http://samza.apache.org/learn/documentation/0.10/comparisons/spark-streaming.html
 
<http://samza.apache.org/learn/documentation/0.10/comparisons/spark-streaming.html>





> On Sep 27, 2016, at 2:12 PM, kant kodali <kanth...@gmail.com> wrote:
> 
> What is the difference between mini-batch vs real time streaming in practice 
> (not theory)? In theory, I understand mini batch is something that batches in 
> the given time frame whereas real time streaming is more like do something as 
> the data arrives but my biggest question is why not have mini batch with 
> epsilon time frame (say one millisecond) or I would like to understand reason 
> why one would be an effective solution than other?
> I recently came across one example where mini-batch (Apache Spark) is used 
> for Fraud detection and real time streaming (Apache Flink) used for Fraud 
> Prevention. Someone also commented saying mini-batches would not be an 
> effective solution for fraud prevention (since the goal is to prevent the 
> transaction from occurring as it happened) Now I wonder why this wouldn't be 
> so effective with mini batch (Spark) ? Why is it not effective to run 
> mini-batch with 1 millisecond latency? Batching is a technique used 
> everywhere including the OS and the Kernel TCP/IP stack where the data to the 
> disk or network are indeed buffered so what is the convincing factor here to 
> say one is more effective than other?
> Thanks,
> kant
> 

Reply via email to