The different between Stream vs Micro Batch is about Ordering of Messages > Spark Streaming guarantees ordered processing of RDDs in one DStream. Since > each RDD is processed in parallel, there is not order guaranteed within the > RDD. This is a tradeoff design Spark made. If you want to process the > messages in order within the RDD, you have to process them in one thread, > which does not have the benefit of parallelism.
More about that http://samza.apache.org/learn/documentation/0.10/comparisons/spark-streaming.html <http://samza.apache.org/learn/documentation/0.10/comparisons/spark-streaming.html> > On Sep 27, 2016, at 2:12 PM, kant kodali <kanth...@gmail.com> wrote: > > What is the difference between mini-batch vs real time streaming in practice > (not theory)? In theory, I understand mini batch is something that batches in > the given time frame whereas real time streaming is more like do something as > the data arrives but my biggest question is why not have mini batch with > epsilon time frame (say one millisecond) or I would like to understand reason > why one would be an effective solution than other? > I recently came across one example where mini-batch (Apache Spark) is used > for Fraud detection and real time streaming (Apache Flink) used for Fraud > Prevention. Someone also commented saying mini-batches would not be an > effective solution for fraud prevention (since the goal is to prevent the > transaction from occurring as it happened) Now I wonder why this wouldn't be > so effective with mini batch (Spark) ? Why is it not effective to run > mini-batch with 1 millisecond latency? Batching is a technique used > everywhere including the OS and the Kernel TCP/IP stack where the data to the > disk or network are indeed buffered so what is the convincing factor here to > say one is more effective than other? > Thanks, > kant >