Posting a comment from my previous mail post: When data is received from a stream source, receiver creates blocks of data. A new block of data is generated every blockInterval milliseconds. N blocks of data are created during the batchInterval where N = batchInterval/blockInterval. A RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD.
Now if you want to repartition based on a key, a shuffle is needed. On Wed, Aug 12, 2015 at 4:36 AM, Mohit Anchlia <mohitanch...@gmail.com> wrote: > How does partitioning in spark work when it comes to streaming? What's the > best way to partition a time series data grouped by a certain tag like > categories of product video, music etc. >