Thanks for the info. When data is written in hdfs how does spark keeps the filenames written by multiple executors unique
On Tue, Aug 11, 2015 at 9:35 PM, Hemant Bhanawat <hemant9...@gmail.com> wrote: > Posting a comment from my previous mail post: > > When data is received from a stream source, receiver creates blocks of > data. A new block of data is generated every blockInterval milliseconds. N > blocks of data are created during the batchInterval where N = > batchInterval/blockInterval. A RDD is created on the driver for the blocks > created during the batchInterval. The blocks generated during the > batchInterval are partitions of the RDD. > > Now if you want to repartition based on a key, a shuffle is needed. > > On Wed, Aug 12, 2015 at 4:36 AM, Mohit Anchlia <mohitanch...@gmail.com> > wrote: > >> How does partitioning in spark work when it comes to streaming? What's >> the best way to partition a time series data grouped by a certain tag like >> categories of product video, music etc. >> > >