My Streaming app has a requirement that my output be saved in the smallest number of file possible such that each file does not exceed a max number of rows. Based on my experience it appears that each partition will be written to separate output file.
This was really easy to do in my batch processing using data frames and RDD. Its easy to call count() and then decide how many partitions I want and finally call repartition(). I am having heck of time trying to figure out to do the same thing using spark streaming. JavaDStream<Pojo> tidy = JavaDStream<Long> counts = tidy.count(); Bellow is the documentation for count. I do not see how I can use this to figure out how many partitions I need? Stream does not provide a collect(). foreachRDD() can not return a value. I tried using an accumulator but that did not work Any suggestions would be greatly appreciated http://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/str eaming/api/java/JavaDStream.html count JavaDStream <http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/api /java/JavaDStream.html> <java.lang.Long> count() Return a new DStream in which each RDD has a single element generated by counting each RDD of this DStream. Returns:(undocumented)