My Streaming app has a requirement that my output be saved in the smallest
number of file possible such that each file does not exceed a max number of
rows. Based on my experience it appears that each partition will be written
to separate output file.

This was really easy to do in my batch processing using data frames and RDD.
Its easy to call count() and then decide how many partitions I want and
finally call repartition().

I am having heck of time trying to figure out to do the same thing using
spark streaming. 


JavaDStream<Pojo> tidy = Š

JavaDStream<Long> counts = tidy.count();



Bellow is the documentation for count. I do not see how I can use this to
figure out how many partitions I need? Stream does not provide a collect().
foreachRDD() can not return a value. I tried using an accumulator but that
did not work



Any suggestions would be greatly appreciated


http://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/str
eaming/api/java/JavaDStream.html
count
JavaDStream 
<http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/api
/java/JavaDStream.html> <java.lang.Long> count()
Return a new DStream in which each RDD has a single element generated by
counting each RDD of this DStream.
Returns:(undocumented)



Reply via email to