[ https://issues.apache.org/jira/browse/SPARK-19125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15812690#comment-15812690 ]
Sean Owen commented on SPARK-19125: ----------------------------------- Yes, I don't think a distributed system, even, is a great candidate for reproducible results. There are several stochastic elements. Still, what about queueStream()? You can create a fixed sequence of RDDs to pass to streaming for test-like situations like this. > Streaming Duration by Count > --------------------------- > > Key: SPARK-19125 > URL: https://issues.apache.org/jira/browse/SPARK-19125 > Project: Spark > Issue Type: Improvement > Components: DStreams > Environment: Java > Reporter: Paulo Cândido > > I use the Spark Streaming in scientific way. In this cases, we have to run > the same experiment many times using the same seed to obtain the same result. > All randomic components have the seed as input, so I can controll it. > However, there is a unique component that doesn't depend of seeds and we > can't controll, it's the bach size. Regardless of the input way of stream, > the metric to break the microbaches is wall time. It's a problem in > scientific environment because if we run the same experiments with same param > many times, each time we can get a diferent result, depending the quantity of > elements read in each bach. The same stream source may generate diferent bach > sizes on multiple executions because of wall time. > My sugestion is provide a new Duration metric: Count of Elements. > Regardless of time spent to fill a microbatch, they will be always the same > size, and when the source has a seed to generate de same values, independent > of throughput, we will can replicate the experiments with same result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org