[ 
https://issues.apache.org/jira/browse/SPARK-19125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15812690#comment-15812690
 ] 

Sean Owen commented on SPARK-19125:
-----------------------------------

Yes, I don't think a distributed system, even, is a great candidate for 
reproducible results. There are several stochastic elements.

Still, what about queueStream()? You can create a fixed sequence of RDDs to 
pass to streaming for test-like situations like this.

> Streaming Duration by Count
> ---------------------------
>
>                 Key: SPARK-19125
>                 URL: https://issues.apache.org/jira/browse/SPARK-19125
>             Project: Spark
>          Issue Type: Improvement
>          Components: DStreams
>         Environment: Java
>            Reporter: Paulo Cândido
>
> I use the Spark Streaming in scientific way. In this cases, we have to run 
> the same experiment many times using the same seed to obtain the same result. 
> All randomic components have the seed as input, so I can controll it. 
> However, there is a unique component that doesn't depend of seeds and we 
> can't controll, it's the bach size. Regardless of the input way of stream, 
> the metric to break the microbaches is wall time. It's a problem in 
> scientific environment because if we run the same experiments with same param 
> many times, each time we can get a diferent result, depending the quantity of 
> elements read in each bach. The same stream source may generate diferent bach 
> sizes on multiple executions because of wall time.
> My sugestion is provide a new Duration metric: Count of Elements.
> Regardless of time spent to fill a microbatch, they will be always the same 
> size, and when the source has a seed to generate de same values, independent 
> of throughput, we will can replicate the experiments with same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to