Paulo Cândido created SPARK-19125: ------------------------------------- Summary: Streaming Duration by Count Key: SPARK-19125 URL: https://issues.apache.org/jira/browse/SPARK-19125 Project: Spark Issue Type: Improvement Components: DStreams Environment: Java Reporter: Paulo Cândido
I use the Spark Streaming in scientific way. In this cases, we have to run the same experiment many times using the same seed to obtain the same result. All randomic components have the seed as input, so I can controll it. However, there is a unique component that doesn't depend of seeds and we can't controll, it's the bach size. Regardless of the input way of stream, the metric to break the microbaches is wall time. It's a problem in scientific environment because if we run the same experiments with same param many times, each time we can get a diferent result, depending the quantity of elements read in each bach. The same stream source may generate diferent bach sizes on multiple executions because of wall time. My sugestion is provide a new Duration metric: Count of Elements. Regardless of time spent to fill a microbatch, they will be always the same size, and when the source has a seed to generate de same values, independent of throughput, we will can replicate the experiments with same result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org