Sungho Ham created SPARK-19036: ---------------------------------- Summary: Merging dealyed micro batches for parallelism Key: SPARK-19036 URL: https://issues.apache.org/jira/browse/SPARK-19036 Project: Spark Issue Type: New Feature Reporter: Sungho Ham Priority: Minor
Efficiency of parallel execution get worsen when data is not evenly distributed by both time and message. Theses skews make streaming batch delayed despite sufficient resources. Merging small-sized delayed batches could help increase efficiency of micro batch. Here is an example. t messages --------------------------------- 4 1 <-- current time 3 1 2 1 1 1000 <-- processing After long-running batch (t=1), three batches has only one message. These batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to t=4 into one RDD, Spark can process them 3 times faster. If processing time of each message is highly skewed also, not only utilizing parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message and ten small messages. Then, merging three RDDs still could considerably improve efficiency. t messages --------------------------------- 4 1+10 <-- current time 3 1+10 2 1+10 1 1000 <-- processing There could two parameters to describe merging behavior. - delay_time_limit: when to start merge - merge_record_limit: when to stop merge -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org