[ https://issues.apache.org/jira/browse/SPARK-19036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sungho Ham updated SPARK-19036: ------------------------------- Description: Efficiency of parallel execution get worsen when data is not evenly distributed by both time and message. Theses skews make streaming batch delayed despite sufficient resources. Merging small-sized delayed batches could help increase efficiency of micro batch. Here is an example. batch_time messages --------------------------------- 4 1 <-- current time 3 1 2 1 1 1000 <-- processing After long-running batch (t=1), three batches has only one message. These batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to t=4 into one RDD, Spark can process them 3 times faster. If processing time of each message is highly skewed also, not only utilizing parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message and ten small messages. Then, merging three RDDs still could considerably improve efficiency. batch_time messages --------------------------------- 4 1+10 <-- current time 3 1+10 2 1+10 1 1000 <-- processing There could two parameters to describe merging behavior. - delay_time_limit: when to start merge - merge_record_limit: when to stop merge was: Efficiency of parallel execution get worsen when data is not evenly distributed by both time and message. Theses skews make streaming batch delayed despite sufficient resources. Merging small-sized delayed batches could help increase efficiency of micro batch. Here is an example. t messages --------------------------------- 4 1 <-- current time 3 1 2 1 1 1000 <-- processing After long-running batch (t=1), three batches has only one message. These batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to t=4 into one RDD, Spark can process them 3 times faster. If processing time of each message is highly skewed also, not only utilizing parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message and ten small messages. Then, merging three RDDs still could considerably improve efficiency. t messages --------------------------------- 4 1+10 <-- current time 3 1+10 2 1+10 1 1000 <-- processing There could two parameters to describe merging behavior. - delay_time_limit: when to start merge - merge_record_limit: when to stop merge > Merging dealyed micro batches for parallelism > ---------------------------------------------- > > Key: SPARK-19036 > URL: https://issues.apache.org/jira/browse/SPARK-19036 > Project: Spark > Issue Type: New Feature > Reporter: Sungho Ham > Priority: Minor > > Efficiency of parallel execution get worsen when data is not evenly > distributed by both time and message. Theses skews make streaming batch > delayed despite sufficient resources. > Merging small-sized delayed batches could help increase efficiency of micro > batch. > Here is an example. > batch_time messages > --------------------------------- > 4 1 <-- current time > 3 1 > 2 1 > 1 1000 <-- processing > After long-running batch (t=1), three batches has only one message. These > batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 > to t=4 into one RDD, Spark can process them 3 times faster. > If processing time of each message is highly skewed also, not only utilizing > parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG > message and ten small messages. Then, merging three RDDs still could > considerably improve efficiency. > batch_time messages > --------------------------------- > 4 1+10 <-- current time > 3 1+10 > 2 1+10 > 1 1000 <-- processing > There could two parameters to describe merging behavior. > - delay_time_limit: when to start merge > - merge_record_limit: when to stop merge -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org