[ https://issues.apache.org/jira/browse/SPARK-19036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sungho Ham updated SPARK-19036: ------------------------------- Description: Efficiency of parallel execution get worsen when data is not evenly distributed by both time and message. Theses skews make streaming batch delayed despite sufficient resources. Merging small-sized delayed batches could help increase efficiency of micro batch. Here is an example. ||batch_time || messages || |4 | 1 <-- current time | |3 | 1 | |2 | 1 | |1 | 1000 <-- processing | After long-running batch (t=1), three batches has only one message. These batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to t=4 into one RDD, Spark can process them 3 times faster. If processing time of each message is highly skewed also, not only utilizing parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message and ten small messages. Then, merging three RDDs still could considerably improve efficiency. ||batch_time || messages || | 4 | 1+10 <-- current time | | 3 | 1+10 | | 2 | 1+10 | | 1 | 1000 <-- processing | There could be two parameters to describe merging behavior. - delay_time_limit: when to start merge - merge_record_limit: when to stop merge was: Efficiency of parallel execution get worsen when data is not evenly distributed by both time and message. Theses skews make streaming batch delayed despite sufficient resources. Merging small-sized delayed batches could help increase efficiency of micro batch. Here is an example. ||batch_time || messages || |4 | 1 <-- current time | |3 | 1 | |2 | 1 | |1 | 1000 <-- processing | After long-running batch (t=1), three batches has only one message. These batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to t=4 into one RDD, Spark can process them 3 times faster. If processing time of each message is highly skewed also, not only utilizing parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message and ten small messages. Then, merging three RDDs still could considerably improve efficiency. ||batch_time || messages || | 4 | 1+10 <-- current time | | 3 | 1+10 | | 2 | 1+10 | | 1 | 1000 <-- processing | There could two parameters to describe merging behavior. - delay_time_limit: when to start merge - merge_record_limit: when to stop merge > Merging dealyed micro batches for parallelism > ---------------------------------------------- > > Key: SPARK-19036 > URL: https://issues.apache.org/jira/browse/SPARK-19036 > Project: Spark > Issue Type: New Feature > Reporter: Sungho Ham > Priority: Minor > > Efficiency of parallel execution get worsen when data is not evenly > distributed by both time and message. Theses skews make streaming batch > delayed despite sufficient resources. > Merging small-sized delayed batches could help increase efficiency of micro > batch. > Here is an example. > ||batch_time || messages || > |4 | 1 <-- current time | > |3 | 1 | > |2 | 1 | > |1 | 1000 <-- processing | > After long-running batch (t=1), three batches has only one message. These > batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 > to t=4 into one RDD, Spark can process them 3 times faster. > If processing time of each message is highly skewed also, not only utilizing > parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG > message and ten small messages. Then, merging three RDDs still could > considerably improve efficiency. > ||batch_time || messages || > | 4 | 1+10 <-- current time | > | 3 | 1+10 | > | 2 | 1+10 | > | 1 | 1000 <-- processing | > There could be two parameters to describe merging behavior. > - delay_time_limit: when to start merge > - merge_record_limit: when to stop merge -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org