[ 
https://issues.apache.org/jira/browse/SPARK-19036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungho Ham updated SPARK-19036:
-------------------------------
    Description: 
Efficiency of parallel execution get worsen when data is not evenly distributed 
by both time and message. Theses skews make streaming batch delayed despite 
sufficient resources. 

Merging small-sized delayed batches could help increase efficiency of micro 
batch.  

Here is an example.

batch_time    messages 
---------------------------------                
4                      1           <--  current time
3                      1             
2                      1             
1                      1000      <-- processing 

After long-running batch (t=1),  three batches has only one message. These 
batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to 
t=4 into one RDD, Spark can process them 3 times faster.

If processing time of each message is highly skewed also, not only utilizing 
parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message 
and ten small messages. Then, merging three RDDs still could considerably 
improve efficiency.

batch_time    messages 
---------------------------------                
4                      1+10       <--  current time
3                      1+10           
2                      1+10            
1                      1000      <-- processing 

There could two parameters to describe merging behavior.
- delay_time_limit:  when to start merge 
- merge_record_limit: when to stop merge



  was:
Efficiency of parallel execution get worsen when data is not evenly distributed 
by both time and message. Theses skews make streaming batch delayed despite 
sufficient resources. 

Merging small-sized delayed batches could help increase efficiency of micro 
batch.  

Here is an example.

t    messages 
---------------------------------                
4        1           <--  current time
3        1             
2        1             
1        1000      <-- processing 

After long-running batch (t=1),  three batches has only one message. These 
batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to 
t=4 into one RDD, Spark can process them 3 times faster.

If processing time of each message is highly skewed also, not only utilizing 
parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message 
and ten small messages. Then, merging three RDDs still could considerably 
improve efficiency.

t    messages 
---------------------------------                
4        1+10       <--  current time
3        1+10           
2        1+10            
1        1000      <-- processing 

There could two parameters to describe merging behavior.
- delay_time_limit:  when to start merge 
- merge_record_limit: when to stop merge




> Merging dealyed micro batches for parallelism 
> ----------------------------------------------
>
>                 Key: SPARK-19036
>                 URL: https://issues.apache.org/jira/browse/SPARK-19036
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Sungho Ham
>            Priority: Minor
>
> Efficiency of parallel execution get worsen when data is not evenly 
> distributed by both time and message. Theses skews make streaming batch 
> delayed despite sufficient resources. 
> Merging small-sized delayed batches could help increase efficiency of micro 
> batch.  
> Here is an example.
> batch_time    messages 
> ---------------------------------                
> 4                      1           <--  current time
> 3                      1             
> 2                      1             
> 1                      1000      <-- processing 
> After long-running batch (t=1),  three batches has only one message. These 
> batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 
> to t=4 into one RDD, Spark can process them 3 times faster.
> If processing time of each message is highly skewed also, not only utilizing 
> parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG 
> message and ten small messages. Then, merging three RDDs still could 
> considerably improve efficiency.
> batch_time    messages 
> ---------------------------------                
> 4                      1+10       <--  current time
> 3                      1+10           
> 2                      1+10            
> 1                      1000      <-- processing 
> There could two parameters to describe merging behavior.
> - delay_time_limit:  when to start merge 
> - merge_record_limit: when to stop merge



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to