[ 
https://issues.apache.org/jira/browse/SPARK-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

宿荣全 updated SPARK-4734:
-----------------------
    Description: 
Streaming scan new files form the HDFS and process those files in each batch 
process.Current streaming exist some problems:
1.When the number of files is very large(the count size of those files is very 
large) in some batch segement.The processing time required will become very 
long.The processing time maybe over slideDuration time.Eventually lead to 
dispatch the next batch process is delay.
2.when the size of total file Dstream  is very large in one batch,those  
dstream data do shuffle after memory will be n times increasing occupation,app 
will be slow or even terminated by operating system.

So if we set a upper limit value of input data for each batch to control the 
batch process time,the job dispatch delay and the process delay wil be 
alleviated.

modification:
Add a new parameter "spark.streaming.segmentSizeThreshold" in InputDStream 
(input data base class).the size of each batch process segments  will be set in 
this parameter from [spark-defaults.conf] or setting in source.
all implements class of InputDStream will do corresponding action be aimed at 
the segmentSizeThreshold.
This patch is a modification about FileInputDStream ,so when find new files     
 ,put those files's name and size in a queue and take elements package to a 
batch data with totail size < segmentSizeThreshold  in FileInputDStream.Please 
look source about detailed logic.

  was:
Streaming scan new files form the HDFS and process those files in each batch 
process.Current streaming exist some problems:
1.When the number of files is very large(the count size of those files is very 
large) in some batch segement.The processing time required will become very 
long.The processing time maybe over slideDuration time.Eventually lead to 
dispatch the next batch process is delay.
2.when the size of total file Dstream  is very large in one batch,those  
dstream data do shuffle after memory will be n times increasing occupation,app 
will be slow or even terminated by operating system.

So if we set a upper limit value of input data for each batch to control the 
batch process time,the job dispatch delay and the process delay wil be 
alleviated.

modification:
Add a new parameter "spark.streaming.segmentSizeThreshold" in InputDStream 
(input data base class).the size of each batch process segments  will be set in 
this parameter from [spark-defaults.conf.template] or setting in source.
all implements class of InputDStream will do corresponding action be aimed at 
the segmentSizeThreshold.
This patch is a modification about FileInputDStream ,so when find new files     
 ,put those files's name and size in a queue and take elements package to a 
batch data with totail size < segmentSizeThreshold  in FileInputDStream.Please 
look source about detailed logic.


> limit the file Dstream size for each batch
> ------------------------------------------
>
>                 Key: SPARK-4734
>                 URL: https://issues.apache.org/jira/browse/SPARK-4734
>             Project: Spark
>          Issue Type: New Feature
>          Components: Streaming
>            Reporter: 宿荣全
>            Priority: Minor
>
> Streaming scan new files form the HDFS and process those files in each batch 
> process.Current streaming exist some problems:
> 1.When the number of files is very large(the count size of those files is 
> very large) in some batch segement.The processing time required will become 
> very long.The processing time maybe over slideDuration time.Eventually lead 
> to dispatch the next batch process is delay.
> 2.when the size of total file Dstream  is very large in one batch,those  
> dstream data do shuffle after memory will be n times increasing 
> occupation,app will be slow or even terminated by operating system.
> So if we set a upper limit value of input data for each batch to control the 
> batch process time,the job dispatch delay and the process delay wil be 
> alleviated.
> modification:
> Add a new parameter "spark.streaming.segmentSizeThreshold" in InputDStream 
> (input data base class).the size of each batch process segments  will be set 
> in this parameter from [spark-defaults.conf] or setting in source.
> all implements class of InputDStream will do corresponding action be aimed at 
> the segmentSizeThreshold.
> This patch is a modification about FileInputDStream ,so when find new files   
>    ,put those files's name and size in a queue and take elements package to a 
> batch data with totail size < segmentSizeThreshold  in 
> FileInputDStream.Please look source about detailed logic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to