[ 
https://issues.apache.org/jira/browse/SPARK-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

宿荣全 updated SPARK-4734:
-----------------------
    Summary: [Streaming]limit the file Dstream size for each batch  (was: limit 
the file Dstream size for each batch)

> [Streaming]limit the file Dstream size for each batch
> -----------------------------------------------------
>
>                 Key: SPARK-4734
>                 URL: https://issues.apache.org/jira/browse/SPARK-4734
>             Project: Spark
>          Issue Type: New Feature
>          Components: Streaming
>            Reporter: 宿荣全
>            Priority: Minor
>
> Streaming scan new files form the HDFS and process those files in each batch 
> process.Current streaming exist some problems:
> 1.When the number of files is very large(the count size of those files is 
> very large) in some batch segement.The processing time required will become 
> very long.The processing time maybe over slideDuration time.Eventually lead 
> to dispatch the next batch process is delay.
> 2.when the size of total file Dstream  is very large in one batch,those  
> dstream data do shuffle after memory will be n times increasing 
> occupation,app will be slow or even terminated by operating system.
> So if we set a upper limit value of input data for each batch to control the 
> batch process time,the job dispatch delay and the process delay wil be 
> alleviated.
> modification:
> Add a new parameter "spark.streaming.segmentSizeThreshold" in InputDStream 
> (input data base class).the size of each batch process segments  will be set 
> in this parameter from [spark-defaults.conf] or setting in source.
> all implements class of InputDStream will do corresponding action be aimed at 
> the segmentSizeThreshold.
> This patch is a modification about FileInputDStream ,so when find new files   
>    ,put those files's name and size in a queue and take elements package to a 
> batch data with totail size < segmentSizeThreshold  in 
> FileInputDStream.Please look source about detailed logic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to