[ https://issues.apache.org/jira/browse/SPARK-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
宿荣全 updated SPARK-4734: ----------------------- Summary: [Streaming]limit the file Dstream size for each batch (was: limit the file Dstream size for each batch) > [Streaming]limit the file Dstream size for each batch > ----------------------------------------------------- > > Key: SPARK-4734 > URL: https://issues.apache.org/jira/browse/SPARK-4734 > Project: Spark > Issue Type: New Feature > Components: Streaming > Reporter: 宿荣全 > Priority: Minor > > Streaming scan new files form the HDFS and process those files in each batch > process.Current streaming exist some problems: > 1.When the number of files is very large(the count size of those files is > very large) in some batch segement.The processing time required will become > very long.The processing time maybe over slideDuration time.Eventually lead > to dispatch the next batch process is delay. > 2.when the size of total file Dstream is very large in one batch,those > dstream data do shuffle after memory will be n times increasing > occupation,app will be slow or even terminated by operating system. > So if we set a upper limit value of input data for each batch to control the > batch process time,the job dispatch delay and the process delay wil be > alleviated. > modification: > Add a new parameter "spark.streaming.segmentSizeThreshold" in InputDStream > (input data base class).the size of each batch process segments will be set > in this parameter from [spark-defaults.conf] or setting in source. > all implements class of InputDStream will do corresponding action be aimed at > the segmentSizeThreshold. > This patch is a modification about FileInputDStream ,so when find new files > ,put those files's name and size in a queue and take elements package to a > batch data with totail size < segmentSizeThreshold in > FileInputDStream.Please look source about detailed logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org