[ https://issues.apache.org/jira/browse/SPARK-8605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377999#comment-16377999 ]
Apache Spark commented on SPARK-8605: ------------------------------------- User 'ConcurrencyPractitioner' has created a pull request for this issue: https://github.com/apache/spark/pull/20683 > Exclude files in StreamingContext. textFileStream(directory) > ------------------------------------------------------------ > > Key: SPARK-8605 > URL: https://issues.apache.org/jira/browse/SPARK-8605 > Project: Spark > Issue Type: Improvement > Components: DStreams > Reporter: Noel Vo > Priority: Major > Labels: streaming, streaming_api > > Currenly, spark streaming can monitor a directory and it will process the > newly added files. This will cause a bug if the files copied to the directory > are big. For example, in hdfs, if a file is being copied, its name is > file_name._COPYING_. Spark will pick up the file and process. However, when > it's done copying the file, the file name becomes file_name. This would cause > FileDoesNotExist error. It would be great if we can exclude files using regex > in the directory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org