[ https://issues.apache.org/jira/browse/SPARK-8605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
holdenk updated SPARK-8605: --------------------------- Component/s: (was: PySpark) Streaming > Exclude files in StreamingContext. textFileStream(directory) > ------------------------------------------------------------ > > Key: SPARK-8605 > URL: https://issues.apache.org/jira/browse/SPARK-8605 > Project: Spark > Issue Type: Improvement > Components: Streaming > Reporter: Noel Vo > Labels: streaming, streaming_api > > Currenly, spark streaming can monitor a directory and it will process the > newly added files. This will cause a bug if the files copied to the directory > are big. For example, in hdfs, if a file is being copied, its name is > file_name._COPYING_. Spark will pick up the file and process. However, when > it's done copying the file, the file name becomes file_name. This would cause > FileDoesNotExist error. It would be great if we can exclude files using regex > in the directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org