Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/14731#discussion_r94407115 --- Diff: docs/streaming-programming-guide.md --- @@ -644,17 +644,90 @@ methods for creating DStreams from files as input sources. </div> </div> - Spark Streaming will monitor the directory `dataDirectory` and process any files created in that directory (files written in nested directories not supported). Note that + Spark Streaming will monitor the directory `dataDirectory` and process any files created in that directory. + + ++ The files must have the same data format. + + A simple directory can be monitored, such as `hdfs://namenode:8040/logs/`. + All files directly such a path will be processed as they are discovered. + + A POSIX glob pattern can be supplied, such as + `hdfs://namenode:8040/logs/2016-??-31`. + Here, the DStream will consist of all files directly under those directories + matching the regular expression. --- End diff -- I added a link to the posix docs. If you follow them, you eventually end up on some coverage of regexps inside []; the Hadoop Glob code does actually convert the shell expression to a java regexp, then compile it in, so presumably should handle everything that the regexp engine (originally {{java.util.regexp}}, currently {{com.google.re2j}} can compile. That's too much detail and something that should really be covered in the Hadoop docs by someone.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org