[GitHub] spark pull request #14731: [SPARK-17159] [streaming]: optimise check for new...

steveloughran Tue, 03 Jan 2017 05:49:19 -0800

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14731#discussion_r94407115
  
    --- Diff: docs/streaming-programming-guide.md ---
    @@ -644,17 +644,90 @@ methods for creating DStreams from files as input 
sources.
         </div>
         </div>
     
    -   Spark Streaming will monitor the directory `dataDirectory` and process 
any files created in that directory (files written in nested directories not 
supported). Note that
    +   Spark Streaming will monitor the directory `dataDirectory` and process 
any files created in that directory.
    +
    +     ++ The files must have the same data format.
    +     + A simple directory can be monitored, such as 
`hdfs://namenode:8040/logs/`.
    +       All files directly such a path will be processed as they are 
discovered.
    +     + A POSIX glob pattern can be supplied, such as
    +       `hdfs://namenode:8040/logs/2016-??-31`.
    +       Here, the DStream will consist of all files directly under those 
directories
    +       matching the regular expression.
    --- End diff --
    
    I added a link to the posix docs. If you follow them, you eventually end up 
on some coverage of regexps inside []; the Hadoop Glob code does actually 
convert the shell expression to a java regexp, then compile it in, so 
presumably should handle everything that the regexp engine (originally 
{{java.util.regexp}}, currently {{com.google.re2j}} can compile. That's too 
much detail and something that should really be covered in the Hadoop docs by 
someone.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14731: [SPARK-17159] [streaming]: optimise check for new...

Reply via email to