GitHub user lw-lin reopened a pull request: https://github.com/apache/spark/pull/16987
[SPARK-19633][SS] FileSource read from FileSink ## What changes were proposed in this pull request? Right now file source always uses `InMemoryFileIndex` to scan files from a given path. But when reading the outputs from another streaming query, the file source should use `MetadataFileIndex` to list files from the sink log. This patch adds this support. ## `MetadataFileIndex` or `InMemoryFileIndex` ```scala spark .readStream .format(...) .load("/some/path") // for a non-glob path: // - use `MetadataFileIndex` when `/some/path/_spark_meta` exists // - fall back to `InMemoryFileIndex` otherwise ``` ```scala spark .readStream .format(...) .load("/some/path/*/*") // for a glob path: always use `InMemoryFileIndex` ``` ## How was this patch tested? two newly added tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark source-read-from-sink Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16987.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16987 ---- commit b66d2ccabcae41973bd8af4ed406567dc071ff67 Author: Liwei Lin <lwl...@gmail.com> Date: 2017-02-18T01:20:18Z File Source reads from File Sink ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org