[GitHub] spark pull request #16987: [SPARK-19633][SS] FileSource read from FileSink

lw-lin Thu, 23 Feb 2017 16:22:09 -0800

GitHub user lw-lin reopened a pull request:

    https://github.com/apache/spark/pull/16987


    [SPARK-19633][SS] FileSource read from FileSink

    ## What changes were proposed in this pull request?
    
    Right now file source always uses `InMemoryFileIndex` to scan files from a 
given path.
    
    But when reading the outputs from another streaming query, the file source 
should use `MetadataFileIndex` to list files from the sink log. This patch adds 
this support.
    
    ## `MetadataFileIndex` or `InMemoryFileIndex`
    ```scala
    spark
      .readStream
      .format(...)
      .load("/some/path") // for a non-glob path:
                          //   - use `MetadataFileIndex` when 
`/some/path/_spark_meta` exists
                          //   - fall back to `InMemoryFileIndex` otherwise
    ```
    ```scala
    spark
      .readStream
      .format(...)
      .load("/some/path/*/*") // for a glob path: always use `InMemoryFileIndex`
    ```
    
    ## How was this patch tested?
    
    two newly added tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lw-lin/spark source-read-from-sink

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16987.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16987
    
----
commit b66d2ccabcae41973bd8af4ed406567dc071ff67
Author: Liwei Lin <lwl...@gmail.com>
Date:   2017-02-18T01:20:18Z

    File Source reads from File Sink

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16987: [SPARK-19633][SS] FileSource read from FileSink

Reply via email to