[GitHub] spark pull request: [SPARK-14288][SQL] Memory Sink for streaming

marmbrus Fri, 01 Apr 2016 16:05:21 -0700

GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/12119


    [SPARK-14288][SQL] Memory Sink for streaming

    This PR exposes the internal testing `MemorySink` though the data source 
API.  This will allow users to easily test streaming applications in the Spark 
shell or other local tests.
    
    Usage:
    ```scala
    inputStream.write
      .format("memory")
      .queryName("memStream")
      .startStream()
    
    // Now you can query the result of the stream here.
    sqlContext.table("memStream")
    ```
    
    The most complicated part of the logic is setting checkpoint directory.  
There are a few requirements we are attempting to satisfy here:
     - when working in the shell locally, it should just work with no extra 
configuration.
     - when working on a cluster you should be able to make it easily create 
the checkpoint on a distributed file system so you can test aggregation (state 
checkpoints are also stored in this directory and must be accessible from 
workers).
     - it should be clear that you can't resume since the data is just in 
memory.
    
    The chosen algorithm proceeds as follows:
     - the user gives a checkpoint directory, use it
     - if the conf has a checkpoint location, use `$location/$queryName`
     - if neither, create a local directory

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark memorySink

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12119.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12119
    
----
commit aaee000cd7bb5ad30710847c5bf48d96cdd870f5
Author: Michael Armbrust <mich...@databricks.com>
Date:   2016-03-31T06:33:02Z

    [SPARK-14288][SQL] Memory Sink for streaming

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14288][SQL] Memory Sink for streaming

Reply via email to