[GitHub] spark pull request: [WIP][SPARK-3247][SQL] An API for adding forei...

marmbrus Sat, 20 Sep 2014 18:38:49 -0700

GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/2475


    [WIP][SPARK-3247][SQL] An API for adding foreign data sources to Spark SQL

    **Work in progress - APIs may change**
    
    This PR introduces a new set of APIs to Spark SQL that allow other 
developers to add support for reading data from new sources.  As an example, a 
library is included for reading data encoded using Avro.
    
    New sources must implement the interface `BaseRelation`, which is 
responsible for describing the schema of the data.  This base relation must 
also implement at least one `Scan` interface, which is responsible for 
producing an RDD containing row objects.  The various Scan interfaces allow for 
optimizations such as column pruning and filter push down, when the underlying 
data source can handle these operations.
    
    External data sources can be accessed using either the programatic API or 
using pure SQL.  For example, the included Avro library could be called from 
Scala query DSL as follows:
    
    ```scala
    import org.apache.spark.sql.avro._
    
    val results = TestSQLContext
      .avroFile("../hive/src/test/resources/data/files/episodes.avro")
      .select('title)
      .collect()
    ```
    
    The same can be done in pure SQL, for example from the SQL command line or 
JDBC interface.
    
    ```sql
    CREATE FOREIGN TEMPORARY TABLE avroTable
    USING org.apache.spark.sql.avro
    OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro");
    
    SELECT * FROM avroTable;
    ```
    
    TODO:
     - [ ] Move command refactoring into separate PR
     - [ ] Transition parquet and json support to new API
     - [ ] Figure out how to package data sources and their dependencies
     - [ ] Examples / implementation of more advanced scan types
     - [ ] Support for foreign catalogs
     - [ ] Introspection like `describe` for foreign tables.
     - [ ] More tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark foreign

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2475.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2475
    
----
commit 47d542cc0238fba04b6c4e4456393d812d559c4e
Author: Michael Armbrust <mich...@databricks.com>
Date:   2014-09-20T23:20:50Z

    First draft of foreign data API

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3247][SQL] An API for adding forei...

Reply via email to