GitHub user marmbrus opened a pull request: https://github.com/apache/spark/pull/2475
[WIP][SPARK-3247][SQL] An API for adding foreign data sources to Spark SQL **Work in progress - APIs may change** This PR introduces a new set of APIs to Spark SQL that allow other developers to add support for reading data from new sources. As an example, a library is included for reading data encoded using Avro. New sources must implement the interface `BaseRelation`, which is responsible for describing the schema of the data. This base relation must also implement at least one `Scan` interface, which is responsible for producing an RDD containing row objects. The various Scan interfaces allow for optimizations such as column pruning and filter push down, when the underlying data source can handle these operations. External data sources can be accessed using either the programatic API or using pure SQL. For example, the included Avro library could be called from Scala query DSL as follows: ```scala import org.apache.spark.sql.avro._ val results = TestSQLContext .avroFile("../hive/src/test/resources/data/files/episodes.avro") .select('title) .collect() ``` The same can be done in pure SQL, for example from the SQL command line or JDBC interface. ```sql CREATE FOREIGN TEMPORARY TABLE avroTable USING org.apache.spark.sql.avro OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro"); SELECT * FROM avroTable; ``` TODO: - [ ] Move command refactoring into separate PR - [ ] Transition parquet and json support to new API - [ ] Figure out how to package data sources and their dependencies - [ ] Examples / implementation of more advanced scan types - [ ] Support for foreign catalogs - [ ] Introspection like `describe` for foreign tables. - [ ] More tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/marmbrus/spark foreign Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2475.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2475 ---- commit 47d542cc0238fba04b6c4e4456393d812d559c4e Author: Michael Armbrust <mich...@databricks.com> Date: 2014-09-20T23:20:50Z First draft of foreign data API ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org