GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/6194
[SPARK-2883] [SQL] ORC data source for Spark SQL This PR is an update of #6135 authored by @zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package. ## New Features 1. New save/load methods provided: - `df.saveAsOrcFile()` Used to save the table in ORC format. - `sqlContext.orcFile()` Used to import ORC file as a Spark SQL table. To enable these two methods, please add the following line to enable corresponding implicit conversions: ```scala import org.apache.spark.sql.hive.orc._ ``` 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Saving/loading ORC files without contacting Hive metastore 1. The orc file is operated in HiveContext, the only reason is due to package issue, and we donât want to bring in hive dependency into spark sql. Note that orc operations does not relies on Hive metastore. ## Future Work 1. Schema evolution support 1. Hive metastore table conversion ## Acknowledgements This PR also include initial work done by @scwf from Huawei (PR #3753). You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark polishing-orc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6194.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6194 ---- commit 62fef1829f78b15c5caf6fc825ebcdf045eecbe5 Author: Zhan Zhang <zhaz...@gmail.com> Date: 2015-05-13T16:21:26Z orc data source support commit cd1b4340d35cb9ff9c820329e6d6e6dda094b2f0 Author: Zhan Zhang <zhaz...@gmail.com> Date: 2015-05-13T19:17:02Z minor change commit aced00f8acb6f18f6f8644fa6dd99affa186513f Author: Zhan Zhang <zhaz...@gmail.com> Date: 2015-05-13T23:12:49Z predicate fix commit f156bf0af97a0ac11392c59c99e947eef04b96b7 Author: Zhan Zhang <zhaz...@gmail.com> Date: 2015-05-14T00:01:06Z reuse test suite commit 22b8a58c548db143f3e5245993a4aaacfd0802ff Author: Zhan Zhang <zhaz...@gmail.com> Date: 2015-05-14T02:48:02Z save mode fix commit 00dd24c1a83796a6016aa2bb945c759587480f35 Author: Zhan Zhang <zhaz...@gmail.com> Date: 2015-05-14T20:19:30Z resolve review comments commit 3501a9b70161ad41ef4b5718c2b57fb32188d5e9 Author: Zhan Zhang <zhaz...@gmail.com> Date: 2015-05-14T20:22:07Z resolve review comments commit 4bc937fa37c2674c007726f6c9bb25911378049f Author: Cheng Lian <l...@databricks.com> Date: 2015-05-15T16:17:51Z Polishes the ORC data source ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org