[GitHub] spark pull request: [SPARK-2883] [SQL] ORC data source for Spark S...

liancheng Fri, 15 May 2015 10:10:34 -0700

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/6194


    [SPARK-2883] [SQL] ORC data source for Spark SQL

    This PR is an update of #6135 authored by @zhzhan from Hortonworks.
    
    ----
    
    This PR implements a Spark SQL data source for accessing ORC files.
    
    > **NOTE**
    >
    > Although ORC is now an Apache TLP, the codebase is still tightly coupled 
with Hive.  That's why the new ORC data source is under 
`org.apache.spark.sql.hive` package.
    
    ## New Features
    
    1.  New save/load methods provided:
    
        - `df.saveAsOrcFile()`
    
          Used to save the table in ORC format.
    
        - `sqlContext.orcFile()`
    
          Used to import ORC file as a Spark SQL table.
    
    
        To enable these two methods, please add the following line to enable 
corresponding implicit conversions:
    
        ```scala
        import org.apache.spark.sql.hive.orc._
        ```
    
    1.  Support for complex data types (i.e. array, map, and struct)
    
    1.  Aware of common optimizations provided by Spark SQL:
    
        - Column pruning
        - Partitioning pruning
        - Filter push-down
    
    1.  Saving/loading ORC files without contacting Hive metastore
    
    1.  The orc file is operated in HiveContext, the only reason is due to 
package issue, and we donât want to bring in hive dependency into spark sql. 
Note that orc operations does not relies on Hive metastore.
    
    ## Future Work
    
    1.  Schema evolution support
    1.  Hive metastore table conversion
    
    ## Acknowledgements
    
    This PR also include initial work done by @scwf from Huawei (PR #3753).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark polishing-orc

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6194.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6194
    
----
commit 62fef1829f78b15c5caf6fc825ebcdf045eecbe5
Author: Zhan Zhang <zhaz...@gmail.com>
Date:   2015-05-13T16:21:26Z

    orc data source support

commit cd1b4340d35cb9ff9c820329e6d6e6dda094b2f0
Author: Zhan Zhang <zhaz...@gmail.com>
Date:   2015-05-13T19:17:02Z

    minor change

commit aced00f8acb6f18f6f8644fa6dd99affa186513f
Author: Zhan Zhang <zhaz...@gmail.com>
Date:   2015-05-13T23:12:49Z

    predicate fix

commit f156bf0af97a0ac11392c59c99e947eef04b96b7
Author: Zhan Zhang <zhaz...@gmail.com>
Date:   2015-05-14T00:01:06Z

    reuse test suite

commit 22b8a58c548db143f3e5245993a4aaacfd0802ff
Author: Zhan Zhang <zhaz...@gmail.com>
Date:   2015-05-14T02:48:02Z

    save mode fix

commit 00dd24c1a83796a6016aa2bb945c759587480f35
Author: Zhan Zhang <zhaz...@gmail.com>
Date:   2015-05-14T20:19:30Z

    resolve review comments

commit 3501a9b70161ad41ef4b5718c2b57fb32188d5e9
Author: Zhan Zhang <zhaz...@gmail.com>
Date:   2015-05-14T20:22:07Z

    resolve review comments

commit 4bc937fa37c2674c007726f6c9bb25911378049f
Author: Cheng Lian <l...@databricks.com>
Date:   2015-05-15T16:17:51Z

    Polishes the ORC data source

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2883] [SQL] ORC data source for Spark S...

Reply via email to