[GitHub] spark pull request: [SPARK-13244][SQL][WIP] Prototyping: migrates ...

liancheng Tue, 01 Mar 2016 03:16:29 -0800

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/11443


    [SPARK-13244][SQL][WIP] Prototyping: migrates DataFrame to Dataset using 
another approach

    ## What changes were proposed in this pull request?
    
    This is another try of DataFrame-to-Dataset migration.
    
    The first approach we're trying is PR #11431 and its upcoming follow-ups. It
    firstly migrates all DataFrame methods into Dataset, and then delete the
    DataFrame class and make DataFrame a type alias of `Dataset[Row]`.
    
    - Pros: Can be done incrementally. Each step is sane and self contained, can
      be merged into master separately.
    - Cons: Lots of DataFrame operations to migrate.
    
      See sub-tasks under SPARK-13244 for details
    
    The second approach, which demonstrated by this PR, does this migration in 
the
    opposite direction:
    
    - [x] Rename the original `Dataset` class to something else (`DS` in this 
PR)
          to avoid naming conflict.
    - [x] Rename `DataFrame` to `Dataset` and add a type parameter `T` without 
the
          `Encoder` bound.
    
          In short, from `DataFrame` to `Dataset[T]` rather than
          `Dataset[T: Encoder]`.
    
    - [x] Add type alias `type DataFrame = Dataset[Row]` in the `sql` package
    - [x] Fix Java code
    
          Scala type alias is not visible to Java, need to replace `DataFrame`
          with `Dataset<Row>` throughout the Java code base.
    
    - [ ] Migrate operations in class `DS` (the original Dataset) to the new
          `Dataset[T]`.
    
          During this step, we also need to update the new `Dataset[T]` to
          `Dataset[T: Encoder]`. And all the original DataFrame operations need 
to
          be updated to adapt Dataset.
    
    - [ ] Remove class `DS`.
    
    This PR prototypes the first 4 steps.
    
    - Pros: The number of operations in Dataset is much less then that of
      DataFrame, thus less operations are expected to be migrate.
    - Cons: Hard to be done incrementally.
    
    ## How was this patch tested?
    
    Existing tests do the work.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark ds-to-df

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11443.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11443
    
----
commit e1c28ca980847feee22ba30ad5eaa2e12c852b78
Author: Cheng Lian <l...@databricks.com>
Date:   2016-03-01T08:04:39Z

    Temporarily renames Dataset to DS

commit 5aa4faa1e737a8e7c0a6c1f1e88bc81e4ae5f8d7
Author: Cheng Lian <l...@databricks.com>
Date:   2016-03-01T08:49:40Z

    Renames DataFrame to Dataset[T]

commit 7fa44d7bfec0d0a4f3da0851690e1589cf9c6954
Author: Cheng Lian <l...@databricks.com>
Date:   2016-03-01T10:38:31Z

    Fixes Java API compilation failures

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13244][SQL][WIP] Prototyping: migrates ...

Reply via email to