GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/11443
[SPARK-13244][SQL][WIP] Prototyping: migrates DataFrame to Dataset using another approach ## What changes were proposed in this pull request? This is another try of DataFrame-to-Dataset migration. The first approach we're trying is PR #11431 and its upcoming follow-ups. It firstly migrates all DataFrame methods into Dataset, and then delete the DataFrame class and make DataFrame a type alias of `Dataset[Row]`. - Pros: Can be done incrementally. Each step is sane and self contained, can be merged into master separately. - Cons: Lots of DataFrame operations to migrate. See sub-tasks under SPARK-13244 for details The second approach, which demonstrated by this PR, does this migration in the opposite direction: - [x] Rename the original `Dataset` class to something else (`DS` in this PR) to avoid naming conflict. - [x] Rename `DataFrame` to `Dataset` and add a type parameter `T` without the `Encoder` bound. In short, from `DataFrame` to `Dataset[T]` rather than `Dataset[T: Encoder]`. - [x] Add type alias `type DataFrame = Dataset[Row]` in the `sql` package - [x] Fix Java code Scala type alias is not visible to Java, need to replace `DataFrame` with `Dataset<Row>` throughout the Java code base. - [ ] Migrate operations in class `DS` (the original Dataset) to the new `Dataset[T]`. During this step, we also need to update the new `Dataset[T]` to `Dataset[T: Encoder]`. And all the original DataFrame operations need to be updated to adapt Dataset. - [ ] Remove class `DS`. This PR prototypes the first 4 steps. - Pros: The number of operations in Dataset is much less then that of DataFrame, thus less operations are expected to be migrate. - Cons: Hard to be done incrementally. ## How was this patch tested? Existing tests do the work. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark ds-to-df Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11443.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11443 ---- commit e1c28ca980847feee22ba30ad5eaa2e12c852b78 Author: Cheng Lian <l...@databricks.com> Date: 2016-03-01T08:04:39Z Temporarily renames Dataset to DS commit 5aa4faa1e737a8e7c0a6c1f1e88bc81e4ae5f8d7 Author: Cheng Lian <l...@databricks.com> Date: 2016-03-01T08:49:40Z Renames DataFrame to Dataset[T] commit 7fa44d7bfec0d0a4f3da0851690e1589cf9c6954 Author: Cheng Lian <l...@databricks.com> Date: 2016-03-01T10:38:31Z Fixes Java API compilation failures ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org