[GitHub] [spark] aokolnychyi commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

2020-12-15 Thread GitBox
aokolnychyi commented on pull request #29066: URL: https://github.com/apache/spark/pull/29066#issuecomment-745147943 Closing this one in favor of smaller PRs. This is an automated message from the Apache Git Service. To

[GitHub] [spark] aokolnychyi commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

2020-12-10 Thread GitBox
aokolnychyi commented on pull request #29066: URL: https://github.com/apache/spark/pull/29066#issuecomment-742477298 The first PR with interfaces only is out. This is an automated message from the Apache Git Service. To

[GitHub] [spark] aokolnychyi commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

2020-12-09 Thread GitBox
aokolnychyi commented on pull request #29066: URL: https://github.com/apache/spark/pull/29066#issuecomment-741711690 It is a bit hard to keep this large PR up-to-date since it touches many places. As it seems like a reasonable approach, I am going to split the work and submit smaller PRs.

[GitHub] [spark] aokolnychyi commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

2020-12-07 Thread GitBox
aokolnychyi commented on pull request #29066: URL: https://github.com/apache/spark/pull/29066#issuecomment-739978109 Seems like there is consensus about evolving this API alongside the interfaces in `read` package. I am not sure whether we need to move new interfaces to `write`, though.

[GitHub] [spark] aokolnychyi commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

2020-12-07 Thread GitBox
aokolnychyi commented on pull request #29066: URL: https://github.com/apache/spark/pull/29066#issuecomment-739974758 I've updated this PR and I am ready to split it into smaller mergeable parts. It would be great if everyone could take another look to make sure we are on the same page.

[GitHub] [spark] aokolnychyi commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

2020-12-01 Thread GitBox
aokolnychyi commented on pull request #29066: URL: https://github.com/apache/spark/pull/29066#issuecomment-736320032 I know deprecating and then removing is usually a better idea and I will be okay evolving read and write path separately. The only concern I have is that while we use these

[GitHub] [spark] aokolnychyi commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

2020-12-01 Thread GitBox
aokolnychyi commented on pull request #29066: URL: https://github.com/apache/spark/pull/29066#issuecomment-736316990 @dbtsai, I will rebase this one once PR #30558 is in. This is an automated message from the Apache Git

[GitHub] [spark] aokolnychyi commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

2020-11-28 Thread GitBox
aokolnychyi commented on pull request #29066: URL: https://github.com/apache/spark/pull/29066#issuecomment-735220811 We should agree on the future of the existing `Distribution` and `ClusteredDistribution` interfaces used in `Partitioning`. Here is a quote from the design doc:

[GitHub] [spark] aokolnychyi commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

2020-11-27 Thread GitBox
aokolnychyi commented on pull request #29066: URL: https://github.com/apache/spark/pull/29066#issuecomment-735044798 also cc @dbtsai @dongjoon-hyun, it would be great to get your input on this one after the holidays. This

[GitHub] [spark] aokolnychyi commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

2020-11-27 Thread GitBox
aokolnychyi commented on pull request #29066: URL: https://github.com/apache/spark/pull/29066#issuecomment-735044653 I also have a prototype for this logic in micro-batch streaming. I added dedicated plans which I think we were missing for a while. Right now, `MicroBatchExecution`

[GitHub] [spark] aokolnychyi commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

2020-11-25 Thread GitBox
aokolnychyi commented on pull request #29066: URL: https://github.com/apache/spark/pull/29066#issuecomment-733842184 Tests failed as I overlooked recent changes around caching. Should be fixed now. This is an automated

[GitHub] [spark] aokolnychyi commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

2020-11-24 Thread GitBox
aokolnychyi commented on pull request #29066: URL: https://github.com/apache/spark/pull/29066#issuecomment-733427234 I'd like to emphasize that all changes are in one place to simplify the review. I'll split the work into smaller PRs later.

[GitHub] [spark] aokolnychyi commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes

2020-11-24 Thread GitBox
aokolnychyi commented on pull request #29066: URL: https://github.com/apache/spark/pull/29066#issuecomment-733425796 Okay, I went through the comments and I think they are all resolved except points related to tests. This PR is no longer WIP and is ready for a detailed review.