Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-05 Thread via GitHub
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1979031337 I've force updated my pr and now it brings the smallest changes and fixes this issue completely. -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-05 Thread via GitHub
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1978837075 > We can't cache the queryExecution in the Dataset itself because the queryExecution may come from other Dataset instance. See `isEmpty`: > > ```scala > def isEmpty: Boolean =

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-05 Thread via GitHub
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1978809925 We can't cache the queryExecution in the Dataset itself because the queryExecution may come from other Dataset instance. See `isEmpty`: ```scala def isEmpty: Boolean =

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-05 Thread via GitHub
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1978804309 > > Regardless of the answer I think it makes sense to use the same approach for both Dataset states (persisted and unpersisted). > > I agree. We can cache it in a lazy variable

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-05 Thread via GitHub
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1978337106 > Regardless of the answer I think it makes sense to use the same approach for both Dataset states (persisted and unpersisted). I agree. We can cache it in a lazy variable

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-04 Thread via GitHub
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1976817040 > What're your ideas? Looking at the code I think it's going to work and will fix the issue. I do have a couple of questions though: 1. We'll have an asymmetry:

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-03 Thread via GitHub
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1975750721 > @doki23 Do you have time to continue working on the pull request? I seems to me that it's close to completion. @dtarima Of course, I'm glad to move it forward. -- This is an

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-03 Thread via GitHub
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1975636851 @doki23 Do you have time to continue working on the pull request? I seems to me that it's close to completion. -- This is an automated message from the Apache Git Service. To respond

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-28 Thread via GitHub
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1969241145 `Dataset#withAction` accepts `queryExecution` so all places where it gets it unchanged would need to be fixed, like `withAction("collectAsArrowToR", queryExecution)`. And no need to

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-28 Thread via GitHub
cloud-fan commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1969143795 I think all these actions will go through `Dataset#withAction` so we have a narrow wrist to fix the issue. -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-28 Thread via GitHub
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1968973266 > > df.count() and df.collect().size should always agree. > > how about this idea: when calling `df.collect()`, if the plan is cached but the physical plan is not a cache scan,

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-27 Thread via GitHub
cloud-fan commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1968268514 > df.count() and df.collect().size should always agree. how about this idea: when calling `df.collect()`, if the plan is cached but the physical plan is not a cache scan, then we

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-26 Thread via GitHub
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1964179051 > > This PR creates a new Dataset instance, but the old one would still have the same inconsistent behavior. > > Another possible approach is to create a new `queryExecution`

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-24 Thread via GitHub
nchammas commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1962513221 > > This PR creates a new Dataset instance, but the old one would still have the same inconsistent behavior. > > Another possible approach is to create a new `queryExecution`

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-22 Thread via GitHub
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1959304539 > This PR creates a new Dataset instance, but the old one would still have the same inconsistent behavior. Another possible approach is to create a new `queryExecution` instance.

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-22 Thread via GitHub
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1958979959 Hi @dongjoon-hyun @nchammas , I've made some changes, would you please take a look again? And are there any problems of this pr? -- This is an automated message from the Apache Git

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-21 Thread via GitHub
doki23 commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1498639875 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala: ## @@ -82,6 +82,26 @@ class DatasetCacheSuite extends QueryTest assert(cached.storageLevel

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-21 Thread via GitHub
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1957724807 I believe the main issue is that `cache` changes the results (logically it shouldn't have any effect). This PR creates a new `Dataset` instance, but the old one would still have

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-21 Thread via GitHub
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1957040482 > Here is the Python test I am running, which is simplified from the original reproduction that Denis posted: I've not fixed pyspark's dataframe api, so it should still be

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-21 Thread via GitHub
nchammas commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1956773367 > It's not an ideal behavior but should be easy to work around Just to be clear, do you not consider it a correctness issue? To me, it's a correctness issue since the existing

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-21 Thread via GitHub
nchammas commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1497634106 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala: ## @@ -82,6 +82,26 @@ class DatasetCacheSuite extends QueryTest

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub
doki23 commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1496852441 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3878,7 +3878,7 @@ class Dataset[T] private[sql]( */ def persist(newLevel: StorageLevel):

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1955732649 > Could you enable GitHub Action on your repository, @doki23 ? Apache Spark community uses the contributor's GitHub Action resources. > > *

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub
cloud-fan commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1955662603 It's not an ideal behavior but should be easy to work around (`df.select("*").collect()`). IIUC this PR is also like a workaround, as the original `df` can't apply cache anyway because

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub
dongjoon-hyun commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1955030719 cc @cloud-fan and @HyukjinKwon . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub
dongjoon-hyun commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1496457259 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3878,7 +3878,7 @@ class Dataset[T] private[sql]( */ def persist(newLevel:

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub
dongjoon-hyun commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1496457259 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3878,7 +3878,7 @@ class Dataset[T] private[sql]( */ def persist(newLevel:

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub
dongjoon-hyun commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1496453576 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3878,7 +3878,7 @@ class Dataset[T] private[sql]( */ def persist(newLevel:

Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub
dongjoon-hyun commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1955005622 Could you enable GitHub Action on your repository, @doki23 ? Apache Spark community uses the contributor's GitHub Action resources. -

[PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub
doki23 opened a new pull request, #45181: URL: https://github.com/apache/spark/pull/45181 ### What changes were proposed in this pull request? This pr fixes [SPARK-46992](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-46992). It makes cache() method return a new