doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1979031337
I've force updated my pr and now it brings the smallest changes and fixes
this issue completely.
--
This is an automated message from the Apache Git Service.
To respond to the message,
dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1978837075
> We can't cache the queryExecution in the Dataset itself because the
queryExecution may come from other Dataset instance. See `isEmpty`:
>
> ```scala
> def isEmpty: Boolean =
doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1978809925
We can't cache the queryExecution in the Dataset itself because the
queryExecution may come from other Dataset instance.
See `isEmpty`:
```scala
def isEmpty: Boolean =
dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1978804309
> > Regardless of the answer I think it makes sense to use the same approach
for both Dataset states (persisted and unpersisted).
>
> I agree. We can cache it in a lazy variable
doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1978337106
> Regardless of the answer I think it makes sense to use the same approach
for both Dataset states (persisted and unpersisted).
I agree. We can cache it in a lazy variable
dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1976817040
> What're your ideas?
Looking at the code I think it's going to work and will fix the issue.
I do have a couple of questions though:
1. We'll have an asymmetry:
doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1975750721
> @doki23 Do you have time to continue working on the pull request? I seems
to me that it's close to completion.
@dtarima Of course, I'm glad to move it forward.
--
This is an
dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1975636851
@doki23 Do you have time to continue working on the pull request? I seems to
me that it's close to completion.
--
This is an automated message from the Apache Git Service.
To respond
dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1969241145
`Dataset#withAction` accepts `queryExecution` so all places where it gets it
unchanged would need to be fixed, like `withAction("collectAsArrowToR",
queryExecution)`.
And no need to
cloud-fan commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1969143795
I think all these actions will go through `Dataset#withAction` so we have a
narrow wrist to fix the issue.
--
This is an automated message from the Apache Git Service.
To respond to
dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1968973266
> > df.count() and df.collect().size should always agree.
>
> how about this idea: when calling `df.collect()`, if the plan is cached
but the physical plan is not a cache scan,
cloud-fan commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1968268514
> df.count() and df.collect().size should always agree.
how about this idea: when calling `df.collect()`, if the plan is cached but
the physical plan is not a cache scan, then we
dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1964179051
> > This PR creates a new Dataset instance, but the old one would still have
the same inconsistent behavior.
>
> Another possible approach is to create a new `queryExecution`
nchammas commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1962513221
> > This PR creates a new Dataset instance, but the old one would still have
the same inconsistent behavior.
>
> Another possible approach is to create a new `queryExecution`
doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1959304539
> This PR creates a new Dataset instance, but the old one would still have
the same inconsistent behavior.
Another possible approach is to create a new `queryExecution` instance.
doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1958979959
Hi @dongjoon-hyun @nchammas ,
I've made some changes, would you please take a look again?
And are there any problems of this pr?
--
This is an automated message from the Apache Git
doki23 commented on code in PR #45181:
URL: https://github.com/apache/spark/pull/45181#discussion_r1498639875
##
sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala:
##
@@ -82,6 +82,26 @@ class DatasetCacheSuite extends QueryTest
assert(cached.storageLevel
dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1957724807
I believe the main issue is that `cache` changes the results (logically it
shouldn't have any effect).
This PR creates a new `Dataset` instance, but the old one would still have
doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1957040482
> Here is the Python test I am running, which is simplified from the
original reproduction that Denis posted:
I've not fixed pyspark's dataframe api, so it should still be
nchammas commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1956773367
> It's not an ideal behavior but should be easy to work around
Just to be clear, do you not consider it a correctness issue? To me, it's a
correctness issue since the existing
nchammas commented on code in PR #45181:
URL: https://github.com/apache/spark/pull/45181#discussion_r1497634106
##
sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala:
##
@@ -82,6 +82,26 @@ class DatasetCacheSuite extends QueryTest
doki23 commented on code in PR #45181:
URL: https://github.com/apache/spark/pull/45181#discussion_r1496852441
##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -3878,7 +3878,7 @@ class Dataset[T] private[sql](
*/
def persist(newLevel: StorageLevel):
doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1955732649
> Could you enable GitHub Action on your repository, @doki23 ? Apache Spark
community uses the contributor's GitHub Action resources.
>
> *
cloud-fan commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1955662603
It's not an ideal behavior but should be easy to work around
(`df.select("*").collect()`). IIUC this PR is also like a workaround, as the
original `df` can't apply cache anyway because
dongjoon-hyun commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1955030719
cc @cloud-fan and @HyukjinKwon .
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the
dongjoon-hyun commented on code in PR #45181:
URL: https://github.com/apache/spark/pull/45181#discussion_r1496457259
##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -3878,7 +3878,7 @@ class Dataset[T] private[sql](
*/
def persist(newLevel:
dongjoon-hyun commented on code in PR #45181:
URL: https://github.com/apache/spark/pull/45181#discussion_r1496457259
##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -3878,7 +3878,7 @@ class Dataset[T] private[sql](
*/
def persist(newLevel:
dongjoon-hyun commented on code in PR #45181:
URL: https://github.com/apache/spark/pull/45181#discussion_r1496453576
##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -3878,7 +3878,7 @@ class Dataset[T] private[sql](
*/
def persist(newLevel:
dongjoon-hyun commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1955005622
Could you enable GitHub Action on your repository, @doki23 ? Apache Spark
community uses the contributor's GitHub Action resources.
-
doki23 opened a new pull request, #45181:
URL: https://github.com/apache/spark/pull/45181
### What changes were proposed in this pull request?
This pr fixes
[SPARK-46992](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-46992).
It makes cache() method return a new
30 matches
Mail list logo