Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-05 Thread via GitHub


doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1979031337

   I've force updated my pr and now it brings the smallest changes and fixes 
this issue completely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-05 Thread via GitHub


dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1978837075

   > We can't cache the queryExecution in the Dataset itself because the 
queryExecution may come from other Dataset instance. See `isEmpty`:
   > 
   > ```scala
   > def isEmpty: Boolean = withAction("isEmpty", 
select().limit(1).queryExecution) { plan =>
   > plan.executeTake(1).isEmpty
   >   }
   > ```
   
   I don't see a problem here.
   Yes, we can only cache our own `queryExecution` instances associated with 
our `logicalPlan` (the same way it's cached now as `val` in constructor).
   
   `select().limit(1)` will create two more short-lived `Dataset` instances, 
but we don't care about them - it's just an implementation detail not related 
to our `Dataset` instance.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-05 Thread via GitHub


doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1978809925

   We can't cache the queryExecution in the Dataset itself because the 
queryExecution may come from other Dataset instance.
   See `isEmpty`:
   ```scala
   def isEmpty: Boolean = withAction("isEmpty", 
select().limit(1).queryExecution) { plan =>
   plan.executeTake(1).isEmpty
 }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-05 Thread via GitHub


dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1978804309

   > > Regardless of the answer I think it makes sense to use the same approach 
for both Dataset states (persisted and unpersisted).
   > 
   > I agree. We can cache it in a lazy variable `queryExecutionPersisted`.
   
   :+1:  to `lazy val`
   
   > > The additional responsibility is independent so it should be in a 
separate method which provides proper QueryExecution: if we do that then we'll 
get something similar to def queryExecution: QueryExecution method above.
   > 
   > I'm afraid it'll be a user-facing change if we can only access 
`queryExecution` by method `queryExecution()`.
   
   I don't think we have a choice... Otherwise using unpersisted 
`QueryExecution` when the dataset is cached may result in inconsistencies. 
Basically, the bug is user-facing and our fix have to be user-facing too by 
definition.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-05 Thread via GitHub


doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1978337106

   > Regardless of the answer I think it makes sense to use the same approach 
for both Dataset states (persisted and unpersisted).
   
   I agree. We can cache it in a lazy variable `queryExecutionPersisted`.
   
   > The additional responsibility is independent so it should be in a separate 
method which provides proper QueryExecution: if we do that then we'll get 
something similar to def queryExecution: QueryExecution method above.
   
   I'm afraid it'll be a user-facing change if we can only access 
`queryExecution` by method `queryExecution()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-04 Thread via GitHub


dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1976817040

   > What're your ideas?
   
   Looking at the code I think it's going to work and will fix the issue.
   
   I do have a couple of questions though:
   
   1. We'll have an asymmetry: `QueryExecution` instance for the unpersisted 
state is cached/reused in the `Dataset`, but for the persisted state it's 
always new (never cached/reused). Do we want to use a "cached" `QueryExecution` 
instance for performance reasons or it doesn't matter much? Regardless of the 
answer I think it makes sense to use the same approach for both `Dataset` 
states (persisted and unpersisted).
   
   If we want to reuse `QueryExecution` instances then we'll need to have both 
instances in the constructor:
   ```
   class Dataset[T] private[sql](
   @DeveloperApi @Unstable @transient val queryExecutionUnpersisted: 
QueryExecution,
   @DeveloperApi @Unstable @transient val queryExecutionPersisted: 
QueryExecution,
   ```
   
   If we don't need to reuse then we should probably have these values instead 
and always create new `QueryExecution` instances regardless of the `Dataset` 
persistence state:
   ```
   class Dataset[T] private[sql](
   val sparkSession: SparkSession,
   val logical: LogicalPlan,
   val tracker: QueryPlanningTracker = new QueryPlanningTracker,
   val mode: CommandExecutionMode.Value = CommandExecutionMode.ALL
   ```
   
   Even though the latter approach is simpler, I suspect that we want to reuse 
`QueryExecution` instances to avoid doing the same work over and over again 
(the plan analysis).
   
   
   2. Method `withAction` accepts `QueryExecution` so logically it's expected 
to be used, but it may change it based on the `Dataset` persistence state. Does 
the additional responsibility belong to `withAction` method? It seems implicit 
to me: nothing in the name or arguments hints at that. Is this the only place 
where we need different `QueryExecution` instances? The additional 
responsibility is independent so it should be in a separate method which 
provides proper `QueryExecution`: if we do that then we'll get something 
similar to `def queryExecution: QueryExecution` method above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-03 Thread via GitHub


doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1975750721

   > @doki23 Do you have time to continue working on the pull request? I seems 
to me that it's close to completion.
   
   @dtarima Of course, I'm glad to move it forward.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-03-03 Thread via GitHub


dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1975636851

   @doki23 Do you have time to continue working on the pull request? I seems to 
me that it's close to completion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-28 Thread via GitHub


dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1969241145

   `Dataset#withAction` accepts `queryExecution` so all places where it gets it 
unchanged would need to be fixed, like `withAction("collectAsArrowToR", 
queryExecution)`.
   And no need to change in other places where `queryExecution` is a different 
instance, like `withAction("head", limit(n).queryExecution)`. I'm afraid the 
code is going to be too implicit to quickly grasp the meaning of additional 
`select` for new readers.
   
   We could have something like this below (choosing a proper instance 
depending on the caching state in one place):
   ```
   class Dataset[T] private[sql](
   @DeveloperApi @Unstable @transient val queryExecutionUnpersisted: 
QueryExecution,
   @DeveloperApi @Unstable @transient val queryExecutionPersisted: 
QueryExecution,
   @DeveloperApi @Unstable @transient val encoder: Encoder[T])
 extends Serializable {
   
 def this(qe: QueryExecution, encoder: Encoder[T]) = {
   this(
 new QueryExecution(qe.sparkSession, qe.logical, qe.tracker, qe.mode),
 new QueryExecution(qe.sparkSession, qe.logical, qe.tracker, qe.mode),
 encoder)
 }
   
 def queryExecution: QueryExecution = {
   val cacheManager = 
queryExecutionUnpersisted.sparkSession.sharedState.cacheManager
   val plan = queryExecutionUnpersisted.logical
   if (cacheManager.lookupCachedData(plan).isEmpty) 
queryExecutionUnpersisted
   else queryExecutionPersisted
 }
   
 ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-28 Thread via GitHub


cloud-fan commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1969143795

   I think all these actions will go through `Dataset#withAction` so we have a 
narrow wrist to fix the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-28 Thread via GitHub


dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1968973266

   > > df.count() and df.collect().size should always agree.
   > 
   > how about this idea: when calling `df.collect()`, if the plan is cached 
but the physical plan is not a cache scan, then we do 
`df.select("*").collect()` instead of executing the current physical plan.
   
   Possible, but in my mind it's less robust.
   
   Here is a representation of the consequences from the root cause:
   `caching changes the result` -- affects --> `queryExecution caching in 
Dataset` -- affects --> `collect()`
   
   I know that `collect()` is not the only affected method: `toLocalIterator()` 
is another one - there might exist more we don't know about, and more could be 
added in the future. If the root cause is not fixed then there is a high 
probability of a similar bug to reappear.
   
   Ideally `caching changes the result` should be fixed. If it's impossible by 
some reason then `queryExecution caching in Dataset` would be the next issue to 
be fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-27 Thread via GitHub


cloud-fan commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1968268514

   > df.count() and df.collect().size should always agree.
   
   how about this idea: when calling `df.collect()`, if the plan is cached but 
the physical plan is not a cache scan, then we do `df.select("*").collect()` 
instead of executing the current physical plan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-26 Thread via GitHub


dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1964179051

   > > This PR creates a new Dataset instance, but the old one would still have 
the same inconsistent behavior.
   > 
   > Another possible approach is to create a new `queryExecution` instance. 
But we need change it to a mutable variable which is unsafe.
   
   `Dataset` could keep two instances of `queryExecution` for `persisted` and 
`unpersisted` states
   (added a bit more detailed description in a ticket's comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-24 Thread via GitHub


nchammas commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1962513221

   > > This PR creates a new Dataset instance, but the old one would still have 
the same inconsistent behavior.
   > 
   > Another possible approach is to create a new `queryExecution` instance. 
But we need change it to a mutable variable which is unsafe.
   
   I don't understand the problem well enough to say if this is the better 
approach. I'll defer to others for the time being.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-22 Thread via GitHub


doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1959304539

   > This PR creates a new Dataset instance, but the old one would still have 
the same inconsistent behavior.
   
   Another possible approach is to create a new `queryExecution` instance. But 
we need change it to a mutable variable which is unsafe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-22 Thread via GitHub


doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1958979959

   Hi @dongjoon-hyun @nchammas ,
   I've made some changes, would you please take a look again?
   And are there any problems of this pr?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-21 Thread via GitHub


doki23 commented on code in PR #45181:
URL: https://github.com/apache/spark/pull/45181#discussion_r1498639875


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala:
##
@@ -82,6 +82,26 @@ class DatasetCacheSuite extends QueryTest
 assert(cached.storageLevel == StorageLevel.NONE, "The Dataset should not 
be cached.")
   }
 
+  test("SPARK-46992 collect before persisting") {
+val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS().select(expr("_2 + 
1").as[Int])
+// collect first
+ds.collect()
+// and then cache it
+val cached = ds.cache()
+// ds is not cached
+assertNotCached(ds)
+// Make sure, the Dataset is indeed cached.
+assertCached(cached)
+
+// Check result.
+checkDataset(
+  cached,
+  2, 3, 4)

Review Comment:
   It makes sure that the cached data of the new `Dataset` instance is as 
expected. I'll also add one more case that proves the results of 
`cached.count()` and `cached.collect()` are consistent.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-21 Thread via GitHub


dtarima commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1957724807

   I believe the main issue is that `cache` changes the results (logically it 
shouldn't have any effect).
   
   This PR creates a new `Dataset` instance, but the old one would still have 
the same inconsistent behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-21 Thread via GitHub


doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1957040482

   > Here is the Python test I am running, which is simplified from the 
original reproduction that Denis posted:
   
   I've not fixed pyspark's dataframe api, so it should still be incorrect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-21 Thread via GitHub


nchammas commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1956773367

   > It's not an ideal behavior but should be easy to work around
   
   Just to be clear, do you not consider it a correctness issue? To me, it's a 
correctness issue since the existing behavior on `master` violates what should 
be a basic invariant: `df.count()` and `df.collect().size` should always agree.
   
   But this is not always true, as the repro documented in the issue 
description shows. I also posted my own repro [just above][1].
   
   [1]: https://github.com/apache/spark/pull/45181#pullrequestreview-1893314799


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-21 Thread via GitHub


nchammas commented on code in PR #45181:
URL: https://github.com/apache/spark/pull/45181#discussion_r1497634106


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala:
##
@@ -82,6 +82,26 @@ class DatasetCacheSuite extends QueryTest
 assert(cached.storageLevel == StorageLevel.NONE, "The Dataset should not 
be cached.")
   }
 
+  test("SPARK-46992 collect before persisting") {
+val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS().select(expr("_2 + 
1").as[Int])
+// collect first
+ds.collect()
+// and then cache it
+val cached = ds.cache()
+// ds is not cached
+assertNotCached(ds)
+// Make sure, the Dataset is indeed cached.
+assertCached(cached)
+
+// Check result.
+checkDataset(
+  cached,
+  2, 3, 4)

Review Comment:
   Are you sure this is a valid test? Because this particular check passes for 
me on `master`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub


doki23 commented on code in PR #45181:
URL: https://github.com/apache/spark/pull/45181#discussion_r1496852441


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -3878,7 +3878,7 @@ class Dataset[T] private[sql](
*/
   def persist(newLevel: StorageLevel): this.type = {
 sparkSession.sharedState.cacheManager.cacheQuery(this, None, newLevel)
-this
+Dataset(sparkSession, this.queryExecution.logical).asInstanceOf[this.type]

Review Comment:
   Hmm...Sorry I do not understand your concerns. It does not change the 
immutability of cached data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub


doki23 commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1955732649

   > Could you enable GitHub Action on your repository, @doki23 ? Apache Spark 
community uses the contributor's GitHub Action resources.
   > 
   > * 
https://github.com/apache/spark/pull/45181/checks?check_run_id=21768064346
   > 
   > ![Screenshot 2024-02-20 at 12 22 
49](https://private-user-images.githubusercontent.com/9700541/306405623-09dc02a9-8329-41af-b127-52d6274995cd.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDg0ODEwODksIm5iZiI6MTcwODQ4MDc4OSwicGF0aCI6Ii85NzAwNTQxLzMwNjQwNTYyMy0wOWRjMDJhOS04MzI5LTQxYWYtYjEyNy01MmQ2Mjc0OTk1Y2QucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDIyMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDAyMjFUMDE1OTQ5WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9N2EyYjVlNDllODYyMDYyNDk5Mjc2MWU4YTcyOTRiZGI4MjIxNDUyYTNhMGZjYjQ4MjczOTkwMzE1NDhlYTg0YiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.GyIY-WgoHrmWQvN2vyfgHKidtJQNy08FanNinshBYE4)
   
   Ok, I've enabled it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub


cloud-fan commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1955662603

   It's not an ideal behavior but should be easy to work around 
(`df.select("*").collect()`). IIUC this PR is also like a workaround, as the 
original `df` can't apply cache anyway because the physical plan is 
materialized.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub


dongjoon-hyun commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1955030719

   cc @cloud-fan and @HyukjinKwon .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45181:
URL: https://github.com/apache/spark/pull/45181#discussion_r1496457259


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -3878,7 +3878,7 @@ class Dataset[T] private[sql](
*/
   def persist(newLevel: StorageLevel): this.type = {
 sparkSession.sharedState.cacheManager.cacheQuery(this, None, newLevel)
-this
+Dataset(sparkSession, this.queryExecution.logical).asInstanceOf[this.type]

Review Comment:
   IIUC, Apache Spark's the `persist` data (underlying RDD) can be recomputed 
always, doesn't it? It's only for the best-effort approach to reduce 
re-computation. Do we guarantee the cached data's immutability?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45181:
URL: https://github.com/apache/spark/pull/45181#discussion_r1496457259


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -3878,7 +3878,7 @@ class Dataset[T] private[sql](
*/
   def persist(newLevel: StorageLevel): this.type = {
 sparkSession.sharedState.cacheManager.cacheQuery(this, None, newLevel)
-this
+Dataset(sparkSession, this.queryExecution.logical).asInstanceOf[this.type]

Review Comment:
   IIUC, Apache Spark's the `persist` data (underlying RDD) can be recomputed 
always, does it? It's only for the best-effort approach to reduce 
re-computation. Do we guarantee the cached data's immutability?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45181:
URL: https://github.com/apache/spark/pull/45181#discussion_r1496453576


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -3878,7 +3878,7 @@ class Dataset[T] private[sql](
*/
   def persist(newLevel: StorageLevel): this.type = {
 sparkSession.sharedState.cacheManager.cacheQuery(this, None, newLevel)
-this
+Dataset(sparkSession, this.queryExecution.logical).asInstanceOf[this.type]

Review Comment:
   BTW, the original code came from #4686 (at least) and seems to be the 
default Apache Spark behavior for a long time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]

2024-02-20 Thread via GitHub


dongjoon-hyun commented on PR #45181:
URL: https://github.com/apache/spark/pull/45181#issuecomment-1955005622

   Could you enable GitHub Action on your repository, @doki23 ? Apache Spark 
community uses the contributor's GitHub Action resources.
   
   - https://github.com/apache/spark/pull/45181/checks?check_run_id=21768064346
   
   ![Screenshot 2024-02-20 at 12 22 
49](https://github.com/apache/spark/assets/9700541/09dc02a9-8329-41af-b127-52d6274995cd)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org