Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1979031337 I've force updated my pr and now it brings the smallest changes and fixes this issue completely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1978837075 > We can't cache the queryExecution in the Dataset itself because the queryExecution may come from other Dataset instance. See `isEmpty`: > > ```scala > def isEmpty: Boolean = withAction("isEmpty", select().limit(1).queryExecution) { plan => > plan.executeTake(1).isEmpty > } > ``` I don't see a problem here. Yes, we can only cache our own `queryExecution` instances associated with our `logicalPlan` (the same way it's cached now as `val` in constructor). `select().limit(1)` will create two more short-lived `Dataset` instances, but we don't care about them - it's just an implementation detail not related to our `Dataset` instance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1978809925 We can't cache the queryExecution in the Dataset itself because the queryExecution may come from other Dataset instance. See `isEmpty`: ```scala def isEmpty: Boolean = withAction("isEmpty", select().limit(1).queryExecution) { plan => plan.executeTake(1).isEmpty } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1978804309 > > Regardless of the answer I think it makes sense to use the same approach for both Dataset states (persisted and unpersisted). > > I agree. We can cache it in a lazy variable `queryExecutionPersisted`. :+1: to `lazy val` > > The additional responsibility is independent so it should be in a separate method which provides proper QueryExecution: if we do that then we'll get something similar to def queryExecution: QueryExecution method above. > > I'm afraid it'll be a user-facing change if we can only access `queryExecution` by method `queryExecution()`. I don't think we have a choice... Otherwise using unpersisted `QueryExecution` when the dataset is cached may result in inconsistencies. Basically, the bug is user-facing and our fix have to be user-facing too by definition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1978337106 > Regardless of the answer I think it makes sense to use the same approach for both Dataset states (persisted and unpersisted). I agree. We can cache it in a lazy variable `queryExecutionPersisted`. > The additional responsibility is independent so it should be in a separate method which provides proper QueryExecution: if we do that then we'll get something similar to def queryExecution: QueryExecution method above. I'm afraid it'll be a user-facing change if we can only access `queryExecution` by method `queryExecution()`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1976817040 > What're your ideas? Looking at the code I think it's going to work and will fix the issue. I do have a couple of questions though: 1. We'll have an asymmetry: `QueryExecution` instance for the unpersisted state is cached/reused in the `Dataset`, but for the persisted state it's always new (never cached/reused). Do we want to use a "cached" `QueryExecution` instance for performance reasons or it doesn't matter much? Regardless of the answer I think it makes sense to use the same approach for both `Dataset` states (persisted and unpersisted). If we want to reuse `QueryExecution` instances then we'll need to have both instances in the constructor: ``` class Dataset[T] private[sql]( @DeveloperApi @Unstable @transient val queryExecutionUnpersisted: QueryExecution, @DeveloperApi @Unstable @transient val queryExecutionPersisted: QueryExecution, ``` If we don't need to reuse then we should probably have these values instead and always create new `QueryExecution` instances regardless of the `Dataset` persistence state: ``` class Dataset[T] private[sql]( val sparkSession: SparkSession, val logical: LogicalPlan, val tracker: QueryPlanningTracker = new QueryPlanningTracker, val mode: CommandExecutionMode.Value = CommandExecutionMode.ALL ``` Even though the latter approach is simpler, I suspect that we want to reuse `QueryExecution` instances to avoid doing the same work over and over again (the plan analysis). 2. Method `withAction` accepts `QueryExecution` so logically it's expected to be used, but it may change it based on the `Dataset` persistence state. Does the additional responsibility belong to `withAction` method? It seems implicit to me: nothing in the name or arguments hints at that. Is this the only place where we need different `QueryExecution` instances? The additional responsibility is independent so it should be in a separate method which provides proper `QueryExecution`: if we do that then we'll get something similar to `def queryExecution: QueryExecution` method above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1975750721 > @doki23 Do you have time to continue working on the pull request? I seems to me that it's close to completion. @dtarima Of course, I'm glad to move it forward. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1975636851 @doki23 Do you have time to continue working on the pull request? I seems to me that it's close to completion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1969241145 `Dataset#withAction` accepts `queryExecution` so all places where it gets it unchanged would need to be fixed, like `withAction("collectAsArrowToR", queryExecution)`. And no need to change in other places where `queryExecution` is a different instance, like `withAction("head", limit(n).queryExecution)`. I'm afraid the code is going to be too implicit to quickly grasp the meaning of additional `select` for new readers. We could have something like this below (choosing a proper instance depending on the caching state in one place): ``` class Dataset[T] private[sql]( @DeveloperApi @Unstable @transient val queryExecutionUnpersisted: QueryExecution, @DeveloperApi @Unstable @transient val queryExecutionPersisted: QueryExecution, @DeveloperApi @Unstable @transient val encoder: Encoder[T]) extends Serializable { def this(qe: QueryExecution, encoder: Encoder[T]) = { this( new QueryExecution(qe.sparkSession, qe.logical, qe.tracker, qe.mode), new QueryExecution(qe.sparkSession, qe.logical, qe.tracker, qe.mode), encoder) } def queryExecution: QueryExecution = { val cacheManager = queryExecutionUnpersisted.sparkSession.sharedState.cacheManager val plan = queryExecutionUnpersisted.logical if (cacheManager.lookupCachedData(plan).isEmpty) queryExecutionUnpersisted else queryExecutionPersisted } ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
cloud-fan commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1969143795 I think all these actions will go through `Dataset#withAction` so we have a narrow wrist to fix the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1968973266 > > df.count() and df.collect().size should always agree. > > how about this idea: when calling `df.collect()`, if the plan is cached but the physical plan is not a cache scan, then we do `df.select("*").collect()` instead of executing the current physical plan. Possible, but in my mind it's less robust. Here is a representation of the consequences from the root cause: `caching changes the result` -- affects --> `queryExecution caching in Dataset` -- affects --> `collect()` I know that `collect()` is not the only affected method: `toLocalIterator()` is another one - there might exist more we don't know about, and more could be added in the future. If the root cause is not fixed then there is a high probability of a similar bug to reappear. Ideally `caching changes the result` should be fixed. If it's impossible by some reason then `queryExecution caching in Dataset` would be the next issue to be fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
cloud-fan commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1968268514 > df.count() and df.collect().size should always agree. how about this idea: when calling `df.collect()`, if the plan is cached but the physical plan is not a cache scan, then we do `df.select("*").collect()` instead of executing the current physical plan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1964179051 > > This PR creates a new Dataset instance, but the old one would still have the same inconsistent behavior. > > Another possible approach is to create a new `queryExecution` instance. But we need change it to a mutable variable which is unsafe. `Dataset` could keep two instances of `queryExecution` for `persisted` and `unpersisted` states (added a bit more detailed description in a ticket's comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
nchammas commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1962513221 > > This PR creates a new Dataset instance, but the old one would still have the same inconsistent behavior. > > Another possible approach is to create a new `queryExecution` instance. But we need change it to a mutable variable which is unsafe. I don't understand the problem well enough to say if this is the better approach. I'll defer to others for the time being. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1959304539 > This PR creates a new Dataset instance, but the old one would still have the same inconsistent behavior. Another possible approach is to create a new `queryExecution` instance. But we need change it to a mutable variable which is unsafe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1958979959 Hi @dongjoon-hyun @nchammas , I've made some changes, would you please take a look again? And are there any problems of this pr? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
doki23 commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1498639875 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala: ## @@ -82,6 +82,26 @@ class DatasetCacheSuite extends QueryTest assert(cached.storageLevel == StorageLevel.NONE, "The Dataset should not be cached.") } + test("SPARK-46992 collect before persisting") { +val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS().select(expr("_2 + 1").as[Int]) +// collect first +ds.collect() +// and then cache it +val cached = ds.cache() +// ds is not cached +assertNotCached(ds) +// Make sure, the Dataset is indeed cached. +assertCached(cached) + +// Check result. +checkDataset( + cached, + 2, 3, 4) Review Comment: It makes sure that the cached data of the new `Dataset` instance is as expected. I'll also add one more case that proves the results of `cached.count()` and `cached.collect()` are consistent. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
dtarima commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1957724807 I believe the main issue is that `cache` changes the results (logically it shouldn't have any effect). This PR creates a new `Dataset` instance, but the old one would still have the same inconsistent behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1957040482 > Here is the Python test I am running, which is simplified from the original reproduction that Denis posted: I've not fixed pyspark's dataframe api, so it should still be incorrect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
nchammas commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1956773367 > It's not an ideal behavior but should be easy to work around Just to be clear, do you not consider it a correctness issue? To me, it's a correctness issue since the existing behavior on `master` violates what should be a basic invariant: `df.count()` and `df.collect().size` should always agree. But this is not always true, as the repro documented in the issue description shows. I also posted my own repro [just above][1]. [1]: https://github.com/apache/spark/pull/45181#pullrequestreview-1893314799 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
nchammas commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1497634106 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala: ## @@ -82,6 +82,26 @@ class DatasetCacheSuite extends QueryTest assert(cached.storageLevel == StorageLevel.NONE, "The Dataset should not be cached.") } + test("SPARK-46992 collect before persisting") { +val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS().select(expr("_2 + 1").as[Int]) +// collect first +ds.collect() +// and then cache it +val cached = ds.cache() +// ds is not cached +assertNotCached(ds) +// Make sure, the Dataset is indeed cached. +assertCached(cached) + +// Check result. +checkDataset( + cached, + 2, 3, 4) Review Comment: Are you sure this is a valid test? Because this particular check passes for me on `master`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
doki23 commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1496852441 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3878,7 +3878,7 @@ class Dataset[T] private[sql]( */ def persist(newLevel: StorageLevel): this.type = { sparkSession.sharedState.cacheManager.cacheQuery(this, None, newLevel) -this +Dataset(sparkSession, this.queryExecution.logical).asInstanceOf[this.type] Review Comment: Hmm...Sorry I do not understand your concerns. It does not change the immutability of cached data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
doki23 commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1955732649 > Could you enable GitHub Action on your repository, @doki23 ? Apache Spark community uses the contributor's GitHub Action resources. > > * https://github.com/apache/spark/pull/45181/checks?check_run_id=21768064346 > > ![Screenshot 2024-02-20 at 12 22 49](https://private-user-images.githubusercontent.com/9700541/306405623-09dc02a9-8329-41af-b127-52d6274995cd.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDg0ODEwODksIm5iZiI6MTcwODQ4MDc4OSwicGF0aCI6Ii85NzAwNTQxLzMwNjQwNTYyMy0wOWRjMDJhOS04MzI5LTQxYWYtYjEyNy01MmQ2Mjc0OTk1Y2QucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDIyMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDAyMjFUMDE1OTQ5WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9N2EyYjVlNDllODYyMDYyNDk5Mjc2MWU4YTcyOTRiZGI4MjIxNDUyYTNhMGZjYjQ4MjczOTkwMzE1NDhlYTg0YiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.GyIY-WgoHrmWQvN2vyfgHKidtJQNy08FanNinshBYE4) Ok, I've enabled it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
cloud-fan commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1955662603 It's not an ideal behavior but should be easy to work around (`df.select("*").collect()`). IIUC this PR is also like a workaround, as the original `df` can't apply cache anyway because the physical plan is materialized. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
dongjoon-hyun commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1955030719 cc @cloud-fan and @HyukjinKwon . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
dongjoon-hyun commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1496457259 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3878,7 +3878,7 @@ class Dataset[T] private[sql]( */ def persist(newLevel: StorageLevel): this.type = { sparkSession.sharedState.cacheManager.cacheQuery(this, None, newLevel) -this +Dataset(sparkSession, this.queryExecution.logical).asInstanceOf[this.type] Review Comment: IIUC, Apache Spark's the `persist` data (underlying RDD) can be recomputed always, doesn't it? It's only for the best-effort approach to reduce re-computation. Do we guarantee the cached data's immutability? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
dongjoon-hyun commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1496457259 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3878,7 +3878,7 @@ class Dataset[T] private[sql]( */ def persist(newLevel: StorageLevel): this.type = { sparkSession.sharedState.cacheManager.cacheQuery(this, None, newLevel) -this +Dataset(sparkSession, this.queryExecution.logical).asInstanceOf[this.type] Review Comment: IIUC, Apache Spark's the `persist` data (underlying RDD) can be recomputed always, does it? It's only for the best-effort approach to reduce re-computation. Do we guarantee the cached data's immutability? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
dongjoon-hyun commented on code in PR #45181: URL: https://github.com/apache/spark/pull/45181#discussion_r1496453576 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3878,7 +3878,7 @@ class Dataset[T] private[sql]( */ def persist(newLevel: StorageLevel): this.type = { sparkSession.sharedState.cacheManager.cacheQuery(this, None, newLevel) -this +Dataset(sparkSession, this.queryExecution.logical).asInstanceOf[this.type] Review Comment: BTW, the original code came from #4686 (at least) and seems to be the default Apache Spark behavior for a long time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46992]make dataset.cache() return new ds instance [spark]
dongjoon-hyun commented on PR #45181: URL: https://github.com/apache/spark/pull/45181#issuecomment-1955005622 Could you enable GitHub Action on your repository, @doki23 ? Apache Spark community uses the contributor's GitHub Action resources. - https://github.com/apache/spark/pull/45181/checks?check_run_id=21768064346 ![Screenshot 2024-02-20 at 12 22 49](https://github.com/apache/spark/assets/9700541/09dc02a9-8329-41af-b127-52d6274995cd) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org