Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-12-22 Thread via GitHub
EnricoMi commented on PR #43519: URL: https://github.com/apache/spark/pull/43519#issuecomment-1867978567 Next release is a major release, so perfect opportunity to improve API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-12-22 Thread via GitHub
cloud-fan commented on PR #43519: URL: https://github.com/apache/spark/pull/43519#issuecomment-1867814446 We can not remove (making private is the same as removal for end users) a released API. We can update the document and say Spark always ignore the `name` parameter though. -- This

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-12-22 Thread via GitHub
EnricoMi commented on PR #43519: URL: https://github.com/apache/spark/pull/43519#issuecomment-1867801729 @cloud-fan what do you think about making `Dataset.observe(str, Column, Column*)` private? -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-23 Thread via GitHub
EnricoMi commented on code in PR #43519: URL: https://github.com/apache/spark/pull/43519#discussion_r1403202213 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -1024,6 +1024,27 @@ class DatasetSuite extends QueryTest assert(namedObservation.get

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-23 Thread via GitHub
EnricoMi commented on code in PR #43519: URL: https://github.com/apache/spark/pull/43519#discussion_r1403196154 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -1024,6 +1024,27 @@ class DatasetSuite extends QueryTest assert(namedObservation.get

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-23 Thread via GitHub
cloud-fan commented on code in PR #43519: URL: https://github.com/apache/spark/pull/43519#discussion_r1403120274 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -1024,6 +1024,27 @@ class DatasetSuite extends QueryTest assert(namedObservation.get

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-23 Thread via GitHub
EnricoMi commented on PR #43519: URL: https://github.com/apache/spark/pull/43519#issuecomment-1824032786 > Oh sorry I was a bit confused as well. I think it's because of self-join, we don't require the observation name to be unique. > > With the new df_id parameter, seems we can now?

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-23 Thread via GitHub
EnricoMi commented on PR #43519: URL: https://github.com/apache/spark/pull/43519#issuecomment-1824030046 This whole problem goes away with unnamed `Observation` instances: ``` >>> observation1 = Observation() >>> observation2 = Observation() ``` What is the purpose of

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-22 Thread via GitHub
beliefer commented on PR #43519: URL: https://github.com/apache/spark/pull/43519#issuecomment-1822712201 I guess this PR fix a bug that caused by multiple datasets could share with the same spark session. The listener of `Observation` could receives the `SparkListenerSQLExecutionEnd`

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-21 Thread via GitHub
cloud-fan commented on PR #43519: URL: https://github.com/apache/spark/pull/43519#issuecomment-1822188529 Oh sorry I was a bit confused as well. I think it's because of self-join, we don't require the observation name to be unique. -- This is an automated message from the Apache Git

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-21 Thread via GitHub
EnricoMi commented on PR #43519: URL: https://github.com/apache/spark/pull/43519#issuecomment-1821529980 > name is unique but df can be self-joined and observation will be duplicated. That sounds like an unrelated edge case. Example in the description is not a self-join of an

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-21 Thread via GitHub
cloud-fan commented on PR #43519: URL: https://github.com/apache/spark/pull/43519#issuecomment-1821335854 name is unique but df can be self-joined and observation will be duplicated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-21 Thread via GitHub
EnricoMi commented on PR #43519: URL: https://github.com/apache/spark/pull/43519#issuecomment-1821283381 What is the point of the name of an observation, if that is not the unique identifier? Create an observation without a name if you cannot come up with a unique name and `Observation()`

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-10-25 Thread via GitHub
HyukjinKwon closed pull request #43519: [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets URL: https://github.com/apache/spark/pull/43519 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-10-25 Thread via GitHub
HyukjinKwon commented on PR #43519: URL: https://github.com/apache/spark/pull/43519#issuecomment-1778720002 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-10-24 Thread via GitHub
ueshin opened a new pull request, #43519: URL: https://github.com/apache/spark/pull/43519 ### What changes were proposed in this pull request? Fixes observation when named observations with the same name on different datasets. ### Why are the changes needed? Currently