Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-12-22 Thread via GitHub


EnricoMi commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1867978567

   Next release is a major release, so perfect opportunity to improve API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-12-22 Thread via GitHub


cloud-fan commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1867814446

   We can not remove (making private is the same as removal for end users) a 
released API. We can update the document and say Spark always ignore the `name` 
parameter though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-12-22 Thread via GitHub


EnricoMi commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1867801729

   @cloud-fan what do you think about making `Dataset.observe(str, Column, 
Column*)` private?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-23 Thread via GitHub


EnricoMi commented on code in PR #43519:
URL: https://github.com/apache/spark/pull/43519#discussion_r1403202213


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -1024,6 +1024,27 @@ class DatasetSuite extends QueryTest
 assert(namedObservation.get === expected)
   }
 
+  test("SPARK-45656: named observations with the same name on different 
datasets") {
+val namedObservation1 = Observation("named")
+val df1 = spark.range(50)
+val observed_df1 = df1.observe(
+  namedObservation1, count(lit(1)).as("count"))

Review Comment:
   There is a rule that enforces that `name` is unique within a plan (except 
for identical expressions: self-joins).
   
   Could we have some similar check across the spark session?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-23 Thread via GitHub


EnricoMi commented on code in PR #43519:
URL: https://github.com/apache/spark/pull/43519#discussion_r1403196154


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -1024,6 +1024,27 @@ class DatasetSuite extends QueryTest
 assert(namedObservation.get === expected)
   }
 
+  test("SPARK-45656: named observations with the same name on different 
datasets") {
+val namedObservation1 = Observation("named")
+val df1 = spark.range(50)
+val observed_df1 = df1.observe(
+  namedObservation1, count(lit(1)).as("count"))

Review Comment:
   If `name` in `Dataset.observe(name, expr, exprs)` is not unique across the 
spark session, then yes, we have the same problem. `Observation(name)` inherits 
the problem from `Dataset.observe(name, expr, exprs)`. `Observation()` avoids 
this problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-23 Thread via GitHub


cloud-fan commented on code in PR #43519:
URL: https://github.com/apache/spark/pull/43519#discussion_r1403120274


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -1024,6 +1024,27 @@ class DatasetSuite extends QueryTest
 assert(namedObservation.get === expected)
   }
 
+  test("SPARK-45656: named observations with the same name on different 
datasets") {
+val namedObservation1 = Observation("named")
+val df1 = spark.range(50)
+val observed_df1 = df1.observe(
+  namedObservation1, count(lit(1)).as("count"))

Review Comment:
   does the `observe` with string input have the same problem? or it's only an 
issue with `Observation`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-23 Thread via GitHub


EnricoMi commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1824032786

   > Oh sorry I was a bit confused as well. I think it's because of self-join, 
we don't require the observation name to be unique.
   > 
   > With the new df_id parameter, seems we can now? But it may be a breaking 
change.
   
   In a self-join, the `df_id` is identical between left and right side of the 
join.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-23 Thread via GitHub


EnricoMi commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1824030046

   This whole problem goes away with unnamed `Observation` instances:
   
   ```
   >>> observation1 = Observation()
   >>> observation2 = Observation()
   ```
   
   What is the purpose of the `name` anyway? It is not used anywhere, nor 
visible anyway. It is merely used in the query plan to give `CollectMetrics` 
some id.
   
   Historically, the `name` argument was introduced by the 
`Dataset.observe(str, Column, Column*)` method introduced in 3.0.0, as this 
`name` was the only identifier required / used to retrieve the metrics from the 
event listener.
   
   Then, in 3.3.0, I have introduced the `Observation` class to simplify 
retrieval of the metrics from the events, and added the `name` argument as 
optional, though there is no value in giving your observation a name.
   
   I'd propose to remove the optional `name` from Observation in 4.0.0, which 
fixes [SPARK-45656] by reducing complexity, rather than increasing complexity 
as per this PR.
   
   Arguably, `Dataset.observe(str, Column, Column*)` could be made 
`private[spark]` in 4.0.0 as well as `Dataset.observe(Observation, Column, 
Column*)` seems to be more user-friendly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-22 Thread via GitHub


beliefer commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1822712201

   I guess this PR fix a bug that caused by multiple datasets could share with 
the same spark session. The listener of `Observation` could receives the 
`SparkListenerSQLExecutionEnd` belongs to other `Observation`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-21 Thread via GitHub


cloud-fan commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1822188529

   Oh sorry I was a bit confused as well. I think it's because of self-join, we 
don't require the observation name to be unique.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-21 Thread via GitHub


EnricoMi commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1821529980

   > name is unique but df can be self-joined and observation will be 
duplicated.
   
   That sounds like an unrelated edge case. Example in the description is not a 
self-join of an observed dataset. And this PR does not fix this edge case: The 
self-joined dataset still contains two identical `CollectMetrics` with the same 
`dataframeId`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-21 Thread via GitHub


cloud-fan commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1821335854

   name is unique but df can be self-joined and observation will be duplicated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-11-21 Thread via GitHub


EnricoMi commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1821283381

   What is the point of the name of an observation, if that is not the unique 
identifier? Create an observation without a name if you cannot come up with a 
unique name and `Observation()` will pick a unique name for you.
   
   Why introducing another identifier when `name` is supposed to be unique?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-10-25 Thread via GitHub


HyukjinKwon closed pull request #43519: [SPARK-45656][SQL] Fix observation when 
named observations with the same name on different datasets
URL: https://github.com/apache/spark/pull/43519


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-10-25 Thread via GitHub


HyukjinKwon commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1778720002

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

2023-10-24 Thread via GitHub


ueshin opened a new pull request, #43519:
URL: https://github.com/apache/spark/pull/43519

   ### What changes were proposed in this pull request?
   
   Fixes observation when named observations with the same name on different 
datasets.
   
   ### Why are the changes needed?
   
   Currently if there are observations with the same name on different dataset, 
one of them will be overwritten by the other execution.
   
   For example,
   
   ```py
   >>> observation1 = Observation("named")
   >>> df1 = spark.range(50)
   >>> observed_df1 = df1.observe(observation1, count(lit(1)).alias("cnt"))
   >>>
   >>> observation2 = Observation("named")
   >>> df2 = spark.range(100)
   >>> observed_df2 = df2.observe(observation2, count(lit(1)).alias("cnt"))
   >>>
   >>> observed_df1.collect()
   ...
   >>> observed_df2.collect()
   ...
   >>> observation1.get
   {'cnt': 50}
   >>> observation2.get
   {'cnt': 50}
   ```
   
   `observation2` should return `{'cnt': 100}`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, the observations with the same name will be available if they observe 
different datasets.
   
   ### How was this patch tested?
   
   Added the related tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org