EnricoMi commented on PR #47688: URL: https://github.com/apache/spark/pull/47688#issuecomment-2296226981
The [spark-extension](https://github.com/G-Research/spark-extension) packages provides some [Dataset diff tooling](https://github.com/G-Research/spark-extension/blob/master/DIFF.md). There, a user-defined comparison can simply be defined by implementing the [`scala.math.Equiv` interface](https://www.scala-lang.org/api/2.13.5/scala/math/Equiv.html): https://github.com/G-Research/spark-extension/blob/master/src/main/scala/uk/co/gresearch/spark/diff/DiffComparators.scala#L41 That `Equiv` implementation is wrapped into an `Expression` (including codegen) and turned into a `Comparator`that is then used by the package to diff columns: given two columns `left` and `right`, return a `Column` that evaluates (compares the columns) to `Boolean`: - [EquivDiffComparator.scala:34](https://github.com/G-Research/spark-extension/blob/master/src/main/scala/uk/co/gresearch/spark/diff/comparator/EquivDiffComparator.scala#L34) - [EquivDiffComparator.scala:67](https://github.com/G-Research/spark-extension/blob/master/src/main/scala/uk/co/gresearch/spark/diff/comparator/EquivDiffComparator.scala#L67) This obviously won't work for Spark connect, but with Column Node API this does not work for classic Spark client either. That package supports Spark 3.0 - 3.5. Creating a `Column` from an `Expression` would allow for minimal changes to keep this working for Spark 4.0 with non-Connect client. This is what I meant with backward compatibility. In order to support Spark Connect, there is no way around using the Spark Connect plugin / extensions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org