Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-30 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1620668890


##
sql/core/benchmarks/CollationBenchmark-results.txt:
##
@@ -1,54 +1,79 @@
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - equalsFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-UTF8_BINARY_LCASE3571   3576   
7  0.0   35708.8   1.0X
-UNICODE  2235   2240   
7  0.0   22349.2   1.6X
-UTF8_BINARY  2237   2242   
6  0.0   22371.7   1.6X
-UNICODE_CI  18733  18817   
  118  0.0  187333.8   0.2X
+UTF8_BINARY_LCASE3268   3279   
   16  0.0   32676.7   1.0X
+UNICODE  2086   2087   
2  0.0   20857.9   1.6X
+UTF8_BINARY  2085   2088   
4  0.0   20854.2   1.6X
+UNICODE_CI  19807  19813   
7  0.0  198074.9   0.2X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - compareFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
---
-UTF8_BINARY_LCASE 4260   4290  
41  0.0   42602.6   1.0X
-UNICODE  19536  19624  
   124  0.0  195360.2   0.2X
-UTF8_BINARY   3582   3612  
43  0.0   35818.5   1.2X
-UNICODE_CI   20381  20454  
   103  0.0  203814.1   0.2X
+UTF8_BINARY_LCASE 3839   3843  
 6  0.0   38389.8   1.0X
+UNICODE  19096  19136  
57  0.0  190955.4   0.2X
+UTF8_BINARY   3196   3197  
 2  0.0   31955.7   1.2X
+UNICODE_CI   19038  19043  
 7  0.0  190383.4   0.2X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - hashFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE  7347   7349 
  3  0.0   73467.1   1.0X
-UNICODE   73462  73608 
206  0.0  734623.2   0.1X
-UTF8_BINARY5775   5815 
 57  0.0   57746.0   1.3X
-UNICODE_CI57543  57619 
108  0.0  575425.2   0.1X
+UTF8_BINARY_LCASE  6914   6921 
 10  0.0   69135.3   1.0X
+UNICODE   67702  67724 
 31  0.0  677019.3   0.1X
+UTF8_BINARY5330   5341 
 15  0.0   53296.6   1.3X
+UNICODE_CI65340  65342 
  3  0.0  653395.9   0.1X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - contains: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE 15415  15424 
 13  0.0  154147.1   1.0X
-UNICODE  

Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-30 Thread via GitHub


GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1620662883


##
sql/core/benchmarks/CollationBenchmark-results.txt:
##
@@ -1,54 +1,79 @@
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - equalsFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-UTF8_BINARY_LCASE3571   3576   
7  0.0   35708.8   1.0X
-UNICODE  2235   2240   
7  0.0   22349.2   1.6X
-UTF8_BINARY  2237   2242   
6  0.0   22371.7   1.6X
-UNICODE_CI  18733  18817   
  118  0.0  187333.8   0.2X
+UTF8_BINARY_LCASE3268   3279   
   16  0.0   32676.7   1.0X
+UNICODE  2086   2087   
2  0.0   20857.9   1.6X
+UTF8_BINARY  2085   2088   
4  0.0   20854.2   1.6X
+UNICODE_CI  19807  19813   
7  0.0  198074.9   0.2X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - compareFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
---
-UTF8_BINARY_LCASE 4260   4290  
41  0.0   42602.6   1.0X
-UNICODE  19536  19624  
   124  0.0  195360.2   0.2X
-UTF8_BINARY   3582   3612  
43  0.0   35818.5   1.2X
-UNICODE_CI   20381  20454  
   103  0.0  203814.1   0.2X
+UTF8_BINARY_LCASE 3839   3843  
 6  0.0   38389.8   1.0X
+UNICODE  19096  19136  
57  0.0  190955.4   0.2X
+UTF8_BINARY   3196   3197  
 2  0.0   31955.7   1.2X
+UNICODE_CI   19038  19043  
 7  0.0  190383.4   0.2X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - hashFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE  7347   7349 
  3  0.0   73467.1   1.0X
-UNICODE   73462  73608 
206  0.0  734623.2   0.1X
-UTF8_BINARY5775   5815 
 57  0.0   57746.0   1.3X
-UNICODE_CI57543  57619 
108  0.0  575425.2   0.1X
+UTF8_BINARY_LCASE  6914   6921 
 10  0.0   69135.3   1.0X
+UNICODE   67702  67724 
 31  0.0  677019.3   0.1X
+UTF8_BINARY5330   5341 
 15  0.0   53296.6   1.3X
+UNICODE_CI65340  65342 
  3  0.0  653395.9   0.1X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - contains: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE 15415  15424 
 13  0.0  154147.1   1.0X
-UNICODE  

Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-30 Thread via GitHub


GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1620663435


##
sql/core/benchmarks/CollationBenchmark-results.txt:
##
@@ -1,54 +1,79 @@
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - equalsFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-UTF8_BINARY_LCASE3571   3576   
7  0.0   35708.8   1.0X
-UNICODE  2235   2240   
7  0.0   22349.2   1.6X
-UTF8_BINARY  2237   2242   
6  0.0   22371.7   1.6X
-UNICODE_CI  18733  18817   
  118  0.0  187333.8   0.2X
+UTF8_BINARY_LCASE3268   3279   
   16  0.0   32676.7   1.0X
+UNICODE  2086   2087   
2  0.0   20857.9   1.6X
+UTF8_BINARY  2085   2088   
4  0.0   20854.2   1.6X
+UNICODE_CI  19807  19813   
7  0.0  198074.9   0.2X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - compareFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
---
-UTF8_BINARY_LCASE 4260   4290  
41  0.0   42602.6   1.0X
-UNICODE  19536  19624  
   124  0.0  195360.2   0.2X
-UTF8_BINARY   3582   3612  
43  0.0   35818.5   1.2X
-UNICODE_CI   20381  20454  
   103  0.0  203814.1   0.2X
+UTF8_BINARY_LCASE 3839   3843  
 6  0.0   38389.8   1.0X
+UNICODE  19096  19136  
57  0.0  190955.4   0.2X
+UTF8_BINARY   3196   3197  
 2  0.0   31955.7   1.2X
+UNICODE_CI   19038  19043  
 7  0.0  190383.4   0.2X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - hashFunction:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE  7347   7349 
  3  0.0   73467.1   1.0X
-UNICODE   73462  73608 
206  0.0  734623.2   0.1X
-UTF8_BINARY5775   5815 
 57  0.0   57746.0   1.3X
-UNICODE_CI57543  57619 
108  0.0  575425.2   0.1X
+UTF8_BINARY_LCASE  6914   6921 
 10  0.0   69135.3   1.0X
+UNICODE   67702  67724 
 31  0.0  677019.3   0.1X
+UTF8_BINARY5330   5341 
 15  0.0   53296.6   1.3X
+UNICODE_CI65340  65342 
  3  0.0  653395.9   0.1X
 
-OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1021-azure
 AMD EPYC 7763 64-Core Processor
 collation unit benchmarks - contains: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-UTF8_BINARY_LCASE 15415  15424 
 13  0.0  154147.1   1.0X
-UNICODE  

Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-28 Thread via GitHub


uros-db commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2134820952

   We can leave `PandasMode` for a separate PR, but we'll definitely need to 
take care of it at one point
   
   now that you've explored various options and finished the `groupMapReduce` 
approach, I think should can call in other SQL team reviewers to take a look at 
this and provide their feedback: @dbatomic @nikolamand-db @stefankandic 
@stevomitric
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-26 Thread via GitHub


GideonPotok commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2132241154

   @uros-db ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-24 Thread via GitHub


GideonPotok commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2129923419

   @uros-db I forgot but should I add collation support to 
`org.apache.spark.sql.catalyst.expressions.aggregate.PandasMode`?
   
   The only difference will be 
   1. Support for null keys (thus StringType won't necessarily mean all values 
in buffer are UTF8String, some might just be null, right?)
   2. PandasMode returns a list of all values that are tied for mode. In that 
case, should all the values be present? Eg if you have the pandas_mode of ['a', 
'a', 'a', 'b',  'b', 'B'], with utf_binary_lcase collation, what do you think 
pandas_mode should return? If we want to support PandasMode, I can do a little 
research on what other databases seem to favor for this type of question. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-24 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1613363048


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -74,16 +90,29 @@ case class Mode(
 if (buffer.isEmpty) {
   return null
 }
-
+val collationAwareBuffer = child.dataType match {
+  case c: StringType if
+!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality 
=>
+val collationId = c.collationId
+val modeMap = buffer.toSeq.groupMapReduce {
+  case (key: String, _) =>
+CollationFactory.getCollationKey(UTF8String.fromString(key), 
collationId)
+  case (key: UTF8String, _) =>
+CollationFactory.getCollationKey(key, collationId)
+  case (key, _) => key

Review Comment:
   also, what case exactly does this catch? if `case c: StringType` is already 
matched, then we should be able to do 
`CollationFactory.getCollationKey(key.asInstanceOf[UTF8String], collationId)` 
right away?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-24 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1613361676


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -74,16 +90,29 @@ case class Mode(
 if (buffer.isEmpty) {
   return null
 }
-
+val collationAwareBuffer = child.dataType match {
+  case c: StringType if
+!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality 
=>
+val collationId = c.collationId
+val modeMap = buffer.toSeq.groupMapReduce {
+  case (key: String, _) =>
+CollationFactory.getCollationKey(UTF8String.fromString(key), 
collationId)
+  case (key: UTF8String, _) =>
+CollationFactory.getCollationKey(key, collationId)

Review Comment:
   could we combine these with something like: 
`CollationFactory.getCollationKey(key.asInstanceOf[UTF8String], collationId)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-24 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1613358004


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -74,16 +90,29 @@ case class Mode(
 if (buffer.isEmpty) {
   return null
 }
-
+val collationAwareBuffer = child.dataType match {
+  case c: StringType if
+!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality 
=>
+val collationId = c.collationId
+val modeMap = buffer.toSeq.groupMapReduce {
+  case (key: String, _) =>
+CollationFactory.getCollationKey(UTF8String.fromString(key), 
collationId)
+  case (key: UTF8String, _) =>
+CollationFactory.getCollationKey(key, collationId)

Review Comment:
   ```suggestion
 CollationFactory.getCollationKey(key.asInstanceOf[UTF8String], 
collationId)
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-24 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1613358004


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -74,16 +90,29 @@ case class Mode(
 if (buffer.isEmpty) {
   return null
 }
-
+val collationAwareBuffer = child.dataType match {
+  case c: StringType if
+!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality 
=>
+val collationId = c.collationId
+val modeMap = buffer.toSeq.groupMapReduce {
+  case (key: String, _) =>
+CollationFactory.getCollationKey(UTF8String.fromString(key), 
collationId)
+  case (key: UTF8String, _) =>
+CollationFactory.getCollationKey(key, collationId)

Review Comment:
   ```suggestion
 CollationFactory.getCollationKey(key.asInstanceOf[UTF8String], 
collationId)
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-24 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1613351378


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -48,6 +49,21 @@ case class Mode(
 
   override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+val defaultCheck = super.checkInputDataTypes()
+if (defaultCheck.isFailure) {
+  defaultCheck
+} else if (UnsafeRowUtils.isBinaryStable(child.dataType) ||
+  child.dataType.isInstanceOf[StringType]) {
+  defaultCheck

Review Comment:
   ```suggestion
 TypeCheckResult.TypeCheckSuccess
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-22 Thread via GitHub


GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1610561627


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -48,6 +49,22 @@ case class Mode(
 
   override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+val defaultCheck = super.checkInputDataTypes()
+if (defaultCheck.isFailure) {
+  defaultCheck
+} else {
+  child.dataType match {
+case _: StructType | _: ArrayType if 
!UnsafeRowUtils.isBinaryStable(child.dataType) =>
+  TypeCheckResult.TypeCheckFailure(
+s"Input to function mode was a complex type" +
+  s" with non-binary collated fields," +
+  s" which is not yet supported by mode.")
+case _ => TypeCheckResult.TypeCheckSuccess
+  }

Review Comment:
   @uros-db done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-21 Thread via GitHub


GideonPotok commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2123419758

   @uros-db I have made changes for all but your latest suggestion (re 
whitelists -- will add that soon)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-21 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1608866089


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -48,6 +49,22 @@ case class Mode(
 
   override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+val defaultCheck = super.checkInputDataTypes()
+if (defaultCheck.isFailure) {
+  defaultCheck
+} else {
+  child.dataType match {
+case _: StructType | _: ArrayType if 
!UnsafeRowUtils.isBinaryStable(child.dataType) =>
+  TypeCheckResult.TypeCheckFailure(
+s"Input to function mode was a complex type" +
+  s" with non-binary collated fields," +
+  s" which is not yet supported by mode.")
+case _ => TypeCheckResult.TypeCheckSuccess
+  }

Review Comment:
   let's try to white-list instead of black-list (if we can, of course)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-21 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1608865653


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -48,6 +49,22 @@ case class Mode(
 
   override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+val defaultCheck = super.checkInputDataTypes()
+if (defaultCheck.isFailure) {
+  defaultCheck
+} else {
+  child.dataType match {
+case _: StructType | _: ArrayType if 
!UnsafeRowUtils.isBinaryStable(child.dataType) =>
+  TypeCheckResult.TypeCheckFailure(
+s"Input to function mode was a complex type" +
+  s" with non-binary collated fields," +
+  s" which is not yet supported by mode.")
+case _ => TypeCheckResult.TypeCheckSuccess
+  }

Review Comment:
   ```suggestion
 if (UnsafeRowUtils.isBinaryStable(child.dataType) || 
child.dataType.isInstanceOf[StringType]) {
   TypeCheckResult.TypeCheckSuccess
 } else {
   ...
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-21 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1608247812


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -74,18 +102,56 @@ case class Mode(
 if (buffer.isEmpty) {
   return null
 }
-
+val collationAwareBuffer = child.dataType match {
+  case c: StringType if
+!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality 
=>
+val collationId = c.collationId
+val modeMap = buffer.toSeq.groupMapReduce {
+  case (key: String, _) =>
+CollationFactory.getCollationKey(UTF8String.fromString(key), 
collationId)
+  case (key: UTF8String, _) =>
+CollationFactory.getCollationKey(key, collationId)
+  case (key, _) => key
+}(x => x)((x, y) => (x._1, x._2 + y._2)).values
+modeMap
+//  case s: StructType => getBufferForStructType(buffer, s)
+  case _ => buffer
+}
 reverseOpt.map { reverse =>
   val defaultKeyOrdering = if (reverse) {
 
PhysicalDataType.ordering(child.dataType).asInstanceOf[Ordering[AnyRef]].reverse
   } else {
 
PhysicalDataType.ordering(child.dataType).asInstanceOf[Ordering[AnyRef]]
   }
   val ordering = Ordering.Tuple2(Ordering.Long, defaultKeyOrdering)
-  buffer.maxBy { case (key, count) => (count, key) }(ordering)
-}.getOrElse(buffer.maxBy(_._2))._1
+  collationAwareBuffer.maxBy { case (key, count) => (count, key) 
}(ordering)
+}.getOrElse(collationAwareBuffer.maxBy(_._2))._1
   }
-
+/*
+  private def getBufferForStructType(
+  buffer: OpenHashMap[AnyRef, Long],
+  s: StructType): Iterable[(AnyRef, Long)] = {
+val fIsNonBinaryString = s.fields.map(f => (f, f.dataType)).map {
+  case (f, t: StringType) if !t.supportsBinaryEquality => (f.name, true)
+  case (f, t) => (f.name, false)
+}.toMap
+val fCollationIDs = s.fields.collect {
+  case f if fIsNonBinaryString(f.name) =>
+(f.name, f.dataType.asInstanceOf[StringType].collationId)
+}.toMap
+
+buffer.groupMapReduce {
+  case (key: InternalRow, count) =>
+key.toSeq(s).zip(s.fields).map {
+  case (k: String, field) if fIsNonBinaryString(field.name) =>
+CollationFactory.getCollationKey(UTF8String.fromString(k), 
fCollationIDs(field.name))
+  case (k: UTF8String, field) if fIsNonBinaryString(field.name) =>
+CollationFactory.getCollationKey(k, fCollationIDs(field.name))
+  case (k, _) => k
+  }
+}(x => x)((x, y) => (x._1, x._2 + y._2)).values
+  }
+*/

Review Comment:
   I suppose you will remove this for now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-21 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1608240127


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -48,6 +49,33 @@ case class Mode(
 
   override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+checkDataType(child.dataType)
+  }
+
+  private def checkDataType(dataType: DataType, level1: Boolean = true): 
TypeCheckResult = {
+dataType match {
+  case ArrayType(elementType, _) =>
+checkDataType(elementType, level1 = false)
+  case StructType(fields) =>
+combineTypeCheckResults(fields.map { field =>
+  checkDataType(field.dataType, level1 = false)
+})
+  case dt: StringType if !level1 &&
+!CollationFactory.fetchCollation(dt.collationId).supportsBinaryEquality
+  => TypeCheckResult.TypeCheckFailure(
+s"Input to function $prettyName was a complex type" +
+  s" with strings collated on non-binary collations," +
+  s" which is not yet supported.")
+  case _ => TypeCheckResult.TypeCheckSuccess
+}
+  }

Review Comment:
   instead of this recursive check, consider using already existing recursive 
constructs (such as `existsRecursively` in `DataType`)
   
   (see comments above)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-21 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1608242763


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -48,6 +49,33 @@ case class Mode(
 
   override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+checkDataType(child.dataType)

Review Comment:
   related to the comment below, I think you may just go ahead and use 
something we've already implemented - like 
`UnsafeRowUtils.isBinaryStable(child.dataType)`
   
   also, you'll likely need to allow `child.dataType.isInstanceOf[StringType]` 
- since we do want to allow all binary stable types, as well as all string 
types (some of them might be binary "unstable" - such as UTF8_BINARY_LCASE or 
UNICODE_CI)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-21 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1608240127


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -48,6 +49,33 @@ case class Mode(
 
   override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+checkDataType(child.dataType)
+  }
+
+  private def checkDataType(dataType: DataType, level1: Boolean = true): 
TypeCheckResult = {
+dataType match {
+  case ArrayType(elementType, _) =>
+checkDataType(elementType, level1 = false)
+  case StructType(fields) =>
+combineTypeCheckResults(fields.map { field =>
+  checkDataType(field.dataType, level1 = false)
+})
+  case dt: StringType if !level1 &&
+!CollationFactory.fetchCollation(dt.collationId).supportsBinaryEquality
+  => TypeCheckResult.TypeCheckFailure(
+s"Input to function $prettyName was a complex type" +
+  s" with strings collated on non-binary collations," +
+  s" which is not yet supported.")
+  case _ => TypeCheckResult.TypeCheckSuccess
+}
+  }

Review Comment:
   instead of this recursive check, consider using already existing recursive 
constructs (such as `existsRecursively` in `DataType`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-21 Thread via GitHub


uros-db commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1608236375


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -48,6 +49,33 @@ case class Mode(
 
   override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType)
 
+  override def checkInputDataTypes(): TypeCheckResult = {

Review Comment:
   here, we also need to do something like:
   
   ```
   val defaultCheck = super.checkInputDataTypes()
   if (defaultCheck.isFailure) {
 defaultCheck
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-20 Thread via GitHub


GideonPotok commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2121332325

   
   @uros-db I agree that we should avoid auxiliary structures. And I don't see 
a good way to move the changes to implementation of  `merge` and `update` 
without keeping an auxiliary map from the collation key to the actual values 
seen (eg from "aa" to "aaaAAa", "AA" for a data frame with the values 
"aaaAAa" and "AA".) That would be an auxiliary structure. There is ton of 
of scaffolding to support having just that OpenHashMap available throughout the 
expression being executed.  

So I advise strongly against us pursuing this 
idea, which is good in theory, at least for now. Having said that, such a 
prototype of an approach might look like this: 
https://github.com/GideonPotok/spark/pull/1 . Thoughts? 
   
   Also, I am done with adding the exception for unsupported complex types! 
Take a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-19 Thread via GitHub


GideonPotok closed pull request #46526: [WIP][SPARK-47353][SQL] Enable 
collation support for the Mode expression using GroupMapReduce
URL: https://github.com/apache/spark/pull/46526


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-14 Thread via GitHub


GideonPotok commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1600668920


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -22,7 +22,7 @@ import 
org.apache.spark.sql.catalyst.expressions.{ExpressionEvalHelper, Literal,
 import org.apache.spark.sql.catalyst.util.CollationFactory
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.test.SharedSparkSession
-import org.apache.spark.sql.types.{ArrayType, BinaryType, BooleanType, 
DataType, IntegerType, StringType}
+import org.apache.spark.sql.types._

Review Comment:
   ```suggestion
   import org.apache.spark.sql.types.{ArrayType, BinaryType, BooleanType, 
DataType, IntegerType, StringType}
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-14 Thread via GitHub


uros-db commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1600471419


##
sql/core/benchmarks/CollationBenchmark-jdk21-results.txt:
##


Review Comment:
   all things considered, I would say proceed with this approach - clean 
everything up and get it running with your tests and all CI checks in order. 
then we can call in other reviewers and see where it goes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-14 Thread via GitHub


uros-db commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1600468023


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##


Review Comment:
   "mode" is not a StringExpression
   
   let's move the tests to CollationSQLExpressionsSuite.scala



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-14 Thread via GitHub


uros-db commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1600463245


##
sql/core/benchmarks/CollationBenchmark-jdk21-results.txt:
##


Review Comment:
   apropos altering the benchmark to yield better results for this particular 
expression, I'm not sure if that's something we should encourage - the 
benchmark is not perfect and should only be used for rough estimates. it's fine 
to consider the worst case scenario (all different elements), and I think we 
should look for the best approach anyways



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-14 Thread via GitHub


uros-db commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1600457296


##
sql/core/benchmarks/CollationBenchmark-jdk21-results.txt:
##


Review Comment:
   apropos going through lower first, we need to be careful so as not to 
destroy the original data
   
   consider an example of finding the mode in ['a', 'B', 'B', 'A']. Here, 
correct answers would be:
   
   - 'a'
   - 'A'
   - 'B'
   
   but NOT:
   'b'
   
   because that value is not found in the original data



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-14 Thread via GitHub


GideonPotok commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1600041876


##
sql/core/benchmarks/CollationBenchmark-jdk21-results.txt:
##


Review Comment:
   0.  Note, by the way that because we are relying on supportsBinaryEquality, 
this is about preserving not only the performance for UTF8_BINARY, but also 
that of UNICODE
   1. @uros-db check again. I believe the benchmarks are slightly more 
realistic now. In that for each string there are 3-6 that are equal by 
collation. EG:
   
   ```
collation unit benchmarks - mode - 30105 elements:  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

-
UTF8_BINARY_LCASE - mode - 30105 elements  6
  6   0  5.1 195.6   1.0X
UNICODE - mode - 30105 elements3
  3   0 11.6  86.0   2.3X
UTF8_BINARY - mode - 30105 elements3
  3   0 11.6  85.9   2.3X
UNICODE_CI - mode - 30105 elements12
 12   1  2.6 382.9   0.5X
```

2. Still a slowdown (though there is more work, so how would we expect any 
different). I willl run these new benchmarks on the other approaches and 
assuming this one is best, we can get this one ready for final stages of 
cleanup and review...

3.  I will leave one benchmark for mode rather then having three for 
different input sizes...  that was just a temporary setup.

5. An idea would be for the case of UTF8_BINARY and UNICODE to go through 
the `lower` operation first. This would be a better way to check that, as the 
design doc instructs: "Performance regression for case insensitive collation 
must be no worse than using upper() or ilike() explicitly" . Let me know 
whether to change the benchmark accordingly. There will probably still be a 
performance degradation, but it would at least be a fairer comparison.


   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-14 Thread via GitHub


Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-13 Thread via GitHub


uros-db commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1598850667


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -74,6 +78,18 @@ case class Mode(
 if (buffer.isEmpty) {
   return null
 }
+val buffer2 = if 
(!CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {

Review Comment:
   ```suggestion
   val collationAwareBuffer = if 
(!CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-13 Thread via GitHub


GideonPotok commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1598747396


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -70,10 +74,22 @@ case class Mode(
 buffer
   }
 
-  override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
-if (buffer.isEmpty) {
+  override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+if (buff.isEmpty) {
   return null
 }
+val buffer = if (isCollatedString(child)) {
+  val modeMap = buff.toSeq.groupMapReduce {
+case (key: String, _) =>
+  CollationFactory.getCollationKey(UTF8String.fromString(key), 
collationId)
+case (key: UTF8String, _) =>
+  CollationFactory.getCollationKey(key, collationId)
+case (key, _) => key
+  }(x => x)((x, y) => (x._1, x._2 + y._2)).values
+  modeMap
+} else {
+  buff
+}

Review Comment:
   @uros-db Addressing the initial question about why Mode isn't functioning 
with collated strings in Spark as expected, even though Aggregation typically 
works: It seems that while the ordering for PhysicalStringType is correctly 
established using `private[sql] val ordering = 
CollationFactory.fetchCollation(collationId).comparator.compare(_, _)`, this 
does not automatically resolve the issue with Mode. To illustrate, consider the 
example of UTF8_BINARY_LCASE where an input like `Map("a" -> 3L, "b" -> 2L, "B" 
-> 2L)` results in evaluating the maximum over the tuples `(2L, "B"), (2L, 
"b"), (3L, "a")` rather than the expected `(3L, "a"), (4L, "b")`. This 
indicates that the current approach doesn't aggregate values as required for 
Mode to operate correctly. Unit tests confirm that Mode otherwise won't work 
for such cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-13 Thread via GitHub


GideonPotok commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1598747396


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -70,10 +74,22 @@ case class Mode(
 buffer
   }
 
-  override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
-if (buffer.isEmpty) {
+  override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+if (buff.isEmpty) {
   return null
 }
+val buffer = if (isCollatedString(child)) {
+  val modeMap = buff.toSeq.groupMapReduce {
+case (key: String, _) =>
+  CollationFactory.getCollationKey(UTF8String.fromString(key), 
collationId)
+case (key: UTF8String, _) =>
+  CollationFactory.getCollationKey(key, collationId)
+case (key, _) => key
+  }(x => x)((x, y) => (x._1, x._2 + y._2)).values
+  modeMap
+} else {
+  buff
+}

Review Comment:
   @uros-db While the ordering for PhysicalStringType is correctly established 
using `private[sql] val ordering = 
CollationFactory.fetchCollation(collationId).comparator.compare(_, _)`, this 
does not automatically resolve the issue with Mode. To illustrate, consider the 
example of UTF8_BINARY_LCASE where an input like `Map("a" -> 3L, "b" -> 2L, "B" 
-> 2L)` results in evaluating the maximum over the tuples `(2L, "B"), (2L, 
"b"), (3L, "a")` rather than the expected `(3L, "a"), (4L, "b")`. This 
indicates that the current approach doesn't aggregate values as required for 
Mode to operate correctly. Unit tests confirm that Mode otherwise won't work 
for such cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-13 Thread via GitHub


uros-db commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1598030581


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -70,10 +74,22 @@ case class Mode(
 buffer
   }
 
-  override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
-if (buffer.isEmpty) {
+  override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+if (buff.isEmpty) {
   return null
 }
+val buffer = if (isCollatedString(child)) {
+  val modeMap = buff.toSeq.groupMapReduce {
+case (key: String, _) =>
+  CollationFactory.getCollationKey(UTF8String.fromString(key), 
collationId)
+case (key: UTF8String, _) =>
+  CollationFactory.getCollationKey(key, collationId)
+case (key, _) => key
+  }(x => x)((x, y) => (x._1, x._2 + y._2)).values
+  modeMap
+} else {
+  buff
+}

Review Comment:
   Going back to the original issue (why Mode doesn't work already, while 
otherwise Aggregation generally works with collated strings in Spark), here's 
what I'm interested in: why does 
`PhysicalDataType.ordering(child.dataType).asInstanceOf[Ordering[AnyRef]]` not 
work here automatically?
   
   afaik, ordering for `PhysicalStringType` is defined correctly:
   ```
   private[sql] val ordering = 
CollationFactory.fetchCollation(collationId).comparator.compare(_, _)
   ```
   
   so one would naturally expect Mode to work "as is"
   
   did you investigate this maybe?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-13 Thread via GitHub


uros-db commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1598017669


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -70,10 +74,22 @@ case class Mode(
 buffer
   }
 
-  override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
-if (buffer.isEmpty) {
+  override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+if (buff.isEmpty) {

Review Comment:
   we shouldn't make unnecessary changes, let's not rename this if we don't 
have to
   
   just rename your new variable below to something else



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-13 Thread via GitHub


uros-db commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1598015529


##
sql/core/benchmarks/CollationBenchmark-jdk21-results.txt:
##


Review Comment:
   I would say that these results are somewhat better, but still not too good
   
   However - what's imperative right now is that we preserve the performance 
for UTF8_BINARY (by doing that if/else branch on supportsBinaryEquality). If we 
don't have a better approach at this moment, and we've already tried a couple 
of things - then I would say that's fine and we can proceed with the best of 
what we've got (which is this PR)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-13 Thread via GitHub


uros-db commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1598012310


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -86,6 +102,14 @@ case class Mode(
 }.getOrElse(buffer.maxBy(_._2))._1
   }
 
+  private def isCollatedString(child: Expression): Boolean = {
+child.dataType match {
+  case s: StringType if s.collationId != 
CollationFactory.UTF8_BINARY_COLLATION_ID => true
+  // maybe use supportsBinaryEquality or something else
+  case _ => false
+}
+  }
+

Review Comment:
   we generally shouldn't use collationIds to check equality, you correctly 
noticed that we should go with `supportsBinaryEquality` instead - you can 
remove this method, and perform the check above as suggested



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-13 Thread via GitHub


uros-db commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1598010702


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -70,10 +74,22 @@ case class Mode(
 buffer
   }
 
-  override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
-if (buffer.isEmpty) {
+  override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+if (buff.isEmpty) {
   return null
 }
+val buffer = if (isCollatedString(child)) {

Review Comment:
   ```suggestion
   val buffer = if 
(!CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

2024-05-10 Thread via GitHub


GideonPotok opened a new pull request, #46526:
URL: https://github.com/apache/spark/pull/46526

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org