[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19878 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154845901 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -730,23 +776,29 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] { input: String, result: String, fields: Array[StructField]): String = { -val localResult = ctx.freshName("localResult") val childResult = ctx.freshName("childResult") -fields.zipWithIndex.map { case (field, index) => +val fieldsHash = fields.zipWithIndex.map { case (field, index) => + val computeFieldHash = nullSafeElementHash( +input, index.toString, field.nullable, field.dataType, childResult, ctx) s""" - $childResult = 0; - ${nullSafeElementHash(input, index.toString, field.nullable, field.dataType, - childResult, ctx)} - $localResult = (31 * $localResult) + $childResult; - """ -}.mkString( - s""" - int $localResult = 0; - int $childResult = 0; - """, - "", - s"$result = (31 * $result) + $localResult;" -) + |$childResult = 0; + |$computeFieldHash + |$result = (31 * $result) + $childResult; + """.stripMargin +} + +s"${ctx.JAVA_INT} $childResult = 0;\n" + ctx.splitExpressions( --- End diff -- Yea, the input here is a row that may be produced by `row.getStruct` instead of `ctx.INPUT_ROW`, so we don't need this check as the input won't be `ctx.currentVars`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154819410 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -730,23 +776,29 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] { input: String, result: String, fields: Array[StructField]): String = { -val localResult = ctx.freshName("localResult") val childResult = ctx.freshName("childResult") -fields.zipWithIndex.map { case (field, index) => +val fieldsHash = fields.zipWithIndex.map { case (field, index) => + val computeFieldHash = nullSafeElementHash( +input, index.toString, field.nullable, field.dataType, childResult, ctx) s""" - $childResult = 0; - ${nullSafeElementHash(input, index.toString, field.nullable, field.dataType, - childResult, ctx)} - $localResult = (31 * $localResult) + $childResult; - """ -}.mkString( - s""" - int $localResult = 0; - int $childResult = 0; - """, - "", - s"$result = (31 * $result) + $localResult;" -) + |$childResult = 0; + |$computeFieldHash + |$result = (31 * $result) + $childResult; + """.stripMargin +} + +s"${ctx.JAVA_INT} $childResult = 0;\n" + ctx.splitExpressions( --- End diff -- No need to check `ctx.INPUT_ROW == null || ctx.currentVars != null` here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154819030 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -610,25 +637,44 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { ev.isNull = "false" + val childHash = ctx.freshName("childHash") -val childrenHash = ctx.splitExpressions(children.map { child => +val childrenHash = children.map { child => val childGen = child.genCode(ctx) val codeToComputeHash = ctx.nullSafeExec(child.nullable, childGen.isNull) { computeHash(childGen.value, child.dataType, childHash, ctx) } s""" |${childGen.code} + |$childHash = 0; |$codeToComputeHash |${ev.value} = (31 * ${ev.value}) + $childHash; - |$childHash = 0; """.stripMargin -}) +} -ctx.addMutableState(ctx.javaType(dataType), ev.value) -ctx.addMutableState(ctx.JAVA_INT, childHash, s"$childHash = 0;") -ev.copy(code = s""" - ${ev.value} = $seed; - $childrenHash""") +val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) { + childrenHash.mkString("\n") +} else { + ctx.splitExpressions( +expressions = childrenHash, +funcName = "computeHash", +arguments = Seq("InternalRow" -> ctx.INPUT_ROW, ctx.JAVA_INT -> ev.value), +returnType = ctx.JAVA_INT, +makeSplitFunction = body => + s""" + |${ctx.JAVA_INT} $childHash = 0; + |$body + |return ${ev.value}; + """.stripMargin, +foldFunctions = _.map(funcCall => s"${ev.value} = $funcCall;").mkString("\n")) +} + +ev.copy(code = + s""" + |${ctx.JAVA_INT} ${ev.value} = $seed; + |${ctx.JAVA_INT} $childHash = 0; --- End diff -- nvm, `splitExpressions` could possibly not split expressions if only one block. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154818350 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -610,25 +637,44 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { ev.isNull = "false" + val childHash = ctx.freshName("childHash") -val childrenHash = ctx.splitExpressions(children.map { child => +val childrenHash = children.map { child => val childGen = child.genCode(ctx) val codeToComputeHash = ctx.nullSafeExec(child.nullable, childGen.isNull) { computeHash(childGen.value, child.dataType, childHash, ctx) } s""" |${childGen.code} + |$childHash = 0; |$codeToComputeHash |${ev.value} = (31 * ${ev.value}) + $childHash; - |$childHash = 0; """.stripMargin -}) +} -ctx.addMutableState(ctx.javaType(dataType), ev.value) -ctx.addMutableState(ctx.JAVA_INT, childHash, s"$childHash = 0;") -ev.copy(code = s""" - ${ev.value} = $seed; - $childrenHash""") +val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) { + childrenHash.mkString("\n") +} else { + ctx.splitExpressions( +expressions = childrenHash, +funcName = "computeHash", +arguments = Seq("InternalRow" -> ctx.INPUT_ROW, ctx.JAVA_INT -> ev.value), +returnType = ctx.JAVA_INT, +makeSplitFunction = body => + s""" + |${ctx.JAVA_INT} $childHash = 0; + |$body + |return ${ev.value}; + """.stripMargin, +foldFunctions = _.map(funcCall => s"${ev.value} = $funcCall;").mkString("\n")) +} + +ev.copy(code = + s""" + |${ctx.JAVA_INT} ${ev.value} = $seed; + |${ctx.JAVA_INT} $childHash = 0; --- End diff -- nit: `childHash` is only needed to declare here when we don't split functions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154805001 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -270,17 +270,36 @@ abstract class HashExpression[E] extends Expression { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { ev.isNull = "false" -val childrenHash = ctx.splitExpressions(children.map { child => + +val childrenHash = children.map { child => val childGen = child.genCode(ctx) childGen.code + ctx.nullSafeExec(child.nullable, childGen.isNull) { computeHash(childGen.value, child.dataType, ev.value, ctx) } -}) +} + +val hashResultType = ctx.javaType(dataType) +val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) { --- End diff -- That one has been merged, but this one is still different. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154691937 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -389,13 +408,21 @@ abstract class HashExpression[E] extends Expression { input: String, result: String, fields: Array[StructField]): String = { -val hashes = fields.zipWithIndex.map { case (field, index) => +val fieldsHash = fields.zipWithIndex.map { case (field, index) => nullSafeElementHash(input, index.toString, field.nullable, field.dataType, result, ctx) } +val hashResultType = ctx.javaType(dataType) --- End diff -- `ctx` is only available inside `doGenCode` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154687417 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -389,13 +408,21 @@ abstract class HashExpression[E] extends Expression { input: String, result: String, fields: Array[StructField]): String = { -val hashes = fields.zipWithIndex.map { case (field, index) => +val fieldsHash = fields.zipWithIndex.map { case (field, index) => nullSafeElementHash(input, index.toString, field.nullable, field.dataType, result, ctx) } +val hashResultType = ctx.javaType(dataType) --- End diff -- nit: this is done also in line 281. Can we do this only once? maybe with a `lazy val`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154683889 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -730,23 +776,29 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] { input: String, result: String, fields: Array[StructField]): String = { -val localResult = ctx.freshName("localResult") val childResult = ctx.freshName("childResult") -fields.zipWithIndex.map { case (field, index) => +val fieldsHash = fields.zipWithIndex.map { case (field, index) => + val computeFieldHash = nullSafeElementHash( +input, index.toString, field.nullable, field.dataType, childResult, ctx) s""" - $childResult = 0; - ${nullSafeElementHash(input, index.toString, field.nullable, field.dataType, - childResult, ctx)} - $localResult = (31 * $localResult) + $childResult; - """ -}.mkString( --- End diff -- We forgot to split the code for computing hive hash of struct, it's fixed now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154683716 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -270,17 +270,36 @@ abstract class HashExpression[E] extends Expression { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { ev.isNull = "false" -val childrenHash = ctx.splitExpressions(children.map { child => + +val childrenHash = children.map { child => val childGen = child.genCode(ctx) childGen.code + ctx.nullSafeExec(child.nullable, childGen.isNull) { computeHash(childGen.value, child.dataType, ev.value, ctx) } -}) +} + +val hashResultType = ctx.javaType(dataType) +val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) { --- End diff -- I think @kiszk is doing this --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154682872 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -270,17 +270,36 @@ abstract class HashExpression[E] extends Expression { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { ev.isNull = "false" -val childrenHash = ctx.splitExpressions(children.map { child => + +val childrenHash = children.map { child => val childGen = child.genCode(ctx) childGen.code + ctx.nullSafeExec(child.nullable, childGen.isNull) { computeHash(childGen.value, child.dataType, ev.value, ctx) } -}) +} + +val hashResultType = ctx.javaType(dataType) +val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) { --- End diff -- This pattern appears many times in the code base, we may need to create a `ctx.splitExpressionsWithCurrentInput` for it later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/19878 [SPARK-22682][SQL] HashExpression does not need to create global variables ## What changes were proposed in this pull request? It turns out that `HashExpression` can pass around some values via parameter when splitting codes into methods, to save some global variable slots. This can also prevent a weird case that global variable appears in parameter list, which is discovered by https://github.com/apache/spark/pull/19865 ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark minor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19878.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19878 commit 0e9998e0704b54d8f1352a1936c9b6367ebee15e Author: Wenchen FanDate: 2017-12-04T15:24:46Z HashExpression does not need to create global variables --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org