[GitHub] spark pull request: [SPARK-9400][SQL] codeGen stringLocate
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7779#issuecomment-126193054 Jenkins, ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9401][SQL] codeGen concatWs
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7782 [SPARK-9401][SQL] codeGen concatWs Jira: https://issues.apache.org/jira/browse/SPARK-9401 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9401 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7782.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7782 commit 46a6c20da83939d70020c880a964d6bd10fcd00c Author: Tarek Auel Date: 2015-07-30T05:44:14Z [SPARK-9401][SQL] codeGen concatWs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9400][SQL] codeGen stringLocate
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7779 [SPARK-9400][SQL] codeGen stringLocate Jira: https://issues.apache.org/jira/browse/SPARK-9400 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9400 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7779.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7779 commit 4c27625eaf3d6b02940390ac31f3000ac8247552 Author: Tarek Auel Date: 2015-07-30T04:59:37Z [SPARK-9400][SQL] codeGen stringLocate --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9403][SQL] codeGen in / inSet
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7778#discussion_r35837995 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -107,21 +109,71 @@ case class In(value: Expression, list: Seq[Expression]) extends Predicate with C val evaluatedValue = value.eval(input) list.exists(e => e.eval(input) == evaluatedValue) } + + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +if (list.isEmpty) { + s""" +${ev.primitive} = false; +${ev.isNull} = false; + """ +} else { + val valueGen = value.gen(ctx) + val listGen = list.map(_.gen(ctx)) + val listCode = listGen.map(x => +s""" + if (!${ev.primitive}) { +${x.code} +if (${classOf[Objects].getName}.equals(${valueGen.primitive}, ${x.primitive})) { + ${ev.primitive} = true; +} + } + """).foldLeft("")((a, b) => a + "\n" + b) + s""" + ${valueGen.code} + boolean ${ev.primitive} = false; + boolean ${ev.isNull} = false; + $listCode + """ +} + } + } +/** + * Helper companion object in order to support code generation. + */ +object InSet { + + @transient var hset: Set[Any] = null --- End diff -- @rxin Is there a better way to expose `hset` to the codeGen stuff? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9403][SQL] codeGen in / inSet
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7778 [SPARK-9403][SQL] codeGen in / inSet Jira: https://issues.apache.org/jira/browse/SPARK-9403 @rxin ping You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9403 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7778.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7778 commit e69ebaa95340399f1112edf05745e1711cfdbdeb Author: Tarek Auel Date: 2015-07-30T04:44:05Z [SPARK-9403][SQL] codeGen in / inSet --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7571#issuecomment-123481020 @rxin Could you trigger Jenkins --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7571#issuecomment-123446243 Should I add it again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7571#discussion_r35140450 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -486,6 +486,10 @@ case class StringFormat(children: Expression*) extends Expression with CodegenFa private def format: Expression = children(0) private def args: Seq[Expression] = children.tail + override def inputTypes: Seq[AbstractDataType] = +StringType :: List.fill(children.size - 1)(AnyDataType) --- End diff -- I updated as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7571 [SPARK-9154][SQL] codegen StringFormat Jira: https://issues.apache.org/jira/browse/SPARK-9154 fixes bug of #7546 @marmbrus I can't reopen the other PR, because I didn't closed it. Can you trigger Jenkins? You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9154 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7571.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7571 commit 086caba76f646f86840a2cee325188895ab42c8f Author: Tarek Auel Date: 2015-07-20T16:29:03Z [SPARK-9154][SQL] codegen string format commit cd8322bc4e6c15cd9911363c4596eba1a935fcdd Author: Tarek Auel Date: 2015-07-20T21:40:30Z [SPARK-9154][SQL] codegen string format commit 10b4de88c817a474b7b0a83d948cb86927638775 Author: Tarek Auel Date: 2015-07-20T21:42:28Z [SPARK-9154][SQL] codegen removed fallback trait commit a943d3e60649f4267e40376c0bb1ff30ae024436 Author: Tarek Auel Date: 2015-07-21T06:26:58Z [SPARK-9154] implicit input cast, added tests for null, support for null primitives commit f512c5f9219d38c2445e4e776aa739ae6310bb60 Author: Tarek Auel Date: 2015-07-21T18:57:27Z [SPARK-9154][SQL] build fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-8244][SQL] string function: find in set
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7186#issuecomment-123441184 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Revert "[SPARK-9154] [SQL] codegen StringForma...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7570#issuecomment-123419444 Can I reopen the last PR to fix the issue or do I have to create a new one, because the old one got merged? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9152][SQL] Implement code generation fo...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7561#discussion_r35079064 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -196,6 +254,40 @@ case class RLike(left: Expression, right: Expression) override def escape(v: String): String = v override def matches(regex: Pattern, str: String): Boolean = regex.matcher(str).find(0) override def toString: String = s"$left RLIKE $right" + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val patternClass = classOf[Pattern].getName + +val literalRight: String = right match { + case x @ Literal(value: String, StringType) => escape(value) + case _ => null +} --- End diff -- Okay got it. But this caches the value only if it's a literal. I think if we save the value in `mutableState` we could even use this "cache" if an expression returns the same result like the previous one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9152][SQL] Implement code generation fo...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7561#discussion_r35078428 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -187,6 +187,64 @@ case class Like(left: Expression, right: Expression) override def matches(regex: Pattern, str: String): Boolean = regex.matcher(str).matches() override def toString: String = s"$left LIKE $right" + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val patternClass = classOf[Pattern].getName + +val literalRight: String = right match { + case x @ Literal(value: String, StringType) => escape(value) + case _ => null +} + +val leftGen = left.gen(ctx) +val rightGen = right.gen(ctx) + +val patternCode = + if (literalRight != null) { +s"${patternClass} pattern = $patternClass.compile($literalRight);" + } else { +s""" + StringBuilder regex = new StringBuilder("(?s)"); + for (int idx = 1; idx < rightStr.length(); idx++) { +char prev = rightStr.charAt(idx - 1); +char curr = rightStr.charAt(idx); +if (prev == '') { + if (curr == '_') { +regex.append("_"); + } else if (curr == '%') { +regex.append("%"); + } else { +regex.append(${patternClass}.quote("" + curr)); + } +} else { + if (curr != '') { +if (curr == '_') { + regex.append("."); +} else if (curr == '%') { + regex.append(".*"); +} else { + regex.append(${patternClass}.quote((new Character(curr)).toString())); +} + } +} + } + ${patternClass} pattern = ${patternClass}.compile(regex.toString()); +""" + } --- End diff -- That is exactly what we want. `escape` is totally independent from the expression itself, isn't it. This simplifies the codegen, removes duplicated code and has no negative impact. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9152][SQL] Implement code generation fo...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7561#discussion_r35074663 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -196,6 +254,40 @@ case class RLike(left: Expression, right: Expression) override def escape(v: String): String = v override def matches(regex: Pattern, str: String): Boolean = regex.matcher(str).find(0) override def toString: String = s"$left RLIKE $right" + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val patternClass = classOf[Pattern].getName + +val literalRight: String = right match { + case x @ Literal(value: String, StringType) => escape(value) + case _ => null +} + +val leftGen = left.gen(ctx) +val rightGen = right.gen(ctx) + +val patternCode = + if (literalRight != null) { +s"${patternClass} pattern = $patternClass.compile($literalRight);" + } else { +s""" + ${patternClass} pattern = ${patternClass}.compile(rightStr); +""" + } + +s""" + ${leftGen.code} + ${rightGen.code} --- End diff -- Please use a logic like this: codeA nullCheckA codeB nullcheckB This allows to skip the evaluation of `right` if `left` is already null. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9152][SQL] Implement code generation fo...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7561#discussion_r35074591 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -196,6 +254,40 @@ case class RLike(left: Expression, right: Expression) override def escape(v: String): String = v override def matches(regex: Pattern, str: String): Boolean = regex.matcher(str).find(0) override def toString: String = s"$left RLIKE $right" + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val patternClass = classOf[Pattern].getName + +val literalRight: String = right match { + case x @ Literal(value: String, StringType) => escape(value) + case _ => null +} --- End diff -- I guess this doesn't make sense here. This uses the interpreted evaluation. If you want to cache something have a look on `ctx.addMutableState`. This allows to cache things. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9152][SQL] Implement code generation fo...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7561#discussion_r35074260 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -187,6 +187,64 @@ case class Like(left: Expression, right: Expression) override def matches(regex: Pattern, str: String): Boolean = regex.matcher(str).matches() override def toString: String = s"$left LIKE $right" + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val patternClass = classOf[Pattern].getName + +val literalRight: String = right match { + case x @ Literal(value: String, StringType) => escape(value) + case _ => null +} + +val leftGen = left.gen(ctx) +val rightGen = right.gen(ctx) + +val patternCode = + if (literalRight != null) { +s"${patternClass} pattern = $patternClass.compile($literalRight);" + } else { +s""" + StringBuilder regex = new StringBuilder("(?s)"); + for (int idx = 1; idx < rightStr.length(); idx++) { +char prev = rightStr.charAt(idx - 1); +char curr = rightStr.charAt(idx); +if (prev == '') { + if (curr == '_') { +regex.append("_"); + } else if (curr == '%') { +regex.append("%"); + } else { +regex.append(${patternClass}.quote("" + curr)); + } +} else { + if (curr != '') { +if (curr == '_') { + regex.append("."); +} else if (curr == '%') { + regex.append(".*"); +} else { + regex.append(${patternClass}.quote((new Character(curr)).toString())); +} + } +} + } + ${patternClass} pattern = ${patternClass}.compile(regex.toString()); +""" + } --- End diff -- Please put all the coding above in a static context. Then you can call it from `codeGen` and the interpreted code and we avoid duplicated coding. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9152][SQL] Implement code generation fo...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7561#discussion_r35074087 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -187,6 +187,64 @@ case class Like(left: Expression, right: Expression) override def matches(regex: Pattern, str: String): Boolean = regex.matcher(str).matches() override def toString: String = s"$left LIKE $right" + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val patternClass = classOf[Pattern].getName + +val literalRight: String = right match { + case x @ Literal(value: String, StringType) => escape(value) + case _ => null +} + +val leftGen = left.gen(ctx) +val rightGen = right.gen(ctx) + +val patternCode = + if (literalRight != null) { +s"${patternClass} pattern = $patternClass.compile($literalRight);" + } else { +s""" + StringBuilder regex = new StringBuilder("(?s)"); --- End diff -- I am not sure, if `StringBuilder`is imported. If not define somewhere `val sb = classOf[StringBuilder].getName` and use `$sb` You shouldn't use `regex`. You can create a save variable name with `ctx.freshName` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-8244][SQL] string function: find in set
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7186#issuecomment-123187725 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7546#discussion_r35072972 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -486,6 +486,10 @@ case class StringFormat(children: Expression*) extends Expression with CodegenFa private def format: Expression = children(0) private def args: Seq[Expression] = children.tail + override def inputTypes: Seq[AbstractDataType] = +children.zipWithIndex.map(x => if (x._2 == 0) StringType else AnyDataType) --- End diff -- @marmbrus Is this what you proposed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7546#discussion_r35072868 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -501,6 +501,32 @@ case class StringFormat(children: Expression*) extends Expression with CodegenFa } } + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val pattern = children.head.gen(ctx) + +val argListGen = children.tail.map(_.gen(ctx)) +val argListCode = argListGen.map(_.code + "\n") +val argListString = argListGen.foldLeft("")((s, v) => s + s", ${v.primitive}") --- End diff -- `s", ${v.isNull} ? null : ${v.primitive}"` Doesn't compile because of: `Incompatible expression types "void" and "int"` Casting the null to the Boxed type, throws a null pointer exception: ```Java int primitive6 = 0; Object o = (true) ? (Integer) null : primitive6; ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9132][SPARK-9163][SQL] codegen conv
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7552#issuecomment-123139930 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9157][SQL] codegen substring
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7534#issuecomment-123138231 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7546#discussion_r35066508 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala --- @@ -353,7 +353,7 @@ class StringExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { test("FORMAT") { val f = 'f.string.at(0) val d1 = 'd.int.at(1) -val s1 = 's.int.at(2) +val s1 = 's.string.at(2) val row1 = create_row("aa%d%s", 12, "cc") val row2 = create_row(null, 12, "cc") --- End diff -- What do we expect, if an Integer value is null? `printf` itself has no problems with null, but for codeGen we have a primitive value like `int` instead of `Integer`. One approach for solving this might be to box all values again and set it to null, if `isNull` return true. Another approach might be to return null if one argument is null --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7546#discussion_r35063546 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -476,7 +476,7 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression) /** * Returns the input formatted according do printf-style format strings */ -case class StringFormat(children: Expression*) extends Expression with CodegenFallback { +case class StringFormat(children: Expression*) extends Expression { --- End diff -- I do have to split the signature for this to `StringFormat(string: Expression, args: Expression*)`, don't I? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9132][SPARK-9163][SQL] codegen conv
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7552 [SPARK-9132][SPARK-9163][SQL] codegen conv Jira: https://issues.apache.org/jira/browse/SPARK-9132 https://issues.apache.org/jira/browse/SPARK-9163 @rxin as you proposed in the Jira ticket, I just moved the logic to a separate object. I haven't changed anything of the logic of `NumberConverter`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9163 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7552.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7552 commit fa985bda663bdbf60e5b22c4a4113a772b647d35 Author: Tarek Auel Date: 2015-07-21T01:12:23Z [SPARK-9132][SPARK-9163][SQL] codegen conv commit 40dcde9c76232d79d51316f0b6ee978c18a22538 Author: Tarek Auel Date: 2015-07-21T01:17:43Z [SPARK-9132][SPARK-9163][SQL] style fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7462#discussion_r35056820 --- Diff: python/pyspark/sql/functions.py --- @@ -795,6 +796,22 @@ def weekofyear(col): return Column(sc._jvm.functions.weekofyear(col)) +@since(1.5) +def size(col): +""" +Collection function: returns the length of the array or map stored in the column. +:param col: name of column or expression + +>>> from pyspark.sql import Row +>>> from pyspark.sql.functions import size +>>> df = sqlContext.createDataFrame([Row(data=[1, 2, 3]), Row(data=[1]), Row(data=[])]) +>>> df.select(size(df.data)).collect() --- End diff -- You don't have to import `size` and `Row`. simple use ```Python >>> df = sqlContext.createDataFrame([([1, 2, 3],),([1],),([],)], ['data']) >>> df.select(size(df.data)).collect() ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7462#discussion_r35055882 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionFunctionsSuite.scala --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.types._ + + +class CollectionFunctionsSuite extends SparkFunSuite with ExpressionEvalHelper { + + test("Array and Map Size") { +val a0 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType)) +val a1 = Literal.create(Seq[Integer](), ArrayType(IntegerType)) +val a2 = Literal.create(Seq(1, 2), ArrayType(IntegerType)) + +checkEvaluation(Size(a0), 3) +checkEvaluation(Size(a1), 0) +checkEvaluation(Size(a2), 2) + +val m0 = Literal.create(Map("a" -> "a", "b" -> "b"), MapType(StringType, StringType)) +val m1 = Literal.create(Map[String, String](), MapType(StringType, StringType)) +val m2 = Literal.create(Map("a" -> "a"), MapType(StringType, StringType)) + +checkEvaluation(Size(m0), 2) +checkEvaluation(Size(m1), 0) +checkEvaluation(Size(m2), 1) --- End diff -- Can you add something like ```Scala checkEvaluation(Literal.create(null, MapType(StringType, StringType)), null) checkEvaluation(Literal.create(null, ArrayType(StringType)), null) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7462#discussion_r35055176 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.catalyst.expressions.codegen.{CodeGenContext, GeneratedExpressionCode} +import org.apache.spark.sql.types._ + +/** + * Given an array or map, returns its size. + */ +case class Size(child: Expression) extends UnaryExpression with ExpectsInputTypes { + override def dataType: DataType = IntegerType + override def inputTypes: Seq[AbstractDataType] = Seq(TypeCollection(ArrayType, MapType)) + + override def nullSafeEval(value: Any): Int = child.dataType match { +case ArrayType(_, _) => value.asInstanceOf[Seq[Any]].size +case MapType(_, _, _) => value.asInstanceOf[Map[Any, Any]].size + } + + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +child.dataType match { --- End diff -- ```Scala override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { nullSafeCodeGen(ctx, ev, c => s"${ev.primitive} = ($c).size();") } ``` `nullSafeCodeGen` allows you to add multiple lines. `defineCodeGen` expects only the right part of the assignment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9164][SQL] codegen hex/unhex
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7548 [SPARK-9164][SQL] codegen hex/unhex Jira: https://issues.apache.org/jira/browse/SPARK-9164 The diff looks heavy, but I just moved the `hex` and `unhex` methods to `object Hex`. This allows me to call them from `eval` and `codeGen` You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9164 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7548.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7548 commit dd91c57beba57d11091a9160a072fc889db411bf Author: Tarek Auel Date: 2015-07-20T23:05:10Z [SPARK-9164][SQL] codegen hex/unhex --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7462#discussion_r35053445 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.catalyst.expressions.codegen.{CodeGenContext, GeneratedExpressionCode} +import org.apache.spark.sql.types._ + +/** + * Given an array or map, returns its size. + */ +case class Size(child: Expression) extends UnaryExpression with ExpectsInputTypes { + override def dataType: DataType = IntegerType + override def inputTypes: Seq[AbstractDataType] = Seq(TypeCollection(ArrayType, MapType)) + + override def nullSafeEval(value: Any): Int = child.dataType match { +case ArrayType(_, _) => value.asInstanceOf[Seq[Any]].size +case MapType(_, _, _) => value.asInstanceOf[Map[Any, Any]].size + } + + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +child.dataType match { --- End diff -- 1. Pattern matching is not necessary here. Just do `defineCodeGen(ctx, ev, c => s"($c).size()")`. 2. Maybe we should call here `nullSafeCodeGen` instead of `defineCodeGen`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9516][SQL] codegen StringSplit
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7547 [SPARK-9516][SQL] codegen StringSplit Jira: https://issues.apache.org/jira/browse/SPARK-9156 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9156 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7547.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7547 commit 5ad6a1f851683ee285a731a762c96e1ac398219a Author: Tarek Auel Date: 2015-07-20T07:58:50Z [SPARK-9156] codegen StringSplit commit b860eaf09cd77da00889f576300318fa520a73f3 Author: Tarek Auel Date: 2015-07-20T22:22:02Z [SPARK-9156][SQL] codegen StringSplit commit 0be2700f2366614cae7faceb799085e96d33cd16 Author: Tarek Auel Date: 2015-07-20T22:24:56Z [SPARK-9156][SQL] indention fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7546#issuecomment-123059863 Jenkins, ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7546 [SPARK-9154][SQL] codegen StringFormat Jira: https://issues.apache.org/jira/browse/SPARK-9154 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9154 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7546.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7546 commit 086caba76f646f86840a2cee325188895ab42c8f Author: Tarek Auel Date: 2015-07-20T16:29:03Z [SPARK-9154][SQL] codegen string format commit cd8322bc4e6c15cd9911363c4596eba1a935fcdd Author: Tarek Auel Date: 2015-07-20T21:40:30Z [SPARK-9154][SQL] codegen string format commit 10b4de88c817a474b7b0a83d948cb86927638775 Author: Tarek Auel Date: 2015-07-20T21:42:28Z [SPARK-9154][SQL] codegen removed fallback trait --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9161][SQL] codegen FormatNumber
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7545 [SPARK-9161][SQL] codegen FormatNumber Jira https://issues.apache.org/jira/browse/SPARK-9161 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9161 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7545.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7545 commit 21425c82ac3f6a68d3428c08de7ff27f50a12993 Author: Tarek Auel Date: 2015-07-20T19:20:01Z [SPARK-9161][SQL] codegen FormatNumber --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9155][SQL] codegen StringSpace
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7531#discussion_r35031504 --- Diff: unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -77,6 +78,15 @@ public static UTF8String fromString(String str) { } } + /** + * Creates an UTF8String that contains `length` spaces. + */ + public static UTF8String blankString(int length) { +byte[] spaces = new byte[length]; +Arrays.fill(spaces, (byte) ' '); --- End diff -- Char implements UTF-16 (http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html) "The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities". Shall I still change it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9178][SQL] Add an empty string constant...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7509#issuecomment-122969582 @rxin Could you restart this? I don't understand what went wrong: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1121/testReport/org.apache.spark.sql/DatetimeExpressionsSuite/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9160][SQL] codegen encode, decode
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7543 [SPARK-9160][SQL] codegen encode, decode Jira: https://issues.apache.org/jira/browse/SPARK-9160 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9160 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7543.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7543 commit 7528f0eac152fad6e8263a63fd78d138a18b5aa0 Author: Tarek Auel Date: 2015-07-20T17:38:17Z [SPARK-9160][SQL] codegen encode, decode --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9159][SQL] codegen ascii, base64, unbas...
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7542 [SPARK-9159][SQL] codegen ascii, base64, unbase64 Jira: https://issues.apache.org/jira/browse/SPARK-9159 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9159 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7542.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7542 commit 772e6bc5c2729cc50b207c0043f3380a1856ae80 Author: Tarek Auel Date: 2015-07-20T17:22:49Z [SPARK-9159][SQL] codegen ascii, base64, unbase64 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9157][SQL] codegen substring
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7534#issuecomment-122918124 @rxin I guess Jenkins didn't got the 'add to whitelist' --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9157][SQL] codegen substring
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7534#issuecomment-122916361 Jenkins, ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9157][SQL] codegen substring
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7534#discussion_r34976319 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -628,24 +623,59 @@ case class Substring(str: Expression, pos: Expression, len: Expression) override def eval(input: InternalRow): Any = { val string = str.eval(input) -val po = pos.eval(input) -val ln = len.eval(input) - -if ((string == null) || (po == null) || (ln == null)) { - null --- End diff -- I created a nested if in order to avoid evaluation of the 2nd or 3rd argument, if one of them is already null. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9157][SQL] codegen substring
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7534#discussion_r34976235 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -593,12 +593,7 @@ case class Substring(str: Expression, pos: Expression, len: Expression) override def foldable: Boolean = str.foldable && pos.foldable && len.foldable override def nullable: Boolean = str.nullable || pos.nullable || len.nullable - override def dataType: DataType = { -if (!resolved) { - throw new UnresolvedException(this, s"Cannot resolve since $children are not resolved") -} -if (str.dataType == BinaryType) str.dataType else StringType - } + override def dataType: DataType = StringType --- End diff -- @rxin This simplification is correct, isn't it? The expression extends `ImplicitCastInputTypes`, because of that `BinaryType` can be casted to `StringType` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9157][SQL] codegen substring
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7534 [SPARK-9157][SQL] codegen substring https://issues.apache.org/jira/browse/SPARK-9157 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9157 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7534.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7534 commit 1a2e6110478642f30487e569f2f3645ef058bc78 Author: Tarek Auel Date: 2015-07-20T08:39:08Z [SPARK-9157][SQL] codegen substring --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9155][SQL] codegen StringSpace
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7531#issuecomment-122800467 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9155][SQL] codegen StringSpace
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7531#discussion_r34972744 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -556,6 +556,16 @@ case class StringSpace(child: Expression) UTF8String.fromBytes(spaces) } + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +nullSafeCodeGen(ctx, ev, (length) => { + val spaces = ctx.freshName("spaces") + s""" +byte[] $spaces = new byte[($length < 0) ? 0 : $length]; --- End diff -- Okay --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9155][SQL] codegen StringSpace
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7531 [SPARK-9155][SQL] codegen StringSpace Jira https://issues.apache.org/jira/browse/SPARK-9155 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9155 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7531.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7531 commit 4bc33e6794ee931e0a52645d8fc6ddf699754b32 Author: Tarek Auel Date: 2015-07-20T07:29:20Z [SPARK-9155] codegen StringSpace --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9153][SQL] codegen StirngLPad/StringRPa...
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7527 [SPARK-9153][SQL] codegen StirngLPad/StringRPad Jira: https://issues.apache.org/jira/browse/SPARK-9153 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9153 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7527.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7527 commit 92b6a5d5d89c909ae408bc5fb58542225f1f915c Author: Tarek Auel Date: 2015-07-20T06:50:30Z [SPARK-9153] codegen lpad/rpad --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9177][SQL] Reuse of calendar object in ...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7516#issuecomment-122774605 Sure. I am going to solve some of them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9177][SQL] Reuse of calendar object in ...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7516#issuecomment-122773919 @rxin Jenkins still doesn't like me --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7462#issuecomment-122764175 @EntilZha 1. `eval` and `nullSafeEval` `eval` will be invoked to evaluate the expression. Most expressions should return `null` if one of there arguments is `null`. In order to avoid that every expression has to check if `left` or `right` is `null`, `nullSafeEval` has been added. `eval` does the null check and calls `nullSafeEval`, see. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L289-L313 You should override `eval` if you don't want to return `null`, if one the arguments is `null`. Most of the times you will use `nullSafeEval`. 2. `UnaryExpression`: Expression has one parameter (like `size(x)`) `BinaryExpression`: Expression has two parameters (like `contains(a, b)`) `ExpectsInputTypes`: Allows to automatically check if the argument type is correct, see https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpectsInputTypes.scala#L42-L57. You specify the allowed types by overriding `inputTypes`. `ImplicitCastInputs`: The difference to `ExpectsInputTypes` is that this tries to cast the value. Most string operations are implemented with a byte array as input. A string can be "casted" to a byte array by calling `.getBytes`. `ImplicitCastInputs` allows to call `contains(s: String, s2: String)` and `contains(s: Array[Byte], s2: Array[Byte])`. Typically you use this if a cast is reasonable. Cast from anything else to string is most of the times reasonable, but casting a string (automatically == implicit) to an integer value is most of the time not helpful. Users could still invoke the `cast` function. 3. I don't know 4. Intellij allows to run most suites from the IDE. And have a look at https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningIndividualTests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9178][SQL] Add an empty string constant...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7509#issuecomment-122726603 @rxin Shall I add final and do a rebase? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9178][SQL] Add an empty string constant...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7509#issuecomment-122715547 @rxin Jenkins doesn't like me --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9177][SQL] Reuse of calendar object in ...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7516#discussion_r34960318 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala --- @@ -213,10 +213,14 @@ case class WeekOfYear(child: Expression) extends UnaryExpression with ImplicitCa override def dataType: DataType = IntegerType - override protected def nullSafeEval(date: Any): Any = { + private[this] final val c = { --- End diff -- Just to double check. `java.util.Calendar` implements `Serializable`, because of that a `@transient` isn't necessary. Am I right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9177][SQL] Reuse of calendar object in ...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7516#discussion_r34960308 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala --- @@ -225,8 +229,8 @@ case class WeekOfYear(child: Expression) extends UnaryExpression with ImplicitCa nullSafeCodeGen(ctx, ev, (time) => { val cal = classOf[Calendar].getName val c = ctx.freshName("cal") + ctx.addMutableState(cal, c, s"""$cal.getInstance(java.util.TimeZone.getTimeZone("UTC"));""") s""" -$cal $c = $cal.getInstance(java.util.TimeZone.getTimeZone("UTC")); $c.setFirstDayOfWeek($cal.MONDAY); $c.setMinimalDaysInFirstWeek(4); --- End diff -- If we extend `CodeGenContext.addMutableState(javaName, variableName, initialValue)` to something like `CodeGenContext.addMutableState(javaName, variableName, initialValue, initialCode: Option[String] = None)` we could get grid of these two lines and allow more complex initialisations than a single method call. So far there is nothing to pass a more complex initialisation, is there? @cloud-fan I guess you have created the pr for `addMutableState`. Is there a opportunity to push the two lines to the initialisation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9177][SQL] Reuse of calendar object in ...
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7516 [SPARK-9177][SQL] Reuse of calendar object in WeekOfYear https://issues.apache.org/jira/browse/SPARK-9177 @rxin Are we sure that this is thread safe? @chenghao-intel explained in another PR that every partition (if I remember correctly) uses one expression instance. This instance isn't used by multiple threads, is it? If not, we are fine. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9177 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7516.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7516 commit ff97b095c3c80f857f571c0087d271d32b208cb9 Author: Tarek Auel Date: 2015-07-19T17:40:21Z [SPARK-9177] Reuse calendar object in interpreted code and codegen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9178][SQL] Add an empty string constant...
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7509 [SPARK-9178][SQL] Add an empty string constant to UTF8String Jira: https://issues.apache.org/jira/browse/SPARK-9178 In order to avoid calls of `UTF8String.fromString("")` this pr adds an `EMPTY_STRING` constant to `UTF8String`. An `UTF8String` is immutable, so we can use a constant, isn't it? I searched for current usage of `UTF8String.fromString("")` with `grep -R "UTF8String.fromString(\"\")" .` You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-9178 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7509.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7509 commit 748b87a38575664fcfc877ccc575678ba54a9df6 Author: Tarek Auel Date: 2015-07-19T08:22:43Z [SPARK-9178] Add empty string constant to UTF8String --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8255][SPARK-8256][SQL]Add regex_extract...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7468#discussion_r34955908 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -673,6 +673,110 @@ case class Encode(value: Expression, charset: Expression) } /** + * Replace all substrings of str that match regexp with rep + */ +case class RegExpReplace(subject: Expression, regexp: Expression, rep: Expression) + extends Expression with ImplicitCastInputTypes { + + // last regex in string, we will update the pattern iff regexp value changed. + @transient private var lastRegex: UTF8String = _ + // last regex pattern, we cache it for performance concern + @transient private var pattern: Pattern = _ + // last replacement string, we don't want to convert a UTF8String => java.langString every time. + @transient private var lastReplacement: String = _ + @transient private var lastReplacementInUTF8: UTF8String = _ + // result buffer write by Matcher + @transient private val result: StringBuffer = new StringBuffer + + override def nullable: Boolean = children.foldLeft(false)(_ || _.nullable) + override def foldable: Boolean = children.foldLeft(true)(_ && _.foldable) + + override def eval(input: InternalRow): Any = { +val s = subject.eval(input) +if (null != s) { + val p = regexp.eval(input) + if (null != p) { +val r = rep.eval(input) +if (null != r) { + if (!p.equals(lastRegex)) { +// regex value changed +lastRegex = p.asInstanceOf[UTF8String] +pattern = Pattern.compile(lastRegex.toString) + } + if (!r.equals(lastReplacementInUTF8)) { +// replacement string changed +lastReplacementInUTF8 = r.asInstanceOf[UTF8String] +lastReplacement = lastReplacementInUTF8.toString + } + val m = pattern.matcher(s.toString()) + result.delete(0, result.length()) + + while (m.find) { +m.appendReplacement(result, lastReplacement) + } + m.appendTail(result) + + return UTF8String.fromString(result.toString) +} + } +} + +null + } + + override def dataType: DataType = StringType + override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, StringType) + override def children: Seq[Expression] = subject :: regexp :: rep :: Nil + override def prettyName: String = "regexp_replace" +} + +/** + * UDF to extract a specific(idx) group identified by a java regex. + */ +case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expression) + extends Expression with ImplicitCastInputTypes { + def this(s: Expression, r: Expression) = this(s, r, Literal(1)) + + // last regex in string, we will update the pattern iff regexp value changed. + @transient private var lastRegex: UTF8String = _ + // last regex pattern, we cache it for performance concern + @transient private var pattern: Pattern = _ + + override def nullable: Boolean = children.foldLeft(false)(_ || _.nullable) + override def foldable: Boolean = children.foldLeft(true)(_ && _.foldable) + + override def eval(input: InternalRow): Any = { +val s = subject.eval(input) +if (null != s) { + val p = regexp.eval(input) + if (null != p) { +val r = idx.eval(input) +if (null != r) { + if (!p.equals(lastRegex)) { +// regex value changed +lastRegex = p.asInstanceOf[UTF8String] +pattern = Pattern.compile(lastRegex.toString) + } + val m = pattern.matcher(s.toString()) + if (m.find) { +val mr: MatchResult = m.toMatchResult +return UTF8String.fromString(mr.group(r.asInstanceOf[Int])) + } + return UTF8String.fromString("") --- End diff -- Okay. I am going to create a Jira and check the coding for existing empty strings --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8255][SPARK-8256][SQL]Add regex_extract...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7468#discussion_r34955738 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -673,6 +673,110 @@ case class Encode(value: Expression, charset: Expression) } /** + * Replace all substrings of str that match regexp with rep + */ +case class RegExpReplace(subject: Expression, regexp: Expression, rep: Expression) + extends Expression with ImplicitCastInputTypes { + + // last regex in string, we will update the pattern iff regexp value changed. + @transient private var lastRegex: UTF8String = _ + // last regex pattern, we cache it for performance concern + @transient private var pattern: Pattern = _ + // last replacement string, we don't want to convert a UTF8String => java.langString every time. + @transient private var lastReplacement: String = _ + @transient private var lastReplacementInUTF8: UTF8String = _ + // result buffer write by Matcher + @transient private val result: StringBuffer = new StringBuffer + + override def nullable: Boolean = children.foldLeft(false)(_ || _.nullable) + override def foldable: Boolean = children.foldLeft(true)(_ && _.foldable) + + override def eval(input: InternalRow): Any = { +val s = subject.eval(input) +if (null != s) { + val p = regexp.eval(input) + if (null != p) { +val r = rep.eval(input) +if (null != r) { + if (!p.equals(lastRegex)) { +// regex value changed +lastRegex = p.asInstanceOf[UTF8String] +pattern = Pattern.compile(lastRegex.toString) + } + if (!r.equals(lastReplacementInUTF8)) { +// replacement string changed +lastReplacementInUTF8 = r.asInstanceOf[UTF8String] +lastReplacement = lastReplacementInUTF8.toString + } + val m = pattern.matcher(s.toString()) + result.delete(0, result.length()) + + while (m.find) { +m.appendReplacement(result, lastReplacement) + } + m.appendTail(result) + + return UTF8String.fromString(result.toString) +} + } +} + +null + } + + override def dataType: DataType = StringType + override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, StringType) + override def children: Seq[Expression] = subject :: regexp :: rep :: Nil + override def prettyName: String = "regexp_replace" +} + +/** + * UDF to extract a specific(idx) group identified by a java regex. + */ +case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expression) + extends Expression with ImplicitCastInputTypes { + def this(s: Expression, r: Expression) = this(s, r, Literal(1)) + + // last regex in string, we will update the pattern iff regexp value changed. + @transient private var lastRegex: UTF8String = _ + // last regex pattern, we cache it for performance concern + @transient private var pattern: Pattern = _ + + override def nullable: Boolean = children.foldLeft(false)(_ || _.nullable) + override def foldable: Boolean = children.foldLeft(true)(_ && _.foldable) + + override def eval(input: InternalRow): Any = { +val s = subject.eval(input) +if (null != s) { + val p = regexp.eval(input) + if (null != p) { +val r = idx.eval(input) +if (null != r) { + if (!p.equals(lastRegex)) { +// regex value changed +lastRegex = p.asInstanceOf[UTF8String] +pattern = Pattern.compile(lastRegex.toString) + } + val m = pattern.matcher(s.toString()) + if (m.find) { +val mr: MatchResult = m.toMatchResult +return UTF8String.fromString(mr.group(r.asInstanceOf[Int])) + } + return UTF8String.fromString("") --- End diff -- `UTF8String.fromByte(Array[Byte]())` should be slightly faster and avoids creating the string. @rxin / @davies A little bit off-topic, but do you guys think we should add something to `UTF8String` to create an empty UTF8String? Something like: ``` public UTF8String EMTPY_STRING() { return UTF8String.fromBytes(new byte[0]) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...
[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7462#discussion_r34955687 --- Diff: python/pyspark/sql/functions.py --- @@ -652,6 +658,16 @@ def ntile(n): return Column(sc._jvm.functions.ntile(int(n))) +@since(1.5) +def size(col): +""" +Collection function: returns the length of the array or map stored in the column. +:param col: name of column or expression --- End diff -- Could you add an example/test here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Make date/time functions more consistent...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7506#issuecomment-122633436 @rxin Could you do this little fix as well? https://github.com/apache/spark/pull/7505/files Why do we switch from day_of_month to dayofmonth? Most SQL implementations use underscores: [MySQL](https://dev.mysql.com/doc/refman/5.0/en/func-op-summary-ref.html) [SAP HANA](http://help.sap.com/saphelp_hanaplatform/helpdata/en/20/9f228975191014baed94f1b69693ae/content.htm?frameset=/en/20/9ddefe75191014ac249bf78ba2a1e9/frameset.htm¤t_toc=/en/2e/1ef8b4f4554739959886e55d4c127b/plain.htm&node_id=91&show_children=false) [Oracle](http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions001.htm#i88891) I would prefer underscores, because they improve the readability, if you write all SQL stuff in caps, like: `SELECT name, age, DAY_OF_MONTH(birthday) AS birthday FROM people WHERE age > 15` compared to `SELECT name, age, DAYOFMONTH(birthday) AS birthday FROM people WHERE age > 15` I'm not a Python pro, but I thought that underscores are 'pythonic', aren't they? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8199][SQL] follow up; revert change in ...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7505#issuecomment-122632995 Now it's right, isn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8199][SQL] follow up; revert change in ...
GitHub user tarekauel opened a pull request: https://github.com/apache/spark/pull/7505 [SPARK-8199][SQL] follow up; revert change in test @rxin / @davies Sorry for that unnecessary change. And thanks again for all you support! You can merge this pull request into a Git repository by running: $ git pull https://github.com/tarekauel/spark SPARK-8199-FollowUp Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7505.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7505 commit 67acfe6ff366e2050a72069842b088935d81e2ef Author: Tarek Auel Date: 2015-07-19T06:01:02Z [SPARK-8199] follow up; revert change in test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/6981#discussion_r34955066 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateFunctionsSuite.scala --- @@ -0,0 +1,249 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import java.sql.{Timestamp, Date} +import java.text.SimpleDateFormat +import java.util.{TimeZone, Calendar} + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.types.{StringType, TimestampType, DateType} + +class DateFunctionsSuite extends SparkFunSuite with ExpressionEvalHelper { + + val sdf = new SimpleDateFormat("-MM-dd HH:mm:ss") + val sdfDate = new SimpleDateFormat("-MM-dd") + val d = new Date(sdf.parse("2015-04-08 13:10:15").getTime) + val ts = new Timestamp(sdf.parse("2013-11-08 13:10:15").getTime) + + test("Day in Year") { +val sdfDay = new SimpleDateFormat("D") +(2002 to 2004).foreach { y => + (0 to 11).foreach { m => +(0 to 5).foreach { i => + val c = Calendar.getInstance() + c.set(y, m, 28, 0, 0, 0) + c.add(Calendar.DATE, i) + checkEvaluation(DayInYear(Cast(Literal(new Date(c.getTimeInMillis)), DateType)), +sdfDay.format(c.getTime).toInt) +} + } +} + +(1998 to 2002).foreach { y => + (0 to 11).foreach { m => +(0 to 5).foreach { i => + val c = Calendar.getInstance() + c.set(y, m, 28, 0, 0, 0) + c.add(Calendar.DATE, 1) + checkEvaluation(DayInYear(Cast(Literal(new Date(c.getTimeInMillis)), DateType)), +sdfDay.format(c.getTime).toInt) +} + } +} + +(1969 to 1970).foreach { y => + (0 to 11).foreach { m => +(0 to 5).foreach { i => + val c = Calendar.getInstance() + c.set(y, m, 28, 0, 0, 0) + c.add(Calendar.DATE, 1) + checkEvaluation(DayInYear(Cast(Literal(new Date(c.getTimeInMillis)), DateType)), +sdfDay.format(c.getTime).toInt) +} + } +} + +(2402 to 2404).foreach { y => + (0 to 11).foreach { m => +(0 to 5).foreach { i => + val c = Calendar.getInstance() + c.set(y, m, 28, 0, 0, 0) + c.add(Calendar.DATE, 1) + checkEvaluation(DayInYear(Cast(Literal(new Date(c.getTimeInMillis)), DateType)), +sdfDay.format(c.getTime).toInt) +} + } +} + +(2398 to 2402).foreach { y => + (0 to 11).foreach { m => +(0 to 5).foreach { i => + val c = Calendar.getInstance() + c.set(y, m, 28, 0, 0, 0) + c.add(Calendar.DATE, 1) --- End diff -- I changed this when I looked for the last bug. I'm going to create a follow PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/6981#issuecomment-122573044 @rxin / @davies First of all thanks for all your feedback so far. I removed the manual binary search. I guess we all agree that the if/else structure is much more readable. @davies I removed `testWithTimezone` because of @cloud-fan comment in #7488. @cloud-fan Could you have a look on this PR, if you have time? Would be great. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/6981#discussion_r34949893 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala --- @@ -54,3 +60,367 @@ case class CurrentTimestamp() extends LeafExpression { System.currentTimeMillis() * 1000L } } + +case class Hour(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(timestamp: Any): Any = { +val time = timestamp.asInstanceOf[Long] / 1000 +val longTime: Long = time.asInstanceOf[Long] + TimeZone.getDefault.getOffset(time) +((longTime / (1000 * 3600)) % 24).toInt + } + + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val tz = classOf[TimeZone].getName +defineCodeGen(ctx, ev, (c) => + s"""(int) ((($c / 1000) + $tz.getDefault().getOffset($c / 1000)) + / (1000 * 3600) % 24)""".stripMargin +) + } +} + +case class Minute(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(timestamp: Any): Any = { +val time = timestamp.asInstanceOf[Long] / 1000 +val longTime: Long = time.asInstanceOf[Long] + TimeZone.getDefault.getOffset(time) +((longTime / (1000 * 60)) % 60).toInt + } + + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val tz = classOf[TimeZone].getName +defineCodeGen(ctx, ev, (c) => + s"""(int) ((($c / 1000) + $tz.getDefault().getOffset($c / 1000)) + / (1000 * 60) % 60)""".stripMargin +) + } +} + +case class Second(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(time: Any): Any = { +(time.asInstanceOf[Long] / 1000L / 1000L % 60L).toInt + } + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +nullSafeCodeGen(ctx, ev, (time) => { + s"""${ev.primitive} = (int) ($time / 1000L / 1000L % 60L);""" +}) + } +} + +abstract class DateFormatExpression extends UnaryExpression with ImplicitCastInputTypes { + self: Product => + + val daysIn400Years: Int = 146097 + val to2001 = -11323 + + // this is year -17999, calculation: 50 * daysIn400Year + val toYearZero = to2001 + 7304850 + + protected def isLeapYear(year: Int): Boolean = { +(year % 4) == 0 && ((year % 100) != 0 || (year % 400) == 0) + } + + private[this] def yearBoundary(year: Int): Int = { +year * 365 + ((year / 4 ) - (year / 100) + (year / 400)) + } + + private[this] def numYears(in: Int): Int = { +val year = in / 365 +if (in > yearBoundary(year)) year else year - 1 + } + + override def dataType: DataType = IntegerType + + override def inputTypes: Seq[AbstractDataType] = Seq(DateType) + + protected def calculateYearAndDayInYear(daysIn: Int): (Int, Int) = { +val daysNormalized = daysIn + toYearZero +val numOfQuarterCenturies = daysNormalized / daysIn400Years +val daysInThis400 = daysNormalized % daysIn400Years + 1 +val years = numYears(daysInThis400) +val year: Int = (2001 - 2) + 400 * numOfQuarterCenturies + years +val dayInYear = daysInThis400 - yearBoundary(years) +(year, dayInYear) + } + + protected def codeGen(ctx: CodeGenContext, ev: GeneratedExpressionCode, input: String, + f: (String, String) => String): String = { +val daysIn400Years = ctx.freshName("daysIn400Years") +val to2001 = ctx.freshName("to2001") +val toYearZero = ctx.freshName("toYearZero") +val daysNormalized = ctx.freshName("daysNormalized") +val numOfQuarterCenturies = ctx.freshName("numOfQuarterCenturies") +val daysInThis400 = ctx.freshName("daysInThis400") +val years = ctx.freshName("years") +val year = ctx.freshName("year") +val dayInYear = ctx.freshName("dayInYear") + +s
[GitHub] spark pull request: [SPARK-9136][SQL] fix several bugs in DateTime...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7473#discussion_r34948052 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -320,16 +320,17 @@ object DateTimeUtils { Calendar.getInstance( TimeZone.getTimeZone(f"GMT${timeZone.get.toChar}${segments(7)}%02d:${segments(8)}%02d")) } +c.set(Calendar.MILLISECOND, 0) if (justTime) { - c.set(Calendar.HOUR, segments(3)) + c.set(Calendar.HOUR_OF_DAY, segments(3)) c.set(Calendar.MINUTE, segments(4)) c.set(Calendar.SECOND, segments(5)) } else { c.set(segments(0), segments(1) - 1, segments(2), segments(3), segments(4), segments(5)) } -Some(c.getTimeInMillis / 1000 * 100 + segments(6)) +Some(c.getTimeInMillis * 1000 + segments(6)) --- End diff -- Got it! Thanks for the explanation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/6981#discussion_r34940088 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala --- @@ -54,3 +60,204 @@ case class CurrentTimestamp() extends LeafExpression { System.currentTimeMillis() * 1000L } } + +case class Hour(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(timestamp: Any): Any = { +DateTimeUtils.getHours(timestamp.asInstanceOf[Long]) + } + + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = DateTimeUtils.getClass.getName.stripSuffix("$") +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getHours($c)""" +) + } +} + +case class Minute(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(timestamp: Any): Any = { +DateTimeUtils.getMinutes(timestamp.asInstanceOf[Long]) + } + + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = DateTimeUtils.getClass.getName.stripSuffix("$") +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getMinutes($c)""" +) + } +} + +case class Second(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(timestamp: Any): Any = { +DateTimeUtils.getSeconds(timestamp.asInstanceOf[Long]) + } + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = DateTimeUtils.getClass.getName.stripSuffix("$") +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getSeconds($c)""" +) + } +} + +case class DayInYear(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(DateType) + + override def dataType: DataType = IntegerType + + override def prettyName: String = "day_in_year" + + override protected def nullSafeEval(date: Any): Any = { +DateTimeUtils.getDayInYear(date.asInstanceOf[Int]) + } + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = DateTimeUtils.getClass.getName.stripSuffix("$") +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getDayInYear($c)""" +) + } +} + + +case class Year(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(DateType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(date: Any): Any = { +DateTimeUtils.getYear(date.asInstanceOf[Int]) + } + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = DateTimeUtils.getClass.getName.stripSuffix("$") +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getYear($c)""" +) + } +} + +case class Quarter(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(DateType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(date: Any): Any = { +DateTimeUtils.getQuarter(date.asInstanceOf[Int]) + } + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = DateTimeUtils.getClass.getName.stripSuffix("$") +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getQuarter($c)""" +) + } +} + +case class Month(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(DateType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(date: Any): Any = { +DateTimeUtils
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/6981#discussion_r34939885 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -378,4 +395,208 @@ object DateTimeUtils { c.set(segments(0), segments(1) - 1, segments(2), 0, 0, 0) Some((c.getTimeInMillis / 1000 / 3600 / 24).toInt) } + + /** + * Returns the hour value of a given timestamp value. The timestamp is expressed in microseconds. + */ + def getHours(timestamp: Long): Int = { +val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp / 1000) +((localTs / 1000 / 3600) % 24).toInt + } + + /** + * Returns the minute value of a given timestamp value. The timestamp is expressed in + * microseconds. + */ + def getMinutes(timestamp: Long): Int = { +val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp / 1000) +((localTs / 1000 / 60) % 60).toInt + } + + /** + * Returns the second value of a given timestamp value. The timestamp is expressed in + * microseconds. + */ + def getSeconds(timestamp: Long): Int = { +((timestamp / 1000 / 1000) % 60).toInt + } + + private[this] def isLeapYear(year: Int): Boolean = { +(year % 4) == 0 && ((year % 100) != 0 || (year % 400) == 0) + } + + /** + * Return the number of days since the start of 400 year period. + * The second year of a 400 year period (year 1) starts on day 365. + */ + private[this] def yearBoundary(year: Int): Int = { +year * 365 + ((year / 4 ) - (year / 100) + (year / 400)) + } + + /** + * Calculates the number of years for the given number of days. This depends + * on a 400 year period. + * @param days days since the beginning of the 400 year period + * @return (number of year, days in year) + */ + private[this] def numYears(days: Int): (Int, Int) = { +val year = days / 365 +val boundary = yearBoundary(year) +if (days > boundary) (year, days - boundary) else (year - 1, days - yearBoundary(year - 1)) + } + + /** + * Calculates the year and and the number of the day in the year for the given + * number of days. The given days is the number of days since 1.1.1970. + * + * The calculation uses the fact that the period 1.1.2001 until 31.12.2400 is + * equals to the period 1.1.1601 until 31.12.2000. + */ + private[this] def getYearAndDayInYear(daysSince1970: Int): (Int, Int) = { +// add the difference (in days) between 1.1.1970 and the artificial year 0 (-17999) +val daysNormalized = daysSince1970 + toYearZero +val numOfQuarterCenturies = daysNormalized / daysIn400Years +val daysInThis400 = daysNormalized % daysIn400Years + 1 +val (years, dayInYear) = numYears(daysInThis400) +val year: Int = (2001 - 2) + 400 * numOfQuarterCenturies + years +(year, dayInYear) + } + + /** + * Returns the 'day in year' value for the given date. The date is expressed in days + * since 1.1.1970. + */ + def getDayInYear(date: Int): Int = { +getYearAndDayInYear(date)._2 + } + + /** + * Returns the year value for the given date. The date is expressed in days + * since 1.1.1970. + */ + def getYear(date: Int): Int = { +getYearAndDayInYear(date)._1 + } + + /** + * Returns the quarter for the given date. The date is expressed in days + * since 1.1.1970. + */ + def getQuarter(date: Int): Int = { +var (year, dayInYear) = getYearAndDayInYear(date) +if (isLeapYear(year)) { + dayInYear = dayInYear - 1 +} +if (dayInYear <= 90) { + 1 +} else if (dayInYear <= 181) { + 2 +} else if (dayInYear <= 273) { + 3 +} else { + 4 +} + } + + /** + * Returns the month value for the given date. The date is expressed in days + * since 1.1.1970. January is month 1. + */ + def getMonth(date: Int): Int = { +var (year, dayInYear) = getYearAndDayInYear(date) +var isLeap = isLeapYear(year) +if (isLeap && dayInYear > 60) { + dayInYear = dayInYear - 1 + isLeap = false +} + +if (dayInYear <= 181) { + if (dayInYear <= 90) { +if (dayInYear <= 31) { + 1 +} else if (dayInYear <= 59 || (isLeap && dayInYear <= 60)) { --- End diff -- Put the problem is, if you subtract 1 one, you can not differentiate between 1.3. and 29.2. You have to add somehow an additiona
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/6981#issuecomment-122385842 @rxin / @davies Can you trigger Jenkins again? BTW: I had the issue already yesterday that Jenkins could not fetch from repo. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9136][SQL] fix several bugs in DateTime...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7473#discussion_r34912321 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -320,16 +320,17 @@ object DateTimeUtils { Calendar.getInstance( TimeZone.getTimeZone(f"GMT${timeZone.get.toChar}${segments(7)}%02d:${segments(8)}%02d")) } +c.set(Calendar.MILLISECOND, 0) if (justTime) { - c.set(Calendar.HOUR, segments(3)) + c.set(Calendar.HOUR_OF_DAY, segments(3)) c.set(Calendar.MINUTE, segments(4)) c.set(Calendar.SECOND, segments(5)) } else { c.set(segments(0), segments(1) - 1, segments(2), segments(3), segments(4), segments(5)) } -Some(c.getTimeInMillis / 1000 * 100 + segments(6)) +Some(c.getTimeInMillis * 1000 + segments(6)) --- End diff -- Sorry I didn't get this. Why is divide by 1000 a problem, if the date is before 1.1.1970? It's a long value and the last three digits are the millis. In order to cut them off, I divide by 1000. This has no impact on the sign or anything else, does it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9136][SQL] fix several bugs in DateTime...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7473#issuecomment-122340977 LGTM 1. yes that was a bug. Thanks for fixing it. 2. see my comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9136][SQL] fix several bugs in DateTime...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7473#discussion_r34908143 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -320,16 +320,17 @@ object DateTimeUtils { Calendar.getInstance( TimeZone.getTimeZone(f"GMT${timeZone.get.toChar}${segments(7)}%02d:${segments(8)}%02d")) } +c.set(Calendar.MILLISECOND, 0) if (justTime) { - c.set(Calendar.HOUR, segments(3)) + c.set(Calendar.HOUR_OF_DAY, segments(3)) c.set(Calendar.MINUTE, segments(4)) c.set(Calendar.SECOND, segments(5)) } else { c.set(segments(0), segments(1) - 1, segments(2), segments(3), segments(4), segments(5)) } -Some(c.getTimeInMillis / 1000 * 100 + segments(6)) +Some(c.getTimeInMillis * 1000 + segments(6)) --- End diff -- I don't know why you have changed this. I divided by `1000` first in order to remove the call of `c.set(Calendar.MILLISECOND, 0)` in Line 323. But I am fine with this, I guess it a personal preference. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/6981#discussion_r34870733 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateFunctionsSuite.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import java.sql.{Timestamp, Date} +import java.text.SimpleDateFormat +import java.util.{TimeZone, Calendar} + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.types.{StringType, TimestampType, DateType} + +class DateFunctionsSuite extends SparkFunSuite with ExpressionEvalHelper { + + val oldDefault = TimeZone.getDefault + + val sdf = new SimpleDateFormat("-MM-dd HH:mm:ss") + val sdfDate = new SimpleDateFormat("-MM-dd") + val d = new Date(sdf.parse("2015-04-08 13:10:15").getTime) + val ts = new Timestamp(sdf.parse("2013-11-08 13:10:15").getTime) + + test("Day in Year") { +val sdfDay = new SimpleDateFormat("D") +(2002 to 2004).foreach { y => + (0 to 11).foreach { m => +(0 to 5).foreach { i => --- End diff -- 28, 29, 30, 31, 1.1. It checks until the 1st or 2nd of the next month, doesn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/6981#discussion_r34868858 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -378,4 +395,183 @@ object DateTimeUtils { c.set(segments(0), segments(1) - 1, segments(2), 0, 0, 0) Some((c.getTimeInMillis / 1000 / 3600 / 24).toInt) } + + /** + * Returns the hour value of a given timestamp value. The timestamp is expressed in microseconds. + */ + def getHours(timestamp: Long): Int = { +val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp / 1000) +((localTs / 1000 / 3600) % 24).toInt + } + + /** + * Returns the minute value of a given timestamp value. The timestamp is expressed in + * microseconds. + */ + def getMinutes(timestamp: Long): Int = { +val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp / 1000) +((localTs / 1000 / 60) % 60).toInt + } + + /** + * Returns the second value of a given timestamp value. The timestamp is expressed in + * microseconds. + */ + def getSeconds(timestamp: Long): Int = { +val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp / 1000) +((localTs / 1000) % 60).toInt + } + + private[this] def isLeapYear(year: Int): Boolean = { +(year % 4) == 0 && ((year % 100) != 0 || (year % 400) == 0) + } + + /** + * Return the number of days since the start of 400 year period. + * The second year of a 400 year period (year 1) starts on day 365. + */ + private[this] def yearBoundary(year: Int): Int = { +year * 365 + ((year / 4 ) - (year / 100) + (year / 400)) + } + + /** + * Calculates the number of years for the given number of days. This depends + * on a 400 year period. + * @param days days since the beginning of the 400 year period + * @return number of year + */ + private[this] def numYears(days: Int): Int = { --- End diff -- Sorry I'm not sure if I got it correct. I should return ``` val boundary = yearBoundary(year) if (days > boundary) (year, boundary) else (year - 1, yearBoundary(year - 1)) ``` In order to avoid the call of `yearBoundary` in line the line `val dayInYear = daysInThis400 - yearBoundary(years)`. That was your proposal, wasn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8269][SQL]string function: initcap
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7208#discussion_r34867691 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -593,6 +593,33 @@ case class Levenshtein(left: Expression, right: Expression) extends BinaryExpres } /** + * Returns string, with the first letter of each word in uppercase, + * all other letters in lowercase. Words are delimited by whitespace. + */ +case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes { + override def dataType: DataType = StringType + + override def inputTypes: Seq[DataType] = Seq(StringType) + + override def nullSafeEval(string: Any): Any = { +if (string.asInstanceOf[UTF8String].getBytes.length == 0) { + return string +} +else { + val sb = new StringBuffer() + sb.append(string) + sb.setCharAt(0, sb.charAt(0).toUpper) + for (i <- 1 until sb.length) { +if (sb.charAt(i - 1).equals(' ')) { + sb.setCharAt(i, sb.charAt(i).toUpper) +} + } + UTF8String.fromString(sb.toString) --- End diff -- My idea would be that we check if the next character fits `Char`. If yes we convert it to Char, call `Character.toUpperCase(c)` and change the result int the array. If we cannot convert it to Char, we "ignore" it and don't change it. But as Reynold mentioned, we can do this in a second step. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/6981#discussion_r34867157 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala --- @@ -54,3 +60,193 @@ case class CurrentTimestamp() extends LeafExpression { System.currentTimeMillis() * 1000L } } + +case class Hour(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(timestamp: Any): Any = { +DateTimeUtils.getHours(timestamp.asInstanceOf[Long]) + } + + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils" +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getHours($c)""" +) + } +} + +case class Minute(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(timestamp: Any): Any = { +DateTimeUtils.getMinutes(timestamp.asInstanceOf[Long]) + } + + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils" +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getMinutes($c)""" +) + } +} + +case class Second(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(timestamp: Any): Any = { +DateTimeUtils.getSeconds(timestamp.asInstanceOf[Long]) + } + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils" +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getSeconds($c)""" +) + } +} + +case class DayInYear(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(DateType) + + override def dataType: DataType = IntegerType + + override def prettyName: String = "day_in_year" + + override protected def nullSafeEval(date: Any): Any = { +DateTimeUtils.getDayInYear(date.asInstanceOf[Int]) + } + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils" +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getDayInYear($c)""" +) + } +} + + +case class Year(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(DateType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(date: Any): Any = { +DateTimeUtils.getYear(date.asInstanceOf[Int]) + } + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils" +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getYear($c)""" +) + } +} + +case class Quarter(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(DateType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(date: Any): Any = { +DateTimeUtils.getQuarter(date.asInstanceOf[Int]) + } + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils" +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getQuarter($c)""" +) + } +} + +case class Month(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(DateType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(date: Any): Any = { +
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/6981#discussion_r34866296 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala --- @@ -54,3 +60,193 @@ case class CurrentTimestamp() extends LeafExpression { System.currentTimeMillis() * 1000L } } + +case class Hour(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(timestamp: Any): Any = { +DateTimeUtils.getHours(timestamp.asInstanceOf[Long]) + } + + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils" --- End diff -- `DateTimeUtils.getClass.getName` returns `org.apache.spark.sql.catalyst.util.DateTimeUtils$`. Is there a way to select the name automatically and don't get the `$`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/6981#discussion_r34866224 --- Diff: python/pyspark/sql/functions.py --- @@ -652,6 +652,135 @@ def ntile(n): return Column(sc._jvm.functions.ntile(int(n))) +@since(1.5) +def dateFormat(dateCol, formatCol): --- End diff -- @rxin camel case or underscore? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/6981#discussion_r34866189 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala --- @@ -178,7 +178,17 @@ object FunctionRegistry { // datetime functions expression[CurrentDate]("current_date"), -expression[CurrentTimestamp]("current_timestamp") +expression[CurrentTimestamp]("current_timestamp"), +expression[DateFormatClass]("date_format"), +expression[Year]("year"), +expression[Quarter]("quarter"), +expression[Month]("month"), +expression[Day]("day"), --- End diff -- @rxin In Jira you mentioned there should be an alias. Can I just add `expression[Day]("day_of_month")`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/6981#discussion_r34866129 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala --- @@ -54,3 +60,193 @@ case class CurrentTimestamp() extends LeafExpression { System.currentTimeMillis() * 1000L } } + +case class Hour(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(timestamp: Any): Any = { +DateTimeUtils.getHours(timestamp.asInstanceOf[Long]) + } + + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils" +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getHours($c)""" +) + } +} + +case class Minute(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(timestamp: Any): Any = { +DateTimeUtils.getMinutes(timestamp.asInstanceOf[Long]) + } + + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils" +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getMinutes($c)""" +) + } +} + +case class Second(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(timestamp: Any): Any = { +DateTimeUtils.getSeconds(timestamp.asInstanceOf[Long]) + } + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils" +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getSeconds($c)""" +) + } +} + +case class DayInYear(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(DateType) + + override def dataType: DataType = IntegerType + + override def prettyName: String = "day_in_year" + + override protected def nullSafeEval(date: Any): Any = { +DateTimeUtils.getDayInYear(date.asInstanceOf[Int]) + } + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils" +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getDayInYear($c)""" +) + } +} + + +case class Year(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(DateType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(date: Any): Any = { +DateTimeUtils.getYear(date.asInstanceOf[Int]) + } + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils" +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getYear($c)""" +) + } +} + +case class Quarter(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(DateType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(date: Any): Any = { +DateTimeUtils.getQuarter(date.asInstanceOf[Int]) + } + + override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils" +defineCodeGen(ctx, ev, (c) => + s"""$dtu.getQuarter($c)""" +) + } +} + +case class Month(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(DateType) + + override def dataType: DataType = IntegerType + + override protected def nullSafeEval(date: Any): Any = { +
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/6981#discussion_r34865980 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -378,4 +395,183 @@ object DateTimeUtils { c.set(segments(0), segments(1) - 1, segments(2), 0, 0, 0) Some((c.getTimeInMillis / 1000 / 3600 / 24).toInt) } + + /** + * Returns the hour value of a given timestamp value. The timestamp is expressed in microseconds. + */ + def getHours(timestamp: Long): Int = { +val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp / 1000) +((localTs / 1000 / 3600) % 24).toInt + } + + /** + * Returns the minute value of a given timestamp value. The timestamp is expressed in + * microseconds. + */ + def getMinutes(timestamp: Long): Int = { +val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp / 1000) +((localTs / 1000 / 60) % 60).toInt + } + + /** + * Returns the second value of a given timestamp value. The timestamp is expressed in + * microseconds. + */ + def getSeconds(timestamp: Long): Int = { +val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp / 1000) +((localTs / 1000) % 60).toInt + } + + private[this] def isLeapYear(year: Int): Boolean = { +(year % 4) == 0 && ((year % 100) != 0 || (year % 400) == 0) + } + + /** + * Return the number of days since the start of 400 year period. + * The second year of a 400 year period (year 1) starts on day 365. + */ + private[this] def yearBoundary(year: Int): Int = { +year * 365 + ((year / 4 ) - (year / 100) + (year / 400)) + } + + /** + * Calculates the number of years for the given number of days. This depends + * on a 400 year period. + * @param days days since the beginning of the 400 year period + * @return number of year + */ + private[this] def numYears(days: Int): Int = { +val year = days / 365 +if (days > yearBoundary(year)) year else year - 1 + } + + /** + * Calculates the year and and the number of the day in the year for the given + * number of days. The given days is the number of days since 1.1.1970. + * + * The calculation uses the fact that the period 1.1.2001 until 31.12.2400 is + * equals to the period 1.1.1601 until 31.12.2000. + */ + private[this] def getYearAndDayInYear(daysSince1970: Int): (Int, Int) = { +// add the difference (in days) between 1.1.1970 and the artificial year 0 (-17999) +val daysNormalized = daysSince1970 + toYearZero +val numOfQuarterCenturies = daysNormalized / daysIn400Years +val daysInThis400 = daysNormalized % daysIn400Years + 1 +val years = numYears(daysInThis400) +val year: Int = (2001 - 2) + 400 * numOfQuarterCenturies + years +val dayInYear = daysInThis400 - yearBoundary(years) +(year, dayInYear) + } + + /** + * Returns the 'day in year' value for the given date. The date is expressed in days + * since 1.1.1970. + */ + def getDayInYear(date: Int): Int = { +getYearAndDayInYear(date)._2 + } + + /** + * Returns the year value for the given date. The date is expressed in days + * since 1.1.1970. + */ + def getYear(date: Int): Int = { +getYearAndDayInYear(date)._1 + } + + /** + * Returns the quarter for the given date. The date is expressed in days + * since 1.1.1970. + */ + def getQuarter(date: Int): Int = { +val (year, dayInYear) = getYearAndDayInYear(date) +val leap = if (isLeapYear(year)) 1 else 0 +if (dayInYear <= 90 + leap) { + 1 +} else if (dayInYear <= 181 + leap) { + 2 +} else if (dayInYear <= 273 + leap) { + 3 +} else { + 4 +} + } + + /** + * Returns the month value for the given date. The date is expressed in days + * since 1.1.1970. January is month 1. + */ + def getMonth(date: Int): Int = { +val (year, dayInYear) = getYearAndDayInYear(date) +val leap = if (isLeapYear(year)) 1 else 0 +if (dayInYear <= 31) { + 1 +} else if (dayInYear <= 59 + leap) { + 2 +} else if (dayInYear <= 90 + leap) { + 3 +} else if (dayInYear <= 120 + leap) { + 4 +} else if (dayInYear <= 151 + leap) { + 5 +} else if (dayInYear <= 181 + leap) { + 6 +} else if (dayInYear <= 212 + leap) { + 7 +} else if (dayInYear <= 243 + leap) {
[GitHub] spark pull request: [SPARK-8269][SQL]string function: initcap
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7208#discussion_r34865919 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -593,6 +593,33 @@ case class Levenshtein(left: Expression, right: Expression) extends BinaryExpres } /** + * Returns string, with the first letter of each word in uppercase, + * all other letters in lowercase. Words are delimited by whitespace. + */ +case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes { + override def dataType: DataType = StringType + + override def inputTypes: Seq[DataType] = Seq(StringType) + + override def nullSafeEval(string: Any): Any = { +if (string.asInstanceOf[UTF8String].getBytes.length == 0) { + return string +} +else { + val sb = new StringBuffer() + sb.append(string) + sb.setCharAt(0, sb.charAt(0).toUpper) + for (i <- 1 until sb.length) { +if (sb.charAt(i - 1).equals(' ')) { + sb.setCharAt(i, sb.charAt(i).toUpper) +} + } + UTF8String.fromString(sb.toString) --- End diff -- I think we should consider implement all of this on bytes directly. The conversion to `Char` isn't safe. I'm not sure, what happens if a character doesn't fit into `Char`. Using the assumption that a lower case and a upper case character have always the same number of bytes, we could easily use `Array[Byte]`. Even tough this isn't guaranteed by Unicode it seems to be true (maybe we could propose this to Unicode). But we can do this in a follow up PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/6981#issuecomment-122055145 @davies could you trigger Jenkins. I'd like to get an idea, what is still crashing. I expect that `WeekOfYear` will crash (because of different timezones). The other stuff should be resolved by the new casting implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7353#discussion_r34762666 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -180,4 +182,202 @@ object DateTimeUtils { val nanos = (us % MICROS_PER_SECOND) * 1000L (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos) } + + /** + * Parses a given UTF8 date string to the corresponding a corresponding [[Long]] value. + * The return type is [[Option]] in order to distinguish between 0L and null. The following + * formats are allowed: + * + * `` + * `-[m]m` + * `-[m]m-[d]d` + * `-[m]m-[d]d ` + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]` + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z` + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m` + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m` + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]` + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z` + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m` + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m` + * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]` + * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z` + * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m` + * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m` + * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]` + * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z` + * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m` + * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m` + */ + def stringToTimestamp(s: UTF8String): Option[Long] = { +if (s == null) { + return None +} +var timeZone: Option[Byte] = None +val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0) +var i = 0 +var currentSegmentValue = 0 +val bytes = s.getBytes +var j = 0 +var digitsMilli = 0 +var justTime = false +while (j < bytes.length) { + val b = bytes(j) + val parsedValue = b - '0'.toByte + if (parsedValue < 0 || parsedValue > 9) { +if (j == 0 && b == 'T') { + justTime = true + i += 3 +} else if (i < 2) { + if (b == '-') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else if (i == 0 && b == ':') { +justTime = true +segments(3) = currentSegmentValue +currentSegmentValue = 0 +i = 4 + } else { +return None + } +} else if (i == 2) { + if (b == ' ' || b == 'T') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return None + } +} else if (i == 3 || i == 4) { + if (b == ':') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return None + } +} else if (i == 5 || i == 6) { + if (b == 'Z') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(43) + } else if (b == '-' || b == '+') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(b) + } else if (b == '.' && i == 5) { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return None + } + if (i == 6 && b != '.') { +i += 1 + } +} else { + if (b == ':' || b == ' ') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return None + } +} + } else { +if (i == 6) { + digitsMilli += 1 +} +currentSegmentValue = currentSegmentValue * 10 + parsedValue + } + j += 1 +} + +segments(i) = currentSegmentValue + +while (digitsMilli < 6) { + segments(6) *= 10 + digits
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7353#discussion_r34761324 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -180,4 +182,202 @@ object DateTimeUtils { val nanos = (us % MICROS_PER_SECOND) * 1000L (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos) } + + /** + * Parses a given UTF8 date string to the corresponding a corresponding [[Long]] value. + * The return type is [[Option]] in order to distinguish between 0L and null. The following + * formats are allowed: + * + * `` + * `-[m]m` + * `-[m]m-[d]d` + * `-[m]m-[d]d ` + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]` + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z` + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m` + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m` + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]` + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z` + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m` + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m` + * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]` + * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z` + * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m` + * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m` + * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]` + * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z` + * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m` + * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m` + */ + def stringToTimestamp(s: UTF8String): Option[Long] = { +if (s == null) { + return None +} +var timeZone: Option[Byte] = None +val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0) +var i = 0 +var currentSegmentValue = 0 +val bytes = s.getBytes +var j = 0 +var digitsMilli = 0 +var justTime = false +while (j < bytes.length) { + val b = bytes(j) + val parsedValue = b - '0'.toByte + if (parsedValue < 0 || parsedValue > 9) { +if (j == 0 && b == 'T') { + justTime = true + i += 3 +} else if (i < 2) { + if (b == '-') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else if (i == 0 && b == ':') { +justTime = true +segments(3) = currentSegmentValue +currentSegmentValue = 0 +i = 4 + } else { +return None + } +} else if (i == 2) { + if (b == ' ' || b == 'T') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return None + } +} else if (i == 3 || i == 4) { + if (b == ':') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return None + } +} else if (i == 5 || i == 6) { + if (b == 'Z') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(43) + } else if (b == '-' || b == '+') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(b) + } else if (b == '.' && i == 5) { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return None + } + if (i == 6 && b != '.') { +i += 1 + } +} else { + if (b == ':' || b == ' ') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return None + } +} + } else { +if (i == 6) { + digitsMilli += 1 +} +currentSegmentValue = currentSegmentValue * 10 + parsedValue + } + j += 1 +} + +segments(i) = currentSegmentValue + +while (digitsMilli < 6) { + segments(6) *= 10 + digits
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7353#discussion_r34742916 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -180,4 +182,169 @@ object DateTimeUtils { val nanos = (us % MICROS_PER_SECOND) * 1000L (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos) } + + /** + * Parses a given UTF8 date string to the corresponding [[Timestamp]] object. The format of the + * date has to be one of the following: ``, `-[m]m`, `-[m]m-[d]d`, `-[m]m-[d]d `, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, + */ + def stringToTimestamp(s: UTF8String): Timestamp = { +if (s == null) { + return null +} +var timeZone: Option[Byte] = None +val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0) +var i = 0 +var currentSegmentValue = 0 +val bytes = s.getBytes +var j = 0 +var digitsMilli = 0 +while (j < bytes.length) { + val b = bytes(j) + val parsedValue = b - 48 + if (parsedValue < 0 || parsedValue > 9) { +if (i < 2) { + if (b == '-') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i == 2) { + if (b == ' ' || b == 'T') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i < 5) { + if (b == ':') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i < 7) { + if (b == 'Z') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(43) + } else if (b == '-' || b == '+') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(b) + } else if (b == '.' && i == 5) { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } + if (i == 6 && b != '.') { +i += 1 + } +} else if (i > 6) { + if (b == ':') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} + } else { +if (i == 6) { + digitsMilli += 1 +} +currentSegmentValue = currentSegmentValue * 10 + parsedValue + } + j += 1 +} +if (i > 8) { + return null +} +segments(i) = currentSegmentValue + +// Hive compatibility 2011-05-06 07:08:09.1000 == 2011-05-06 07:08:09.1 +if (digitsMilli == 4) { + segments(6) = segments(6) / 10 +} + +// 18:3:1.1 is equals to 18:3:1:100 +if (digitsMilli == 1) { + segments(6) = segments(6) * 100 +} else if (digitsMilli == 2) { + segments(6) = segments(6) * 10 +} + +if (segments(0) < 0 || segments(0) > || segments(1) < 1 || segments(1) > 12 || +segments(2) < 1 || segments(2) > 31 || segments(3) < 0 || segments(3) > 23 || +segments(4) < 0 || segments(4) > 59 || segments(5) < 0 || segments(5) > 59 || +segments(6) < 0 || segments(6) > 999 || segments(7) < 0 || segments(7) > 14 || +segments(8) < 0 || segments(8) > 59) { + return null +} +val c = if (timeZone.isEmpty) { + Calendar.getInstance() +} else { + Calendar.getInstance( + TimeZone.getTimeZone(f"GMT${timeZone.get
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7353#issuecomment-121697293 @davies somehow Jenkins wasn't able to fetch from GitHub. Could you trigger Jenkins, again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7353#issuecomment-121118086 @davies How should we deal with this? I don't know the value of `'value`, but it seems to be something that the parser can parse can be parsed to `1.1.1970`. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37171/testReport/org.apache.spark.sql.hive.execution/HiveQuerySuite/Cast_Timestamp_to_Timestamp_in_UDF/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7353#discussion_r34526036 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -180,4 +182,169 @@ object DateTimeUtils { val nanos = (us % MICROS_PER_SECOND) * 1000L (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos) } + + /** + * Parses a given UTF8 date string to the corresponding [[Timestamp]] object. The format of the + * date has to be one of the following: ``, `-[m]m`, `-[m]m-[d]d`, `-[m]m-[d]d `, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, + */ + def stringToTimestamp(s: UTF8String): Timestamp = { +if (s == null) { + return null +} +var timeZone: Option[Byte] = None +val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0) +var i = 0 +var currentSegmentValue = 0 +val bytes = s.getBytes +var j = 0 +var digitsMilli = 0 +while (j < bytes.length) { + val b = bytes(j) + val parsedValue = b - 48 + if (parsedValue < 0 || parsedValue > 9) { +if (i < 2) { + if (b == '-') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i == 2) { + if (b == ' ' || b == 'T') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i < 5) { + if (b == ':') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i < 7) { + if (b == 'Z') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(43) + } else if (b == '-' || b == '+') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(b) + } else if (b == '.' && i == 5) { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } + if (i == 6 && b != '.') { +i += 1 + } +} else if (i > 6) { + if (b == ':') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} + } else { +if (i == 6) { + digitsMilli += 1 +} +currentSegmentValue = currentSegmentValue * 10 + parsedValue + } + j += 1 +} +if (i > 8) { + return null +} +segments(i) = currentSegmentValue + +// Hive compatibility 2011-05-06 07:08:09.1000 == 2011-05-06 07:08:09.1 +if (digitsMilli == 4) { + segments(6) = segments(6) / 10 +} + +// 18:3:1.1 is equals to 18:3:1:100 +if (digitsMilli == 1) { + segments(6) = segments(6) * 100 +} else if (digitsMilli == 2) { + segments(6) = segments(6) * 10 +} + +if (segments(0) < 0 || segments(0) > || segments(1) < 1 || segments(1) > 12 || +segments(2) < 1 || segments(2) > 31 || segments(3) < 0 || segments(3) > 23 || +segments(4) < 0 || segments(4) > 59 || segments(5) < 0 || segments(5) > 59 || +segments(6) < 0 || segments(6) > 999 || segments(7) < 0 || segments(7) > 14 || +segments(8) < 0 || segments(8) > 59) { + return null +} +val c = if (timeZone.isEmpty) { + Calendar.getInstance() +} else { + Calendar.getInstance( + TimeZone.getTimeZone(f"GM
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7353#discussion_r34523798 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -180,4 +182,169 @@ object DateTimeUtils { val nanos = (us % MICROS_PER_SECOND) * 1000L (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos) } + + /** + * Parses a given UTF8 date string to the corresponding [[Timestamp]] object. The format of the + * date has to be one of the following: ``, `-[m]m`, `-[m]m-[d]d`, `-[m]m-[d]d `, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, + */ + def stringToTimestamp(s: UTF8String): Timestamp = { +if (s == null) { + return null +} +var timeZone: Option[Byte] = None +val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0) +var i = 0 +var currentSegmentValue = 0 +val bytes = s.getBytes +var j = 0 +var digitsMilli = 0 +while (j < bytes.length) { + val b = bytes(j) + val parsedValue = b - 48 + if (parsedValue < 0 || parsedValue > 9) { +if (i < 2) { + if (b == '-') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i == 2) { + if (b == ' ' || b == 'T') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i < 5) { + if (b == ':') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i < 7) { + if (b == 'Z') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(43) + } else if (b == '-' || b == '+') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(b) + } else if (b == '.' && i == 5) { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } + if (i == 6 && b != '.') { +i += 1 + } +} else if (i > 6) { + if (b == ':') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} + } else { +if (i == 6) { + digitsMilli += 1 +} +currentSegmentValue = currentSegmentValue * 10 + parsedValue + } + j += 1 +} +if (i > 8) { + return null +} +segments(i) = currentSegmentValue + +// Hive compatibility 2011-05-06 07:08:09.1000 == 2011-05-06 07:08:09.1 +if (digitsMilli == 4) { + segments(6) = segments(6) / 10 +} + +// 18:3:1.1 is equals to 18:3:1:100 +if (digitsMilli == 1) { --- End diff -- Good idea! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7353#discussion_r34523785 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -180,4 +182,169 @@ object DateTimeUtils { val nanos = (us % MICROS_PER_SECOND) * 1000L (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos) } + + /** + * Parses a given UTF8 date string to the corresponding [[Timestamp]] object. The format of the + * date has to be one of the following: ``, `-[m]m`, `-[m]m-[d]d`, `-[m]m-[d]d `, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, + */ + def stringToTimestamp(s: UTF8String): Timestamp = { +if (s == null) { + return null +} +var timeZone: Option[Byte] = None +val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0) +var i = 0 +var currentSegmentValue = 0 +val bytes = s.getBytes +var j = 0 +var digitsMilli = 0 +while (j < bytes.length) { + val b = bytes(j) + val parsedValue = b - 48 + if (parsedValue < 0 || parsedValue > 9) { +if (i < 2) { + if (b == '-') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i == 2) { + if (b == ' ' || b == 'T') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i < 5) { + if (b == ':') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i < 7) { + if (b == 'Z') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(43) + } else if (b == '-' || b == '+') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(b) + } else if (b == '.' && i == 5) { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } + if (i == 6 && b != '.') { +i += 1 + } +} else if (i > 6) { + if (b == ':') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} + } else { +if (i == 6) { + digitsMilli += 1 +} +currentSegmentValue = currentSegmentValue * 10 + parsedValue + } + j += 1 +} +if (i > 8) { + return null --- End diff -- Okay. If there is a space the garbage is ignored --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7353#discussion_r34523735 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -180,4 +182,169 @@ object DateTimeUtils { val nanos = (us % MICROS_PER_SECOND) * 1000L (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos) } + + /** + * Parses a given UTF8 date string to the corresponding [[Timestamp]] object. The format of the + * date has to be one of the following: ``, `-[m]m`, `-[m]m-[d]d`, `-[m]m-[d]d `, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, + */ + def stringToTimestamp(s: UTF8String): Timestamp = { +if (s == null) { + return null +} +var timeZone: Option[Byte] = None +val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0) +var i = 0 +var currentSegmentValue = 0 +val bytes = s.getBytes +var j = 0 +var digitsMilli = 0 +while (j < bytes.length) { + val b = bytes(j) + val parsedValue = b - 48 + if (parsedValue < 0 || parsedValue > 9) { +if (i < 2) { + if (b == '-') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i == 2) { + if (b == ' ' || b == 'T') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i < 5) { + if (b == ':') { --- End diff -- This is equals to `i == 3 || i == 4`, because of the `if` and `elseif` before. I am going to adjust the checks that they are more readable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7353#discussion_r34521930 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -180,4 +182,169 @@ object DateTimeUtils { val nanos = (us % MICROS_PER_SECOND) * 1000L (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos) } + + /** + * Parses a given UTF8 date string to the corresponding [[Timestamp]] object. The format of the + * date has to be one of the following: ``, `-[m]m`, `-[m]m-[d]d`, `-[m]m-[d]d `, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, + */ + def stringToTimestamp(s: UTF8String): Timestamp = { +if (s == null) { + return null +} +var timeZone: Option[Byte] = None +val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0) +var i = 0 +var currentSegmentValue = 0 +val bytes = s.getBytes +var j = 0 +var digitsMilli = 0 +while (j < bytes.length) { + val b = bytes(j) + val parsedValue = b - 48 + if (parsedValue < 0 || parsedValue > 9) { +if (i < 2) { + if (b == '-') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i == 2) { + if (b == ' ' || b == 'T') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i < 5) { + if (b == ':') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i < 7) { + if (b == 'Z') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(43) + } else if (b == '-' || b == '+') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(b) + } else if (b == '.' && i == 5) { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } + if (i == 6 && b != '.') { +i += 1 + } +} else if (i > 6) { + if (b == ':') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} + } else { +if (i == 6) { + digitsMilli += 1 +} +currentSegmentValue = currentSegmentValue * 10 + parsedValue + } + j += 1 +} +if (i > 8) { + return null +} +segments(i) = currentSegmentValue + +// Hive compatibility 2011-05-06 07:08:09.1000 == 2011-05-06 07:08:09.1 +if (digitsMilli == 4) { + segments(6) = segments(6) / 10 +} + +// 18:3:1.1 is equals to 18:3:1:100 +if (digitsMilli == 1) { + segments(6) = segments(6) * 100 +} else if (digitsMilli == 2) { + segments(6) = segments(6) * 10 +} + +if (segments(0) < 0 || segments(0) > || segments(1) < 1 || segments(1) > 12 || +segments(2) < 1 || segments(2) > 31 || segments(3) < 0 || segments(3) > 23 || +segments(4) < 0 || segments(4) > 59 || segments(5) < 0 || segments(5) > 59 || +segments(6) < 0 || segments(6) > 999 || segments(7) < 0 || segments(7) > 14 || +segments(8) < 0 || segments(8) > 59) { + return null +} +val c = if (timeZone.isEmpty) { + Calendar.getInstance() +} else { + Calendar.getInstance( + TimeZone.getTimeZone(f"GMT${timeZone.get
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7353#issuecomment-121007809 @davies thanks for all your good comments. I'm going to incorporate your suggestions. I don't accept `18:03:20` , yet. The design document doesn't allow this. I think we should parse a pure time string to today + time. But I wanted to double check this with you guys. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7353#discussion_r34486426 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -165,15 +165,8 @@ case class Cast(child: Expression, dataType: DataType) extends UnaryExpression w private[this] def castToTimestamp(from: DataType): Any => Any = from match { case StringType => buildCast[UTF8String](_, utfs => { -// Throw away extra if more than 9 decimal places -val s = utfs.toString -val periodIdx = s.indexOf(".") -var n = s -if (periodIdx != -1 && n.length() - periodIdx > 9) { - n = n.substring(0, periodIdx + 10) -} -try DateTimeUtils.fromJavaTimestamp(Timestamp.valueOf(n)) -catch { case _: java.lang.IllegalArgumentException => null } +val parsedDateString = DateTimeUtils.stringToTimestamp(utfs) +if (parsedDateString == null) null else DateTimeUtils.fromJavaTimestamp(parsedDateString) --- End diff -- I'm going to adjust this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7353#discussion_r34486393 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -180,4 +182,169 @@ object DateTimeUtils { val nanos = (us % MICROS_PER_SECOND) * 1000L (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos) } + + /** + * Parses a given UTF8 date string to the corresponding [[Timestamp]] object. The format of the + * date has to be one of the following: ``, `-[m]m`, `-[m]m-[d]d`, `-[m]m-[d]d `, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, + * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, + * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, + */ + def stringToTimestamp(s: UTF8String): Timestamp = { +if (s == null) { + return null +} +var timeZone: Option[Byte] = None +val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0) +var i = 0 +var currentSegmentValue = 0 +val bytes = s.getBytes +var j = 0 +var digitsMilli = 0 +while (j < bytes.length) { + val b = bytes(j) + val parsedValue = b - 48 + if (parsedValue < 0 || parsedValue > 9) { +if (i < 2) { + if (b == '-') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i == 2) { + if (b == ' ' || b == 'T') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i < 5) { + if (b == ':') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} else if (i < 7) { + if (b == 'Z') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(43) + } else if (b == '-' || b == '+') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 +timeZone = Some(b) + } else if (b == '.' && i == 5) { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } + if (i == 6 && b != '.') { +i += 1 + } +} else if (i > 6) { + if (b == ':') { +segments(i) = currentSegmentValue +currentSegmentValue = 0 +i += 1 + } else { +return null + } +} + } else { +if (i == 6) { + digitsMilli += 1 +} +currentSegmentValue = currentSegmentValue * 10 + parsedValue + } + j += 1 +} +if (i > 8) { + return null +} +segments(i) = currentSegmentValue + +// Hive compatibility 2011-05-06 07:08:09.1000 == 2011-05-06 07:08:09.1 +if (digitsMilli == 4) { + segments(6) = segments(6) / 10 +} + +// 18:3:1.1 is equals to 18:3:1:100 +if (digitsMilli == 1) { + segments(6) = segments(6) * 100 +} else if (digitsMilli == 2) { + segments(6) = segments(6) * 10 +} + +if (segments(0) < 0 || segments(0) > || segments(1) < 1 || segments(1) > 12 || +segments(2) < 1 || segments(2) > 31 || segments(3) < 0 || segments(3) > 23 || +segments(4) < 0 || segments(4) > 59 || segments(5) < 0 || segments(5) > 59 || +segments(6) < 0 || segments(6) > 999 || segments(7) < 0 || segments(7) > 14 || +segments(8) < 0 || segments(8) > 59) { + return null +} +val c = if (timeZone.isEmpty) { + Calendar.getInstance() +} else { + Calendar.getInstance( + TimeZone.getTimeZone(f"GMT${timeZo
[GitHub] spark pull request: [SPARK-8269][SQL]string function: initcap
Github user tarekauel commented on a diff in the pull request: https://github.com/apache/spark/pull/7208#discussion_r34435744 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -570,6 +570,37 @@ case class StringLength(child: Expression) extends UnaryExpression with ExpectsI } /** + * Returns string, with the first letter of each word in uppercase, + * all other letters in lowercase. Words are delimited by whitespace. + */ +case class InitCap(child: Expression) extends UnaryExpression with ExpectsInputTypes { + override def dataType: DataType = StringType + + override def inputTypes: Seq[DataType] = Seq(StringType) + + override def eval(input: InternalRow): Any = { +val string = child.eval(input) +if (string == null) { + null +} +else if (string.asInstanceOf[UTF8String].getBytes.length == 0) { + UTF8String.fromString(string.toString) --- End diff -- @HuJiayin there was a `n` missing. I updated the command. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7353#issuecomment-120668526 @cloud-fan The following strings can be parsed now: (String --> Date) ``, `-[m]m`, `-[m]m-[d]d`, `-[m]m-[d]d `, `-[m]m-[d]d *`, `-[m]m-[d]dT*` (String -> Timestamp) ``, `-[m]m`, `-[m]m-[d]d`, `-[m]m-[d]d `, `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]`, `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`, `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]`, `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`, `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`, `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/7353#issuecomment-120637117 Okay, yes I'm going to extend it. What Hive does support is parsing the minute from `13:20:08`. Our minute method requires a TimeStamp. Either we don't support the `minute` expression on pure time information or we define a cast from a time string to timestamp. But even if we do that, we are still not compatible with Hive. Because we would parse a pure date like `2015-10-20` to `2015-10-20 00:00:00` and we could extract the minute `0`, Hive would return `null` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...
Github user tarekauel commented on the pull request: https://github.com/apache/spark/pull/6981#issuecomment-120564697 @davies I proposed a solution for the cast issue in #7353 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org