from:"tarekauel"

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7779#issuecomment-126193054
  
Jenkins, ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9401][SQL] codeGen concatWs

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7782

[SPARK-9401][SQL] codeGen concatWs

Jira: https://issues.apache.org/jira/browse/SPARK-9401



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9401

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7782.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7782


commit 46a6c20da83939d70020c880a964d6bd10fcd00c
Author: Tarek Auel 
Date:   2015-07-30T05:44:14Z

[SPARK-9401][SQL] codeGen concatWs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9400][SQL] codeGen stringLocate

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7779

[SPARK-9400][SQL] codeGen stringLocate

Jira: https://issues.apache.org/jira/browse/SPARK-9400

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9400

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7779.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7779


commit 4c27625eaf3d6b02940390ac31f3000ac8247552
Author: Tarek Auel 
Date:   2015-07-30T04:59:37Z

[SPARK-9400][SQL] codeGen stringLocate




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9403][SQL] codeGen in / inSet

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7778#discussion_r35837995
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -107,21 +109,71 @@ case class In(value: Expression, list: 
Seq[Expression]) extends Predicate with C
 val evaluatedValue = value.eval(input)
 list.exists(e => e.eval(input) == evaluatedValue)
   }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+if (list.isEmpty) {
+  s"""
+${ev.primitive} = false;
+${ev.isNull} = false;
+   """
+} else {
+  val valueGen = value.gen(ctx)
+  val listGen = list.map(_.gen(ctx))
+  val listCode = listGen.map(x =>
+s"""
+  if (!${ev.primitive}) {
+${x.code}
+if (${classOf[Objects].getName}.equals(${valueGen.primitive}, 
${x.primitive})) {
+  ${ev.primitive} = true;
+}
+  }
+ """).foldLeft("")((a, b) => a + "\n" + b)
+  s"""
+  ${valueGen.code}
+  boolean ${ev.primitive} = false;
+  boolean ${ev.isNull} = false;
+  $listCode
+ """
+}
+  }
+
 }
 
+/**
+ * Helper companion object in order to support code generation.
+ */
+object InSet {
+
+  @transient var hset: Set[Any] = null
--- End diff --

@rxin Is there a better way to expose `hset` to the codeGen stuff?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9403][SQL] codeGen in / inSet

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7778

[SPARK-9403][SQL] codeGen in / inSet

Jira: https://issues.apache.org/jira/browse/SPARK-9403

@rxin ping

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9403

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7778.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7778


commit e69ebaa95340399f1112edf05745e1711cfdbdeb
Author: Tarek Auel 
Date:   2015-07-30T04:44:05Z

[SPARK-9403][SQL] codeGen in / inSet




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7571#issuecomment-123481020
  
@rxin Could you trigger Jenkins


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7571#issuecomment-123446243
  
Should I add it again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7571#discussion_r35140450
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -486,6 +486,10 @@ case class StringFormat(children: Expression*) extends 
Expression with CodegenFa
   private def format: Expression = children(0)
   private def args: Seq[Expression] = children.tail
 
+  override def inputTypes: Seq[AbstractDataType] =
+StringType :: List.fill(children.size - 1)(AnyDataType)
--- End diff --

I updated as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7571

[SPARK-9154][SQL] codegen StringFormat

Jira: https://issues.apache.org/jira/browse/SPARK-9154

fixes bug of #7546

@marmbrus I can't reopen the other PR, because I didn't closed it. Can you 
trigger Jenkins?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9154

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7571.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7571


commit 086caba76f646f86840a2cee325188895ab42c8f
Author: Tarek Auel 
Date:   2015-07-20T16:29:03Z

[SPARK-9154][SQL] codegen string format

commit cd8322bc4e6c15cd9911363c4596eba1a935fcdd
Author: Tarek Auel 
Date:   2015-07-20T21:40:30Z

[SPARK-9154][SQL] codegen string format

commit 10b4de88c817a474b7b0a83d948cb86927638775
Author: Tarek Auel 
Date:   2015-07-20T21:42:28Z

[SPARK-9154][SQL] codegen removed fallback trait

commit a943d3e60649f4267e40376c0bb1ff30ae024436
Author: Tarek Auel 
Date:   2015-07-21T06:26:58Z

[SPARK-9154] implicit input cast, added tests for null, support for null 
primitives

commit f512c5f9219d38c2445e4e776aa739ae6310bb60
Author: Tarek Auel 
Date:   2015-07-21T18:57:27Z

[SPARK-9154][SQL] build fix




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-8244][SQL] string function: find in set

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7186#issuecomment-123441184
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Revert "[SPARK-9154] [SQL] codegen StringForma...

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7570#issuecomment-123419444
  
Can I reopen the last PR to fix the issue or do I have to create a new one, 
because the old one got merged?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9152][SQL] Implement code generation fo...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7561#discussion_r35079064
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -196,6 +254,40 @@ case class RLike(left: Expression, right: Expression)
   override def escape(v: String): String = v
   override def matches(regex: Pattern, str: String): Boolean = 
regex.matcher(str).find(0)
   override def toString: String = s"$left RLIKE $right"
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val patternClass = classOf[Pattern].getName
+
+val literalRight: String = right match {
+  case x @ Literal(value: String, StringType) => escape(value)
+  case _ => null
+}
--- End diff --

Okay got it. But this caches the value only if it's a literal. I think if 
we save the value in `mutableState` we could even use this "cache" if an 
expression returns the same result like the previous one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9152][SQL] Implement code generation fo...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7561#discussion_r35078428
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -187,6 +187,64 @@ case class Like(left: Expression, right: Expression)
   override def matches(regex: Pattern, str: String): Boolean = 
regex.matcher(str).matches()
 
   override def toString: String = s"$left LIKE $right"
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val patternClass = classOf[Pattern].getName
+
+val literalRight: String = right match {
+  case x @ Literal(value: String, StringType) => escape(value)
+  case _ => null
+}
+
+val leftGen = left.gen(ctx)
+val rightGen = right.gen(ctx)
+
+val patternCode =
+  if (literalRight != null) {
+s"${patternClass} pattern = $patternClass.compile($literalRight);"
+  } else {
+s"""
+  StringBuilder regex = new StringBuilder("(?s)");
+  for (int idx = 1; idx < rightStr.length(); idx++) {
+char prev = rightStr.charAt(idx - 1);
+char curr = rightStr.charAt(idx);
+if (prev == '') {
+  if (curr == '_') {
+regex.append("_");
+  } else if (curr == '%') {
+regex.append("%");
+  } else {
+regex.append(${patternClass}.quote("" + curr));
+  }
+} else {
+  if (curr != '') {
+if (curr == '_') {
+  regex.append(".");
+} else if (curr == '%') {
+  regex.append(".*");
+} else {
+  regex.append(${patternClass}.quote((new 
Character(curr)).toString()));
+}
+  }
+}
+  }
+  ${patternClass} pattern = 
${patternClass}.compile(regex.toString());
+"""
+  }
--- End diff --

That is exactly what we want. `escape` is totally independent from the 
expression itself, isn't it. This simplifies the codegen, removes duplicated 
code and has no negative impact.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9152][SQL] Implement code generation fo...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7561#discussion_r35074663
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -196,6 +254,40 @@ case class RLike(left: Expression, right: Expression)
   override def escape(v: String): String = v
   override def matches(regex: Pattern, str: String): Boolean = 
regex.matcher(str).find(0)
   override def toString: String = s"$left RLIKE $right"
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val patternClass = classOf[Pattern].getName
+
+val literalRight: String = right match {
+  case x @ Literal(value: String, StringType) => escape(value)
+  case _ => null
+}
+
+val leftGen = left.gen(ctx)
+val rightGen = right.gen(ctx)
+
+val patternCode =
+  if (literalRight != null) {
+s"${patternClass} pattern = $patternClass.compile($literalRight);"
+  } else {
+s"""
+  ${patternClass} pattern = ${patternClass}.compile(rightStr);
+"""
+  }
+
+s"""
+  ${leftGen.code}
+  ${rightGen.code}
--- End diff --

Please use a logic like this:

codeA
nullCheckA
  codeB
  nullcheckB

This allows to skip the evaluation of `right` if `left` is already null.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9152][SQL] Implement code generation fo...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7561#discussion_r35074591
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -196,6 +254,40 @@ case class RLike(left: Expression, right: Expression)
   override def escape(v: String): String = v
   override def matches(regex: Pattern, str: String): Boolean = 
regex.matcher(str).find(0)
   override def toString: String = s"$left RLIKE $right"
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val patternClass = classOf[Pattern].getName
+
+val literalRight: String = right match {
+  case x @ Literal(value: String, StringType) => escape(value)
+  case _ => null
+}
--- End diff --

I guess this doesn't make sense here. This uses the interpreted evaluation. 
If you want to cache something have a look on `ctx.addMutableState`. This 
allows to cache things.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9152][SQL] Implement code generation fo...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7561#discussion_r35074260
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -187,6 +187,64 @@ case class Like(left: Expression, right: Expression)
   override def matches(regex: Pattern, str: String): Boolean = 
regex.matcher(str).matches()
 
   override def toString: String = s"$left LIKE $right"
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val patternClass = classOf[Pattern].getName
+
+val literalRight: String = right match {
+  case x @ Literal(value: String, StringType) => escape(value)
+  case _ => null
+}
+
+val leftGen = left.gen(ctx)
+val rightGen = right.gen(ctx)
+
+val patternCode =
+  if (literalRight != null) {
+s"${patternClass} pattern = $patternClass.compile($literalRight);"
+  } else {
+s"""
+  StringBuilder regex = new StringBuilder("(?s)");
+  for (int idx = 1; idx < rightStr.length(); idx++) {
+char prev = rightStr.charAt(idx - 1);
+char curr = rightStr.charAt(idx);
+if (prev == '') {
+  if (curr == '_') {
+regex.append("_");
+  } else if (curr == '%') {
+regex.append("%");
+  } else {
+regex.append(${patternClass}.quote("" + curr));
+  }
+} else {
+  if (curr != '') {
+if (curr == '_') {
+  regex.append(".");
+} else if (curr == '%') {
+  regex.append(".*");
+} else {
+  regex.append(${patternClass}.quote((new 
Character(curr)).toString()));
+}
+  }
+}
+  }
+  ${patternClass} pattern = 
${patternClass}.compile(regex.toString());
+"""
+  }
--- End diff --

Please put all the coding above in a static context. Then you can call it 
from `codeGen` and the interpreted code and we avoid duplicated coding.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9152][SQL] Implement code generation fo...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7561#discussion_r35074087
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -187,6 +187,64 @@ case class Like(left: Expression, right: Expression)
   override def matches(regex: Pattern, str: String): Boolean = 
regex.matcher(str).matches()
 
   override def toString: String = s"$left LIKE $right"
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val patternClass = classOf[Pattern].getName
+
+val literalRight: String = right match {
+  case x @ Literal(value: String, StringType) => escape(value)
+  case _ => null
+}
+
+val leftGen = left.gen(ctx)
+val rightGen = right.gen(ctx)
+
+val patternCode =
+  if (literalRight != null) {
+s"${patternClass} pattern = $patternClass.compile($literalRight);"
+  } else {
+s"""
+  StringBuilder regex = new StringBuilder("(?s)");
--- End diff --

I am not sure, if `StringBuilder`is imported. If not define somewhere `val 
sb = classOf[StringBuilder].getName` and use `$sb`

You shouldn't use `regex`. You can create a save variable name with 
`ctx.freshName`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-8244][SQL] string function: find in set

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7186#issuecomment-123187725
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7546#discussion_r35072972
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -486,6 +486,10 @@ case class StringFormat(children: Expression*) extends 
Expression with CodegenFa
   private def format: Expression = children(0)
   private def args: Seq[Expression] = children.tail
 
+  override def inputTypes: Seq[AbstractDataType] =
+children.zipWithIndex.map(x => if (x._2 == 0) StringType else 
AnyDataType)
--- End diff --

@marmbrus Is this what you proposed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7546#discussion_r35072868
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -501,6 +501,32 @@ case class StringFormat(children: Expression*) extends 
Expression with CodegenFa
 }
   }
 
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+val pattern = children.head.gen(ctx)
+
+val argListGen = children.tail.map(_.gen(ctx))
+val argListCode = argListGen.map(_.code + "\n")
+val argListString = argListGen.foldLeft("")((s, v) => s + s", 
${v.primitive}")
--- End diff --

`s",  ${v.isNull} ? null : ${v.primitive}"` Doesn't compile because of: 
`Incompatible expression types "void" and "int"`

Casting the null to the Boxed type, throws a null pointer exception: 
```Java
int primitive6 = 0;
Object o = (true) ? (Integer) null : primitive6;
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9132][SPARK-9163][SQL] codegen conv

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7552#issuecomment-123139930
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9157][SQL] codegen substring

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7534#issuecomment-123138231
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7546#discussion_r35066508
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -353,7 +353,7 @@ class StringExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   test("FORMAT") {
 val f = 'f.string.at(0)
 val d1 = 'd.int.at(1)
-val s1 = 's.int.at(2)
+val s1 = 's.string.at(2)
 
 val row1 = create_row("aa%d%s", 12, "cc")
 val row2 = create_row(null, 12, "cc")
--- End diff --

What do we expect, if an Integer value is null? `printf` itself has no 
problems with null, but for codeGen we have a primitive value like `int` 
instead of `Integer`. One approach for solving this might be to box all values 
again and set it to null, if `isNull` return true. Another approach might be to 
return null if one argument is null


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7546#discussion_r35063546
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -476,7 +476,7 @@ case class StringRPad(str: Expression, len: Expression, 
pad: Expression)
 /**
  * Returns the input formatted according do printf-style format strings
  */
-case class StringFormat(children: Expression*) extends Expression with 
CodegenFallback {
+case class StringFormat(children: Expression*) extends Expression {
--- End diff --

I do have to split the signature for this to `StringFormat(string: 
Expression, args: Expression*)`, don't I?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9132][SPARK-9163][SQL] codegen conv

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7552

[SPARK-9132][SPARK-9163][SQL] codegen conv

Jira: https://issues.apache.org/jira/browse/SPARK-9132
https://issues.apache.org/jira/browse/SPARK-9163

@rxin as you proposed in the Jira ticket, I just moved the logic to a 
separate object. I haven't changed anything of the logic of `NumberConverter`. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9163

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7552.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7552


commit fa985bda663bdbf60e5b22c4a4113a772b647d35
Author: Tarek Auel 
Date:   2015-07-21T01:12:23Z

[SPARK-9132][SPARK-9163][SQL] codegen conv

commit 40dcde9c76232d79d51316f0b6ee978c18a22538
Author: Tarek Auel 
Date:   2015-07-21T01:17:43Z

[SPARK-9132][SPARK-9163][SQL] style fix




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7462#discussion_r35056820
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -795,6 +796,22 @@ def weekofyear(col):
 return Column(sc._jvm.functions.weekofyear(col))
 
 
+@since(1.5)
+def size(col):
+"""
+Collection function: returns the length of the array or map stored in 
the column.
+:param col: name of column or expression
+
+>>> from pyspark.sql import Row
+>>> from pyspark.sql.functions import size
+>>> df = sqlContext.createDataFrame([Row(data=[1, 2, 3]), 
Row(data=[1]), Row(data=[])])
+>>> df.select(size(df.data)).collect()
--- End diff --

You don't have to import `size` and `Row`.
simple use
```Python
>>> df = sqlContext.createDataFrame([([1, 2, 3],),([1],),([],)], ['data'])
>>> df.select(size(df.data)).collect()
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7462#discussion_r35055882
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionFunctionsSuite.scala
 ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.types._
+
+
+class CollectionFunctionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
+
+  test("Array and Map Size") {
+val a0 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType))
+val a1 = Literal.create(Seq[Integer](), ArrayType(IntegerType))
+val a2 = Literal.create(Seq(1, 2), ArrayType(IntegerType))
+
+checkEvaluation(Size(a0), 3)
+checkEvaluation(Size(a1), 0)
+checkEvaluation(Size(a2), 2)
+
+val m0 = Literal.create(Map("a" -> "a", "b" -> "b"), 
MapType(StringType, StringType))
+val m1 = Literal.create(Map[String, String](), MapType(StringType, 
StringType))
+val m2 = Literal.create(Map("a" -> "a"), MapType(StringType, 
StringType))
+
+checkEvaluation(Size(m0), 2)
+checkEvaluation(Size(m1), 0)
+checkEvaluation(Size(m2), 1)
--- End diff --

Can you add something like
```Scala
checkEvaluation(Literal.create(null, MapType(StringType, StringType)), null)
checkEvaluation(Literal.create(null, ArrayType(StringType)), null)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7462#discussion_r35055176
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodeGenContext, 
GeneratedExpressionCode}
+import org.apache.spark.sql.types._
+
+/**
+ * Given an array or map, returns its size.
+ */
+case class Size(child: Expression) extends UnaryExpression with 
ExpectsInputTypes {
+  override def dataType: DataType = IntegerType
+  override def inputTypes: Seq[AbstractDataType] = 
Seq(TypeCollection(ArrayType, MapType))
+
+  override def nullSafeEval(value: Any): Int = child.dataType match {
+case ArrayType(_, _) => value.asInstanceOf[Seq[Any]].size
+case MapType(_, _, _) => value.asInstanceOf[Map[Any, Any]].size
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+child.dataType match {
--- End diff --

```Scala
 override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
nullSafeCodeGen(ctx, ev, c => s"${ev.primitive} = ($c).size();")
  }
```

`nullSafeCodeGen` allows you to add multiple lines.  `defineCodeGen` 
expects only the right part of the assignment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9164][SQL] codegen hex/unhex

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7548

[SPARK-9164][SQL] codegen hex/unhex

Jira: https://issues.apache.org/jira/browse/SPARK-9164

The diff looks heavy, but I just moved the `hex` and `unhex` methods to 
`object Hex`.  This allows me to call them from `eval` and `codeGen`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9164

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7548.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7548


commit dd91c57beba57d11091a9160a072fc889db411bf
Author: Tarek Auel 
Date:   2015-07-20T23:05:10Z

[SPARK-9164][SQL] codegen hex/unhex




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7462#discussion_r35053445
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodeGenContext, 
GeneratedExpressionCode}
+import org.apache.spark.sql.types._
+
+/**
+ * Given an array or map, returns its size.
+ */
+case class Size(child: Expression) extends UnaryExpression with 
ExpectsInputTypes {
+  override def dataType: DataType = IntegerType
+  override def inputTypes: Seq[AbstractDataType] = 
Seq(TypeCollection(ArrayType, MapType))
+
+  override def nullSafeEval(value: Any): Int = child.dataType match {
+case ArrayType(_, _) => value.asInstanceOf[Seq[Any]].size
+case MapType(_, _, _) => value.asInstanceOf[Map[Any, Any]].size
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+child.dataType match {
--- End diff --

1. Pattern matching is not necessary here. Just do `defineCodeGen(ctx, ev, 
c => s"($c).size()")`.
2. Maybe we should call here `nullSafeCodeGen`  instead of `defineCodeGen`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9516][SQL] codegen StringSplit

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7547

[SPARK-9516][SQL] codegen StringSplit

Jira: https://issues.apache.org/jira/browse/SPARK-9156

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9156

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7547.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7547


commit 5ad6a1f851683ee285a731a762c96e1ac398219a
Author: Tarek Auel 
Date:   2015-07-20T07:58:50Z

[SPARK-9156] codegen StringSplit

commit b860eaf09cd77da00889f576300318fa520a73f3
Author: Tarek Auel 
Date:   2015-07-20T22:22:02Z

[SPARK-9156][SQL] codegen StringSplit

commit 0be2700f2366614cae7faceb799085e96d33cd16
Author: Tarek Auel 
Date:   2015-07-20T22:24:56Z

[SPARK-9156][SQL] indention fix




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7546#issuecomment-123059863
  
Jenkins, ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9154][SQL] codegen StringFormat

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7546

[SPARK-9154][SQL] codegen StringFormat

Jira: https://issues.apache.org/jira/browse/SPARK-9154

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9154

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7546.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7546


commit 086caba76f646f86840a2cee325188895ab42c8f
Author: Tarek Auel 
Date:   2015-07-20T16:29:03Z

[SPARK-9154][SQL] codegen string format

commit cd8322bc4e6c15cd9911363c4596eba1a935fcdd
Author: Tarek Auel 
Date:   2015-07-20T21:40:30Z

[SPARK-9154][SQL] codegen string format

commit 10b4de88c817a474b7b0a83d948cb86927638775
Author: Tarek Auel 
Date:   2015-07-20T21:42:28Z

[SPARK-9154][SQL] codegen removed fallback trait




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9161][SQL] codegen FormatNumber

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7545

[SPARK-9161][SQL] codegen FormatNumber

Jira https://issues.apache.org/jira/browse/SPARK-9161

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9161

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7545.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7545


commit 21425c82ac3f6a68d3428c08de7ff27f50a12993
Author: Tarek Auel 
Date:   2015-07-20T19:20:01Z

[SPARK-9161][SQL] codegen FormatNumber




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9155][SQL] codegen StringSpace

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7531#discussion_r35031504
  
--- Diff: 
unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -77,6 +78,15 @@ public static UTF8String fromString(String str) {
 }
   }
 
+  /**
+   * Creates an UTF8String that contains `length` spaces.
+   */
+  public static UTF8String blankString(int length) {
+byte[] spaces = new byte[length];
+Arrays.fill(spaces, (byte) ' ');
--- End diff --

Char implements UTF-16 
(http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html)

"The char data type (and therefore the value that a Character object 
encapsulates) are based on the original Unicode specification, which defined 
characters as fixed-width 16-bit entities".

Shall I still change it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9178][SQL] Add an empty string constant...

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7509#issuecomment-122969582
  
@rxin Could you restart this?

I don't understand what went wrong:

https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1121/testReport/org.apache.spark.sql/DatetimeExpressionsSuite/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9160][SQL] codegen encode, decode

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7543

[SPARK-9160][SQL] codegen encode, decode

Jira: https://issues.apache.org/jira/browse/SPARK-9160

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9160

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7543.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7543


commit 7528f0eac152fad6e8263a63fd78d138a18b5aa0
Author: Tarek Auel 
Date:   2015-07-20T17:38:17Z

[SPARK-9160][SQL] codegen encode, decode




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9159][SQL] codegen ascii, base64, unbas...

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7542

[SPARK-9159][SQL] codegen ascii, base64, unbase64

Jira: https://issues.apache.org/jira/browse/SPARK-9159

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9159

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7542.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7542


commit 772e6bc5c2729cc50b207c0043f3380a1856ae80
Author: Tarek Auel 
Date:   2015-07-20T17:22:49Z

[SPARK-9159][SQL] codegen ascii, base64, unbase64




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9157][SQL] codegen substring

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7534#issuecomment-122918124
  
@rxin I guess Jenkins didn't got the 'add to whitelist'


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9157][SQL] codegen substring

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7534#issuecomment-122916361
  
Jenkins, ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9157][SQL] codegen substring

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7534#discussion_r34976319
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -628,24 +623,59 @@ case class Substring(str: Expression, pos: 
Expression, len: Expression)
 
   override def eval(input: InternalRow): Any = {
 val string = str.eval(input)
-val po = pos.eval(input)
-val ln = len.eval(input)
-
-if ((string == null) || (po == null) || (ln == null)) {
-  null
--- End diff --

I created a nested if in order to avoid evaluation of the 2nd or 3rd 
argument, if one of them is already null.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9157][SQL] codegen substring

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7534#discussion_r34976235
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -593,12 +593,7 @@ case class Substring(str: Expression, pos: Expression, 
len: Expression)
   override def foldable: Boolean = str.foldable && pos.foldable && 
len.foldable
   override def nullable: Boolean = str.nullable || pos.nullable || 
len.nullable
 
-  override def dataType: DataType = {
-if (!resolved) {
-  throw new UnresolvedException(this, s"Cannot resolve since $children 
are not resolved")
-}
-if (str.dataType == BinaryType) str.dataType else StringType
-  }
+  override def dataType: DataType = StringType
--- End diff --

@rxin This simplification is correct, isn't it? The expression extends 
`ImplicitCastInputTypes`, because of that `BinaryType` can be casted to 
`StringType`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9157][SQL] codegen substring

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7534

[SPARK-9157][SQL] codegen substring

https://issues.apache.org/jira/browse/SPARK-9157

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9157

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7534.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7534


commit 1a2e6110478642f30487e569f2f3645ef058bc78
Author: Tarek Auel 
Date:   2015-07-20T08:39:08Z

[SPARK-9157][SQL] codegen substring




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9155][SQL] codegen StringSpace

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7531#issuecomment-122800467
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9155][SQL] codegen StringSpace

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7531#discussion_r34972744
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -556,6 +556,16 @@ case class StringSpace(child: Expression)
 UTF8String.fromBytes(spaces)
   }
 
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+nullSafeCodeGen(ctx, ev, (length) => {
+  val spaces = ctx.freshName("spaces")
+  s"""
+byte[] $spaces = new byte[($length < 0) ? 0 : $length];
--- End diff --

Okay


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9155][SQL] codegen StringSpace

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7531

[SPARK-9155][SQL] codegen StringSpace

Jira https://issues.apache.org/jira/browse/SPARK-9155

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9155

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7531.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7531


commit 4bc33e6794ee931e0a52645d8fc6ddf699754b32
Author: Tarek Auel 
Date:   2015-07-20T07:29:20Z

[SPARK-9155] codegen StringSpace




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9153][SQL] codegen StirngLPad/StringRPa...

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7527

[SPARK-9153][SQL] codegen StirngLPad/StringRPad

Jira: https://issues.apache.org/jira/browse/SPARK-9153

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9153

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7527.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7527


commit 92b6a5d5d89c909ae408bc5fb58542225f1f915c
Author: Tarek Auel 
Date:   2015-07-20T06:50:30Z

[SPARK-9153] codegen lpad/rpad




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9177][SQL] Reuse of calendar object in ...

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7516#issuecomment-122774605
  
Sure. I am going to solve some of them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9177][SQL] Reuse of calendar object in ...

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7516#issuecomment-122773919
  
@rxin Jenkins still doesn't like me 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7462#issuecomment-122764175

@EntilZha
1. `eval` and `nullSafeEval`
`eval` will be invoked to evaluate the expression. Most expressions should
return `null` if one of there arguments is `null`. In order to avoid that every
expression has to check if `left` or `right` is `null`, `nullSafeEval` has been
added. `eval` does the null check and calls `nullSafeEval`, see.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L289-L313
You should override `eval` if you don't want to return `null`, if one the
arguments is `null`. Most of the times you will use `nullSafeEval`.
2.
`UnaryExpression`: Expression has one parameter (like `size(x)`)

`BinaryExpression`: Expression has two parameters (like `contains(a, b)`)

`ExpectsInputTypes`: Allows to automatically check if the argument type is
correct, see
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpectsInputTypes.scala#L42-L57.
You specify the allowed types by overriding `inputTypes`.

`ImplicitCastInputs`: The difference to `ExpectsInputTypes` is that this
tries to cast the value. Most string operations are implemented with a byte
array as input. A string can be "casted" to a byte array by calling
`.getBytes`. `ImplicitCastInputs` allows to call `contains(s: String, s2:
String)` and `contains(s: Array[Byte], s2: Array[Byte])`. Typically you use
this if a cast is reasonable. Cast from anything else to string is most of the
times reasonable, but casting a string (automatically == implicit) to an
integer value is most of the time not helpful. Users could still invoke the
`cast` function.
3. I don't know
4. Intellij allows to run most suites from the IDE. And have a look at
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningIndividualTests

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9178][SQL] Add an empty string constant...

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7509#issuecomment-122726603
  
@rxin Shall I add final and do a rebase?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9178][SQL] Add an empty string constant...

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7509#issuecomment-122715547
  
@rxin Jenkins doesn't like me


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9177][SQL] Reuse of calendar object in ...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7516#discussion_r34960318
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala
 ---
@@ -213,10 +213,14 @@ case class WeekOfYear(child: Expression) extends 
UnaryExpression with ImplicitCa
 
   override def dataType: DataType = IntegerType
 
-  override protected def nullSafeEval(date: Any): Any = {
+  private[this] final val c = {
--- End diff --

Just to double check. `java.util.Calendar` implements `Serializable`, 
because of that a `@transient` isn't necessary. Am I right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9177][SQL] Reuse of calendar object in ...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7516#discussion_r34960308
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala
 ---
@@ -225,8 +229,8 @@ case class WeekOfYear(child: Expression) extends 
UnaryExpression with ImplicitCa
 nullSafeCodeGen(ctx, ev, (time) => {
   val cal = classOf[Calendar].getName
   val c = ctx.freshName("cal")
+  ctx.addMutableState(cal, c, 
s"""$cal.getInstance(java.util.TimeZone.getTimeZone("UTC"));""")
   s"""
-$cal $c = $cal.getInstance(java.util.TimeZone.getTimeZone("UTC"));
 $c.setFirstDayOfWeek($cal.MONDAY);
 $c.setMinimalDaysInFirstWeek(4);
--- End diff --

If we extend `CodeGenContext.addMutableState(javaName, variableName, 
initialValue)` to something like `CodeGenContext.addMutableState(javaName, 
variableName, initialValue, initialCode: Option[String] = None)` we could get 
grid of these two lines and allow more complex initialisations than a single 
method call. So far there is nothing to pass a more complex initialisation, is 
there?

@cloud-fan I guess you have created the pr for `addMutableState`. Is there 
a opportunity to push the two lines to the initialisation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9177][SQL] Reuse of calendar object in ...

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7516

[SPARK-9177][SQL] Reuse of calendar object in WeekOfYear 

https://issues.apache.org/jira/browse/SPARK-9177

@rxin Are we sure that this is thread safe? @chenghao-intel explained in 
another PR that every partition (if I remember correctly) uses one expression 
instance. This instance isn't used by multiple threads, is it? If not, we are 
fine.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9177

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7516.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7516


commit ff97b095c3c80f857f571c0087d271d32b208cb9
Author: Tarek Auel 
Date:   2015-07-19T17:40:21Z

[SPARK-9177] Reuse calendar object in interpreted code and codegen




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9178][SQL] Add an empty string constant...

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7509

[SPARK-9178][SQL] Add an empty string constant to UTF8String

Jira: https://issues.apache.org/jira/browse/SPARK-9178

In order to avoid calls of `UTF8String.fromString("")` this pr adds an 
`EMPTY_STRING` constant to `UTF8String`. An `UTF8String` is immutable, so we 
can use a constant, isn't it?

I searched for current usage of `UTF8String.fromString("")` with 
`grep -R  "UTF8String.fromString(\"\")" .` 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-9178

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7509.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7509


commit 748b87a38575664fcfc877ccc575678ba54a9df6
Author: Tarek Auel 
Date:   2015-07-19T08:22:43Z

[SPARK-9178] Add empty string constant to UTF8String




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8255][SPARK-8256][SQL]Add regex_extract...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7468#discussion_r34955908
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -673,6 +673,110 @@ case class Encode(value: Expression, charset: 
Expression)
 }
 
 /**
+ * Replace all substrings of str that match regexp with rep
+ */
+case class RegExpReplace(subject: Expression, regexp: Expression, rep: 
Expression)
+  extends Expression with ImplicitCastInputTypes {
+
+  // last regex in string, we will update the pattern iff regexp value 
changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+  // last replacement string, we don't want to convert a UTF8String => 
java.langString every time.
+  @transient private var lastReplacement: String = _
+  @transient private var lastReplacementInUTF8: UTF8String = _
+  // result buffer write by Matcher
+  @transient private val result: StringBuffer = new StringBuffer
+
+  override def nullable: Boolean = children.foldLeft(false)(_ || 
_.nullable)
+  override def foldable: Boolean = children.foldLeft(true)(_ && _.foldable)
+
+  override def eval(input: InternalRow): Any = {
+val s = subject.eval(input)
+if (null != s) {
+  val p = regexp.eval(input)
+  if (null != p) {
+val r = rep.eval(input)
+if (null != r) {
+  if (!p.equals(lastRegex)) {
+// regex value changed
+lastRegex = p.asInstanceOf[UTF8String]
+pattern = Pattern.compile(lastRegex.toString)
+  }
+  if (!r.equals(lastReplacementInUTF8)) {
+// replacement string changed
+lastReplacementInUTF8 = r.asInstanceOf[UTF8String]
+lastReplacement = lastReplacementInUTF8.toString
+  }
+  val m = pattern.matcher(s.toString())
+  result.delete(0, result.length())
+
+  while (m.find) {
+m.appendReplacement(result, lastReplacement)
+  }
+  m.appendTail(result)
+
+  return UTF8String.fromString(result.toString)
+}
+  }
+}
+
+null
+  }
+
+  override def dataType: DataType = StringType
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+  override def children: Seq[Expression] = subject :: regexp :: rep :: Nil
+  override def prettyName: String = "regexp_replace"
+}
+
+/**
+ * UDF to extract a specific(idx) group identified by a java regex.
+ */
+case class RegExpExtract(subject: Expression, regexp: Expression, idx: 
Expression)
+  extends Expression with ImplicitCastInputTypes {
+  def this(s: Expression, r: Expression) = this(s, r, Literal(1))
+
+  // last regex in string, we will update the pattern iff regexp value 
changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def nullable: Boolean = children.foldLeft(false)(_ || 
_.nullable)
+  override def foldable: Boolean = children.foldLeft(true)(_ && _.foldable)
+
+  override def eval(input: InternalRow): Any = {
+val s = subject.eval(input)
+if (null != s) {
+  val p = regexp.eval(input)
+  if (null != p) {
+val r = idx.eval(input)
+if (null != r) {
+  if (!p.equals(lastRegex)) {
+// regex value changed
+lastRegex = p.asInstanceOf[UTF8String]
+pattern = Pattern.compile(lastRegex.toString)
+  }
+  val m = pattern.matcher(s.toString())
+  if (m.find) {
+val mr: MatchResult = m.toMatchResult
+return UTF8String.fromString(mr.group(r.asInstanceOf[Int]))
+  }
+  return UTF8String.fromString("")
--- End diff --

Okay. I am going to create a Jira and check the coding for existing empty 
strings


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8255][SPARK-8256][SQL]Add regex_extract...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7468#discussion_r34955738
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -673,6 +673,110 @@ case class Encode(value: Expression, charset: 
Expression)
 }
 
 /**
+ * Replace all substrings of str that match regexp with rep
+ */
+case class RegExpReplace(subject: Expression, regexp: Expression, rep: 
Expression)
+  extends Expression with ImplicitCastInputTypes {
+
+  // last regex in string, we will update the pattern iff regexp value 
changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+  // last replacement string, we don't want to convert a UTF8String => 
java.langString every time.
+  @transient private var lastReplacement: String = _
+  @transient private var lastReplacementInUTF8: UTF8String = _
+  // result buffer write by Matcher
+  @transient private val result: StringBuffer = new StringBuffer
+
+  override def nullable: Boolean = children.foldLeft(false)(_ || 
_.nullable)
+  override def foldable: Boolean = children.foldLeft(true)(_ && _.foldable)
+
+  override def eval(input: InternalRow): Any = {
+val s = subject.eval(input)
+if (null != s) {
+  val p = regexp.eval(input)
+  if (null != p) {
+val r = rep.eval(input)
+if (null != r) {
+  if (!p.equals(lastRegex)) {
+// regex value changed
+lastRegex = p.asInstanceOf[UTF8String]
+pattern = Pattern.compile(lastRegex.toString)
+  }
+  if (!r.equals(lastReplacementInUTF8)) {
+// replacement string changed
+lastReplacementInUTF8 = r.asInstanceOf[UTF8String]
+lastReplacement = lastReplacementInUTF8.toString
+  }
+  val m = pattern.matcher(s.toString())
+  result.delete(0, result.length())
+
+  while (m.find) {
+m.appendReplacement(result, lastReplacement)
+  }
+  m.appendTail(result)
+
+  return UTF8String.fromString(result.toString)
+}
+  }
+}
+
+null
+  }
+
+  override def dataType: DataType = StringType
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+  override def children: Seq[Expression] = subject :: regexp :: rep :: Nil
+  override def prettyName: String = "regexp_replace"
+}
+
+/**
+ * UDF to extract a specific(idx) group identified by a java regex.
+ */
+case class RegExpExtract(subject: Expression, regexp: Expression, idx: 
Expression)
+  extends Expression with ImplicitCastInputTypes {
+  def this(s: Expression, r: Expression) = this(s, r, Literal(1))
+
+  // last regex in string, we will update the pattern iff regexp value 
changed.
+  @transient private var lastRegex: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  override def nullable: Boolean = children.foldLeft(false)(_ || 
_.nullable)
+  override def foldable: Boolean = children.foldLeft(true)(_ && _.foldable)
+
+  override def eval(input: InternalRow): Any = {
+val s = subject.eval(input)
+if (null != s) {
+  val p = regexp.eval(input)
+  if (null != p) {
+val r = idx.eval(input)
+if (null != r) {
+  if (!p.equals(lastRegex)) {
+// regex value changed
+lastRegex = p.asInstanceOf[UTF8String]
+pattern = Pattern.compile(lastRegex.toString)
+  }
+  val m = pattern.matcher(s.toString())
+  if (m.find) {
+val mr: MatchResult = m.toMatchResult
+return UTF8String.fromString(mr.group(r.asInstanceOf[Int]))
+  }
+  return UTF8String.fromString("")
--- End diff --

`UTF8String.fromByte(Array[Byte]())` should be slightly faster and avoids 
creating the string.

@rxin / @davies A little bit off-topic, but do you guys think we should add 
something to `UTF8String` to create an empty UTF8String? Something like:
```
public UTF8String EMTPY_STRING() {
  return UTF8String.fromBytes(new byte[0])
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7462#discussion_r34955687
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -652,6 +658,16 @@ def ntile(n):
 return Column(sc._jvm.functions.ntile(int(n)))
 
 
+@since(1.5)
+def size(col):
+"""
+Collection function: returns the length of the array or map stored in 
the column.
+:param col: name of column or expression
--- End diff --

Could you add an example/test here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] Make date/time functions more consistent...

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7506#issuecomment-122633436
  
@rxin Could you do this little fix as well?
https://github.com/apache/spark/pull/7505/files

Why do we switch from day_of_month to dayofmonth? Most SQL implementations 
use underscores:
[MySQL](https://dev.mysql.com/doc/refman/5.0/en/func-op-summary-ref.html) 
[SAP 
HANA](http://help.sap.com/saphelp_hanaplatform/helpdata/en/20/9f228975191014baed94f1b69693ae/content.htm?frameset=/en/20/9ddefe75191014ac249bf78ba2a1e9/frameset.htm¤t_toc=/en/2e/1ef8b4f4554739959886e55d4c127b/plain.htm&node_id=91&show_children=false)
 
[Oracle](http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions001.htm#i88891)
I would prefer underscores, because they improve the readability, if you 
write all SQL stuff in caps, like:
`SELECT name, age, DAY_OF_MONTH(birthday) AS birthday FROM people WHERE age 
> 15` compared to `SELECT name, age, DAYOFMONTH(birthday) AS birthday FROM 
people WHERE age > 15`
I'm not a Python pro, but I thought that underscores are 'pythonic', aren't 
they?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8199][SQL] follow up; revert change in ...

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7505#issuecomment-122632995
  
Now it's right, isn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8199][SQL] follow up; revert change in ...

GitHub user tarekauel opened a pull request:

https://github.com/apache/spark/pull/7505

[SPARK-8199][SQL] follow up; revert change in test

@rxin / @davies 

Sorry for that unnecessary change. And thanks again for all you support!

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tarekauel/spark SPARK-8199-FollowUp

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7505.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7505


commit 67acfe6ff366e2050a72069842b088935d81e2ef
Author: Tarek Auel 
Date:   2015-07-19T06:01:02Z

[SPARK-8199] follow up; revert change in test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/6981#discussion_r34955066
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateFunctionsSuite.scala
 ---
@@ -0,0 +1,249 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import java.sql.{Timestamp, Date}
+import java.text.SimpleDateFormat
+import java.util.{TimeZone, Calendar}
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.types.{StringType, TimestampType, DateType}
+
+class DateFunctionsSuite extends SparkFunSuite with ExpressionEvalHelper {
+
+  val sdf = new SimpleDateFormat("-MM-dd HH:mm:ss")
+  val sdfDate = new SimpleDateFormat("-MM-dd")
+  val d = new Date(sdf.parse("2015-04-08 13:10:15").getTime)
+  val ts = new Timestamp(sdf.parse("2013-11-08 13:10:15").getTime)
+
+  test("Day in Year") {
+val sdfDay = new SimpleDateFormat("D")
+(2002 to 2004).foreach { y =>
+  (0 to 11).foreach { m =>
+(0 to 5).foreach { i =>
+  val c = Calendar.getInstance()
+  c.set(y, m, 28, 0, 0, 0)
+  c.add(Calendar.DATE, i)
+  checkEvaluation(DayInYear(Cast(Literal(new 
Date(c.getTimeInMillis)), DateType)),
+sdfDay.format(c.getTime).toInt)
+}
+  }
+}
+
+(1998 to 2002).foreach { y =>
+  (0 to 11).foreach { m =>
+(0 to 5).foreach { i =>
+  val c = Calendar.getInstance()
+  c.set(y, m, 28, 0, 0, 0)
+  c.add(Calendar.DATE, 1)
+  checkEvaluation(DayInYear(Cast(Literal(new 
Date(c.getTimeInMillis)), DateType)),
+sdfDay.format(c.getTime).toInt)
+}
+  }
+}
+
+(1969 to 1970).foreach { y =>
+  (0 to 11).foreach { m =>
+(0 to 5).foreach { i =>
+  val c = Calendar.getInstance()
+  c.set(y, m, 28, 0, 0, 0)
+  c.add(Calendar.DATE, 1)
+  checkEvaluation(DayInYear(Cast(Literal(new 
Date(c.getTimeInMillis)), DateType)),
+sdfDay.format(c.getTime).toInt)
+}
+  }
+}
+
+(2402 to 2404).foreach { y =>
+  (0 to 11).foreach { m =>
+(0 to 5).foreach { i =>
+  val c = Calendar.getInstance()
+  c.set(y, m, 28, 0, 0, 0)
+  c.add(Calendar.DATE, 1)
+  checkEvaluation(DayInYear(Cast(Literal(new 
Date(c.getTimeInMillis)), DateType)),
+sdfDay.format(c.getTime).toInt)
+}
+  }
+}
+
+(2398 to 2402).foreach { y =>
+  (0 to 11).foreach { m =>
+(0 to 5).foreach { i =>
+  val c = Calendar.getInstance()
+  c.set(y, m, 28, 0, 0, 0)
+  c.add(Calendar.DATE, 1)
--- End diff --

I changed this when I looked for the last bug. I'm going to create a follow 
PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/6981#issuecomment-122573044
  
@rxin / @davies First of all thanks for all your feedback so far. I removed 
the manual binary search. I guess we all agree that the if/else structure is 
much more readable. 

@davies I removed `testWithTimezone` because of @cloud-fan comment in 
#7488. 

@cloud-fan Could you have a look on this PR, if you have time? Would be 
great.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/6981#discussion_r34949893
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala
 ---
@@ -54,3 +60,367 @@ case class CurrentTimestamp() extends LeafExpression {
 System.currentTimeMillis() * 1000L
   }
 }
+
+case class Hour(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(timestamp: Any): Any = {
+val time = timestamp.asInstanceOf[Long] / 1000
+val longTime: Long = time.asInstanceOf[Long] + 
TimeZone.getDefault.getOffset(time)
+((longTime / (1000 * 3600)) % 24).toInt
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+val tz = classOf[TimeZone].getName
+defineCodeGen(ctx, ev, (c) =>
+  s"""(int) ((($c / 1000) + $tz.getDefault().getOffset($c / 1000))
+ / (1000 * 3600) % 24)""".stripMargin
+)
+  }
+}
+
+case class Minute(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(timestamp: Any): Any = {
+val time = timestamp.asInstanceOf[Long] / 1000
+val longTime: Long = time.asInstanceOf[Long] + 
TimeZone.getDefault.getOffset(time)
+((longTime / (1000 * 60)) % 60).toInt
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+val tz = classOf[TimeZone].getName
+defineCodeGen(ctx, ev, (c) =>
+  s"""(int) ((($c / 1000) + $tz.getDefault().getOffset($c / 1000))
+ / (1000 * 60) % 60)""".stripMargin
+)
+  }
+}
+
+case class Second(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(time: Any): Any = {
+(time.asInstanceOf[Long] / 1000L / 1000L % 60L).toInt
+  }
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+nullSafeCodeGen(ctx, ev, (time) => {
+  s"""${ev.primitive} = (int) ($time / 1000L / 1000L % 60L);"""
+})
+  }
+}
+
+abstract class DateFormatExpression extends UnaryExpression with 
ImplicitCastInputTypes {
+  self: Product =>
+
+  val daysIn400Years: Int = 146097
+  val to2001 = -11323
+
+  // this is year -17999, calculation: 50 * daysIn400Year
+  val toYearZero = to2001 + 7304850
+
+  protected def isLeapYear(year: Int): Boolean = {
+(year % 4) == 0 && ((year % 100) != 0 || (year % 400) == 0)
+  }
+
+  private[this] def yearBoundary(year: Int): Int = {
+year * 365 + ((year / 4 ) - (year / 100) + (year / 400))
+  }
+
+  private[this] def numYears(in: Int): Int = {
+val year = in / 365
+if (in > yearBoundary(year)) year else year - 1
+  }
+
+  override def dataType: DataType = IntegerType
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
+
+  protected def calculateYearAndDayInYear(daysIn: Int): (Int, Int) = {
+val daysNormalized = daysIn + toYearZero
+val numOfQuarterCenturies = daysNormalized / daysIn400Years
+val daysInThis400 = daysNormalized % daysIn400Years + 1
+val years = numYears(daysInThis400)
+val year: Int = (2001 - 2) + 400 * numOfQuarterCenturies + years
+val dayInYear = daysInThis400 - yearBoundary(years)
+(year, dayInYear)
+  }
+
+  protected def codeGen(ctx: CodeGenContext, ev: GeneratedExpressionCode, 
input: String,
+  f: (String, String) => String): String = {
+val daysIn400Years = ctx.freshName("daysIn400Years")
+val to2001 = ctx.freshName("to2001")
+val toYearZero = ctx.freshName("toYearZero")
+val daysNormalized = ctx.freshName("daysNormalized")
+val numOfQuarterCenturies = ctx.freshName("numOfQuarterCenturies")
+val daysInThis400 = ctx.freshName("daysInThis400")
+val years = ctx.freshName("years")
+val year = ctx.freshName("year")
+val dayInYear = ctx.freshName("dayInYear")
+
+s

[GitHub] spark pull request: [SPARK-9136][SQL] fix several bugs in DateTime...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7473#discussion_r34948052
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -320,16 +320,17 @@ object DateTimeUtils {
   Calendar.getInstance(
 
TimeZone.getTimeZone(f"GMT${timeZone.get.toChar}${segments(7)}%02d:${segments(8)}%02d"))
 }
+c.set(Calendar.MILLISECOND, 0)
 
 if (justTime) {
-  c.set(Calendar.HOUR, segments(3))
+  c.set(Calendar.HOUR_OF_DAY, segments(3))
   c.set(Calendar.MINUTE, segments(4))
   c.set(Calendar.SECOND, segments(5))
 } else {
   c.set(segments(0), segments(1) - 1, segments(2), segments(3), 
segments(4), segments(5))
 }
 
-Some(c.getTimeInMillis / 1000 * 100 + segments(6))
+Some(c.getTimeInMillis * 1000 + segments(6))
--- End diff --

Got it! Thanks for the explanation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/6981#discussion_r34940088
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala
 ---
@@ -54,3 +60,204 @@ case class CurrentTimestamp() extends LeafExpression {
 System.currentTimeMillis() * 1000L
   }
 }
+
+case class Hour(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(timestamp: Any): Any = {
+DateTimeUtils.getHours(timestamp.asInstanceOf[Long])
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+val dtu = DateTimeUtils.getClass.getName.stripSuffix("$")
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getHours($c)"""
+)
+  }
+}
+
+case class Minute(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(timestamp: Any): Any = {
+DateTimeUtils.getMinutes(timestamp.asInstanceOf[Long])
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+val dtu = DateTimeUtils.getClass.getName.stripSuffix("$")
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getMinutes($c)"""
+)
+  }
+}
+
+case class Second(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(timestamp: Any): Any = {
+DateTimeUtils.getSeconds(timestamp.asInstanceOf[Long])
+  }
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val dtu = DateTimeUtils.getClass.getName.stripSuffix("$")
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getSeconds($c)"""
+)
+  }
+}
+
+case class DayInYear(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
+
+  override def dataType: DataType = IntegerType
+
+  override def prettyName: String = "day_in_year"
+
+  override protected def nullSafeEval(date: Any): Any = {
+DateTimeUtils.getDayInYear(date.asInstanceOf[Int])
+  }
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val dtu = DateTimeUtils.getClass.getName.stripSuffix("$")
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getDayInYear($c)"""
+)
+  }
+}
+
+
+case class Year(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(date: Any): Any = {
+DateTimeUtils.getYear(date.asInstanceOf[Int])
+  }
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val dtu = DateTimeUtils.getClass.getName.stripSuffix("$")
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getYear($c)"""
+)
+  }
+}
+
+case class Quarter(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(date: Any): Any = {
+DateTimeUtils.getQuarter(date.asInstanceOf[Int])
+  }
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val dtu = DateTimeUtils.getClass.getName.stripSuffix("$")
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getQuarter($c)"""
+)
+  }
+}
+
+case class Month(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(date: Any): Any = {
+DateTimeUtils

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/6981#discussion_r34939885
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -378,4 +395,208 @@ object DateTimeUtils {
 c.set(segments(0), segments(1) - 1, segments(2), 0, 0, 0)
 Some((c.getTimeInMillis / 1000 / 3600 / 24).toInt)
   }
+
+  /**
+   * Returns the hour value of a given timestamp value. The timestamp is 
expressed in microseconds.
+   */
+  def getHours(timestamp: Long): Int = {
+val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp 
/ 1000)
+((localTs / 1000 / 3600) % 24).toInt
+  }
+
+  /**
+   * Returns the minute value of a given timestamp value. The timestamp is 
expressed in
+   * microseconds.
+   */
+  def getMinutes(timestamp: Long): Int = {
+val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp 
/ 1000)
+((localTs / 1000 / 60) % 60).toInt
+  }
+
+  /**
+   * Returns the second value of a given timestamp value. The timestamp is 
expressed in
+   * microseconds.
+   */
+  def getSeconds(timestamp: Long): Int = {
+((timestamp / 1000 / 1000) % 60).toInt
+  }
+
+  private[this] def isLeapYear(year: Int): Boolean = {
+(year % 4) == 0 && ((year % 100) != 0 || (year % 400) == 0)
+  }
+
+  /**
+   * Return the number of days since the start of 400 year period.
+   * The second year of a 400 year period (year 1) starts on day 365.
+   */
+  private[this] def yearBoundary(year: Int): Int = {
+year * 365 + ((year / 4 ) - (year / 100) + (year / 400))
+  }
+
+  /**
+   * Calculates the number of years for the given number of days. This 
depends
+   * on a 400 year period.
+   * @param days days since the beginning of the 400 year period
+   * @return (number of year, days in year)
+   */
+  private[this] def numYears(days: Int): (Int, Int) = {
+val year = days / 365
+val boundary = yearBoundary(year)
+if (days > boundary) (year, days - boundary) else (year - 1, days - 
yearBoundary(year - 1))
+  }
+
+  /**
+   * Calculates the year and and the number of the day in the year for the 
given
+   * number of days. The given days is the number of days since 1.1.1970.
+   *
+   * The calculation uses the fact that the period 1.1.2001 until 
31.12.2400 is
+   * equals to the period 1.1.1601 until 31.12.2000.
+   */
+  private[this] def getYearAndDayInYear(daysSince1970: Int): (Int, Int) = {
+// add the difference (in days) between 1.1.1970 and the artificial 
year 0 (-17999)
+val daysNormalized = daysSince1970 + toYearZero
+val numOfQuarterCenturies = daysNormalized / daysIn400Years
+val daysInThis400 = daysNormalized % daysIn400Years + 1
+val (years, dayInYear) = numYears(daysInThis400)
+val year: Int = (2001 - 2) + 400 * numOfQuarterCenturies + years
+(year, dayInYear)
+  }
+
+  /**
+   * Returns the 'day in year' value for the given date. The date is 
expressed in days
+   * since 1.1.1970.
+   */
+  def getDayInYear(date: Int): Int = {
+getYearAndDayInYear(date)._2
+  }
+
+  /**
+   * Returns the year value for the given date. The date is expressed in 
days
+   * since 1.1.1970.
+   */
+  def getYear(date: Int): Int = {
+getYearAndDayInYear(date)._1
+  }
+
+  /**
+   * Returns the quarter for the given date. The date is expressed in days
+   * since 1.1.1970.
+   */
+  def getQuarter(date: Int): Int = {
+var (year, dayInYear) = getYearAndDayInYear(date)
+if (isLeapYear(year)) {
+  dayInYear = dayInYear - 1
+}
+if (dayInYear <= 90) {
+  1
+} else if (dayInYear <= 181) {
+  2
+} else if (dayInYear <= 273) {
+  3
+} else {
+  4
+}
+  }
+
+  /**
+   * Returns the month value for the given date. The date is expressed in 
days
+   * since 1.1.1970. January is month 1.
+   */
+  def getMonth(date: Int): Int = {
+var (year, dayInYear) = getYearAndDayInYear(date)
+var isLeap = isLeapYear(year)
+if (isLeap && dayInYear > 60) {
+  dayInYear = dayInYear - 1
+  isLeap = false
+}
+
+if (dayInYear <= 181) {
+  if (dayInYear <= 90) {
+if (dayInYear <= 31) {
+  1
+} else if (dayInYear <= 59 || (isLeap && dayInYear <= 60)) {
--- End diff --

Put the problem is, if you subtract 1 one, you can not differentiate 
between 1.3. and 29.2. You have to add somehow an additiona

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/6981#issuecomment-122385842
  
@rxin / @davies  Can you trigger Jenkins again? BTW: I had the issue 
already yesterday that Jenkins could not fetch from repo.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9136][SQL] fix several bugs in DateTime...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7473#discussion_r34912321
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -320,16 +320,17 @@ object DateTimeUtils {
   Calendar.getInstance(
 
TimeZone.getTimeZone(f"GMT${timeZone.get.toChar}${segments(7)}%02d:${segments(8)}%02d"))
 }
+c.set(Calendar.MILLISECOND, 0)
 
 if (justTime) {
-  c.set(Calendar.HOUR, segments(3))
+  c.set(Calendar.HOUR_OF_DAY, segments(3))
   c.set(Calendar.MINUTE, segments(4))
   c.set(Calendar.SECOND, segments(5))
 } else {
   c.set(segments(0), segments(1) - 1, segments(2), segments(3), 
segments(4), segments(5))
 }
 
-Some(c.getTimeInMillis / 1000 * 100 + segments(6))
+Some(c.getTimeInMillis * 1000 + segments(6))
--- End diff --

Sorry I didn't get this. Why is divide by 1000 a problem, if the date is 
before 1.1.1970? It's a long value and the last three digits are the millis. In 
order to cut them off, I divide by 1000. This has no impact on the sign or 
anything else, does it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9136][SQL] fix several bugs in DateTime...

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7473#issuecomment-122340977
  
LGTM

1. yes that was a bug. Thanks for fixing it.
2. see my comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9136][SQL] fix several bugs in DateTime...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7473#discussion_r34908143
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -320,16 +320,17 @@ object DateTimeUtils {
   Calendar.getInstance(
 
TimeZone.getTimeZone(f"GMT${timeZone.get.toChar}${segments(7)}%02d:${segments(8)}%02d"))
 }
+c.set(Calendar.MILLISECOND, 0)
 
 if (justTime) {
-  c.set(Calendar.HOUR, segments(3))
+  c.set(Calendar.HOUR_OF_DAY, segments(3))
   c.set(Calendar.MINUTE, segments(4))
   c.set(Calendar.SECOND, segments(5))
 } else {
   c.set(segments(0), segments(1) - 1, segments(2), segments(3), 
segments(4), segments(5))
 }
 
-Some(c.getTimeInMillis / 1000 * 100 + segments(6))
+Some(c.getTimeInMillis * 1000 + segments(6))
--- End diff --

I don't know why you have changed this. I divided by `1000` first in order 
to remove the call of `c.set(Calendar.MILLISECOND, 0)` in Line 323. But I am 
fine with this, I guess it a personal preference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/6981#discussion_r34870733
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateFunctionsSuite.scala
 ---
@@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import java.sql.{Timestamp, Date}
+import java.text.SimpleDateFormat
+import java.util.{TimeZone, Calendar}
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.types.{StringType, TimestampType, DateType}
+
+class DateFunctionsSuite extends SparkFunSuite with ExpressionEvalHelper {
+
+  val oldDefault = TimeZone.getDefault
+
+  val sdf = new SimpleDateFormat("-MM-dd HH:mm:ss")
+  val sdfDate = new SimpleDateFormat("-MM-dd")
+  val d = new Date(sdf.parse("2015-04-08 13:10:15").getTime)
+  val ts = new Timestamp(sdf.parse("2013-11-08 13:10:15").getTime)
+
+  test("Day in Year") {
+val sdfDay = new SimpleDateFormat("D")
+(2002 to 2004).foreach { y =>
+  (0 to 11).foreach { m =>
+(0 to 5).foreach { i =>
--- End diff --

28, 29, 30, 31, 1.1. It checks until the 1st or 2nd of the next month, 
doesn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/6981#discussion_r34868858
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -378,4 +395,183 @@ object DateTimeUtils {
 c.set(segments(0), segments(1) - 1, segments(2), 0, 0, 0)
 Some((c.getTimeInMillis / 1000 / 3600 / 24).toInt)
   }
+
+  /**
+   * Returns the hour value of a given timestamp value. The timestamp is 
expressed in microseconds.
+   */
+  def getHours(timestamp: Long): Int = {
+val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp 
/ 1000)
+((localTs / 1000 / 3600) % 24).toInt
+  }
+
+  /**
+   * Returns the minute value of a given timestamp value. The timestamp is 
expressed in
+   * microseconds.
+   */
+  def getMinutes(timestamp: Long): Int = {
+val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp 
/ 1000)
+((localTs / 1000 / 60) % 60).toInt
+  }
+
+  /**
+   * Returns the second value of a given timestamp value. The timestamp is 
expressed in
+   * microseconds.
+   */
+  def getSeconds(timestamp: Long): Int = {
+val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp 
/ 1000)
+((localTs / 1000) % 60).toInt
+  }
+
+  private[this] def isLeapYear(year: Int): Boolean = {
+(year % 4) == 0 && ((year % 100) != 0 || (year % 400) == 0)
+  }
+
+  /**
+   * Return the number of days since the start of 400 year period.
+   * The second year of a 400 year period (year 1) starts on day 365.
+   */
+  private[this] def yearBoundary(year: Int): Int = {
+year * 365 + ((year / 4 ) - (year / 100) + (year / 400))
+  }
+
+  /**
+   * Calculates the number of years for the given number of days. This 
depends
+   * on a 400 year period.
+   * @param days days since the beginning of the 400 year period
+   * @return number of year
+   */
+  private[this] def numYears(days: Int): Int = {
--- End diff --

Sorry I'm not sure if I got it correct. I should return
```
val boundary = yearBoundary(year)
if (days > boundary) (year, boundary) else (year - 1, yearBoundary(year - 
1))
```
In order to avoid the call of `yearBoundary` in line the line `val 
dayInYear = daysInThis400 - yearBoundary(years)`. That was your proposal, 
wasn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8269][SQL]string function: initcap

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7208#discussion_r34867691
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -593,6 +593,33 @@ case class Levenshtein(left: Expression, right: 
Expression) extends BinaryExpres
 }
 
 /**
+ * Returns string, with the first letter of each word in uppercase,
+ * all other letters in lowercase. Words are delimited by whitespace.
+ */
+case class InitCap(child: Expression) extends UnaryExpression with 
ExpectsInputTypes {
+  override def dataType: DataType = StringType
+
+  override def inputTypes: Seq[DataType] = Seq(StringType)
+
+  override def nullSafeEval(string: Any): Any = {
+if (string.asInstanceOf[UTF8String].getBytes.length == 0) {
+  return string
+}
+else {
+  val sb = new StringBuffer()
+  sb.append(string)
+  sb.setCharAt(0, sb.charAt(0).toUpper)
+  for (i <- 1 until sb.length) {
+if (sb.charAt(i - 1).equals(' ')) {
+  sb.setCharAt(i, sb.charAt(i).toUpper)
+}
+  }
+  UTF8String.fromString(sb.toString)
--- End diff --

My idea would be that we check if the next character fits `Char`. If yes we 
convert it to Char, call `Character.toUpperCase(c)` and change the result int 
the array. If we cannot convert it to Char, we "ignore" it and don't change it. 
But as Reynold mentioned, we can do this in a second step.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/6981#discussion_r34867157
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala
 ---
@@ -54,3 +60,193 @@ case class CurrentTimestamp() extends LeafExpression {
 System.currentTimeMillis() * 1000L
   }
 }
+
+case class Hour(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(timestamp: Any): Any = {
+DateTimeUtils.getHours(timestamp.asInstanceOf[Long])
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils"
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getHours($c)"""
+)
+  }
+}
+
+case class Minute(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(timestamp: Any): Any = {
+DateTimeUtils.getMinutes(timestamp.asInstanceOf[Long])
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils"
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getMinutes($c)"""
+)
+  }
+}
+
+case class Second(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(timestamp: Any): Any = {
+DateTimeUtils.getSeconds(timestamp.asInstanceOf[Long])
+  }
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils"
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getSeconds($c)"""
+)
+  }
+}
+
+case class DayInYear(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
+
+  override def dataType: DataType = IntegerType
+
+  override def prettyName: String = "day_in_year"
+
+  override protected def nullSafeEval(date: Any): Any = {
+DateTimeUtils.getDayInYear(date.asInstanceOf[Int])
+  }
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils"
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getDayInYear($c)"""
+)
+  }
+}
+
+
+case class Year(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(date: Any): Any = {
+DateTimeUtils.getYear(date.asInstanceOf[Int])
+  }
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils"
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getYear($c)"""
+)
+  }
+}
+
+case class Quarter(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(date: Any): Any = {
+DateTimeUtils.getQuarter(date.asInstanceOf[Int])
+  }
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils"
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getQuarter($c)"""
+)
+  }
+}
+
+case class Month(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(date: Any): Any = {
+

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/6981#discussion_r34866296
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala
 ---
@@ -54,3 +60,193 @@ case class CurrentTimestamp() extends LeafExpression {
 System.currentTimeMillis() * 1000L
   }
 }
+
+case class Hour(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(timestamp: Any): Any = {
+DateTimeUtils.getHours(timestamp.asInstanceOf[Long])
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils"
--- End diff --

`DateTimeUtils.getClass.getName` returns 
`org.apache.spark.sql.catalyst.util.DateTimeUtils$`. Is there a way to select 
the name automatically and don't get the `$`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/6981#discussion_r34866224
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -652,6 +652,135 @@ def ntile(n):
 return Column(sc._jvm.functions.ntile(int(n)))
 
 
+@since(1.5)
+def dateFormat(dateCol, formatCol):
--- End diff --

@rxin camel case or underscore?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/6981#discussion_r34866189
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 ---
@@ -178,7 +178,17 @@ object FunctionRegistry {
 
 // datetime functions
 expression[CurrentDate]("current_date"),
-expression[CurrentTimestamp]("current_timestamp")
+expression[CurrentTimestamp]("current_timestamp"),
+expression[DateFormatClass]("date_format"),
+expression[Year]("year"),
+expression[Quarter]("quarter"),
+expression[Month]("month"),
+expression[Day]("day"),
--- End diff --

@rxin In Jira you mentioned there should be an alias. Can I just add 
`expression[Day]("day_of_month")`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/6981#discussion_r34866129
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala
 ---
@@ -54,3 +60,193 @@ case class CurrentTimestamp() extends LeafExpression {
 System.currentTimeMillis() * 1000L
   }
 }
+
+case class Hour(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(timestamp: Any): Any = {
+DateTimeUtils.getHours(timestamp.asInstanceOf[Long])
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils"
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getHours($c)"""
+)
+  }
+}
+
+case class Minute(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(timestamp: Any): Any = {
+DateTimeUtils.getMinutes(timestamp.asInstanceOf[Long])
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils"
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getMinutes($c)"""
+)
+  }
+}
+
+case class Second(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(timestamp: Any): Any = {
+DateTimeUtils.getSeconds(timestamp.asInstanceOf[Long])
+  }
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils"
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getSeconds($c)"""
+)
+  }
+}
+
+case class DayInYear(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
+
+  override def dataType: DataType = IntegerType
+
+  override def prettyName: String = "day_in_year"
+
+  override protected def nullSafeEval(date: Any): Any = {
+DateTimeUtils.getDayInYear(date.asInstanceOf[Int])
+  }
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils"
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getDayInYear($c)"""
+)
+  }
+}
+
+
+case class Year(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(date: Any): Any = {
+DateTimeUtils.getYear(date.asInstanceOf[Int])
+  }
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils"
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getYear($c)"""
+)
+  }
+}
+
+case class Quarter(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(date: Any): Any = {
+DateTimeUtils.getQuarter(date.asInstanceOf[Int])
+  }
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+val dtu = "org.apache.spark.sql.catalyst.util.DateTimeUtils"
+defineCodeGen(ctx, ev, (c) =>
+  s"""$dtu.getQuarter($c)"""
+)
+  }
+}
+
+case class Month(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
+
+  override def dataType: DataType = IntegerType
+
+  override protected def nullSafeEval(date: Any): Any = {
+

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/6981#discussion_r34865980
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -378,4 +395,183 @@ object DateTimeUtils {
 c.set(segments(0), segments(1) - 1, segments(2), 0, 0, 0)
 Some((c.getTimeInMillis / 1000 / 3600 / 24).toInt)
   }
+
+  /**
+   * Returns the hour value of a given timestamp value. The timestamp is 
expressed in microseconds.
+   */
+  def getHours(timestamp: Long): Int = {
+val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp 
/ 1000)
+((localTs / 1000 / 3600) % 24).toInt
+  }
+
+  /**
+   * Returns the minute value of a given timestamp value. The timestamp is 
expressed in
+   * microseconds.
+   */
+  def getMinutes(timestamp: Long): Int = {
+val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp 
/ 1000)
+((localTs / 1000 / 60) % 60).toInt
+  }
+
+  /**
+   * Returns the second value of a given timestamp value. The timestamp is 
expressed in
+   * microseconds.
+   */
+  def getSeconds(timestamp: Long): Int = {
+val localTs = (timestamp / 1000) + defaultTimeZone.getOffset(timestamp 
/ 1000)
+((localTs / 1000) % 60).toInt
+  }
+
+  private[this] def isLeapYear(year: Int): Boolean = {
+(year % 4) == 0 && ((year % 100) != 0 || (year % 400) == 0)
+  }
+
+  /**
+   * Return the number of days since the start of 400 year period.
+   * The second year of a 400 year period (year 1) starts on day 365.
+   */
+  private[this] def yearBoundary(year: Int): Int = {
+year * 365 + ((year / 4 ) - (year / 100) + (year / 400))
+  }
+
+  /**
+   * Calculates the number of years for the given number of days. This 
depends
+   * on a 400 year period.
+   * @param days days since the beginning of the 400 year period
+   * @return number of year
+   */
+  private[this] def numYears(days: Int): Int = {
+val year = days / 365
+if (days > yearBoundary(year)) year else year - 1
+  }
+
+  /**
+   * Calculates the year and and the number of the day in the year for the 
given
+   * number of days. The given days is the number of days since 1.1.1970.
+   *
+   * The calculation uses the fact that the period 1.1.2001 until 
31.12.2400 is
+   * equals to the period 1.1.1601 until 31.12.2000.
+   */
+  private[this] def getYearAndDayInYear(daysSince1970: Int): (Int, Int) = {
+// add the difference (in days) between 1.1.1970 and the artificial 
year 0 (-17999)
+val daysNormalized = daysSince1970 + toYearZero
+val numOfQuarterCenturies = daysNormalized / daysIn400Years
+val daysInThis400 = daysNormalized % daysIn400Years + 1
+val years = numYears(daysInThis400)
+val year: Int = (2001 - 2) + 400 * numOfQuarterCenturies + years
+val dayInYear = daysInThis400 - yearBoundary(years)
+(year, dayInYear)
+  }
+
+  /**
+   * Returns the 'day in year' value for the given date. The date is 
expressed in days
+   * since 1.1.1970.
+   */
+  def getDayInYear(date: Int): Int = {
+getYearAndDayInYear(date)._2
+  }
+
+  /**
+   * Returns the year value for the given date. The date is expressed in 
days
+   * since 1.1.1970.
+   */
+  def getYear(date: Int): Int = {
+getYearAndDayInYear(date)._1
+  }
+
+  /**
+   * Returns the quarter for the given date. The date is expressed in days
+   * since 1.1.1970.
+   */
+  def getQuarter(date: Int): Int = {
+val (year, dayInYear) = getYearAndDayInYear(date)
+val leap = if (isLeapYear(year)) 1 else 0
+if (dayInYear <= 90 + leap) {
+  1
+} else if (dayInYear <= 181 + leap) {
+  2
+} else if (dayInYear <= 273 + leap) {
+  3
+} else {
+  4
+}
+  }
+
+  /**
+   * Returns the month value for the given date. The date is expressed in 
days
+   * since 1.1.1970. January is month 1.
+   */
+  def getMonth(date: Int): Int = {
+val (year, dayInYear) = getYearAndDayInYear(date)
+val leap = if (isLeapYear(year)) 1 else 0
+if (dayInYear <= 31) {
+  1
+} else if (dayInYear <= 59 + leap) {
+  2
+} else if (dayInYear <= 90 + leap) {
+  3
+} else if (dayInYear <= 120 + leap) {
+  4
+} else if (dayInYear <= 151 + leap) {
+  5
+} else if (dayInYear <= 181 + leap) {
+  6
+} else if (dayInYear <= 212 + leap) {
+  7
+} else if (dayInYear <= 243 + leap) {

[GitHub] spark pull request: [SPARK-8269][SQL]string function: initcap

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7208#discussion_r34865919
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -593,6 +593,33 @@ case class Levenshtein(left: Expression, right: 
Expression) extends BinaryExpres
 }
 
 /**
+ * Returns string, with the first letter of each word in uppercase,
+ * all other letters in lowercase. Words are delimited by whitespace.
+ */
+case class InitCap(child: Expression) extends UnaryExpression with 
ExpectsInputTypes {
+  override def dataType: DataType = StringType
+
+  override def inputTypes: Seq[DataType] = Seq(StringType)
+
+  override def nullSafeEval(string: Any): Any = {
+if (string.asInstanceOf[UTF8String].getBytes.length == 0) {
+  return string
+}
+else {
+  val sb = new StringBuffer()
+  sb.append(string)
+  sb.setCharAt(0, sb.charAt(0).toUpper)
+  for (i <- 1 until sb.length) {
+if (sb.charAt(i - 1).equals(' ')) {
+  sb.setCharAt(i, sb.charAt(i).toUpper)
+}
+  }
+  UTF8String.fromString(sb.toString)
--- End diff --

I think we should consider implement all of this on bytes directly. The 
conversion to `Char` isn't safe. I'm not sure, what happens if a character 
doesn't fit into `Char`. Using the assumption that a lower case and a upper 
case character have always the same number of bytes, we could easily use 
`Array[Byte]`. Even tough this isn't guaranteed by Unicode it seems to be true 
(maybe we could propose this to Unicode). But we can do this in a follow up PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-818...

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/6981#issuecomment-122055145
  
@davies could you trigger Jenkins. I'd like to get an idea, what is still 
crashing. I expect that `WeekOfYear` will crash (because of different 
timezones). The other stuff should be resolved by the new casting 
implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7353#discussion_r34762666
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -180,4 +182,202 @@ object DateTimeUtils {
 val nanos = (us % MICROS_PER_SECOND) * 1000L
 (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos)
   }
+
+  /**
+   * Parses a given UTF8 date string to the corresponding a corresponding 
[[Long]] value.
+   * The return type is [[Option]] in order to distinguish between 0L and 
null. The following
+   * formats are allowed:
+   *
+   * ``
+   * `-[m]m`
+   * `-[m]m-[d]d`
+   * `-[m]m-[d]d `
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]`
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z`
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m`
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m`
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]`
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z`
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m`
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m`
+   * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]`
+   * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z`
+   * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m`
+   * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m`
+   * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]`
+   * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z`
+   * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m`
+   * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m`
+   */
+  def stringToTimestamp(s: UTF8String): Option[Long] = {
+if (s == null) {
+  return None
+}
+var timeZone: Option[Byte] = None
+val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0)
+var i = 0
+var currentSegmentValue = 0
+val bytes = s.getBytes
+var j = 0
+var digitsMilli = 0
+var justTime = false
+while (j < bytes.length) {
+  val b = bytes(j)
+  val parsedValue = b - '0'.toByte
+  if (parsedValue < 0 || parsedValue > 9) {
+if (j == 0 && b == 'T') {
+  justTime = true
+  i += 3
+} else if (i < 2) {
+  if (b == '-') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else if (i == 0 && b == ':') {
+justTime = true
+segments(3) = currentSegmentValue
+currentSegmentValue = 0
+i = 4
+  } else {
+return None
+  }
+} else if (i == 2) {
+  if (b == ' ' || b == 'T') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return None
+  }
+} else if (i == 3 || i == 4) {
+  if (b == ':') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return None
+  }
+} else if (i == 5 || i == 6) {
+  if (b == 'Z') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+timeZone = Some(43)
+  } else if (b == '-' || b == '+') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+timeZone = Some(b)
+  } else if (b == '.' && i == 5) {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return None
+  }
+  if (i == 6  && b != '.') {
+i += 1
+  }
+} else {
+  if (b == ':' || b == ' ') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return None
+  }
+}
+  } else {
+if (i == 6) {
+  digitsMilli += 1
+}
+currentSegmentValue = currentSegmentValue * 10 + parsedValue
+  }
+  j += 1
+}
+
+segments(i) = currentSegmentValue
+
+while (digitsMilli < 6) {
+  segments(6) *= 10
+  digits

[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7353#discussion_r34761324
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -180,4 +182,202 @@ object DateTimeUtils {
 val nanos = (us % MICROS_PER_SECOND) * 1000L
 (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos)
   }
+
+  /**
+   * Parses a given UTF8 date string to the corresponding a corresponding 
[[Long]] value.
+   * The return type is [[Option]] in order to distinguish between 0L and 
null. The following
+   * formats are allowed:
+   *
+   * ``
+   * `-[m]m`
+   * `-[m]m-[d]d`
+   * `-[m]m-[d]d `
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]`
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z`
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m`
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m`
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]`
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z`
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m`
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m`
+   * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]`
+   * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z`
+   * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m`
+   * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m`
+   * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]`
+   * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z`
+   * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m`
+   * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m`
+   */
+  def stringToTimestamp(s: UTF8String): Option[Long] = {
+if (s == null) {
+  return None
+}
+var timeZone: Option[Byte] = None
+val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0)
+var i = 0
+var currentSegmentValue = 0
+val bytes = s.getBytes
+var j = 0
+var digitsMilli = 0
+var justTime = false
+while (j < bytes.length) {
+  val b = bytes(j)
+  val parsedValue = b - '0'.toByte
+  if (parsedValue < 0 || parsedValue > 9) {
+if (j == 0 && b == 'T') {
+  justTime = true
+  i += 3
+} else if (i < 2) {
+  if (b == '-') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else if (i == 0 && b == ':') {
+justTime = true
+segments(3) = currentSegmentValue
+currentSegmentValue = 0
+i = 4
+  } else {
+return None
+  }
+} else if (i == 2) {
+  if (b == ' ' || b == 'T') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return None
+  }
+} else if (i == 3 || i == 4) {
+  if (b == ':') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return None
+  }
+} else if (i == 5 || i == 6) {
+  if (b == 'Z') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+timeZone = Some(43)
+  } else if (b == '-' || b == '+') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+timeZone = Some(b)
+  } else if (b == '.' && i == 5) {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return None
+  }
+  if (i == 6  && b != '.') {
+i += 1
+  }
+} else {
+  if (b == ':' || b == ' ') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return None
+  }
+}
+  } else {
+if (i == 6) {
+  digitsMilli += 1
+}
+currentSegmentValue = currentSegmentValue * 10 + parsedValue
+  }
+  j += 1
+}
+
+segments(i) = currentSegmentValue
+
+while (digitsMilli < 6) {
+  segments(6) *= 10
+  digits

[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...

2015-07-15 Thread tarekauel

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7353#discussion_r34742916
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -180,4 +182,169 @@ object DateTimeUtils {
 val nanos = (us % MICROS_PER_SECOND) * 1000L
 (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos)
   }
+
+  /**
+   * Parses a given UTF8 date string to the corresponding [[Timestamp]] 
object. The format of the
+   * date has to be one of the following: ``, `-[m]m`, 
`-[m]m-[d]d`, `-[m]m-[d]d `,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`,
+   */
+  def stringToTimestamp(s: UTF8String): Timestamp = {
+if (s == null) {
+  return null
+}
+var timeZone: Option[Byte] = None
+val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0)
+var i = 0
+var currentSegmentValue = 0
+val bytes = s.getBytes
+var j = 0
+var digitsMilli = 0
+while (j < bytes.length) {
+  val b = bytes(j)
+  val parsedValue = b - 48
+  if (parsedValue < 0 || parsedValue > 9) {
+if (i < 2) {
+  if (b == '-') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i == 2) {
+  if (b == ' ' || b == 'T') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i < 5) {
+  if (b == ':') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i < 7) {
+  if (b == 'Z') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+timeZone = Some(43)
+  } else if (b == '-' || b == '+') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+timeZone = Some(b)
+  } else if (b == '.' && i == 5) {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+  if (i == 6  && b != '.') {
+i += 1
+  }
+} else if (i > 6) {
+  if (b == ':') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+}
+  } else {
+if (i == 6) {
+  digitsMilli += 1
+}
+currentSegmentValue = currentSegmentValue * 10 + parsedValue
+  }
+  j += 1
+}
+if (i > 8) {
+  return null
+}
+segments(i) = currentSegmentValue
+
+// Hive compatibility 2011-05-06 07:08:09.1000 == 2011-05-06 07:08:09.1
+if (digitsMilli == 4) {
+  segments(6) = segments(6) / 10
+}
+
+// 18:3:1.1 is equals to 18:3:1:100
+if (digitsMilli == 1) {
+  segments(6) = segments(6) * 100
+} else if (digitsMilli == 2) {
+  segments(6) = segments(6) * 10
+}
+
+if (segments(0) < 0 || segments(0) >  || segments(1) < 1 || 
segments(1) > 12 ||
+segments(2) < 1 || segments(2) > 31 || segments(3) < 0 || 
segments(3) > 23 ||
+segments(4) < 0 || segments(4) > 59 || segments(5) < 0 || 
segments(5) > 59 ||
+segments(6) < 0 || segments(6) > 999 || segments(7) < 0 || 
segments(7) > 14 ||
+segments(8) < 0 || segments(8) > 59) {
+  return null
+}
+val c = if (timeZone.isEmpty) {
+  Calendar.getInstance()
+} else {
+  Calendar.getInstance(
+
TimeZone.getTimeZone(f"GMT${timeZone.get

[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...

2015-07-15 Thread tarekauel

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7353#issuecomment-121697293
  
@davies somehow Jenkins wasn't able to fetch from GitHub. Could you trigger 
Jenkins, again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7353#issuecomment-121118086
  
@davies How should we deal with this? I don't know the value of `'value`, 
but it seems to be something that the parser can parse can be parsed to 
`1.1.1970`.


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37171/testReport/org.apache.spark.sql.hive.execution/HiveQuerySuite/Cast_Timestamp_to_Timestamp_in_UDF/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7353#discussion_r34526036
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -180,4 +182,169 @@ object DateTimeUtils {
 val nanos = (us % MICROS_PER_SECOND) * 1000L
 (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos)
   }
+
+  /**
+   * Parses a given UTF8 date string to the corresponding [[Timestamp]] 
object. The format of the
+   * date has to be one of the following: ``, `-[m]m`, 
`-[m]m-[d]d`, `-[m]m-[d]d `,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`,
+   */
+  def stringToTimestamp(s: UTF8String): Timestamp = {
+if (s == null) {
+  return null
+}
+var timeZone: Option[Byte] = None
+val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0)
+var i = 0
+var currentSegmentValue = 0
+val bytes = s.getBytes
+var j = 0
+var digitsMilli = 0
+while (j < bytes.length) {
+  val b = bytes(j)
+  val parsedValue = b - 48
+  if (parsedValue < 0 || parsedValue > 9) {
+if (i < 2) {
+  if (b == '-') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i == 2) {
+  if (b == ' ' || b == 'T') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i < 5) {
+  if (b == ':') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i < 7) {
+  if (b == 'Z') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+timeZone = Some(43)
+  } else if (b == '-' || b == '+') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+timeZone = Some(b)
+  } else if (b == '.' && i == 5) {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+  if (i == 6  && b != '.') {
+i += 1
+  }
+} else if (i > 6) {
+  if (b == ':') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+}
+  } else {
+if (i == 6) {
+  digitsMilli += 1
+}
+currentSegmentValue = currentSegmentValue * 10 + parsedValue
+  }
+  j += 1
+}
+if (i > 8) {
+  return null
+}
+segments(i) = currentSegmentValue
+
+// Hive compatibility 2011-05-06 07:08:09.1000 == 2011-05-06 07:08:09.1
+if (digitsMilli == 4) {
+  segments(6) = segments(6) / 10
+}
+
+// 18:3:1.1 is equals to 18:3:1:100
+if (digitsMilli == 1) {
+  segments(6) = segments(6) * 100
+} else if (digitsMilli == 2) {
+  segments(6) = segments(6) * 10
+}
+
+if (segments(0) < 0 || segments(0) >  || segments(1) < 1 || 
segments(1) > 12 ||
+segments(2) < 1 || segments(2) > 31 || segments(3) < 0 || 
segments(3) > 23 ||
+segments(4) < 0 || segments(4) > 59 || segments(5) < 0 || 
segments(5) > 59 ||
+segments(6) < 0 || segments(6) > 999 || segments(7) < 0 || 
segments(7) > 14 ||
+segments(8) < 0 || segments(8) > 59) {
+  return null
+}
+val c = if (timeZone.isEmpty) {
+  Calendar.getInstance()
+} else {
+  Calendar.getInstance(
+
TimeZone.getTimeZone(f"GM

[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7353#discussion_r34523798
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -180,4 +182,169 @@ object DateTimeUtils {
 val nanos = (us % MICROS_PER_SECOND) * 1000L
 (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos)
   }
+
+  /**
+   * Parses a given UTF8 date string to the corresponding [[Timestamp]] 
object. The format of the
+   * date has to be one of the following: ``, `-[m]m`, 
`-[m]m-[d]d`, `-[m]m-[d]d `,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`,
+   */
+  def stringToTimestamp(s: UTF8String): Timestamp = {
+if (s == null) {
+  return null
+}
+var timeZone: Option[Byte] = None
+val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0)
+var i = 0
+var currentSegmentValue = 0
+val bytes = s.getBytes
+var j = 0
+var digitsMilli = 0
+while (j < bytes.length) {
+  val b = bytes(j)
+  val parsedValue = b - 48
+  if (parsedValue < 0 || parsedValue > 9) {
+if (i < 2) {
+  if (b == '-') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i == 2) {
+  if (b == ' ' || b == 'T') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i < 5) {
+  if (b == ':') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i < 7) {
+  if (b == 'Z') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+timeZone = Some(43)
+  } else if (b == '-' || b == '+') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+timeZone = Some(b)
+  } else if (b == '.' && i == 5) {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+  if (i == 6  && b != '.') {
+i += 1
+  }
+} else if (i > 6) {
+  if (b == ':') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+}
+  } else {
+if (i == 6) {
+  digitsMilli += 1
+}
+currentSegmentValue = currentSegmentValue * 10 + parsedValue
+  }
+  j += 1
+}
+if (i > 8) {
+  return null
+}
+segments(i) = currentSegmentValue
+
+// Hive compatibility 2011-05-06 07:08:09.1000 == 2011-05-06 07:08:09.1
+if (digitsMilli == 4) {
+  segments(6) = segments(6) / 10
+}
+
+// 18:3:1.1 is equals to 18:3:1:100
+if (digitsMilli == 1) {
--- End diff --

Good idea!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7353#discussion_r34523785
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -180,4 +182,169 @@ object DateTimeUtils {
 val nanos = (us % MICROS_PER_SECOND) * 1000L
 (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos)
   }
+
+  /**
+   * Parses a given UTF8 date string to the corresponding [[Timestamp]] 
object. The format of the
+   * date has to be one of the following: ``, `-[m]m`, 
`-[m]m-[d]d`, `-[m]m-[d]d `,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`,
+   */
+  def stringToTimestamp(s: UTF8String): Timestamp = {
+if (s == null) {
+  return null
+}
+var timeZone: Option[Byte] = None
+val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0)
+var i = 0
+var currentSegmentValue = 0
+val bytes = s.getBytes
+var j = 0
+var digitsMilli = 0
+while (j < bytes.length) {
+  val b = bytes(j)
+  val parsedValue = b - 48
+  if (parsedValue < 0 || parsedValue > 9) {
+if (i < 2) {
+  if (b == '-') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i == 2) {
+  if (b == ' ' || b == 'T') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i < 5) {
+  if (b == ':') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i < 7) {
+  if (b == 'Z') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+timeZone = Some(43)
+  } else if (b == '-' || b == '+') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+timeZone = Some(b)
+  } else if (b == '.' && i == 5) {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+  if (i == 6  && b != '.') {
+i += 1
+  }
+} else if (i > 6) {
+  if (b == ':') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+}
+  } else {
+if (i == 6) {
+  digitsMilli += 1
+}
+currentSegmentValue = currentSegmentValue * 10 + parsedValue
+  }
+  j += 1
+}
+if (i > 8) {
+  return null
--- End diff --

Okay. If there is a space the garbage is ignored


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7353#discussion_r34523735
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -180,4 +182,169 @@ object DateTimeUtils {
 val nanos = (us % MICROS_PER_SECOND) * 1000L
 (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos)
   }
+
+  /**
+   * Parses a given UTF8 date string to the corresponding [[Timestamp]] 
object. The format of the
+   * date has to be one of the following: ``, `-[m]m`, 
`-[m]m-[d]d`, `-[m]m-[d]d `,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`,
+   */
+  def stringToTimestamp(s: UTF8String): Timestamp = {
+if (s == null) {
+  return null
+}
+var timeZone: Option[Byte] = None
+val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0)
+var i = 0
+var currentSegmentValue = 0
+val bytes = s.getBytes
+var j = 0
+var digitsMilli = 0
+while (j < bytes.length) {
+  val b = bytes(j)
+  val parsedValue = b - 48
+  if (parsedValue < 0 || parsedValue > 9) {
+if (i < 2) {
+  if (b == '-') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i == 2) {
+  if (b == ' ' || b == 'T') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i < 5) {
+  if (b == ':') {
--- End diff --

This is equals to `i == 3 || i == 4`, because of the `if` and `elseif` 
before. I am going to adjust the checks that
they are more readable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7353#discussion_r34521930
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -180,4 +182,169 @@ object DateTimeUtils {
 val nanos = (us % MICROS_PER_SECOND) * 1000L
 (day.toInt, secondsInDay * NANOS_PER_SECOND + nanos)
   }
+
+  /**
+   * Parses a given UTF8 date string to the corresponding [[Timestamp]] 
object. The format of the
+   * date has to be one of the following: ``, `-[m]m`, 
`-[m]m-[d]d`, `-[m]m-[d]d `,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`,
+   * `-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m`,
+   * `-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m`,
+   */
+  def stringToTimestamp(s: UTF8String): Timestamp = {
+if (s == null) {
+  return null
+}
+var timeZone: Option[Byte] = None
+val segments: Array[Int] = Array[Int](1, 1, 1, 0, 0, 0, 0, 0, 0)
+var i = 0
+var currentSegmentValue = 0
+val bytes = s.getBytes
+var j = 0
+var digitsMilli = 0
+while (j < bytes.length) {
+  val b = bytes(j)
+  val parsedValue = b - 48
+  if (parsedValue < 0 || parsedValue > 9) {
+if (i < 2) {
+  if (b == '-') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i == 2) {
+  if (b == ' ' || b == 'T') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i < 5) {
+  if (b == ':') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+} else if (i < 7) {
+  if (b == 'Z') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+timeZone = Some(43)
+  } else if (b == '-' || b == '+') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+timeZone = Some(b)
+  } else if (b == '.' && i == 5) {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+  if (i == 6  && b != '.') {
+i += 1
+  }
+} else if (i > 6) {
+  if (b == ':') {
+segments(i) = currentSegmentValue
+currentSegmentValue = 0
+i += 1
+  } else {
+return null
+  }
+}
+  } else {
+if (i == 6) {
+  digitsMilli += 1
+}
+currentSegmentValue = currentSegmentValue * 10 + parsedValue
+  }
+  j += 1
+}
+if (i > 8) {
+  return null
+}
+segments(i) = currentSegmentValue
+
+// Hive compatibility 2011-05-06 07:08:09.1000 == 2011-05-06 07:08:09.1
+if (digitsMilli == 4) {
+  segments(6) = segments(6) / 10
+}
+
+// 18:3:1.1 is equals to 18:3:1:100
+if (digitsMilli == 1) {
+  segments(6) = segments(6) * 100
+} else if (digitsMilli == 2) {
+  segments(6) = segments(6) * 10
+}
+
+if (segments(0) < 0 || segments(0) >  || segments(1) < 1 || 
segments(1) > 12 ||
+segments(2) < 1 || segments(2) > 31 || segments(3) < 0 || 
segments(3) > 23 ||
+segments(4) < 0 || segments(4) > 59 || segments(5) < 0 || 
segments(5) > 59 ||
+segments(6) < 0 || segments(6) > 999 || segments(7) < 0 || 
segments(7) > 14 ||
+segments(8) < 0 || segments(8) > 59) {
+  return null
+}
+val c = if (timeZone.isEmpty) {
+  Calendar.getInstance()
+} else {
+  Calendar.getInstance(
+
TimeZone.getTimeZone(f"GMT${timeZone.get

[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...

Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7353#issuecomment-121007809
  
@davies thanks for all your good comments. I'm going to incorporate your 
suggestions.

I don't accept `18:03:20` , yet. The design document doesn't allow this. I 
think we should parse a pure time string to today + time. But I wanted to 
double check this with you guys.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...

Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7353#discussion_r34486426
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ---
@@ -165,15 +165,8 @@ case class Cast(child: Expression, dataType: DataType) 
extends UnaryExpression w
   private[this] def castToTimestamp(from: DataType): Any => Any = from 
match {
 case StringType =>
   buildCast[UTF8String](_, utfs => {
-// Throw away extra if more than 9 decimal places
-val s = utfs.toString
-val periodIdx = s.indexOf(".")
-var n = s
-if (periodIdx != -1 && n.length() - periodIdx > 9) {
-  n = n.substring(0, periodIdx + 10)
-}
-try DateTimeUtils.fromJavaTimestamp(Timestamp.valueOf(n))
-catch { case _: java.lang.IllegalArgumentException => null }
+val parsedDateString = DateTimeUtils.stringToTimestamp(utfs)
+if (parsedDateString == null) null else 
DateTimeUtils.fromJavaTimestamp(parsedDateString)
--- End diff --

I'm going to adjust this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...