[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14004


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-08 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r70029429
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -17,10 +17,15 @@
 
 package org.apache.spark.sql.catalyst.expressions
 
+import java.text.BreakIterator
--- End diff --

I'll move it to stringExpressions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-08 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r70029340
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +203,67 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
--- End diff --

It is needed for Hive compatibility, and there is a testcase for that.
```
checkAnswer(
  df.selectExpr("sentences(null)", "sentences(10)", "sentences(3.14)"),
  Row(null, Seq(Seq("10")), Seq(Seq("3.14"
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-08 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r70029134
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +203,67 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
--- End diff --

Thank you for review! I'll fix soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r70028972
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -17,10 +17,15 @@
 
 package org.apache.spark.sql.catalyst.expressions
 
+import java.text.BreakIterator
--- End diff --

why is this expression in this file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r70028864
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +203,67 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
--- End diff --

ImplicitCastInputTypes -> ExpectsInputType


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r70028809
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +203,67 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
--- End diff --

```
_FUNC_(str[, lang, country])
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69948642
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +203,67 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  def this(str: Expression) = this(str, Literal(""), Literal(""))
+  def this(str: Expression, language: Expression) = this(str, language, 
Literal(""))
+
+  override def nullable: Boolean = true
+  override def dataType: DataType =
+ArrayType(ArrayType(StringType, containsNull = false), containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+  override def children: Seq[Expression] = str :: language :: country :: 
Nil
+
+  override def eval(input: InternalRow): Any = {
+val string = str.eval(input)
+if (string == null) {
+  null
+} else {
+  var locale = Locale.getDefault
+  val lang = language.eval(input)
+  val coun = country.eval(input)
+  if (lang != null && coun != null) {
--- End diff --

Oh, it's my mistake. Sorry. I'll fix soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69943328
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +203,67 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  def this(str: Expression) = this(str, Literal(""), Literal(""))
+  def this(str: Expression, language: Expression) = this(str, language, 
Literal(""))
+
+  override def nullable: Boolean = true
+  override def dataType: DataType =
+ArrayType(ArrayType(StringType, containsNull = false), containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+  override def children: Seq[Expression] = str :: language :: country :: 
Nil
+
+  override def eval(input: InternalRow): Any = {
+val string = str.eval(input)
+if (string == null) {
+  null
+} else {
+  var locale = Locale.getDefault
+  val lang = language.eval(input)
+  val coun = country.eval(input)
+  if (lang != null && coun != null) {
--- End diff --

`language.eval(input)` is `null`, isn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69942945
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +203,66 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  def this(str: Expression) = this(str, Literal(""), Literal(""))
+  def this(str: Expression, language: Expression) = this(str, language, 
Literal(""))
+
+  override def nullable: Boolean = true
+  override def dataType: DataType =
+ArrayType(ArrayType(StringType, containsNull = false), containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+  override def children: Seq[Expression] = str :: language :: country :: 
Nil
+
+  override def eval(input: InternalRow): Any = {
+val string = str.eval(input)
+if (string == null) {
+  null
+} else {
+  val locale = try {
+new Locale(language.eval(input).asInstanceOf[UTF8String].toString,
+  country.eval(input).asInstanceOf[UTF8String].toString)
+  } catch {
+case _: NullPointerException | _: ClassCastException => 
Locale.getDefault
--- End diff --

Oh, it was ambiguous. `new Locale("abc", "xyz")` is created, it's not null. 
The following works but uses JVM Default locale. 
`BreakIterator.getSentenceInstance` does not return `null` with wrong locale 
instance.
```
scala> val x = java.text.BreakIterator.getSentenceInstance(new 
java.util.Locale("xxx", "bbb"))
x: java.text.BreakIterator = [checksum=0xcba403eb]
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69893328
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +203,66 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  def this(str: Expression) = this(str, Literal(""), Literal(""))
+  def this(str: Expression, language: Expression) = this(str, language, 
Literal(""))
+
+  override def nullable: Boolean = true
+  override def dataType: DataType =
+ArrayType(ArrayType(StringType, containsNull = false), containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+  override def children: Seq[Expression] = str :: language :: country :: 
Nil
+
+  override def eval(input: InternalRow): Any = {
+val string = str.eval(input)
+if (string == null) {
+  null
+} else {
+  val locale = try {
+new Locale(language.eval(input).asInstanceOf[UTF8String].toString,
+  country.eval(input).asInstanceOf[UTF8String].toString)
+  } catch {
+case _: NullPointerException | _: ClassCastException => 
Locale.getDefault
--- End diff --

what do you mean by `ignored`? returning null or returning the default 
locale?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69893166
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +203,67 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  def this(str: Expression) = this(str, Literal(""), Literal(""))
+  def this(str: Expression, language: Expression) = this(str, language, 
Literal(""))
+
+  override def nullable: Boolean = true
+  override def dataType: DataType =
+ArrayType(ArrayType(StringType, containsNull = false), containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+  override def children: Seq[Expression] = str :: language :: country :: 
Nil
+
+  override def eval(input: InternalRow): Any = {
+val string = str.eval(input)
+if (string == null) {
+  null
+} else {
+  var locale = Locale.getDefault
+  val lang = language.eval(input)
+  val coun = country.eval(input)
+  if (lang != null && coun != null) {
--- End diff --

I'd like to write:
```
val languageStr = language.eval(input).asInstanceOf[UTF8String]
val countryStr = country.eval(input).asInstanceOf[UTF8String]
val locale = if (languageStr != null && countryStr != null) {
  new Locale(languageStr, countryStr)
} else {
  Locale.getDefault
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69857442
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +203,66 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  def this(str: Expression) = this(str, Literal(""), Literal(""))
+  def this(str: Expression, language: Expression) = this(str, language, 
Literal(""))
+
+  override def nullable: Boolean = true
+  override def dataType: DataType =
+ArrayType(ArrayType(StringType, containsNull = false), containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+  override def children: Seq[Expression] = str :: language :: country :: 
Nil
+
+  override def eval(input: InternalRow): Any = {
+val string = str.eval(input)
+if (string == null) {
+  null
+} else {
+  var locale = Locale.getDefault
+  if (language != null && language.eval(input) != null &&
--- End diff --

Oh. I didn't think in that way. Right for both. Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69857049
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +203,66 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  def this(str: Expression) = this(str, Literal(""), Literal(""))
+  def this(str: Expression, language: Expression) = this(str, language, 
Literal(""))
+
+  override def nullable: Boolean = true
+  override def dataType: DataType =
+ArrayType(ArrayType(StringType, containsNull = false), containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+  override def children: Seq[Expression] = str :: language :: country :: 
Nil
+
+  override def eval(input: InternalRow): Any = {
+val string = str.eval(input)
+if (string == null) {
+  null
+} else {
+  var locale = Locale.getDefault
+  if (language != null && language.eval(input) != null &&
--- End diff --

Usually we don't check the nullability of expression, but the eval result 
of expression. And it's really bad we call `eval` twice: one in the if 
condition and one in the if body.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69854198
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +203,66 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  def this(str: Expression) = this(str, Literal(""), Literal(""))
+  def this(str: Expression, language: Expression) = this(str, language, 
Literal(""))
+
+  override def nullable: Boolean = true
+  override def dataType: DataType =
+ArrayType(ArrayType(StringType, containsNull = false), containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+  override def children: Seq[Expression] = str :: language :: country :: 
Nil
+
+  override def eval(input: InternalRow): Any = {
+val string = str.eval(input)
+if (string == null) {
+  null
+} else {
+  val locale = try {
+new Locale(language.eval(input).asInstanceOf[UTF8String].toString,
+  country.eval(input).asInstanceOf[UTF8String].toString)
+  } catch {
+case _: NullPointerException | _: ClassCastException => 
Locale.getDefault
--- End diff --

It created the wrong locale, then it compared the system available locales. 
So, finally, ignored. The following is the underlying code in `rt.jar`.
```
List var4 = 
Control.getControl(Control.FORMAT_DEFAULT).getCandidateLocales("", var1);
Iterator var5 = var4.iterator();

while(var5.hasNext()) {
Locale var6 = (Locale)var5.next();
if(!var6.equals(var1)) {
var2 = findAdapter(var0, var6);
if(var2 != null) {
((ConcurrentMap)var3).putIfAbsent(var1, var2);
return var2;
}
}
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69853682
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,41 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("Sentences") {
+val nullString = Literal.create(null, StringType)
+checkEvaluation(Sentences(nullString, nullString, nullString), null, 
EmptyRow)
+checkEvaluation(Sentences(nullString, nullString), null, EmptyRow)
+checkEvaluation(Sentences(nullString), null, EmptyRow)
+checkEvaluation(Sentences(Literal.create(null, NullType)), null, 
EmptyRow)
+checkEvaluation(Sentences("", nullString, nullString), Seq.empty, 
EmptyRow)
+checkEvaluation(Sentences("", nullString), Seq.empty, EmptyRow)
+checkEvaluation(Sentences(""), Seq.empty, EmptyRow)
+
+val correct_answer = Seq(
+  Seq("Hi", "there"),
+  Seq("The", "price", "was"),
+  Seq("But", "not", "now"))
+
+// Hive compatible test-cases.
+checkEvaluation(
+  Sentences("Hi there! The price was $1,234.56 But, not now."),
+  correct_answer,
+  EmptyRow)
--- End diff --

Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69853496
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala ---
@@ -347,4 +347,24 @@ class StringFunctionsSuite extends QueryTest with 
SharedSQLContext {
   df2.filter("b>0").selectExpr("format_number(a, b)"),
   Row("5.") :: Row("4.000") :: Row("4.000") :: Row("4.000") :: 
Row("3.00") :: Nil)
   }
+
+  test("string sentences function") {
+val df = Seq(("Hi there! The price was $1,234.56 But, not now.", 
"en", "US"))
+  .toDF("str", "language", "country")
+
+checkAnswer(
+  df.selectExpr("sentences(str, language, country)"),
+  Row(Seq(Seq("Hi", "there"), Seq("The", "price", "was"), Seq("But", 
"not", "now"
+
+// Type coercion
+checkAnswer(
+  df.selectExpr("sentences(null)", "sentences(10)", "sentences(3.14)"),
+  Row(null, Seq(Seq("10")), Seq(Seq("3.14"
+
+// Argument number exception
+val m = intercept[AnalysisException] {
+  df.selectExpr("sentences()")
+}.getMessage
+assert(m.contains("Invalid number of arguments"))
--- End diff --

It is `Invalid number of arguments for function sentences`. I'll update 
this, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69848762
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala ---
@@ -347,4 +347,24 @@ class StringFunctionsSuite extends QueryTest with 
SharedSQLContext {
   df2.filter("b>0").selectExpr("format_number(a, b)"),
   Row("5.") :: Row("4.000") :: Row("4.000") :: Row("4.000") :: 
Row("3.00") :: Nil)
   }
+
+  test("string sentences function") {
+val df = Seq(("Hi there! The price was $1,234.56 But, not now.", 
"en", "US"))
+  .toDF("str", "language", "country")
+
+checkAnswer(
+  df.selectExpr("sentences(str, language, country)"),
+  Row(Seq(Seq("Hi", "there"), Seq("The", "price", "was"), Seq("But", 
"not", "now"
+
+// Type coercion
+checkAnswer(
+  df.selectExpr("sentences(null)", "sentences(10)", "sentences(3.14)"),
+  Row(null, Seq(Seq("10")), Seq(Seq("3.14"
+
+// Argument number exception
+val m = intercept[AnalysisException] {
+  df.selectExpr("sentences()")
+}.getMessage
+assert(m.contains("Invalid number of arguments"))
--- End diff --

btw what's the full error message here? I thought it would be `function not 
found` or something...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69848674
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,41 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("Sentences") {
+val nullString = Literal.create(null, StringType)
+checkEvaluation(Sentences(nullString, nullString, nullString), null, 
EmptyRow)
+checkEvaluation(Sentences(nullString, nullString), null, EmptyRow)
+checkEvaluation(Sentences(nullString), null, EmptyRow)
+checkEvaluation(Sentences(Literal.create(null, NullType)), null, 
EmptyRow)
+checkEvaluation(Sentences("", nullString, nullString), Seq.empty, 
EmptyRow)
+checkEvaluation(Sentences("", nullString), Seq.empty, EmptyRow)
+checkEvaluation(Sentences(""), Seq.empty, EmptyRow)
+
+val correct_answer = Seq(
+  Seq("Hi", "there"),
+  Seq("The", "price", "was"),
+  Seq("But", "not", "now"))
+
+// Hive compatible test-cases.
+checkEvaluation(
+  Sentences("Hi there! The price was $1,234.56 But, not now."),
+  correct_answer,
+  EmptyRow)
--- End diff --

`EmptyRow` is the default value, we don't need to pass it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69848618
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +203,66 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  def this(str: Expression) = this(str, Literal(""), Literal(""))
+  def this(str: Expression, language: Expression) = this(str, language, 
Literal(""))
+
+  override def nullable: Boolean = true
+  override def dataType: DataType =
+ArrayType(ArrayType(StringType, containsNull = false), containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+  override def children: Seq[Expression] = str :: language :: country :: 
Nil
+
+  override def eval(input: InternalRow): Any = {
+val string = str.eval(input)
+if (string == null) {
+  null
+} else {
+  val locale = try {
+new Locale(language.eval(input).asInstanceOf[UTF8String].toString,
+  country.eval(input).asInstanceOf[UTF8String].toString)
+  } catch {
+case _: NullPointerException | _: ClassCastException => 
Locale.getDefault
--- End diff --

I'd like to use `if` to check the null explicitly. And `ClassCastException` 
will never happen as we have type checking.
BTW what will be the result of invalid language and country? e.g. `new 
Locale("abc", "xyz")`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69680871
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +202,63 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends TernaryExpression with ImplicitCastInputTypes with 
CodegenFallback {
+
+  def this(str: Expression) = this(str, Literal(""), Literal(""))
+  def this(str: Expression, language: Expression) = this(str, language, 
Literal(""))
+
+  override def dataType: DataType = ArrayType(ArrayType(StringType))
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+  override def children: Seq[Expression] = str :: language :: country :: 
Nil
+
+  override def nullSafeEval(string: Any, language: Any, country: Any): Any 
= {
+val sentences = getSentences(string.asInstanceOf[UTF8String].toString,
+  language.asInstanceOf[UTF8String].toString, 
country.asInstanceOf[UTF8String].toString)
+val result = ArrayBuffer.empty[GenericArrayData]
+sentences.foreach(sentence => result += new 
GenericArrayData(sentence.toArray))
+new GenericArrayData(result.toArray)
+  }
+
+  private def getSentences(sentences: String, language: String, country: 
String) = {
+val locale = try {
+  new Locale(language, country)
+} finally {
+  Locale.getDefault
+}
+
+val bi = BreakIterator.getSentenceInstance(locale)
+bi.setText(sentences)
+var idx = 0
+val result = new ArrayBuffer[ArrayBuffer[UTF8String]]
+while (bi.next != BreakIterator.DONE) {
+  val sentence = sentences.substring(idx, bi.current)
+  idx = bi.current
+
+  val wi = BreakIterator.getWordInstance(locale)
+  var widx = 0
+  wi.setText(sentence)
+  val words = new ArrayBuffer[UTF8String]
+  while (wi.next != BreakIterator.DONE) {
+val word = sentence.substring(widx, wi.current)
+widx = wi.current
+if (Character.isLetterOrDigit(word.charAt(0))) words += 
UTF8String.fromString(word)
--- End diff --

Yes. This is the Hive rule.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69680850
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +202,63 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends TernaryExpression with ImplicitCastInputTypes with 
CodegenFallback {
+
+  def this(str: Expression) = this(str, Literal(""), Literal(""))
+  def this(str: Expression, language: Expression) = this(str, language, 
Literal(""))
+
+  override def dataType: DataType = ArrayType(ArrayType(StringType))
--- End diff --

Hmm. Right. The return `dataType` could be `null`, but has no `nullable` 
element. I'll fix like the following.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69679940
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +202,63 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends TernaryExpression with ImplicitCastInputTypes with 
CodegenFallback {
+
+  def this(str: Expression) = this(str, Literal(""), Literal(""))
+  def this(str: Expression, language: Expression) = this(str, language, 
Literal(""))
+
+  override def dataType: DataType = ArrayType(ArrayType(StringType))
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+  override def children: Seq[Expression] = str :: language :: country :: 
Nil
+
+  override def nullSafeEval(string: Any, language: Any, country: Any): Any 
= {
+val sentences = getSentences(string.asInstanceOf[UTF8String].toString,
+  language.asInstanceOf[UTF8String].toString, 
country.asInstanceOf[UTF8String].toString)
+val result = ArrayBuffer.empty[GenericArrayData]
+sentences.foreach(sentence => result += new 
GenericArrayData(sentence.toArray))
+new GenericArrayData(result.toArray)
+  }
+
+  private def getSentences(sentences: String, language: String, country: 
String) = {
+val locale = try {
+  new Locale(language, country)
+} finally {
+  Locale.getDefault
+}
+
+val bi = BreakIterator.getSentenceInstance(locale)
+bi.setText(sentences)
+var idx = 0
+val result = new ArrayBuffer[ArrayBuffer[UTF8String]]
+while (bi.next != BreakIterator.DONE) {
+  val sentence = sentences.substring(idx, bi.current)
+  idx = bi.current
+
+  val wi = BreakIterator.getWordInstance(locale)
+  var widx = 0
+  wi.setText(sentence)
+  val words = new ArrayBuffer[UTF8String]
+  while (wi.next != BreakIterator.DONE) {
+val word = sentence.substring(widx, wi.current)
+widx = wi.current
+if (Character.isLetterOrDigit(word.charAt(0))) words += 
UTF8String.fromString(word)
--- End diff --

is it hive's rule? i.e. only include words that begin with letter or digit


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69679712
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +202,63 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, lang, country) - Splits str into an array of array 
of words.",
+  extended = "> SELECT _FUNC_('Hi there! Good morning.');\n  
[['Hi','there'], ['Good','morning']]")
+case class Sentences(
+str: Expression,
+language: Expression = Literal(""),
+country: Expression = Literal(""))
+  extends TernaryExpression with ImplicitCastInputTypes with 
CodegenFallback {
+
+  def this(str: Expression) = this(str, Literal(""), Literal(""))
+  def this(str: Expression, language: Expression) = this(str, language, 
Literal(""))
+
+  override def dataType: DataType = ArrayType(ArrayType(StringType))
--- End diff --

Is the String element nullable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r6967
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -801,6 +804,49 @@ public static UTF8String concatWs(UTF8String 
separator, UTF8String... inputs) {
 return res;
   }
 
+  /**
+   * Return a locale of the given language and country, or a default 
locale when failures occur.
+   */
+  private Locale getLocale(UTF8String language, UTF8String country) {
--- End diff --

Sorry I mean the `sentences` method, not the `Sentences` expression...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69675532
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +200,35 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
--- End diff --

```
hive> SELECT sentences('hi there', 'x', '');
OK
_c0
[["hi","there"]]
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69675519
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +200,35 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
--- End diff --

Yep. It's Hive behavior. Hive supports `sentences(str)`, 
`sentences(str,lang)`, and `sentences(str,lang,country)`.
And Hive does not raise exception for the illegal `lang` and `country`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69674992
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +200,35 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
--- End diff --

Can you check hive's behaviour? Is it legal to omit only one of them? and 
will hive throw exception if the 'lang' and 'country' are illegal?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69674949
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -801,6 +804,49 @@ public static UTF8String concatWs(UTF8String 
separator, UTF8String... inputs) {
 return res;
   }
 
+  /**
+   * Return a locale of the given language and country, or a default 
locale when failures occur.
+   */
+  private Locale getLocale(UTF8String language, UTF8String country) {
--- End diff --

Okay. No problem. I'll move this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69674859
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -801,6 +804,49 @@ public static UTF8String concatWs(UTF8String 
separator, UTF8String... inputs) {
 return res;
   }
 
+  /**
+   * Return a locale of the given language and country, or a default 
locale when failures occur.
+   */
+  private Locale getLocale(UTF8String language, UTF8String country) {
--- End diff --

I think it's ok to inline this method into `sentences`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69407577
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala
 ---
@@ -246,4 +246,31 @@ class ComplexTypeSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 checkMetadata(CreateStructUnsafe(Seq(a, b)))
 checkMetadata(CreateNamedStructUnsafe(Seq("a", a, "b", b)))
   }
+
+  test("Sentences") {
+// Hive compatible test-cases.
+checkEvaluation(
+  Sentences("Hi there! The price was $1,234.56 But, not now."),
+  Seq(
+Seq("Hi", "there").map(UTF8String.fromString),
+Seq("The", "price", "was").map(UTF8String.fromString),
+Seq("But", "not", "now").map(UTF8String.fromString)),
+  EmptyRow)
+
+checkEvaluation(
+  Sentences("Hi there! The price was $1,234.56 But, not now.", 
"en"),
+  Seq(
+Seq("Hi", "there").map(UTF8String.fromString),
+Seq("The", "price", "was").map(UTF8String.fromString),
+Seq("But", "not", "now").map(UTF8String.fromString)),
+  EmptyRow)
+
+checkEvaluation(
+  Sentences("Hi there! The price was $1,234.56 But, not now.", 
"en", "US"),
+Seq(
--- End diff --

wrong ident here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69388199
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -801,6 +804,49 @@ public static UTF8String concatWs(UTF8String 
separator, UTF8String... inputs) {
 return res;
   }
 
+  /**
+   * Return a locale of the given language and country, or a default 
locale when failures occur.
+   */
+  private Locale getLocale(UTF8String language, UTF8String country) {
+try {
+  return new Locale(language.toString(), country.toString());
+} finally {
+  return Locale.getDefault();
+}
+  }
+
+  /**
+   * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+   */
+  public ArrayList sentences(UTF8String language, 
UTF8String country) {
--- End diff --

hmmm, I'm confused too. @rxin what's the convention here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69325521
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -801,6 +804,49 @@ public static UTF8String concatWs(UTF8String 
separator, UTF8String... inputs) {
 return res;
   }
 
+  /**
+   * Return a locale of the given language and country, or a default 
locale when failures occur.
+   */
+  private Locale getLocale(UTF8String language, UTF8String country) {
+try {
+  return new Locale(language.toString(), country.toString());
+} finally {
+  return Locale.getDefault();
+}
+  }
+
+  /**
+   * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+   */
+  public ArrayList sentences(UTF8String language, 
UTF8String country) {
--- End diff --

For backward compatibility, we need to make Wrappers here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69325407
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -801,6 +804,49 @@ public static UTF8String concatWs(UTF8String 
separator, UTF8String... inputs) {
 return res;
   }
 
+  /**
+   * Return a locale of the given language and country, or a default 
locale when failures occur.
+   */
+  private Locale getLocale(UTF8String language, UTF8String country) {
+try {
+  return new Locale(language.toString(), country.toString());
+} finally {
+  return Locale.getDefault();
+}
+  }
+
+  /**
+   * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+   */
+  public ArrayList sentences(UTF8String language, 
UTF8String country) {
--- End diff --

Oh, I thought it's a convention.

+1. Splitting not-native UTF8String functions sounds good to me. We need to 
split the followings together.

- split

https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L795

- translate

https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L805

- toTitleCaseSlow

https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L408




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69305302
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -801,6 +804,49 @@ public static UTF8String concatWs(UTF8String 
separator, UTF8String... inputs) {
 return res;
   }
 
+  /**
+   * Return a locale of the given language and country, or a default 
locale when failures occur.
+   */
+  private Locale getLocale(UTF8String language, UTF8String country) {
+try {
+  return new Locale(language.toString(), country.toString());
+} finally {
+  return Locale.getDefault();
+}
+  }
+
+  /**
+   * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+   */
+  public ArrayList sentences(UTF8String language, 
UTF8String country) {
--- End diff --

This method is implemented by String not UTF8String, I think we should put 
it into another util object(maybe in scala) and takes String as arguments. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-06-30 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/14004

[SPARK-16285][SQL] Implement sentences SQL functions

## What changes were proposed in this pull request?

This PR implements `sentences` SQL function.

## How was this patch tested?

Pass the Jenkins tests with a new testcase.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK_16285

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14004.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14004


commit f29a8c3c9e56a33cff4a7efe481b3ccc7bc9dcfa
Author: Dongjoon Hyun 
Date:   2016-06-30T22:08:09Z

[SPARK-16285][SQL] Implement sentences SQL functions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org