[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-06-30 Thread janplus
GitHub user janplus opened a pull request:

https://github.com/apache/spark/pull/14008

[SPARK-16281][SQL] Implement parse_url SQL function

## What changes were proposed in this pull request?

This PR adds parse_url SQL functions in order to remove Hive fallback.

A new implementation of #13999 

## How was this patch tested?

Pass the exist tests including new testcases.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/janplus/spark SPARK-16281

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14008.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14008


commit def598206e2bcfae4553edf18333872d2f2eff35
Author: wujian 
Date:   2016-07-01T05:24:34Z

[SPARK-16281][SQL] Implement parse_url SQL function




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-06-30 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69250066
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 ---
@@ -285,6 +285,7 @@ object FunctionRegistry {
 expression[StringTrimLeft]("ltrim"),
 expression[JsonTuple]("json_tuple"),
 expression[FormatString]("printf"),
+expression[ParseUrl]("parse_url"),
--- End diff --

this should go before printf


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-06-30 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69250118
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +655,128 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO\n"
+  + "key specifies which query to extract\n"
+  + "Examples:\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'HOST') FROM src LIMIT 1;\n" + "  'spark.apache.org'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY') FROM src LIMIT 1;\n"  + "  'query=1'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY', 'query') FROM src LIMIT 1;\n" + "  '1'")
+case class ParseUrl(children: Expression*)
--- End diff --

there is no vararg in parse url, is there?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-06-30 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69251415
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 ---
@@ -285,6 +285,7 @@ object FunctionRegistry {
 expression[StringTrimLeft]("ltrim"),
 expression[JsonTuple]("json_tuple"),
 expression[FormatString]("printf"),
+expression[ParseUrl]("parse_url"),
--- End diff --

OK, Thank you for review. I'll fix this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-06-30 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69251673
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +655,128 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO\n"
+  + "key specifies which query to extract\n"
+  + "Examples:\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'HOST') FROM src LIMIT 1;\n" + "  'spark.apache.org'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY') FROM src LIMIT 1;\n"  + "  'query=1'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY', 'query') FROM src LIMIT 1;\n" + "  '1'")
+case class ParseUrl(children: Expression*)
--- End diff --

The parameter `key` can only be present when `partToExtract` is `QUERY`. So 
there are two parameters call and three parameters call. I have not figured out 
a better way to accomplish this.

I'll keep on trying to do it better. If there is any advice, that will be 
very helpful. Thanks again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-06-30 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69253202
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +655,128 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO\n"
+  + "key specifies which query to extract\n"
+  + "Examples:\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'HOST') FROM src LIMIT 1;\n" + "  'spark.apache.org'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY') FROM src LIMIT 1;\n"  + "  'query=1'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY', 'query') FROM src LIMIT 1;\n" + "  '1'")
+case class ParseUrl(children: Expression*)
--- End diff --

I think you can just declare a separate ctor that accepts two args


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69260667
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +655,128 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO\n"
+  + "key specifies which query to extract\n"
--- End diff --

Hi, @janplus .
There is a limitation of Scala 2.10 compiler. For `extended`, "+" breaks 
build. 
Please use one single `""""""` string like SubstringIndex 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L498
 .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69262363
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +655,128 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO\n"
+  + "key specifies which query to extract\n"
+  + "Examples:\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'HOST') FROM src LIMIT 1;\n" + "  'spark.apache.org'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY') FROM src LIMIT 1;\n"  + "  'query=1'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY', 'query') FROM src LIMIT 1;\n" + "  '1'")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  private lazy val stringExprs = children.toArray
--- End diff --

What about making `stringExprs` as local variable `eval`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69262783
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +655,128 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO\n"
+  + "key specifies which query to extract\n"
+  + "Examples:\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'HOST') FROM src LIMIT 1;\n" + "  'spark.apache.org'\n"
--- End diff --

We can omit `FROM src LIMIT 1` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69263461
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +655,128 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO\n"
+  + "key specifies which query to extract\n"
+  + "Examples:\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'HOST') FROM src LIMIT 1;\n" + "  'spark.apache.org'\n"
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69263441
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +655,128 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO\n"
+  + "key specifies which query to extract\n"
--- End diff --

Hi, @dongjoon-hyun .
Thank you for review. I'll fix this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69263681
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +655,128 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO\n"
+  + "key specifies which query to extract\n"
+  + "Examples:\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'HOST') FROM src LIMIT 1;\n" + "  'spark.apache.org'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY') FROM src LIMIT 1;\n"  + "  'query=1'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY', 'query') FROM src LIMIT 1;\n" + "  '1'")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  private lazy val stringExprs = children.toArray
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+  private lazy val HOST: UTF8String = UTF8String.fromString("HOST")
--- End diff --

For all variables, what about using the simple form? `lazy` has benefit, 
but also some overhead. Also, you can omit `: UTF8String` part, too.
```
private val HOST = UTF8String.fromString("HOST")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69264025
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +655,128 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO\n"
+  + "key specifies which query to extract\n"
+  + "Examples:\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'HOST') FROM src LIMIT 1;\n" + "  'spark.apache.org'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY') FROM src LIMIT 1;\n"  + "  'query=1'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY', 'query') FROM src LIMIT 1;\n" + "  '1'")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  private lazy val stringExprs = children.toArray
--- End diff --

This is for performance concern. I do not want to call `toArray` every row.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69264274
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +655,128 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO\n"
+  + "key specifies which query to extract\n"
+  + "Examples:\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'HOST') FROM src LIMIT 1;\n" + "  'spark.apache.org'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY') FROM src LIMIT 1;\n"  + "  'query=1'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY', 'query') FROM src LIMIT 1;\n" + "  '1'")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  private lazy val stringExprs = children.toArray
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+  private lazy val HOST: UTF8String = UTF8String.fromString("HOST")
--- End diff --

Yes, you are right. I'll fix this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69264464
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +655,128 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO\n"
+  + "key specifies which query to extract\n"
+  + "Examples:\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'HOST') FROM src LIMIT 1;\n" + "  'spark.apache.org'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY') FROM src LIMIT 1;\n"  + "  'query=1'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY', 'query') FROM src LIMIT 1;\n" + "  '1'")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  private lazy val stringExprs = children.toArray
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+  private lazy val HOST: UTF8String = UTF8String.fromString("HOST")
+  private lazy val PATH: UTF8String = UTF8String.fromString("PATH")
+  private lazy val QUERY: UTF8String = UTF8String.fromString("QUERY")
+  private lazy val REF: UTF8String = UTF8String.fromString("REF")
+  private lazy val PROTOCOL: UTF8String = UTF8String.fromString("PROTOCOL")
+  private lazy val FILE: UTF8String = UTF8String.fromString("FILE")
+  private lazy val AUTHORITY: UTF8String = 
UTF8String.fromString("AUTHORITY")
+  private lazy val USERINFO: UTF8String = UTF8String.fromString("USERINFO")
+  private lazy val REGEXPREFIX: String = "[&^]?"
+  private lazy val REGEXSUBFIX: String = "=([^&]*)"
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
+  lastUrl = new URL(url.toString)
+} catch {
+  case ex: MalformedURLException => return null
--- End diff --

We can use the following like Hive uses `Exception`. `SecurityException` 
occurs possibly as a `RuntimeException`.
```
case NonFatal(_) =>
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69265591
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +655,128 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO\n"
+  + "key specifies which query to extract\n"
+  + "Examples:\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'HOST') FROM src LIMIT 1;\n" + "  'spark.apache.org'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY') FROM src LIMIT 1;\n"  + "  'query=1'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY', 'query') FROM src LIMIT 1;\n" + "  '1'")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  private lazy val stringExprs = children.toArray
--- End diff --

If then, why do you call `toArray`? Actually, you can just use 
`children(0)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69265722
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,43 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType)),
+expected)
+}
+def checkParseUrlWithKey(expected: String, urlStr: String,
+  partToExtract: String, key: String): Unit = {
--- End diff --

nit: Indentation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69266113
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,43 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType)),
+expected)
+}
+def checkParseUrlWithKey(expected: String, urlStr: String,
+  partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+ Literal.create(key, StringType)), expected)
--- End diff --

For this one, 2-indentation.

https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69266296
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,43 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType)),
+expected)
+}
+def checkParseUrlWithKey(expected: String, urlStr: String,
+  partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+ Literal.create(key, StringType)), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("jian", "http://j...@spark.apache.org/path?query=1";, 
"USERINFO")
--- End diff --

If you don't mind, could you replace `jian` with `userinfo`? :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69266407
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala ---
@@ -226,6 +226,20 @@ class StringFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row("???hi", "hi???", "h", "h"))
   }
 
+  test("string parse_url function") {
+val df = Seq[String](("http://j...@spark.apache.org/path?query=1#Ref";))
--- End diff --

nit: `jian`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69266361
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,43 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType)),
+expected)
+}
+def checkParseUrlWithKey(expected: String, urlStr: String,
+  partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+ Literal.create(key, StringType)), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("jian", "http://j...@spark.apache.org/path?query=1";, 
"USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
+
+// arguments checking
+assert(ParseUrl(Literal("1")).checkInputDataTypes().isFailure)
+assert(ParseUrl(Literal("1"), Literal("2"),
+Literal("3"), 
Literal("4")).checkInputDataTypes().isFailure)
--- End diff --

nit: 2-indentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69266465
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala ---
@@ -226,6 +226,20 @@ class StringFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row("???hi", "hi???", "h", "h"))
   }
 
+  test("string parse_url function") {
+val df = Seq[String](("http://j...@spark.apache.org/path?query=1#Ref";))
+  .toDF("url")
+
+checkAnswer(
+  df.selectExpr("parse_url(url, 'HOST')", "parse_url(url, 'PATH')",
+"parse_url(url, 'QUERY')", "parse_url(url, 'REF')",
--- End diff --

nit: 2-indentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69266491
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala ---
@@ -226,6 +226,20 @@ class StringFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row("???hi", "hi???", "h", "h"))
   }
 
+  test("string parse_url function") {
+val df = Seq[String](("http://j...@spark.apache.org/path?query=1#Ref";))
+  .toDF("url")
+
+checkAnswer(
+  df.selectExpr("parse_url(url, 'HOST')", "parse_url(url, 'PATH')",
+"parse_url(url, 'QUERY')", "parse_url(url, 'REF')",
+"parse_url(url, 'PROTOCOL')", "parse_url(url, 'FILE')",
+"parse_url(url, 'AUTHORITY')", "parse_url(url, 
'USERINFO')",
+"parse_url(url, 'QUERY', 'query')"),
+  Row("spark.apache.org", "/path", "query=1", "Ref",
+  "http", "/path?query=1", "j...@spark.apache.org", "jian", "1"))
--- End diff --

you know, `jian`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69266712
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,43 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType)),
+expected)
+}
+def checkParseUrlWithKey(expected: String, urlStr: String,
+  partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+ Literal.create(key, StringType)), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("jian", "http://j...@spark.apache.org/path?query=1";, 
"USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
--- End diff --

Could you add exceptional cases by using the following statement?
```
intercept[AnalysisException] {
  ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69276469
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +655,128 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO\n"
+  + "key specifies which query to extract\n"
+  + "Examples:\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'HOST') FROM src LIMIT 1;\n" + "  'spark.apache.org'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY') FROM src LIMIT 1;\n"  + "  'query=1'\n"
+  + "  > SELECT _FUNC_('http://spark.apache.org/path?query=1', "
+  + "'QUERY', 'query') FROM src LIMIT 1;\n" + "  '1'")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  private lazy val stringExprs = children.toArray
--- End diff --

Try to avoid Scala Seqs' potential performance problems.
https://github.com/apache/spark/pull/13966/files#r69184719


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69276506
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,43 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType)),
+expected)
+}
+def checkParseUrlWithKey(expected: String, urlStr: String,
+  partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+ Literal.create(key, StringType)), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("jian", "http://j...@spark.apache.org/path?query=1";, 
"USERINFO")
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69297640
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,43 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType)),
+expected)
+}
+def checkParseUrlWithKey(expected: String, urlStr: String,
+  partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+ Literal.create(key, StringType)), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("jian", "http://j...@spark.apache.org/path?query=1";, 
"USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
--- End diff --

I am not sure. Is there any exceptional case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69302431
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +657,125 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
--- End diff --

is this optimization mainly for when the `url` is literal?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69302506
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +657,125 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
--- End diff --

should we put these constants into an object?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69302511
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +657,125 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
--- End diff --

should we put these constants into an object?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69302808
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +657,125 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
--- End diff --

It's ok to use `CodegenFallback` if the codegen is too hard to implement, 
but we should open a JIRA for the codegen part so we won't forget.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69302923
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +657,125 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Expression*)
--- End diff --

should we use `Seq[Expression]` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69303124
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,43 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType)),
+expected)
+}
+def checkParseUrlWithKey(expected: String, urlStr: String,
+  partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+ Literal.create(key, StringType)), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("jian", "http://j...@spark.apache.org/path?query=1";, 
"USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
--- End diff --

e.g. invalid url, invalid `part`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69307094
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +657,125 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Expression*)
--- End diff --

You are right. I'll fix this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69307187
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +657,125 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
--- End diff --

OK. I'll open a JIRA for the codegen part and keep working on it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69307232
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +657,125 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
--- End diff --

Make sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69307503
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +657,125 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
--- End diff --

Yes. When the `url` column has many same values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69307883
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,43 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType)),
+expected)
+}
+def checkParseUrlWithKey(expected: String, urlStr: String,
+  partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+ Literal.create(key, StringType)), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("jian", "http://j...@spark.apache.org/path?query=1";, 
"USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
--- End diff --

Oh sorry, I miss the point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69309179
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +657,125 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
--- End diff --

you can follow `XPathBoolean` to optimize for literal case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69314031
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -653,6 +657,125 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
 }
 
 /**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Expression*)
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
--- End diff --

Thought we judge on the `url` string, the main purpose is to cache the 
`URL` object. 
As We must handle the exceptions caused by invalid urls, the approach of 
`XPathBoolean` seems not suitable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69323801
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,43 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType)),
+expected)
+}
+def checkParseUrlWithKey(expected: String, urlStr: String,
+  partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+ Literal.create(key, StringType)), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("jian", "http://j...@spark.apache.org/path?query=1";, 
"USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
--- End diff --

As invalid url and invalid part just get `null` result, I wonder in what 
circumstance there would throw an exception?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69343465
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
--- End diff --

Could you make this `private`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69345511
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -27,6 +29,8 @@ import org.apache.spark.sql.catalyst.util.ArrayData
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.{ByteArray, UTF8String}
 
+import scala.util.control.NonFatal
--- End diff --

Oh, we have import ordering. You should move `NonFatal` like the following. 
Blank line is needed.
```
import java.util.Date

import scala.util.control.NonFatal

import org.apache.hadoop._

import org.apache.spark.sql._
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69345634
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -27,6 +29,8 @@ import org.apache.spark.sql.catalyst.util.ArrayData
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.{ByteArray, UTF8String}
 
+import scala.util.control.NonFatal
--- End diff --

By the way, I'm wondering why this does not cause build errors. Usually, 
it's caught by build errors.

Anyway, could you run `./dev/scalastyle`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69345785
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
--- End diff --

```
override def nullable: Boolean = true
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69345860
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
--- End diff --

Public functions need explicit type declaration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69346173
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -17,7 +17,9 @@
 
 package org.apache.spark.sql.catalyst.expressions
 
+import java.net.URL
 import java.text.{DecimalFormat, DecimalFormatSymbols}
+import java.util.regex.Pattern
 import java.util.{HashMap, Locale, Map => JMap}
--- End diff --

Also, imports need to be sorted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69348238
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
--- End diff --

Maybe,
```
  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
  'spark.apache.org'
  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
  'query=1'
  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
  '1'""")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69350632
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,43 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType)),
+expected)
+}
+def checkParseUrlWithKey(expected: String, urlStr: String,
+  partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+ Literal.create(key, StringType)), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("jian", "http://j...@spark.apache.org/path?query=1";, 
"USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
--- End diff --

Try this one.
```
SELECT parse_url('http://spark/?','QUERY', '???')
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69350933
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,43 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType)),
+expected)
+}
+def checkParseUrlWithKey(expected: String, urlStr: String,
+  partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+ Literal.create(key, StringType)), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("jian", "http://j...@spark.apache.org/path?query=1";, 
"USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
--- End diff --

I'm not sure how to handle this kind of malformed queries. Hive makes 
reasonable message.
```
hive> SELECT parse_url('https://?1=1&;?','QUERY', '???');
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments ''???'': 
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69370542
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,43 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType)),
+expected)
+}
+def checkParseUrlWithKey(expected: String, urlStr: String,
+  partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+ Literal.create(key, StringType)), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("jian", "http://j...@spark.apache.org/path?query=1";, 
"USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
--- End diff --

Yes, you are right. Invalid `key` does this job. I'll fix this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69371274
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrl = new URL(url.toString)
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
--- End diff --

so if all urls are different, we need to clone them everytime?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69371328
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrl = new URL(url.toString)
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
+} catch {
+  case NonFatal(_) => return null
--- End diff --

We should check the possible exceptions for `new URL` instead of catching 
NonFatal blindly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69371334
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrl = new URL(url.toString)
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
+} catch {
+  case NonFatal(_) => return null
--- End diff --

btw, does hive return null for invalid url?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69371479
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrl = new URL(url.toString)
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
+} catch {
+  case NonFatal(_) => return null
--- End diff --

As hive returns null for invalid urls, it seems reasonable that spark does 
the same thing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69371498
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrl = new URL(url.toString)
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
--- End diff --

Em, good point. I'll try to find a better way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69371739
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrl = new URL(url.toString)
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
+} catch {
+  case NonFatal(_) => return null
--- End diff --

oh this is just code style suggestion: only catch the exception that may be 
thrown in the `try` block, `new URL` only throws some specific kind of 
exceptions right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69371935
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrl = new URL(url.toString)
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
+} catch {
+  case NonFatal(_) => return null
--- End diff --

Yes, it only throws `MalformedURLException`. But Dongjoon Hyun suggested me 
to use `NonFatal(_)` here, as quoted

> We can use the following like Hive uses Exception. SecurityException 
occurs possibly as a RuntimeException.
> `case NonFatal(_) =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69383504
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
+  Literal.create(partToExtract, StringType))), expected)
+}
+def checkParseUrlWithKey(
+expected: String, urlStr: String,
+partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+  Literal.create(key, StringType))), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("userinfo", 
"http://useri...@spark.apache.org/path?query=1";, "USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "")
+
+// exceptional cases
+intercept[java.util.regex.PatternSyntaxException] {
--- End diff --

Hi, @janplus .
I thought about this a little more. Currently, this exception happens in 
`Executor` side. It's not desirable. IMO, we had better make this as 
`AnalysisException`.
Could you add some simple validation logic for `key`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69383544
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
+  Literal.create(partToExtract, StringType))), expected)
+}
+def checkParseUrlWithKey(
+expected: String, urlStr: String,
+partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+  Literal.create(key, StringType))), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("userinfo", 
"http://useri...@spark.apache.org/path?query=1";, "USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "")
+
+// exceptional cases
+intercept[java.util.regex.PatternSyntaxException] {
--- End diff --

In case of Hive, it's also `SemanticException`, not a raw 
`PatternSyntaxException`.
You may need to investigate Hive `SemanticException` logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69383569
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
+  Literal.create(partToExtract, StringType))), expected)
+}
+def checkParseUrlWithKey(
+expected: String, urlStr: String,
+partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+  Literal.create(key, StringType))), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("userinfo", 
"http://useri...@spark.apache.org/path?query=1";, "USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "")
+
+// exceptional cases
+intercept[java.util.regex.PatternSyntaxException] {
--- End diff --

In other words, Spark of this PR runs the execution for that problematic 
parameter while Hive does not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69384848
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
+  Literal.create(partToExtract, StringType))), expected)
+}
+def checkParseUrlWithKey(
+expected: String, urlStr: String,
+partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+  Literal.create(key, StringType))), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("userinfo", 
"http://useri...@spark.apache.org/path?query=1";, "USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "")
+
+// exceptional cases
+intercept[java.util.regex.PatternSyntaxException] {
--- End diff --

I'll have a investigation on this.
It should be different whether `key` is `Literal`.

> hive> select parse_url("http://spark/path?";, "QUERY", "???");
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments '"???"': 
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method 
public java.lang.String 
org.apache.hadoop.hive.ql.udf.UDFParseUrl.evaluate(java.lang.String,java.lang.String,java.lang.String)
  on object org.apache.hadoop.hive.ql.udf.UDFParseUrl@6682e6a5 of class 
org.apache.hadoop.hive.ql.udf.UDFParseUrl with arguments 
{http://spark/path?:java.lang.String, QUERY:java.lang.String, 
???:java.lang.String} of size 3
>
> hive> select parse_url("http://spark/path?";, "QUERY", name) from test;
OK
Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Unable to 
execute method public java.lang.String 
org.apache.hadoop.hive.ql.udf.UDFParseUrl.evaluate(java.lang.String,java.lang.String,java.lang.String)
  on object org.apache.hadoop.hive.ql.udf.UDFParseUrl@2035d65b of class 
org.apache.hadoop.hive.ql.udf.UDFParseUrl with arguments 
{http://spark/path?:java.lang.String, QUERY:java.lang.String, 
???:java.lang.String} of size 3
Time taken: 0.039 seconds


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69384995
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
+  Literal.create(partToExtract, StringType))), expected)
+}
+def checkParseUrlWithKey(
+expected: String, urlStr: String,
+partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+  Literal.create(key, StringType))), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("userinfo", 
"http://useri...@spark.apache.org/path?query=1";, "USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "")
+
+// exceptional cases
+intercept[java.util.regex.PatternSyntaxException] {
--- End diff --

Thank you, @janplus .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69385193
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
+  Literal.create(partToExtract, StringType))), expected)
+}
+def checkParseUrlWithKey(
+expected: String, urlStr: String,
+partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+  Literal.create(key, StringType))), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("userinfo", 
"http://useri...@spark.apache.org/path?query=1";, "USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "")
+
+// exceptional cases
+intercept[java.util.regex.PatternSyntaxException] {
--- End diff --

Hi, @dongjoon-hyun 
It seems only when `url`, `partToExtract` and `key` is all `Literal`, then 
hive may give a `SemanticException`.

> hive> select * from url_parse_data;
OK
http://spark/path?  QUERY   ???
Time taken: 0.054 seconds, Fetched: 1 row(s)

> hive> select parse_url("http://spark/path?";, "QUERY", "???") from 
url_parse_data;
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments '"???"': 
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method 
public java.lang.String 
org.apache.hadoop.hive.ql.udf.UDFParseUrl.evaluate(java.lang.String,java.lang.String,java.lang.String)
  on object org.apache.hadoop.hive.ql.udf.UDFParseUrl@59e082f8 of class 
org.apache.hadoop.hive.ql.udf.UDFParseUrl with arguments 
{http://spark/path?:java.lang.String, QUERY:java.lang.String, 
???:java.lang.String} of size 3

> hive> select parse_url(url, "QUERY", "???") from url_parse_data;
OK
Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Unable to 
execute method public java.lang.String 
org.apache.hadoop.hive.ql.udf.UDFParseUrl.evaluate(java.lang.String,java.lang.String,java.lang.String)
  on object org.apache.hadoop.hive.ql.udf.UDFParseUrl@7d1f3fe9 of class 
org.apache.hadoop.hive.ql.udf.UDFParseUrl with arguments 
{http://spark/path?:java.lang.String, QUERY:java.lang.String, 
???:java.lang.String} of size 3

> hive> select parse_url("http://spark/path?";, part, "???") from 
url_parse_data;
OK
Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Unable to 
execute method public java.lang.String 
org.apache.hadoop.hive.ql.udf.UDFParseUrl.evaluate(java.lang.String,java.lang.String,java.lang.String)
  on object org.apache.hadoop.hive.ql.udf.UDFParseUrl@37fef327 of class 
org.apache.hadoop.hive.ql.udf.UDFParseUrl with arguments 
{http://spark/path?:java.lang.String, QUERY:java.lang.String, 
???:java.lang.String} of size 3

> hive> select parse_url("http://spark/path?";, "QUERY", key) from 
url_parse_data;
OK
Failed with exception 
java.io.IOException:org.apache.hadoop.hive.

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69385614
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
+  Literal.create(partToExtract, StringType))), expected)
+}
+def checkParseUrlWithKey(
+expected: String, urlStr: String,
+partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+  Literal.create(key, StringType))), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("userinfo", 
"http://useri...@spark.apache.org/path?query=1";, "USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "")
+
+// exceptional cases
+intercept[java.util.regex.PatternSyntaxException] {
--- End diff --

Thank you for nice investigation. Yes, the validation of Hive seems to be 
too limited.
I think you can be better than Hive if you supports **Literal** `key` 
validation?
How do you think about that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69385849
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
+  Literal.create(partToExtract, StringType))), expected)
+}
+def checkParseUrlWithKey(
+expected: String, urlStr: String,
+partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+  Literal.create(key, StringType))), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("userinfo", 
"http://useri...@spark.apache.org/path?query=1";, "USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "")
+
+// exceptional cases
+intercept[java.util.regex.PatternSyntaxException] {
--- End diff --

Yes, definitely I can do that. In fact I have finished it.
But before I do the commit, let us get thought it first.
In `checkAnalysis` method for `LogicalPlan`, the only method will be called 
for `Expression` is `checkInputDataTypes`

https://github.com/apache/spark/blob/d1e8108854deba3de8e2d87eb4389d11fb17ee57/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L64

Which means we can only implement this validation in `checkInputDataTypes` 
of `ParseUrl`. In that circumstance spark will give the AnalysisException like 
this
> org.apache.spark.sql.AnalysisException: cannot resolve 
'parse_url("http://spark.apache.org/path?";, "QUERY", "???")' due to data type 
mismatch: wrong key "???"; line 1 pos 0

But obviously this should not be a data type mismatch. This message may 
confuse the users. Also the different message for **Literal** `key` and **Not 
Literal** `key` may make them confused too.
Otherwise, if we do not validate the **Literal** `key`, the `Executor` will 
get an exception at the first row. It seems not that unacceptable.
So compared the both sides, I think we should not do the Literal `key` 
validation.
How do you think about this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69385942
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
--- End diff --

Hi, @janplus .
When I said before, I thought you could do validation here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69386079
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
--- End diff --

Oh, yes, I can do validation here.
But still we will have different exception types for **Literal** `key` and 
**Non Literal** `key`.
That is `AnalysisException` for invalid Literal `key`, and 
`PatternSyntaxException` for invalid non Literal `key`.
Will this be OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69386214
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
--- End diff --

Yep. I think so.
1. `PatternSyntaxException` is a result of `Executor` side failure. It 
means it could be any exception.
2. `AnalysisException` is a result of `Driver`-side static analysis.
We need to do our best to prevent the error of type 1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69386249
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
--- End diff --

And, here, let's use `$prettyName` instead of `parse_url`. I mean
```scala
TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two or 
three arguments")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69386280
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
--- End diff --

BTW, implement here seems that the invalidation still happened in 
`Executor` side. IMO, the invalidation of key's value can hardly be considered 
as `AnalysisException`, just like the `rlike` function, isn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69386305
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
--- End diff --

OK, thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69386335
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
--- End diff --

It's not general pattern matching problem. You know, it's the 'key=value` 
part of the URL, isn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-02 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69386564
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
--- End diff --

I get your point. Since implement here will still be `Executor` side, I'll 
try to find another way to do this.
Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69386854
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
--- End diff --

Oh, sorry for that. I missed that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69386898
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
--- End diff --

Then, you can try that in `checkInputDataTypes`. Please refer the following 
code.


https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L544


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-03 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69386958
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
--- End diff --

Thanks, this is my approach. It will give following message
> org.apache.spark.sql.AnalysisException: cannot resolve 
'parse_url("http://spark.apache.org/path?";, "QUERY", "???")' due to data type 
mismatch: invalid key '???'; line 1 pos 7
Does it look ok to you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69388219
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
+  Literal.create(partToExtract, StringType))), expected)
+}
+def checkParseUrlWithKey(
+expected: String, urlStr: String,
+partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+  Literal.create(key, StringType))), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("userinfo", 
"http://useri...@spark.apache.org/path?query=1";, "USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "")
+
+// exceptional cases
+intercept[java.util.regex.PatternSyntaxException] {
--- End diff --

I think it's fine to throw the exception at executor side, no need to 
specially handle literal here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-03 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69401574
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
+  Literal.create(partToExtract, StringType))), expected)
+}
+def checkParseUrlWithKey(
+expected: String, urlStr: String,
+partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+  Literal.create(key, StringType))), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("userinfo", 
"http://useri...@spark.apache.org/path?query=1";, "USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1";, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1";, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", null)
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1";, 
"QUERY", "")
+
+// exceptional cases
+intercept[java.util.regex.PatternSyntaxException] {
--- End diff --

OK, @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69406035
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two 
or three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  private def getPattern(key: Any): Pattern = {
+if (key != null) {
+  val sb = new StringBuilder()
+  sb.append(REGEXPREFIX).append(key.toString).append(REGEXSUBFIX)
+  Pattern.compile(sb.toString())
+} else {
+  null
+}
+  }
+
+  private def getUrl(url: Any): URL = {
+try {
+  new URL(url.toString)
+} catch {
+  case NonFatal(_) => null
--- End diff --

Seems `new URL` will only throw `MalformedURLException`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69406176
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two 
or three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  private def getPattern(key: Any): Pattern = {
+if (key != null) {
+  val sb = new StringBuilder()
+  sb.append(REGEXPREFIX).append(key.toString).append(REGEXSUBFIX)
+  Pattern.compile(sb.toString())
+} else {
+  null
+}
+  }
+
+  private def getUrl(url: Any): URL = {
+try {
+  new URL(url.toString)
+} catch {
+  case NonFatal(_) => null
+}
+  }
+
+
+  private def extractValueFromQuery(query: Any, pattern: Pattern): Any = {
+val m = pattern.matcher(query.toString)
+if (m.find()) {
+  UTF8String.fromString(m.group(2))
+} else {
+  null
+}
+  }
+
+  private def extractFromUrl(url: URL, partToExtract: Any): Any = {
+if (partToExtract.equals(HOST)) {
+  UTF8String.fromString(url.getHost)
+} else if (partToExtract.equals(PATH)) {
+  UTF8String.fromString(url.getPath)
+} else if (partToExtract.equals(QUERY)) {
+  UTF8String.fromString(url.getQuery)
+} else if (partToExtract.equals(REF)) {
+  UTF8String.fromString(url.getRef)
+} else if (partToExtract.equals(PROTOCOL)) {
+  UTF8String.fromString(url.getProtocol)
+} else if (partToExtract.equals(FILE)) {
+  UTF8String.fromString(url.getFile)
+} else if (partToExtract.equals(AUTHORITY)) {
+  UTF8String.fromString(url.getAuthority)
+} else if (partToExtract.equals(USERINFO)) {
+  UTF8String.fromString(url.getUserInfo)
+} else {
+  null
+}
+  }
+
+  private def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url != null && partToExtract != null) {
+  if (cachedUrl ne null) {
+extractFromUrl(cachedUrl, partToExtract)
+  } else {

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69406199
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two 
or three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  private def getPattern(key: Any): Pattern = {
--- End diff --

we should explicitly say the argument is `UTF8String`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-03 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69408137
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two 
or three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  private def getPattern(key: Any): Pattern = {
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-03 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69408152
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two 
or three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  private def getPattern(key: Any): Pattern = {
+if (key != null) {
+  val sb = new StringBuilder()
+  sb.append(REGEXPREFIX).append(key.toString).append(REGEXSUBFIX)
+  Pattern.compile(sb.toString())
+} else {
+  null
+}
+  }
+
+  private def getUrl(url: Any): URL = {
+try {
+  new URL(url.toString)
+} catch {
+  case NonFatal(_) => null
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-03 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69408201
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two 
or three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  private def getPattern(key: Any): Pattern = {
+if (key != null) {
+  val sb = new StringBuilder()
+  sb.append(REGEXPREFIX).append(key.toString).append(REGEXSUBFIX)
+  Pattern.compile(sb.toString())
+} else {
+  null
+}
+  }
+
+  private def getUrl(url: Any): URL = {
+try {
+  new URL(url.toString)
+} catch {
+  case NonFatal(_) => null
+}
+  }
+
+
+  private def extractValueFromQuery(query: Any, pattern: Pattern): Any = {
+val m = pattern.matcher(query.toString)
+if (m.find()) {
+  UTF8String.fromString(m.group(2))
+} else {
+  null
+}
+  }
+
+  private def extractFromUrl(url: URL, partToExtract: Any): Any = {
+if (partToExtract.equals(HOST)) {
+  UTF8String.fromString(url.getHost)
+} else if (partToExtract.equals(PATH)) {
+  UTF8String.fromString(url.getPath)
+} else if (partToExtract.equals(QUERY)) {
+  UTF8String.fromString(url.getQuery)
+} else if (partToExtract.equals(REF)) {
+  UTF8String.fromString(url.getRef)
+} else if (partToExtract.equals(PROTOCOL)) {
+  UTF8String.fromString(url.getProtocol)
+} else if (partToExtract.equals(FILE)) {
+  UTF8String.fromString(url.getFile)
+} else if (partToExtract.equals(AUTHORITY)) {
+  UTF8String.fromString(url.getAuthority)
+} else if (partToExtract.equals(USERINFO)) {
+  UTF8String.fromString(url.getUserInfo)
+} else {
+  null
+}
+  }
+
+  private def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url != null && partToExtract != null) {
+  if (cachedUrl ne null) {
+extractFromUrl(cachedUrl, partToExtract)
+  } else {
  

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-04 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69475224
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +654,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two 
or three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  private def getPattern(key: UTF8String): Pattern = {
+if (key != null) {
+  val sb = new StringBuilder()
+  sb.append(REGEXPREFIX).append(key.toString).append(REGEXSUBFIX)
--- End diff --

Fine. I'll fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69475002
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +654,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two 
or three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  private def getPattern(key: UTF8String): Pattern = {
+if (key != null) {
+  val sb = new StringBuilder()
+  sb.append(REGEXPREFIX).append(key.toString).append(REGEXSUBFIX)
--- End diff --

it's only 3 strings, I think it's ok to use `+` to combine them


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69476286
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +654,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two 
or three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  private def getPattern(key: UTF8String): Pattern = {
+if (key != null) {
+  val sb = new StringBuilder()
+  sb.append(REGEXPREFIX).append(key.toString).append(REGEXSUBFIX)
+  Pattern.compile(sb.toString())
+} else {
+  null
+}
+  }
+
+  private def getUrl(url: UTF8String): URL = {
+try {
+  new URL(url.toString)
+} catch {
+  case e: MalformedURLException => null
+}
+  }
+
+
+  private def extractValueFromQuery(query: Any, pattern: Pattern): Any = {
--- End diff --

It's better to explicitly declare `query` as `UTF8String`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-04 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69476414
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +654,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two 
or three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  private def getPattern(key: UTF8String): Pattern = {
+if (key != null) {
+  val sb = new StringBuilder()
+  sb.append(REGEXPREFIX).append(key.toString).append(REGEXSUBFIX)
+  Pattern.compile(sb.toString())
+} else {
+  null
+}
+  }
+
+  private def getUrl(url: UTF8String): URL = {
+try {
+  new URL(url.toString)
+} catch {
+  case e: MalformedURLException => null
+}
+  }
+
+
+  private def extractValueFromQuery(query: Any, pattern: Pattern): Any = {
--- End diff --

Oh, get your point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69476921
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +654,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two 
or three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  private def getPattern(key: UTF8String): Pattern = {
+if (key != null) {
+  val sb = new StringBuilder()
+  sb.append(REGEXPREFIX).append(key.toString).append(REGEXSUBFIX)
+  Pattern.compile(sb.toString())
+} else {
+  null
+}
+  }
+
+  private def getUrl(url: UTF8String): URL = {
+try {
+  new URL(url.toString)
+} catch {
+  case e: MalformedURLException => null
+}
+  }
+
+
+  private def extractValueFromQuery(query: Any, pattern: Pattern): Any = {
+val m = pattern.matcher(query.toString)
+if (m.find()) {
+  UTF8String.fromString(m.group(2))
+} else {
+  null
+}
+  }
+
+  private def extractFromUrl(url: URL, partToExtract: Any): Any = {
+if (partToExtract.equals(HOST)) {
+  UTF8String.fromString(url.getHost)
+} else if (partToExtract.equals(PATH)) {
+  UTF8String.fromString(url.getPath)
+} else if (partToExtract.equals(QUERY)) {
+  UTF8String.fromString(url.getQuery)
+} else if (partToExtract.equals(REF)) {
+  UTF8String.fromString(url.getRef)
+} else if (partToExtract.equals(PROTOCOL)) {
+  UTF8String.fromString(url.getProtocol)
+} else if (partToExtract.equals(FILE)) {
+  UTF8String.fromString(url.getFile)
+} else if (partToExtract.equals(AUTHORITY)) {
+  UTF8String.fromString(url.getAuthority)
+} else if (partToExtract.equals(USERINFO)) {
+  UTF8String.fromString(url.getUserInfo)
+} else {
+  null
+}
+  }
+
+  private def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url != null && partToExtract != null) {
+  if (cachedUrl ne null) {
+extractFromUrl(cachedUrl, partToEx

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-04 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69477433
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +654,163 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two 
or three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  private def getPattern(key: UTF8String): Pattern = {
+if (key != null) {
+  val sb = new StringBuilder()
+  sb.append(REGEXPREFIX).append(key.toString).append(REGEXSUBFIX)
+  Pattern.compile(sb.toString())
+} else {
+  null
+}
+  }
+
+  private def getUrl(url: UTF8String): URL = {
+try {
+  new URL(url.toString)
+} catch {
+  case e: MalformedURLException => null
+}
+  }
+
+
+  private def extractValueFromQuery(query: Any, pattern: Pattern): Any = {
+val m = pattern.matcher(query.toString)
+if (m.find()) {
+  UTF8String.fromString(m.group(2))
+} else {
+  null
+}
+  }
+
+  private def extractFromUrl(url: URL, partToExtract: Any): Any = {
+if (partToExtract.equals(HOST)) {
+  UTF8String.fromString(url.getHost)
+} else if (partToExtract.equals(PATH)) {
+  UTF8String.fromString(url.getPath)
+} else if (partToExtract.equals(QUERY)) {
+  UTF8String.fromString(url.getQuery)
+} else if (partToExtract.equals(REF)) {
+  UTF8String.fromString(url.getRef)
+} else if (partToExtract.equals(PROTOCOL)) {
+  UTF8String.fromString(url.getProtocol)
+} else if (partToExtract.equals(FILE)) {
+  UTF8String.fromString(url.getFile)
+} else if (partToExtract.equals(AUTHORITY)) {
+  UTF8String.fromString(url.getAuthority)
+} else if (partToExtract.equals(USERINFO)) {
+  UTF8String.fromString(url.getUserInfo)
+} else {
+  null
+}
+  }
+
+  private def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url != null && partToExtract != null) {
+  if (cachedUrl ne null) {
+extractFromUrl(cachedUrl, partToExtr

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69509699
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
--- End diff --

`Literal.create(urlStr, StringType)` -> `Literal(urlString)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69509714
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
+  Literal.create(partToExtract, StringType))), expected)
+}
+def checkParseUrlWithKey(
+expected: String, urlStr: String,
--- End diff --

one parameter one line please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69509768
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
+  Literal.create(partToExtract, StringType))), expected)
+}
+def checkParseUrlWithKey(
+expected: String, urlStr: String,
+partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+  Literal.create(key, StringType))), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("userinfo", 
"http://useri...@spark.apache.org/path?query=1";, "USERINFO")
--- End diff --

what will happen if there is no userinfo in the url?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-05 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69532226
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-05 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69532282
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
+  Literal.create(partToExtract, StringType))), expected)
+}
+def checkParseUrlWithKey(
+expected: String, urlStr: String,
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-05 Thread janplus
Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69532511
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,51 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType),
+  Literal.create(partToExtract, StringType))), expected)
+}
+def checkParseUrlWithKey(
+expected: String, urlStr: String,
+partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Seq(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+  Literal.create(key, StringType))), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1";, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1";, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1";, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref";, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1";, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1";, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1";, "AUTHORITY")
+checkParseUrl("userinfo", 
"http://useri...@spark.apache.org/path?query=1";, "USERINFO")
--- End diff --

Then the result is `null`. I'll add a test case for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69850820
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +654,160 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two 
or three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  private def getPattern(key: UTF8String): Pattern = {
+if (key != null) {
+  Pattern.compile(REGEXPREFIX + key.toString + REGEXSUBFIX)
+} else {
+  null
+}
+  }
+
+  private def getUrl(url: UTF8String): URL = {
+try {
+  new URL(url.toString)
+} catch {
+  case e: MalformedURLException => null
+}
+  }
+
+  private def extractValueFromQuery(query: UTF8String, pattern: Pattern): 
UTF8String = {
+val m = pattern.matcher(query.toString)
+if (m.find()) {
+  UTF8String.fromString(m.group(2))
+} else {
+  null
+}
+  }
+
+  private def extractFromUrl(url: URL, partToExtract: UTF8String): 
UTF8String = {
+if (partToExtract.equals(HOST)) {
+  UTF8String.fromString(url.getHost)
+} else if (partToExtract.equals(PATH)) {
+  UTF8String.fromString(url.getPath)
+} else if (partToExtract.equals(QUERY)) {
+  UTF8String.fromString(url.getQuery)
+} else if (partToExtract.equals(REF)) {
+  UTF8String.fromString(url.getRef)
+} else if (partToExtract.equals(PROTOCOL)) {
+  UTF8String.fromString(url.getProtocol)
+} else if (partToExtract.equals(FILE)) {
+  UTF8String.fromString(url.getFile)
+} else if (partToExtract.equals(AUTHORITY)) {
+  UTF8String.fromString(url.getAuthority)
+} else if (partToExtract.equals(USERINFO)) {
+  UTF8String.fromString(url.getUserInfo)
+} else {
+  null
--- End diff --

It's weird that we return null if users provide an invalid `part` string, 
while we have a fixed list of supported `part` string. Even it's hive's rule, 
it still looks unreasonable to me.  cc @rxin @yhuai 


---
If your project is set up for it, you can reply to this email

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69850899
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +654,160 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two 
or three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  private def getPattern(key: UTF8String): Pattern = {
+if (key != null) {
+  Pattern.compile(REGEXPREFIX + key.toString + REGEXSUBFIX)
+} else {
+  null
+}
+  }
+
+  private def getUrl(url: UTF8String): URL = {
+try {
+  new URL(url.toString)
+} catch {
+  case e: MalformedURLException => null
+}
+  }
+
+  private def extractValueFromQuery(query: UTF8String, pattern: Pattern): 
UTF8String = {
+val m = pattern.matcher(query.toString)
+if (m.find()) {
+  UTF8String.fromString(m.group(2))
+} else {
+  null
+}
+  }
+
+  private def extractFromUrl(url: URL, partToExtract: UTF8String): 
UTF8String = {
+if (partToExtract.equals(HOST)) {
+  UTF8String.fromString(url.getHost)
+} else if (partToExtract.equals(PATH)) {
+  UTF8String.fromString(url.getPath)
+} else if (partToExtract.equals(QUERY)) {
+  UTF8String.fromString(url.getQuery)
+} else if (partToExtract.equals(REF)) {
+  UTF8String.fromString(url.getRef)
+} else if (partToExtract.equals(PROTOCOL)) {
+  UTF8String.fromString(url.getProtocol)
+} else if (partToExtract.equals(FILE)) {
+  UTF8String.fromString(url.getFile)
+} else if (partToExtract.equals(AUTHORITY)) {
+  UTF8String.fromString(url.getAuthority)
+} else if (partToExtract.equals(USERINFO)) {
+  UTF8String.fromString(url.getUserInfo)
+} else {
+  null
+}
+  }
+
+  private def parseUrlWithoutKey(url: UTF8String, partToExtract: 
UTF8String): UTF8String = {
+if (url != null && partToExtract != null) {
+  if (cachedUrl ne null) {
+extractFromUrl(cachedUrl, partToExtract)
+  } else {
+va

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69850990
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +654,160 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')
+  'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')
+  'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')
+  '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable: Boolean = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+  override def prettyName: String = "parse_url"
+
+  // If the url is a constant, cache the URL object so that we don't need 
to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = stringExprs(0) match {
+case Literal(url: UTF8String, _) => getUrl(url)
+case _ => null
+  }
+
+  // If the key is a constant, cache the Pattern object so that we don't 
need to convert key
+  // from UTF8String to String to StringBuilder to String to Pattern for 
every row.
+  @transient private lazy val cachedPattern = stringExprs(2) match {
+case Literal(key: UTF8String, _) => getPattern(key)
+case _ => null
+  }
+
+  private lazy val stringExprs = children.toArray
--- End diff --

I don't it's necessary... we only have 2 or 3 parameters...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >