[jira] [Updated] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

2018-01-12 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-23056:
---
Description: 
When using internationalized Domains in the urls like:

{code:java}
val url = "http://правительство.рф;
{code}
The parse_url returns null, but works fine when using the hive 's version of 
parse_url

On digging further, found that the difference is in below call in spark:


{code:java}
private def getUrl(url: UTF8String): URI = {
  try {
new URI(url.toString)
  } catch {
case e: URISyntaxException => null
  }
}
{code}



while hive uses java.net.URL:

{code:java}
url = new URL(urlStr)
{code}

Sure enough, this simple test demonstrates URL works but URI does not in this 
case:

{code:java}
val url = "http://правительство.рф;

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
{code}

To reproduce the problem on spark-sql:

{code:java}
spark-sql> select parse_url('http://日本語.JP/case/accessible/', 'HOST');
{code}
returns NULL

This problem was introduced by
 which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

{code:java}
SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
{code}

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```

  was:
When using internationalized Domains in the urls like:

{code:java}
val url = "http://правительство.рф;
{code}
The parse_url returns null, but works fine when using the hive 's version of 
parse_url

On digging further, found that the difference is in below call in spark:


{code:java}
private def getUrl(url: UTF8String): URI = {
  try {
new URI(url.toString)
  } catch {
case e: URISyntaxException => null
  }
}
{code}



while hive uses java.net.URL:

{code:java}
url = new URL(urlStr)
{code}

Sure enough, this simple test demonstrates URL works but URI does not in this 
case:

{code:java}
val url = "http://правительство.рф;

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
{code}

To reproduce the problem on spark-sql:

{code:java}
spark-sql> select parse_url('http://日本語.JP/case/accessible/', 'HOST');
{code}
returns NULL

http://日本語.JP/case/accessible/

This problem was introduced by
 which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

{code:java}
SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
{code}

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```


> parse_url regression when switched to using java.net.URI instead of 
> java.net.URL
> 
>
> Key: SPARK-23056
> URL: https://issues.apache.org/jira/browse/SPARK-23056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.0
>Reporter: Yash Datta
>  Labels: regression
>
> When using internationalized Domains in the urls like:
> {code:java}
> val url = "http://правительство.рф;
> {code}
> The parse_url returns null, but works fine when using the hive 's version of 
> parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
> new URI(url.toString)
>   } catch {
> case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this 
> case:
> {code:java}
> val url = "http://правительство.рф;
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost") // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://日本語.JP/case/accessible/', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
>  which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> {code:java}
> SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
> {code}
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

2018-01-12 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-23056:
---
Description: 
When using internationalized Domains in the urls like:

{code:java}
val url = "http://правительство.рф;
{code}
The parse_url returns null, but works fine when using the hive 's version of 
parse_url

On digging further, found that the difference is in below call in spark:


{code:java}
private def getUrl(url: UTF8String): URI = {
  try {
new URI(url.toString)
  } catch {
case e: URISyntaxException => null
  }
}
{code}



while hive uses java.net.URL:

{code:java}
url = new URL(urlStr)
{code}

Sure enough, this simple test demonstrates URL works but URI does not in this 
case:

{code:java}
val url = "http://правительство.рф;

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
{code}

To reproduce the problem on spark-sql:

{code:java}
spark-sql> select parse_url('http://日本語.JP/case/accessible/', 'HOST');
{code}
returns NULL

http://日本語.JP/case/accessible/

This problem was introduced by
 which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

{code:java}
SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
{code}

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```

  was:
When using internationalized Domains in the urls like:

{code:java}
val url = "http://правительство.рф;
{code}
The parse_url returns null, but works fine when using the hive 's version of 
parse_url

On digging further, found that the difference is in below call in spark:


{code:java}
private def getUrl(url: UTF8String): URI = {
  try {
new URI(url.toString)
  } catch {
case e: URISyntaxException => null
  }
}
{code}



while hive uses java.net.URL:

{code:java}
url = new URL(urlStr)
{code}

Sure enough, this simple test demonstrates URL works but URI does not in this 
case:

{code:java}
val url = "http://правительство.рф;

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
{code}

To reproduce the problem on spark-sql:

{code:java}
spark-sql> select parse_url('http://日本語.JP/case/accessible/', 'HOST');
{code}
returns NULL

This problem was introduced by
 which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

{code:java}
SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
{code}

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```


> parse_url regression when switched to using java.net.URI instead of 
> java.net.URL
> 
>
> Key: SPARK-23056
> URL: https://issues.apache.org/jira/browse/SPARK-23056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.0
>Reporter: Yash Datta
>  Labels: regression
>
> When using internationalized Domains in the urls like:
> {code:java}
> val url = "http://правительство.рф;
> {code}
> The parse_url returns null, but works fine when using the hive 's version of 
> parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
> new URI(url.toString)
>   } catch {
> case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this 
> case:
> {code:java}
> val url = "http://правительство.рф;
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost") // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://日本語.JP/case/accessible/', 'HOST');
> {code}
> returns NULL
> http://日本語.JP/case/accessible/
> This problem was introduced by
>  which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> {code:java}
> SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
> {code}
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To 

[jira] [Updated] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

2018-01-12 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-23056:
---
Description: 
When using internationalized Domains in the urls like:

{code:java}
val url = "http://правительство.рф;
{code}
The parse_url returns null, but works fine when using the hive 's version of 
parse_url

On digging further, found that the difference is in below call in spark:


{code:java}
private def getUrl(url: UTF8String): URI = {
  try {
new URI(url.toString)
  } catch {
case e: URISyntaxException => null
  }
}
{code}



while hive uses java.net.URL:

{code:java}
url = new URL(urlStr)
{code}

Sure enough, this simple test demonstrates URL works but URI does not in this 
case:

{code:java}
val url = "http://правительство.рф;

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
{code}

To reproduce the problem on spark-sql:

{code:java}
spark-sql> select parse_url('http://日本語.JP/case/accessible/', 'HOST');
{code}
returns NULL

This problem was introduced by
 which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

{code:java}
SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
{code}

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```

  was:
When using internationalized Domains in the urls like:

{code:java}
val url = "http://правительство.рф;
{code}
The parse_url returns null, but works fine when using the hive 's version of 
parse_url

On digging further, found that the difference is in below call in spark:


{code:java}
private def getUrl(url: UTF8String): URI = {
  try {
new URI(url.toString)
  } catch {
case e: URISyntaxException => null
  }
}
{code}



while hive uses java.net.URL:

{code:java}
url = new URL(urlStr)
{code}

Sure enough, this simple test demonstrates URL works but URI does not in this 
case:

{code:java}
val url = "http://правительство.рф;

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
{code}

To reproduce the problem on spark-sql:

{code:java}
spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
{code}
returns NULL

This problem was introduced by
 which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

{code:java}
SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
{code}

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```


> parse_url regression when switched to using java.net.URI instead of 
> java.net.URL
> 
>
> Key: SPARK-23056
> URL: https://issues.apache.org/jira/browse/SPARK-23056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.0
>Reporter: Yash Datta
>  Labels: regression
>
> When using internationalized Domains in the urls like:
> {code:java}
> val url = "http://правительство.рф;
> {code}
> The parse_url returns null, but works fine when using the hive 's version of 
> parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
> new URI(url.toString)
>   } catch {
> case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this 
> case:
> {code:java}
> val url = "http://правительство.рф;
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost") // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://日本語.JP/case/accessible/', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
>  which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> {code:java}
> SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
> {code}
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Comment Edited] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

2018-01-12 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324111#comment-16324111
 ] 

Yash Datta edited comment on SPARK-23056 at 1/12/18 3:26 PM:
-

Agreed that going strictly by standard, these are IRIs and not URLs but in 
practice, these overlap. Please do suggest what we would like to support in 
spark.


was (Author: saucam):
Agreed that going strictly by standard, these are IRIs and not URLs but in 
practice, these overlap. Please do suggest what we would like to support in 
spark.

http://xn--wgv71a119e.jp/case/accessible/

> parse_url regression when switched to using java.net.URI instead of 
> java.net.URL
> 
>
> Key: SPARK-23056
> URL: https://issues.apache.org/jira/browse/SPARK-23056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.0
>Reporter: Yash Datta
>  Labels: regression
>
> When using internationalized Domains in the urls like:
> {code:java}
> val url = "http://правительство.рф;
> {code}
> The parse_url returns null, but works fine when using the hive 's version of 
> parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
> new URI(url.toString)
>   } catch {
> case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this 
> case:
> {code:java}
> val url = "http://правительство.рф;
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost") // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
>  which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> {code:java}
> SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
> {code}
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

2018-01-12 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324111#comment-16324111
 ] 

Yash Datta edited comment on SPARK-23056 at 1/12/18 3:25 PM:
-

Agreed that going strictly by standard, these are IRIs and not URLs but in 
practice, these overlap. Please do suggest what we would like to support in 
spark.

http://xn--wgv71a119e.jp/case/accessible/


was (Author: saucam):
Agreed that going strictly by standard, these are IRIs and not URLs but in 
practice, these overlap. Please do suggest what we would like to support in 
spark.

> parse_url regression when switched to using java.net.URI instead of 
> java.net.URL
> 
>
> Key: SPARK-23056
> URL: https://issues.apache.org/jira/browse/SPARK-23056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.0
>Reporter: Yash Datta
>  Labels: regression
>
> When using internationalized Domains in the urls like:
> {code:java}
> val url = "http://правительство.рф;
> {code}
> The parse_url returns null, but works fine when using the hive 's version of 
> parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
> new URI(url.toString)
>   } catch {
> case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this 
> case:
> {code:java}
> val url = "http://правительство.рф;
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost") // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
>  which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> {code:java}
> SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
> {code}
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

2018-01-12 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324111#comment-16324111
 ] 

Yash Datta commented on SPARK-23056:


Agreed that going strictly by standard, these are IRIs and not URLs but in 
practice, these overlap. Please do suggest what we would like to support in 
spark.

> parse_url regression when switched to using java.net.URI instead of 
> java.net.URL
> 
>
> Key: SPARK-23056
> URL: https://issues.apache.org/jira/browse/SPARK-23056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.0
>Reporter: Yash Datta
>  Labels: regression
>
> When using internationalized Domains in the urls like:
> {code:java}
> val url = "http://правительство.рф;
> {code}
> The parse_url returns null, but works fine when using the hive 's version of 
> parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
> new URI(url.toString)
>   } catch {
> case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this 
> case:
> {code:java}
> val url = "http://правительство.рф;
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost") // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
>  which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> {code:java}
> SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
> {code}
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

2018-01-12 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324056#comment-16324056
 ] 

Yash Datta edited comment on SPARK-23056 at 1/12/18 2:52 PM:
-

We have production use case with many different IRIs in Japanese.
https://www.w3.org/International/articles/idn-and-iri/

Agreed that we should not be re-introducing performance bottlenecks, that is 
why did not submit a patch reverting to the use of URL instead of URI, but 
at-least for us this is a valid use case.

If this is not a generic enough problem to be solved, we can close it, but I 
disagree.


was (Author: saucam):
We have production use case with many different IRIs in Japanese.
https://www.w3.org/International/articles/idn-and-iri/

Agreed that we should not be re-introducing performance bottlenecks, that is 
why did not submit a patch reverting to the use of URL instead of URI, but 
at-least for us this is a valid use case.

> parse_url regression when switched to using java.net.URI instead of 
> java.net.URL
> 
>
> Key: SPARK-23056
> URL: https://issues.apache.org/jira/browse/SPARK-23056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.0
>Reporter: Yash Datta
>  Labels: regression
>
> When using internationalized Domains in the urls like:
> {code:java}
> val url = "http://правительство.рф;
> {code}
> The parse_url returns null, but works fine when using the hive 's version of 
> parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
> new URI(url.toString)
>   } catch {
> case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this 
> case:
> {code:java}
> val url = "http://правительство.рф;
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost") // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
>  which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> {code:java}
> SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
> {code}
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

2018-01-12 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324056#comment-16324056
 ] 

Yash Datta commented on SPARK-23056:


We have production use case with many different IRIs in Japanese.
https://www.w3.org/International/articles/idn-and-iri/

Agreed that we should not be re-introducing performance bottlenecks, that is 
why did not submit a patch reverting to the use of URL instead of URI, but 
at-least for us this is a valid use case.

> parse_url regression when switched to using java.net.URI instead of 
> java.net.URL
> 
>
> Key: SPARK-23056
> URL: https://issues.apache.org/jira/browse/SPARK-23056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.0
>Reporter: Yash Datta
>  Labels: regression
>
> When using internationalized Domains in the urls like:
> {code:java}
> val url = "http://правительство.рф;
> {code}
> The parse_url returns null, but works fine when using the hive 's version of 
> parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
> new URI(url.toString)
>   } catch {
> case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this 
> case:
> {code:java}
> val url = "http://правительство.рф;
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost") // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
>  which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> {code:java}
> SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
> {code}
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

2018-01-12 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-23056:
---
Labels: regression  (was: regresion)

> parse_url regression when switched to using java.net.URI instead of 
> java.net.URL
> 
>
> Key: SPARK-23056
> URL: https://issues.apache.org/jira/browse/SPARK-23056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.0
>Reporter: Yash Datta
>  Labels: regression
>
> When using internationalized Domains in the urls like:
> {code:java}
> val url = "http://правительство.рф;
> {code}
> The parse_url returns null, but works fine when using the hive 's version of 
> parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
> new URI(url.toString)
>   } catch {
> case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this 
> case:
> {code:java}
> val url = "http://правительство.рф;
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost") // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
>  which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> {code:java}
> SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
> {code}
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

2018-01-12 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-23056:
---
Description: 
When using internationalized Domains in the urls like:

{code:java}
val url = "http://правительство.рф;
{code}
The parse_url returns null, but works fine when using the hive 's version of 
parse_url

On digging further, found that the difference is in below call in spark:


{code:java}
private def getUrl(url: UTF8String): URI = {
  try {
new URI(url.toString)
  } catch {
case e: URISyntaxException => null
  }
}
{code}



while hive uses java.net.URL:

{code:java}
url = new URL(urlStr)
{code}

Sure enough, this simple test demonstrates URL works but URI does not in this 
case:

{code:java}
val url = "http://правительство.рф;

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
{code}

To reproduce the problem on spark-sql:

{code:java}
spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
{code}
returns NULL

This problem was introduced by
 which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

{code:java}
SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
{code}

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```

  was:
When using internationalized Domains in the urls like:

{code:java}
val url = "http://правительство.рф;
{code}
The parse_url returns null, but works fine when using the hive 's version of 
parse_url

On digging further, found that the difference is in below call in spark:


{code:java}
private def getUrl(url: UTF8String): URI = {
  try {
new URI(url.toString)
  } catch {
case e: URISyntaxException => null
  }
}
{code}



while hive uses java.net.URL:

{code:java}
url = new URL(urlStr)
{code}

Sure enough, this simple test demonstrates URL works but URI does not in this 
case:

{code:java}
val url = "http://правительство.рф;

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
{code}

To reproduce the problem on spark-sql:

{code:java}
spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
{code}
returns NULL

This problem was introduced by
 which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```


> parse_url regression when switched to using java.net.URI instead of 
> java.net.URL
> 
>
> Key: SPARK-23056
> URL: https://issues.apache.org/jira/browse/SPARK-23056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.0
>Reporter: Yash Datta
>  Labels: regression
>
> When using internationalized Domains in the urls like:
> {code:java}
> val url = "http://правительство.рф;
> {code}
> The parse_url returns null, but works fine when using the hive 's version of 
> parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
> new URI(url.toString)
>   } catch {
> case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this 
> case:
> {code:java}
> val url = "http://правительство.рф;
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost") // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
>  which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> {code:java}
> SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
> {code}
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

2018-01-12 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-23056:
---
Labels: regresion  (was: )

> parse_url regression when switched to using java.net.URI instead of 
> java.net.URL
> 
>
> Key: SPARK-23056
> URL: https://issues.apache.org/jira/browse/SPARK-23056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.0
>Reporter: Yash Datta
>  Labels: regression
>
> When using internationalized Domains in the urls like:
> {code:java}
> val url = "http://правительство.рф;
> {code}
> The parse_url returns null, but works fine when using the hive 's version of 
> parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
> new URI(url.toString)
>   } catch {
> case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this 
> case:
> {code:java}
> val url = "http://правительство.рф;
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost") // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
>  which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> ```SQL
> SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

2018-01-12 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-23056:
---
Description: 
When using internationalized Domains in the urls like:

{code:java}
val url = "http://правительство.рф;
{code}
The parse_url returns null, but works fine when using the hive 's version of 
parse_url

On digging further, found that the difference is in below call in spark:


{code:java}
private def getUrl(url: UTF8String): URI = {
  try {
new URI(url.toString)
  } catch {
case e: URISyntaxException => null
  }
}
{code}



while hive uses java.net.URL:

{code:java}
url = new URL(urlStr)
{code}

Sure enough, this simple test demonstrates URL works but URI does not in this 
case:

{code:java}
val url = "http://правительство.рф;

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
{code}

To reproduce the problem on spark-sql:

{code:java}
spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
{code}
returns NULL

This problem was introduced by
 which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```

  was:
When using internationalized Domains in the urls like:

val url = "http://правительство.рф;
The parse_url returns null, but works fine when using the hive 's version of 
parse_url

On digging further, found that the difference is in below call in spark:


{code:java}
private def getUrl(url: UTF8String): URI = {
  try {
new URI(url.toString)
  } catch {
case e: URISyntaxException => null
  }
}
{code}



while hive uses java.net.URL:

{code:java}
url = new URL(urlStr)
{code}

Sure enough, this simple test demonstrates URL works but URI does not in this 
case:

{code:java}
val url = "http://правительство.рф;

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
{code}

To reproduce the problem on spark-sql:

{code:java}
spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
{code}
returns NULL

This problem was introduced by
 which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```


> parse_url regression when switched to using java.net.URI instead of 
> java.net.URL
> 
>
> Key: SPARK-23056
> URL: https://issues.apache.org/jira/browse/SPARK-23056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.0
>Reporter: Yash Datta
>
> When using internationalized Domains in the urls like:
> {code:java}
> val url = "http://правительство.рф;
> {code}
> The parse_url returns null, but works fine when using the hive 's version of 
> parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
> new URI(url.toString)
>   } catch {
> case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this 
> case:
> {code:java}
> val url = "http://правительство.рф;
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost") // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
>  which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> ```SQL
> SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

2018-01-12 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-23056:
---
Description: 
When using internationalized Domains in the urls like:

val url = "http://правительство.рф;
The parse_url returns null, but works fine when using the hive 's version of 
parse_url

On digging further, found that the difference is in below call in spark:


{code:java}
private def getUrl(url: UTF8String): URI = {
  try {
new URI(url.toString)
  } catch {
case e: URISyntaxException => null
  }
}
{code}



while hive uses java.net.URL:

{code:java}
url = new URL(urlStr)
{code}

Sure enough, this simple test demonstrates URL works but URI does not in this 
case:

{code:java}
val url = "http://правительство.рф;

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
{code}

To reproduce the problem on spark-sql:

{code:java}
spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
{code}
returns NULL

This problem was introduced by
 which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```

  was:
When using internationalized Domains in the urls like:

val url = "http://правительство.рф;
The parse_url returns null, but works fine when using the hive 's version of 
parse_url

On digging further, found that the difference is in below call in spark:

private def getUrl(url: UTF8String): URI = {
  try {
new URI(url.toString)
  } catch {
case e: URISyntaxException => null
  }
}

while hive uses java.net.URL:

url = new URL(urlStr)

Sure enough, this simple test demonstrates URL works but URI does not in this 
case:

val url = "http://правительство.рф;

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  

To reproduce the problem on spark-sql:

spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
returns NULL

This problem was introduced by
 which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```


> parse_url regression when switched to using java.net.URI instead of 
> java.net.URL
> 
>
> Key: SPARK-23056
> URL: https://issues.apache.org/jira/browse/SPARK-23056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.0
>Reporter: Yash Datta
>
> When using internationalized Domains in the urls like:
> val url = "http://правительство.рф;
> The parse_url returns null, but works fine when using the hive 's version of 
> parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
> new URI(url.toString)
>   } catch {
> case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this 
> case:
> {code:java}
> val url = "http://правительство.рф;
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost") // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
>  which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> ```SQL
> SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

2018-01-12 Thread Yash Datta (JIRA)
Yash Datta created SPARK-23056:
--

 Summary: parse_url regression when switched to using java.net.URI 
instead of java.net.URL
 Key: SPARK-23056
 URL: https://issues.apache.org/jira/browse/SPARK-23056
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.3, 2.2.2, 2.3.0
Reporter: Yash Datta


When using internationalized Domains in the urls like:

val url = "http://правительство.рф;
The parse_url returns null, but works fine when using the hive 's version of 
parse_url

On digging further, found that the difference is in below call in spark:

private def getUrl(url: UTF8String): URI = {
  try {
new URI(url.toString)
  } catch {
case e: URISyntaxException => null
  }
}

while hive uses java.net.URL:

url = new URL(urlStr)

Sure enough, this simple test demonstrates URL works but URI does not in this 
case:

val url = "http://правительство.рф;

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  

To reproduce the problem on spark-sql:

spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
returns NULL

This problem was introduced by
 which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5948) Support writing to partitioned table for the Parquet data source

2015-12-29 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073755#comment-15073755
 ] 

Yash Datta edited comment on SPARK-5948 at 12/29/15 10:18 AM:
--

oh i see, 
So does it mean that when using hive commands that use dynamic partitioning 
like :

insert overwrite table  partition (a, b)
select * from  


spark will use this modified path ?

Thanks for the prompt reply


was (Author: saucam):
oh i see, 
So does it mean that when using hive commands that use dynamic partitioning 
like :

insert verwrite table  partition (a, b)
select * from  


spark will use this modified path ?

Thanks for the prompt reply

> Support writing to partitioned table for the Parquet data source
> 
>
> Key: SPARK-5948
> URL: https://issues.apache.org/jira/browse/SPARK-5948
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.4.0
>
>
> In 1.3.0, we added support for reading partitioned tables declared in Hive 
> metastore for the Parquet data source. However, writing to partitioned tables 
> is not supported yet. This feature should probably built upon SPARK-5947.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5948) Support writing to partitioned table for the Parquet data source

2015-12-29 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073755#comment-15073755
 ] 

Yash Datta commented on SPARK-5948:
---

oh i see, 
So does it mean that when using hive commands that use dynamic partitioning 
like :

insert verwrite table  partition (a, b)
select * from  


spark will use this modified path ?

Thanks for the prompt reply

> Support writing to partitioned table for the Parquet data source
> 
>
> Key: SPARK-5948
> URL: https://issues.apache.org/jira/browse/SPARK-5948
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.4.0
>
>
> In 1.3.0, we added support for reading partitioned tables declared in Hive 
> metastore for the Parquet data source. However, writing to partitioned tables 
> is not supported yet. This feature should probably built upon SPARK-5947.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5948) Support writing to partitioned table for the Parquet data source

2015-12-28 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073551#comment-15073551
 ] 

Yash Datta commented on SPARK-5948:
---

Can you please mention which change resolved this one ?

> Support writing to partitioned table for the Parquet data source
> 
>
> Key: SPARK-5948
> URL: https://issues.apache.org/jira/browse/SPARK-5948
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.4.0
>
>
> In 1.3.0, we added support for reading partitioned tables declared in Hive 
> metastore for the Parquet data source. However, writing to partitioned tables 
> is not supported yet. This feature should probably built upon SPARK-5947.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11878) Eliminate distribute by in case group by is present with exactly the same grouping expressions

2015-11-19 Thread Yash Datta (JIRA)
Yash Datta created SPARK-11878:
--

 Summary: Eliminate distribute by in case group by is present with 
exactly the same grouping expressions
 Key: SPARK-11878
 URL: https://issues.apache.org/jira/browse/SPARK-11878
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yash Datta
Priority: Minor


For queries like :

select <> from table group by a distribute by a

we can eliminate distribute by ; since group by will anyways do a hash 
partitioning

Also applicable when user uses Dataframe API but the number of partitions in 
RepartitionByExpression is not specified (None)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10527) evaluate debugString only when log level is debug

2015-09-09 Thread Yash Datta (JIRA)
Yash Datta created SPARK-10527:
--

 Summary: evaluate debugString only when log level is debug
 Key: SPARK-10527
 URL: https://issues.apache.org/jira/browse/SPARK-10527
 Project: Spark
  Issue Type: Improvement
Reporter: Yash Datta
Priority: Trivial


Minor enhancement to evalutate debugstring only when log level is debug in 
DAGScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10451) Prevent unnecessary serializations in InMemoryColumnarTableScan

2015-09-04 Thread Yash Datta (JIRA)
Yash Datta created SPARK-10451:
--

 Summary: Prevent unnecessary serializations in 
InMemoryColumnarTableScan
 Key: SPARK-10451
 URL: https://issues.apache.org/jira/browse/SPARK-10451
 Project: Spark
  Issue Type: Improvement
Reporter: Yash Datta


In InMemorycolumnarTableScan, seriliazation of certain fields like buildFilter, 
InMemoryRelation etc can be avoided during task execution by carefully managing 
the clsoure of mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7340) Use latest parquet release 1.6.0 in spark

2015-05-04 Thread Yash Datta (JIRA)
Yash Datta created SPARK-7340:
-

 Summary: Use latest parquet release 1.6.0 in spark
 Key: SPARK-7340
 URL: https://issues.apache.org/jira/browse/SPARK-7340
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Yash Datta
 Fix For: 1.4.0


Bump the used parquet version from 1.6.0rc3 to 1.6.0. 
This brings in major improvement in the form of reading footers in the task 
side instead of the driver, removing a major bottleneck



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7097) Partitioned tables should only consider referred partitions in query during size estimation for checking against autoBroadcastJoinThreshold

2015-04-27 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-7097:
--
Description: 
Currently when deciding about whether to create HashJoin or ShuffleHashJoin, 
the size estimation of partitioned tables involved considers the size of entire 
table. This results in many query plans using shuffle hash joins , where infact 
only a small number of partitions may be being referred by the actual query 
(due to additional filters), and hence these could be run using BroadCastHash 
join.

The query plan should consider the size of only the referred partitions in such 
cases

  was:
Currently when deciding about whether to create HashJoin or ShuffleHashJoin, 
the size estimation of partitioned tables involved considers the size of entire 
table. This results in many query plans using shuffle hash joins , where infact 
only a small number of partitions may be being referred by the actual query 
(due to additional filters), and hence these could be run using Map side hash 
join.

The query plan should consider the size of only the referred partitions in such 
cases


 Partitioned tables should only consider referred partitions in query during 
 size estimation for checking against autoBroadcastJoinThreshold
 ---

 Key: SPARK-7097
 URL: https://issues.apache.org/jira/browse/SPARK-7097
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1
Reporter: Yash Datta
 Fix For: 1.4.0


 Currently when deciding about whether to create HashJoin or ShuffleHashJoin, 
 the size estimation of partitioned tables involved considers the size of 
 entire table. This results in many query plans using shuffle hash joins , 
 where infact only a small number of partitions may be being referred by the 
 actual query (due to additional filters), and hence these could be run using 
 BroadCastHash join.
 The query plan should consider the size of only the referred partitions in 
 such cases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7142) Minor enhancement to BooleanSimplification Optimizer rule

2015-04-25 Thread Yash Datta (JIRA)
Yash Datta created SPARK-7142:
-

 Summary: Minor enhancement to BooleanSimplification Optimizer rule
 Key: SPARK-7142
 URL: https://issues.apache.org/jira/browse/SPARK-7142
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yash Datta
Priority: Minor


Add simplification using these rules :

A and (not(A) or B) = A and B

not(A and B) = not(A) or not(B)

not(A or B) = not(A) and not(B)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7097) Partitioned tables should only consider referred partitions in query during size estimation for checking against autoBroadcastJoinThreshold

2015-04-23 Thread Yash Datta (JIRA)
Yash Datta created SPARK-7097:
-

 Summary: Partitioned tables should only consider referred 
partitions in query during size estimation for checking against 
autoBroadcastJoinThreshold
 Key: SPARK-7097
 URL: https://issues.apache.org/jira/browse/SPARK-7097
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1, 1.3.0, 1.2.2, 1.2.1, 1.2.0, 1.1.1
Reporter: Yash Datta
 Fix For: 1.4.0


Currently when deciding about whether to create HashJoin or ShuffleHashJoin, 
the size estimation of partitioned tables involved considers the size of entire 
table. This results in many query plans using shuffle hash joins , where infact 
only a small number of partitions may be being referred by the actual query 
(due to additional filters), and hence these could be run using Map side hash 
join.

The query plan should consider the size of only the referred partitions in such 
cases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6742) Spark pushes down filters in old parquet path that reference partitioning columns

2015-04-07 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-6742:
--

This is same as SPARK-6554 for new parquet path

 Spark pushes down filters in old parquet path that reference partitioning 
 columns
 -

 Key: SPARK-6742
 URL: https://issues.apache.org/jira/browse/SPARK-6742
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Yash Datta

 Create a table with multiple fields partitioned on 'market' column. run a 
 query like : 
 SELECT start_sp_time, end_sp_time, imsi, imei,  enb_common_enbid FROM 
 csl_data_parquet WHERE (((technology = 'FDD') AND (bandclass = '800') AND 
 (region = 'R15') AND (market = 'LA metro')) OR ((technology = 'FDD') AND 
 (bandclass = '1900') AND (region = 'R15') AND (market = 'Indianapolis'))) AND 
 start_sp_time = 1.4158368E9 AND end_sp_time  1.4159232E9 AND dt = 
 '2014-11-13-00-00' AND dt  '2014-11-14-00-00' ORDER BY end_sp_time DESC 
 LIMIT 100
 The or filter is pushed down in this case , resulting in column not found 
 exception from parquet 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6742) Spark pushes down filters in old parquet path that reference partitioning columns

2015-04-07 Thread Yash Datta (JIRA)
Yash Datta created SPARK-6742:
-

 Summary: Spark pushes down filters in old parquet path that 
reference partitioning columns
 Key: SPARK-6742
 URL: https://issues.apache.org/jira/browse/SPARK-6742
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Yash Datta


Create a table with multiple fields partitioned on 'market' column. run a query 
like : 

SELECT start_sp_time, end_sp_time, imsi, imei,  enb_common_enbid FROM 
csl_data_parquet WHERE (((technology = 'FDD') AND (bandclass = '800') AND 
(region = 'R15') AND (market = 'LA metro')) OR ((technology = 'FDD') AND 
(bandclass = '1900') AND (region = 'R15') AND (market = 'Indianapolis'))) AND 
start_sp_time = 1.4158368E9 AND end_sp_time  1.4159232E9 AND dt = 
'2014-11-13-00-00' AND dt  '2014-11-14-00-00' ORDER BY end_sp_time DESC LIMIT 
100

The or filter is pushed down in this case , resulting in column not found 
exception from parquet 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4258) NPE with new Parquet Filters

2015-04-03 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394670#comment-14394670
 ] 

Yash Datta commented on SPARK-4258:
---

[~yhuai] No it does not. I fixed this in parquet master. Waiting for parquet to 
release the next version. Current version is 1.6.0rc3 (being used in spark)

 NPE with new Parquet Filters
 

 Key: SPARK-4258
 URL: https://issues.apache.org/jira/browse/SPARK-4258
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.2.0


 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in 
 stage 21.0 (TID 160, ip-10-0-247-144.us-west-2.compute.internal): 
 java.lang.NullPointerException: 
 parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206)
 parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:210)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$Or.accept(Operators.java:302)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:201)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$And.accept(Operators.java:290)
 
 parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52)
 parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46)
 parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 
 parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
 {code}
 This occurs when reading parquet data encoded with the older version of the 
 library for TPC-DS query 34.  Will work on coming up with a smaller 
 reproduction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6632) Optimize the parquetSchema to metastore schema reconciliation, so that the process is delegated to each map task itself

2015-03-31 Thread Yash Datta (JIRA)
Yash Datta created SPARK-6632:
-

 Summary: Optimize the parquetSchema to metastore schema 
reconciliation, so that the process is delegated to each map task itself
 Key: SPARK-6632
 URL: https://issues.apache.org/jira/browse/SPARK-6632
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yash Datta
 Fix For: 1.4.0


Currently in ParquetRelation2, schema from all the part files is first merged, 
and then reconciled with metastore schema. This approach does not scale in case 
we have thousands of partitions for the table. We can take a different approach 
where we can go ahead with the metastore schema, and reconcile the names of the 
columns within each map task , using ReadSupport hooks provided in parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-23 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-6471:
--
Summary: Metastore schema should only be a subset of parquet schema to 
support dropping of columns using replace columns  (was: Metastoreschema should 
only be a subset of parquetSchema to support dropping of columns using replace 
columns)

 Metastore schema should only be a subset of parquet schema to support 
 dropping of columns using replace columns
 ---

 Key: SPARK-6471
 URL: https://issues.apache.org/jira/browse/SPARK-6471
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yash Datta
 Fix For: 1.4.0


 Currently in the parquet relation 2 implementation, error is thrown in case 
 merged schema is not exactly the same as metastore schema. 
 But to support cases like deletion of column using replace column command, we 
 can relax the restriction so that even if metastore schema is a subset of 
 merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6471) Metastoreschema should only be a subset of parquetSchema to support dropping of columns using replace columns

2015-03-23 Thread Yash Datta (JIRA)
Yash Datta created SPARK-6471:
-

 Summary: Metastoreschema should only be a subset of parquetSchema 
to support dropping of columns using replace columns
 Key: SPARK-6471
 URL: https://issues.apache.org/jira/browse/SPARK-6471
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yash Datta
 Fix For: 1.4.0


Currently in the parquet relation 2 implementation, error is thrown in case 
merged schema is not exactly the same as metastore schema. 
But to support cases like deletion of column using replace column command, we 
can relax the restriction so that even if metastore schema is a subset of 
merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-23 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376286#comment-14376286
 ] 

Yash Datta commented on SPARK-6471:
---

https://github.com/apache/spark/pull/5141

 Metastore schema should only be a subset of parquet schema to support 
 dropping of columns using replace columns
 ---

 Key: SPARK-6471
 URL: https://issues.apache.org/jira/browse/SPARK-6471
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yash Datta
 Fix For: 1.4.0


 Currently in the parquet relation 2 implementation, error is thrown in case 
 merged schema is not exactly the same as metastore schema. 
 But to support cases like deletion of column using replace column command, we 
 can relax the restriction so that even if metastore schema is a subset of 
 merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-23 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-6471:
--
Comment: was deleted

(was: https://github.com/apache/spark/pull/5141)

 Metastore schema should only be a subset of parquet schema to support 
 dropping of columns using replace columns
 ---

 Key: SPARK-6471
 URL: https://issues.apache.org/jira/browse/SPARK-6471
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yash Datta
 Fix For: 1.4.0


 Currently in the parquet relation 2 implementation, error is thrown in case 
 merged schema is not exactly the same as metastore schema. 
 But to support cases like deletion of column using replace column command, we 
 can relax the restriction so that even if metastore schema is a subset of 
 merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6006) Optimize count distinct in case high cardinality columns

2015-02-25 Thread Yash Datta (JIRA)
Yash Datta created SPARK-6006:
-

 Summary: Optimize count distinct in case high cardinality columns
 Key: SPARK-6006
 URL: https://issues.apache.org/jira/browse/SPARK-6006
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1, 1.1.1
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.3.0


In case there are a lot of distinct values, count distinct becomes too slow 
since it tries to hash partial results to one map. It can be improved by 
creating buckets/partial maps in an intermediate stage where same key from 
multiple partial maps of first stage hash to the same bucket. Later we can sum 
the size of these buckets to get total distinct count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6006) Optimize count distinct in case of high cardinality columns

2015-02-25 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-6006:
--
Summary: Optimize count distinct in case of high cardinality columns  (was: 
Optimize count distinct in case high cardinality columns)

 Optimize count distinct in case of high cardinality columns
 ---

 Key: SPARK-6006
 URL: https://issues.apache.org/jira/browse/SPARK-6006
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1, 1.2.1
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.3.0


 In case there are a lot of distinct values, count distinct becomes too slow 
 since it tries to hash partial results to one map. It can be improved by 
 creating buckets/partial maps in an intermediate stage where same key from 
 multiple partial maps of first stage hash to the same bucket. Later we can 
 sum the size of these buckets to get total distinct count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5684) Key not found exception is thrown in case location of added partition to a parquet table is different than a path containing the partition values

2015-02-09 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-5684:
--
Priority: Major  (was: Critical)

 Key not found exception is thrown in case location of added partition to a 
 parquet table is different than a path containing the partition values
 -

 Key: SPARK-5684
 URL: https://issues.apache.org/jira/browse/SPARK-5684
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0
Reporter: Yash Datta
 Fix For: 1.3.0


 Create a partitioned parquet table : 
 create table test_table (dummy string) partitioned by (timestamp bigint) 
 stored as parquet;
 Add a partition to the table and specify a different location:
 alter table test_table add partition (timestamp=9) location 
 '/data/pth/different'
 Run a simple select  * query 
 we get an exception :
 15/02/09 08:27:25 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
 db4_mi2mi_binsrc1_default limit 5]
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 
 (TID 21, localhost): java
 .util.NoSuchElementException: key not found: timestamp
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.MapLike$class.apply(MapLike.scala:141)
 at scala.collection.AbstractMap.apply(Map.scala:58)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:141)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:128)
 at 
 org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:247)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 This happens because in parquet path it is assumed that (key=value) patterns 
 are present in the partition location, which is not always the case!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5453) Use hive-site.xml to set class for adding custom filter for input files

2015-01-28 Thread Yash Datta (JIRA)
Yash Datta created SPARK-5453:
-

 Summary: Use hive-site.xml to set class for adding custom filter 
for input files
 Key: SPARK-5453
 URL: https://issues.apache.org/jira/browse/SPARK-5453
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yash Datta
Priority: Minor


It will be useful if users can add custom input filter class in hive-site.xml 
and use it seamlessly in hive and spark ! 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4786) Parquet filter pushdown for BYTE and SHORT types

2015-01-21 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287004#comment-14287004
 ] 

Yash Datta commented on SPARK-4786:
---

https://github.com/apache/spark/pull/4156

 Parquet filter pushdown for BYTE and SHORT types
 

 Key: SPARK-4786
 URL: https://issues.apache.org/jira/browse/SPARK-4786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Lian

 Among all integral types, currently only INT and LONG predicates can be 
 converted to Parquet filter predicate. BYTE and SHORT predicates can be 
 covered by INT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4786) Parquet filter pushdown for BYTE and SHORT types

2015-01-21 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-4786:
--
Comment: was deleted

(was: https://github.com/apache/spark/pull/4156)

 Parquet filter pushdown for BYTE and SHORT types
 

 Key: SPARK-4786
 URL: https://issues.apache.org/jira/browse/SPARK-4786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Lian

 Among all integral types, currently only INT and LONG predicates can be 
 converted to Parquet filter predicate. BYTE and SHORT predicates can be 
 covered by INT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4762) Add support for tuples in where in clause query

2014-12-05 Thread Yash Datta (JIRA)
Yash Datta created SPARK-4762:
-

 Summary: Add support for tuples in where in clause query
 Key: SPARK-4762
 URL: https://issues.apache.org/jira/browse/SPARK-4762
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
 Fix For: 1.3.0


Currently, in the where in clause the filter is applied only on a single 
column. We can enhance it to accept filter on multiple columns.

So current support is for queries like :
Select * from table where c1 in (value1,value2,...value n);

Need to add support for queries like :
Select * from table where (c1,c2,... cn) in ((value1,value2...value n), 
(value1' , value2' ... ,value n') )

Also, we can add optimized version of where in clause of tuples , where we 
create a hashset of the filter tuples for matching rows.

This also requires a change in the hive parser since currently there is no 
support for multiple columns in IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4762) Add support for tuples in 'where in' clause query

2014-12-05 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-4762:
--
Summary: Add support for tuples in 'where in' clause query  (was: Add 
support for tuples in where in clause query)

 Add support for tuples in 'where in' clause query
 -

 Key: SPARK-4762
 URL: https://issues.apache.org/jira/browse/SPARK-4762
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
 Fix For: 1.3.0


 Currently, in the where in clause the filter is applied only on a single 
 column. We can enhance it to accept filter on multiple columns.
 So current support is for queries like :
 Select * from table where c1 in (value1,value2,...value n);
 Need to add support for queries like :
 Select * from table where (c1,c2,... cn) in ((value1,value2...value n), 
 (value1' , value2' ... ,value n') )
 Also, we can add optimized version of where in clause of tuples , where we 
 create a hashset of the filter tuples for matching rows.
 This also requires a change in the hive parser since currently there is no 
 support for multiple columns in IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4762) Add support for tuples in 'where in' clause query

2014-12-05 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14235280#comment-14235280
 ] 

Yash Datta commented on SPARK-4762:
---

Already created a PR for the hive parser

 Add support for tuples in 'where in' clause query
 -

 Key: SPARK-4762
 URL: https://issues.apache.org/jira/browse/SPARK-4762
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
 Fix For: 1.3.0


 Currently, in the where in clause the filter is applied only on a single 
 column. We can enhance it to accept filter on multiple columns.
 So current support is for queries like :
 Select * from table where c1 in (value1,value2,...value n);
 Need to add support for queries like :
 Select * from table where (c1,c2,... cn) in ((value1,value2...value n), 
 (value1' , value2' ... ,value n') )
 Also, we can add optimized version of where in clause of tuples , where we 
 create a hashset of the filter tuples for matching rows.
 This also requires a change in the hive parser since currently there is no 
 support for multiple columns in IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4365) Remove unnecessary filter call on records returned from parquet library

2014-11-12 Thread Yash Datta (JIRA)
Yash Datta created SPARK-4365:
-

 Summary: Remove unnecessary filter call on records returned from 
parquet library
 Key: SPARK-4365
 URL: https://issues.apache.org/jira/browse/SPARK-4365
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.2.0


Since parquet library has been updated , we no longer need to filter the 
records returned from parquet library for null records , as now the library 
skips those :

from 
parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java


  public boolean nextKeyValue() throws IOException, InterruptedException {
boolean recordFound = false;

while (!recordFound) {
  // no more records left
  if (current = total) { return false; }

  try {
checkRead();
currentValue = recordReader.read();
current ++; 
if (recordReader.shouldSkipCurrentRecord()) {
  // this record is being filtered via the filter2 package
  if (DEBUG) LOG.debug(skipping record);
  continue;
}   

if (currentValue == null) {
  // only happens with FilteredRecordReader at end of block
  current = totalCountLoadedSoFar;
  if (DEBUG) LOG.debug(filtered record reader reached end of block);
  continue;
}   
  recordFound = true;

if (DEBUG) LOG.debug(read value:  + currentValue);
  } catch (RuntimeException e) {
throw new ParquetDecodingException(format(Can not read value at %d in 
block %d in file %s, current, currentBlock, file), e); 
  }   
}   
return true;
  }





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3968) Use parquet-mr filter2 api in spark sql

2014-10-20 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-3968:
--
Description: 
The parquet-mr project has introduced a new filter api , along with several 
fixes (like filtering on optional fields) . It can also eliminate entire 
RowGroups depending on certain statistics like min/max
We can leverage that to further improve performance of queries with filters.
Also filter2 api introduces ability to create custom filters. We can create a 
custom filter for the optimized In clause (InSet) , so that elimination happens 
in the ParquetRecordReader itself (will create a separate ticket for that) .

This fixes the below ticket : 

https://issues.apache.org/jira/browse/SPARK-1847


  was:
The parquet-mr project has introduced a new filter api , along with several 
fixes . It can also eliminate entire RowGroups depending on certain statistics 
like min/max
We can leverage that to further improve performance of queries with filters.
Also filter2 api introduces ability to create custom filters. We can create a 
custom filter for the optimized In clause (InSet) , so that elimination happens 
in the ParquetRecordReader itself (will create a separate ticket for that) .


 Use parquet-mr filter2 api in spark sql
 ---

 Key: SPARK-3968
 URL: https://issues.apache.org/jira/browse/SPARK-3968
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.1.1


 The parquet-mr project has introduced a new filter api , along with several 
 fixes (like filtering on optional fields) . It can also eliminate entire 
 RowGroups depending on certain statistics like min/max
 We can leverage that to further improve performance of queries with filters.
 Also filter2 api introduces ability to create custom filters. We can create a 
 custom filter for the optimized In clause (InSet) , so that elimination 
 happens in the ParquetRecordReader itself (will create a separate ticket for 
 that) .
 This fixes the below ticket : 
 https://issues.apache.org/jira/browse/SPARK-1847



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3968) Use parquet-mr filter2 api in spark sql

2014-10-18 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-3968:
--
Description: 
The parquet-mr project has introduced a new filter api , along with several 
fixes . It can also eliminate entire RowGroups depending on certain statistics 
like min/max
We can leverage that to further improve performance of queries with filters.
Also filter2 api introduces ability to create custom filters. We can create a 
custom filter for the optimized In clause (InSet) , so that elimination happens 
in the ParquetRecordReader itself (will create a separate ticket for that) .

  was:
The parquet-mr project has introduced a new filter api , along with several 
fixes , like filtering on OPTIONAL columns as well. It can also eliminate 
entire RowGroups depending on certain statistics like min/max
We can leverage that to further improve performance of queries with filters.
Also filter2 api introduces ability to create custom filters. We can create a 
custom filter for the optimized In clause (InSet) , so that elimination happens 
in the ParquetRecordReader itself (will create a separate ticket for that) .


 Use parquet-mr filter2 api in spark sql
 ---

 Key: SPARK-3968
 URL: https://issues.apache.org/jira/browse/SPARK-3968
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.1.1


 The parquet-mr project has introduced a new filter api , along with several 
 fixes . It can also eliminate entire RowGroups depending on certain 
 statistics like min/max
 We can leverage that to further improve performance of queries with filters.
 Also filter2 api introduces ability to create custom filters. We can create a 
 custom filter for the optimized In clause (InSet) , so that elimination 
 happens in the ParquetRecordReader itself (will create a separate ticket for 
 that) .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3968) Using parquet-mr filter2 api in spark sql, add a custom filter for InSet clause

2014-10-16 Thread Yash Datta (JIRA)
Yash Datta created SPARK-3968:
-

 Summary: Using parquet-mr filter2 api in spark sql, add a custom 
filter for InSet clause
 Key: SPARK-3968
 URL: https://issues.apache.org/jira/browse/SPARK-3968
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.1.1


The parquet-mr project has introduced a new filter api , along with several 
fixes , like filtering on OPTIONAL columns as well. It can also eliminate 
entire RowGroups depending on certain statistics like min/max
We can leverage that to further improve performance of queries with filters.
Also filter2 api introduces ability to create custom filters. We can create a 
custom filter for the optimized In clause (InSet) , so that elimination happens 
in the ParquetRecordReader itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3968) Using parquet-mr filter2 api in spark sql, add a custom filter for InSet clause

2014-10-16 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-3968:
--
Shepherd: Yash Datta

 Using parquet-mr filter2 api in spark sql, add a custom filter for InSet 
 clause
 ---

 Key: SPARK-3968
 URL: https://issues.apache.org/jira/browse/SPARK-3968
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.1.1


 The parquet-mr project has introduced a new filter api , along with several 
 fixes , like filtering on OPTIONAL columns as well. It can also eliminate 
 entire RowGroups depending on certain statistics like min/max
 We can leverage that to further improve performance of queries with filters.
 Also filter2 api introduces ability to create custom filters. We can create a 
 custom filter for the optimized In clause (InSet) , so that elimination 
 happens in the ParquetRecordReader itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3711) Optimize where in clause filter queries

2014-09-30 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152878#comment-14152878
 ] 

Yash Datta commented on SPARK-3711:
---

On a 2 node setup 
each machine config: 24 core machine , (96 GB ) 

invoking spark-sql with :
./bin/spark-sql --executor-memory 16G --driver-memory 8G --master url

executing a filter query on a parquet table having  47750544 rows , with ~1000 
filters:
(the selected column was unique for each row)

select * from table where column in (A1,A2A1000);

Time taken on spark-1.1 (after multiple runs) :
~90 seconds

after the patch :
~7 seconds


 Optimize where in clause filter queries
 ---

 Key: SPARK-3711
 URL: https://issues.apache.org/jira/browse/SPARK-3711
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.1.1


 The In case class is replaced by a InSet class in case all the filters are 
 literals, which uses a hashset instead of Sequence,  thereby giving 
 significant performance improvement. Maximum improvement should be visible in 
 case small percentage of large data matches the filter list  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3711) Optimize where in clause filter queries

2014-09-27 Thread Yash Datta (JIRA)
Yash Datta created SPARK-3711:
-

 Summary: Optimize where in clause filter queries
 Key: SPARK-3711
 URL: https://issues.apache.org/jira/browse/SPARK-3711
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.1.1


The In case class is replaced by a InSet class in case all the filters are 
literals, which uses a hashset instead of Sequence,  thereby giving significant 
performance improvement. Maximum improvement should be visible in case small 
percentage of large data matches the filter list  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org