[ https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324031#comment-16324031 ]
Sean Owen commented on SPARK-23056: ----------------------------------- [~saucam] that is not a valid URI or URL. I don't think this can be considered a bug. I'm surprised the URL class parses it, and I agree it's good to be consistent with Hive, but not sure this is guaranteed by the semantics of the function. The problem was a big performance bottleneck. If there's a solution that avoids that problem and also makes this more lenient to match Hive, that could be OK, but I am not sure if this should be considered a problem. You can URL-escape that URL. > parse_url regression when switched to using java.net.URI instead of > java.net.URL > -------------------------------------------------------------------------------- > > Key: SPARK-23056 > URL: https://issues.apache.org/jira/browse/SPARK-23056 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.3, 2.2.2, 2.3.0 > Reporter: Yash Datta > Labels: regression > > When using internationalized Domains in the urls like: > {code:java} > val url = "http://правительство.рф" > {code} > The parse_url returns null, but works fine when using the hive 's version of > parse_url > On digging further, found that the difference is in below call in spark: > {code:java} > private def getUrl(url: UTF8String): URI = { > try { > new URI(url.toString) > } catch { > case e: URISyntaxException => null > } > } > {code} > while hive uses java.net.URL: > {code:java} > url = new URL(urlStr) > {code} > Sure enough, this simple test demonstrates URL works but URI does not in this > case: > {code:java} > val url = "http://правительство.рф" > val uriHost = new URI(url).getHost > val urlHost = new URL(url).getHost > println(s"uriHost = $uriHost") // prints uriHost = null > println(s"urlHost = $urlHost") // prints urlHost = правительство.рф > {code} > To reproduce the problem on spark-sql: > {code:java} > spark-sql> select parse_url('http://千夏ともか.test', 'HOST'); > {code} > returns NULL > This problem was introduced by > <https://issues.apache.org/jira/browse/SPARK-16826> which is designed to > improve the performance of PARSE_URL(). > The same issue exists in the following SQL: > {code:java} > SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p') > {code} > // return null in Spark 2.1+ > // return ["abc"] less than Spark 2.1 > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org