Hi devs,

Stumbled across an interesting problem with the parse_url function that has
been implemented in spark in
https://issues.apache.org/jira/browse/SPARK-16281

When using internationalized Domains in the urls like:

val url = "http://правительство.рф <http://xn--80aealotwbjpid2k.xn--p1ai>"

The parse_url returns null, but works fine when using the hive 's version
of parse_url

On digging further, found that the difference is in below call in spark:

private def getUrl(url: UTF8String): URI = {
  try {
    new URI(url.toString)
  } catch {
    case e: URISyntaxException => null
  }
}


while hive uses java.net.URL:

url = new URL(urlStr)


Sure enough, this simple test demonstrates URL works but URI does not in
this case:

val url = "http://правительство.рф <http://xn--80aealotwbjpid2k.xn--p1ai>"

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost")     // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф


To reproduce the problem on spark-sql:

spark-sql> select parse_url('http://千夏ともか.test
<http://xn--u8jxcyd029o9bg.test>', 'HOST');
returns NULL

Could someone  please explain the reason of using URI instead of URL ? Does
this problem warrant creating a jira ticket ?


Best Regards
Yash

-- 
When events unfold with calm and ease
When the winds that blow are merely breeze
Learn from nature, from birds and bees
Live your life in love, and let joy not cease.

Reply via email to