Hi devs, Stumbled across an interesting problem with the parse_url function that has been implemented in spark in https://issues.apache.org/jira/browse/SPARK-16281
When using internationalized Domains in the urls like: val url = "http://правительство.рф <http://xn--80aealotwbjpid2k.xn--p1ai>" The parse_url returns null, but works fine when using the hive 's version of parse_url On digging further, found that the difference is in below call in spark: private def getUrl(url: UTF8String): URI = { try { new URI(url.toString) } catch { case e: URISyntaxException => null } } while hive uses java.net.URL: url = new URL(urlStr) Sure enough, this simple test demonstrates URL works but URI does not in this case: val url = "http://правительство.рф <http://xn--80aealotwbjpid2k.xn--p1ai>" val uriHost = new URI(url).getHost val urlHost = new URL(url).getHost println(s"uriHost = $uriHost") // prints uriHost = null println(s"urlHost = $urlHost") // prints urlHost = правительство.рф To reproduce the problem on spark-sql: spark-sql> select parse_url('http://千夏ともか.test <http://xn--u8jxcyd029o9bg.test>', 'HOST'); returns NULL Could someone please explain the reason of using URI instead of URL ? Does this problem warrant creating a jira ticket ? Best Regards Yash -- When events unfold with calm and ease When the winds that blow are merely breeze Learn from nature, from birds and bees Live your life in love, and let joy not cease.
