[ 
https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-18910:
------------------------------
    Description: 
When I create a UDF that jar file in hdfs, I can't use the UDF. 
<code>
spark-sql> create function trans_array as 'com.test.udf.TransArray'  using jar 
'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';

spark-sql> describe function trans_array;
Function: test_db.trans_array
Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
Usage: N/A.
Time taken: 0.127 seconds, Fetched 3 row(s)

spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from 
test_spark limit 10;
Error in query: Undefined function: 'trans_array'. This function is neither a 
registered temporary function nor a permanent function registered in the 
database 'test_db'.; line 1 pos 7
<code>

The reason is when 
org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, 
the uri.toURL throw exception with " failed unknown protocol: hdfs"
<code>
  def addJar(path: String): Unit = {
    sparkSession.sparkContext.addJar(path)

    val uri = new Path(path).toUri
    val jarURL = if (uri.getScheme == null) {
      // `path` is a local file path without a URL scheme
      new File(path).toURI.toURL
    } else {
      // `path` is a URL with a scheme
      {color:red}uri.toURL{color}
    }
    jarClassLoader.addURL(jarURL)
    Thread.currentThread().setContextClassLoader(jarClassLoader)
  }
<code>

I think we should setURLStreamHandlerFactory method on URL with an instance of 
FsUrlStreamHandlerFactory, just like:
<code>
static {
        // This method can be called at most once in a given JVM.
        URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
<code>


  was:
When I create a UDF that jar file in hdfs, I can't use the UDF. 
<code>
spark-sql> create function trans_array as 'com.test.udf.TransArray'  using jar 
'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';

spark-sql> describe function trans_array;
Function: test_db.trans_array
Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
Usage: N/A.
Time taken: 0.127 seconds, Fetched 3 row(s)

spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from 
test_spark limit 10;
Error in query: Undefined function: 'trans_array'. This function is neither a 
registered temporary function nor a permanent function registered in the 
database 'test_db'.; line 1 pos 7
</code>

The reason is when 
org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, 
the uri.toURL throw exception with " failed unknown protocol: hdfs"
<code>
  def addJar(path: String): Unit = {
    sparkSession.sparkContext.addJar(path)

    val uri = new Path(path).toUri
    val jarURL = if (uri.getScheme == null) {
      // `path` is a local file path without a URL scheme
      new File(path).toURI.toURL
    } else {
      // `path` is a URL with a scheme
      uri.toURL
    }
    jarClassLoader.addURL(jarURL)
    Thread.currentThread().setContextClassLoader(jarClassLoader)
  }
</code>

I think we should setURLStreamHandlerFactory method on URL with an instance of 
FsUrlStreamHandlerFactory, just like:
<code>
static {
        // This method can be called at most once in a given JVM.
        URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
</code>



> Can't use UDF that jar file in hdfs
> -----------------------------------
>
>                 Key: SPARK-18910
>                 URL: https://issues.apache.org/jira/browse/SPARK-18910
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Hong Shen
>
> When I create a UDF that jar file in hdfs, I can't use the UDF. 
> <code>
> spark-sql> create function trans_array as 'com.test.udf.TransArray'  using 
> jar 
> 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';
> spark-sql> describe function trans_array;
> Function: test_db.trans_array
> Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
> Usage: N/A.
> Time taken: 0.127 seconds, Fetched 3 row(s)
> spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) 
> from test_spark limit 10;
> Error in query: Undefined function: 'trans_array'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'test_db'.; line 1 pos 7
> <code>
> The reason is when 
> org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource,
>  the uri.toURL throw exception with " failed unknown protocol: hdfs"
> <code>
>   def addJar(path: String): Unit = {
>     sparkSession.sparkContext.addJar(path)
>     val uri = new Path(path).toUri
>     val jarURL = if (uri.getScheme == null) {
>       // `path` is a local file path without a URL scheme
>       new File(path).toURI.toURL
>     } else {
>       // `path` is a URL with a scheme
>       {color:red}uri.toURL{color}
>     }
>     jarClassLoader.addURL(jarURL)
>     Thread.currentThread().setContextClassLoader(jarClassLoader)
>   }
> <code>
> I think we should setURLStreamHandlerFactory method on URL with an instance 
> of FsUrlStreamHandlerFactory, just like:
> <code>
> static {
>       // This method can be called at most once in a given JVM.
>       URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
> }
> <code>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to