Hi all, I am getting an exception when trying to execute a Spark Job that is using the new Phoenix 4.5 spark connector. The application works very well in my local machine, but fails to run in a cluster environment on top of yarn.
The cluster is a Cloudera CDH 5.4.4 with HBase 1.0.0 and Phoenix 4.5 (phoenix is installed correctly as sqlline works without errors). In the pom.xml, only the spark-core jar (version 1.3.0-cdh5.4.4) has scope "provided", while all other jars has been copied by maven into the /myapp/lib folder. I include all the dependent libs using the option "--jar" in the spark-submit command (among these libraries, there is the phoenix-core-xx.jar, which contains the class PhoenixOutputFormat). This is the command: spark-submit --class my.JobRunner \ --master yarn --deploy-mode client \ --jars `ls -dm /myapp/lib/* | tr -d ' \r\n'` \ /myapp/mainjar.jar The /myapp/lib folders contains the phoenix core lib, which contains class org.apache.phoenix.mapreduce.PhoenixOutputFormat. But it seems that the driver/executor cannot see it. And I get an exception when I try to save to Phoenix an RDD: Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.phoenix.mapreduce.PhoenixOutputFormat not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2112) at org.apache.hadoop.mapreduce.task.JobContextImpl.getOutputFormatClass(JobContextImpl.java:232) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:971) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:903) at org.apache.phoenix.spark.ProductRDDFunctions.saveToPhoenix(ProductRDDFunctions.scala:51) at com.mypackage.save(DAOImpl.scala:41) at com.mypackage.ProtoStreamingJob.execute(ProtoStreamingJob.scala:58) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.mypackage.SparkApplication.sparkRun(SparkApplication.scala:95) at com.mypackage.SparkApplication$delayedInit$body.apply(SparkApplication.scala:112) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) at com.mypackage.SparkApplication.main(SparkApplication.scala:15) at com.mypackage.ProtoStreamingJobRunner.main(ProtoStreamingJob.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: Class org.apache.phoenix.mapreduce.PhoenixOutputFormat not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2018) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2110) ... 30 more The phoenix-core-xxx.jar is included in the classpath. I am sure it is in the classpath because I tried to instantiate an object of class PhoenixOutputFormat directly in the main class and it worked. The problem is that the method "org.apache.hadoop.conf.Configuration.getClassByName" cannot find it. Since I am using "client" deploy mode, the exception should have been thrown by the driver in the local machine. How can this happen?