Staffan Arvidsson created SPARK-5350:
----------------------------------------

             Summary: There are issues when combining Spark and CDK 
(https://github.com/egonw/cdk). 
                 Key: SPARK-5350
                 URL: https://issues.apache.org/jira/browse/SPARK-5350
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.2.0, 1.1.1
         Environment: Running Spark using a local computer, using both Mac OS X 
and a VM with Linux Ubuntu.
            Reporter: Staffan Arvidsson


I'm using Maven and Eclipse to build my project. When I import the CDK 
(https://github.com/egonw/cdk) jar-files that I need, and setup the 
SparkContext and try for instance reading a file (by simply "val lines = 
sc.textFile(filePath)") I get the following errors in the log:
{quote}
[main] DEBUG org.apache.spark.rdd.HadoopRDD  - SplitLocationInfo and other new 
Hadoop classes are unavailable. Using the older Hadoop location info code.
java.lang.ClassNotFoundException: 
org.apache.hadoop.mapred.InputSplitWithLocationInfo
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:191)
        at 
org.apache.spark.rdd.HadoopRDD$SplitInfoReflections.<init>(HadoopRDD.scala:381)
        at org.apache.spark.rdd.HadoopRDD$.liftedTree1$1(HadoopRDD.scala:391)
        at org.apache.spark.rdd.HadoopRDD$.<init>(HadoopRDD.scala:390)
        at org.apache.spark.rdd.HadoopRDD$.<clinit>(HadoopRDD.scala)
        at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:159)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
        at org.apache.spark.rdd.RDD.foreach(RDD.scala:765)
{quote}
later in the log: 
{quote}
[Executor task launch worker-0] DEBUG org.apache.spark.deploy.SparkHadoopUtil  
- Couldn't find method for retrieving thread-level FileSystem input data
java.lang.NoSuchMethodException: 
org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()
        at java.lang.Class.getDeclaredMethod(Class.java:2009)
        at org.apache.spark.util.Utils$.invoke(Utils.scala:1733)
        at 
org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178)
        at 
org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at 
org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178)
        at 
org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:138)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:220)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{quote}
There has also been issues related to "HADOOP_HOME" not being set etc., but 
which seems to be intermittent and only occur sometimes. 

After testing different versions of both CDK and Spark, I've found out that the 
Spark version 0.9.1 seems to get things to work. This will not solve my problem 
though, as I will later need to use functionality from the MLlib that are only 
in the newer versions of Spark.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to