spark-0.9.1 compiled with Hadoop 2.3.0 doesn't work with S3?

2014-04-21 Thread Nan Zhu
Hi, all  

I’m writing a Spark application to load S3 data to HDFS,  

the HDFS version is 2.3.0, so I have to compile Spark with Hadoop 2.3.0

after I execute

val allfiles = sc.textFile(s3n://abc/*.txt”)

val output = allfiles.saveAsTextFile(hdfs://x.x.x.x:9000/dataset”)

Spark throws exception: (actually related to Hadoop?)

java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException  
at 
org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:100)
at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:90)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)
at $iwC$$iwC$$iwC$$iwC.init(console:14)
at $iwC$$iwC$$iwC.init(console:19)
at $iwC$$iwC.init(console:21)
at $iwC.init(console:23)
at init(console:25)
at .init(console:29)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:793)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:838)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:750)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:598)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:605)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:608)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:931)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:881)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:881)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:881)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:973)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
Caused by: java.lang.ClassNotFoundException: org.jets3t.service.ServiceException
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 63 more



Anyone else met the similar problem?

Best,  

--  
Nan Zhu




Re: spark-0.9.1 compiled with Hadoop 2.3.0 doesn't work with S3?

2014-04-21 Thread Parviz Deyhim
I ran into the same issue. The problem seems to be with the jets3t library
that Spark uses in project/SparkBuild.scala.

change this:

net.java.dev.jets3t  % jets3t   % 0.7.1

to

net.java.dev.jets3t  % jets3t   % 0.9.0

0.7.1 is not the right version of jets3t for Hadoop 2.3.0


On Mon, Apr 21, 2014 at 11:30 AM, Nan Zhu zhunanmcg...@gmail.com wrote:

  Hi, all

 I’m writing a Spark application to load S3 data to HDFS,

 the HDFS version is 2.3.0, so I have to compile Spark with Hadoop 2.3.0

 after I execute

 val allfiles = sc.textFile(s3n://abc/*.txt”)

 val output = allfiles.saveAsTextFile(hdfs://x.x.x.x:9000/dataset”)

 Spark throws exception: (actually related to Hadoop?)

 java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException

 at
 org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:100)

 at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:90)

 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)

 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)

 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)

 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)

 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)

 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)

 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)

 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)

 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)

 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)

 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)

 at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)

 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)

 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)

 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)

 at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)

 at $iwC$$iwC$$iwC$$iwC.init(console:14)

 at $iwC$$iwC$$iwC.init(console:19)

 at $iwC$$iwC.init(console:21)

 at $iwC.init(console:23)

 at init(console:25)

 at .init(console:29)

 at .clinit(console)

 at .init(console:7)

 at .clinit(console)

 at $print(console)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:606)

 at
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)

 at
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)

 at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)

 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)

 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)

 at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:793)

 at
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:838)

 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:750)

 at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:598)

 at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:605)

 at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:608)

 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:931)

 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:881)

 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:881)

 at
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:881)

 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:973)

 at org.apache.spark.repl.Main$.main(Main.scala:31)

 at org.apache.spark.repl.Main.main(Main.scala)

 Caused by: java.lang.ClassNotFoundException:
 org.jets3t.service.ServiceException

 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

 at 

Re: spark-0.9.1 compiled with Hadoop 2.3.0 doesn't work with S3?

2014-04-21 Thread Nan Zhu
Yes, I fixed in the same way, but didn’t get a change to get back to here   

I also made a PR: https://github.com/apache/spark/pull/468

Best,  

--  
Nan Zhu


On Monday, April 21, 2014 at 8:19 PM, Parviz Deyhim wrote:

 I ran into the same issue. The problem seems to be with the jets3t library 
 that Spark uses in project/SparkBuild.scala.  
  
 change this:  
  
 net.java.dev.jets3t  % jets3t   % 0.7.1
  
 to  
  
 net.java.dev.jets3t  % jets3t   % 0.9.0
  
 0.7.1 is not the right version of jets3t for Hadoop 2.3.0
  
  
 On Mon, Apr 21, 2014 at 11:30 AM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  Hi, all  
   
  I’m writing a Spark application to load S3 data to HDFS,  
   
  the HDFS version is 2.3.0, so I have to compile Spark with Hadoop 2.3.0
   
  after I execute  
   
  val allfiles = sc.textFile(s3n://abc/*.txt”)
   
  val output = allfiles.saveAsTextFile(hdfs://x.x.x.x:9000/dataset”)
   
  Spark throws exception: (actually related to Hadoop?)
   
   
  java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException
   
   
  at 
  org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:100)
   
   
  at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:90)
   
   
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)
   
   
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
   
   
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
   
   
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
   
   
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
   
   
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   
   
  at 
  org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
   
   
  at 
  org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
   
   
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
   
   
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   
   
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   
   
  at scala.Option.getOrElse(Option.scala:120)
   
   
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   
   
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   
   
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   
   
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   
   
  at scala.Option.getOrElse(Option.scala:120)
   
   
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   
   
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   
   
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   
   
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   
   
  at scala.Option.getOrElse(Option.scala:120)
   
   
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   
   
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)
   
   
  at 
  org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)
   
   
  at 
  org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)
   
   
  at 
  org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)
   
   
  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)
   
   
  at $iwC$$iwC$$iwC$$iwC.init(console:14)
   
   
  at $iwC$$iwC$$iwC.init(console:19)
   
   
  at $iwC$$iwC.init(console:21)
   
   
  at $iwC.init(console:23)
   
   
  at init(console:25)
   
   
  at .init(console:29)
   
   
  at .clinit(console)
   
   
  at .init(console:7)
   
   
  at .clinit(console)
   
   
  at $print(console)
   
   
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
   
  at 
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   
   
  at 
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   
   
  at java.lang.reflect.Method.invoke(Method.java:606)
   
   
  at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
   
   
  at 
  org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
   
   
  at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
   
   
  at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
   
   
  at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
   
   
  at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:793)
   
   
  at 
  org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:838)
   
   
  at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:750)
   
   
  at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:598)
   
   
  at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:605)
   
   
  at