ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Nicholas Chammas
Just launched an EC2 cluster from git hash
9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD
accessing data in S3 yields the following error output.

I understand that NoClassDefFoundError errors may mean something in the
deployment was messed up. Is that correct? When I launch a cluster using
spark-ec2, I expect all critical deployment details to be taken care of by
the script.

So is something in the deployment executed by spark-ec2 borked?

Nick

java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at org.apache.spark.ShuffleDependency.init(Dependency.scala:71)
at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188)
at 
org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144)
at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903)
at 
org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:185)
at 
org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:236)
at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337)
at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at org.apache.spark.rdd.RDD.take(RDD.scala:1036)
at $iwC$$iwC$$iwC$$iwC.init(console:26)
at $iwC$$iwC$$iwC.init(console:31)
at $iwC$$iwC.init(console:33)
at $iwC.init(console:35)
at init(console:37)
at .init(console:41)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Aaron Davidson
This one is typically due to a mismatch between the Hadoop versions --
i.e., Spark is compiled against 1.0.4 but is running with 2.3.0 in the
classpath, or something like that. Not certain why you're seeing this with
spark-ec2, but I'm assuming this is related to the issues you posted in a
separate thread.


On Mon, Jul 14, 2014 at 6:43 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Just launched an EC2 cluster from git hash
 9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD
 accessing data in S3 yields the following error output.

 I understand that NoClassDefFoundError errors may mean something in the
 deployment was messed up. Is that correct? When I launch a cluster using
 spark-ec2, I expect all critical deployment details to be taken care of by
 the script.

 So is something in the deployment executed by spark-ec2 borked?

 Nick

 java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
 at
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224)
 at
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214)
 at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
 at org.apache.spark.ShuffleDependency.init(Dependency.scala:71)
 at
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79)
 at
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 at
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188)
 at
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144)
 at
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903)
 at
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:185)
 at
 org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:236)
 at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337)
 at
 org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
 at org.apache.spark.rdd.RDD.take(RDD.scala:1036)
 at $iwC$$iwC$$iwC$$iwC.init(console:26)
 at $iwC$$iwC$$iwC.init(console:31)
 at $iwC$$iwC.init(console:33)
 at $iwC.init(console:35)
 at init(console:37)
 at 

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Shivaram Venkataraman
My guess is that this is related to
https://issues.apache.org/jira/browse/SPARK-2471 where the S3 library gets
excluded from the SBT assembly jar. I am not sure if the assembly jar used
in EC2 is generated using SBT though.

Shivaram


On Mon, Jul 14, 2014 at 10:02 PM, Aaron Davidson ilike...@gmail.com wrote:

 This one is typically due to a mismatch between the Hadoop versions --
 i.e., Spark is compiled against 1.0.4 but is running with 2.3.0 in the
 classpath, or something like that. Not certain why you're seeing this with
 spark-ec2, but I'm assuming this is related to the issues you posted in a
 separate thread.


 On Mon, Jul 14, 2014 at 6:43 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  Just launched an EC2 cluster from git hash
  9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD
  accessing data in S3 yields the following error output.
 
  I understand that NoClassDefFoundError errors may mean something in the
  deployment was messed up. Is that correct? When I launch a cluster using
  spark-ec2, I expect all critical deployment details to be taken care of
 by
  the script.
 
  So is something in the deployment executed by spark-ec2 borked?
 
  Nick
 
  java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
  at
 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224)
  at
 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214)
  at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
  at
 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
  at
 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at
 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at org.apache.spark.ShuffleDependency.init(Dependency.scala:71)
  at
  org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79)
  at
  org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
  at
  org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188)
  at
 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144)
  at
  org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903)
  at
 
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174)
  at
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
  at
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
  at
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:185)
  at
 
 org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:236)
  at
 org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337)
  at
  org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Patrick Wendell
Yeah - this is likely caused by SPARK-2471.

On Mon, Jul 14, 2014 at 10:11 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
 My guess is that this is related to
 https://issues.apache.org/jira/browse/SPARK-2471 where the S3 library gets
 excluded from the SBT assembly jar. I am not sure if the assembly jar used
 in EC2 is generated using SBT though.

 Shivaram


 On Mon, Jul 14, 2014 at 10:02 PM, Aaron Davidson ilike...@gmail.com wrote:

 This one is typically due to a mismatch between the Hadoop versions --
 i.e., Spark is compiled against 1.0.4 but is running with 2.3.0 in the
 classpath, or something like that. Not certain why you're seeing this with
 spark-ec2, but I'm assuming this is related to the issues you posted in a
 separate thread.


 On Mon, Jul 14, 2014 at 6:43 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  Just launched an EC2 cluster from git hash
  9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD
  accessing data in S3 yields the following error output.
 
  I understand that NoClassDefFoundError errors may mean something in the
  deployment was messed up. Is that correct? When I launch a cluster using
  spark-ec2, I expect all critical deployment details to be taken care of
 by
  the script.
 
  So is something in the deployment executed by spark-ec2 borked?
 
  Nick
 
  java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
  at
 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224)
  at
 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214)
  at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
  at
 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
  at
 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at
 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at org.apache.spark.ShuffleDependency.init(Dependency.scala:71)
  at
  org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79)
  at
  org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
  at
  org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188)
  at
 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144)
  at
  org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903)
  at
 
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174)
  at
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
  at
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
  at
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:185)
  at
 
 org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:236)
  at
 org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337)
  at
  org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at 

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Nicholas Chammas
Okie doke--added myself as a watcher on that issue.

On a related note, what are the thoughts on automatically spinning up/down
EC2 clusters and running tests against them? It would probably be way too
cumbersome to do that for every build, but perhaps on some schedule it
could help validate that we are still deploying EC2 clusters correctly.

Would something like that be valuable?

Nick


On Tue, Jul 15, 2014 at 1:19 AM, Patrick Wendell pwend...@gmail.com wrote:

 Yeah - this is likely caused by SPARK-2471.

 On Mon, Jul 14, 2014 at 10:11 PM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  My guess is that this is related to
  https://issues.apache.org/jira/browse/SPARK-2471 where the S3 library
 gets
  excluded from the SBT assembly jar. I am not sure if the assembly jar
 used
  in EC2 is generated using SBT though.
 
  Shivaram
 
 
  On Mon, Jul 14, 2014 at 10:02 PM, Aaron Davidson ilike...@gmail.com
 wrote:
 
  This one is typically due to a mismatch between the Hadoop versions --
  i.e., Spark is compiled against 1.0.4 but is running with 2.3.0 in the
  classpath, or something like that. Not certain why you're seeing this
 with
  spark-ec2, but I'm assuming this is related to the issues you posted in
 a
  separate thread.
 
 
  On Mon, Jul 14, 2014 at 6:43 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
   Just launched an EC2 cluster from git hash
   9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD
   accessing data in S3 yields the following error output.
  
   I understand that NoClassDefFoundError errors may mean something in
 the
   deployment was messed up. Is that correct? When I launch a cluster
 using
   spark-ec2, I expect all critical deployment details to be taken care
 of
  by
   the script.
  
   So is something in the deployment executed by spark-ec2 borked?
  
   Nick
  
   java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
   at
  
 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224)
   at
  
 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214)
   at
   org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
   at
  
 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
   at
  
 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
   at
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
   at
 org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
   at
  
 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
   at org.apache.spark.ShuffleDependency.init(Dependency.scala:71)
   at
   org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79)
   at
   org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
   at
   org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188)
   at
  
 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144)
   at
   org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903)
   at
  
 
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174)
   at
  
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
   at
  
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
   at