Spark Streaming w/ tshark exception problem on EC2

2014-07-15 Thread Gianluca Privitera
Hi,
I’ve got a problem with Spark Streaming and tshark.
While I’m running locally I have no problems with this code, but when I run it 
on a EC2 cluster I get the exception shown just under the code.

def dissection(s: String): Seq[String] = {
try {

  Process(hadoop command to create ./localcopy.tmp).! // calls hadoop to 
copy a file from s3 locally
  val pb = Process(“tshark … localcopy.tmp”)  // calls tshark to transform 
the s3 file into sequence of strings
  var returnValue = pb.lines_!.toSeq
  return returnValue

} catch {
  case e: Exception =
System.err.println(“ERROR)
return new MutableList[String]()
}
  }

(line 2051 points to the function “dissection”)

WARN scheduler.TaskSetManager: Loss was due to 
java.lang.ExceptionInInitializerError
java.lang.ExceptionInInitializerError
at Main$$anonfun$11.apply(Main.scala:2051)
at Main$$anonfun$11.apply(Main.scala:2051)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Has anyone got an idea why that may happen? I’m pretty sure that the hadoop 
call works perfectly.

Thanks
Gianluca


Re: Spark Streaming w/ tshark exception problem on EC2

2014-07-15 Thread Tathagata Das
Quick google search of that exception says this occurs when there is an
error in the initialization of static methods. Could be some issue related
to how dissection is defined. Maybe try putting the function in a different
static class that is unrelated to the Main class, which may have other
static initialization stuff? Its hard to say much more without any more
context.

TD


On Tue, Jul 15, 2014 at 2:17 AM, Gianluca Privitera 
gianluca.privite...@studio.unibo.it wrote:

  Hi,
 I’ve got a problem with Spark Streaming and tshark.
 While I’m running locally I have no problems with this code, but when I
 run it on a EC2 cluster I get the exception shown just under the code.

  def dissection(s: String): Seq[String] = {
 try {

Process(hadoop command to create ./localcopy.tmp).! // calls
 hadoop to copy a file from s3 locally
val pb = Process(“tshark … localcopy.tmp”)  // calls tshark to
 transform the s3 file into sequence of strings
   var returnValue = pb.lines_!.toSeq
return returnValue

  } catch {
   case e: Exception =
  System.err.println(“ERROR)
 return new MutableList[String]()
 }
   }

  (line 2051 points to the function “dissection”)

  WARN scheduler.TaskSetManager: Loss was due to
 java.lang.ExceptionInInitializerError
 java.lang.ExceptionInInitializerError
 at Main$$anonfun$11.apply(Main.scala:2051)
 at Main$$anonfun$11.apply(Main.scala:2051)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
 at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
 at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
 at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)

  Has anyone got an idea why that may happen? I’m pretty sure that the
 hadoop call works perfectly.

  Thanks
 Gianluca