RE: Running Spark/YARN on AWS EMR - Issues finding file on hdfs?

2015-05-16 Thread jaredtims
Any resolution to this? Im having the same problem.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-YARN-on-AWS-EMR-Issues-finding-file-on-hdfs-tp10214p22918.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: zip files submitted with --py-files disappear from hdfs after a while on EMR

2015-05-16 Thread jaredtims
Any resolution to this? I am having the same problem.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/zip-files-submitted-with-py-files-disappear-from-hdfs-after-a-while-on-EMR-tp22342p22919.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Difference among batchDuration, windowDuration, slideDuration

2015-03-18 Thread jaredtims
I think hsy541 is still confused by what is still confusing to me.  Namely,
what is the value that sentence Each RDD in a DStream contains data from a
certain interval is speaking of?  This is from the  Discretized Streams
http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
  
section.  The example makes it seem like the batchDuration is 4 seconds and
then this mystery interval is 1 second?  Where is this mystery interval
defined?  Or am i missing something altogether?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Difference-among-batchDuration-windowDuration-slideDuration-tp9966p22119.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark flume tryOrIOException NoSuchMethodError

2015-03-13 Thread jaredtims
I am trying to process events from a flume avro sink, but i keep getting this
same error.  I am just running it locally using flumes avro-client. With the
following commands to start the job and client.  It seems like it should be
a configuration problems since its a NoSuchMethodError, but everything is
there.

Job command:
spark-submit --master local[2] --class com.streaming.SparkFlume
./target/scala-2.10/streaming.jar localhost 

Client command:
flume-ng avro-client --conf $FLUME_BASE/conf -H localhost -p  -F
/etc/passwd


The error:

15/03/13 16:55:10 INFO ReceiverTracker: Stream 0 received 0 blocks
15/03/13 16:55:10 INFO JobScheduler: Added jobs for time 142628011 ms
15/03/13 16:55:10 INFO JobScheduler: Starting job streaming job
142628011 ms.0 from job set of time 142628011 ms
15/03/13 16:55:10 INFO DefaultExecutionContext: Starting job:
CallSite(getCallSite at
DStream.scala:294,org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1077)
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:294)
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288)
scala.Option.orElse(Option.scala:257)
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:285)
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:115)
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221)
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221)
scala.util.Try$.apply(Try.scala:161)
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:221)
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:165))
15/03/13 16:55:10 INFO DefaultExecutionContext: Job finished:
CallSite(getCallSite at
DStream.scala:294,org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1077)
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:294)
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288)
scala.Option.orElse(Option.scala:257)
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:285)
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:115)
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221)
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221)
scala.util.Try$.apply(Try.scala:161)
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:221)
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:165)),
took 3.9692E-5 s
15/03/13 16:55:10 INFO JobScheduler: Finished job streaming job
142628011 ms.0 from job set of time 142628011 ms
15/03/13 16:55:10 INFO JobScheduler: Total delay: 0.008 s for time
142628011 ms (execution: 0.003 s)
15/03/13 16:55:10 INFO FilteredRDD: Removing RDD 12 from persistence list
15/03/13 16:55:10 INFO BlockManager: Removing RDD 12
15/03/13 16:55:10 INFO MappedRDD: Removing RDD 11 from persistence list
15/03/13 16:55:10 INFO BlockManager: Removing RDD 11
15/03/13 16:55:10 INFO BlockRDD: Removing RDD 10 from persistence list
15/03/13 16:55:10 INFO BlockManager: Removing RDD 10
15/03/13 16:55:10 INFO FlumeInputDStream: Removing blocks of RDD
BlockRDD[10] at createStream at SparkFlume.scala:45 of time 142628011 ms
15/03/13 16:55:11 INFO