Re: Streaming scheduling delay
On Fri, Feb 13, 2015 at 2:21 AM, Gerard Maas gerard.m...@gmail.com wrote: KafkaOutputServicePool Could you please give an example code of how KafkaOutputServicePool would look like? When I tried object pooling I end up with various not serializable exceptions. Thanks! Josh
Re: throughput in the web console?
Let me ask like this, what would be the easiest way to display the throughput in the web console? Would I need to create a new tab and add the metrics? Any good or simple examples showing how this can be done? On Wed, Feb 25, 2015 at 12:07 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you have a look at https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.scheduler.SparkListener And for Streaming: https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener Thanks Best Regards On Tue, Feb 24, 2015 at 10:29 PM, Josh J joshjd...@gmail.com wrote: Hi, I plan to run a parameter search varying the number of cores, epoch, and parallelism. The web console provides a way to archive the previous runs, though is there a way to view in the console the throughput? Rather than logging the throughput separately to the log files and correlating the logs files to the web console processing times? Thanks, Josh
Re: throughput in the web console?
On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das ak...@sigmoidanalytics.com wrote: For SparkStreaming applications, there is already a tab called Streaming which displays the basic statistics. Would I just need to extend this tab to add the throughput?
throughput in the web console?
Hi, I plan to run a parameter search varying the number of cores, epoch, and parallelism. The web console provides a way to archive the previous runs, though is there a way to view in the console the throughput? Rather than logging the throughput separately to the log files and correlating the logs files to the web console processing times? Thanks, Josh
measuring time taken in map, reduceByKey, filter, flatMap
Hi, I have a stream pipeline which invokes map, reduceByKey, filter, and flatMap. How can I measure the time taken in each stage? Thanks, Josh
Re: dockerized spark executor on mesos?
We have dockerized Spark Master and worker(s) separately and are using it in our dev environment. Is this setup available on github or dockerhub? On Tue, Dec 9, 2014 at 3:50 PM, Venkat Subramanian vsubr...@gmail.com wrote: We have dockerized Spark Master and worker(s) separately and are using it in our dev environment. We don't use Mesos though, running it in Standalone mode, but adding Mesos should not be that difficult I think. Regards Venkat -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/dockerized-spark-executor-on-mesos-tp20276p20603.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark standalone master with workers on two nodes
Hi, I'm trying to run Spark Streaming standalone on two nodes. I'm able to run on a single node fine. I start both workers and it registers in the Spark UI. However, the application says SparkDeploySchedulerBackend: Asked to remove non-existent executor 2 Any ideas? Thanks, Josh
sample is not a member of org.apache.spark.streaming.dstream.DStream
Hi, I'm trying to using sampling with Spark Streaming. I imported the following import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ I then call sample val streamtoread = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap,StorageLevel.MEMORY_AND_DISK).map(_._2) streamtoread.sample(withReplacement = true, fraction = fraction) How do I use the sample http://spark.apache.org/docs/latest/programming-guide.html#transformations() method with Spark Streaming? Thanks, Josh
kafka pipeline exactly once semantics
Hi, In the spark docs http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node it mentions However, output operations (like foreachRDD) have *at-least once* semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. I would like to setup a Kafka pipeline whereby I write my data to a single topic 1, then I continue to process using spark streaming and write the transformed results to topic2, and finally I read the results from topic 2. How do I configure the spark streaming so that I can maintain exactly once semantics when writing to topic 2? Thanks, Josh
Re: Publishing a transformed DStream to Kafka
Is there a way to do this that preserves exactly once semantics for the write to Kafka? On Tue, Sep 2, 2014 at 12:30 PM, Tim Smith secs...@gmail.com wrote: I'd be interested in finding the answer too. Right now, I do: val kafkaOutMsgs = kafkInMessages.map(x=myFunc(x._2,someParam)) kafkaOutMsgs.foreachRDD((rdd,time) = { rdd.foreach(rec = { writer.output(rec) }) } ) //where writer.ouput is a method that takes a string and writer is an instance of a producer class. On Tue, Sep 2, 2014 at 10:12 AM, Massimiliano Tomassi max.toma...@gmail.com wrote: Hello all, after having applied several transformations to a DStream I'd like to publish all the elements in all the resulting RDDs to Kafka. What the best way to do that would be? Just using DStream.foreach and then RDD.foreach ? Is there any other built in utility for this use case? Thanks a lot, Max -- Massimiliano Tomassi e-mail: max.toma...@gmail.com
Adaptive stream processing and dynamic batch sizing
Hi, I was wondering if the adaptive stream processing and dynamic batch processing was available to use in spark streaming? If someone could help point me in the right direction? Thanks, Josh
Re: Adaptive stream processing and dynamic batch sizing
Referring to this paper http://dl.acm.org/citation.cfm?id=2670995. On Fri, Nov 14, 2014 at 10:42 AM, Josh J joshjd...@gmail.com wrote: Hi, I was wondering if the adaptive stream processing and dynamic batch processing was available to use in spark streaming? If someone could help point me in the right direction? Thanks, Josh
concat two Dstreams
Hi, Is it possible to concatenate or append two Dstreams together? I have an incoming stream that I wish to combine with data that's generated by a utility. I then need to process the combined Dstream. Thanks, Josh
Re: concat two Dstreams
I think it's just called union On Tue, Nov 11, 2014 at 2:41 PM, Josh J joshjd...@gmail.com wrote: Hi, Is it possible to concatenate or append two Dstreams together? I have an incoming stream that I wish to combine with data that's generated by a utility. I then need to process the combined Dstream. Thanks, Josh
convert ListString to dstream
Hi, I have some data generated by some utilities that returns the results as a ListString. I would like to join this with a Dstream of strings. How can I do this? I tried the following though get scala compiler errors val list_scalaconverted = ssc.sparkContext.parallelize(listvalues.toArray()) list_queue.add(list_scalaconverted) val list_stream = ssc.queueStream(list_queue) found : org.apache.spark.rdd.RDD[Object] required: org.apache.spark.rdd.RDD[String] Note: Object : String, but class RDD is invariant in type T. You may wish to define T as -T instead. (SLS 4.5) and found : java.util.LinkedList[org.apache.spark.rdd.RDD[String]] required: scala.collection.mutable.Queue[org.apache.spark.rdd.RDD[?]] Thanks, Josh
Re: scala RDD sortby compilation error
I'm using the same code https://github.com/apache/spark/blob/83b7a1c6503adce1826fc537b4db47e534da5cae/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala#L687, though still receive not enough arguments for method sortBy: (f: String = K, ascending: Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag: scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String]. Unspecified value parameter f. On Tue, Nov 4, 2014 at 11:28 AM, Josh J joshjd...@gmail.com wrote: Hi, Does anyone have any good examples of using sortby for RDDs and scala? I'm receiving not enough arguments for method sortBy: (f: String = K, ascending: Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag: scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String]. Unspecified value parameter f. I tried to follow the example in the test case https://github.com/apache/spark/blob/83b7a1c6503adce1826fc537b4db47e534da5cae/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala by using the same approach even same method names and parameters though no luck. Thanks, Josh
random shuffle streaming RDDs?
Hi, Is there a nice or optimal method to randomly shuffle spark streaming RDDs? Thanks, Josh
Re: random shuffle streaming RDDs?
When I'm outputting the RDDs to an external source, I would like the RDDs to be outputted in a random shuffle so that even the order is random. So far what I understood is that the RDDs do have a type of order, in that the order for spark streaming RDDs would be the order in which spark streaming read the tuples from source (e.g. ordered by roughly when the producer sent the tuple in addition to any latency) On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen so...@cloudera.com wrote: I think the answer will be the same in streaming as in the core. You want a random permutation of an RDD? in general RDDs don't have ordering at all -- excepting when you sort for example -- so a permutation doesn't make sense. Do you just want a well-defined but random ordering of the data? Do you just want to (re-)assign elements randomly to partitions? On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd...@gmail.com wrote: Hi, Is there a nice or optimal method to randomly shuffle spark streaming RDDs? Thanks, Josh
Re: random shuffle streaming RDDs?
If you want to permute an RDD, how about a sortBy() on a good hash function of each value plus some salt? (Haven't thought this through much but sounds about right.) This sounds promising. Where can I read more about the space (memory and network overhead) and time complexity of sortBy? On Mon, Nov 3, 2014 at 10:38 AM, Sean Owen so...@cloudera.com wrote: If you iterated over an RDD's partitions, I'm not sure that in practice you would find the order matches the order they were received. The receiver is replicating data to another node or node as it goes and I don't know much is guaranteed about that. If you want to permute an RDD, how about a sortBy() on a good hash function of each value plus some salt? (Haven't thought this through much but sounds about right.) On Mon, Nov 3, 2014 at 4:59 PM, Josh J joshjd...@gmail.com wrote: When I'm outputting the RDDs to an external source, I would like the RDDs to be outputted in a random shuffle so that even the order is random. So far what I understood is that the RDDs do have a type of order, in that the order for spark streaming RDDs would be the order in which spark streaming read the tuples from source (e.g. ordered by roughly when the producer sent the tuple in addition to any latency) On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen so...@cloudera.com wrote: I think the answer will be the same in streaming as in the core. You want a random permutation of an RDD? in general RDDs don't have ordering at all -- excepting when you sort for example -- so a permutation doesn't make sense. Do you just want a well-defined but random ordering of the data? Do you just want to (re-)assign elements randomly to partitions? On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd...@gmail.com wrote: Hi, Is there a nice or optimal method to randomly shuffle spark streaming RDDs? Thanks, Josh
run multiple spark applications in parallel
Hi, How do I run multiple spark applications in parallel? I tried to run on yarn cluster, though the second application submitted does not run. Thanks, Josh
Re: run multiple spark applications in parallel
Sorry, I should've included some stats with my email I execute each job in the following manner ./bin/spark-submit --class CLASSNAME --master yarn-cluster --driver-memory 1g --executor-memory 1g --executor-cores 1 UBER.JAR ${ZK_PORT_2181_TCP_ADDR} my-consumer-group1 1 The box has 24 CPUs, Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz 32 GB RAM Thanks, Josh On Tue, Oct 28, 2014 at 4:15 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Try reducing the resources (cores and memory) of each application. On Oct 28, 2014, at 7:05 PM, Josh J joshjd...@gmail.com wrote: Hi, How do I run multiple spark applications in parallel? I tried to run on yarn cluster, though the second application submitted does not run. Thanks, Josh
exact count using rdd.count()?
Hi, Is the following guaranteed to always provide an exact count? foreachRDD(foreachFunc = rdd = { rdd.count() In the literature it mentions However, output operations (like foreachRDD) have *at-least once* semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node Thanks, Josh
combine rdds?
Hi, How could I combine rdds? I would like to combine two RDDs if the count in an RDD is not above some threshold. Thanks, Josh
docker spark 1.1.0 cluster
Hi, Is there a dockerfiles available which allow to setup a docker spark 1.1.0 cluster? Thanks, Josh
streaming join sliding windows
Hi, How can I join neighbor sliding windows in spark streaming? Thanks, Josh
Re: countByWindow save the count ?
Thanks. I''m just confused on the syntax, I'm not sure which variables or where the value of the count is stored so that I can save it. Any examples or tips? On Mon, Aug 25, 2014 at 9:49 PM, Daniil Osipov daniil.osi...@shazam.com wrote: You could try to use foreachRDD on the result of countByWindow with a function that performs the save operation. On Fri, Aug 22, 2014 at 1:58 AM, Josh J joshjd...@gmail.com wrote: Hi, Hopefully a simple question. Though is there an example of where to save the output of countByWindow ? I would like to save the results to external storage (kafka or redis). The examples show only stream.print() Thanks, Josh
countByWindow save the count ?
Hi, Hopefully a simple question. Though is there an example of where to save the output of countByWindow ? I would like to save the results to external storage (kafka or redis). The examples show only stream.print() Thanks, Josh
multiple windows from the same DStream ?
Hi, Can I build two sliding windows in parallel from the same Dstream ? Will these two window streams run in parallel and process the same data? I wish to do two different functions (aggregration on one window and storage for the other window) across the same original dstream data though the same window sizes. JavaPairReceiverInputDStreamString, String messages = KafkaUtils.createStream(jssc, args[0], args[1], topicMap); Duration windowLength = new Duration(3); Duration slideInterval = new Duration(3); JavaPairDStreamString,String windowMessages1 = messages.window(windowLength,slideInterval); JavaPairDStreamString,String windowMessages2 = messages.window(windowLength,slideInterval); Thanks, Josh
DStream start a separate DStream
Hi, I would like to have a sliding window dstream perform a streaming computation and store these results. Once these results are stored, I then would like to process the results. Though I must wait until the final computation done for all tuples in the sliding window, before I begin the new DStream. How can I accomplish this with spark? Sincerely, Josh
Difference between amplab docker and spark docker?
Hi, Whats the difference between amplab docker https://github.com/amplab/docker-scripts and spark docker https://github.com/apache/spark/tree/master/docker? Thanks, Josh