Re: Streaming scheduling delay

2015-03-01 Thread Josh J
On Fri, Feb 13, 2015 at 2:21 AM, Gerard Maas wrote: > KafkaOutputServicePool Could you please give an example code of how KafkaOutputServicePool would look like? When I tried object pooling I end up with various not serializable exceptions. Thanks! Josh

Re: throughput in the web console?

2015-02-25 Thread Josh J
On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das wrote: > For SparkStreaming applications, there is already a tab called "Streaming" > which displays the basic statistics. Would I just need to extend this tab to add the throughput?

Re: throughput in the web console?

2015-02-25 Thread Josh J
Regards > > On Tue, Feb 24, 2015 at 10:29 PM, Josh J wrote: > >> Hi, >> >> I plan to run a parameter search varying the number of cores, epoch, and >> parallelism. The web console provides a way to archive the previous runs, >> though is there a way to view i

throughput in the web console?

2015-02-24 Thread Josh J
Hi, I plan to run a parameter search varying the number of cores, epoch, and parallelism. The web console provides a way to archive the previous runs, though is there a way to view in the console the throughput? Rather than logging the throughput separately to the log files and correlating the log

measuring time taken in map, reduceByKey, filter, flatMap

2015-01-30 Thread Josh J
Hi, I have a stream pipeline which invokes map, reduceByKey, filter, and flatMap. How can I measure the time taken in each stage? Thanks, Josh

Re: dockerized spark executor on mesos?

2015-01-14 Thread Josh J
> > We have dockerized Spark Master and worker(s) separately and are using it > in > our dev environment. Is this setup available on github or dockerhub? On Tue, Dec 9, 2014 at 3:50 PM, Venkat Subramanian wrote: > We have dockerized Spark Master and worker(s) separately and are using it > in >

spark standalone master with workers on two nodes

2015-01-13 Thread Josh J
Hi, I'm trying to run Spark Streaming standalone on two nodes. I'm able to run on a single node fine. I start both workers and it registers in the Spark UI. However, the application says "SparkDeploySchedulerBackend: Asked to remove non-existent executor 2" Any ideas? Thanks, Josh

sample is not a member of org.apache.spark.streaming.dstream.DStream

2014-12-28 Thread Josh J
Hi, I'm trying to using sampling with Spark Streaming. I imported the following import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ I then call sample val streamtoread = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap,StorageLevel.MEMORY_AND_DISK).m

Re: Publishing a transformed DStream to Kafka

2014-11-30 Thread Josh J
Is there a way to do this that preserves exactly once semantics for the write to Kafka? On Tue, Sep 2, 2014 at 12:30 PM, Tim Smith wrote: > I'd be interested in finding the answer too. Right now, I do: > > val kafkaOutMsgs = kafkInMessages.map(x=>myFunc(x._2,someParam)) > kafkaOutMsgs.foreachRDD

kafka pipeline exactly once semantics

2014-11-30 Thread Josh J
Hi, In the spark docs it mentions "However, output operations (like foreachRDD) have *at-least once* semantics, that is, the transformed data may get written to an external entity more than once in the

Re: Adaptive stream processing and dynamic batch sizing

2014-11-14 Thread Josh J
Referring to this paper <http://dl.acm.org/citation.cfm?id=2670995>. On Fri, Nov 14, 2014 at 10:42 AM, Josh J wrote: > Hi, > > I was wondering if the adaptive stream processing and dynamic batch > processing was available to use in spark streaming? If someone could help >

Adaptive stream processing and dynamic batch sizing

2014-11-14 Thread Josh J
Hi, I was wondering if the adaptive stream processing and dynamic batch processing was available to use in spark streaming? If someone could help point me in the right direction? Thanks, Josh

Re: concat two Dstreams

2014-11-11 Thread Josh J
I think it's just called "union" On Tue, Nov 11, 2014 at 2:41 PM, Josh J wrote: > Hi, > > Is it possible to concatenate or append two Dstreams together? I have an > incoming stream that I wish to combine with data that's generated by a > utility. I then n

concat two Dstreams

2014-11-11 Thread Josh J
Hi, Is it possible to concatenate or append two Dstreams together? I have an incoming stream that I wish to combine with data that's generated by a utility. I then need to process the combined Dstream. Thanks, Josh

convert List to dstream

2014-11-10 Thread Josh J
Hi, I have some data generated by some utilities that returns the results as a List. I would like to join this with a Dstream of strings. How can I do this? I tried the following though get scala compiler errors val list_scalaconverted = ssc.sparkContext.parallelize(listvalues.toArray()) list_

Re: scala RDD sortby compilation error

2014-11-04 Thread Josh J
Please find my code here <https://gist.github.com/joshjdevl/b9af68b11398fd1823c4>. On Tue, Nov 4, 2014 at 11:33 AM, Josh J wrote: > I'm using the same code > <https://github.com/apache/spark/blob/83b7a1c6503adce1826fc537b4db47e534da5cae/core/src/test/scala/org/apache/spar

Re: scala RDD sortby compilation error

2014-11-04 Thread Josh J
)(implicit ord: Ordering[K], implicit ctag: scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String]. Unspecified value parameter f. On Tue, Nov 4, 2014 at 11:28 AM, Josh J wrote: > Hi, > > Does anyone have any good examples of using sortby for RDDs and scala? > > I'm recei

scala RDD sortby compilation error

2014-11-04 Thread Josh J
Hi, Does anyone have any good examples of using sortby for RDDs and scala? I'm receiving not enough arguments for method sortBy: (f: String => K, ascending: Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag: scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String]. Unspec

Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
d I don't know much is guaranteed about that. > > If you want to permute an RDD, how about a sortBy() on a good hash > function of each value plus some salt? (Haven't thought this through > much but sounds about right.) > > On Mon, Nov 3, 2014 at 4:59 PM, Josh J wrote: &g

Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
RDDs don't have > ordering at all -- excepting when you sort for example -- so a > permutation doesn't make sense. Do you just want a well-defined but > random ordering of the data? Do you just want to (re-)assign elements > randomly to partitions? > > On Mon, Nov 3,

random shuffle streaming RDDs?

2014-11-03 Thread Josh J
Hi, Is there a nice or optimal method to randomly shuffle spark streaming RDDs? Thanks, Josh

Re: run multiple spark applications in parallel

2014-10-28 Thread Josh J
CPUs, Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz 32 GB RAM Thanks, Josh On Tue, Oct 28, 2014 at 4:15 PM, Soumya Simanta wrote: > Try reducing the resources (cores and memory) of each application. > > > > > On Oct 28, 2014, at 7:05 PM, Josh J wrote: > > > > Hi,

run multiple spark applications in parallel

2014-10-28 Thread Josh J
Hi, How do I run multiple spark applications in parallel? I tried to run on yarn cluster, though the second application submitted does not run. Thanks, Josh

combine rdds?

2014-10-27 Thread Josh J
Hi, How could I combine rdds? I would like to combine two RDDs if the count in an RDD is not above some threshold. Thanks, Josh

exact count using rdd.count()?

2014-10-27 Thread Josh J
Hi, Is the following guaranteed to always provide an exact count? foreachRDD(foreachFunc = rdd => { rdd.count() In the literature it mentions "However, output operations (like foreachRDD) have *at-least once* semantics, that is, the transformed data may get written to an external entity more

docker spark 1.1.0 cluster

2014-10-24 Thread Josh J
Hi, Is there a dockerfiles available which allow to setup a docker spark 1.1.0 cluster? Thanks, Josh

streaming join sliding windows

2014-10-22 Thread Josh J
Hi, How can I join neighbor sliding windows in spark streaming? Thanks, Josh

dynamic sliding window duration

2014-10-07 Thread Josh J
Hi, I have a source which fluctuates in the frequency of streaming tuples. I would like to process certain batch counts, rather than batch window durations. Is it possible to either 1) define batch window sizes or 2) dynamically adjust the duration of the sliding window? Thanks, Josh

Re: countByWindow save the count ?

2014-08-26 Thread Josh J
with a > function that performs the save operation. > > > On Fri, Aug 22, 2014 at 1:58 AM, Josh J wrote: > >> Hi, >> >> Hopefully a simple question. Though is there an example of where to save >> the output of countByWindow ? I would like to save the resu

countByWindow save the count ?

2014-08-22 Thread Josh J
Hi, Hopefully a simple question. Though is there an example of where to save the output of countByWindow ? I would like to save the results to external storage (kafka or redis). The examples show only stream.print() Thanks, Josh

DStream start a separate DStream

2014-08-21 Thread Josh J
Hi, I would like to have a sliding window dstream perform a streaming computation and store these results. Once these results are stored, I then would like to process the results. Though I must wait until the final computation done for all tuples in the sliding window, before I begin the new DStre

multiple windows from the same DStream ?

2014-08-21 Thread Josh J
Hi, Can I build two sliding windows in parallel from the same Dstream ? Will these two window streams run in parallel and process the same data? I wish to do two different functions (aggregration on one window and storage for the other window) across the same original dstream data though the same

Difference between amplab docker and spark docker?

2014-08-20 Thread Josh J
Hi, Whats the difference between amplab docker and spark docker ? Thanks, Josh