Re: Streaming scheduling delay

2015-03-01 Thread Josh J
On Fri, Feb 13, 2015 at 2:21 AM, Gerard Maas gerard.m...@gmail.com wrote:

 KafkaOutputServicePool


Could you please give an example code of how KafkaOutputServicePool would
look like? When I tried object pooling I end up with various not
serializable exceptions.

Thanks!
Josh


Re: throughput in the web console?

2015-02-25 Thread Josh J
Let me ask like this, what would be the easiest way to display the
throughput in the web console? Would I need to create a new tab and add the
metrics? Any good or simple examples showing how this can be done?

On Wed, Feb 25, 2015 at 12:07 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Did you have a look at


 https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.scheduler.SparkListener

 And for Streaming:


 https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener



 Thanks
 Best Regards

 On Tue, Feb 24, 2015 at 10:29 PM, Josh J joshjd...@gmail.com wrote:

 Hi,

 I plan to run a parameter search varying the number of cores, epoch, and
 parallelism. The web console provides a way to archive the previous runs,
 though is there a way to view in the console the throughput? Rather than
 logging the throughput separately to the log files and correlating the logs
 files to the web console processing times?

 Thanks,
 Josh





Re: throughput in the web console?

2015-02-25 Thread Josh J
On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 For SparkStreaming applications, there is already a tab called Streaming
 which displays the basic statistics.


Would I just need to extend this tab to add the throughput?


throughput in the web console?

2015-02-24 Thread Josh J
Hi,

I plan to run a parameter search varying the number of cores, epoch, and
parallelism. The web console provides a way to archive the previous runs,
though is there a way to view in the console the throughput? Rather than
logging the throughput separately to the log files and correlating the logs
files to the web console processing times?

Thanks,
Josh


measuring time taken in map, reduceByKey, filter, flatMap

2015-01-30 Thread Josh J
Hi,

I have a stream pipeline which invokes map, reduceByKey, filter, and
flatMap. How can I measure the time taken in each stage?

Thanks,
Josh


Re: dockerized spark executor on mesos?

2015-01-14 Thread Josh J

 We have dockerized Spark Master and worker(s) separately and are using it
 in
 our dev environment.


Is this setup available on github or dockerhub?

On Tue, Dec 9, 2014 at 3:50 PM, Venkat Subramanian vsubr...@gmail.com
wrote:

 We have dockerized Spark Master and worker(s) separately and are using it
 in
 our dev environment. We don't use Mesos though, running it in Standalone
 mode, but adding Mesos should not be that difficult I think.

 Regards

 Venkat



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/dockerized-spark-executor-on-mesos-tp20276p20603.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




spark standalone master with workers on two nodes

2015-01-13 Thread Josh J
Hi,

I'm trying to run Spark Streaming standalone on two nodes. I'm able to run
on a single node fine. I start both workers and it registers in the Spark
UI. However, the application says

SparkDeploySchedulerBackend: Asked to remove non-existent executor 2

Any ideas?

Thanks,
Josh


sample is not a member of org.apache.spark.streaming.dstream.DStream

2014-12-28 Thread Josh J
Hi,

I'm trying to using sampling with Spark Streaming. I imported the following

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._


I then call sample


val streamtoread = KafkaUtils.createStream(ssc, zkQuorum, group,
topicMap,StorageLevel.MEMORY_AND_DISK).map(_._2)

streamtoread.sample(withReplacement = true, fraction = fraction)


How do I use the sample
http://spark.apache.org/docs/latest/programming-guide.html#transformations()
method with Spark Streaming?


Thanks,

Josh


kafka pipeline exactly once semantics

2014-11-30 Thread Josh J
Hi,

In the spark docs
http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node
it mentions However, output operations (like foreachRDD) have *at-least
once* semantics, that is, the transformed data may get written to an
external entity more than once in the event of a worker failure. 

I would like to setup a Kafka pipeline whereby I write my data to a single
topic 1, then I continue to process using spark streaming and write the
transformed results to topic2, and finally I read the results from topic 2.
How do I configure the spark streaming so that I can maintain exactly once
semantics when writing to topic 2?

Thanks,
Josh


Re: Publishing a transformed DStream to Kafka

2014-11-30 Thread Josh J
Is there a way to do this that preserves exactly once semantics for the
write to Kafka?

On Tue, Sep 2, 2014 at 12:30 PM, Tim Smith secs...@gmail.com wrote:

 I'd be interested in finding the answer too. Right now, I do:

 val kafkaOutMsgs = kafkInMessages.map(x=myFunc(x._2,someParam))
 kafkaOutMsgs.foreachRDD((rdd,time) = { rdd.foreach(rec = {
 writer.output(rec) }) } ) //where writer.ouput is a method that takes a
 string and writer is an instance of a producer class.





 On Tue, Sep 2, 2014 at 10:12 AM, Massimiliano Tomassi 
 max.toma...@gmail.com wrote:

 Hello all,
 after having applied several transformations to a DStream I'd like to
 publish all the elements in all the resulting RDDs to Kafka. What the best
 way to do that would be? Just using DStream.foreach and then RDD.foreach ?
 Is there any other built in utility for this use case?

 Thanks a lot,
 Max

 --
 
 Massimiliano Tomassi
 
 e-mail: max.toma...@gmail.com
 





Adaptive stream processing and dynamic batch sizing

2014-11-14 Thread Josh J
Hi,

I was wondering if the adaptive stream processing and dynamic batch
processing was available to use in spark streaming? If someone could help
point me in the right direction?

Thanks,
Josh


Re: Adaptive stream processing and dynamic batch sizing

2014-11-14 Thread Josh J
Referring to this paper http://dl.acm.org/citation.cfm?id=2670995.

On Fri, Nov 14, 2014 at 10:42 AM, Josh J joshjd...@gmail.com wrote:

 Hi,

 I was wondering if the adaptive stream processing and dynamic batch
 processing was available to use in spark streaming? If someone could help
 point me in the right direction?

 Thanks,
 Josh




concat two Dstreams

2014-11-11 Thread Josh J
Hi,

Is it possible to concatenate or append two Dstreams together? I have an
incoming stream that I wish to combine with data that's generated by a
utility. I then need to process the combined Dstream.

Thanks,
Josh


Re: concat two Dstreams

2014-11-11 Thread Josh J
I think it's just called union

On Tue, Nov 11, 2014 at 2:41 PM, Josh J joshjd...@gmail.com wrote:

 Hi,

 Is it possible to concatenate or append two Dstreams together? I have an
 incoming stream that I wish to combine with data that's generated by a
 utility. I then need to process the combined Dstream.

 Thanks,
 Josh



convert ListString to dstream

2014-11-10 Thread Josh J
Hi,

I have some data generated by some utilities that returns the results as
a ListString. I would like to join this with a Dstream of strings. How
can I do this? I tried the following though get scala compiler errors

val list_scalaconverted = ssc.sparkContext.parallelize(listvalues.toArray())
   list_queue.add(list_scalaconverted)
val list_stream = ssc.queueStream(list_queue)

 found   : org.apache.spark.rdd.RDD[Object]

 required: org.apache.spark.rdd.RDD[String]

Note: Object : String, but class RDD is invariant in type T.

You may wish to define T as -T instead. (SLS 4.5)


and


 found   : java.util.LinkedList[org.apache.spark.rdd.RDD[String]]

 required: scala.collection.mutable.Queue[org.apache.spark.rdd.RDD[?]]


Thanks,

Josh


Re: scala RDD sortby compilation error

2014-11-04 Thread Josh J
I'm using the same code
https://github.com/apache/spark/blob/83b7a1c6503adce1826fc537b4db47e534da5cae/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala#L687,
though still receive

 not enough arguments for method sortBy: (f: String = K, ascending:
Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag:
scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String].

Unspecified value parameter f.

On Tue, Nov 4, 2014 at 11:28 AM, Josh J joshjd...@gmail.com wrote:

 Hi,

 Does anyone have any good examples of using sortby for RDDs and scala?

 I'm receiving

  not enough arguments for method sortBy: (f: String = K, ascending:
 Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag:
 scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String].

 Unspecified value parameter f.


 I tried to follow the example in the test case
 https://github.com/apache/spark/blob/83b7a1c6503adce1826fc537b4db47e534da5cae/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala
  by
 using the same approach even same method names and parameters though no
 luck.


 Thanks,

 Josh



random shuffle streaming RDDs?

2014-11-03 Thread Josh J
Hi,

Is there a nice or optimal method to randomly shuffle spark streaming RDDs?

Thanks,
Josh


Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
When I'm outputting the RDDs to an external source, I would like the RDDs
to be outputted in a random shuffle so that even the order is random. So
far what I understood is that the RDDs do have a type of order, in that the
order for spark streaming RDDs would be the order in which spark streaming
read the tuples from source (e.g. ordered by roughly when the producer sent
the tuple in addition to any latency)

On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen so...@cloudera.com wrote:

 I think the answer will be the same in streaming as in the core. You
 want a random permutation of an RDD? in general RDDs don't have
 ordering at all -- excepting when you sort for example -- so a
 permutation doesn't make sense. Do you just want a well-defined but
 random ordering of the data? Do you just want to (re-)assign elements
 randomly to partitions?

 On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd...@gmail.com wrote:
  Hi,
 
  Is there a nice or optimal method to randomly shuffle spark streaming
 RDDs?
 
  Thanks,
  Josh



Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
 If you want to permute an RDD, how about a sortBy() on a good hash
function of each value plus some salt? (Haven't thought this through
much but sounds about right.)

This sounds promising. Where can I read more about the space (memory and
network overhead) and time complexity of sortBy?



On Mon, Nov 3, 2014 at 10:38 AM, Sean Owen so...@cloudera.com wrote:

 If you iterated over an RDD's partitions, I'm not sure that in
 practice you would find the order matches the order they were
 received. The receiver is replicating data to another node or node as
 it goes and I don't know much is guaranteed about that.

 If you want to permute an RDD, how about a sortBy() on a good hash
 function of each value plus some salt? (Haven't thought this through
 much but sounds about right.)

 On Mon, Nov 3, 2014 at 4:59 PM, Josh J joshjd...@gmail.com wrote:
  When I'm outputting the RDDs to an external source, I would like the
 RDDs to
  be outputted in a random shuffle so that even the order is random. So far
  what I understood is that the RDDs do have a type of order, in that the
  order for spark streaming RDDs would be the order in which spark
 streaming
  read the tuples from source (e.g. ordered by roughly when the producer
 sent
  the tuple in addition to any latency)
 
  On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen so...@cloudera.com wrote:
 
  I think the answer will be the same in streaming as in the core. You
  want a random permutation of an RDD? in general RDDs don't have
  ordering at all -- excepting when you sort for example -- so a
  permutation doesn't make sense. Do you just want a well-defined but
  random ordering of the data? Do you just want to (re-)assign elements
  randomly to partitions?
 
  On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd...@gmail.com wrote:
   Hi,
  
   Is there a nice or optimal method to randomly shuffle spark streaming
   RDDs?
  
   Thanks,
   Josh
 
 



run multiple spark applications in parallel

2014-10-28 Thread Josh J
Hi,

How do I run multiple spark applications in parallel? I tried to run on
yarn cluster, though the second application submitted does not run.

Thanks,
Josh


Re: run multiple spark applications in parallel

2014-10-28 Thread Josh J
Sorry, I should've included some stats with my email

I execute each job in the following manner

./bin/spark-submit --class CLASSNAME --master yarn-cluster --driver-memory
1g --executor-memory 1g --executor-cores 1 UBER.JAR
${ZK_PORT_2181_TCP_ADDR} my-consumer-group1 1


The box has

24 CPUs, Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz

32 GB RAM


Thanks,

Josh

On Tue, Oct 28, 2014 at 4:15 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:

 Try reducing the resources (cores and memory) of each application.



  On Oct 28, 2014, at 7:05 PM, Josh J joshjd...@gmail.com wrote:
 
  Hi,
 
  How do I run multiple spark applications in parallel? I tried to run on
 yarn cluster, though the second application submitted does not run.
 
  Thanks,
  Josh



exact count using rdd.count()?

2014-10-27 Thread Josh J
Hi,

Is the following guaranteed to always provide an exact count?

foreachRDD(foreachFunc = rdd = {
   rdd.count()

In the literature it mentions However, output operations (like foreachRDD)
have *at-least once* semantics, that is, the transformed data may get
written to an external entity more than once in the event of a worker
failure.

http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node

Thanks,
Josh


combine rdds?

2014-10-27 Thread Josh J
Hi,

How could I combine rdds? I would like to combine two RDDs if the count in
an RDD is not above some threshold.

Thanks,
Josh


docker spark 1.1.0 cluster

2014-10-24 Thread Josh J
Hi,

Is there a dockerfiles available which allow to setup a docker spark 1.1.0
cluster?

Thanks,
Josh


streaming join sliding windows

2014-10-22 Thread Josh J
Hi,

How can I join neighbor sliding windows in spark streaming?

Thanks,
Josh


Re: countByWindow save the count ?

2014-08-26 Thread Josh J
Thanks. I''m just confused on the syntax, I'm not sure which variables or
where the value of the count is stored so that I can save it. Any examples
or tips?


On Mon, Aug 25, 2014 at 9:49 PM, Daniil Osipov daniil.osi...@shazam.com
wrote:

 You could try to use foreachRDD on the result of countByWindow with a
 function that performs the save operation.


 On Fri, Aug 22, 2014 at 1:58 AM, Josh J joshjd...@gmail.com wrote:

 Hi,

 Hopefully a simple question. Though is there an example of where to save
 the output of countByWindow ? I would like to save the results to external
 storage (kafka or redis). The examples show only stream.print()

 Thanks,
 Josh





countByWindow save the count ?

2014-08-22 Thread Josh J
Hi,

Hopefully a simple question. Though is there an example of where to save
the output of countByWindow ? I would like to save the results to external
storage (kafka or redis). The examples show only stream.print()

Thanks,
Josh


multiple windows from the same DStream ?

2014-08-21 Thread Josh J
Hi,

Can I build two sliding windows in parallel from the same Dstream ? Will
these two window streams run in parallel and process the same data? I wish
to do two different functions (aggregration on one window and storage for
the other window) across the same original dstream data though the same
window sizes.

JavaPairReceiverInputDStreamString, String messages =
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);

Duration windowLength = new Duration(3);
Duration slideInterval = new Duration(3);
JavaPairDStreamString,String windowMessages1 =
messages.window(windowLength,slideInterval);
JavaPairDStreamString,String windowMessages2 =
messages.window(windowLength,slideInterval);


Thanks,
Josh


DStream start a separate DStream

2014-08-21 Thread Josh J
Hi,

I would like to have a sliding window dstream perform a streaming
computation and store these results. Once these results are stored, I then
would like to process the results. Though I must wait until the final
computation done for all tuples in the sliding window, before I begin the
new DStream. How can I accomplish this with spark?

Sincerely,
Josh


Difference between amplab docker and spark docker?

2014-08-20 Thread Josh J
Hi,

Whats the difference between amplab docker
https://github.com/amplab/docker-scripts and spark docker
https://github.com/apache/spark/tree/master/docker?

Thanks,
Josh