sortBy transformation shows as a job

2016-01-05 Thread Soumitra Kumar
Fellows, I have a simple code. sc.parallelize (Array (1, 4, 3, 2), 2).sortBy (i=>i).foreach (println) This results in 2 jobs (sortBy, foreach) in Spark's application master ui. I thought there is one to one relationship between RDD action and job. Here, only action is foreach, so should be only

Re: Use Spark Streaming for Batch?

2015-02-22 Thread Soumitra Kumar
See if https://issues.apache.org/jira/browse/SPARK-3660 helps you. My patch has been accepted and, this enhancement is scheduled for 1.3.0. This lets you specify initialRDD for updateStateByKey operation. Let me know if you need any information. On Sun, Feb 22, 2015 at 5:21 PM, Tobias Pfeiffer

Re: Error reporting/collecting for users

2015-01-27 Thread Soumitra Kumar
It is a Streaming application, so how/when do you plan to access the accumulator on driver? On Tue, Jan 27, 2015 at 6:48 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, thanks for your mail! On Wed, Jan 28, 2015 at 11:44 AM, Tathagata Das tathagata.das1...@gmail.com wrote: That seems

Re: Stop streaming context gracefully when SIGTERM is passed

2014-12-15 Thread Soumitra Kumar
Hi Adam, I have following scala actor based code to do graceful shutdown: class TimerActor (val timeout : Long, val who : Actor) extends Actor { def act { reactWithin (timeout) { case TIMEOUT = who ! SHUTDOWN } } } class SSCReactor (val ssc :

Re: Question about textFileStream

2014-11-10 Thread Soumitra Kumar
Entire file in a window. On Mon, Nov 10, 2014 at 9:20 AM, Saiph Kappa saiph.ka...@gmail.com wrote: Hi, In my application I am doing something like this new StreamingContext(sparkConf, Seconds(10)).textFileStream(logs/), and I get some unknown exceptions when I copy a file with about 800 MB

Re: How to add HBase dependencies and conf with spark-submit?

2014-10-16 Thread Soumitra Kumar
Great, it worked. I don't have an answer what is special about SPARK_CLASSPATH vs --jars, just found the working setting through trial an error. - Original Message - From: Fengyun RAO raofeng...@gmail.com To: Soumitra Kumar kumar.soumi...@gmail.com Cc: user@spark.apache.org, u

How to name a DStream

2014-10-16 Thread Soumitra Kumar
Hello, I am debugging my code to find out what else to cache. Following is a line in log: 14/10/16 12:00:01 INFO TransformedDStream: Persisting RDD 6 for time 141348600 ms to StorageLevel(true, true, false, false, 1) at time 141348600 ms Is there a way to name a DStream? RDD has a

Print dependency graph as DOT file

2014-10-16 Thread Soumitra Kumar
Hello, Is there a way to print the dependency graph of complete program or RDD/DStream as a DOT file? It would be very helpful to have such a thing. Thanks, -Soumitra. - To unsubscribe, e-mail:

Re: How to add HBase dependencies and conf with spark-submit?

2014-10-15 Thread Soumitra Kumar
I am writing to HBase, following are my options: export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar spark-submit \ --jars

Re: Kafka-HDFS to store as Parquet format

2014-10-07 Thread Soumitra Kumar
to new schema. - Original Message - From: Buntu Dev buntu...@gmail.com To: Soumitra Kumar kumar.soumi...@gmail.com Cc: u...@spark.incubator.apache.org Sent: Tuesday, October 7, 2014 10:18:16 AM Subject: Re: Kafka-HDFS to store as Parquet format Thanks for the info Soumitra.. its a good

Re: How to initialize updateStateByKey operation

2014-09-23 Thread Soumitra Kumar
I thought I did a good job ;-) OK, so what is the best way to initialize updateStateByKey operation? I have counts from previous spark-submit, and want to load that in next spark-submit job. - Original Message - From: Soumitra Kumar kumar.soumi...@gmail.com To: spark users user

Re: Spark Streaming and ReactiveMongo

2014-09-19 Thread Soumitra Kumar
onStart should be non-blocking. You may try to create a thread in onStart instead. - Original Message - From: t1ny wbr...@gmail.com To: u...@spark.incubator.apache.org Sent: Friday, September 19, 2014 1:26:42 AM Subject: Re: Spark Streaming and ReactiveMongo Here's what we've tried so

Re: Bulk-load to HBase

2014-09-19 Thread Soumitra Kumar
I successfully did this once. RDD map to RDD [(ImmutableBytesWritable, KeyValue)] then val conf = HBaseConfiguration.create() val job = new Job (conf, CEF2HFile) job.setMapOutputKeyClass (classOf[ImmutableBytesWritable]); job.setMapOutputValueClass (classOf[KeyValue]); val table = new

Re: Stable spark streaming app

2014-09-17 Thread Soumitra Kumar
Hmm, no response to this thread! Adding to it, please share experiences of building an enterprise grade product based on Spark Streaming. I am exploring Spark Streaming for enterprise software and am cautiously optimistic about it. I see huge potential to improve debuggability of Spark. -

Re: Stable spark streaming app

2014-09-17 Thread Soumitra Kumar
://issues.apache.org/jira/browse/SPARK-2316 Best I understand and have been told, this does not affect data integrity but may cause un-necessary recomputes. Hope this helps, Tim On Wed, Sep 17, 2014 at 8:30 AM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hmm, no response to this thread! Adding

Re: How to initialize StateDStream

2014-09-13 Thread Soumitra Kumar
I had looked at that. If I have a set of saved word counts from previous run, and want to load that in the next run, what is the best way to do it? I am thinking of hacking the Spark code and have an initial rdd in StateDStream, and use that in for the first time. On Fri, Sep 12, 2014 at 11:04

Re: How to initialize StateDStream

2014-09-13 Thread Soumitra Kumar
Thanks for the pointers. I meant previous run of spark-submit. For 1: This would be a bit more computation in every batch. 2: Its a good idea, but it may be inefficient to retrieve each value. In general, for a generic state machine the initialization and input sequence is critical for

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-07 Thread Soumitra Kumar
I have the following code: stream foreachRDD { rdd = if (rdd.take (1).size == 1) { rdd foreachPartition { iterator = initDbConnection () iterator foreach { write to db

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Soumitra Kumar
Yes, that is an option. I started with a function of batch time, and index to generate id as long. This may be faster than generating UUID, with added benefit of sorting based on time. - Original Message - From: Tathagata Das tathagata.das1...@gmail.com To: Soumitra Kumar kumar.soumi

Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra.

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote: No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
. Then you can use rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
of the RDDs would contain more than 1 billion records. Then you can use rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: So, I guess zipWithUniqueId will be similar. Is there a way

Shared variable in Spark Streaming

2014-08-08 Thread Soumitra Kumar
Hello, I want to count the number of elements in the DStream, like RDD.count() . Since there is no such method in DStream, I thought of using DStream.count and use the accumulator. How do I do DStream.count() to count the number of elements in a DStream? How do I create a shared variable in

Re: Shared variable in Spark Streaming

2014-08-08 Thread Soumitra Kumar
variable will reside in the driver and will keep being updated after every batch. TD On Thu, Aug 7, 2014 at 10:16 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, I want to count the number of elements in the DStream, like RDD.count() . Since there is no such method in DStream, I

Re: HBase row count

2014-02-25 Thread Soumitra Kumar
running a count every time on the full dataset) then caching is not going to help you. On Tue, Feb 25, 2014 at 4:59 PM, Soumitra Kumar kumar.soumi...@gmail.comwrote: Thanks Nick. How do I figure out if the RDD fits in memory? On Tue, Feb 25, 2014 at 1:04 AM, Nick Pentreath nick.pentre