Re: Receiver Fault Tolerance

2015-05-06 Thread Tathagata Das
Incorrect. The receiver runs in an executor just like a any other tasks. In the cluster mode, the driver runs in a worker, however it launches executors in OTHER workers in the cluster. Its those executors running in other workers that run tasks, and also the receivers. On Wed, May 6, 2015 at 5:09

Re: How update counter in cassandra

2015-05-06 Thread Tathagata Das
This may help. http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala On Wed, May 6, 2015 at 5:35 PM, Sergio Jiménez Barrio wrote: > I have a Counter family colums in Cassandra. I want update this counters > with a aplication in sp

Re: DStream Union vs. StreamingContext Union

2015-05-12 Thread Tathagata Das
@Vadim What happened when you tried unioning using DStream.union in python? TD On Tue, May 12, 2015 at 9:53 AM, Evo Eftimov wrote: > I can confirm it does work in Java > > > > *From:* Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] > *Sent:* Tuesday, May 12, 2015 5:53 PM > *To:* Evo Eftimo

Re: DStream Union vs. StreamingContext Union

2015-05-12 Thread Tathagata Das
lse). > ᐧ > > On Tue, May 12, 2015 at 12:57 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> @Vadim What happened when you tried unioning using DStream.union in >> python? >> >> TD >> >> On Tue, May 12, 2015 at 9:53 AM

Re: how to use rdd.countApprox

2015-05-12 Thread Tathagata Das
>From the code it seems that as soon as the " rdd.countApprox(5000)" returns, you can call "pResult.initialValue()" to get the approximate count at that point of time (that is after timeout). Calling "pResult.getFinalValue()" will further block until the job is over, and give the final correct valu

Re: how to use rdd.countApprox

2015-05-13 Thread Tathagata Das
ks, > Du > > > > On Wednesday, May 13, 2015 10:33 AM, Du Li > wrote: > > > Hi TD, > > Thanks a lot. rdd.countApprox(5000).initialValue worked! Now my streaming > app is standing a much better chance to complete processing each batch > within the batch interval. >

Re: how to use rdd.countApprox

2015-05-13 Thread Tathagata Das
> > > > On Wednesday, May 13, 2015 12:02 PM, Tathagata Das > wrote: > > > That is a good question. I dont see a direct way to do that. > > You could do try the following > > val jobGroupId = > rdd.sparkContext.setJobGroup(jobGroupId) > val approxCount

Re: how to use rdd.countApprox

2015-05-13 Thread Tathagata Das
uations similar to countApprox. > > Thanks, > Du > > > > On Wednesday, May 13, 2015 1:12 PM, Tathagata Das > wrote: > > > That is not supposed to happen :/ That is probably a bug. > If you have the log4j logs, would be good to file a JIRA. This may be > worth debuggin

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-14 Thread Tathagata Das
It would be good if you can tell what I should add to the documentation to make it easier to understand. I can update the docs for 1.4.0 release. On Tue, May 12, 2015 at 9:52 AM, Lee McFadden wrote: > Thanks for explaining Sean and Cody, this makes sense now. I'd like to > help improve this doc

Re: [Unit Test Failure] Test org.apache.spark.streaming.JavaAPISuite.testCount failed

2015-05-14 Thread Tathagata Das
Do you get this failure repeatedly? On Thu, May 14, 2015 at 12:55 AM, kf wrote: > Hi, all, i got following error when i run unit test of spark by > dev/run-tests > on the latest "branch-1.4" branch. > > the latest commit id: > commit d518c0369fa412567855980c3f0f426cde5c190d > Author: zsxwing

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Tathagata Das
What is the error you are seeing? TD On Thu, May 14, 2015 at 9:00 AM, Erich Ess wrote: > Hi, > > Is it possible to setup streams from multiple Kinesis streams and process > them in a single job? From what I have read, this should be possible, > however, the Kinesis layer errors out whenever I

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Tathagata Das
> union'd? > > it should work the same way - including union() of streams from totally > different source types (kafka, kinesis, flume). > > > > On Thu, May 14, 2015 at 2:07 PM, Tathagata Das > wrote: > >> What is the error you are seeing? >> >> T

Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Tathagata Das
How are you configuring the fair scheduler pools? On Fri, May 15, 2015 at 8:33 AM, Evo Eftimov wrote: > I have run / submitted a few Spark Streaming apps configured with Fair > scheduling on Spark Streaming 1.2.0, however they still run in a FIFO mode. > Is FAIR scheduling supported at all for S

Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-16 Thread Tathagata Das
pported for Spark Streaming Apps >> >> >> >> *From:* Richard Marscher [mailto:rmarsc...@localytics.com] >> *Sent:* Friday, May 15, 2015 7:20 PM >> *To:* Evo Eftimov >> *Cc:* Tathagata Das; user >> *Subject:* Re: Spark Fair Scheduler for Spark Streaming - 1

Re: Spark groupByKey, does it always create at least 1 partition per key?

2015-05-18 Thread Tathagata Das
By definition, all the values of a key will be only in one partition. This is some of the oldest API in Spark and will continue to work as it is now. On Mon, May 18, 2015 at 10:38 AM, tomboyle wrote: > I am currently using spark streaming. During my batch processing I must > groupByKey. Afterwar

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-18 Thread Tathagata Das
If you wanted to stop it gracefully, then why are you not calling ssc.stop(stopGracefully = true, stopSparkContext = true)? Then it doesnt matter whether the shutdown hook was called or not. TD On Mon, May 18, 2015 at 9:43 PM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Hi,

Re: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-18 Thread Tathagata Das
If you dont want the fileStream to start only after certain event has happened, why not start the streamingContext after that event? TD On Sun, May 17, 2015 at 7:51 PM, Haopu Wang wrote: > I want to use file stream as input. And I look at SparkStreaming > document again, it's saying file strea

Re: Interactive modification of DStreams

2014-06-02 Thread Tathagata Das
Currently Spark Streaming does not support addition/deletion/modification of DStream after the streaming context has been started. Nor can you restart a stopped streaming context. Also, multiple spark contexts (and therefore multiple streaming contexts) cannot be run concurrently in the same JVM.

Re: Window slide duration

2014-06-02 Thread Tathagata Das
I am assuming that you are referring to the "OneForOneStrategy: key not found: 1401753992000 ms" error, and not to the previous "Time 1401753992000 ms is invalid ...". Those two seem a little unrelated to me. Can you give us the stacktrace associated with the key-not-found error? TD On Mon, Jun

Re: NoSuchElementException: key not found

2014-06-02 Thread Tathagata Das
Do you have the info level logs of the application? Can you grep the value "32855" to find any references to it? Also what version of the Spark are you using (so that I can match the stack trace, does not seem to match with Spark 1.0)? TD On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang wrote: >

Re: Window slide duration

2014-06-02 Thread Tathagata Das
Can you give all the logs? Would like to see what is clearing the key " 1401754908000 ms" TD On Mon, Jun 2, 2014 at 5:38 PM, Vadim Chekan wrote: > Ok, it seems like "Time ... is invalid" is part of normal workflow, when > window DStream will ignore RDDs at moments in time when they do not matc

Re: Failed to remove RDD error

2014-06-02 Thread Tathagata Das
Spark.streaming.unpersist was an experimental feature introduced with Spark 0.9 (but kept disabled), which actively clears off RDDs that are not useful any more. in Spark 1.0 that has been enabled by default. It is possible that this is an unintended side-effect of that. If spark.cleaner.ttl works

Re: Failed to remove RDD error

2014-06-03 Thread Tathagata Das
, Michael Chang wrote: > Thanks Tathagata, > > Thanks for all your hard work! In the future, is it possible to mark > "experimental" features as such on the online documentation? > > Thanks, > Michael > > > On Mon, Jun 2, 2014 at 6:12 PM, Tathagata Das > w

Re: NoSuchElementException: key not found

2014-06-03 Thread Tathagata Das
condition. TD On Tue, Jun 3, 2014 at 10:09 AM, Michael Chang wrote: > I only had the warning level logs, unfortunately. There were no other > references of 32855 (except a repeated stack trace, I believe). I'm using > Spark 0.9.1 > > > On Mon, Jun 2, 2014 at 5:50 PM

Re: NoSuchElementException: key not found

2014-06-03 Thread Tathagata Das
Hi Tathagata, > > Thanks for your help! By not using coalesced RDD, do you mean not > repartitioning my Dstream? > > Thanks, > Mike > > > > > On Tue, Jun 3, 2014 at 12:03 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> I think I know w

Re: custom receiver in java

2014-06-04 Thread Tathagata Das
Yes, thanks updating this old thread! We heard our community demands and added support for Java receivers! TD On Wed, Jun 4, 2014 at 12:15 PM, lbustelo wrote: > Not that what TD was referring above, is already in 1.0.0 > > http://spark.apache.org/docs/1.0.0/streaming-custom-receivers.html > >

Re: NullPointerException on reading checkpoint files

2014-06-12 Thread Tathagata Das
Are you using broadcast variables in the streaming program? That might be the cuase of this. As far as I can see, broadcast variables are probably not supported through restarts and checkpoints. TD On Thu, Jun 12, 2014 at 12:16 PM, Kiran wrote: > I am also seeing similar problem when trying t

Re: running Spark Streaming just once and stop it

2014-06-12 Thread Tathagata Das
You should be able to see the streaming tab in the Spark web ui (running on port 4040) if you have created StreamingContext and you are using Spark 1.0 TD On Thu, Jun 12, 2014 at 1:06 AM, Ravi Hemnani wrote: > Hey, > > I did sparkcontext.addstreaminglistener(streaminglistener object) in my >

Re: Spark Streaming not processing file with particular number of entries

2014-06-13 Thread Tathagata Das
This is very odd. If it is running fine on mesos, I dont see a obvious reason why it wont work on Spark standalone cluster. Is the .4 million file already present in the monitored directory when the context is started? In that case, the file will not be picked up (unless textFileStream is created w

Re: Spark Streaming not processing file with particular number of entries

2014-06-13 Thread Tathagata Das
In the logs you posted (the 2nd set), i dont see the file being picked up. The lines having "FileInputDStream: Finding new files ..." should show the file name that has been picked up and i dont see any file in the second set logs. If the file is already present in the directory by the time streami

Re: Long running Spark Streaming Job increasing executing time per batch

2014-06-19 Thread Tathagata Das
Thats quite odd. Yes, with checkpoint the lineage does not increase. Can you tell which stage is the processing of each batch is causing the increase in the processing time? Also, what is the batch interval, and checkpoint interval? TD On Thu, Jun 19, 2014 at 8:45 AM, Skogberg, Fredrik < fredri

Re: Issue while trying to aggregate with a sliding window

2014-06-19 Thread Tathagata Das
zeroTime marks the time when the streaming job started, and the first batch of data is from zeroTime to zeroTime + slideDuration. The validity check of time - zeroTime) being multiple of slideDuration is to ensure that for a given dstream, it generates RDD at the right times. For example, say the b

Re: How to achieve reasonable performance on Spark Streaming?

2014-06-19 Thread Tathagata Das
Hello all, Apologies for the late response, this thread went below my radar. There are a number of things that can be done to improve the performance. Here are some of them of the top of my head based on what you have mentioned. Most of them are mentioned in the streaming guide's performance tunin

Re: Possible approaches for adding extra metadata (Spark Streaming)?

2014-06-20 Thread Tathagata Das
If the metadata is directly related to each individual records, then it can be done either ways. Since I am not sure how easy or hard will it be for you add tags before putting the data into spark streaming, its hard to recommend one method over the other. However, if the metadata is related to ea

Re: Long running Spark Streaming Job increasing executing time per batch

2014-06-20 Thread Tathagata Das
In the spark web ui, you should see the same pattern of stage repeating over time, as the same sequence of stages get computed in every batch. From that you would be able to get a sense of how much corresponding stages take across different batches, and which stage is actually is taking more time,

Re: Could not compute split, block not found

2014-06-30 Thread Tathagata Das
Are you by any change using only memory in the storage level of the input streams? TD On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer wrote: > Bill, > > let's say the processing time is t' and the window size t. Spark does not > *require* t' < t. In fact, for *temporary* peaks in your streami

Re: Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Tathagata Das
1. Multiple output operations are processed in the order they are defined. That is because by default each one output operation is processed at a time. This *can* be parallelized using an undocumented config parameter "spark.streaming.concurrentJobs" which is by default set to 1. 2. Yes, the outpu

Re: Restarting a Streaming Context

2014-07-09 Thread Tathagata Das
I confirm that is indeed the case. It is designed to be so because it keeps things simpler - less chances of issues related to cleanup when stop() is called. Also it keeps things consistent with the spark context - once a spark context is stopped it cannot be used any more. You can create a new s

Re: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-10 Thread Tathagata Das
Do you see any errors in the logs of the driver? On Thu, Jul 10, 2014 at 1:21 AM, Haopu Wang wrote: > I'm running an App for hours in a standalone cluster. From the data > injector and "Streaming" tab of web ui, it's running well. > > However, I see quite a lot of Active stages in web ui even s

Re: KMeans code is rubbish

2014-07-10 Thread Tathagata Das
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how m

Re: Getting Persistent Connection using socketStream?

2014-07-10 Thread Tathagata Das
The implementation of the input-stream-to-iterator function in #2 is incorrect. The function should be such that, when the hasNext is called on the iterator, it should try to read from the buffered reader. If an object (that is, line) can be read, then return it, otherwise block and wait for data t

Re: What version of twitter4j should I use with Spark Streaming?

2014-07-10 Thread Tathagata Das
Spark Streaming uses twitter4j 3.0.3. 3.0.6 should probably work fine. The exception that you are seeing is something that should be looked into. Can you give us more logs (specially executor logs) with stack traces that has the error. TD On Thu, Jul 10, 2014 at 2:42 PM, Nick Chammas wrote: >

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-10 Thread Tathagata Das
How are you supplying the text file? On Wed, Jul 9, 2014 at 11:51 AM, M Singh wrote: > Hi Folks: > > I am working on an application which uses spark streaming (version 1.1.0 > snapshot on a standalone cluster) to process text file and save counters in > cassandra based on fields in each row. I

Re: Number of executors change during job running

2014-07-10 Thread Tathagata Das
Are you specifying the number of reducers in all the DStream.ByKey operations? If the reduce by key is not set, then the number of reducers used in the stages can keep changing across batches. TD On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay wrote: > Hi all, > > I have a Spark streaming job run

Re: "NoSuchElementException: key not found" when changing the window lenght and interval in Spark Streaming

2014-07-10 Thread Tathagata Das
This bug has been fixed. Either use the master branch of Spark, or maybe wait a few days for Spark 1.0.1 to be released (voting has successfully closed). TD On Thu, Jul 10, 2014 at 2:33 AM, richiesgr wrote: > Hi > > I get exactly the same problem here, do you've found the problem ? > Thanks >

Re: Spark Streaming using File Stream in Java

2014-07-10 Thread Tathagata Das
The fileStream is not designed to work with continuously updating file, as the one of the main design goals of Spark is immutability (to guarantee fault-tolerance by recomputation), and files that are appending (mutating) defeats that. It rather designed to pickup new files added atomically (using

Re: Number of executors change during job running

2014-07-10 Thread Tathagata Das
> more then 3 minutes. Thanks! > > Bill > > > On Thu, Jul 10, 2014 at 4:28 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> Are you specifying the number of reducers in all the DStream.ByKey >> operations? If the reduce by key is not set, then

Re: Some question about SQL and streaming

2014-07-10 Thread Tathagata Das
Yeah, the right solution is to have something like SchemaDStream, where the schema of all the schemaRDD generated by it can be stored. Something I really would like to see happen in the future :) TD On Thu, Jul 10, 2014 at 6:37 PM, Tobias Pfeiffer wrote: > Hi, > > I think it would be great if

Re: Getting Persistent Connection using socketStream?

2014-07-10 Thread Tathagata Das
Right this uses NextIterator, which is elsewhere in the repo. It just makes it cleaner to implement a custom iterator. But i guess you got the high level point, so its okay. TD On Thu, Jul 10, 2014 at 7:21 PM, kytay wrote: > Hi TD > > Thanks. > > I have problem understanding the codes in githu

Re: Join two Spark Streaming

2014-07-10 Thread Tathagata Das
Do you want to continuously maintain the set of unique integers seen since the beginning of stream? var uniqueValuesRDD: RDD[Int] = ... dstreamOfIntegers.transform(newDataRDD => { val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct uniqueValuesRDD = newUniqueValuesRDD //

Re: Join two Spark Streaming

2014-07-11 Thread Tathagata Das
e batch instead of accumulative number of unique integers. > > I do have two questions about your code: > > 1. Why do we need uniqueValuesRDD? Why do we need to call > uniqueValuesRDD.checkpoint()? > > 2. Where is distinctValues defined? > > Bill > > > On Thu, Jul

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread Tathagata Das
t should be the optimal value for this. > > Thanks. > > > On Thursday, July 10, 2014 7:24 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > > > How are you supplying the text file? > > > On Wed, Jul 9, 2014 at 11:51 AM, M Singh wrote: > >

Re: writing FLume data to HDFS

2014-07-11 Thread Tathagata Das
What is the error you are getting when you say "??I was trying to write the data to hdfs..but it fails…" TD On Thu, Jul 10, 2014 at 1:36 PM, Sundaram, Muthu X. < muthu.x.sundaram@sabre.com> wrote: > I am new to spark. I am trying to do the following. > > NetcatàFlumeàSpark streaming(process

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread Tathagata Das
So, is it expected for the process to generate stages/tasks even after > processing a file ? > > Also, is there a way to figure out the file that is getting processed and > when that process is complete ? > > Thanks > > > On Friday, July 11, 2014 1:51 PM, Tathagata Das &

Re: How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-11 Thread Tathagata Das
The same executor can be used for both receiving and processing, irrespective of the deployment mode (yarn, spark standalone, etc.) It boils down to the number of cores / task slots that executor has. Each receiver is like a long running task, so each of them occupy a slot. If there are free slots

Re: Number of executors change during job running

2014-07-11 Thread Tathagata Das
t the computation can be efficiently finished. I am > not sure how to achieve this. > > Thanks! > > Bill > > > On Thu, Jul 10, 2014 at 5:59 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> Can you try setting the number-of-partitions in all

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Tathagata Das
You dont get any exception from twitter.com, saying credential error or something? I have seen this happen when once one was behind vpn to his office, and probably twitter was blocked in their office. You could be having a similar issue. TD On Fri, Jul 11, 2014 at 2:57 PM, SK wrote: > Hi, > >

Re: Some question about SQL and streaming

2014-07-11 Thread Tathagata Das
Yes, even though we dont have immediate plans, I definitely would like to see it happen some time in not-so-distant future. TD On Thu, Jul 10, 2014 at 7:55 PM, Shao, Saisai wrote: > No specific plans to do so, since there has some functional loss like > time based windowing function which is

Re: Generic Interface between RDD and DStream

2014-07-11 Thread Tathagata Das
I totally agree that doing if we are able to do this it will be very cool. However, this requires having a common trait / interface between RDD and DStream, which we dont have as of now. It would be very cool though. On my wish list for sure. TD On Thu, Jul 10, 2014 at 11:53 AM, mshah wrote: >

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Tathagata Das
Does nothing get printed on the screen? If you are not getting any tweets but spark streaming is running successfully you should get at least counts being printed every batch (which would be zero). But they are not being printed either, check the spark web ui to see stages are running or not. If th

Re: How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-11 Thread Tathagata Das
me? > > Best, > > Fang, Yan > yanfang...@gmail.com > +1 (206) 849-4108 > > > On Fri, Jul 11, 2014 at 1:45 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> The same executor can be used for both receiving and processing, >> irrespective of the

Re: Number of executors change during job running

2014-07-11 Thread Tathagata Das
t;= timeDiff) >> >> val watchTimeFields = filteredFields.map(fields => (fields._1, >> fields._2, fields._4, fields._5, fields._7)) >> val watchTimeTuples = watchTimeFields.map(fields => >> getWatchtimeTuple(fields)) >> val programDuids = watchTimeTuples

Re: Spark Streaming timing considerations

2014-07-11 Thread Tathagata Das
This is not in the current streaming API. Queue stream is useful for testing with generated RDDs, but not for actual data. For actual data stream, the slack time can be implemented by doing DStream.window on a larger window that take slack time in consideration, and then the required application-t

Re: Generic Interface between RDD and DStream

2014-07-11 Thread Tathagata Das
t touching the calling side (using them), > and thus keeping my plain old backward compat' ^^. > > I know it's just an intermediate hack, but still ;-) > > greetz, > > > aℕdy ℙetrella > about.me/noootsab > [image: aℕdy ℙetrella on about.me] > > <http://ab

Re: Stopping StreamingContext does not kill receiver

2014-07-12 Thread Tathagata Das
Yes, thats a bug i just discovered. Race condition in the Twitter Receiver, will fix asap. Here is the JIRA https://issues.apache.org/jira/browse/SPARK-2464 TD On Sat, Jul 12, 2014 at 3:21 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > To add a potentially relevant piece of informa

Re: can't print DStream after reduce

2014-07-13 Thread Tathagata Das
Try doing DStream.foreachRDD and then printing the RDD count and further inspecting the RDD. On Jul 13, 2014 1:03 AM, "Walrus theCat" wrote: > Hi, > > I have a DStream that works just fine when I say: > > dstream.print > > If I say: > > dstream.map(_,1).print > > that works, too. However, if I d

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-13 Thread Tathagata Das
In case you still have issues with duplicate files in uber jar, here is a reference sbt file with assembly plugin that deals with duplicates https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt On Fri, Jul 11, 2014 at 10:06 AM, Bill Jay wrote: > You may try to

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-13 Thread Tathagata Das
2014 at 5:48 PM, Tathagata Das wrote: > In case you still have issues with duplicate files in uber jar, here is a > reference sbt file with assembly plugin that deals with duplicates > > > https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt > &

Re: writing FLume data to HDFS

2014-07-14 Thread Tathagata Das
bytePayload = avroEvent.getBody(); > > logRecord = new > String(bytePayload.array()); > > > > System.out.println(">>>>>>>>LOG RECORD = " + > logRecord); > > > &g

Re: Spark Streaming Json file groupby function

2014-07-14 Thread Tathagata Das
You have to import StreamingContext._ to enable groupByKey operations on DStreams. After importing that you can apply groupByKey on any DStream, that is a DStream of key-value pairs (e.g. DStream[(String, Int)]) . The data in each pair RDDs will be grouped by the first element in the tuple as the

Re: Number of executors change during job running

2014-07-14 Thread Tathagata Das
er, this method will introduce > many dstreams. It will be good if we can control the number of executors in > the groupBy operation because the calculation needs to be finished within 1 > minute for different size of input data based on our production need. > > Thanks! > > > Bil

Re: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-14 Thread Tathagata Das
gt; at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) > > > -- > > *From:* Haopu Wang > *Sent:* Thursday, July 10, 2014 7:38 PM > *To:* user@spark.apache.org > *Subject:* RE: All of the tasks have been

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-14 Thread Tathagata Das
multiple of hdfs block size. > > Mans > > > > On Friday, July 11, 2014 4:38 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > > > The model for file stream is to pick up and process new files written > atomically (by move) into a directory. So your f

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-14 Thread Tathagata Das
When you are sending data using simple socket code to send messages, are those messages "\n" delimited? If its not, then the receiver of socketTextSTream, wont identify them as separate events, and keep buffering them. TD On Sun, Jul 13, 2014 at 10:49 PM, kytay wrote: > Hi Tobias > > I have be

Re: Error in JavaKafkaWordCount.java example

2014-07-14 Thread Tathagata Das
Are you compiling it within Spark using Spark's recommended way (see doc web page)? Or are you compiling it in your own project? In the latter case, make sure you are using the Scala 2.10.4. TD On Sun, Jul 13, 2014 at 6:43 AM, Mahebub Sayyed wrote: > Hello, > > I am referring following example

Re: can't print DStream after reduce

2014-07-14 Thread Tathagata Das
; println((x.count,x.first))) // no output is >>>> printed to driver console >>>> >>>> >>>> >>>> >>>> On Sun, Jul 13, 2014 at 11:47 AM, Walrus theCat >>> > wrote: >>>> >>>>> >>>>> Thanks for your

Re: Number of executors change during job running

2014-07-14 Thread Tathagata Das
rs in the reduce task > (combineByKey). Even with the first batch which used more than 80 > executors, it took 2.4 mins to finish the reduce stage for a very small > amount of data. > > Bill > > > On Mon, Jul 14, 2014 at 12:30 PM, Tathagata Das < > tathagata.das1...@gmail.com&

Re: Spark Streaming Json file groupby function

2014-07-14 Thread Tathagata Das
In general it may be a better idea to actually convert the records from hashmaps, to a specific data structure. Say case class Record(id: int, name: String, mobile: String, score: Int, test_type: String ... ) Then you should be able to do something like val records = jsonf.map(m => convertMapToR

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
The twitter functionality is not available through the shell. 1) we separated these non-core functionality into separate subprojects so that their dependencies do not collide/pollute those of of core spark 2) a shell is not really the best way to start a long running stream. Its best to use twitte

Re: Stateful RDDs?

2014-07-14 Thread Tathagata Das
Trying answer your questions as concisely as possible 1. In the current implementation, the entire state RDD needs to loaded for any update. It is a known limitation, that we want to overcome in the future. Therefore the state Dstream should not be persisted to disk as all the data in the state RD

Re: SQL + streaming

2014-07-14 Thread Tathagata Das
Could you elaborate on what is the problem you are facing? Compiler error? Runtime error? Class-not-found error? Not receiving any data from Kafka? Receiving data but SQL command throwing error? No errors but no output either? TD On Mon, Jul 14, 2014 at 4:06 PM, hsy...@gmail.com wrote: > Hi Al

Re: Spark-Streaming collect/take functionality.

2014-07-14 Thread Tathagata Das
Why doesnt something like this work? If you want a continuously updated reference to the top counts, you can use a global variable. var topCounts: Array[(String, Int)] = null sortedCounts.foreachRDD (rdd => val currentTopCounts = rdd.take(10) // print currentTopCounts it or watever top

Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-14 Thread Tathagata Das
Oh yes, this was a bug and it has been fixed. Checkout from the master branch! https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC TD On Mon, Jul 7,

Re: SQL + streaming

2014-07-14 Thread Tathagata Das
ul 14, 2014 at 4:59 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> Could you elaborate on what is the problem you are facing? Compiler >> error? Runtime error? Class-not-found error? Not receiving any data from >> Kafka? Receiving data but SQL comma

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
I guess this is not clearly documented. At a high level, any class that is in the package org.apache.spark.streaming.XXX where XXX is in { twitter, kafka, flume, zeromq, mqtt } is not available in the Spark shell. I have added this to the larger JIRA of things-to-add-to-streaming-docs https://

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
raising dependency-related concerns into the core of spark streaming. TD On Mon, Jul 14, 2014 at 6:29 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > On Mon, Jul 14, 2014 at 6:52 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> The twitter fun

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
< nicholas.cham...@gmail.com> wrote: > If we're talking about the issue you captured in SPARK-2464 > <https://issues.apache.org/jira/browse/SPARK-2464>, then it was a newly > launched EC2 cluster on 1.0.1. > > > On Mon, Jul 14, 2014 at 10:48 PM, Tathagata Das <

Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-15 Thread Tathagata Das
gards, > Rajesh > > > On Tue, Jul 15, 2014 at 5:42 AM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> Oh yes, this was a bug and it has been fixed. Checkout from the master >> branch! >> >> >> https://issues.apache.org/jira/browse/SP

Re: SQL + streaming

2014-07-15 Thread Tathagata Das
ark-submit) and I couldn't > find any output from the yarn stdout logs > > > On Mon, Jul 14, 2014 at 6:25 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> Can you make sure you are running locally on more than 1 local cores? You >> could set the mast

Re: Spark Streaming Json file groupby function

2014-07-15 Thread Tathagata Das
I see you have the code to convert to Record class but commented it out. That is the right way to go. When you are converting it to a 4-tuple with " (data("type"),data("name"),data("score"),data("school"))" ... its of type (Any, Any, Any, Any) as data("xyz") returns Any. And registerAsTable probab

Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Tathagata Das
This sounds really really weird. Can you give me a piece of code that I can run to reproduce this issue myself? TD On Tue, Jul 15, 2014 at 12:02 AM, Walrus theCat wrote: > This is (obviously) spark streaming, by the way. > > > On Mon, Jul 14, 2014 at 8:27 PM, Walrus theCat > wrote: > >> Hi, >

Re: Cassandra driver Spark question

2014-07-15 Thread Tathagata Das
Can you find out what is the class that is causing the NotSerializable exception? In fact, you can enabled extended serialization debugging to figure out object structure through the foreachRDD's

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Tathagata Das
The way the HDFS file writing works at a high level is that each attempt to write a partition to a file starts writing to unique temporary file (say, something like targetDirectory/_temp/part-X_attempt-). If the writing into the file successfully completes, then the temporary file is moved

Re: NotSerializableException in Spark Streaming

2014-07-15 Thread Tathagata Das
I am not entire sure off the top of my head. But a possible (usually works) workaround is to define the function as a val instead of a def. For example def func(i: Int): Boolean = { true } can be written as val func = (i: Int) => { true } Hope this helps for now. TD On Tue, Jul 15, 2014 at 9

Re: NotSerializableException in Spark Streaming

2014-07-15 Thread Tathagata Das
I am very curious though. Can you post a concise code example which we can run to reproduce this problem? TD On Tue, Jul 15, 2014 at 2:54 PM, Tathagata Das wrote: > I am not entire sure off the top of my head. But a possible (usually > works) workaround is to define the function as

Re: Need help on spark Hbase

2014-07-15 Thread Tathagata Das
Also, it helps if you post us logs, stacktraces, exceptions, etc. TD On Tue, Jul 15, 2014 at 10:07 AM, Jerry Lam wrote: > Hi Rajesh, > > I have a feeling that this is not directly related to spark but I might be > wrong. The reason why is that when you do: > >Configuration configuration =

Re: Multiple streams at the same time

2014-07-15 Thread Tathagata Das
Creating multiple StreamingContexts using the same SparkContext is currently not supported. :) Guess it was not clear in the docs. Note to self. TD On Tue, Jul 15, 2014 at 1:50 PM, gorenuru wrote: > Hi everyone. > > I have some problems running multiple streams at the same time. > > What i am

Re: Multiple streams at the same time

2014-07-15 Thread Tathagata Das
Why do you need to create multiple streaming contexts at all? TD On Tue, Jul 15, 2014 at 3:43 PM, gorenuru wrote: > Oh, sad to hear that :( > From my point of view, creating separate spark context for each stream is > to > expensive. > Also, it's annoying because we have to be responsible for

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-15 Thread Tathagata Das
Yes, what Nick said is the recommended way. In most usecases, a spark streaming program in production is not usually run from the shell. Hence, we chose not to make the external stuff (twitter, kafka, etc.) available to spark shell to avoid dependency conflicts brought it by them with spark's depen

Re: Spark Streaming Json file groupby function

2014-07-15 Thread Tathagata Das
You need to have import sqlContext._ so just uncomment that and it should work. TD On Tue, Jul 15, 2014 at 1:40 PM, srinivas wrote: > I am still getting the error...even if i convert it to record > object KafkaWordCount { > def main(args: Array[String]) { > if (args.length < 4) { >

<    2   3   4   5   6   7   8   9   10   >