Re: Out of any idea

2014-07-19 Thread Tathagata Das
Could you collect debug level logs and send us. Without logs its hard to speculate anything. :) TD On Sat, Jul 19, 2014 at 2:39 PM, boci boci.b...@gmail.com wrote: Hi guys! I run out of ideas... I created a spark streaming job (kafka - spark - ES). If I start my app local machine (inside

Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-18 Thread Tathagata Das
yanfang...@gmail.com wrote: Thank you, TD. This is important information for us. Will keep an eye on that. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 6:54 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Yes, this is the limitation

Re: spark streaming rate limiting from kafka

2014-07-18 Thread Tathagata Das
in general. I tried to explore the link you provided but could not find any specific JIRA related to this? Do you have the JIRA number for this? On Thu, Jul 17, 2014 at 9:21 PM, Tathagata Das tathagata.das1...@gmail.com wrote: You can create multiple kafka stream to partition your topics

Re: spark streaming rate limiting from kafka

2014-07-18 Thread Tathagata Das
Dang! Messed it up again! JIRA: https://issues.apache.org/jira/browse/SPARK-1341 Github PR: https://github.com/apache/spark/pull/945/files On Fri, Jul 18, 2014 at 11:35 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Oops, wrong link! JIRA: https://github.com/apache/spark/pull/945

Re: Spark Streaming with long batch / window duration

2014-07-18 Thread Tathagata Das
If you want to process data that spans across weeks, then it best to use a dedicated data store (file system, sql / nosql database, etc.) that is designed for long term data storage and retrieval. Spark Streaming is not designed as a long term data store. Also it does not seem like you need low

Re: unserializable object in Spark Streaming context

2014-07-18 Thread Tathagata Das
Thats, a good question. My first reach is timeout. Timing out after 10s of seconds should be sufficient. So there should be a timer in the singleton that runs a check every second, on when the singleton was last used, and closes the connections after a time out. Any attempts to use the connection

Re: unserializable object in Spark Streaming context

2014-07-18 Thread Tathagata Das
this helps! TD On Fri, Jul 18, 2014 at 8:14 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Thats, a good question. My first reach is timeout. Timing out after 10s of seconds should be sufficient. So there should be a timer in the singleton that runs a check every second, on when

Re: Kyro deserialisation error

2014-07-17 Thread Tathagata Das
Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com On Thu, Jul 17, 2014 at 2:58 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Is the class that is not found in the wikipediapagerank jar? TD On Wed, Jul 16, 2014 at 12:32 AM, Hao Wang wh.s...@gmail.com

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Tathagata Das
1. You can put in multiple kafka topics in the same Kafka input stream. See the example KafkaWordCount https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala . However they will all be read

Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Tathagata Das
For accessing previous version, I would do it the same way. :) 1. Can you elaborate on what you mean by that with an example? What do you mean by accessing keys? 2. Yeah, that is hard to do with the ability to do point lookups into an RDD, which we dont support yet. You could try embedding the

Re: Spark Streaming Json file groupby function

2014-07-17 Thread Tathagata Das
This is a basic scala problem. You cannot apply toInt to Any. Try doing toString.toInt For such scala issues, I recommend trying it out in the Scala shell. For example, you could have tried this out as the following. [tdas @ Xion streaming] scala Welcome to Scala version 2.10.3 (Java HotSpot(TM)

Re: Spark Streaming timing considerations

2014-07-17 Thread Tathagata Das
, Laeeq Ahmed laeeqsp...@yahoo.com wrote: Hi, Thanks I will try to implement it. Regards, Laeeq On Saturday, July 12, 2014 4:37 AM, Tathagata Das tathagata.das1...@gmail.com wrote: This is not in the current streaming API. Queue stream is useful for testing with generated RDDs

Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Tathagata Das
yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das tathagata.das1...@gmail.com wrote: For accessing previous version, I would do it the same way. :) 1. Can you elaborate on what you mean by that with an example? What do you mean by accessing keys? 2. Yeah

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Tathagata Das
And if Marcelo's guess is correct, then the right way to do this would be to lazily / dynamically create the jdbc connection server as a singleton in the workers/executors and use that. Something like this. dstream.foreachRDD(rdd = { rdd.foreachPartition((iterator: Iterator[...]) = {

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Tathagata Das
is it possible to create partitions while storing data direct from kafka to Parquet files??* *(likewise created in above query)* On Thu, Jul 17, 2014 at 12:35 PM, Tathagata Das tathagata.das1...@gmail.com wrote: 1. You can put in multiple kafka topics in the same Kafka input stream. See the example

Re: how to pass extra Java opts to workers for spark streaming jobs

2014-07-17 Thread Tathagata Das
Can you check in the environment tab of Spark web ui to see whether this configuration parameter is in effect? TD On Thu, Jul 17, 2014 at 2:05 PM, Chen Song chen.song...@gmail.com wrote: I am using spark 0.9.0 and I am able to submit job to YARN,

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Tathagata Das
You can create multiple kafka stream to partition your topics across them, which will run multiple receivers or multiple executors. This is covered in the Spark streaming guide. http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving And for the

Re: Spark Streaming

2014-07-17 Thread Tathagata Das
Step 1: use rdd.mapPartitions(). Thats equivalent to mapper of the MapReduce. You can open connection, get all the data and buffer it, close connection, return iterator to the buffer Step 2: Make step 1 better, by making it reuse connections. You can use singletons / static vars, to lazily

Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Tathagata Das
...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 1:47 PM, Tathagata Das tathagata.das1...@gmail.com wrote: The updateFunction given in updateStateByKey should be called on ALL the keys are in the state, even if there is no new data in the batch for some key. Is that not the behavior

Re: Kyro deserialisation error

2014-07-16 Thread Tathagata Das
hdfs://192.168.1.12:9000/freebase-26G 1 200 True Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com On Wed, Jul 16, 2014 at 1:41 PM, Tathagata Das

Re: Spark Streaming, external windowing?

2014-07-16 Thread Tathagata Das
One way to do that is currently possible is given here http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAMwrk0=b38dewysliwyc6hmze8tty8innbw6ixatnd1ue2-...@mail.gmail.com%3E On Wed, Jul 16, 2014 at 1:16 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi Sargun, There have

Re: can't print DStream after reduce

2014-07-16 Thread Tathagata Das
wrote: Hi, thanks for creating the issue. It feels like in the last week, more or less half of the questions about Spark Streaming rooted in setting the master to local ;-) Tobias On Wed, Jul 16, 2014 at 11:03 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Aah, right, copied

Re: Multiple streams at the same time

2014-07-16 Thread Tathagata Das
I hope it all works :) On Wed, Jul 16, 2014 at 9:08 AM, gorenuru goren...@gmail.com wrote: Hi and thank you for your reply. Looks like it's possible. It looks like a hack for me because we are specifying batch duration when creating context. This means that if we will specify batch

Re: Spark Streaming Json file groupby function

2014-07-16 Thread Tathagata Das
I think I know what the problem is. Spark Streaming is constantly doing garbage cleanup by throwing away data that it does not based on the operations in the DStream. Here the DSTream operations are not aware of the spark sql queries thats happening asynchronous to spark streaming. So data is

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Tathagata Das
Have you taken a look at DStream.transformWith( ... ) . That allows you apply arbitrary transformation between RDDs (of the same timestamp) of two different streams. So you can do something like this. 2s-window-stream.transformWith(1s-window-stream, (rdd1: RDD[...], rdd2: RDD[...]) = { ... //

Re: Spark Streaming timestamps

2014-07-16 Thread Tathagata Das
Answers inline. On Wed, Jul 16, 2014 at 5:39 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am currently using Spark Streaming to conduct a real-time data analytics. We receive data from Kafka. We want to generate output files that contain results that are based on the data we

Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-15 Thread Tathagata Das
, Tathagata Das tathagata.das1...@gmail.com wrote: Oh yes, this was a bug and it has been fixed. Checkout from the master branch! https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY

Re: SQL + streaming

2014-07-15 Thread Tathagata Das
cluster(spark-submit) and I couldn't find any output from the yarn stdout logs On Mon, Jul 14, 2014 at 6:25 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Can you make sure you are running locally on more than 1 local cores? You could set the master in the SparkConf as conf.setMaster

Re: Spark Streaming Json file groupby function

2014-07-15 Thread Tathagata Das
I see you have the code to convert to Record class but commented it out. That is the right way to go. When you are converting it to a 4-tuple with (data(type),data(name),data(score),data(school)) ... its of type (Any, Any, Any, Any) as data(xyz) returns Any. And registerAsTable probably doesnt

Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Tathagata Das
This sounds really really weird. Can you give me a piece of code that I can run to reproduce this issue myself? TD On Tue, Jul 15, 2014 at 12:02 AM, Walrus theCat walrusthe...@gmail.com wrote: This is (obviously) spark streaming, by the way. On Mon, Jul 14, 2014 at 8:27 PM, Walrus theCat

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Tathagata Das
The way the HDFS file writing works at a high level is that each attempt to write a partition to a file starts writing to unique temporary file (say, something like targetDirectory/_temp/part-X_attempt-). If the writing into the file successfully completes, then the temporary file is moved

Re: NotSerializableException in Spark Streaming

2014-07-15 Thread Tathagata Das
I am very curious though. Can you post a concise code example which we can run to reproduce this problem? TD On Tue, Jul 15, 2014 at 2:54 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I am not entire sure off the top of my head. But a possible (usually works) workaround is to define

Re: Need help on spark Hbase

2014-07-15 Thread Tathagata Das
Also, it helps if you post us logs, stacktraces, exceptions, etc. TD On Tue, Jul 15, 2014 at 10:07 AM, Jerry Lam chiling...@gmail.com wrote: Hi Rajesh, I have a feeling that this is not directly related to spark but I might be wrong. The reason why is that when you do: Configuration

Re: Multiple streams at the same time

2014-07-15 Thread Tathagata Das
Creating multiple StreamingContexts using the same SparkContext is currently not supported. :) Guess it was not clear in the docs. Note to self. TD On Tue, Jul 15, 2014 at 1:50 PM, gorenuru goren...@gmail.com wrote: Hi everyone. I have some problems running multiple streams at the same

Re: Multiple streams at the same time

2014-07-15 Thread Tathagata Das
Why do you need to create multiple streaming contexts at all? TD On Tue, Jul 15, 2014 at 3:43 PM, gorenuru goren...@gmail.com wrote: Oh, sad to hear that :( From my point of view, creating separate spark context for each stream is to expensive. Also, it's annoying because we have to be

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-15 Thread Tathagata Das
Yes, what Nick said is the recommended way. In most usecases, a spark streaming program in production is not usually run from the shell. Hence, we chose not to make the external stuff (twitter, kafka, etc.) available to spark shell to avoid dependency conflicts brought it by them with spark's

Re: Spark Streaming Json file groupby function

2014-07-15 Thread Tathagata Das
You need to have import sqlContext._ so just uncomment that and it should work. TD On Tue, Jul 15, 2014 at 1:40 PM, srinivas kusamsrini...@gmail.com wrote: I am still getting the error...even if i convert it to record object KafkaWordCount { def main(args: Array[String]) { if

Re: SQL + streaming

2014-07-15 Thread Tathagata Das
, Jul 15, 2014 at 12:48 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Could you run it locally first to make sure it works, and you see output? Also, I recommend going through the previous step-by-step approach to narrow down where the problem is. TD On Mon, Jul 14, 2014 at 9:15 PM

Re: Multiple streams at the same time

2014-07-15 Thread Tathagata Das
On Tue, Jul 15, 2014 at 5:22 PM, gorenuru goren...@gmail.com wrote: Because I want to have different streams with different durations. Fornexample, one triggers snapshot analysis each 5 minutes and another each 10 seconds On Tue, Jul 15, 2014 at 3:59 pm, Tathagata Das [via Apache Spark User

Re: Spark Streaming w/ tshark exception problem on EC2

2014-07-15 Thread Tathagata Das
Quick google search of that exception says this occurs when there is an error in the initialization of static methods. Could be some issue related to how dissection is defined. Maybe try putting the function in a different static class that is unrelated to the Main class, which may have other

Re: can't print DStream after reduce

2014-07-15 Thread Tathagata Das
mode On Mon, Jul 14, 2014 at 3:57 PM, Tathagata Das tathagata.das1...@gmail.com wrote: The problem is not really for local[1] or local. The problem arises when there are more input streams than there are cores. But I agree, for people who are just beginning to use it by running it locally

Re: Spark Streaming Json file groupby function

2014-07-15 Thread Tathagata Das
Can you try defining the case class outside the main function. In fact outside the object? TD On Tue, Jul 15, 2014 at 8:20 PM, srinivas kusamsrini...@gmail.com wrote: Hi TD, I uncomment import sqlContext._ and tried to compile the code import java.util.Properties import kafka.producer._

Re: SQL + streaming

2014-07-15 Thread Tathagata Das
:53 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Oh yes, we have run sql, streaming and mllib all together. You can take a look at the demo https://databricks.com/cloud that DataBricks gave at the spark summit. I think I get the problem is. Sql() returns a RDD, and println(rdd

Re: Kyro deserialisation error

2014-07-15 Thread Tathagata Das
Are you using classes from external libraries that have not been added to the sparkContext, using sparkcontext.addJar()? TD On Tue, Jul 15, 2014 at 8:36 PM, Hao Wang wh.s...@gmail.com wrote: I am running the WikipediaPageRank in Spark example and share the same problem with you: 4/07/16

Re: writing FLume data to HDFS

2014-07-14 Thread Tathagata Das
()); System.out.println(LOG RECORD = + logRecord); ??I was trying to write the data to hdfs..but it fails… *From:* Tathagata Das [mailto:tathagata.das1...@gmail.com] *Sent:* Friday, July 11, 2014 1:43 PM *To:* user@spark.apache.org *Cc:* u

Re: Spark Streaming Json file groupby function

2014-07-14 Thread Tathagata Das
You have to import StreamingContext._ to enable groupByKey operations on DStreams. After importing that you can apply groupByKey on any DStream, that is a DStream of key-value pairs (e.g. DStream[(String, Int)]) . The data in each pair RDDs will be grouped by the first element in the tuple as the

Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-07-14 Thread Tathagata Das
of the tasks have been completed but the Stage is still shown as Active? I didn't keep the driver's log. It's a lesson. I will try to run it again to see if it happens again. -- *From:* Tathagata Das [mailto:tathagata.das1...@gmail.com] *Sent:* 2014年7月10日 17:29

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-14 Thread Tathagata Das
. Mans On Friday, July 11, 2014 4:38 PM, Tathagata Das tathagata.das1...@gmail.com wrote: The model for file stream is to pick up and process new files written atomically (by move) into a directory. So your file is being processed in a single batch, and then its waiting for any new files

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-14 Thread Tathagata Das
When you are sending data using simple socket code to send messages, are those messages \n delimited? If its not, then the receiver of socketTextSTream, wont identify them as separate events, and keep buffering them. TD On Sun, Jul 13, 2014 at 10:49 PM, kytay kaiyang@gmail.com wrote: Hi

Re: Error in JavaKafkaWordCount.java example

2014-07-14 Thread Tathagata Das
Are you compiling it within Spark using Spark's recommended way (see doc web page)? Or are you compiling it in your own project? In the latter case, make sure you are using the Scala 2.10.4. TD On Sun, Jul 13, 2014 at 6:43 AM, Mahebub Sayyed mahebub...@gmail.com wrote: Hello, I am referring

Re: can't print DStream after reduce

2014-07-14 Thread Tathagata Das
no count. Thanks On Sun, Jul 13, 2014 at 11:34 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Try doing DStream.foreachRDD and then printing the RDD count and further inspecting the RDD. On Jul 13, 2014 1:03 AM, Walrus theCat walrusthe...@gmail.com wrote: Hi, I have a DStream

Re: Number of executors change during job running

2014-07-14 Thread Tathagata Das
in the reduce task (combineByKey). Even with the first batch which used more than 80 executors, it took 2.4 mins to finish the reduce stage for a very small amount of data. Bill On Mon, Jul 14, 2014 at 12:30 PM, Tathagata Das tathagata.das1...@gmail.com wrote: After using repartition(300), how

Re: Spark Streaming Json file groupby function

2014-07-14 Thread Tathagata Das
In general it may be a better idea to actually convert the records from hashmaps, to a specific data structure. Say case class Record(id: int, name: String, mobile: String, score: Int, test_type: String ... ) Then you should be able to do something like val records = jsonf.map(m =

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
The twitter functionality is not available through the shell. 1) we separated these non-core functionality into separate subprojects so that their dependencies do not collide/pollute those of of core spark 2) a shell is not really the best way to start a long running stream. Its best to use

Re: Stateful RDDs?

2014-07-14 Thread Tathagata Das
Trying answer your questions as concisely as possible 1. In the current implementation, the entire state RDD needs to loaded for any update. It is a known limitation, that we want to overcome in the future. Therefore the state Dstream should not be persisted to disk as all the data in the state

Re: SQL + streaming

2014-07-14 Thread Tathagata Das
Could you elaborate on what is the problem you are facing? Compiler error? Runtime error? Class-not-found error? Not receiving any data from Kafka? Receiving data but SQL command throwing error? No errors but no output either? TD On Mon, Jul 14, 2014 at 4:06 PM, hsy...@gmail.com

Re: Spark-Streaming collect/take functionality.

2014-07-14 Thread Tathagata Das
Why doesnt something like this work? If you want a continuously updated reference to the top counts, you can use a global variable. var topCounts: Array[(String, Int)] = null sortedCounts.foreachRDD (rdd = val currentTopCounts = rdd.take(10) // print currentTopCounts it or watever

Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-14 Thread Tathagata Das
Oh yes, this was a bug and it has been fixed. Checkout from the master branch! https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC TD On Mon, Jul

Re: SQL + streaming

2014-07-14 Thread Tathagata Das
at 4:59 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Could you elaborate on what is the problem you are facing? Compiler error? Runtime error? Class-not-found error? Not receiving any data from Kafka? Receiving data but SQL command throwing error? No errors but no output either? TD

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
I guess this is not clearly documented. At a high level, any class that is in the package org.apache.spark.streaming.XXX where XXX is in { twitter, kafka, flume, zeromq, mqtt } is not available in the Spark shell. I have added this to the larger JIRA of things-to-add-to-streaming-docs

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
raising dependency-related concerns into the core of spark streaming. TD On Mon, Jul 14, 2014 at 6:29 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Mon, Jul 14, 2014 at 6:52 PM, Tathagata Das tathagata.das1...@gmail.com wrote: The twitter functionality is not available through

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
nicholas.cham...@gmail.com wrote: If we're talking about the issue you captured in SPARK-2464 https://issues.apache.org/jira/browse/SPARK-2464, then it was a newly launched EC2 cluster on 1.0.1. On Mon, Jul 14, 2014 at 10:48 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Did you make any

Re: can't print DStream after reduce

2014-07-13 Thread Tathagata Das
Try doing DStream.foreachRDD and then printing the RDD count and further inspecting the RDD. On Jul 13, 2014 1:03 AM, Walrus theCat walrusthe...@gmail.com wrote: Hi, I have a DStream that works just fine when I say: dstream.print If I say: dstream.map(_,1).print that works, too.

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-13 Thread Tathagata Das
In case you still have issues with duplicate files in uber jar, here is a reference sbt file with assembly plugin that deals with duplicates https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt On Fri, Jul 11, 2014 at 10:06 AM, Bill Jay

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-13 Thread Tathagata Das
at 5:48 PM, Tathagata Das tathagata.das1...@gmail.com wrote: In case you still have issues with duplicate files in uber jar, here is a reference sbt file with assembly plugin that deals with duplicates https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt

Re: Stopping StreamingContext does not kill receiver

2014-07-12 Thread Tathagata Das
Yes, thats a bug i just discovered. Race condition in the Twitter Receiver, will fix asap. Here is the JIRA https://issues.apache.org/jira/browse/SPARK-2464 TD On Sat, Jul 12, 2014 at 3:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: To add a potentially relevant piece of

Re: Join two Spark Streaming

2014-07-11 Thread Tathagata Das
instead of accumulative number of unique integers. I do have two questions about your code: 1. Why do we need uniqueValuesRDD? Why do we need to call uniqueValuesRDD.checkpoint()? 2. Where is distinctValues defined? Bill On Thu, Jul 10, 2014 at 8:46 PM, Tathagata Das tathagata.das1

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread Tathagata Das
for this. Thanks. On Thursday, July 10, 2014 7:24 PM, Tathagata Das tathagata.das1...@gmail.com wrote: How are you supplying the text file? On Wed, Jul 9, 2014 at 11:51 AM, M Singh mans6si...@yahoo.com wrote: Hi Folks: I am working on an application which uses spark streaming (version

Re: writing FLume data to HDFS

2014-07-11 Thread Tathagata Das
What is the error you are getting when you say ??I was trying to write the data to hdfs..but it fails… TD On Thu, Jul 10, 2014 at 1:36 PM, Sundaram, Muthu X. muthu.x.sundaram@sabre.com wrote: I am new to spark. I am trying to do the following. NetcatàFlumeàSpark streaming(process Flume

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread Tathagata Das
...@yahoo.com wrote: So, is it expected for the process to generate stages/tasks even after processing a file ? Also, is there a way to figure out the file that is getting processed and when that process is complete ? Thanks On Friday, July 11, 2014 1:51 PM, Tathagata Das tathagata.das1

Re: How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-11 Thread Tathagata Das
The same executor can be used for both receiving and processing, irrespective of the deployment mode (yarn, spark standalone, etc.) It boils down to the number of cores / task slots that executor has. Each receiver is like a long running task, so each of them occupy a slot. If there are free slots

Re: Number of executors change during job running

2014-07-11 Thread Tathagata Das
that the computation can be efficiently finished. I am not sure how to achieve this. Thanks! Bill On Thu, Jul 10, 2014 at 5:59 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Can you try setting the number-of-partitions in all the shuffle-based DStream operations, explicitly. It may

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Tathagata Das
You dont get any exception from twitter.com, saying credential error or something? I have seen this happen when once one was behind vpn to his office, and probably twitter was blocked in their office. You could be having a similar issue. TD On Fri, Jul 11, 2014 at 2:57 PM, SK

Re: Some question about SQL and streaming

2014-07-11 Thread Tathagata Das
Yes, even though we dont have immediate plans, I definitely would like to see it happen some time in not-so-distant future. TD On Thu, Jul 10, 2014 at 7:55 PM, Shao, Saisai saisai.s...@intel.com wrote: No specific plans to do so, since there has some functional loss like time based

Re: Generic Interface between RDD and DStream

2014-07-11 Thread Tathagata Das
I totally agree that doing if we are able to do this it will be very cool. However, this requires having a common trait / interface between RDD and DStream, which we dont have as of now. It would be very cool though. On my wish list for sure. TD On Thu, Jul 10, 2014 at 11:53 AM, mshah

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Tathagata Das
Does nothing get printed on the screen? If you are not getting any tweets but spark streaming is running successfully you should get at least counts being printed every batch (which would be zero). But they are not being printed either, check the spark web ui to see stages are running or not. If

Re: How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-11 Thread Tathagata Das
? Best, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Fri, Jul 11, 2014 at 1:45 PM, Tathagata Das tathagata.das1...@gmail.com wrote: The same executor can be used for both receiving and processing, irrespective of the deployment mode (yarn, spark standalone, etc.) It boils down

Re: Number of executors change during job running

2014-07-11 Thread Tathagata Das
days. No findings why there are always 2 executors for the groupBy stage. Thanks a lot! Bill On Fri, Jul 11, 2014 at 1:50 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Can you show us the program that you are running. If you are setting number of partitions in the XYZ-ByKey operation

Re: Spark Streaming timing considerations

2014-07-11 Thread Tathagata Das
This is not in the current streaming API. Queue stream is useful for testing with generated RDDs, but not for actual data. For actual data stream, the slack time can be implemented by doing DStream.window on a larger window that take slack time in consideration, and then the required

Re: Generic Interface between RDD and DStream

2014-07-11 Thread Tathagata Das
know it's just an intermediate hack, but still ;-) greetz, aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Sat, Jul 12, 2014 at 12:57 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I totally agree that doing if we are able to do

Re: Restarting a Streaming Context

2014-07-10 Thread Tathagata Das
I confirm that is indeed the case. It is designed to be so because it keeps things simpler - less chances of issues related to cleanup when stop() is called. Also it keeps things consistent with the spark context - once a spark context is stopped it cannot be used any more. You can create a new

Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-07-10 Thread Tathagata Das
Do you see any errors in the logs of the driver? On Thu, Jul 10, 2014 at 1:21 AM, Haopu Wang hw...@qilinsoft.com wrote: I'm running an App for hours in a standalone cluster. From the data injector and Streaming tab of web ui, it's running well. However, I see quite a lot of Active stages in

Re: KMeans code is rubbish

2014-07-10 Thread Tathagata Das
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how

Re: Getting Persistent Connection using socketStream?

2014-07-10 Thread Tathagata Das
The implementation of the input-stream-to-iterator function in #2 is incorrect. The function should be such that, when the hasNext is called on the iterator, it should try to read from the buffered reader. If an object (that is, line) can be read, then return it, otherwise block and wait for data

Re: What version of twitter4j should I use with Spark Streaming?

2014-07-10 Thread Tathagata Das
Spark Streaming uses twitter4j 3.0.3. 3.0.6 should probably work fine. The exception that you are seeing is something that should be looked into. Can you give us more logs (specially executor logs) with stack traces that has the error. TD On Thu, Jul 10, 2014 at 2:42 PM, Nick Chammas

Re: Number of executors change during job running

2014-07-10 Thread Tathagata Das
Are you specifying the number of reducers in all the DStream.ByKey operations? If the reduce by key is not set, then the number of reducers used in the stages can keep changing across batches. TD On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I have a

Re: NoSuchElementException: key not found when changing the window lenght and interval in Spark Streaming

2014-07-10 Thread Tathagata Das
This bug has been fixed. Either use the master branch of Spark, or maybe wait a few days for Spark 1.0.1 to be released (voting has successfully closed). TD On Thu, Jul 10, 2014 at 2:33 AM, richiesgr richie...@gmail.com wrote: Hi I get exactly the same problem here, do you've found the

Re: Spark Streaming using File Stream in Java

2014-07-10 Thread Tathagata Das
The fileStream is not designed to work with continuously updating file, as the one of the main design goals of Spark is immutability (to guarantee fault-tolerance by recomputation), and files that are appending (mutating) defeats that. It rather designed to pickup new files added atomically (using

Re: Number of executors change during job running

2014-07-10 Thread Tathagata Das
then 3 minutes. Thanks! Bill On Thu, Jul 10, 2014 at 4:28 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Are you specifying the number of reducers in all the DStream.ByKey operations? If the reduce by key is not set, then the number of reducers used in the stages can keep

Re: Some question about SQL and streaming

2014-07-10 Thread Tathagata Das
Yeah, the right solution is to have something like SchemaDStream, where the schema of all the schemaRDD generated by it can be stored. Something I really would like to see happen in the future :) TD On Thu, Jul 10, 2014 at 6:37 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, I think it

Re: Getting Persistent Connection using socketStream?

2014-07-10 Thread Tathagata Das
Right this uses NextIterator, which is elsewhere in the repo. It just makes it cleaner to implement a custom iterator. But i guess you got the high level point, so its okay. TD On Thu, Jul 10, 2014 at 7:21 PM, kytay kaiyang@gmail.com wrote: Hi TD Thanks. I have problem understanding

Re: Join two Spark Streaming

2014-07-10 Thread Tathagata Das
Do you want to continuously maintain the set of unique integers seen since the beginning of stream? var uniqueValuesRDD: RDD[Int] = ... dstreamOfIntegers.transform(newDataRDD = { val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct uniqueValuesRDD = newUniqueValuesRDD //

Re: Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Tathagata Das
1. Multiple output operations are processed in the order they are defined. That is because by default each one output operation is processed at a time. This *can* be parallelized using an undocumented config parameter spark.streaming.concurrentJobs which is by default set to 1. 2. Yes, the output

Re: Could not compute split, block not found

2014-07-01 Thread Tathagata Das
Are you by any change using only memory in the storage level of the input streams? TD On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, let's say the processing time is t' and the window size t. Spark does not *require* t' t. In fact, for *temporary* peaks in

Re: Possible approaches for adding extra metadata (Spark Streaming)?

2014-06-20 Thread Tathagata Das
If the metadata is directly related to each individual records, then it can be done either ways. Since I am not sure how easy or hard will it be for you add tags before putting the data into spark streaming, its hard to recommend one method over the other. However, if the metadata is related to

Re: Long running Spark Streaming Job increasing executing time per batch

2014-06-19 Thread Tathagata Das
Thats quite odd. Yes, with checkpoint the lineage does not increase. Can you tell which stage is the processing of each batch is causing the increase in the processing time? Also, what is the batch interval, and checkpoint interval? TD On Thu, Jun 19, 2014 at 8:45 AM, Skogberg, Fredrik

Re: Issue while trying to aggregate with a sliding window

2014-06-19 Thread Tathagata Das
zeroTime marks the time when the streaming job started, and the first batch of data is from zeroTime to zeroTime + slideDuration. The validity check of time - zeroTime) being multiple of slideDuration is to ensure that for a given dstream, it generates RDD at the right times. For example, say the

Re: How to achieve reasonable performance on Spark Streaming?

2014-06-19 Thread Tathagata Das
Hello all, Apologies for the late response, this thread went below my radar. There are a number of things that can be done to improve the performance. Here are some of them of the top of my head based on what you have mentioned. Most of them are mentioned in the streaming guide's performance

Re: Spark Streaming not processing file with particular number of entries

2014-06-13 Thread Tathagata Das
This is very odd. If it is running fine on mesos, I dont see a obvious reason why it wont work on Spark standalone cluster. Is the .4 million file already present in the monitored directory when the context is started? In that case, the file will not be picked up (unless textFileStream is created

Re: running Spark Streaming just once and stop it

2014-06-12 Thread Tathagata Das
You should be able to see the streaming tab in the Spark web ui (running on port 4040) if you have created StreamingContext and you are using Spark 1.0 TD On Thu, Jun 12, 2014 at 1:06 AM, Ravi Hemnani raviiihemn...@gmail.com wrote: Hey, I did

<    3   4   5   6   7   8   9   >