Could you collect debug level logs and send us. Without logs its hard to
speculate anything. :)
TD
On Sat, Jul 19, 2014 at 2:39 PM, boci boci.b...@gmail.com wrote:
Hi guys!
I run out of ideas... I created a spark streaming job (kafka - spark -
ES).
If I start my app local machine (inside
yanfang...@gmail.com wrote:
Thank you, TD. This is important information for us. Will keep an eye on
that.
Cheers,
Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108
On Thu, Jul 17, 2014 at 6:54 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Yes, this is the limitation
in
general. I tried to explore the link you provided but could not find any
specific JIRA related to this? Do you have the JIRA number for this?
On Thu, Jul 17, 2014 at 9:21 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
You can create multiple kafka stream to partition your topics
Dang! Messed it up again!
JIRA: https://issues.apache.org/jira/browse/SPARK-1341
Github PR: https://github.com/apache/spark/pull/945/files
On Fri, Jul 18, 2014 at 11:35 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Oops, wrong link!
JIRA: https://github.com/apache/spark/pull/945
If you want to process data that spans across weeks, then it best to use a
dedicated data store (file system, sql / nosql database, etc.) that is
designed for long term data storage and retrieval. Spark Streaming is not
designed as a long term data store. Also it does not seem like you need low
Thats, a good question. My first reach is timeout. Timing out after 10s of
seconds should be sufficient. So there should be a timer in the singleton
that runs a check every second, on when the singleton was last used, and
closes the connections after a time out. Any attempts to use the connection
this helps!
TD
On Fri, Jul 18, 2014 at 8:14 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Thats, a good question. My first reach is timeout. Timing out after 10s of
seconds should be sufficient. So there should be a timer in the singleton
that runs a check every second, on when
Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com
On Thu, Jul 17, 2014 at 2:58 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Is the class that is not found in the wikipediapagerank jar?
TD
On Wed, Jul 16, 2014 at 12:32 AM, Hao Wang wh.s...@gmail.com
1. You can put in multiple kafka topics in the same Kafka input stream. See
the example KafkaWordCount
https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala
.
However they will all be read
For accessing previous version, I would do it the same way. :)
1. Can you elaborate on what you mean by that with an example? What do you
mean by accessing keys?
2. Yeah, that is hard to do with the ability to do point lookups into an
RDD, which we dont support yet. You could try embedding the
This is a basic scala problem. You cannot apply toInt to Any. Try doing
toString.toInt
For such scala issues, I recommend trying it out in the Scala shell. For
example, you could have tried this out as the following.
[tdas @ Xion streaming] scala
Welcome to Scala version 2.10.3 (Java HotSpot(TM)
, Laeeq Ahmed laeeqsp...@yahoo.com
wrote:
Hi,
Thanks I will try to implement it.
Regards,
Laeeq
On Saturday, July 12, 2014 4:37 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
This is not in the current streaming API.
Queue stream is useful for testing with generated RDDs
yanfang...@gmail.com
+1 (206) 849-4108
On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
For accessing previous version, I would do it the same way. :)
1. Can you elaborate on what you mean by that with an example? What do
you mean by accessing keys?
2. Yeah
And if Marcelo's guess is correct, then the right way to do this would be
to lazily / dynamically create the jdbc connection server as a singleton
in the workers/executors and use that. Something like this.
dstream.foreachRDD(rdd = {
rdd.foreachPartition((iterator: Iterator[...]) = {
is it possible to create partitions while storing data direct from
kafka to Parquet files??*
*(likewise created in above query)*
On Thu, Jul 17, 2014 at 12:35 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
1. You can put in multiple kafka topics in the same Kafka input stream.
See the example
Can you check in the environment tab of Spark web ui to see whether this
configuration parameter is in effect?
TD
On Thu, Jul 17, 2014 at 2:05 PM, Chen Song chen.song...@gmail.com wrote:
I am using spark 0.9.0 and I am able to submit job to YARN,
You can create multiple kafka stream to partition your topics across them,
which will run multiple receivers or multiple executors. This is covered in
the Spark streaming guide.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving
And for the
Step 1: use rdd.mapPartitions(). Thats equivalent to mapper of the
MapReduce. You can open connection, get all the data and buffer it, close
connection, return iterator to the buffer
Step 2: Make step 1 better, by making it reuse connections. You can use
singletons / static vars, to lazily
...@gmail.com
+1 (206) 849-4108
On Thu, Jul 17, 2014 at 1:47 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
The updateFunction given in updateStateByKey should be called on ALL the
keys are in the state, even if there is no new data in the batch for some
key. Is that not the behavior
hdfs://192.168.1.12:9000/freebase-26G 1 200 True
Regards,
Wang Hao(王灏)
CloudTeam | School of Software Engineering
Shanghai Jiao Tong University
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com
On Wed, Jul 16, 2014 at 1:41 PM, Tathagata Das
One way to do that is currently possible is given here
http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAMwrk0=b38dewysliwyc6hmze8tty8innbw6ixatnd1ue2-...@mail.gmail.com%3E
On Wed, Jul 16, 2014 at 1:16 AM, Gerard Maas gerard.m...@gmail.com wrote:
Hi Sargun,
There have
wrote:
Hi,
thanks for creating the issue. It feels like in the last week, more or
less half of the questions about Spark Streaming rooted in setting the
master to local ;-)
Tobias
On Wed, Jul 16, 2014 at 11:03 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Aah, right, copied
I hope it all works :)
On Wed, Jul 16, 2014 at 9:08 AM, gorenuru goren...@gmail.com wrote:
Hi and thank you for your reply.
Looks like it's possible. It looks like a hack for me because we are
specifying batch duration when creating context. This means that if we will
specify batch
I think I know what the problem is. Spark Streaming is constantly doing
garbage cleanup by throwing away data that it does not based on the
operations in the DStream. Here the DSTream operations are not aware of the
spark sql queries thats happening asynchronous to spark streaming. So data
is
Have you taken a look at DStream.transformWith( ... ) . That allows you
apply arbitrary transformation between RDDs (of the same timestamp) of two
different streams.
So you can do something like this.
2s-window-stream.transformWith(1s-window-stream, (rdd1: RDD[...], rdd2:
RDD[...]) = {
...
//
Answers inline.
On Wed, Jul 16, 2014 at 5:39 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I am currently using Spark Streaming to conduct a real-time data
analytics. We receive data from Kafka. We want to generate output files
that contain results that are based on the data we
, Tathagata Das
tathagata.das1...@gmail.com wrote:
Oh yes, this was a bug and it has been fixed. Checkout from the master
branch!
https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY
cluster(spark-submit) and I couldn't
find any output from the yarn stdout logs
On Mon, Jul 14, 2014 at 6:25 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Can you make sure you are running locally on more than 1 local cores? You
could set the master in the SparkConf as conf.setMaster
I see you have the code to convert to Record class but commented it out.
That is the right way to go. When you are converting it to a 4-tuple with
(data(type),data(name),data(score),data(school)) ... its of type
(Any, Any, Any, Any) as data(xyz) returns Any. And registerAsTable
probably doesnt
This sounds really really weird. Can you give me a piece of code that I can
run to reproduce this issue myself?
TD
On Tue, Jul 15, 2014 at 12:02 AM, Walrus theCat walrusthe...@gmail.com
wrote:
This is (obviously) spark streaming, by the way.
On Mon, Jul 14, 2014 at 8:27 PM, Walrus theCat
The way the HDFS file writing works at a high level is that each attempt to
write a partition to a file starts writing to unique temporary file (say,
something like targetDirectory/_temp/part-X_attempt-). If the
writing into the file successfully completes, then the temporary file is
moved
I am very curious though. Can you post a concise code example which we can
run to reproduce this problem?
TD
On Tue, Jul 15, 2014 at 2:54 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
I am not entire sure off the top of my head. But a possible (usually
works) workaround is to define
Also, it helps if you post us logs, stacktraces, exceptions, etc.
TD
On Tue, Jul 15, 2014 at 10:07 AM, Jerry Lam chiling...@gmail.com wrote:
Hi Rajesh,
I have a feeling that this is not directly related to spark but I might be
wrong. The reason why is that when you do:
Configuration
Creating multiple StreamingContexts using the same SparkContext is
currently not supported. :)
Guess it was not clear in the docs. Note to self.
TD
On Tue, Jul 15, 2014 at 1:50 PM, gorenuru goren...@gmail.com wrote:
Hi everyone.
I have some problems running multiple streams at the same
Why do you need to create multiple streaming contexts at all?
TD
On Tue, Jul 15, 2014 at 3:43 PM, gorenuru goren...@gmail.com wrote:
Oh, sad to hear that :(
From my point of view, creating separate spark context for each stream is
to
expensive.
Also, it's annoying because we have to be
Yes, what Nick said is the recommended way. In most usecases, a spark
streaming program in production is not usually run from the shell. Hence,
we chose not to make the external stuff (twitter, kafka, etc.) available to
spark shell to avoid dependency conflicts brought it by them with spark's
You need to have
import sqlContext._
so just uncomment that and it should work.
TD
On Tue, Jul 15, 2014 at 1:40 PM, srinivas kusamsrini...@gmail.com wrote:
I am still getting the error...even if i convert it to record
object KafkaWordCount {
def main(args: Array[String]) {
if
, Jul 15, 2014 at 12:48 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Could you run it locally first to make sure it works, and you see
output? Also, I recommend going through the previous step-by-step approach
to narrow down where the problem is.
TD
On Mon, Jul 14, 2014 at 9:15 PM
On Tue, Jul 15, 2014 at 5:22 PM, gorenuru goren...@gmail.com wrote:
Because I want to have different streams with different durations.
Fornexample, one triggers snapshot analysis each 5 minutes and another
each 10 seconds
On Tue, Jul 15, 2014 at 3:59 pm, Tathagata Das [via Apache Spark User
Quick google search of that exception says this occurs when there is an
error in the initialization of static methods. Could be some issue related
to how dissection is defined. Maybe try putting the function in a different
static class that is unrelated to the Main class, which may have other
mode
On Mon, Jul 14, 2014 at 3:57 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
The problem is not really for local[1] or local. The problem arises when
there are more input streams than there are cores.
But I agree, for people who are just beginning to use it by running it
locally
Can you try defining the case class outside the main function. In fact
outside the object?
TD
On Tue, Jul 15, 2014 at 8:20 PM, srinivas kusamsrini...@gmail.com wrote:
Hi TD,
I uncomment import sqlContext._ and tried to compile the code
import java.util.Properties
import kafka.producer._
:53 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Oh yes, we have run sql, streaming and mllib all together.
You can take a look at the demo https://databricks.com/cloud that
DataBricks gave at the spark summit.
I think I get the problem is. Sql() returns a RDD, and println(rdd
Are you using classes from external libraries that have not been added to
the sparkContext, using sparkcontext.addJar()?
TD
On Tue, Jul 15, 2014 at 8:36 PM, Hao Wang wh.s...@gmail.com wrote:
I am running the WikipediaPageRank in Spark example and share the same
problem with you:
4/07/16
());
System.out.println(LOG RECORD = +
logRecord);
??I was trying to write the data to hdfs..but
it fails…
*From:* Tathagata Das [mailto:tathagata.das1...@gmail.com]
*Sent:* Friday, July 11, 2014 1:43 PM
*To:* user@spark.apache.org
*Cc:* u
You have to import StreamingContext._ to enable groupByKey operations on
DStreams. After importing that you can apply groupByKey on any DStream,
that is a DStream of key-value pairs (e.g. DStream[(String, Int)]) . The
data in each pair RDDs will be grouped by the first element in the tuple as
the
of the tasks have been completed but the Stage is
still shown as Active?
I didn't keep the driver's log. It's a lesson.
I will try to run it again to see if it happens again.
--
*From:* Tathagata Das [mailto:tathagata.das1...@gmail.com]
*Sent:* 2014年7月10日 17:29
.
Mans
On Friday, July 11, 2014 4:38 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
The model for file stream is to pick up and process new files written
atomically (by move) into a directory. So your file is being processed in a
single batch, and then its waiting for any new files
When you are sending data using simple socket code to send messages, are
those messages \n delimited? If its not, then the receiver of
socketTextSTream, wont identify them as separate events, and keep buffering
them.
TD
On Sun, Jul 13, 2014 at 10:49 PM, kytay kaiyang@gmail.com wrote:
Hi
Are you compiling it within Spark using Spark's recommended way (see doc
web page)? Or are you compiling it in your own project? In the latter case,
make sure you are using the Scala 2.10.4.
TD
On Sun, Jul 13, 2014 at 6:43 AM, Mahebub Sayyed mahebub...@gmail.com
wrote:
Hello,
I am referring
no count.
Thanks
On Sun, Jul 13, 2014 at 11:34 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Try doing DStream.foreachRDD and then printing the RDD count and
further inspecting the RDD.
On Jul 13, 2014 1:03 AM, Walrus theCat walrusthe...@gmail.com
wrote:
Hi,
I have a DStream
in the reduce task
(combineByKey). Even with the first batch which used more than 80
executors, it took 2.4 mins to finish the reduce stage for a very small
amount of data.
Bill
On Mon, Jul 14, 2014 at 12:30 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
After using repartition(300), how
In general it may be a better idea to actually convert the records from
hashmaps, to a specific data structure. Say
case class Record(id: int, name: String, mobile: String, score: Int,
test_type: String ... )
Then you should be able to do something like
val records = jsonf.map(m =
The twitter functionality is not available through the shell.
1) we separated these non-core functionality into separate subprojects so
that their dependencies do not collide/pollute those of of core spark
2) a shell is not really the best way to start a long running stream.
Its best to use
Trying answer your questions as concisely as possible
1. In the current implementation, the entire state RDD needs to loaded for
any update. It is a known limitation, that we want to overcome in the
future. Therefore the state Dstream should not be persisted to disk as all
the data in the state
Could you elaborate on what is the problem you are facing? Compiler error?
Runtime error? Class-not-found error? Not receiving any data from Kafka?
Receiving data but SQL command throwing error? No errors but no output
either?
TD
On Mon, Jul 14, 2014 at 4:06 PM, hsy...@gmail.com
Why doesnt something like this work? If you want a continuously updated
reference to the top counts, you can use a global variable.
var topCounts: Array[(String, Int)] = null
sortedCounts.foreachRDD (rdd =
val currentTopCounts = rdd.take(10)
// print currentTopCounts it or watever
Oh yes, this was a bug and it has been fixed. Checkout from the master
branch!
https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC
TD
On Mon, Jul
at 4:59 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Could you elaborate on what is the problem you are facing? Compiler
error? Runtime error? Class-not-found error? Not receiving any data from
Kafka? Receiving data but SQL command throwing error? No errors but no
output either?
TD
I guess this is not clearly documented. At a high level, any class that is
in the package
org.apache.spark.streaming.XXX where XXX is in { twitter, kafka, flume,
zeromq, mqtt }
is not available in the Spark shell.
I have added this to the larger JIRA of things-to-add-to-streaming-docs
raising dependency-related concerns into the core of spark streaming.
TD
On Mon, Jul 14, 2014 at 6:29 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
On Mon, Jul 14, 2014 at 6:52 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
The twitter functionality is not available through
nicholas.cham...@gmail.com wrote:
If we're talking about the issue you captured in SPARK-2464
https://issues.apache.org/jira/browse/SPARK-2464, then it was a newly
launched EC2 cluster on 1.0.1.
On Mon, Jul 14, 2014 at 10:48 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Did you make any
Try doing DStream.foreachRDD and then printing the RDD count and further
inspecting the RDD.
On Jul 13, 2014 1:03 AM, Walrus theCat walrusthe...@gmail.com wrote:
Hi,
I have a DStream that works just fine when I say:
dstream.print
If I say:
dstream.map(_,1).print
that works, too.
In case you still have issues with duplicate files in uber jar, here is a
reference sbt file with assembly plugin that deals with duplicates
https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt
On Fri, Jul 11, 2014 at 10:06 AM, Bill Jay
at 5:48 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
In case you still have issues with duplicate files in uber jar, here is a
reference sbt file with assembly plugin that deals with duplicates
https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt
Yes, thats a bug i just discovered. Race condition in the Twitter Receiver,
will fix asap.
Here is the JIRA https://issues.apache.org/jira/browse/SPARK-2464
TD
On Sat, Jul 12, 2014 at 3:21 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
To add a potentially relevant piece of
instead of accumulative number of unique integers.
I do have two questions about your code:
1. Why do we need uniqueValuesRDD? Why do we need to call
uniqueValuesRDD.checkpoint()?
2. Where is distinctValues defined?
Bill
On Thu, Jul 10, 2014 at 8:46 PM, Tathagata Das
tathagata.das1
for this.
Thanks.
On Thursday, July 10, 2014 7:24 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
How are you supplying the text file?
On Wed, Jul 9, 2014 at 11:51 AM, M Singh mans6si...@yahoo.com wrote:
Hi Folks:
I am working on an application which uses spark streaming (version
What is the error you are getting when you say ??I was trying to write the
data to hdfs..but it fails…
TD
On Thu, Jul 10, 2014 at 1:36 PM, Sundaram, Muthu X.
muthu.x.sundaram@sabre.com wrote:
I am new to spark. I am trying to do the following.
NetcatàFlumeàSpark streaming(process Flume
...@yahoo.com wrote:
So, is it expected for the process to generate stages/tasks even after
processing a file ?
Also, is there a way to figure out the file that is getting processed and
when that process is complete ?
Thanks
On Friday, July 11, 2014 1:51 PM, Tathagata Das
tathagata.das1
The same executor can be used for both receiving and processing,
irrespective of the deployment mode (yarn, spark standalone, etc.) It boils
down to the number of cores / task slots that executor has. Each receiver
is like a long running task, so each of them occupy a slot. If there are
free slots
that the computation can be efficiently finished. I am
not sure how to achieve this.
Thanks!
Bill
On Thu, Jul 10, 2014 at 5:59 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Can you try setting the number-of-partitions in all the shuffle-based
DStream operations, explicitly. It may
You dont get any exception from twitter.com, saying credential error or
something?
I have seen this happen when once one was behind vpn to his office, and
probably twitter was blocked in their office.
You could be having a similar issue.
TD
On Fri, Jul 11, 2014 at 2:57 PM, SK
Yes, even though we dont have immediate plans, I definitely would like to
see it happen some time in not-so-distant future.
TD
On Thu, Jul 10, 2014 at 7:55 PM, Shao, Saisai saisai.s...@intel.com wrote:
No specific plans to do so, since there has some functional loss like
time based
I totally agree that doing if we are able to do this it will be very cool.
However, this requires having a common trait / interface between RDD and
DStream, which we dont have as of now. It would be very cool though. On my
wish list for sure.
TD
On Thu, Jul 10, 2014 at 11:53 AM, mshah
Does nothing get printed on the screen? If you are not getting any tweets
but spark streaming is running successfully you should get at least counts
being printed every batch (which would be zero). But they are not being
printed either, check the spark web ui to see stages are running or not. If
?
Best,
Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108
On Fri, Jul 11, 2014 at 1:45 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
The same executor can be used for both receiving and processing,
irrespective of the deployment mode (yarn, spark standalone, etc.) It boils
down
days. No findings why there are
always 2 executors for the groupBy stage. Thanks a lot!
Bill
On Fri, Jul 11, 2014 at 1:50 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Can you show us the program that you are running. If you are setting
number of partitions in the XYZ-ByKey operation
This is not in the current streaming API.
Queue stream is useful for testing with generated RDDs, but not for actual
data. For actual data stream, the slack time can be implemented by doing
DStream.window on a larger window that take slack time in consideration,
and then the required
know it's just an intermediate hack, but still ;-)
greetz,
aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]
http://about.me/noootsab
On Sat, Jul 12, 2014 at 12:57 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
I totally agree that doing if we are able to do
I confirm that is indeed the case. It is designed to be so because it
keeps things simpler - less chances of issues related to cleanup when
stop() is called. Also it keeps things consistent with the spark context -
once a spark context is stopped it cannot be used any more.
You can create a new
Do you see any errors in the logs of the driver?
On Thu, Jul 10, 2014 at 1:21 AM, Haopu Wang hw...@qilinsoft.com wrote:
I'm running an App for hours in a standalone cluster. From the data
injector and Streaming tab of web ui, it's running well.
However, I see quite a lot of Active stages in
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with
your dataset as well, I got the expected answer. And I believe that even
though initialization is done using sampling, the example actually sets the
seed to a constant 42, so the result should always be the same no matter
how
The implementation of the input-stream-to-iterator function in #2 is
incorrect. The function should be such that, when the hasNext is called on
the iterator, it should try to read from the buffered reader. If an object
(that is, line) can be read, then return it, otherwise block and wait for
data
Spark Streaming uses twitter4j 3.0.3. 3.0.6 should probably work fine. The
exception that you are seeing is something that should be looked into. Can
you give us more logs (specially executor logs) with stack traces that has
the error.
TD
On Thu, Jul 10, 2014 at 2:42 PM, Nick Chammas
Are you specifying the number of reducers in all the DStream.ByKey
operations? If the reduce by key is not set, then the number of reducers
used in the stages can keep changing across batches.
TD
On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay bill.jaypeter...@gmail.com wrote:
Hi all,
I have a
This bug has been fixed. Either use the master branch of Spark, or maybe
wait a few days for Spark 1.0.1 to be released (voting has successfully
closed).
TD
On Thu, Jul 10, 2014 at 2:33 AM, richiesgr richie...@gmail.com wrote:
Hi
I get exactly the same problem here, do you've found the
The fileStream is not designed to work with continuously updating file, as
the one of the main design goals of Spark is immutability (to guarantee
fault-tolerance by recomputation), and files that are appending (mutating)
defeats that. It rather designed to pickup new files added atomically
(using
then 3 minutes. Thanks!
Bill
On Thu, Jul 10, 2014 at 4:28 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Are you specifying the number of reducers in all the DStream.ByKey
operations? If the reduce by key is not set, then the number of reducers
used in the stages can keep
Yeah, the right solution is to have something like SchemaDStream, where the
schema of all the schemaRDD generated by it can be stored. Something I
really would like to see happen in the future :)
TD
On Thu, Jul 10, 2014 at 6:37 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
I think it
Right this uses NextIterator, which is elsewhere in the repo. It just makes
it cleaner to implement a custom iterator. But i guess you got the high
level point, so its okay.
TD
On Thu, Jul 10, 2014 at 7:21 PM, kytay kaiyang@gmail.com wrote:
Hi TD
Thanks.
I have problem understanding
Do you want to continuously maintain the set of unique integers seen since
the beginning of stream?
var uniqueValuesRDD: RDD[Int] = ...
dstreamOfIntegers.transform(newDataRDD = {
val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct
uniqueValuesRDD = newUniqueValuesRDD
//
1. Multiple output operations are processed in the order they are defined.
That is because by default each one output operation is processed at a
time. This *can* be parallelized using an undocumented config parameter
spark.streaming.concurrentJobs which is by default set to 1.
2. Yes, the output
Are you by any change using only memory in the storage level of the input
streams?
TD
On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Bill,
let's say the processing time is t' and the window size t. Spark does not
*require* t' t. In fact, for *temporary* peaks in
If the metadata is directly related to each individual records, then it can
be done either ways. Since I am not sure how easy or hard will it be for
you add tags before putting the data into spark streaming, its hard to
recommend one method over the other.
However, if the metadata is related to
Thats quite odd. Yes, with checkpoint the lineage does not increase. Can
you tell which stage is the processing of each batch is causing the
increase in the processing time?
Also, what is the batch interval, and checkpoint interval?
TD
On Thu, Jun 19, 2014 at 8:45 AM, Skogberg, Fredrik
zeroTime marks the time when the streaming job started, and the first batch
of data is from zeroTime to zeroTime + slideDuration. The validity check of
time - zeroTime) being multiple of slideDuration is to ensure that for a
given dstream, it generates RDD at the right times. For example, say the
Hello all,
Apologies for the late response, this thread went below my radar. There are
a number of things that can be done to improve the performance. Here are
some of them of the top of my head based on what you have mentioned. Most
of them are mentioned in the streaming guide's performance
This is very odd. If it is running fine on mesos, I dont see a obvious
reason why it wont work on Spark standalone cluster.
Is the .4 million file already present in the monitored directory when the
context is started? In that case, the file will not be picked up (unless
textFileStream is created
You should be able to see the streaming tab in the Spark web ui (running on
port 4040) if you have created StreamingContext and you are using Spark 1.0
TD
On Thu, Jun 12, 2014 at 1:06 AM, Ravi Hemnani raviiihemn...@gmail.com
wrote:
Hey,
I did
701 - 800 of 861 matches
Mail list logo