You’d better also check the log of nodemanager, sometimes because your memory
usage exceeds the limit of Yarn container’s configuration.
I’ve met similar problem before, here is the warning log in nodemanager:
2015-07-07 17:06:07,141 WARN
If you’re using WAL with Kafka, Spark Streaming will ignore this
configuration(autocommit.enable) and explicitly call commitOffset to update
offset to Kafka AFTER WAL is done.
No matter what you’re setting with autocommit.enable, internally Spark
Streaming will set it to false to turn off
Please see the inline comments.
From: Shushant Arora [mailto:shushantaror...@gmail.com]
Sent: Monday, July 6, 2015 8:51 PM
To: Shao, Saisai
Cc: user
Subject: Re: kafka offset commit in spark streaming 1.2
So If WAL is disabled, how developer can commit offset explicitly in spark
streaming app
commitment mechanism is actually a timer way, so it is
asynchronized with replication.
From: Shushant Arora [mailto:shushantaror...@gmail.com]
Sent: Monday, July 6, 2015 8:30 PM
To: Shao, Saisai
Cc: user
Subject: Re: kafka offset commit in spark streaming 1.2
And what if I disable WAL and use
: Tuesday, June 9, 2015 5:28 PM
To: Shao, Saisai; user
Subject: RE: [SparkStreaming 1.3.0] Broadcast failure after setting
spark.cleaner.ttl
Jerry, I agree with you.
However, in my case, I kept the monitoring the blockmanager folder. I do see
sometimes the number of files decreased, but the folder's
From the stack I think this problem may be due to the deletion of broadcast
variable, as you set the spark.cleaner.ttl, so after this timeout limit, the
old broadcast variable will be deleted, you will meet this exception when you
want to use it again after that time limit.
Basically I think
I think you could use checkpoint to cut the lineage of `MyRDD`, I have a
similar scenario and I use checkpoint to workaround this problem :)
Thanks
Jerry
-Original Message-
From: yaochunnan [mailto:yaochun...@gmail.com]
Sent: Friday, May 8, 2015 1:57 PM
To: user@spark.apache.org
...@gmail.com]
Sent: Friday, May 8, 2015 2:51 PM
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: Re: Possible long lineage issue when using DStream to update a normal
RDD
Thank you for this suggestion! But may I ask what's the advantage to use
checkpoint instead of cache here? Cuz they both cut lineage
I don’t think this problem is related to Netty or NIO, switching to nio will
not change this part of code path to get the index file for sort-based shuffle
reader.
I think you could check your system from some aspects:
1. Is there any hardware problem like disk full or others which makes this
I think this paper will be a good resource
(https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf), also the paper
of Dryad is also a good one.
Thanks
Jerry
From: James King [mailto:jakwebin...@gmail.com]
Sent: Friday, April 17, 2015 3:26 PM
To: user
Subject: Spark Directed Acyclic
OK, seems there’s nothing strange from your code. So maybe we need to narrow
down the cause, would you please run KafkaWordCount example in Spark to see if
it is OK, if this is OK, then we should focus on your implementation, otherwise
Kafka potentially has some problems.
Thanks
Jerry
From:
it with spark.shuffle.spill=false
Thanks
Best Regards
On Wed, Mar 18, 2015 at 3:39 PM, Darren Hoo
darren@gmail.commailto:darren@gmail.com wrote:
Thanks, Shao
On Wed, Mar 18, 2015 at 3:34 PM, Shao, Saisai
saisai.s...@intel.commailto:saisai.s...@intel.com wrote:
Yeah, as I said your job
to add more
resources to your cluster.
Thanks
Jerry
From: Darren Hoo [mailto:darren@gmail.com]
Sent: Wednesday, March 18, 2015 3:24 PM
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: Re: [spark-streaming] can shuffle write to disk be disabled?
Hi, Saisai
Here is the duration of one
Please see the inline comments.
Thanks
Jerry
From: Darren Hoo [mailto:darren@gmail.com]
Sent: Wednesday, March 18, 2015 9:30 PM
To: Shao, Saisai
Cc: user@spark.apache.org; Akhil Das
Subject: Re: [spark-streaming] can shuffle write to disk be disabled?
On Wed, Mar 18, 2015 at 8:31 PM, Shao
Would you please check your driver log or streaming web UI to see each job's
latency, including processing latency and total latency.
Seems from your code, sliding window is just 3 seconds, so you will process
each 60 second's data in 3 seconds, if processing latency is larger than the
sliding
I think these two ways are both OK for you to write streaming job, `transform`
is a more general way for you to transform from one DStream to another if
there’s no related DStream API (but have related RDD API). But using map maybe
more straightforward and easy to understand.
Thanks
Jerry
I think you could change the pom file under Spark project to update the Tachyon
related dependency version and rebuild it again (in case API is compatible, and
behavior is the same).
I'm not sure is there any command you can use to compile against Tachyon
version.
Thanks
Jerry
From:
Hi Lin,
AFAIK, currently there's no built-in receiver API for RDBMs, but you can
customize your own receiver to get data from RDBMs, for the details you can
refer to the docs.
Thanks
Jerry
From: Cui Lin [mailto:cui@hds.com]
Sent: Tuesday, March 10, 2015 8:36 AM
To: Tathagata Das
Cc:
Hi Du,
You could try to sleep for several seconds after creating streaming context to
let all the executors registered, then all the receivers can distribute to the
nodes more evenly. Also setting locality is another way as you mentioned.
Thanks
Jerry
From: Du Li
and assigned its hostname to each receiver.
Thanks
Jerry
From: Du Li [mailto:l...@yahoo-inc.com]
Sent: Thursday, March 5, 2015 2:29 PM
To: Shao, Saisai; User
Subject: Re: distribution of receivers in spark streaming
Hi Jerry,
Thanks for your response.
Is there a way to get the list of currently
Yes, if one key has too many values, there still has a chance to meet the OOM.
Thanks
Jerry
From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Thursday, March 5, 2015 3:49 PM
To: Shao, Saisai
Cc: Cheng, Hao; user
Subject: Re: Having lots of FetchFailedException in join
I see. I'm using
Hi Jianshi,
From my understanding, it may not be the problem of NIO or Netty, looking at
your stack trace, the OOM is occurred in EAOM(ExternalAppendOnlyMap),
theoretically EAOM can spill the data into disk if memory is not enough, but
there might some issues when join key is skewed or key
to read the whole partition into memory. But if you uses
SparkSQL, it depends on how SparkSQL uses this operators.
CC @hao if he has some thoughts on it.
Thanks
Jerry
From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Thursday, March 5, 2015 3:28 PM
To: Shao, Saisai
Cc: user
Subject: Re
Cool, great job☺.
Thanks
Jerry
From: Ryan Williams [mailto:ryan.blake.willi...@gmail.com]
Sent: Thursday, February 26, 2015 6:11 PM
To: user; d...@spark.apache.org
Subject: Monitoring Spark with Graphite and Grafana
If anyone is curious to try exporting Spark metrics to Graphite, I just
I don't think current Spark Streaming supports window operations which beyond
its available memory, internally Spark Streaming puts all the data in the
memory belongs to the effective window, if the memory is not enough,
BlockManager will discard the blocks at LRU policy, so something
If you call reduceByKey(), internally Spark will introduce a shuffle
operations, not matter the data is already partitioned locally, Spark itself do
not know the data is already well partitioned.
So if you want to avoid Shuffle, you have to write the code explicitly to
avoid this, from my
I think some RDD APIs like zipPartitions or others can do this as you wanted. I
might check the docs.
Thanks
Jerry
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Monday, February 23, 2015 1:35 PM
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: RE: Union and reduceByKey will trigger
I've no context of this book, AFAIK union will not trigger shuffle, as they
just put the partitions together, the operator reduceByKey() will actually
trigger shuffle.
Thanks
Jerry
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Monday, February 23, 2015 12:26 PM
To: Shao, Saisai
Cc
Hi Judy,
For driver, it is /metrics/json, there's no metricsServlet for executor.
Thanks
Jerry
From: Judy Nash [mailto:judyn...@exchange.microsoft.com]
Sent: Friday, February 6, 2015 3:47 PM
To: user@spark.apache.org
Subject: Spark Metrics Servlet for driver and executor
Hi all,
Looking at
Did you include Kafka jars? This StringDecoder is under kafka/serializer, You
can refer to the unit test KafkaStreamSuite in Spark to see how to use this API.
Thanks
Jerry
From: Eduardo Costa Alfaia [mailto:e.costaalf...@unibs.it]
Sent: Friday, February 6, 2015 9:44 AM
To: Shao, Saisai
Cc: Sean
Hi,
I think you should change the `DefaultDecoder` of your type parameter into
`StringDecoder`, seems you want to decode the message into String.
`DefaultDecoder` is to return Array[Byte], not String, so here class casting
will meet error.
Thanks
Jerry
-Original Message-
From:
Hi all,
I have some questions about the future development of Spark's standalone
resource scheduler. We've heard some users have the requirements to have
multi-tenant support in standalone mode, like multi-user management, resource
management and isolation, whitelist of users. Seems current
Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Monday, February 2, 2015 4:49 PM
To: Shao, Saisai
Cc: d...@spark.apache.org; user@spark.apache.org
Subject: Re: Questions about Spark standalone resource scheduler
Hey Jerry,
I think standalone mode will still add more features
That's definitely a good supplement to the current Spark Streaming, I've heard
many guys want to process the data using log time. Looking forward to the code.
Thanks
Jerry
-Original Message-
From: Tathagata Das [mailto:tathagata.das1...@gmail.com]
Sent: Thursday, January 29, 2015 10:33
Aha, you’re right, I did a wrong comparison, the reason might be only for
checkpointing :).
Thanks
Jerry
From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Wednesday, January 28, 2015 10:39 AM
To: Shao, Saisai
Cc: user
Subject: Re: Why must the dstream.foreachRDD(...) parameter
Hey Tobias,
I think one consideration is for checkpoint of DStream which guarantee driver
fault tolerance.
Also this `foreachFunc` is more like an action function of RDD, thinking of
rdd.foreach(func), in which `func` need to be serializable. So maybe I think
your way of use it is not a
Hi Larry,
I don’t think current Spark’s shuffle can support HDFS as a shuffle output.
Anyway, is there any specific reason to spill shuffle data to HDFS or NFS, this
will severely increase the shuffle time.
Thanks
Jerry
From: Larry Liu [mailto:larryli...@gmail.com]
Sent: Sunday, January 25,
No, current RDD persistence mechanism do not support putting data on HDFS.
The directory is spark.local.dirs.
Instead you can use checkpoint() to save the RDD on HDFS.
Thanks
Jerry
From: Larry Liu [mailto:larryli...@gmail.com]
Sent: Monday, January 26, 2015 3:08 PM
To: Charles Feduke
Cc:
, 2015 3:03 PM
To: Shao, Saisai
Cc: u...@spark.incubator.apache.org
Subject: Re: Shuffle to HDFS
Hi,Jerry
Thanks for your reply.
The reason I have this question is that in Hadoop, mapper intermediate output
(shuffle) will be stored in HDFS. I think the default location for spark is
/tmp I think
for you?
I think it’s better and easy for you to change your implementation rather than
rely on Spark to handle this.
Thanks
Jerry
From: Balakrishnan Narendran [mailto:balu.na...@gmail.com]
Sent: Friday, January 23, 2015 12:19 AM
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: Re: spark
Hi,
Seems you have such a large window (24 hours), so the phenomena of memory
increasing is expectable, because of window operation will cache the RDD within
this window in memory. So for your requirement, memory should be enough to hold
the data of 24 hours.
I don't think checkpoint in Spark
Hi,
I don't think current Spark Streaming support this feature, all the DStream
lineage is fixed after the context is started.
Also stopping a stream is not supported, instead currently we need to stop the
whole streaming context to meet what you want.
Thanks
Saisai
-Original
Hi Jeff,
From my understanding it seems more like a bug, since JavaDStreamLike is used
for Java code, return a Scala DStream is not reasonable. You can fix this by
submitting a PR, or I can help you to fix this.
Thanks
Jerry
From: Jeff Nadler [mailto:jnad...@srcginc.com]
Sent: Monday, January
I think there're two solutions:
1. Enable write ahead log in Spark Streaming if you're using Spark 1.2.
2. Using third-party Kafka consumer
(https://github.com/dibbhatt/kafka-spark-consumer).
Thanks
Saisai
-Original Message-
From: mykidong [mailto:mykid...@gmail.com]
Sent: Thursday,
I started to know your requirement, maybe there’s some limitations in current
MetricsSystem, I think we can improve it either.
Thanks
Jerry
From: Enno Shioji [mailto:eshi...@gmail.com]
Sent: Sunday, January 4, 2015 5:46 PM
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: Re: Better way
Hi,
I think there’s a StreamingSource in Spark Streaming which exposes the Spark
Streaming running status to the metrics sink, you can connect it with Graphite
sink to expose metrics to Graphite. I’m not sure is this what you want.
Besides you can customize the Source and Sink of the
Hi,
Hadoop Configuration is only Writable, not Java Serializable. You can use
SerializableWritable (in Spark) to wrap the Configuration to make it
serializable, and use broadcast variable to broadcast this conf to all the
node, then you can use it in mapPartitions, rather than serialize it
Hi,
We have such requirements to save RDD output to HDFS with saveAsTextFile like
API, but need to overwrite the data if existed. I'm not sure if current Spark
support such kind of operations, or I need to check this manually?
There's a thread in mailing list discussed about this
Thanks Patrick for your detailed explanation.
BR
Jerry
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Thursday, December 25, 2014 3:43 PM
To: Cheng, Hao
Cc: Shao, Saisai; user@spark.apache.org; d...@spark.apache.org
Subject: Re: Question on saveAsTextFile
AFAIK, this will be a new feature in version 1.2, you can check out the master
branch or 1.2 branch to take a try.
Thanks
Jerry
From: Xiaoyong Zhu [mailto:xiaoy...@microsoft.com]
Sent: Monday, December 15, 2014 10:53 AM
To: user@spark.apache.org
Subject: Spark Streaming Python APIs?
Hi spark
Hi,
It is not a trivial work to acknowledge the offsets when RDD is fully
processed, I think from my understanding only modify the KafakUtils is not
enough to meet your requirement, you need to add a metadata management stuff
for each block/RDD, and track them both in executor-driver side, and
Hi,
I don’t think it’s a problem of Spark Streaming, seeing for call stack, it’s
the problem when BlockManager starting to initializing itself. Would you mind
checking your configuration of Spark, hardware problem, deployment. Mostly I
think it’s not the problem of Spark.
Thanks
Saisai
From:
Hi,
According to my knowledge of current Spark Streaming Kafka connector, I think
there's no chance for APP user to detect such kind of failure, this will either
be done by Kafka consumer with ZK coordinator, either by ReceiverTracker in
Spark Streaming, so I think you don't need to take care
Hi Rod,
The purpose of introducing WAL mechanism in Spark Streaming as a general
solution is to make all the receivers be benefit from this mechanism.
Though as you said, external sources like Kafka have their own checkpoint
mechanism, instead of storing data in WAL, we can only store
, November 18, 2014 2:47 AM
To: Helena Edelson
Cc: Jay Vyas; u...@spark.incubator.apache.org; Tobias Pfeiffer; Shao, Saisai
Subject: Re: Spark streaming cannot receive any message from Kafka
Hi all,
I find the reason of this issue. It seems in the new version, if I do not
specify
Did you configure Spark master as local, it should be local[n], n 1 for local
mode. Beside there’s a Kafka wordcount example in Spark Streaming example, you
can try that. I’ve tested with latest master, it’s OK.
Thanks
Jerry
From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Thursday,
Hi, would you mind describing your problem a little more specific.
1. Is the Kafka broker currently has no data feed in?
2. This code will print the lines, but not in the driver side, the code is
running in the executor side, so you can check the log in worker dir to see if
there’s
From my understanding, the Spark code use Kryo as a streaming manner for RDD
partitions, the deserialization comes with iteration to move forward. But the
internal thing of Kryo to deserialize all the object once or incrementally is
actually a behavior of Kryo, I guess Kyro will not deserialize
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: Re: Kafka Consumer in Spark Streaming
The Kafka broker definitely has messages coming in. But your #2 point is
valid. Needless to say I am a newbie to Spark. I can't figure out where the
'executor' logs would be. How would I find them?
All I see
Hi Ryan,
This is an issue from sort-based shuffle, not consolidated hash-based shuffle.
I guess mostly this issue occurs when Spark cluster is in abnormal situation,
maybe long time of GC pause or some others, you can check the system status or
if there’s any other exceptions beside this one.
Hi Jianshi,
For simulation purpose, I think you can try ConstantInputDStream and
QueueInputDStream to convert one RDD or series of RDD into DStream, the first
one output the same RDD in each batch duration, and the second one just output
a RDD in a queue in each batch duration. You can take a
amount of data back to driver.
Thanks
Jerry
From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Monday, October 27, 2014 2:39 PM
To: Shao, Saisai
Cc: user@spark.apache.org; Tathagata Das (t...@databricks.com)
Subject: Re: RDD to DStream
Hi Saisai,
I understand it's non-trivial
cannot support nested RDD in closure.
Thanks
Jerry
From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Monday, October 27, 2014 3:30 PM
To: Shao, Saisai
Cc: user@spark.apache.org; Tathagata Das (t...@databricks.com)
Subject: Re: RDD to DStream
Ok, back to Scala code, I'm wondering why I
Yes, I understand what you want, but maybe hard to achieve without collecting
back to driver node.
Besides, can we just think of another way to do it.
Thanks
Jerry
From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Monday, October 27, 2014 4:07 PM
To: Shao, Saisai
Cc: user
Hi,
Probably the problem you met is related to this JIRA ticket
(https://issues.apache.org/jira/browse/SPARK-3948). It's potentially a Kernel
2.6.32 bug which will make sort-based shuffle failed. I'm not sure your problem
is the same as this one, would you mind checking your kernel version?
Thanks a lot, I will try to reproduce this in my local settings and dig into
the details, thanks for your information.
BR
Jerry
From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com]
Sent: Wednesday, October 22, 2014 8:35 PM
To: Shao, Saisai
Cc: arthur.hk.c...@gmail.com; user
Subject
...@gmail.com [mailto:arthur.hk.c...@gmail.com]
Sent: Thursday, October 23, 2014 11:32 AM
To: Shao, Saisai
Cc: arthur.hk.c...@gmail.com; user
Subject: Re: Spark Hive Snappy Error
Hi,
Please find the attached file.
my spark-default.xml
# Default system properties included when running spark-submit
Hi Song,
For what I know in sort-based shuffle.
Normally parallel opened file numbers for sort-based shuffle is much smaller
than hash-based shuffle.
In hash based shuffle, parallel opened file numbers is C * R (where C is core
number used and R is the reducer number), as you can see the file
Hi Arthur,
I think this is a known issue in Spark, you can check
(https://issues.apache.org/jira/browse/SPARK-3958). I’m curious about it, can
you always reproduce this issue, Is this issue related to some specific data
sets, would you mind giving me some information about you workload, Spark
Hi Abraham,
You are correct, the “auto.offset.reset“ behavior in KafkaReceiver is different
from original Kafka’s semantics, if you set this configure, KafkaReceiver will
clean the related immediately, but for Kafka this configuration is just a hint
which will be effective only when offset is
to keep the same semantics as Kafka, you need to remove the above
code path manually and recompile the Spark.
Thanks
Jerry
From: Abraham Jacob [mailto:abe.jac...@gmail.com]
Sent: Saturday, October 11, 2014 8:49 AM
To: Shao, Saisai
Cc: user@spark.apache.org; Sean McNamara
Subject: Re: Spark
Hi, I think you have to change the code like this to specify the type info,
like this:
val kafkaStream: ReceiverInputDStream[(String, String)] =
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
topicMap,
Hi
First conf is used for Hadoop to determine the locality distribution of HDFS
file. Second conf is used for Spark, though with the same name, actually they
are two different classes.
Thanks
Jerry
From: qinwei [mailto:wei@dewmobile.net]
Sent: Sunday, September 28, 2014 2:05 PM
To: user
Hi,
SortByKey is only for RDD[(K, V)], each tuple can only has two members, Spark
will sort with first member, if you want to use sortByKey, you have to change
your RDD[(String, String, String, String)] into RDD[(String, (String, String,
String))].
Thanks
Jerry
-Original Message-
Hi,
Spark.local.dir is the one used to write map output data and persistent RDD
blocks, but the path of file has been hashed, so you cannot directly find the
persistent rdd block files, but definitely it will be in this folders on your
worker node.
Thanks
Jerry
From: Priya Ch
blocks would be persisted to hdfs then will i be able to
read the hdfs blocks as i could do in hadoop ?
On Tue, Sep 23, 2014 at 5:56 PM, Shao, Saisai [via Apache Spark User List]
[hidden email]/user/SendEmail.jtp?type=nodenode=14887i=1 wrote:
Hi,
Spark.local.dir is the one used to write map output
If you have enough memory, the speed will be faster, within one minutes, since
most of the files are cached. Also you can build your Spark project on a
mounted ramfs in Linux, this will also speed up the process.
Thanks
Jerry
-Original Message-
From: Zhan Zhang
Hi,
I’ve also met this problem before, I think you can try to set
“spark.core.connection.ack.wait.timeout” to a large value to avoid ack timeout,
default is 60 seconds.
Sometimes because of GC pause or some other reasons, acknowledged message will
be timeout, which will lead to this
Hi Rafeeq,
I think this situation always occurs when your Spark Streaming application is
running in an abnormal situation. Would you mind checking your job processing
time in WebUI or log, is the total latency of job processing + job scheduling
time larger than batch duration? If your Spark
Hi,
I’m guessing the problem is that driver or executor cannot get the
metrics.properties configuration file in the yarn container, so metrics system
cannot load the right sinks.
Thanks
Jerry
From: Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com]
Sent: Thursday, September 11, 2014
put
metrics.properties file?
On Thu, Sep 11, 2014 at 4:18 PM, Shao, Saisai
saisai.s...@intel.commailto:saisai.s...@intel.com wrote:
Hi,
I’m guessing the problem is that driver or executor cannot get the
metrics.properties configuration file in the yarn container, so metrics system
cannot load
Hi,
Is there any specific scenario which needs to know the RDD numbers in the
DStream? According to my knowledge DStream will generate one RDD in each right
batchDuration, some old rdd will be remembered for windowing-like function, and
will be removed when useless. The hashmap generatedRDDs
Hi,
I think all the received stream will generate a RDD in each batch duration even
there is no data feed in (an empty RDD will be generated). So you cannot use
number of RDD to judge whether there is any data received.
One way is to do this in DStream/foreachRDD(), like
a.foreachRDD { r =
if
I think you should clarify some things in Spark Streaming:
1. closure in map is running in the remote side, so modify count var will only
take effect in remote side. You will always get -1 in driver side.
2. some codes in closure in foreachRDD is lazily executed in each batch
duration, while
Hi Hemanth,
I think there is a bug in this API in Spark 0.8.1, so you will meet this
exception when using Java code with this API, this bug is fixed in latest
version, as you can see the patch (https://github.com/apache/spark/pull/1508).
But it’s only for Kafka 0.8+, as you still use kafka
, September 09, 2014 1:19 PM
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: Re: Setting Kafka parameters in Spark Streaming
Thanks, Shao, for providing the necessary information.
Hemanth
On Tue, Sep 9, 2014 at 8:21 AM, Shao, Saisai
saisai.s...@intel.commailto:saisai.s...@intel.com wrote:
Hi
Hi,
StreamSQL (https://github.com/thunderain-project/StreamSQL) is a POC project
based on Spark to combine the power of Catalyst and Spark Streaming, to offer
people the ability to manipulate SQL on top of DStream as you wanted, this keep
the same semantics with SparkSQL as offer a
Hi,
I don't think there's a NPE issue when using DStream/count() even there is no
data feed into Spark Streaming. I tested using Kafka in my local settings, both
are OK with and without data consumed.
Actually you can see the details in ReceiverInputDStream, even there is no data
in this
Hi Meethu,
The spark.executor.memory is the Java heap size of forked executor process.
Increasing the spark.executor.memory can actually increase the runtime heap
size of executor process.
For the details of Spark configurations, you can check:
Hi,
Actually several java task threads running in a single executor, not processes,
so each executor will only have one JVM runtime which shares with different
task threads.
Thanks
Jerry
From: rapelly kartheek [mailto:kartheek.m...@gmail.com]
Sent: Wednesday, August 20, 2014 5:29 PM
To:
I think Currently Spark Streaming lack a data acknowledging mechanism when data
is stored and replicated in BlockManager, so potentially data will be lost even
pulled into Kafka, say if data is stored just in BlockGenerator not BM, while
in the meantime Kafka itself commit the consumer offset,
Hi Ali,
Maybe you can take a look at twitter's Summingbird project
(https://github.com/twitter/summingbird), which is currently one of the few
open source choices of lambda Architecture. There's a undergoing sub-project
called summingbird-spark, that might be the one you wanted, might this can
Hi Rafeeq,
I think current Spark Streaming api can offer you the ability to fetch data
from Kafka and store to another external store, if you do not care about
management of consumer offset manually, there’s no need to use low level api as
SimpleConsumer.
For Kafka 0.8.1 compatibility, you
Hi Haopu,
Please see the inline comments.
Thanks
Jerry
-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Wednesday, July 23, 2014 3:00 PM
To: user@spark.apache.org
Subject: spark.streaming.unpersist and spark.cleaner.ttl
I have a DStream receiving data from a
will take a look at DStream.scala although I have no Scala experience.
-Original Message-
From: Shao, Saisai [mailto:saisai.s...@intel.com]
Sent: 2014年7月23日 15:13
To: user@spark.apache.org
Subject: RE: spark.streaming.unpersist and spark.cleaner.ttl
Hi Haopu,
Please see the inline
Hi Denes,
I think you can register your customized metrics source into metrics system
through metrics.properties, you can take metrics.propertes.template as
reference,
Basically you can do as follow if you want to monitor on executor:
Yes, it's normal when memory is not enough to put the third partition, as you
can see in your attached picture.
Thanks
Jerry
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Tuesday, July 22, 2014 3:09 PM
To: user@spark.apache.org
Subject: number of Cached Partitions v.s. Total Partitions
Yeah, I start to know your purpose. Original design purpose of customized
metrics source is focused on self-contained source, seems you need to rely on
outer variable, so the way you mentioned may be is the only way to register.
Besides, as you cannot see the source in Ganglia, I think you can
Actually we have a POC project which shows the power of combining Spark
Streaming and Catalyst, it can manipulate SQL on top of Spark Streaming and get
SchemaDStream. You can take a look at it:
https://github.com/thunderain-project/StreamSQL
Thanks
Jerry
From: Tathagata Das
...@preferred.jp]
Sent: Friday, July 11, 2014 10:47 AM
To: user@spark.apache.org
Subject: Re: Some question about SQL and streaming
Hi,
On Fri, Jul 11, 2014 at 11:38 AM, Shao, Saisai
saisai.s...@intel.commailto:saisai.s...@intel.com wrote:
Actually we have a POC project which shows the power of combining
1 - 100 of 111 matches
Mail list logo