Hi all,
I'm currently working on creating a set of docker images to facilitate
local development with Spark/streaming on Mesos (+zk, hdfs, kafka)
After solving the initial hurdles to get things working together in docker
containers, now everything seems to start-up correctly and the mesos UI
of
clusters with Docker, I'm interested by your findings on sharing the
settings of Kafka and Zookeeper across nodes. How many broker and zookeeper
do you use ?
Regards,
On Mon, May 5, 2014 at 10:11 PM, Gerard Maas gerard.m...@gmail.comwrote:
Hi all,
I'm currently working on creating a set
Hi Zhen,
Thanks a lot for sharing. I'm sure it will be useful for new users.
A small note: On the 'checkpoint' explanation:
sc.setCheckpointDir(my_directory_name)
it would be useful to specify that 'my_directory_name' should exist in all
slaves. As an alternative you could use an HDFS directory
...@us.ibm.com - (512) 286-6075
[image: Inactive hide details for Gerard Maas ---05/05/2014 04:18:08
PM---Hi Benjamin, Yes, we initially used a modified version of the]Gerard
Maas ---05/05/2014 04:18:08 PM---Hi Benjamin, Yes, we initially used a
modified version of the AmpLabs docker scripts
From
By looking at your config, I think there's something wrong with your setup.
One of the key elements of Mesos is that you are abstracted from where the
execution of your task takes place. The SPARK_EXECUTOR_URI tells Mesos
where to find the 'framework' (in Mesos jargon) required to execute a job.
This error message says I can't find the config for the akka subsystem.
That is typically included in the Spark assembly.
First, you need to compile your spark distro, by running sbt/sbt assembly
on the SPARK_HOME dir.
Then, use the SPARK_HOME (through env or configuration) to point to your
for it to work.
The SparkREPL works differently. It uses some dark magic to send the
working session to the workers.
-kr, Gerard.
On Wed, May 21, 2014 at 2:47 PM, Gerard Maas gerard.m...@gmail.com wrote:
Hi Tobias,
I was curious about this issue and tried to run your example on my local
Hi Tobias,
On Wed, May 21, 2014 at 5:45 PM, Tobias Pfeiffer t...@preferred.jp wrote:
first, thanks for your explanations regarding the jar files!
No prob :-)
On Thu, May 22, 2014 at 12:32 AM, Gerard Maas gerard.m...@gmail.com
wrote:
I was discussing it with my fellow Sparkers here and I
Hi Andrew,
Thanks for the current doc.
I'd almost gotten to the point where I thought that my custom code needed
to be included in the SPARK_EXECUTOR_URI but that can't possibly be
correct. The Spark workers that are launched on Mesos slaves should start
with the Spark core jars and then
Hi,
I'm starting to explore the Spark Job Server contributed by Ooyala [1],
running from the master branch.
I started by developing and submitting a simple job and the JAR check gave
me errors on a seemingly good jar. I disabled the fingerprint checking on
the jar and I could submit it, but
, copy,
print, distribute or rely on this email.
On 22 May 2014 18:25, Gerard Maas gerard.m...@gmail.com wrote:
Hi,
I'm starting to explore the Spark Job Server contributed by Ooyala [1],
running from the master branch.
I started by developing and submitting a simple job and the JAR check
The RDD API has functions to join multiple RDDs, such as PariRDD.join
or PariRDD.cogroup that take another RDD as input. e.g.
firstRDD.join(secondRDD)
I'm looking for ways to do the opposite: split an existing RDD. What is the
right way to create derivate RDDs from an existing RDD?
e.g.
I don't think that's supported by default as when the standalone context
will close, the related RDDs will be GC'ed
You should explore Spark-Job Server, which allows to cache RDDs by name and
reuse them within a context.
https://github.com/ooyala/spark-jobserver
-kr, Gerard.
On Tue, Jun 3,
.
Can you go into more detail about why you want to split one RDD into
several?
On Mon, Jun 2, 2014 at 1:13 PM, Gerard Maas gerard.m...@gmail.com wrote:
The RDD API has functions to join multiple RDDs, such as PariRDD.join
or PariRDD.cogroup that take another RDD as input. e.g.
firstRDD.join
Have you tried re-compiling your job against the 1.0 release?
On Tue, Jun 3, 2014 at 8:46 PM, Marek Wiewiorka marek.wiewio...@gmail.com
wrote:
Hi All,
I've been experiencing a very strange error after upgrade from Spark 0.9
to 1.0 - it seems that saveAsTestFile function is throwing
I think that you have two options:
- to run your code locally, you can use local mode by using the 'local'
master like so:
new SparkConf().setMaster(local[4]) where 4 is the number of cores
assigned to the local mode.
- to run your code remotely you need to build the jar with dependencies and
Hi,
The scheduling related code can be found at:
https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/scheduler
The DAG (Directed Acyclic Graph) scheduler is a good start point:
You can consult the docs at :
https://spark.apache.org/docs/latest/api/scala/index.html#package
In particular, the rdd docs contain the explanation of each method :
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
Kr, Gerard
On Jun 8, 2014 1:00 PM, Carter
The goal of rdd.persist is to created a cached rdd that breaks the DAG
lineage. Therefore, computations *in the same job* that use that RDD can
re-use that intermediate result, but it's not meant to survive between job
runs.
for example:
val baseData =
That stack trace is quite similar to the one that is generated when trying
to do a collect within a closure. In this case, it feels wrong to
collect in a closure, but I wonder what's reason behind the NPE.
Curious to know whether they are related.
Here's a very simple example:
rrd1.flatMap(x=
Ll mlll
On Jun 14, 2014 4:05 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
You need to factor your program so that it’s not just a main(). This is
not a Spark-specific issue, it’s about how you’d unit test any program in
general. In this case, your main() creates a SparkContext, so you
Hi,
I've been doing some testing with Calliope as a way to do batch load from
Spark into Cassandra.
My initial results are promising on the performance area, but worrisome on
the memory footprint side.
I'm generating N records of about 50 bytes each and using the UPDATE
mutator to insert them
Hi,
(My excuses for the cross-post from SO)
I'm trying to create Cassandra SSTables from the results of a batch
computation in Spark. Ideally, each partition should create the SSTable for
the data it holds in order to parallelize the process as much as possible
(and probably even stream it to
On Wed, Jun 25, 2014 at 8:44 PM, Gerard Maas gerard.m...@gmail.com
wrote:
Hi,
(My excuses for the cross-post from SO)
I'm trying to create Cassandra SSTables from the results of a batch
computation in Spark. Ideally, each partition should create the SSTable for
the data it holds in order
for you.
Regards,
Rohit
*Founder CEO, **Tuplejump, Inc.*
www.tuplejump.com
*The Data Engineering Platform*
On Thu, Jun 26, 2014 at 12:44 AM, Gerard Maas gerard.m...@gmail.com
wrote:
Thanks Nick.
We used the CassandraOutputFormat through Calliope
.*
www.tuplejump.com
*The Data Engineering Platform*
On Thu, Jun 26, 2014 at 12:44 AM, Gerard Maas gerard.m...@gmail.com
wrote:
Thanks Nick.
We used the CassandraOutputFormat through Calliope. The Calliope API
makes the CassandraOutputFormat quite accessible
Hi Sargun,
There have been few discussions on the list recently about the topic. The
short answer is that this is not supported at the moment.
This is a particularly good thread as it discusses the current state and
limitations:
Using a case class as a key doesn't seem to work properly. [Spark 1.0.0]
A minimal example:
case class P(name:String)
val ps = Array(P(alice), P(bob), P(charly), P(bob))
sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
[Spark shell local mode] res : Array[(P, Int)] =
,ArrayBuffer(1, 1)))
On Tue, Jul 22, 2014 at 4:20 PM, Gerard Maas gerard.m...@gmail.com wrote:
Using a case class as a key doesn't seem to work properly. [Spark 1.0.0]
A minimal example:
case class P(name:String)
val ps = Array(P(alice), P(bob), P(charly), P(bob))
sc.parallelize(ps).map(x= (x,1
, Gerard Maas gerard.m...@gmail.com
wrote:
Just to narrow down the issue, it looks like the issue is in
'reduceByKey' and derivates like 'distinct'.
groupByKey() seems to work
sc.parallelize(ps).map(x= (x.name,1)).groupByKey().collect
res: Array[(String, Iterable[Int])] = Array((charly
, 2014 at 5:37 PM, Gerard Maas gerard.m...@gmail.com wrote:
Yes, right. 'sc.parallelize(ps).map(x= (**x.name**,1)).groupByKey().
collect'
An oversight from my side.
Thanks!, Gerard.
On Tue, Jul 22, 2014 at 5:24 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
I can confirm this bug
To read/write from/to Cassandra I recommend you to use the Spark-Cassandra
connector at [1]
Using it, saving a Spark Streaming RDD to Cassandra is fairly easy:
sparkConfig.set(CassandraConnectionHost, cassandraHost)
val sc = new SparkContext(sparkConfig)
val ssc = new StreamingContext(sc,
Hello Sparkers,
I'm currently running load tests on a Spark Streaming job. When the task
duration increases beyond the batchDuration the job become unstable. In the
logs I see tasks failed with the following message:
Job aborted due to stage failure: Task 266.0:1 failed 4 times, most recent
Johnny,
Currently, probably the easiest (and most performant way) to integrate
Spark and Cassandra is using the spark-cassandra-connector [1]
Given an rdd, saving it to cassandra is as easy as:
rdd.saveToCassandra(keyspace, table, Seq(columns))
We tried many 'hand crafted' options to interact
My Spark Streaming job (running on Spark 1.0.2) stopped working today and
consistently throws the exception below.
No code changed for it, so I'm really puzzled about the cause of the issue.
Looks like a security issue at HDFS level. Has anybody seen this exception
and maybe know the root cause?
Found it! (with sweat in my forehead)
The job was actually running on Mesos using a Spark 1.1.0 executor.
I guess there's some incompatibility between the 1.0.2 and the 1.1 versions
- still quite weird.
-kr, Gerard.
On Thu, Sep 18, 2014 at 12:29 PM, Gerard Maas gerard.m...@gmail.com wrote
Hi,
We have been implementing several Spark Streaming jobs that are basically
processing data and inserting it into Cassandra, sorting it among different
keyspaces.
We've been following the pattern:
dstream.foreachRDD(rdd =
val records = rdd.map(elem = record(elem))
Pinging TD -- I'm sure you know :-)
-kr, Gerard.
On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas gerard.m...@gmail.com wrote:
Hi,
We have been implementing several Spark Streaming jobs that are basically
processing data and inserting it into Cassandra, sorting it among different
keyspaces
on the driver….
mn
On Oct 20, 2014, at 3:07 AM, Gerard Maas gerard.m...@gmail.com wrote:
Pinging TD -- I'm sure you know :-)
-kr, Gerard.
On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas gerard.m...@gmail.com
wrote:
Hi,
We have been implementing several Spark Streaming jobs
recordStream = dstream.map(elem = record(elem))
targets.foreach{target = recordStream.filter(record =
isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))}
?
kr, Gerard.
On Wed, Oct 22, 2014 at 2:12 PM, Gerard Maas gerard.m...@gmail.com wrote:
Thanks Matt,
Unlike the feared
Hi Ashic,
At the moment I see two options:
1) You could use the CassandraConnector object to execute your specialized
query. The recommended pattern is to to that within a
rdd.foreachPartition(...) in order to amortize DB connection setup over the
number of elements in on partition. Something
There's an issue in the way case classes are handled on the REPL and you
won't be able to use a case class as a key. See:
https://issues.apache.org/jira/browse/SPARK-2620
BTW, case classes already implement equals and hashCode. It's not needed to
implement those again.
Given that you already
Looks like you're having some classpath issues.
Are you providing your spark-cassandra-driver classes to your job?
sparkConf.setJars(Seq(jars...)) ?
On Tue, Oct 28, 2014 at 5:34 PM, Harold Nguyen har...@nexgate.com wrote:
Hi all,
I'm having trouble troubleshooting this particular block of
:37 AM, Gerard Maas gerard.m...@gmail.com
wrote:
PS: Just to clarify my statement:
Unlike the feared RDD operations on the driver, it's my understanding
that these Dstream ops on the driver are merely creating an execution plan
for each RDD.
With feared RDD operations on the driver I meant
vHi,
I've been exploring the metrics exposed by Spark and I'm wondering whether
there's a way to register job-specific metrics that could be exposed
through the existing metrics system.
Would there be an example somewhere?
BTW, documentation about how the metrics work could be improved. I
If I remember correctly, EmptyRDD is private [spark]
You can create an empty RDD using the spark context:
val emptyRdd = sc.emptyRDD
-kr, Gerard.
On Fri, Nov 14, 2014 at 11:22 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
To get an empty RDD, I did this:
I have an rdd with one
at 11:35 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Thank You Gerard.
I was trying val emptyRdd = sc.EmptyRDD.
Yes it works but I am not able to do *emptyRdd.collect.foreach(println)*
Thank You
On Fri, Nov 14, 2014 at 3:58 PM, Gerard Maas gerard.m...@gmail.com
wrote:
If I remember
One 'rule of thumbs' is to use rdd.toDebugString and check the lineage for
ShuffleRDD. As long as there's no need for restructuring the RDD,
operations can be pipelined on each partition.
rdd.toDebugString is your friend :-)
-kr, Gerard.
On Mon, Nov 17, 2014 at 7:37 AM, Mukesh Jha
As the Spark Streaming tuning guide indicates, the key indicators of a
healthy streaming job are:
- Processing Time
- Total Delay
The Spark UI page for the Streaming job [1] shows these two indicators but
the metrics source for Spark Streaming (StreamingSource.scala) [2] does
not.
Any reasons
I suppose that here function(x) = function3(function2(function1(x)))
In that case, the difference will be modularity and readability of your
program.
If function{1,2,3} are logically different steps and potentially reusable
somewhere else, I'd keep them separate.
A sequence of map
Looks like metrics are not a hot topic to discuss - yet so important to
sleep well when jobs are running in production.
I've created Spark-4537 https://issues.apache.org/jira/browse/SPARK-4537
to track this issue.
-kr, Gerard.
On Thu, Nov 20, 2014 at 9:25 PM, Gerard Maas gerard.m...@gmail.com
Hi TD,
We also struggled with this error for a long while. The recurring scenario
is when the job takes longer to compute than the job interval and a backlog
starts to pile up.
Hint: Check
If the DStream storage level is set to MEMORY_ONLY_SER and memory runs
out, then you will get a 'Cannot
Hi,
We are currently running our Spark + Spark Streaming jobs on Mesos,
submitting our jobs through Marathon.
We see with some regularity that the Spark Streaming driver gets killed by
Mesos and then restarted on some other node by Marathon.
I've no clue why Mesos is killing the driver and
[Ping]
Any hints?
On Thu, Nov 27, 2014 at 3:38 PM, Gerard Maas gerard.m...@gmail.com wrote:
Hi,
We are currently running our Spark + Spark Streaming jobs on Mesos,
submitting our jobs through Marathon.
We see with some regularity that the Spark Streaming driver gets killed by
Mesos
I guess he's already doing so, given the 'saveToCassandra' usage.
What I don't understand is the question how do I specify a batch. That
doesn't make much sense to me. Could you explain further?
-kr, Gerard.
On Thu, Dec 4, 2014 at 5:36 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You can
You're using two conflicting versions of the connector: the Scala version
at 1.1.0 and the Java version at 1.0.4.
I don't use Java, but I guess you only need the java dependency for your
job - and with the version fixed.
dependency
groupIdcom.datastax.spark/groupId
Hi,
We've a number of Spark Streaming /Kafka jobs that would benefit of an even
spread of consumers over physical hosts in order to maximize network usage.
As far as I can see, the Spark Mesos scheduler accepts resource offers
until all required Mem + CPU allocation has been satisfied.
This
We have a similar case in which we don't want to save data to Cassandra if
the data is empty.
In our case, we filter the initial DStream to process messages that go to a
given table.
To do so, we're using something like this:
dstream.foreachRDD{ (rdd,time) =
tables.foreach{ table =
val
Have you tried with kafkaStream.foreachRDD(rdd = {rdd.foreach(...)} ?
Would that make a difference?
On Thu, Dec 11, 2014 at 10:24 AM, david david...@free.fr wrote:
Hi,
We use the following Spark Streaming code to collect and process Kafka
event :
kafkaStream.foreachRDD(rdd = {
If the timestamps in the logs are to be trusted It looks like your driver
is dying with that *java.io.FileNotFoundException*: and therefore the
workers loose their connection and close down.
-kr, Gerard.
On Thu, Dec 11, 2014 at 7:39 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Try to add
I'm doing the same thing for using Cassandra,
For Cassandra, use the Spark-Cassandra connector [1], which does the
Session management, as described by TD, for you.
[1] https://github.com/datastax/spark-cassandra-connector
-kr, Gerard.
On Thu, Dec 11, 2014 at 1:55 PM, Ashic Mahtab
There's some explanation and an example here:
http://stackoverflow.com/questions/26611471/spark-data-processing-with-grouping/26612246#26612246
-kr, Gerard.
On Thu, Dec 11, 2014 at 7:15 PM, ll duy.huynh@gmail.com wrote:
any explaination on how aggregate works would be much appreciated. i
Are you using a bufferedPrintWriter? that's probably a different flushing
behaviour. Try doing out.flush() after out.write(...) and you will have
the same result.
This is Spark unrelated btw.
-kr, Gerard.
Hi,
I don't get what the problem is. That map to selected columns looks like
the way to go given the context. What's not working?
Kr, Gerard
On Dec 14, 2014 5:17 PM, Denny Lee denny.g@gmail.com wrote:
I have a large of files within HDFS that I would like to do a group by
statement ala
Hi Jean-Pascal,
At Virdata we do a similar thing to 'bucketize' our data to different
keyspaces in Cassandra.
The basic construction would be to filter the DStream (or the underlying
RDD) for each key and then apply the usual storage operations on that new
data set.
Given that, in your case, you
in memory anyway?
Also any experience with minutes long batch interval?
Thanks for the quick answer!
On Sun, Dec 14, 2014 at 11:17 AM, Gerard Maas gerard.m...@gmail.com
wrote:
Hi Jean-Pascal,
At Virdata we do a similar thing to 'bucketize' our data to different
keyspaces in Cassandra
Hi Jeniba,
The second part of this meetup recording has a very good answer to your
question. TD explains the current behavior and the on-going work in Spark
Streaming to fix HA.
https://www.youtube.com/watch?v=jcJq3ZalXD8
-kr, Gerard.
On Tue, Dec 16, 2014 at 11:32 AM, Jeniba Johnson
Creating an RDD from a wildcard like this:
val data = sc.textFile(/user/foo/myfiles/*)
Will create 1 partition for each file found. 1000 files = 1000 partitions.
A task is a job stage (defined as a sequence of transformations) applied to
a partition, so 1000 partitions = 1000 tasks per stage.
Hi Demi,
Thanks for sharing.
What we usually do is let the driver read the configuration for the job and
pass the config object to the actual job as a serializable object. That way
avoids the need of a centralized config sharing point that needs to be
accessed from the workers. as you have
You would do:
rdd.zipWithIndexGives you an RDD[Original, Int] where the second
element is the index.
To have a (index,original) tuple, you will need to map that previous RDD to
the desired shape:
rdd.zipWithIndex.map(_.swap)
-kr, Gerard.
kr, Gerard.
On Tue, Dec 16, 2014 at 4:12 PM,
Patrick,
I was wondering why one would choose for rdd.map vs rdd.foreach to execute
a side-effecting function on an RDD.
-kr, Gerard.
On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell pwend...@gmail.com wrote:
The second choice is better. Once you call collect() you are pulling
all of the
You can create a DStream that contains the count, transforming the grouped
windowed RDD, like this:
val errorCount = grouping.map{case (k,v) = v.size }
If you need to preserve the key:
val errorCount = grouping.map{case (k,v) = (k,v.size) }
or you if you don't care about the content of the
Check out the 'compiling for Scala 2.11' instructions:
http://spark.apache.org/docs/1.2.0/building-spark.html#building-for-scala-211
-kr, Gerard.
On Fri, Dec 19, 2014 at 12:00 PM, Jonathan Chayat jonatha...@supersonic.com
wrote:
The following ticket:
It will be instantiated once per VM, which translates to once per executor.
-kr, Gerard.
On Fri, Dec 19, 2014 at 12:21 PM, Ashic Mahtab as...@live.com wrote:
Hi Guys,
Are scala lazy values instantiated once per executor, or once per
partition? For example, if I have:
object Something =
Hi,
After facing issues with the performance of some of our Spark Streaming
jobs, we invested quite some effort figuring out the factors that affect
the performance characteristics of a Streaming job. We defined an
empirical model that helps us reason about Streaming jobs and applied it to
tune
mode?
I'm making changes to the spark mesos scheduler and I think we can propose
a best way to achieve what you mentioned.
Tim
Sent from my iPhone
On Dec 22, 2014, at 8:33 AM, Gerard Maas gerard.m...@gmail.com wrote:
Hi,
After facing issues with the performance of some of our Spark
Hi,
I'm not sure what you are asking:
Whether we can use spouts and bolts in Spark (= no)
or whether we can do streaming in Spark:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
-kr, Gerard.
On Tue, Dec 23, 2014 at 9:03 AM, Ajay ajay.ga...@gmail.com wrote:
Hi,
Can
Streaming).
The idea is to use Spark as a in-memory computation engine and static data
coming from Cassandra/Hbase and streaming data from Storm.
Thanks
Ajay
On Tue, Dec 23, 2014 at 2:03 PM, Gerard Maas gerard.m...@gmail.com
wrote:
Hi,
I'm not sure what you are asking:
Whether we
and post the code (if possible).
In a nutshell, your processing time batch interval, resulting in an
ever-increasing delay that will end up in a crash.
3 secs to process 14 messages looks like a lot. Curious what the job logic
is.
-kr, Gerard.
On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das
+1
On Fri, Jan 23, 2015 at 5:58 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
That sounds good to me. Shall I open a JIRA / PR about updating the site
community page?
On 2015년 1월 23일 (금) at 오전 4:37 Patrick Wendell patr...@databricks.com
wrote:
Hey Nick,
So I think we what can
Hi,
We are observing with certain regularity that our Spark jobs, as Mesos
framework, are hoarding resources and not releasing them, resulting in
resource starvation to all jobs running on the Mesos cluster.
For example:
This is a job that has spark.cores.max = 4 and spark.executor.memory=3g
(looks like the list didn't like a HTML table on the previous email. My
excuses for any duplicates)
Hi,
We are observing with certain regularity that our Spark jobs, as Mesos
framework, are hoarding resources and not releasing them, resulting in
resource starvation to all jobs running on the
this is more of a scala question, so probably next time you'd like to
address a Scala forum eg. http://stackoverflow.com/questions/tagged/scala
val optArrStr:Option[Array[String]] = ???
optArrStr.map(arr = arr.mkString(,)).getOrElse() // empty string or
whatever default value you have for this.
I've have been contributing to SO for a while now. Here're few
observations I'd like to contribute to the discussion:
The level of questions on SO is often of more entry-level. Harder
questions (that require expertise in a certain area) remain unanswered for
a while. Same questions here on the
://www.virdata.com/tuning-spark/#Partitions)
-kr, Gerard.
On Thu, Jan 22, 2015 at 4:28 PM, Gerard Maas gerard.m...@gmail.com wrote:
So the system has gone from 7msg in 4.961 secs (median) to 106msgs in
4,761 seconds.
I think there's evidence that setup costs are quite high in this case
So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761
seconds.
I think there's evidence that setup costs are quite high in this case and
increasing the batch interval is helping.
On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee
asudipta.baner...@gmail.com wrote:
Hi Ashic
line, that is to fix the number of streams
and change the input and output channels dynamically.
But could not make it work (seems that the receiver is not allowing any
change in the config after it started).
thanks,
On Wed, Jan 21, 2015 at 10:49 AM, Gerard Maas gerard.m...@gmail.com
wrote
One possible workaround could be to orchestrate launch/stopping of
Streaming jobs on demand as long as the number of jobs/streams stay within
the boundaries of the resources (cores) you've available.
e.g. if you're using Mesos, Marathon offers a REST interface to manage job
lifecycle. You will
Hi Mukesh,
How are you creating your receivers? Could you post the (relevant) code?
-kr, Gerard.
On Wed, Jan 21, 2015 at 9:42 AM, Mukesh Jha me.mukesh@gmail.com wrote:
Hello Guys,
I've re partitioned my kafkaStream so that it gets evenly distributed
among the executors and the results
+1 for TypeSafe config
Our practice is to include all spark properties under a 'spark' entry in
the config file alongside job-specific configuration:
A config file would look like:
spark {
master =
cleaner.ttl = 123456
...
}
job {
context {
src = foo
action =
://issues.apache.org/jira/browse/MESOS-1688
On Mon, Jan 26, 2015 at 9:17 PM, Gerard Maas gerard.m...@gmail.com
wrote:
Hi,
We are observing with certain regularity that our Spark jobs, as Mesos
framework, are hoarding resources and not releasing them, resulting in
resource starvation to all jobs
Hi,
Did you try asking this on StackOverflow?
http://stackoverflow.com/questions/tagged/apache-spark
I'd also suggest adding some sample data to help others understanding your
logic.
-kr, Gerard.
On Tue, Jan 27, 2015 at 1:14 PM, 老赵 laozh...@sina.cn wrote:
Hello All,
I am writing a simple
janv. 2015 22:28, Gerard Maas gerard.m...@gmail.com a écrit :
(looks like the list didn't like a HTML table on the previous email. My
excuses for any duplicates)
Hi,
We are observing with certain regularity that our Spark jobs, as Mesos
framework, are hoarding resources
Hi,
Could you add the code where you create the Kafka consumer?
-kr, Gerard.
On Wed, Jan 7, 2015 at 3:43 PM, francois.garil...@typesafe.com wrote:
Hi Mukesh,
If my understanding is correct, each Stream only has a single Receiver.
So, if you have each receiver consuming 9 partitions, you
)
bytes
}
.saveAsTextFile(text)
Is there a way to achieve this with the MetricSystem?
ᐧ
On Mon, Jan 5, 2015 at 10:24 AM, Gerard Maas gerard.m...@gmail.com
wrote:
Hi,
Yes, I managed to create a register custom metrics by creating an
implementation
You are looking for dstream.transform(rdd = rdd.op(otherRdd))
The docs contain an example on how to use transform.
https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
-kr, Gerard.
On Thu, Jan 8, 2015 at 5:50 PM, Asim Jalis asimja...@gmail.com
),
kafkaStreams);
On Wed, Jan 7, 2015 at 8:21 PM, Gerard Maas gerard.m...@gmail.com
wrote:
Hi,
Could you add the code where you create the Kafka consumer?
-kr, Gerard.
On Wed, Jan 7, 2015 at 3:43 PM, francois.garil...@typesafe.com wrote:
Hi Mukesh,
If my understanding is correct, each
Try writing this Spark Streaming idiom in Java and you'll choose Scala soon
enough:
dstream.foreachRDD{rdd =
rdd.foreachPartition( partition = )
}
When deciding between Java and Scala for Spark, IMHO Scala has the
upperhand. If you're concerned with readability, have a look at the Scala
This: java.lang.NoSuchMethodError almost always indicates a version
conflict somewhere.
It looks like you are using Spark 1.1.1 with the cassandra-spark connector
1.2.0. Try aligning those. Those metrics were introduced recently in the
1.2.0 branch of the cassandra connector.
Either upgrade your
In spark-streaming, the consumers will fetch data and put it into 'blocks'.
Each block becomes a partition of the rdd generated during that batch
interval.
The size of each is block controlled by the conf:
'spark.streaming.blockInterval'. That is, the amount of data the consumer
can collect in
1 - 100 of 169 matches
Mail list logo