Hi ,
Is there any code to implement a kafka output for spark streaming? My use
case is all the output need to be dumped back to kafka cluster again after
data is processed ? What will be guideline to implement such function ? I
heard foreachRDD will create one instance of producer per batch ? If
Since the breeze jar is brought into spark by mllib package, you may want
to add mllib as your dependency in spark 1.0. For bring it from your
application yourself, you can either use sbt assembly in ur build project
to generate a flat myApp-assembly.jar which contains breeze jar, or use
spark add
I executed the following commands to launch spark app with yarn client
mode. I have Hadoop 2.3.0, Spark 0.8.1 and Scala 2.9.3
SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly
SPARK_YARN_MODE=true \
SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar
unsibscribe
Thank you,
Konstantin Kudryavtsev
unsubscribe
.set(spark.cleaner.ttl, 120) drops broadcast_0 which makes a Exception
below. It is strange, because broadcast_0 is no need, and I have broadcast_3
instead, and recent RDD is persisted, there is no need for recomputing...
what is the problem? need help.
~~~
14/05/05 17:03:12 INFO
Hi, Yi
Your project sounds interesting to me, Im also working on 3g4g communication
domain, besides Ive also done a tiny project based on hadoop, which analyzes
execution logs. Recently, Im planed to pick it up again. So, if you don't
mind, may i know the introduction of your log analyzing
unsibscribe
Regards,
Chhaya Vishwakarma
The contents of this e-mail and any attachment(s) may contain confidential or
privileged information for the intended recipient(s). Unintended recipients are
prohibited from taking action on the basis of information in
Hi,All
We run a spark cluster with three workers.
created a spark streaming application,
then run the spark project using below command:
shell sbt run spark://192.168.219.129:7077 tcp://192.168.20.118:5556 foo
we looked at the webui of workers, jobs failed without any error or
Using checkpoint. It removes dependences:)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
RDD.checkpoint works fine. But spark.cleaner.ttl is really ugly for broadcast
cleaning. May be it could be removed automatically when no dependences.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5369.html
Have you tried Broadcast.unpersist()?
On Mon, May 5, 2014 at 6:34 PM, Earthson earthson...@gmail.com wrote:
RDD.checkpoint works fine. But spark.cleaner.ttl is really ugly for
broadcast
cleaning. May be it could be removed automatically when no dependences.
--
View this message in
Hi all,
Is there a best practice for subscribing to JMS with Spark Streaming? I
have searched but not found anything conclusive.
In the absence of a standard practice the solution I was thinking of was to
use Akka + Camel (akka.camel.Consumer) to create a subscription for a Spark
Streaming
Ah, I think this should be fixed in 0.9.1?
Did you see the exception is thrown in the worker side?
Best,
--
Nan Zhu
On Sunday, May 4, 2014 at 10:15 PM, Cheney Sun wrote:
Hi Nan,
Have you found a way to fix the issue? Now I run into the same problem with
version 0.9.1.
Thanks,
No replies yet. Guess everyone who had this problem knew the obvious reason
why the error occurred.
It took me some time to figure out the work around though.
It seems shark depends on
/var/lib/spark/shark-0.9.1/lib_managed/jars/org.apache.hadoop/hadoop-core/hadoop-core.jar
for client server
Since 1.0 is still in development you can pick up the latest docs in git:
https://github.com/apache/spark/tree/branch-1.0/docs
I didn't see anywhere that you said you started the spark history server?
there are multiple things that need to happen for the spark history server to
work.
1)
Hi Sparkers,
We have created a quick spark_gce script which can launch a spark cluster
in the Google Cloud. I'm sharing it because it might be helpful for someone
using the Google Cloud for deployment rather than AWS.
Here's the link to the script
https://github.com/sigmoidanalytics/spark_gce
Hi Aureliano,
You might want to check this script out,
https://github.com/sigmoidanalytics/spark_gce
Let me know if you need any help around that.
Thanks
Best Regards
On Tue, Apr 22, 2014 at 7:12 PM, Aureliano Buendia buendia...@gmail.comwrote:
On Tue, Apr 22, 2014 at 10:50 AM, Andras
I just upgraded my Spark version to 1.0.0_SNAPSHOT.
commit f25ebed9f4552bc2c88a96aef06729d9fc2ee5b3
Author: witgo wi...@qq.com
Date: Fri May 2 12:40:27 2014 -0700
I'm running a standalone cluster with 3 workers.
- *Workers:* 3
- *Cores:* 48 Total, 0 Used
- *Memory:* 469.8 GB
I’ve encountered this issue again and am able to reproduce it about 10% of the
time.
1. Here is the input:
RDD[ (a, 126232566, 1), (a, 126232566, 2) ]
RDD[ (a, 126232566, 1), (a, 126232566, 3) ]
RDD[ (a, 126232566, 3) ]
RDD[ (a, 126232566, 4) ]
RDD[ (a, 126232566, 2)
Forgot to mention my batch interval is 1 second:
val ssc = new StreamingContext(conf, Seconds(1))
hence the Thread.sleep(1100)
From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: May-05-14 12:06 PM
To: user@spark.apache.org
Cc: u...@spark.incubator.apache.org
Subject: RE: another
Is there somewhere documented how one would go about configuring every open
port a spark application needs?
This seems like one of the main things that make running spark hard in
places like EC2 where you arent using the canned spark scripts.
Starting an app looks like you'll see ports open for
Yes, I've tried.
The problem is new broadcast object generated by every step until eat up all
of the memory.
I solved it by using RDD.checkpoint to remove dependences to old broadcast
object, and use cleanner.ttl to clean up these broadcast object
automatically.
If there's more elegant way to
Very cool! Have you thought about sending this as a pull request? We’d be happy
to maintain it inside Spark, though it might be interesting to find a single
Python package that can manage clusters across both EC2 and GCE.
Matei
On May 5, 2014, at 7:18 AM, Akhil Das ak...@sigmoidanalytics.com
Hi,
I'm trying to run a simple Spark job that uses a 3rd party class (in this
case twitter4j.Status) in the spark-shell using spark-1.0.0_SNAPSHOT
I'm starting my bin/spark-shell with the following command.
./spark-shell
Has anyone considered using jclouds tooling to support multiple cloud
providers? Maybe using Pallet?
François
On May 5, 2014, at 3:22 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
I second this motion. :)
A unified cloud deployment tool would be absolutely great.
On Mon,
Ethan, you're not the only one, which is why I was asking about this! :-)
Matei, thanks for your response. your answer explains the performance jump
in my code, but shows I've missed something key in my understanding of
Spark!
I was not aware until just now that map output was saved to disk
One main reason why Spark Streaming can achieve higher throughput than
Storm is because Spark Streaming operates in coarser-grained batches -
second-scale massive batches - which reduce per-tuple of overheads in
shuffles, and other kinds of data movements, etc.
Note that, this is also true that
Hi all,
I'm currently working on creating a set of docker images to facilitate
local development with Spark/streaming on Mesos (+zk, hdfs, kafka)
After solving the initial hurdles to get things working together in docker
containers, now everything seems to start-up correctly and the mesos UI
A few high-level suggestions.
1. I recommend using the new Receiver API in almost-released Spark 1.0 (see
branch-1.0 / master branch on github). Its a slightly better version of the
earlier NetworkReceiver, as it hides away blockgenerator (which needed to
be unnecessarily manually started and
Hi Benjamin,
Yes, we initially used a modified version of the AmpLabs docker scripts
[1]. The amplab docker images are a good starting point.
One of the biggest hurdles has been HDFS, which requires reverse-DNS and I
didn't want to go the dnsmasq route to keep the containers relatively
simple to
Hi,
Fairly new to Spark. I'm using Spark's saveAsSequenceFile() to write large
Sequence Files to HDFS. The Sequence Files need to be large to be
efficiently accessed in HDFS, preferably larger than Hadoop's block size,
64MB. The task works for files smaller than 64 MiB (with a warning for
Update: Checkpointing it doesn't perform. I checked by the isCheckpointed
method but it returns always false. ???
2014-05-05 23:14 GMT+02:00 Andrea Esposito and1...@gmail.com:
Checkpoint doesn't help it seems. I do it at each iteration/superstep.
Looking deeply, the RDDs are recomputed just
checkpoint seems to be just add a CheckPoint mark? You need an action after
marked it. I have tried it with success:)
newRdd = oldRdd.map(myFun).persist(myStorageLevel)
newRdd.checkpoint // checkpoint here
newRdd.isCheckpointed // false here
newRdd.foreach(x = {}) // Force evaluation
The file does not exist in fact and no permission issue.
francis@ubuntu-4:/test/spark-0.9.1$ ll work/app-20140505053550-/
total 24
drwxrwxr-x 6 francis francis 4096 May 5 05:35 ./
drwxrwxr-x 11 francis francis 4096 May 5 06:18 ../
drwxrwxr-x 2 francis francis 4096 May 5 05:35 2/
Hi,
i'am looking at the event log, i'am a little confuse about some metrics
here's the info of one task:
Launch Time:1399336904603
Finish Time:1399336906465
Executor Run Time:1781
Shuffle Read Metrics:Shuffle Finish Time:1399336906027, Fetch Wait
Time:0
Shuffle Write Metrics:{Shuffle Bytes
hi I still have over 1g left for my program.
Date: Sun, 4 May 2014 19:14:30 -0700
From: ml-node+s1001560n5340...@n3.nabble.com
To: gyz...@hotmail.com
Subject: Re: sbt/sbt run command returns a JVM problem
the total memory of your machine is 2G right?then how much memory is
left free?
Yes, I'm struggling with a similar problem where my class are not found on the
worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate if someone
can provide some documentation on the usage of spark-submit.
Thanks
On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com
Hi experts.
I have some pre-built python parsers that I am planning to use, just
because I don't want to write them again in scala. However after the data is
parsed I would like to take the RDD and use it in a scala program.(Yes, I
like scala more than python and more comfortable in scala :)
In
In my code, there are two broadcast variables. Sometimes reading the small
one took more time than the big one, so strange!
Log on slave node is as follows:
Block broadcast_2 stored as values to memory (estimated size *4.0 KB*, free
17.2 GB)
Reading broadcast variable 2 took *9.998537123* s
additional, Reading the big broadcast variable always took about 2s.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/about-broadcast-tp5416p5417.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
41 matches
Mail list logo