I am getting this error when I try to run K-Means in spark-0.9.0:
java.lang.NoClassDefFoundError: org/apache/spark/util/Vector
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
at
Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a écrit :
I hijack the thread, but my2c is that this feature is also important to
enable ad-hoc queries which is done at runtime. It doesn't remove interests
for such macro for precompiled jobs of course, but it may not be the first
Thank you very much Sourav
BR
Em 3/26/14, 17:29, Sourav Chandra escreveu:
def print() {
def foreachFunc = (rdd: RDD[T], time: Time) = {
val total = rdd.collect().toList
println (---)
println (Time: + time)
println
On Thu, Mar 27, 2014 at 10:22 AM, andy petrella andy.petre...@gmail.comwrote:
I just mean queries sent at runtime ^^, like for any RDBMS.
In our project we have such requirement to have a layer to play with the
data (custom and low level service layer of a lambda arch), and something
like
nope (what I said :-P)
On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev
pascal.voitot@gmail.com wrote:
On Thu, Mar 27, 2014 at 10:22 AM, andy petrella
andy.petre...@gmail.comwrote:
I just mean queries sent at runtime ^^, like for any RDBMS.
In our project we have such
On Thu, Mar 27, 2014 at 11:08 AM, andy petrella andy.petre...@gmail.comwrote:
nope (what I said :-P)
That's also my answer to my own question :D
but I didn't understand that in your sentence: my2c is that this feature
is also important to enable ad-hoc queries which is done at runtime.
Does Shark not suit your needs? That's what we use at the moment and it's been
good
Sent from my Samsung Galaxy S®4
Original message
From: andy petrella andy.petre...@gmail.com
Date:03/27/2014 6:08 AM (GMT-05:00)
To: user@spark.apache.org
Subject: Re: Announcing Spark
Yes it could, of course. I didn't say that there is no tool to do it,
though ;-).
Andy
On Thu, Mar 27, 2014 at 12:49 PM, yana yana.kadiy...@gmail.com wrote:
Does Shark not suit your needs? That's what we use at the moment and it's
been good
Sent from my Samsung Galaxy S®4
when there is something new, it's also cool to let imagination fly far away
;)
On Thu, Mar 27, 2014 at 2:20 PM, andy petrella andy.petre...@gmail.comwrote:
Yes it could, of course. I didn't say that there is no tool to do it,
though ;-).
Andy
On Thu, Mar 27, 2014 at 12:49 PM, yana
Hello,
I would like to run the
WikipediaPageRankhttps://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scalaexample,
but the Wikipedia dump XML files are no longer available on
Freebase. Does anyone
The API docs for ssc.awaitTermination say simply Wait for the execution to
stop. Any exceptions that occurs during the execution will be thrown in
this thread.
Can someone help me understand what this means? What causes execution to
stop? Why do we need to wait for that to happen?
I tried
I'm working with spark streaming using spark-shell, and hoping folks could
answer a few questions I have.
I'm doing WordCount on a socket stream:
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.Seconds
var
Found this transform fn in StreamingContext which takes in a DStream[_] and a
function which acts on each of its RDDs
Unfortunately I can't figure out how to transform my DStream[(String,Int)] into
DStream[_]
/*** Create a new DStream in which each RDD is generated by applying a function
on
Please disregard I didn't see the Seq wrapper.
From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: March-27-14 11:57 AM
To: u...@spark.incubator.apache.org
Subject: StreamingContext.transform on a DStream
Found this transform fn in StreamingContext which takes in a DStream[_] and a
How exactly does rdd.mapPartitions be executed once in each VM?
I am running mapPartitions and the call function seems not to execute the
code?
JavaPairRDDString, String twos = input.map(new
Split()).sortByKey().partitionBy(new HashPartitioner(k));
twos.values().saveAsTextFile(args[2]);
Look at the tuning guide on Spark's webpage for strategies to cope with
this.
I have run into quite a few memory issues like these, some are resolved
by changing the StorageLevel strategy and employing things like Kryo,
some are solved by specifying the number of tasks to break down a given
I think now that this is because spark.local.dir is defaulting to /tmp, and
since the tasks are not running on the same machine, the file is not found
when the second task takes over.
How do you set spark.local.dir appropriately when running on mesos?
--
View this message in context:
Are you caching a lot of RDD's? If so, maybe you should unpersist() the
ones that you're not using. Also, if you're on 0.9, make sure
spark.shuffle.spill is enabled (which it is by default). This allows your
application to spill in-memory content to disk if necessary.
How much memory are you
I have a simple streaming job that creates a kafka input stream on a topic
with 8 partitions, and does a forEachRDD
The job and tasks are running on mesos, and there are two tasks running, but
only 1 task doing anything.
I also set spark.streaming.concurrentJobs=8 but still there is only 1 task
Yea it's in a standalone mode and I did use SparkContext.addJar method and
tried setting setExecutorEnv SPARK_CLASSPATH, etc. but none of it worked.
I finally made it work by modifying the ClientBase.scala code where I set
'appMasterOnly' to false before the addJars contents were added to
No i am running on 0.8.1.
Yes i am caching a lot, i am benchmarking a simple code in spark where in
512mb, 1g and 2g text files are taken, some basic intermediate operations
are done while the intermediate result which will be used in subsequent
operations are cached.
I thought that, we need not
Hey Rohit,
I think external tables based on Cassandra or other datastores will work
out-of-the box if you build Catalyst with Hive support.
Michael may have feelings about this but I'd guess the longer term design
for having schema support for Cassandra/HBase etc likely wouldn't rely on
hive
You have to raise the global limit as root. Also you have to do that on the
whole cluster.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Mar 27, 2014 at 4:07 AM, Hahn Jiang
Which storage scheme are you using? I am guessing it is MEMORY_ONLY. In
large datasets, MEMORY_AND_DISK or MEMORY_AND_DISK_SER work better.
You can call unpersist on an RDD to remove it from Cache though.
On Thu, Mar 27, 2014 at 11:57 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:
No i am
I create several RDDs by merging several consecutive RDDs from a DStream. Is
there a way to add these new RDDs to a DStream?
-Adrian
Heh sorry that wasnt a clear question, I know 'how' to set it but dont know
what value to use in a mesos cluster, since the processes are running in lxc
containers they wont be sharing a filesystem (or machine for that matter)
I cant use an s3n:// url for local dir can I?
--
View this
Actually looking closer it is stranger than I thought,
in the spark UI, one executor has executed 4 tasks, and one has executed
1928
Can anyone explain the workings of a KafkaInputStream wrt kafka partitions
and mapping to spark executors and tasks?
--
View this message in context:
2. I notice that once I start ssc.start(), my stream starts processing and
continues indefinitely...even if I close the socket on the server end (I'm
using unix command nc to mimic a server as explained in the streaming
programming guide .) Can I tell my stream to detect if it's lost a
Well, it says that the jar was successfully added but can't reference
classes from it. Does this have anything to do with this bug?
http://stackoverflow.com/questions/22457645/when-to-use-spark-classpath-or-sparkcontext-addjar
On Thu, Mar 27, 2014 at 2:57 PM, Sandy Ryza sandy.r...@cloudera.com
On 28 Mar 2014, at 00:34, Scott Clasen scott.cla...@gmail.com wrote:
Actually looking closer it is stranger than I thought,
in the spark UI, one executor has executed 4 tasks, and one has executed
1928
Can anyone explain the workings of a KafkaInputStream wrt kafka partitions
and mapping to
Evgeniy Shishkin wrote
So, at the bottom — kafka input stream just does not work.
That was the conclusion I was coming to as well. Are there open tickets
around fixing this up?
--
View this message in context:
That bug only appears to apply to spark-shell.
Do things work in yarn-client mode or on a standalone cluster? Are you
passing a path with parent directories to addJar?
On Thu, Mar 27, 2014 at 3:01 PM, Sung Hwan Chung
coded...@cs.stanford.eduwrote:
Well, it says that the jar was successfully
Seems like the configuration of the Spark worker is not right. Either the
worker has not been given enough memory or the allocation of the memory to
the RDD storage needs to be fixed. If configured correctly, the Spark
workers should not get OOMs.
On Thu, Mar 27, 2014 at 2:52 PM, Evgeny
On 28 Mar 2014, at 01:44, Tathagata Das tathagata.das1...@gmail.com wrote:
The more I think about it the problem is not about /tmp, its more about the
workers not having enough memory. Blocks of received data could be falling
out of memory before it is getting processed.
BTW, what is the
Thanks everyone for the discussion.
Just to note, I restarted the job yet again, and this time there are indeed
tasks being executed by both worker nodes. So the behavior does seem
inconsistent/broken atm.
Then I added a third node to the cluster, and a third executor came up, and
everything
Christopher
Sorry I might be missing the obvious, but how do i get my function called on
all Executors used by the app? I dont want to use RDDs unless necessary.
once I start my shell or app, how do I get
TaskNonce.getSingleton().doThisOnce() executed on each executor?
@dmpour
On 28 Mar 2014, at 02:10, Scott Clasen scott.cla...@gmail.com wrote:
Thanks everyone for the discussion.
Just to note, I restarted the job yet again, and this time there are indeed
tasks being executed by both worker nodes. So the behavior does seem
inconsistent/broken atm.
Then I added
I dint mention anything, so by default it should be MEMORY_AND_DISK right?
My doubt was, between two different experiments, are the RDDs cached in
memory need to be unpersisted???
Or it doesnt matter ?
On Fri, Mar 28, 2014 at 1:43 AM, Syed A. Hashmi shas...@cloudera.comwrote:
Which storage
I see, did this also fail with previous versions of Spark (0.9 or 0.8)? We’ll
try to look into these, seems like a serious error.
Matei
On Mar 27, 2014, at 7:27 PM, Jim Blomo jim.bl...@gmail.com wrote:
Thanks, Matei. I am running Spark 1.0.0-SNAPSHOT built for Hadoop
1.0.4 from GitHub on
Anyone can help?
How can I configure a different spark.local.dir for each executor?
On 23 Mar, 2014, at 12:11 am, Tsai Li Ming mailingl...@ltsai.com wrote:
Hi,
Each of my worker node has its own unique spark.local.dir.
However, when I run spark-shell, the shuffle writes are always
Hi,
My worker nodes have more memory than the host that I’m submitting my driver
program, but it seems that SPARK_MEM is also setting the Xmx of the spark shell?
$ SPARK_MEM=100g MASTER=spark://XXX:7077 bin/spark-shell
Java HotSpot(TM) 64-Bit Server VM warning: INFO:
41 matches
Mail list logo