Spark repeatedly fails broadcast a large object on a cluster of 25
machines for me.
I get log messages like this:
[spark-akka.actor.default-dispatcher-4] WARN
org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager
BlockManagerId(3, cloud-33.dima.tu-berlin.de, 42185, 0) with no
Sandy looks like a version mismatch: compiling vs Spark HEAD and
running vs 0.8.1?
The method was added to typesafe config on Oct 8 2012 and released in
about version 0.6:
https://github.com/typesafehub/config/commit/5f486f65ac68745ca89059a5b6b144c2daa5d157#diff-3d5ac6ed49837be68d4f47d3b96b1c81
Hi,
I want to split a RDD by certain percentage, like 10 % (split the RDD into
10 piece)
Ideally, the function preferred is as below:
def deterministicSplit[T](dataSet: RDD[T], nb: Int): Array[RDD[T]] = {
/* code */
}
the dataSet is a RDD sorted by its key. For example, if nb = 10 here,
What do you mean by worker dir?
On Tue, Jan 7, 2014 at 11:43 AM, Nan Zhu zhunanmcg...@gmail.com wrote:
Hi, all
I’m trying to change my worker dir to a mounted disk with larger space
but I found that no document telling me how to do this,
I have to check the source code and found the
In worker machines, you will find $SPARK_HOME/work/
that’s worker dir
--
Nan Zhu
On Tuesday, January 7, 2014 at 7:29 AM, Archit Thakur wrote:
What do you mean by worker dir?
On Tue, Jan 7, 2014 at 11:43 AM, Nan Zhu zhunanmcg...@gmail.com
(mailto:zhunanmcg...@gmail.com) wrote:
On Thu, Jan 2, 2014 at 5:52 PM, Andrew Ash and...@andrewash.com wrote:
That sounds right Mayur.
Also in 0.8.1 I hear there's a new repartition method that you might be
able to use to further distribute the data. But if your data is so small
that it fits in just a couple blocks, why are you
Hi,
When we time an action it includes all the transformations timings too, and
it is not clear which transformation takes how long. Is there a way of
timing each transformation separately?
Also, does spark provide a way of more detailed progress reporting, broken
to transformation steps? For
What's the size of your large object to be broadcast?
On Tue, Jan 7, 2014 at 8:55 AM, Sebastian Schelter s...@apache.org wrote:
Spark repeatedly fails broadcast a large object on a cluster of 25
machines for me.
I get log messages like this:
[spark-akka.actor.default-dispatcher-4] WARN
If small-file is hosted in HDFS I think the default is one partition per
HDFS block. If it's in one block, which are 64MB each by default, that
might be one partition.
Sent from my mobile phone
On Jan 7, 2014 8:46 AM, Aureliano Buendia buendia...@gmail.com wrote:
On Thu, Jan 2, 2014 at 5:52
I think that would do what you want. I'm guessing in ... you have an rdd
and then call .collect on it -- normally this would be a bad idea because
of large data sizes, but if you KNOW that it's small then you can force it
through just that one machine.
On Tue, Jan 7, 2014 at 9:20 AM, Aureliano
On Tue, Jan 7, 2014 at 6:04 PM, Andrew Ash and...@andrewash.com wrote:
I think that would do what you want. I'm guessing in ... you have an
rdd and then call .collect on it -- normally this would be a bad idea
because of large data sizes, but if you KNOW that it's small then you can
force it
When we time an action it includes all the transformations timings too,
and it is not clear which transformation takes how long. Is there a way of
timing each transformation separately?
Not really, because even though you may logically specify several different
transformations within your
Hi, all
Thanks for the reply
I actually need to provide a single file to an external system to process
it…seems that I have to make the consumer of the file to support multiple inputs
Best,
--
Nan Zhu
On Tuesday, January 7, 2014 at 12:37 PM, Aaron Davidson wrote:
HDFS, since 0.21
Hi Nan,
A cleaner approach is to export a RESTful service to the external system.
The external system calls the service with appropriate api.
For Scala, Spray can be used to make these services. Twitter oss also many
examples of this service design.
Thanks.
Deb
On Tue, Jan 7, 2014 at 10:25
I am trying to run spark version 0.8.1 on hadoop 2.2.0-cdh5.0.0-beta-1 with
YARN.
I am using YARN Client with yarn-standalone mode as described here
http://spark.incubator.apache.org/docs/latest/running-on-yarn.html
For simplifying matters I’ll say my application code is all contained in
Hi,
The EC2
documentshttp://spark.incubator.apache.org/docs/0.8.1/ec2-scripts.htmlhas
a section called 'Running Applications', but it actually lacks the
step
which should describe how to run the application.
The spark_ec2
Hi, all
I ‘m trying the ALS in mllib
the following is my code
val result = als.run(ratingRDD)
val allMovies = ratingRDD.map(rating = rating.product).distinct()
val allUsers = ratingRDD.map(rating = rating.user).distinct()
val allUserMoviePair = allUsers.cartesian(allMovies)
I've spent some time trying to import data into an RDD using the Spark Java
API, but am not able to properly load data stored in a Hadoop v1.1.1
sequence file with key and value types both LongWritable. I've attached a
copy of the sequence file to this posting. It contains 3000 key, value
pairs.
Not found in which part of code? If in sparkContext thread, say on AM,
--addJars should work
If on tasks, then --addjars won't work, you need to use --file=local://xxx etc,
not sure is it available in 0.8.1. And adding to a single jar should also work,
if not works, might be something wrong
Sorry, you actually can’t call predict() on the cluster because the model
contains some RDDs. There was a recent patch that added a parallel predict
method, here: https://github.com/apache/incubator-spark/pull/328/files. You can
grab the code from that method there (which does a join) and call
Yeah, unfortunately sequenceFile() reuses the Writable object across records.
If you plan to use each record repeatedly (e.g. cache it), you should clone
them using a map function. It was originally designed assuming you only look at
each record once, but it’s poorly documented.
Matei
On Jan
Matei, do you mean something like A rather than B below?
A) rdd.map(_.clone).cache
B) rdd.cache
I'd be happy to add documentation if there's a good place for it, but I'm
not sure there's an obvious place for it.
On Tue, Jan 7, 2014 at 9:35 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
Yup, a) would make it work.
I’d actually prefer that we change it so it clones the objects by default, and
add a boolean flag (default false) for people who want to reuse objects. We’d
have to do the same in hadoopRDD and the various versions of that as well.
Matei
On Jan 8, 2014, at 12:38
Hi, I have installed spark0.8.1-incubating. Both my master and worker
processes are running. Master is listening on port 7077. I have also
installed shark (0.8.0). When I try to start the shark shell
(./shark-withinfo), I get this exception,
14/01/07 21:49:21 INFO client.Client$ClientActor:
Agreed on the clone by default approach -- this reused object gotcha has
hit several people I know when using Avro.
We should be careful to not ignore the performance impact that made Hadoop
reuse objects in the first place though. I'm not sure what this means in
practice though, you either
25 matches
Mail list logo