first() is allowed to run locally, which means that the driver will
execute the action itself without launching any tasks. This is also true of
take(n) for sufficiently small n, for instance.
On Wed, Feb 19, 2014 at 9:55 AM, David Thomas dt5434...@gmail.com wrote:
If I perform a 'collect'
wrote:
But my RDD is placed on the worker nodes. So how can driver perform the
action by itself?
On Wed, Feb 19, 2014 at 10:57 AM, Aaron Davidson ilike...@gmail.comwrote:
first() is allowed to run locally, which means that the driver will
execute the action itself without launching any
If you are intentionally opening many files at once and getting that error,
then it is a fixable OS issue. Please check out this discussion regarding
changing the file limit in /etc/limits.conf:
http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-td1464.html
If you feel that
Note that you may just be changing the soft limit, and are still being
capped by the hard (system-wide) limit. Changing the /etc/limit.conf file
as specified above allows you to modify both the soft and hard limits, and
requires a restart of the machine to take effect.
On Thu, Feb 13, 2014 at
This sounds bad, and probably related to shuffle file consolidation.
Turning off consolidation would probably get you working again, but I'd
really love to track down the bug. Do you know if any tasks fail before
those errors start occurring? It's very possible that another exception is
occurring
This method is doing very little. Line 2 constructs the CoGroupedRDD, which
will do all the real work. Note that while this cogroup function just
groups 2 RDDs together, CoGroupedRDD allows general n-way cogrouping, so it
takes a Seq[RDD(K, _)] rather than just 2 such key-value RDDs.
The rest of
Are you seeing any exceptions in between running apps? Does restarting the
master/workers actually cause Spark to speed back up again? It's possible,
for instance, that you run out of disk space, which should cause exceptions
but not go away by restarting the master/workers.
One thing to worry
Could you give an example where default parallelism is set to 2 where it
didn't used to be?
Here is the relevant section for the spark standalone mode:
:
On Tue, Jan 7, 2014 at 2:52 AM, Aaron Davidson ilike...@gmail.com wrote:
Please take a look at the source code -- it's relatively friendly, and
very useful for digging into Spark internals!
(KryoSerializerhttps://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache
Buendia buendia...@gmail.comwrote:
On Sun, Jan 5, 2014 at 2:28 AM, Aaron Davidson ilike...@gmail.com wrote:
Additionally, which version of Spark are you running?
0.8.1.
Unfortunately, this doesn't work either:
MASTER=local[2] ADD_JARS=/path/to/my/jar
SPARK_CLASSPATH=/path/to/my/jar
Could you post the stack trace you see for the NPE?
On Mon, Dec 30, 2013 at 11:31 AM, Archit Thakur
archit279tha...@gmail.comwrote:
I am still getting it. I googled and found a similar open problem on
stackoverflow:
The error you're receiving is because the Akka frame size must be
a positive Java Integer, i.e., less than 2^31. However, the frame size is
not intended to be nearly the size of the job memory -- it is the smallest
unit of data transfer that Spark does. In this case, your task result
size is
I'd be fine with one-way mirrors here (Apache threads being reflected in
Google groups) -- I have no idea how one is supposed to navigate the Apache
list to look for historic threads.
On Thu, Dec 19, 2013 at 7:58 PM, Mike Potts maspo...@gmail.com wrote:
Thanks very much for the prompt and
You might also check the spark/work/ directory for application (Executor)
logs on the slaves.
On Tue, Nov 19, 2013 at 6:13 PM, Umar Javed umarj.ja...@gmail.com wrote:
I have a scala script that I'm trying to run on a Spark standalone cluster
with just one worker (existing on the master node).
This is very likely due to memory issues. The problem is that each
reducer (partition of the groupBy) builds an in-memory table of that
partition. If you have very few partitions, this will fail, so the solution
is to simply increase the number of reducers. For example:
sc.parallelize(1 to
This discussion seems to indicate the possibility of a mismatch between one
side being Serializable and the other being Externalizable:
https://forums.oracle.com/thread/2147644
In general, the semantics of Serializable can be pretty strange as it
doesn't really behave the same as usual
There is a pull request currently to fix this exact issue, I believe, at
https://github.com/apache/incubator-spark/pull/192. It's very small and
only touches the script files, so you could apply it to your current
version and distribute it to the workers. The fix here is that you add an
additional
compression or seeing more than 48k
DiskBlockObjectWriters to account for the remaining memory used.
On Fri, Nov 22, 2013 at 9:05 AM, Aaron Davidson ilike...@gmail.com wrote:
Great, thanks for the feedback. It sounds like you're using the LZF
compression scheme -- switching to Snappy should see
Have you looked a the Spark executor logs? They're usually located in the
$SPARK_HOME/work/ directory. If you're running in a cluster, they'll be on
the individual slave nodes. These should hopefully reveal more information.
On Mon, Nov 18, 2013 at 3:42 PM, Chris Grier
The main issue with running a spark-shell locally is that it orchestrates
the actual computation, so you want it to be close to the actual Worker
nodes for latency reasons. Running a spark-shell on EC2 in the same region
as the Spark cluster avoids this problem.
The error you're seeing seems to
Also, in general, you can workaround shortcomings in the Java API by
converting to a Scala RDD (using JavaRDD's rdd() method). The API tends to
be much clunkier since you have to jump through some hoops to talk to a
Scala API in Java, though. In this case, JavaRDD's mapPartition() method
will
The number of splits can be configured when reading the file, as an
argument to textFile(), sequenceFile(), etc (see
docshttp://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext@textFile(String,Int):RDD[String]).
Note that this is a minimum, however, as
Apache Ant(TM) version 1.8.2 compiled on May 18 2012
Mvh/BR
Egon Kidmose
On Sun, Nov 17, 2013 at 7:40 PM, Aaron Davidson ilike...@gmail.com
wrote:
Could you report your ant/Ivy version? Just run
ant -version
The fundamental problem is that Ivy is stupidly thinking .orbit is the
file
are using local mode, you can just pass -Xmx32g to the JVM that
is launching spark and it will have that much memory.
On Fri, Nov 15, 2013 at 6:30 PM, Aaron Davidson ilike...@gmail.com
wrote:
One possible workaround would be to use the local-cluster Spark mode.
This
is normally used only
One possible workaround would be to use the local-cluster Spark mode. This
is normally used only for testing, but it will actually spawn a separate
process for the executor. The format is:
new SparkContext(local-cluster[1,4,32000])
This will spawn 1 Executor that is allocated 4 cores and 32GB
It is responsible for a subset of shuffle blocks. MapTasks split up their
data, creating one shuffle block for every reducer. During the shuffle
phase, the reducer will fetch all shuffle blocks that were intended for it.
On Sun, Nov 10, 2013 at 9:38 PM, Umar Javed umarj.ja...@gmail.com wrote:
As a followup on this, the memory footprint of all shuffle metadata has
been greatly reduced. For your original workload with 7k mappers, 7k
reducers, and 5 machines, the total metadata size should have decreased
from ~3.3 GB to ~80 MB.
On Tue, Oct 29, 2013 at 9:07 AM, Aaron Davidson ilike
I've seen this happen before due to the driver doing long GCs when the
driver machine was heavily memory-constrained. For this particular issue,
simply freeing up memory used by other applications fixed the problem.
On Fri, Nov 1, 2013 at 12:14 AM, Liu, Raymond raymond@intel.com wrote:
Hi
Stephen is exactly correct, I just wanted to point out that in Spark 0.8.1
and above, the repartition function has been added to be a clearer way to
accomplish what you want. (Coalescing into a larger number of partitions
doesn't make much linguistic sense.)
On Thu, Oct 31, 2013 at 7:48 AM,
You are correct. If you are just using spark-shell in local mode (i.e.,
without cluster), you can set the SPARK_MEM environment variable to give
the driver more memory. E.g.:
SPARK_MEM=24g ./spark-shell
Otherwise, if you're using a real cluster, the driver shouldn't require a
significant amount
Snappy sounds like it'd be a better solution here. LZF requires a pretty
sizeable buffer per stream (accounting for the 300k you're seeing). It
looks like you have 7000 reducers, and each one requires an LZF-compressed
stream. Snappy has a much lower overhead per stream, so I'd give it a try.
Thanks for the heads up! I have submitted a pull request to fix it
(#98https://github.com/apache/incubator-spark/pull/98),
so it should be corrected soon. In the meantime, if anyone is curious, the
real link should be
On the other hand, I totally agree that memory usage in Spark is rather
opaque, and is one area where we could do a lot better at in terms of
communicating issues, through both docs and instrumentation. At least with
serialization and such, you can get meaningful exceptions (hopefully), but
OOMs
with peculiarity of the Scala shell:
https://groups.google.com/forum/#!searchin/spark-users/error$3A$20type$20mismatch|sort:relevance/spark-users/bwAmbUgxWrA/HwP4Nv4adfEJ
On Fri, Oct 11, 2013 at 6:01 PM, Aaron Davidson ilike...@gmail.comwrote:
Playing around with this a little more, it seems
Playing around with this a little more, it seems that classOf[Animal] is
this.Animal in Spark and Animal in normal Scala.
Also, trying to do something like this:
class Zoo[A : *this.*Animal](thing: A) { }
works in Scala but throws a weird error in Spark:
error: type Animal is not a member of
Also, please post feature requests here: http://spark-project.atlassian.net
Make sure to search prior to posting to avoid duplicates.
On Tue, Oct 8, 2013 at 11:50 AM, Matei Zaharia matei.zaha...@gmail.comwrote:
Hi Shay,
We actually don't support Mesos in the EC2 scripts anymore -- sorry
You might try also setting spark.driver.host to the correct IP in the
conf/spark-env.sh SPARK_JAVA_OPTs as well.
e.g.,
-Dspark.driver.host=192.168.250.47
On Sat, Oct 5, 2013 at 2:45 PM, Aaron Babcock aaron.babc...@gmail.comwrote:
Hello,
I am using spark through a vpn. My driver machine
37 matches
Mail list logo