On Thu, Apr 10, 2014 at 12:24 PM, Andrew Ash and...@andrewash.com wrote:
The biggest issue I've come across is that the cluster is somewhat unstable
when under memory pressure. Meaning that if you attempt to persist an RDD
that's too big for memory, even with MEMORY_AND_DISK, you'll often
It's highly dependent on what the issue is with your particular job, but
the ones I modify most commonly are:
spark.storage.memoryFraction
spark.shuffle.memoryFraction
parallelism (a parameter on many RDD calls) -- increase from the default
level to get more, smaller tasks that are more likely to
It’s not a new API, it just happens underneath the current one if you have
spark.shuffle.spill set to true (which it is by default). Take a look at the
config settings that mention “spill” in
http://spark.incubator.apache.org/docs/latest/configuration.html.
Matei
On Apr 11, 2014, at 7:02 AM,
Excellent, thanks you.
On Fri, Apr 11, 2014 at 12:09 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
It's not a new API, it just happens underneath the current one if you have
spark.shuffle.spill set to true (which it is by default). Take a look at
the config settings that mention spill in
Hello Spark Users,
With the recent graduation of Spark to a top level project (grats, btw!),
maybe a well timed question. :)
We are at the very beginning of a large scale big data project and after
two months of exploration work we'd like to settle on the technologies to
use, roll up our sleeves
When you say Spark is one of the forerunners for our technology choice,
what are the other options you are looking into ?
I start cross validation runs on a 40 core, 160 GB spark job using a
script...I woke up in the morning, none of the jobs crashed ! and the
project just came out of incubation
Spark has been endorsed by Cloudera as the successor to MapReduce. That
says a lot...
On Thu, Apr 10, 2014 at 10:11 AM, Andras Nemeth
andras.nem...@lynxanalytics.com wrote:
Hello Spark Users,
With the recent graduation of Spark to a top level project (grats, btw!),
maybe a well timed
Here are several good ones:
https://www.google.com/search?q=cloudera+sparkoq=cloudera+sparkaqs=chrome..69i57j69i65l3j69i60l2.4439j0j7sourceid=chromeespv=2es_sm=119ie=UTF-8
On Thu, Apr 10, 2014 at 10:42 AM, Ian Ferreira ianferre...@hotmail.comwrote:
Do you have the link to the Cloudera
Mike Olson's comment:
http://vision.cloudera.com/mapreduce-spark/
Here's the partnership announcement:
http://databricks.com/blog/2013/10/28/databricks-and-cloudera-partner-to-support-spark.html
On Thu, Apr 10, 2014 at 10:42 AM, Ian Ferreira ianferre...@hotmail.com
wrote:
Do you have the
I'll provide answers from our own experience at Bizo. We've been using
Spark for 1+ year now and have found it generally better than previous
approaches (Hadoop + Hive mostly).
On Thu, Apr 10, 2014 at 7:11 AM, Andras Nemeth
andras.nem...@lynxanalytics.com wrote:
I. Is it too much magic? Lots
The biggest issue I've come across is that the cluster is somewhat unstable
when under memory pressure. Meaning that if you attempt to persist an RDD
that's too big for memory, even with MEMORY_AND_DISK, you'll often still
get OOMs. I had to carefully modify some of the space tuning parameters
I would echo much of what Andrew has said.
I manage a small/medium sized cluster (48 cores, 512G ram, 512G disk
space dedicated to spark, data storage in separate HDFS shares). I've
been using spark since 0.7, and as with Andrew I've observed
significant and consistent improvements in stability
On Thu, Apr 10, 2014 at 9:24 AM, Andrew Ash and...@andrewash.com wrote:
The biggest issue I've come across is that the cluster is somewhat
unstable when under memory pressure. Meaning that if you attempt to
persist an RDD that's too big for memory, even with MEMORY_AND_DISK, you'll
often
Can anyone comment on their experience running Spark Streaming in
production?
On Thu, Apr 10, 2014 at 10:33 AM, Dmitriy Lyubimov dlie...@gmail.comwrote:
On Thu, Apr 10, 2014 at 9:24 AM, Andrew Ash and...@andrewash.com wrote:
The biggest issue I've come across is that the cluster is
Here are answers to a subset of your questions:
1. Memory management
The general direction of these questions is whether it's possible to take
RDD caching related memory management more into our own hands as LRU
eviction is nice most of the time but can be very suboptimal in some of our
use
4. Shuffle on disk
Is it true - I couldn't find it in official docs, but did see this mentioned
in various threads - that shuffle _always_ hits disk? (Disregarding OS
caches.) Why is this the case? Are you planning to add a function to do
shuffle in memory or are there some intrinsic reasons
To add onto the discussion about memory working space, 0.9 introduced the
ability to spill data within a task to disk, and in 1.0 we’re also changing the
interface to allow spilling data within the same *group* to disk (e.g. when you
do groupBy and get a key with lots of values). The main
17 matches
Mail list logo