Re: Spark - ready for prime time?

2014-04-13 Thread Jim Blomo
On Thu, Apr 10, 2014 at 12:24 PM, Andrew Ash and...@andrewash.com wrote: The biggest issue I've come across is that the cluster is somewhat unstable when under memory pressure. Meaning that if you attempt to persist an RDD that's too big for memory, even with MEMORY_AND_DISK, you'll often

Re: Spark - ready for prime time?

2014-04-13 Thread Andrew Ash
It's highly dependent on what the issue is with your particular job, but the ones I modify most commonly are: spark.storage.memoryFraction spark.shuffle.memoryFraction parallelism (a parameter on many RDD calls) -- increase from the default level to get more, smaller tasks that are more likely to

Re: Spark - ready for prime time?

2014-04-11 Thread Matei Zaharia
It’s not a new API, it just happens underneath the current one if you have spark.shuffle.spill set to true (which it is by default). Take a look at the config settings that mention “spill” in http://spark.incubator.apache.org/docs/latest/configuration.html. Matei On Apr 11, 2014, at 7:02 AM,

Re: Spark - ready for prime time?

2014-04-11 Thread Surendranauth Hiraman
Excellent, thanks you. On Fri, Apr 11, 2014 at 12:09 PM, Matei Zaharia matei.zaha...@gmail.comwrote: It's not a new API, it just happens underneath the current one if you have spark.shuffle.spill set to true (which it is by default). Take a look at the config settings that mention spill in

Fwd: Spark - ready for prime time?

2014-04-10 Thread Andras Nemeth
Hello Spark Users, With the recent graduation of Spark to a top level project (grats, btw!), maybe a well timed question. :) We are at the very beginning of a large scale big data project and after two months of exploration work we'd like to settle on the technologies to use, roll up our sleeves

Re: Spark - ready for prime time?

2014-04-10 Thread Debasish Das
When you say Spark is one of the forerunners for our technology choice, what are the other options you are looking into ? I start cross validation runs on a 40 core, 160 GB spark job using a script...I woke up in the morning, none of the jobs crashed ! and the project just came out of incubation

Re: Spark - ready for prime time?

2014-04-10 Thread Dean Wampler
Spark has been endorsed by Cloudera as the successor to MapReduce. That says a lot... On Thu, Apr 10, 2014 at 10:11 AM, Andras Nemeth andras.nem...@lynxanalytics.com wrote: Hello Spark Users, With the recent graduation of Spark to a top level project (grats, btw!), maybe a well timed

Re: Spark - ready for prime time?

2014-04-10 Thread Dean Wampler
Here are several good ones: https://www.google.com/search?q=cloudera+sparkoq=cloudera+sparkaqs=chrome..69i57j69i65l3j69i60l2.4439j0j7sourceid=chromeespv=2es_sm=119ie=UTF-8 On Thu, Apr 10, 2014 at 10:42 AM, Ian Ferreira ianferre...@hotmail.comwrote: Do you have the link to the Cloudera

Re: Spark - ready for prime time?

2014-04-10 Thread Sean Owen
Mike Olson's comment: http://vision.cloudera.com/mapreduce-spark/ Here's the partnership announcement: http://databricks.com/blog/2013/10/28/databricks-and-cloudera-partner-to-support-spark.html On Thu, Apr 10, 2014 at 10:42 AM, Ian Ferreira ianferre...@hotmail.com wrote: Do you have the

Re: Spark - ready for prime time?

2014-04-10 Thread Alex Boisvert
I'll provide answers from our own experience at Bizo. We've been using Spark for 1+ year now and have found it generally better than previous approaches (Hadoop + Hive mostly). On Thu, Apr 10, 2014 at 7:11 AM, Andras Nemeth andras.nem...@lynxanalytics.com wrote: I. Is it too much magic? Lots

Re: Spark - ready for prime time?

2014-04-10 Thread Andrew Ash
The biggest issue I've come across is that the cluster is somewhat unstable when under memory pressure. Meaning that if you attempt to persist an RDD that's too big for memory, even with MEMORY_AND_DISK, you'll often still get OOMs. I had to carefully modify some of the space tuning parameters

Re: Spark - ready for prime time?

2014-04-10 Thread Brad Miller
I would echo much of what Andrew has said. I manage a small/medium sized cluster (48 cores, 512G ram, 512G disk space dedicated to spark, data storage in separate HDFS shares). I've been using spark since 0.7, and as with Andrew I've observed significant and consistent improvements in stability

Re: Spark - ready for prime time?

2014-04-10 Thread Dmitriy Lyubimov
On Thu, Apr 10, 2014 at 9:24 AM, Andrew Ash and...@andrewash.com wrote: The biggest issue I've come across is that the cluster is somewhat unstable when under memory pressure. Meaning that if you attempt to persist an RDD that's too big for memory, even with MEMORY_AND_DISK, you'll often

Re: Spark - ready for prime time?

2014-04-10 Thread Roger Hoover
Can anyone comment on their experience running Spark Streaming in production? On Thu, Apr 10, 2014 at 10:33 AM, Dmitriy Lyubimov dlie...@gmail.comwrote: On Thu, Apr 10, 2014 at 9:24 AM, Andrew Ash and...@andrewash.com wrote: The biggest issue I've come across is that the cluster is

Re: Spark - ready for prime time?

2014-04-10 Thread Andrew Or
Here are answers to a subset of your questions: 1. Memory management The general direction of these questions is whether it's possible to take RDD caching related memory management more into our own hands as LRU eviction is nice most of the time but can be very suboptimal in some of our use

Re: Spark - ready for prime time?

2014-04-10 Thread Brad Miller
4. Shuffle on disk Is it true - I couldn't find it in official docs, but did see this mentioned in various threads - that shuffle _always_ hits disk? (Disregarding OS caches.) Why is this the case? Are you planning to add a function to do shuffle in memory or are there some intrinsic reasons

Re: Spark - ready for prime time?

2014-04-10 Thread Matei Zaharia
To add onto the discussion about memory working space, 0.9 introduced the ability to spill data within a task to disk, and in 1.0 we’re also changing the interface to allow spilling data within the same *group* to disk (e.g. when you do groupBy and get a key with lots of values). The main