Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
into an rdd with context.textFile(), flatmap that and union these rdds. also see http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files On 1 December 2014 at 16:50, Keith Simmons ke...@pulse.io wrote: This is a long shot, but... I'm trying to load a bunch

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
through the files record by record and outputs them to hdfs, then read them all in as RDDs and take the union? That would only use bounded memory. On 1 December 2014 at 17:19, Keith Simmons ke...@pulse.io wrote: Actually, I'm working with a binary format. The api allows reading out a single

Re: Setting only master heap

2014-10-26 Thread Keith Simmons
for SPARK_WORKER_MEMORY, but this has been deprecated. If you do set it, it just does the same thing as setting SPARK_EXECUTOR_MEMORY would have done. - Sameer On Wed, Oct 22, 2014 at 1:46 PM, Keith Simmons ke...@pulse.io wrote: We've been getting some OOMs from the spark master since

Setting only master heap

2014-10-22 Thread Keith Simmons
We've been getting some OOMs from the spark master since upgrading to Spark 1.1.0. I've found SPARK_DAEMON_MEMORY, but that also seems to increase the worker heap, which as far as I know is fine. Is there any setting which *only* increases the master heap size? Keith

Re: Hung spark executors don't count toward worker memory limit

2014-10-13 Thread Keith Simmons
Maybe I should put this another way. If spark has two jobs, A and B, both of which consume the entire allocated memory pool, is it expected that spark can launch B before the executor processes tied to A are completely terminated? On Thu, Oct 9, 2014 at 6:57 PM, Keith Simmons ke...@pulse.io

Hung spark executors don't count toward worker memory limit

2014-10-09 Thread Keith Simmons
Hi Folks, We have a spark job that is occasionally running out of memory and hanging (I believe in GC). This is it's own issue we're debugging, but in the meantime, there's another unfortunate side effect. When the job is killed (most often because of GC errors), each worker attempts to kill

Re: Hung spark executors don't count toward worker memory limit

2014-10-09 Thread Keith Simmons
PM, Keith Simmons ke...@pulse.io wrote: Hi Folks, We have a spark job that is occasionally running out of memory and hanging (I believe in GC). This is it's own issue we're debugging, but in the meantime, there's another unfortunate side effect. When the job is killed (most often because

Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
HI folks, I'm running into the following error when trying to perform a join in my code: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.catalyst.types.LongType$ I see similar errors for StringType$ and also: scala.reflect.runtime.ReflectError: value apache is

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
locally. On Tue, Jul 15, 2014 at 11:56 AM, Keith Simmons keith.simm...@gmail.com wrote: Nope. All of them are registered from the driver program. However, I think we've found the culprit. If the join column between two tables is not in the same column position in both tables, it triggers

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
) On Tue, Jul 15, 2014 at 1:05 PM, Michael Armbrust mich...@databricks.com wrote: Can you print out the queryExecution? (i.e. println(sql().queryExecution)) On Tue, Jul 15, 2014 at 12:44 PM, Keith Simmons keith.simm...@gmail.com wrote: To give a few more details of my environment in case

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
Cool. So Michael's hunch was correct, it is a thread issue. I'm currently using a tarball build, but I'll do a spark build with the patch as soon as I have a chance and test it out. Keith On Tue, Jul 15, 2014 at 4:14 PM, Zongheng Yang zonghen...@gmail.com wrote: Hi Keith gorenuru, This

Re: Comparative study

2014-07-09 Thread Keith Simmons
Good point. Shows how personal use cases color how we interpret products. On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen so...@cloudera.com wrote: On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote: Impala is *not* built on map/reduce, though it was built to replace Hive, which

Re: Comparative study

2014-07-08 Thread Keith Simmons
Santosh, To add a bit more to what Nabeel said, Spark and Impala are very different tools. Impala is *not* built on map/reduce, though it was built to replace Hive, which is map/reduce based. It has its own distributed query engine, though it does load data from HDFS, and is part of the hadoop

Re: Spark Memory Bounds

2014-05-28 Thread Keith Simmons
, 2014 at 7:22 PM, Keith Simmons ke...@pulse.io wrote: A dash of both. I want to know enough that I can reason about, rather than strictly control, the amount of memory Spark will use. If I have a big data set, I want to understand how I can design it so that Spark's memory consumption falls

Spark Memory Bounds

2014-05-27 Thread Keith Simmons
I'm trying to determine how to bound my memory use in a job working with more data than can simultaneously fit in RAM. From reading the tuning guide, my impression is that Spark's memory usage is roughly the following: (A) In-Memory RDD use + (B) In memory Shuffle use + (C) Transient memory used

Re: Spark Memory Bounds

2014-05-27 Thread Keith Simmons
memory requirements. We might work on this and submit a PR for it. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Tue, May 27, 2014 at 5:33 PM, Keith Simmons ke...@pulse.io wrote: I'm trying to determine how to bound my memory use in a job