into an rdd with context.textFile(),
flatmap that and union these rdds.
also see
http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files
On 1 December 2014 at 16:50, Keith Simmons ke...@pulse.io wrote:
This is a long shot, but...
I'm trying to load a bunch
through the files record
by record and outputs them to hdfs, then read them all in as RDDs and take
the union? That would only use bounded memory.
On 1 December 2014 at 17:19, Keith Simmons ke...@pulse.io wrote:
Actually, I'm working with a binary format. The api allows reading out a
single
for SPARK_WORKER_MEMORY, but this has been deprecated. If you do
set it, it just does the same thing as setting SPARK_EXECUTOR_MEMORY would
have done.
- Sameer
On Wed, Oct 22, 2014 at 1:46 PM, Keith Simmons ke...@pulse.io wrote:
We've been getting some OOMs from the spark master since
We've been getting some OOMs from the spark master since upgrading to Spark
1.1.0. I've found SPARK_DAEMON_MEMORY, but that also seems to increase the
worker heap, which as far as I know is fine. Is there any setting which
*only* increases the master heap size?
Keith
Maybe I should put this another way. If spark has two jobs, A and B, both
of which consume the entire allocated memory pool, is it expected that
spark can launch B before the executor processes tied to A are completely
terminated?
On Thu, Oct 9, 2014 at 6:57 PM, Keith Simmons ke...@pulse.io
Hi Folks,
We have a spark job that is occasionally running out of memory and hanging
(I believe in GC). This is it's own issue we're debugging, but in the
meantime, there's another unfortunate side effect. When the job is killed
(most often because of GC errors), each worker attempts to kill
PM, Keith Simmons ke...@pulse.io wrote:
Hi Folks,
We have a spark job that is occasionally running out of memory and hanging
(I believe in GC). This is it's own issue we're debugging, but in the
meantime, there's another unfortunate side effect. When the job is killed
(most often because
HI folks,
I'm running into the following error when trying to perform a join in my
code:
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.spark.sql.catalyst.types.LongType$
I see similar errors for StringType$ and also:
scala.reflect.runtime.ReflectError: value apache is
locally.
On Tue, Jul 15, 2014 at 11:56 AM, Keith Simmons keith.simm...@gmail.com
wrote:
Nope. All of them are registered from the driver program.
However, I think we've found the culprit. If the join column between two
tables is not in the same column position in both tables, it triggers
)
On Tue, Jul 15, 2014 at 1:05 PM, Michael Armbrust mich...@databricks.com
wrote:
Can you print out the queryExecution?
(i.e. println(sql().queryExecution))
On Tue, Jul 15, 2014 at 12:44 PM, Keith Simmons keith.simm...@gmail.com
wrote:
To give a few more details of my environment in case
Cool. So Michael's hunch was correct, it is a thread issue. I'm currently
using a tarball build, but I'll do a spark build with the patch as soon as
I have a chance and test it out.
Keith
On Tue, Jul 15, 2014 at 4:14 PM, Zongheng Yang zonghen...@gmail.com wrote:
Hi Keith gorenuru,
This
Good point. Shows how personal use cases color how we interpret products.
On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen so...@cloudera.com wrote:
On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote:
Impala is *not* built on map/reduce, though it was built to replace
Hive, which
Santosh,
To add a bit more to what Nabeel said, Spark and Impala are very different
tools. Impala is *not* built on map/reduce, though it was built to replace
Hive, which is map/reduce based. It has its own distributed query engine,
though it does load data from HDFS, and is part of the hadoop
, 2014 at 7:22 PM, Keith Simmons ke...@pulse.io wrote:
A dash of both. I want to know enough that I can reason about, rather
than strictly control, the amount of memory Spark will use. If I have a
big data set, I want to understand how I can design it so that Spark's
memory consumption falls
I'm trying to determine how to bound my memory use in a job working with
more data than can simultaneously fit in RAM. From reading the tuning
guide, my impression is that Spark's memory usage is roughly the following:
(A) In-Memory RDD use + (B) In memory Shuffle use + (C) Transient memory
used
memory requirements. We might work on this and submit a PR for
it.
--
Christopher T. Nguyen
Co-founder CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen
On Tue, May 27, 2014 at 5:33 PM, Keith Simmons ke...@pulse.io wrote:
I'm trying to determine how to bound my memory use in a job
16 matches
Mail list logo