Short: Why does coalesce use huge amounts of memory? How does it work
internally?
Long version:
I asked a similar question a few weeks ago, but I have a simpler test
with better numbers now. I have an RDD created from some HDFS files. I
want to sample it and then coalesce it into fewer partiti
y the
coalesce without cache?
From: Christopher Brady
Sent: Friday, February 12, 2016 8:34 PM
To: Koert Kuipers ; Silvio Fiorito
Cc: user
Subject: Re: coalesce and executor memory
Thank you for the responses. The map function just changes the format of the
record slightly, so I don't
looks like?
On 2/12/16, 1:13 PM, "Christopher Brady"
mailto:christopher.br...@oracle.com>> wrote:
>Can anyone help me understand why using coalesce causes my
executors to
>crash with out of memory? What happens during coalesce that increases
>memo
Can anyone help me understand why using coalesce causes my executors to
crash with out of memory? What happens during coalesce that increases
memory usage so much?
If I do:
hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile
everything works fine, but if I do:
hadoopFile -> sample -
The documentation for DataFrameWriter.format(String) says:
"Specifies the underlying output data source. Built-in options include
"parquet", "json", etc."
What options are there other than parquet and json? From googling I
found "com.databricks.spark.avro", but that doesn't seem to work
corre
I'm having an issue where count(*) returns almost immediately using
Hive, but takes over 10 min using DataFrames. The table data is on HDFS
in an uncompressed CSV format. How is it possible for Hive to get the
count so fast? Is it caching this or putting it in the metastore?
Is there anything
hive installation):
--conf "spark.executor.extraClassPath=$HIVE_BASE_DIR/hive/lib/*"
On Sat, Dec 12, 2015 at 6:32 AM Christopher Brady
mailto:christopher.br...@oracle.com>>
wrote:
I'm trying to run a basic "Hello world" type example using DataFrames
wi
I'm trying to run a basic "Hello world" type example using DataFrames
with Hive in yarn-client mode. My code is:
JavaSparkContext sc = new JavaSparkContext("yarn-client", "Test app"))
HiveContext sqlContext = new HiveContext(sc.sc());
sqlContext.sql("SELECT * FROM my_table").count();
The except