Short: Why does coalesce use huge amounts of memory? How does it work
internally?
Long version:
I asked a similar question a few weeks ago, but I have a simpler test
with better numbers now. I have an RDD created from some HDFS files. I
want to sample it and then coalesce it into fewer
without cache?
From: Christopher Brady
Sent: Friday, February 12, 2016 8:34 PM
To: Koert Kuipers ; Silvio Fiorito
Cc: user
Subject: Re: coalesce and executor memory
Thank you for the responses. The map function just changes the format of the
record slightly, so I don't think that would
Can anyone help me understand why using coalesce causes my executors to
crash with out of memory? What happens during coalesce that increases
memory usage so much?
If I do:
hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile
everything works fine, but if I do:
hadoopFile -> sample
s as to what the code for the map looks like?
On 2/12/16, 1:13 PM, "Christopher Brady"
<christopher.br...@oracle.com
<mailto:christopher.br...@oracle.com>> wrote:
>Can anyone help me understand why using coalesce causes my
executors to
>crash
The documentation for DataFrameWriter.format(String) says:
"Specifies the underlying output data source. Built-in options include
"parquet", "json", etc."
What options are there other than parquet and json? From googling I
found "com.databricks.spark.avro", but that doesn't seem to work
I'm having an issue where count(*) returns almost immediately using
Hive, but takes over 10 min using DataFrames. The table data is on HDFS
in an uncompressed CSV format. How is it possible for Hive to get the
count so fast? Is it caching this or putting it in the metastore?
Is there anything
installation):
--conf "spark.executor.extraClassPath=$HIVE_BASE_DIR/hive/lib/*"
On Sat, Dec 12, 2015 at 6:32 AM Christopher Brady
<christopher.br...@oracle.com <mailto:christopher.br...@oracle.com>>
wrote:
I'm trying to run a basic "Hello world" type exampl
I'm trying to run a basic "Hello world" type example using DataFrames
with Hive in yarn-client mode. My code is:
JavaSparkContext sc = new JavaSparkContext("yarn-client", "Test app"))
HiveContext sqlContext = new HiveContext(sc.sc());
sqlContext.sql("SELECT * FROM my_table").count();
The