coalesce executor memory explosion

2016-02-24 Thread Christopher Brady
Short: Why does coalesce use huge amounts of memory? How does it work internally? Long version: I asked a similar question a few weeks ago, but I have a simpler test with better numbers now. I have an RDD created from some HDFS files. I want to sample it and then coalesce it into fewer partiti

Re: coalesce and executor memory

2016-02-14 Thread Christopher Brady
y the coalesce without cache? From: Christopher Brady Sent: Friday, February 12, 2016 8:34 PM To: Koert Kuipers ; Silvio Fiorito Cc: user Subject: Re: coalesce and executor memory Thank you for the responses. The map function just changes the format of the record slightly, so I don't

Re: coalesce and executor memory

2016-02-12 Thread Christopher Brady
looks like? On 2/12/16, 1:13 PM, "Christopher Brady" mailto:christopher.br...@oracle.com>> wrote: >Can anyone help me understand why using coalesce causes my executors to >crash with out of memory? What happens during coalesce that increases >memo

coalesce and executor memory

2016-02-12 Thread Christopher Brady
Can anyone help me understand why using coalesce causes my executors to crash with out of memory? What happens during coalesce that increases memory usage so much? If I do: hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile everything works fine, but if I do: hadoopFile -> sample -

DataFrameWriter.format(String) is there a list of options?

2015-12-23 Thread Christopher Brady
The documentation for DataFrameWriter.format(String) says: "Specifies the underlying output data source. Built-in options include "parquet", "json", etc." What options are there other than parquet and json? From googling I found "com.databricks.spark.avro", but that doesn't seem to work corre

count(*) performance in Hive vs Spark DataFrames

2015-12-16 Thread Christopher Brady
I'm having an issue where count(*) returns almost immediately using Hive, but takes over 10 min using DataFrames. The table data is on HDFS in an uncompressed CSV format. How is it possible for Hive to get the count so fast? Is it caching this or putting it in the metastore? Is there anything

Re: Classpath problem trying to use DataFrames

2015-12-14 Thread Christopher Brady
hive installation): --conf "spark.executor.extraClassPath=$HIVE_BASE_DIR/hive/lib/*" On Sat, Dec 12, 2015 at 6:32 AM Christopher Brady mailto:christopher.br...@oracle.com>> wrote: I'm trying to run a basic "Hello world" type example using DataFrames wi

Classpath problem trying to use DataFrames

2015-12-11 Thread Christopher Brady
I'm trying to run a basic "Hello world" type example using DataFrames with Hive in yarn-client mode. My code is: JavaSparkContext sc = new JavaSparkContext("yarn-client", "Test app")) HiveContext sqlContext = new HiveContext(sc.sc()); sqlContext.sql("SELECT * FROM my_table").count(); The except