coalesce executor memory explosion

2016-02-24 Thread Christopher Brady
Short: Why does coalesce use huge amounts of memory? How does it work internally? Long version: I asked a similar question a few weeks ago, but I have a simpler test with better numbers now. I have an RDD created from some HDFS files. I want to sample it and then coalesce it into fewer

Re: coalesce and executor memory

2016-02-14 Thread Christopher Brady
without cache? From: Christopher Brady Sent: Friday, February 12, 2016 8:34 PM To: Koert Kuipers ; Silvio Fiorito Cc: user Subject: Re: coalesce and executor memory Thank you for the responses. The map function just changes the format of the record slightly, so I don't think that would

coalesce and executor memory

2016-02-12 Thread Christopher Brady
Can anyone help me understand why using coalesce causes my executors to crash with out of memory? What happens during coalesce that increases memory usage so much? If I do: hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile everything works fine, but if I do: hadoopFile -> sample

Re: coalesce and executor memory

2016-02-12 Thread Christopher Brady
s as to what the code for the map looks like? On 2/12/16, 1:13 PM, "Christopher Brady" <christopher.br...@oracle.com <mailto:christopher.br...@oracle.com>> wrote: >Can anyone help me understand why using coalesce causes my executors to >crash

DataFrameWriter.format(String) is there a list of options?

2015-12-23 Thread Christopher Brady
The documentation for DataFrameWriter.format(String) says: "Specifies the underlying output data source. Built-in options include "parquet", "json", etc." What options are there other than parquet and json? From googling I found "com.databricks.spark.avro", but that doesn't seem to work

count(*) performance in Hive vs Spark DataFrames

2015-12-16 Thread Christopher Brady
I'm having an issue where count(*) returns almost immediately using Hive, but takes over 10 min using DataFrames. The table data is on HDFS in an uncompressed CSV format. How is it possible for Hive to get the count so fast? Is it caching this or putting it in the metastore? Is there anything

Re: Classpath problem trying to use DataFrames

2015-12-14 Thread Christopher Brady
installation): --conf "spark.executor.extraClassPath=$HIVE_BASE_DIR/hive/lib/*" On Sat, Dec 12, 2015 at 6:32 AM Christopher Brady <christopher.br...@oracle.com <mailto:christopher.br...@oracle.com>> wrote: I'm trying to run a basic "Hello world" type exampl

Classpath problem trying to use DataFrames

2015-12-11 Thread Christopher Brady
I'm trying to run a basic "Hello world" type example using DataFrames with Hive in yarn-client mode. My code is: JavaSparkContext sc = new JavaSparkContext("yarn-client", "Test app")) HiveContext sqlContext = new HiveContext(sc.sc()); sqlContext.sql("SELECT * FROM my_table").count(); The