1.3.1: Persisting RDD in parquet - Conflicting partition column names

2015-04-27 Thread sranga
Hi I am getting the following error when persisting an RDD in parquet format to an S3 location. This is code that was working in the 1.2 version. The version that it is failing to work is 1.3.1. Any help is appreciated. Caused by: java.lang.AssertionError: assertion failed: Conflicting

Spark Streaming: HiveContext within Custom Actor

2014-12-29 Thread sranga
Hi Could Spark-SQL be used from within a custom actor that acts as a receiver for a streaming application? If yes, what is the recommended way of passing the SparkContext to the actor? Thanks for your help. - Ranga -- View this message in context:

Re: RDD Cache Cleanup

2014-11-26 Thread sranga
Just to close out this one, I noticed that the cache partition size was quite low for each of the RDDs (1 - 14). Increasing the number of partitions (~400) resolved this for me. -- View this message in context:

RDD Cache Cleanup

2014-11-25 Thread sranga
Hi I am noticing that the RDDs that are persisted get cleaned up very quickly. This usually happens in a matter of a few minutes. I tried setting a value of 20 hours for the /spark.cleaner.ttl/ property and still get the same behavior. In my use-case, I have to persist about 20 RDDs each of size

RDD C

2014-11-25 Thread sranga
Hi I am noticing that the RDDs that are persisted get cleaned up very quickly. This usually happens in a matter of a few minutes. I tried setting a value of 20 hours for the /spark.cleaner.ttl/ property and still get the same behavior. In my use-case, I have to persist about 20 RDDs each of

Re: Spark-Shell: OOM: GC overhead limit exceeded

2014-10-08 Thread sranga
Increasing the driver memory resolved this issue. Thanks to Nick for the hint. Here is how I am starting the shell: spark-shell --driver-memory 4g --driver-cores 4 --master local -- View this message in context:

Spark-Shell: OOM: GC overhead limit exceeded

2014-10-07 Thread sranga
Hi I am new to Spark and trying to develop an application that loads data from Hive. Here is my setup: * Spark-1.1.0 (built using -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive) * Executing Spark-shell on a box with 16 GB RAM * 4 Cores Single Processor * OpenCSV library (SerDe) * Hive table