How to avoid the delay associated with Hive Metastore when loading parquet?

2016-10-23 Thread ankits
Hi, I'm loading parquet files via spark, and I see the first time a file is loaded that there is a 5-10s delay related to the Hive Metastore with messages relating to metastore in the console. How can I avoid this delay and keep the metadata around? I want the data to be persisted even after

Partition parquet data by ENUM column

2015-07-21 Thread ankits
Hi, I am using a custom build of spark 1.4 with the parquet dependency upgraded to 1.7. I have thrift data encoded with parquet that i want to partition by a column of type ENUM. Spark programming guide says partition discovery is only supported for string and numeric columns, so it seems

Limit # of parallel parquet decompresses

2015-03-12 Thread ankits
My jobs frequently run out of memory if the #of cores on an executor is too high, because each core launches a new parquet decompressor thread, which allocates memory off heap to decompress. Consequently, even with say 12 cores on an executor, depending on the memory, I can only use 2-3 to avoid

Why are task results large in this case?

2015-02-04 Thread ankits
I am running a job, part of which is to add some null values to the rows of a SchemaRDD. The job fails with Total size of serialized results of 2692 tasks (1024.1 MB) is bigger than spark.driver.maxResultSize(1024.0 MB) This is the code: val in = sqc.parquetFile(...) .. val presentColProj:

Serialized task result size exceeded

2015-01-30 Thread ankits
This is on spark 1.2 I am loading ~6k parquet files, roughly 500 MB each into a schemaRDD, and calling count() on it. After loading about 2705 tasks (there is one per file), the job crashes with this error: Total size of serialized results of 2705 tasks (1024.0 MB) is bigger than

StackOverflowError with SchemaRDD

2015-01-28 Thread ankits
Hi, I am getting a stack overflow error when querying a schemardd comprised of parquet files. This is (part of) the stack trace: Caused by: java.lang.StackOverflowError at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at

Re: saveAsTextFile

2015-01-15 Thread ankits
I have seen this happen when the RDD contains null values. Essentially, saveAsTextFile calls toString() on the elements of the RDD, so a call to null.toString will result in an NPE. -- View this message in context:

spark.cleaner questions

2015-01-13 Thread ankits
I am using spark 1.1 with the ooyala job server (which basically creates long running spark jobs as contexts to execute jobs in). These contexts have cached RDDs in memory (via RDD.persist()). I want to enable the spark.cleaner to cleanup the /spark/work directories that are created for each app,

RDDs being cleaned too fast

2014-12-10 Thread ankits
I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too fast. How can i inspect the size of RDD in memory and get more information about why it was cleaned up. There should be more than enough memory available on the cluster to store them, and by default, the spark.cleaner.ttl is

Remove added jar from spark context

2014-12-01 Thread ankits
Hi, Is there a way to REMOVE a jar (added via addJar) to spark contexts? We have a long running context used by the spark jobserver, but after trying to update versions of classes already in the class path via addJars, the context still runs the old versions. It would be helpful if I could remove

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
Adding a call to rdd.repartition() after randomizing the keys has no effect either. code - //partitioning is done like partitionIdx = f(key) % numPartitions //we use random keys to get even partitioning val uniform = other_stream.transform(rdd = { rdd.map({ kv = val k

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
I have made some progress - the partitioning is very uneven, and everything goes to one partition. I see that spark partitions by key, so I tried this: //partitioning is done like partitionIdx = f(key) % numPartitions //we use random keys to get even partitioning val uniform =

Imbalanced shuffle read

2014-11-11 Thread ankits
Im running a job that uses groupByKey(), so it generates a lot of shuffle data. Then it processes this and writes files to HDFS in a forEachPartition block. Looking at the forEachPartition stage details in the web console, all but one executor is idle (SUCCESS in 50-60ms), and one is RUNNING with

How to trace/debug serialization?

2014-11-05 Thread ankits
In my spark job, I have a loop something like this: bla.forEachRdd(rdd = { //init some vars rdd.forEachPartition(partiton = { //init some vars partition.foreach(kv = { ... I am seeing serialization errors (unread block data), because I think spark is trying to serialize the

Creating a SchemaRDD from RDD of thrift classes

2014-10-30 Thread ankits
I have one job with spark that creates some RDDs of type X and persists them in memory. The type X is an auto generated Thrift java class (not a case class though). Now in another job, I want to convert the RDD to a SchemaRDD using sqlContext.applySchema(). Can I derive a schema from the thrift

Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread ankits
I want to set up spark SQL to allow ad hoc querying over the last X days of processed data, where the data is processed through spark. This would also have to cache data (in memory only), so the approach I was thinking of was to build a layer that persists the appropriate RDDs and stores them in

Re: Exceptions not caught?

2014-10-23 Thread ankits
I am simply catching all exceptions (like case e:Throwable = println(caught: +e) ) Here is the stack trace: 2014-10-23 15:51:10,766 ERROR [] Exception in task 1.0 in stage 1.0 (TID 1) java.io.IOException: org.apache.thrift.protocol.TProtocolException: Required field 'X' is unset! Struct:Y(id:,

Re: Exceptions not caught?

2014-10-23 Thread ankits
Also everything is running locally on my box, driver and workers. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exceptions-not-caught-tp17157p17160.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Exceptions not caught?

2014-10-23 Thread ankits
Can you check your class Y and fix the above ? I can, but this is about catching the exception should it be thrown by any class in the spark job. Why is the exception not being caught? -- View this message in context: