Hi,
I'm loading parquet files via spark, and I see the first time a file is
loaded that there is a 5-10s delay related to the Hive Metastore with
messages relating to metastore in the console. How can I avoid this delay
and keep the metadata around? I want the data to be persisted even after
Hi, I am using a custom build of spark 1.4 with the parquet dependency
upgraded to 1.7. I have thrift data encoded with parquet that i want to
partition by a column of type ENUM. Spark programming guide says partition
discovery is only supported for string and numeric columns, so it seems
My jobs frequently run out of memory if the #of cores on an executor is too
high, because each core launches a new parquet decompressor thread, which
allocates memory off heap to decompress. Consequently, even with say 12
cores on an executor, depending on the memory, I can only use 2-3 to avoid
I am running a job, part of which is to add some null values to the rows of
a SchemaRDD. The job fails with Total size of serialized results of 2692
tasks (1024.1 MB) is bigger than spark.driver.maxResultSize(1024.0 MB)
This is the code:
val in = sqc.parquetFile(...)
..
val presentColProj:
This is on spark 1.2
I am loading ~6k parquet files, roughly 500 MB each into a schemaRDD, and
calling count() on it.
After loading about 2705 tasks (there is one per file), the job crashes with
this error:
Total size of serialized results of 2705 tasks (1024.0 MB) is bigger than
Hi,
I am getting a stack overflow error when querying a schemardd comprised of
parquet files. This is (part of) the stack trace:
Caused by: java.lang.StackOverflowError
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at
I have seen this happen when the RDD contains null values. Essentially,
saveAsTextFile calls toString() on the elements of the RDD, so a call to
null.toString will result in an NPE.
--
View this message in context:
I am using spark 1.1 with the ooyala job server (which basically creates long
running spark jobs as contexts to execute jobs in). These contexts have
cached RDDs in memory (via RDD.persist()).
I want to enable the spark.cleaner to cleanup the /spark/work directories
that are created for each app,
I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too fast.
How can i inspect the size of RDD in memory and get more information about
why it was cleaned up. There should be more than enough memory available on
the cluster to store them, and by default, the spark.cleaner.ttl is
Hi,
Is there a way to REMOVE a jar (added via addJar) to spark contexts? We have
a long running context used by the spark jobserver, but after trying to
update versions of classes already in the class path via addJars, the
context still runs the old versions. It would be helpful if I could remove
Adding a call to rdd.repartition() after randomizing the keys has no effect
either. code -
//partitioning is done like partitionIdx = f(key) % numPartitions
//we use random keys to get even partitioning
val uniform = other_stream.transform(rdd = {
rdd.map({ kv =
val k
I have made some progress - the partitioning is very uneven, and everything
goes to one partition. I see that spark partitions by key, so I tried this:
//partitioning is done like partitionIdx = f(key) % numPartitions
//we use random keys to get even partitioning
val uniform =
Im running a job that uses groupByKey(), so it generates a lot of shuffle
data. Then it processes this and writes files to HDFS in a forEachPartition
block. Looking at the forEachPartition stage details in the web console, all
but one executor is idle (SUCCESS in 50-60ms), and one is RUNNING with
In my spark job, I have a loop something like this:
bla.forEachRdd(rdd = {
//init some vars
rdd.forEachPartition(partiton = {
//init some vars
partition.foreach(kv = {
...
I am seeing serialization errors (unread block data), because I think spark
is trying to serialize the
I have one job with spark that creates some RDDs of type X and persists them
in memory. The type X is an auto generated Thrift java class (not a case
class though). Now in another job, I want to convert the RDD to a SchemaRDD
using sqlContext.applySchema(). Can I derive a schema from the thrift
I want to set up spark SQL to allow ad hoc querying over the last X days of
processed data, where the data is processed through spark. This would also
have to cache data (in memory only), so the approach I was thinking of was
to build a layer that persists the appropriate RDDs and stores them in
I am simply catching all exceptions (like case e:Throwable =
println(caught: +e) )
Here is the stack trace:
2014-10-23 15:51:10,766 ERROR [] Exception in task 1.0 in stage 1.0 (TID 1)
java.io.IOException: org.apache.thrift.protocol.TProtocolException: Required
field 'X' is unset! Struct:Y(id:,
Also everything is running locally on my box, driver and workers.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Exceptions-not-caught-tp17157p17160.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Can you check your class Y and fix the above ?
I can, but this is about catching the exception should it be thrown by any
class in the spark job. Why is the exception not being caught?
--
View this message in context:
19 matches
Mail list logo