This will also depend on the file format you are using.
A word of advice: you would be much better off with the s3a file system.
As I found out recently the hard way, s3n has some issues with reading
through entire files even when looking for headers.
On Mon, Aug 17, 2015 at 2:10 AM, Akhil Das
:
Depends on which operation you are doing, If you are doing a .count() on a
parquet, it might not download the entire file i think, but if you do a
.count() on a normal text file it might pull the entire file.
Thanks
Best Regards
On Sat, Aug 8, 2015 at 3:12 AM, Akshat Aranya aara
Hi,
I've been trying to track down some problems with Spark reads being very
slow with s3n:// URIs (NativeS3FileSystem). After some digging around, I
realized that this file system implementation fetches the entire file,
which isn't really a Spark problem, but it really slows down things when
)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
... 21 more
On Mon, May 4, 2015 at 5:43 PM, Akshat Aranya aara
to the cluster manually, and then using
spark.executor.extraClassPath
On Wed, Apr 29, 2015 at 6:42 PM, Akshat Aranya aara...@gmail.com wrote:
Hi,
Is it possible to register kryo serialization for classes contained in
jars that are added with spark.jars? In my experiment it doesn't seem to
work, likely
(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
I verified that the same configuration works without using Kryo serialization.
On Fri, May 1, 2015 at 9:44 AM, Akshat Aranya aara...@gmail.com wrote:
I cherry-picked the fix for SPARK-5470 and the problem has gone away.
On Fri, May
Hi,
I'm getting a ClassNotFoundException at the executor when trying to
register a class for Kryo serialization:
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
about Schema$MyRow ?
On Fri, May 1, 2015 at 8:05 AM, Akshat Aranya aara...@gmail.com wrote:
Hi,
I'm getting a ClassNotFoundException at the executor when trying to
register a class for Kryo serialization:
java.lang.reflect.InvocationTargetException
I cherry-picked the fix for SPARK-5470 and the problem has gone away.
On Fri, May 1, 2015 at 9:15 AM, Akshat Aranya aara...@gmail.com wrote:
Yes, this class is present in the jar that was loaded in the classpath
of the executor Java process -- it wasn't even lazily added as a part
of the task
Looking at your classpath, it looks like you've compiled Spark yourself.
Depending on which version of Hadoop you've compiled against (looks like
it's Hadoop 2.2 in your case), Spark will have its own version of
protobuf. You should try by making sure both your HBase and Spark are
compiled
Hi,
Is it possible to register kryo serialization for classes contained in jars
that are added with spark.jars? In my experiment it doesn't seem to
work, likely because the class registration happens before the jar is
shipped to the executor and added to the classloader. Here's the general
idea
Hi,
I'm trying to figure out when TaskCompletionListeners are called -- are
they called at the end of the RDD's compute() method, or after the
iteration through the iterator of the compute() method is completed.
To put it another way, is this OK:
class DatabaseRDD[T] extends RDD[T] {
def
, Michael Armbrust mich...@databricks.com
wrote:
It is generated and cached on each of the executors.
On Mon, Apr 6, 2015 at 2:32 PM, Akshat Aranya aara...@gmail.com wrote:
Hi,
I'm curious as to how Spark does code generation for SQL queries.
Following through the code, I saw
Hi,
I'm curious as to how Spark does code generation for SQL queries.
Following through the code, I saw that an expression is parsed and compiled
into a class using Scala reflection toolbox. However, it's unclear to me
whether the actual byte code is generated on the master or on each of the
My guess would be that you are packaging too many things in your job, which
is causing problems with the classpath. When your jar goes in first, you
get the correct version of protobuf, but some other version of something
else. When your jar goes in later, other things work, but protobuf
breaks.
I am seeing a problem with a Spark job in standalone mode. Spark master's
web interface shows a task RUNNING on a particular executor, but the logs
of the executor do not show the task being ever assigned to it, that is,
such a line is missing from the log:
15/02/25 16:53:36 INFO
I have Spark running in standalone mode with 4 executors, and each executor
with 5 cores each (spark.executor.cores=5). However, when I'm processing
an RDD with ~90,000 partitions, I only get 4 parallel tasks. Shouldn't I
be getting 4x5=20 parallel task executions?
Hi,
I am building a Spark-based service which requires initialization of a
SparkContext in a main():
def main(args: Array[String]) {
val conf = new SparkConf(false)
.setMaster(spark://foo.example.com:7077)
.setAppName(foobar)
val sc = new SparkContext(conf)
val rdd =
Is it possible to have some state across multiple calls to mapPartitions on
each partition, for instance, if I want to keep a database connection open?
details ?
What do you nerd to do with db cpnnection?
Paolo
Inviata dal mio Windows Phone
--
Da: Akshat Aranya aara...@gmail.com
Inviato: 04/12/2014 18:57
A: user@spark.apache.org
Oggetto: Stateful mapPartitions
Is it possible to have some state across
Hi,
I have a question regarding failure of executors: how does Spark reassign
partitions or tasks when executors fail? Is it necessary that new
executors have the same executor IDs as the ones that were lost, or are
these IDs irrelevant for failover?
Parquet is a column-oriented format, which means that you need to read in
less data from the file system if you're only interested in a subset of
your columns. Also, Parquet pushes down selection predicates, which can
eliminate needless deserialization of rows that don't match a selection
Hi,
Sorry if this has been asked before; I didn't find a satisfactory answer
when searching. How can I integrate a Play application with Spark? I'm
getting into issues of akka-actor versions. Play 2.2.x uses akka-actor
2.0, whereas Play 2.3.x uses akka-actor 2.3.4, neither of which work fine
Hi,
Does there exist a way to serialize Row objects to JSON. In the absence of
such a way, is the right way to go:
* get hold of schema using SchemaRDD.schema
* Iterate through each individual Row as a Seq and use the schema to
convert values in the row to JSON types.
Thanks,
Akshat
Spark, in general, is good for iterating through an entire dataset again
and again. All operations are expressed in terms of iteration through all
the records of at least one partition. You may want to look at IndexedRDD (
https://issues.apache.org/jira/browse/SPARK-2365) that aims to improve
This is as much of a Scala question as a Spark question
I have an RDD:
val rdd1: RDD[(Long, Array[Long])]
This RDD has duplicate keys that I can collapse such
val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) = a++b)
If I start with an Array of primitive longs in rdd1, will rdd2
Hi,
How can I convert an RDD loaded from a Parquet file into its original type:
case class Person(name: String, age: Int)
val rdd: RDD[Person] = ...
rdd.saveAsParquetFile(people)
val rdd2: sqlContext.parquetFile(people)
How can I map rdd2 back into an RDD[Person]? All of the examples just
There seems to be some problem with what gets captured in the closure
that's passed into the mapPartitions (myfunc in your case).
I've had a similar problem before:
http://apache-spark-user-list.1001560.n3.nabble.com/TaskNotSerializableException-when-running-through-Spark-shell-td16574.html
Try
I just want to pitch in and say that I ran into the same problem with
running with 64GB executors. For example, some of the tasks take 5 minutes
to execute, out of which 4 minutes are spent in GC. I'll try out smaller
executors.
On Mon, Oct 6, 2014 at 6:35 PM, Otis Gospodnetic
Hi,
Can anyone explain how things get captured in a closure when runing through
the REPL. For example:
def foo(..) = { .. }
rdd.map(foo)
sometimes complains about classes not being serializable that are
completely unrelated to foo. This happens even when I write it such:
object Foo {
def
Hi,
Is there a good way to materialize derivate RDDs from say, a HadoopRDD
while reading in the data only once. One way to do so would be to cache
the HadoopRDD and then create derivative RDDs, but that would require
enough RAM to cache the HadoopRDD which is not an option in my case.
Thanks,
Using a var for RDDs in this way is not going to work. In this example,
tx1.zip(tx2) would create and RDD that depends on tx2, but then soon after
that, you change what tx2 means, so you would end up having a circular
dependency.
On Wed, Oct 8, 2014 at 12:01 PM, Sung Hwan Chung
Hi,
What's the relationship between Spark worker and executor memory settings
in standalone mode? Do they work independently or does the worker cap
executor memory?
Also, is the number of concurrent executors per worker capped by the number
of CPU cores configured for the worker?
On Wed, Oct 1, 2014 at 11:33 AM, Akshat Aranya aara...@gmail.com wrote:
On Wed, Oct 1, 2014 at 11:00 AM, Boromir Widas vcsub...@gmail.com wrote:
1. worker memory caps executor.
2. With default config, every job gets one executor per worker. This
executor runs with all cores available
Hi,
I want implement an RDD wherein the decision of number of partitions is
based on the number of executors that have been set up. Is there some way I
can determine the number of executors within the getPartitions() call?
a hash map within each partition
yourself.
On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote:
I have a use case where my RDD is set up such:
Partition 0:
K1 - [V1, V2]
K2 - [V2]
Partition 1:
K3 - [V1]
K4 - [V3]
I want to invert this RDD, but only within
Hi,
I'm trying to implement a custom RDD that essentially works as a
distributed hash table, i.e. the key space is split up into partitions and
within a partition, an element can be looked up efficiently by the key.
However, the RDD lookup() function (in PairRDDFunctions) is implemented in
a way
I have a use case where my RDD is set up such:
Partition 0:
K1 - [V1, V2]
K2 - [V2]
Partition 1:
K3 - [V1]
K4 - [V3]
I want to invert this RDD, but only within a partition, so that the
operation does not require a shuffle. It doesn't matter if the partitions
of the inverted RDD have non unique
38 matches
Mail list logo