Re: S3n, parallelism, partitions

2015-08-17 Thread Akshat Aranya
This will also depend on the file format you are using. A word of advice: you would be much better off with the s3a file system. As I found out recently the hard way, s3n has some issues with reading through entire files even when looking for headers. On Mon, Aug 17, 2015 at 2:10 AM, Akhil Das

Re: Accessing S3 files with s3n://

2015-08-10 Thread Akshat Aranya
: Depends on which operation you are doing, If you are doing a .count() on a parquet, it might not download the entire file i think, but if you do a .count() on a normal text file it might pull the entire file. Thanks Best Regards On Sat, Aug 8, 2015 at 3:12 AM, Akshat Aranya aara

Accessing S3 files with s3n://

2015-08-07 Thread Akshat Aranya
Hi, I've been trying to track down some problems with Spark reads being very slow with s3n:// URIs (NativeS3FileSystem). After some digging around, I realized that this file system implementation fetches the entire file, which isn't really a Spark problem, but it really slows down things when

Re: Kryo serialization of classes in additional jars

2015-05-13 Thread Akshat Aranya
) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93) ... 21 more On Mon, May 4, 2015 at 5:43 PM, Akshat Aranya aara

Re: Kryo serialization of classes in additional jars

2015-05-04 Thread Akshat Aranya
to the cluster manually, and then using spark.executor.extraClassPath On Wed, Apr 29, 2015 at 6:42 PM, Akshat Aranya aara...@gmail.com wrote: Hi, Is it possible to register kryo serialization for classes contained in jars that are added with spark.jars? In my experiment it doesn't seem to work, likely

Re: ClassNotFoundException for Kryo serialization

2015-05-02 Thread Akshat Aranya
(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) I verified that the same configuration works without using Kryo serialization. On Fri, May 1, 2015 at 9:44 AM, Akshat Aranya aara...@gmail.com wrote: I cherry-picked the fix for SPARK-5470 and the problem has gone away. On Fri, May

ClassNotFoundException for Kryo serialization

2015-05-01 Thread Akshat Aranya
Hi, I'm getting a ClassNotFoundException at the executor when trying to register a class for Kryo serialization: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at

Re: ClassNotFoundException for Kryo serialization

2015-05-01 Thread Akshat Aranya
about Schema$MyRow ? On Fri, May 1, 2015 at 8:05 AM, Akshat Aranya aara...@gmail.com wrote: Hi, I'm getting a ClassNotFoundException at the executor when trying to register a class for Kryo serialization: java.lang.reflect.InvocationTargetException

Re: ClassNotFoundException for Kryo serialization

2015-05-01 Thread Akshat Aranya
I cherry-picked the fix for SPARK-5470 and the problem has gone away. On Fri, May 1, 2015 at 9:15 AM, Akshat Aranya aara...@gmail.com wrote: Yes, this class is present in the jar that was loaded in the classpath of the executor Java process -- it wasn't even lazily added as a part of the task

Re: spark with standalone HBase

2015-04-30 Thread Akshat Aranya
Looking at your classpath, it looks like you've compiled Spark yourself. Depending on which version of Hadoop you've compiled against (looks like it's Hadoop 2.2 in your case), Spark will have its own version of protobuf. You should try by making sure both your HBase and Spark are compiled

Kryo serialization of classes in additional jars

2015-04-29 Thread Akshat Aranya
Hi, Is it possible to register kryo serialization for classes contained in jars that are added with spark.jars? In my experiment it doesn't seem to work, likely because the class registration happens before the jar is shipped to the executor and added to the classloader. Here's the general idea

When are TaskCompletionListeners called?

2015-04-17 Thread Akshat Aranya
Hi, I'm trying to figure out when TaskCompletionListeners are called -- are they called at the end of the RDD's compute() method, or after the iteration through the iterator of the compute() method is completed. To put it another way, is this OK: class DatabaseRDD[T] extends RDD[T] { def

Re: Spark SQL code generation

2015-04-06 Thread Akshat Aranya
, Michael Armbrust mich...@databricks.com wrote: It is generated and cached on each of the executors. On Mon, Apr 6, 2015 at 2:32 PM, Akshat Aranya aara...@gmail.com wrote: Hi, I'm curious as to how Spark does code generation for SQL queries. Following through the code, I saw

Spark SQL code generation

2015-04-06 Thread Akshat Aranya
Hi, I'm curious as to how Spark does code generation for SQL queries. Following through the code, I saw that an expression is parsed and compiled into a class using Scala reflection toolbox. However, it's unclear to me whether the actual byte code is generated on the master or on each of the

Re: Getting to proto buff classes in Spark Context

2015-02-26 Thread Akshat Aranya
My guess would be that you are packaging too many things in your job, which is causing problems with the classpath. When your jar goes in first, you get the correct version of protobuf, but some other version of something else. When your jar goes in later, other things work, but protobuf breaks.

Missing tasks

2015-02-26 Thread Akshat Aranya
I am seeing a problem with a Spark job in standalone mode. Spark master's web interface shows a task RUNNING on a particular executor, but the logs of the executor do not show the task being ever assigned to it, that is, such a line is missing from the log: 15/02/25 16:53:36 INFO

Number of parallel tasks

2015-02-25 Thread Akshat Aranya
I have Spark running in standalone mode with 4 executors, and each executor with 5 cores each (spark.executor.cores=5). However, when I'm processing an RDD with ~90,000 partitions, I only get 4 parallel tasks. Shouldn't I be getting 4x5=20 parallel task executions?

Standalone Spark program

2014-12-18 Thread Akshat Aranya
Hi, I am building a Spark-based service which requires initialization of a SparkContext in a main(): def main(args: Array[String]) { val conf = new SparkConf(false) .setMaster(spark://foo.example.com:7077) .setAppName(foobar) val sc = new SparkContext(conf) val rdd =

Stateful mapPartitions

2014-12-04 Thread Akshat Aranya
Is it possible to have some state across multiple calls to mapPartitions on each partition, for instance, if I want to keep a database connection open?

Re: Stateful mapPartitions

2014-12-04 Thread Akshat Aranya
details ? What do you nerd to do with db cpnnection? Paolo Inviata dal mio Windows Phone -- Da: Akshat Aranya aara...@gmail.com Inviato: ‎04/‎12/‎2014 18:57 A: user@spark.apache.org Oggetto: Stateful mapPartitions Is it possible to have some state across

Executor failover

2014-11-26 Thread Akshat Aranya
Hi, I have a question regarding failure of executors: how does Spark reassign partitions or tasks when executors fail? Is it necessary that new executors have the same executor IDs as the ones that were lost, or are these IDs irrelevant for failover?

Re: advantages of SparkSQL?

2014-11-24 Thread Akshat Aranya
Parquet is a column-oriented format, which means that you need to read in less data from the file system if you're only interested in a subset of your columns. Also, Parquet pushes down selection predicates, which can eliminate needless deserialization of rows that don't match a selection

Spark and Play

2014-11-11 Thread Akshat Aranya
Hi, Sorry if this has been asked before; I didn't find a satisfactory answer when searching. How can I integrate a Play application with Spark? I'm getting into issues of akka-actor versions. Play 2.2.x uses akka-actor 2.0, whereas Play 2.3.x uses akka-actor 2.3.4, neither of which work fine

Mapping SchemaRDD/Row to JSON

2014-11-10 Thread Akshat Aranya
Hi, Does there exist a way to serialize Row objects to JSON. In the absence of such a way, is the right way to go: * get hold of schema using SchemaRDD.schema * Iterate through each individual Row as a Seq and use the schema to convert values in the row to JSON types. Thanks, Akshat

Re: Spark as key/value store?

2014-10-22 Thread Akshat Aranya
Spark, in general, is good for iterating through an entire dataset again and again. All operations are expressed in terms of iteration through all the records of at least one partition. You may want to look at IndexedRDD ( https://issues.apache.org/jira/browse/SPARK-2365) that aims to improve

Primitive arrays in Spark

2014-10-21 Thread Akshat Aranya
This is as much of a Scala question as a Spark question I have an RDD: val rdd1: RDD[(Long, Array[Long])] This RDD has duplicate keys that I can collapse such val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) = a++b) If I start with an Array of primitive longs in rdd1, will rdd2

Attaching schema to RDD created from Parquet file

2014-10-17 Thread Akshat Aranya
Hi, How can I convert an RDD loaded from a Parquet file into its original type: case class Person(name: String, age: Int) val rdd: RDD[Person] = ... rdd.saveAsParquetFile(people) val rdd2: sqlContext.parquetFile(people) How can I map rdd2 back into an RDD[Person]? All of the examples just

Re: bug with MapPartitions?

2014-10-17 Thread Akshat Aranya
There seems to be some problem with what gets captured in the closure that's passed into the mapPartitions (myfunc in your case). I've had a similar problem before: http://apache-spark-user-list.1001560.n3.nabble.com/TaskNotSerializableException-when-running-through-Spark-shell-td16574.html Try

Re: Larger heap leads to perf degradation due to GC

2014-10-16 Thread Akshat Aranya
I just want to pitch in and say that I ran into the same problem with running with 64GB executors. For example, some of the tasks take 5 minutes to execute, out of which 4 minutes are spent in GC. I'll try out smaller executors. On Mon, Oct 6, 2014 at 6:35 PM, Otis Gospodnetic

TaskNotSerializableException when running through Spark shell

2014-10-16 Thread Akshat Aranya
Hi, Can anyone explain how things get captured in a closure when runing through the REPL. For example: def foo(..) = { .. } rdd.map(foo) sometimes complains about classes not being serializable that are completely unrelated to foo. This happens even when I write it such: object Foo { def

One pass compute() to produce multiple RDDs

2014-10-09 Thread Akshat Aranya
Hi, Is there a good way to materialize derivate RDDs from say, a HadoopRDD while reading in the data only once. One way to do so would be to cache the HadoopRDD and then create derivative RDDs, but that would require enough RAM to cache the HadoopRDD which is not an option in my case. Thanks,

Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-08 Thread Akshat Aranya
Using a var for RDDs in this way is not going to work. In this example, tx1.zip(tx2) would create and RDD that depends on tx2, but then soon after that, you change what tx2 means, so you would end up having a circular dependency. On Wed, Oct 8, 2014 at 12:01 PM, Sung Hwan Chung

Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
Hi, What's the relationship between Spark worker and executor memory settings in standalone mode? Do they work independently or does the worker cap executor memory? Also, is the number of concurrent executors per worker capped by the number of CPU cores configured for the worker?

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
On Wed, Oct 1, 2014 at 11:33 AM, Akshat Aranya aara...@gmail.com wrote: On Wed, Oct 1, 2014 at 11:00 AM, Boromir Widas vcsub...@gmail.com wrote: 1. worker memory caps executor. 2. With default config, every job gets one executor per worker. This executor runs with all cores available

Determining number of executors within RDD

2014-10-01 Thread Akshat Aranya
Hi, I want implement an RDD wherein the decision of number of partitions is based on the number of executors that have been set up. Is there some way I can determine the number of executors within the getPartitions() call?

Re: partitioned groupBy

2014-09-17 Thread Akshat Aranya
a hash map within each partition yourself. On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote: I have a use case where my RDD is set up such: Partition 0: K1 - [V1, V2] K2 - [V2] Partition 1: K3 - [V1] K4 - [V3] I want to invert this RDD, but only within

Indexed RDD

2014-09-16 Thread Akshat Aranya
Hi, I'm trying to implement a custom RDD that essentially works as a distributed hash table, i.e. the key space is split up into partitions and within a partition, an element can be looked up efficiently by the key. However, the RDD lookup() function (in PairRDDFunctions) is implemented in a way

partitioned groupBy

2014-09-16 Thread Akshat Aranya
I have a use case where my RDD is set up such: Partition 0: K1 - [V1, V2] K2 - [V2] Partition 1: K3 - [V1] K4 - [V3] I want to invert this RDD, but only within a partition, so that the operation does not require a shuffle. It doesn't matter if the partitions of the inverted RDD have non unique