Re: IOError on createDataFrame

2015-08-31 Thread Akhil Das
Why not attach a bigger hard disk to the machines and point your SPARK_LOCAL_DIRS to it? Thanks Best Regards On Sat, Aug 29, 2015 at 1:13 AM, fsacerdoti wrote: > Hello, > > Similar to the thread below [1], when I tried to create an RDD from a 4GB > pandas dataframe

Re: IOError on createDataFrame

2015-08-31 Thread fsacerdoti
There are two issues here: 1. Suppression of the true reason for failure. The spark runtime reports "TypeError" but that is not why the operation failed. 2. The low performance of loading a pandas dataframe. DISCUSSION Number (1) is easily fixed, and the primary purpose for my post. Number

Re: Tungsten off heap memory access for C++ libraries

2015-08-31 Thread Paul Weiss
Sounds good, want me to create a jira and link it to SPARK-9697? Will put down some ideas to start. On Aug 31, 2015 4:14 AM, "Reynold Xin" wrote: > BTW if you are interested in this, we could definitely get some help in > terms of prototyping the feasibility, i.e. how we can

Re: KryoSerializer for closureSerializer in DAGScheduler

2015-08-31 Thread yash datta
Thanks josh ... i'll take a look On 31 Aug 2015 19:21, "Josh Rosen" wrote: > There are currently a few known issues with using KryoSerializer as the > closure serializer, so it's going to require some changes to Spark if we > want to properly support this. See >

Re: ClassCastException using DataFrame only when num-executors > 2 ...

2015-08-31 Thread Olivier Girardot
tested now against Spark 1.5.0 rc2, and same exceptions happen when num-executors > 2 : 15/08/25 10:31:10 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 5.0 (TID 501, xxx): java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Long at

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-31 Thread Reynold Xin
I'm going to -1 the release myself since the issue @yhuai identified is pretty serious. It basically OOMs the driver for reading any files with a large number of partitions. Looks like the patch for that has already been merged. I'm going to cut rc3 momentarily. On Sun, Aug 30, 2015 at 11:30

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-31 Thread Chester Chen
Seems that Github branch-1.5 already changing the version to 1.5.1-SNAPSHOT, I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ? Chester On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin wrote: > I'm going to -1 the release myself since the issue @yhuai

Re: Tungsten off heap memory access for C++ libraries

2015-08-31 Thread Reynold Xin
On Sun, Aug 30, 2015 at 5:58 AM, Paul Weiss wrote: > > Also, is this work being done on a branch I could look into further and > try out? > > We don't have a branch yet -- because there is no code nor design for this yet. As I said, it is one of the motivations behind

Re: Research of Spark scalability / performance issues

2015-08-31 Thread Steve Loughran
If you look at the recurrent issues in datacentre-scale computing systems, two stand out -resilience to failure: that's algorithms and the layers underneath (storage, work allocation & tracking ...) -scheduling: maximising resource utilisation while prioritising high-SLA work (interactive

KryoSerializer for closureSerializer in DAGScheduler

2015-08-31 Thread yash datta
Hi devs, Curently the only supported serializer for serializing tasks in DAGScheduler.scala is JavaSerializer. val taskBinaryBytes: Array[Byte] = stage match { case stage: ShuffleMapStage => closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef).array() case stage:

Re: Tungsten off heap memory access for C++ libraries

2015-08-31 Thread Reynold Xin
BTW if you are interested in this, we could definitely get some help in terms of prototyping the feasibility, i.e. how we can have a native (e.g. C++) API for data access shipped with Spark. There are a lot of questions (e.g. build, portability) that need to be answered. On Mon, Aug 31, 2015 at