MapPartitions calls - performance
issues - columnar formats?
On 1/11/15 1:40 PM, Nathan McCarthy wrote:
Thanks Cheng Michael! Makes sense. Appreciate the tips!
Idiomatic scala isn't performant. I’ll definitely start using while
loops or tail recursive methods. I have noticed this in the spark
code base
: Monday, 12 January 2015 1:21 am
To: Nathan nathan.mccar...@quantium.com.au, Michael Armbrust
mich...@databricks.com
Cc: user@spark.apache.org user@spark.apache.org
Subject: Re: SparkSQL schemaRDD MapPartitions calls - performance
issues - columnar formats?
On 1/11/15 1:40 PM, Nathan McCarthy
@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL schemaRDD MapPartitions calls - performance issues -
columnar formats?
On 1/11/15 1:40 PM, Nathan McCarthy wrote:
Thanks Cheng Michael! Makes sense. Appreciate the tips!
Idiomatic scala isn't performant. I’ll
@gmail.com
Cc: Nathan nathan.mccar...@quantium.com.au
mailto:nathan.mccar...@quantium.com.au, user@spark.apache.org
mailto:user@spark.apache.org user@spark.apache.org
mailto:user@spark.apache.org
Subject: Re: SparkSQL schemaRDD MapPartitions calls - performance
issues - columnar formats?
The other
@spark.apache.org
Subject: Re: SparkSQL schemaRDD MapPartitions calls - performance issues -
columnar formats?
The other thing to note here is that Spark SQL defensively copies rows when we
switch into user code. This probably explains the difference between 1 2.
The difference between 1 3
The other thing to note here is that Spark SQL defensively copies rows when
we switch into user code. This probably explains the difference between 1
2.
The difference between 1 3 is likely the cost of decompressing the column
buffers vs. accessing a bunch of uncompressed primitive objects.
Hey Nathan,
Thanks for sharing, this is a very interesting post :) My comments are
inlined below.
Cheng
On 1/7/15 11:53 AM, Nathan McCarthy wrote:
Hi,
I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala
via rdd.mapPartitions(…). Using the latest release 1.2.0.
Simple
Any ideas? :)
From: Nathan
nathan.mccar...@quantium.com.aumailto:nathan.mccar...@quantium.com.au
Date: Wednesday, 7 January 2015 2:53 pm
To: user@spark.apache.orgmailto:user@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org
Subject: SparkSQL schemaRDD MapPartitions calls
Hi,
I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala via
rdd.mapPartitions(…). Using the latest release 1.2.0.
Simple example; load up some sample data from parquet on HDFS (about 380m rows,
10 columns) on a 7 node cluster.
val t =