[
https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495475#comment-14495475
]
Yin Huai commented on SPARK-6368:
-
I am adding results of a simple benchmark.
I build the master with
{code}
build/sbt -Phive assembly
{code}
Then, I launched the spark shell with
{code}
bin/spark-shell --master local-cluster[1,1,4096] --conf
spark.executor.memory=4096m --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf
spark.shuffle.compress=false --conf
spark.executor.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/tmp --conf fs.local.block.size=536870912 -v
{code}
The following code was used
{code}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val supportedTypes =
Seq(StringType, BinaryType, BooleanType,
ByteType, ShortType, IntegerType, LongType,
DoubleType, DecimalType.Unlimited, DecimalType(6, 5),
DateType, TimestampType)
val fields = supportedTypes.zipWithIndex.map { case (dataType, index) =
StructField(scol$index, dataType, true)
}
val allColumns = fields.map(_.name).mkString(,)
val schema = StructType(fields)
val rdd =
sc.parallelize((1 to 8), 8).flatMap { j = (1 to 100).map ( i =
Row(
sstr${i}: test serializer2.,
sbinary${i}: test serializer2..getBytes(UTF-8),
i % 2 == 0,
i.toByte,
i.toShort,
i,
i.toLong,
(i + 0.75),
BigDecimal(Long.MaxValue.toString + .12345),
new java.math.BigDecimal(s${i % 9 + 1} + .23456),
new java.sql.Date(i),
new java.sql.Timestamp(i)))
}
sqlContext.createDataFrame(rdd, schema).registerTempTable(shuffle)
sqlContext.sql(cache table shuffle)
{code}
The query was
{code}
sql(s
select
${allColumns}
from shuffle
cluster by
${allColumns}).queryExecution.executedPlan(1).execute().foreach(x =
Unit)
{code}
With the serializer ({{sqlContext.sql(set spark.sql.useSerializer2=true)}}),
the execution time was 84.750775s. With Kryo ({{sqlContext.sql(set
spark.sql.useSerializer2=false)}}), the execution time was 189.129494s. I am
also attaching the profiling results from visualvm (I was using sampler).
Build a specialized serializer for Exchange operator.
--
Key: SPARK-6368
URL: https://issues.apache.org/jira/browse/SPARK-6368
Project: Spark
Issue Type: Improvement
Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical
Kryo is still pretty slow because it works on individual objects and relative
expensive to allocate. For Exchange operator, because the schema for key and
value are already defined, we can create a specialized serializer to handle
the specific schemas of key and value.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org