[jira] [Commented] (SPARK-6368) Build a specialized serializer for Exchange operator.

2015-05-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524117#comment-14524117
 ] 

Apache Spark commented on SPARK-6368:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5849

> Build a specialized serializer for Exchange operator. 
> --
>
> Key: SPARK-6368
> URL: https://issues.apache.org/jira/browse/SPARK-6368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 1.4.0
>
> Attachments: Kryo.nps, SchemaBased.nps
>
>
> Kryo is still pretty slow because it works on individual objects and relative 
> expensive to allocate. For Exchange operator, because the schema for key and 
> value are already defined, we can create a specialized serializer to handle 
> the specific schemas of key and value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6368) Build a specialized serializer for Exchange operator.

2015-04-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495475#comment-14495475
 ] 

Yin Huai commented on SPARK-6368:
-

I am adding results of a simple benchmark.

I build the master with 
{code}
build/sbt -Phive assembly
{code}
Then, I launched the spark shell with
{code}
bin/spark-shell --master local-cluster[1,1,4096] --conf 
spark.executor.memory=4096m --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.shuffle.compress=false --conf 
spark.executor.extraJavaOptions="-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/var/tmp" --conf fs.local.block.size=536870912 -v
{code}

The following code was used
{code}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val supportedTypes =
  Seq(StringType, BinaryType, BooleanType,
ByteType, ShortType, IntegerType, LongType,
DoubleType, DecimalType.Unlimited, DecimalType(6, 5),
DateType, TimestampType)

val fields = supportedTypes.zipWithIndex.map { case (dataType, index) =>
  StructField(s"col$index", dataType, true)
}
val allColumns = fields.map(_.name).mkString(",")
val schema = StructType(fields)

val rdd =
  sc.parallelize((1 to 8), 8).flatMap { j => (1 to 100).map ( i =>
Row(
  s"str${i}: test serializer2.",
  s"binary${i}: test serializer2.".getBytes("UTF-8"),
  i % 2 == 0,
  i.toByte,
  i.toShort,
  i,
  i.toLong,
  (i + 0.75),
  BigDecimal(Long.MaxValue.toString + ".12345"),
  new java.math.BigDecimal(s"${i % 9 + 1}" + ".23456"),
  new java.sql.Date(i),
  new java.sql.Timestamp(i)))
  }
  
sqlContext.createDataFrame(rdd, schema).registerTempTable("shuffle")

sqlContext.sql("cache table shuffle")
{code}

The query was 
{code}
sql(s"""
  select
${allColumns}
  from shuffle
  cluster by
${allColumns}""").queryExecution.executedPlan(1).execute().foreach(x => 
Unit)
{code}

With the serializer ({{sqlContext.sql("set spark.sql.useSerializer2=true")}}), 
the execution time was 84.750775s. With Kryo ({{sqlContext.sql("set 
spark.sql.useSerializer2=false")}}), the execution time was 189.129494s. I am 
also attaching the profiling results from visualvm (I was using sampler). 

> Build a specialized serializer for Exchange operator. 
> --
>
> Key: SPARK-6368
> URL: https://issues.apache.org/jira/browse/SPARK-6368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> Kryo is still pretty slow because it works on individual objects and relative 
> expensive to allocate. For Exchange operator, because the schema for key and 
> value are already defined, we can create a specialized serializer to handle 
> the specific schemas of key and value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6368) Build a specialized serializer for Exchange operator.

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493092#comment-14493092
 ] 

Apache Spark commented on SPARK-6368:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5497

> Build a specialized serializer for Exchange operator. 
> --
>
> Key: SPARK-6368
> URL: https://issues.apache.org/jira/browse/SPARK-6368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> Kryo is still pretty slow because it works on individual objects and relative 
> expensive to allocate. For Exchange operator, because the schema for key and 
> value are already defined, we can create a specialized serializer to handle 
> the specific schemas of key and value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org