[jira] [Commented] (SPARK-6368) Build a specialized serializer for Exchange operator.

2015-05-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524117#comment-14524117
 ] 

Apache Spark commented on SPARK-6368:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5849

 Build a specialized serializer for Exchange operator. 
 --

 Key: SPARK-6368
 URL: https://issues.apache.org/jira/browse/SPARK-6368
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical
 Fix For: 1.4.0

 Attachments: Kryo.nps, SchemaBased.nps


 Kryo is still pretty slow because it works on individual objects and relative 
 expensive to allocate. For Exchange operator, because the schema for key and 
 value are already defined, we can create a specialized serializer to handle 
 the specific schemas of key and value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6368) Build a specialized serializer for Exchange operator.

2015-04-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495475#comment-14495475
 ] 

Yin Huai commented on SPARK-6368:
-

I am adding results of a simple benchmark.

I build the master with 
{code}
build/sbt -Phive assembly
{code}
Then, I launched the spark shell with
{code}
bin/spark-shell --master local-cluster[1,1,4096] --conf 
spark.executor.memory=4096m --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.shuffle.compress=false --conf 
spark.executor.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/var/tmp --conf fs.local.block.size=536870912 -v
{code}

The following code was used
{code}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val supportedTypes =
  Seq(StringType, BinaryType, BooleanType,
ByteType, ShortType, IntegerType, LongType,
DoubleType, DecimalType.Unlimited, DecimalType(6, 5),
DateType, TimestampType)

val fields = supportedTypes.zipWithIndex.map { case (dataType, index) =
  StructField(scol$index, dataType, true)
}
val allColumns = fields.map(_.name).mkString(,)
val schema = StructType(fields)

val rdd =
  sc.parallelize((1 to 8), 8).flatMap { j = (1 to 100).map ( i =
Row(
  sstr${i}: test serializer2.,
  sbinary${i}: test serializer2..getBytes(UTF-8),
  i % 2 == 0,
  i.toByte,
  i.toShort,
  i,
  i.toLong,
  (i + 0.75),
  BigDecimal(Long.MaxValue.toString + .12345),
  new java.math.BigDecimal(s${i % 9 + 1} + .23456),
  new java.sql.Date(i),
  new java.sql.Timestamp(i)))
  }
  
sqlContext.createDataFrame(rdd, schema).registerTempTable(shuffle)

sqlContext.sql(cache table shuffle)
{code}

The query was 
{code}
sql(s
  select
${allColumns}
  from shuffle
  cluster by
${allColumns}).queryExecution.executedPlan(1).execute().foreach(x = 
Unit)
{code}

With the serializer ({{sqlContext.sql(set spark.sql.useSerializer2=true)}}), 
the execution time was 84.750775s. With Kryo ({{sqlContext.sql(set 
spark.sql.useSerializer2=false)}}), the execution time was 189.129494s. I am 
also attaching the profiling results from visualvm (I was using sampler). 

 Build a specialized serializer for Exchange operator. 
 --

 Key: SPARK-6368
 URL: https://issues.apache.org/jira/browse/SPARK-6368
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical

 Kryo is still pretty slow because it works on individual objects and relative 
 expensive to allocate. For Exchange operator, because the schema for key and 
 value are already defined, we can create a specialized serializer to handle 
 the specific schemas of key and value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6368) Build a specialized serializer for Exchange operator.

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493092#comment-14493092
 ] 

Apache Spark commented on SPARK-6368:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5497

 Build a specialized serializer for Exchange operator. 
 --

 Key: SPARK-6368
 URL: https://issues.apache.org/jira/browse/SPARK-6368
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical

 Kryo is still pretty slow because it works on individual objects and relative 
 expensive to allocate. For Exchange operator, because the schema for key and 
 value are already defined, we can create a specialized serializer to handle 
 the specific schemas of key and value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org