[jira] [Updated] (SPARK-13745) Support columnar in memory representation on Big Endian platforms

2016-04-14 Thread Tim Preece (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Preece updated SPARK-13745:
---
Affects Version/s: 2.0.0

> Support columnar in memory representation on Big Endian platforms
> -
>
> Key: SPARK-13745
> URL: https://issues.apache.org/jira/browse/SPARK-13745
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tim Preece
>  Labels: big-endian
>
> SPARK-12785 introduced a columnar in memory representation. 
> Currently this feature is explicitly only supported on Little Endian 
> platorms. On Big Endian platforms the following exception is thrown:
> "org.apache.commons.lang.NotImplementedException: Only little endian is 
> supported."
> This JIRA should be used to extend support to Big Endian architectures, and 
> decide whether the "in memory" columnar format should be consistent with 
> parquet format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13745) Support columnar in memory representation on Big Endian platforms

2016-03-08 Thread Tim Preece (JIRA)
Tim Preece created SPARK-13745:
--

 Summary: Support columnar in memory representation on Big Endian 
platforms
 Key: SPARK-13745
 URL: https://issues.apache.org/jira/browse/SPARK-13745
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Tim Preece


SPARK-12785 introduced a columnar in memory representation. 

Currently this feature is explicitly only supported on Little Endian platorms. 
On Big Endian platforms the following exception is thrown:
"org.apache.commons.lang.NotImplementedException: Only little endian is 
supported."

This JIRA should be used to extend support to Big Endian architectures, and 
decide whether the "in memory" columnar format should be consistent with 
parquet format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13648) org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError

2016-03-03 Thread Tim Preece (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Preece updated SPARK-13648:
---
Description: 
When running the standard Spark unit tests on the IBM Java SDK the hive 
VersionsSuite fail with the following error.

java.lang.NoClassDefFoundError:  org.apache.hadoop.hive.cli.CliSessionState 
when creating Hive client using classpath: ..


  was:
When running the standard Spark unit tests on the IBM Java SDK the hive 
VersionsSuite fail with the following error.

{panel}
java.lang.NoClassDefFoundError: org.apache.hadoop.hive.cli.CliSessionState when 
creating Hive client using classpath: 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/log4j_log4j-1.2.17.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.hadoop_hadoop-mapreduce-client-jobclient-2.6.0.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.directory.server_apacheds-kerberos-codec-2.0.0-M15.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.json_json-20090211.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/junit_junit-3.8.1.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/com.sun.xml.bind_jaxb-impl-2.2.3-1.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.commons_commons-math3-3.1.1.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/com.google.inject_guice-3.0.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.hadoop_hadoop-mapreduce-client-core-2.6.0.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.httpcomponents_httpclient-4.2.5.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/javax.servlet_servlet-api-2.5.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/commons-codec_commons-codec-1.4.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.iq80.snappy_snappy-0.2.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/javax.transaction_jta-1.1.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.datanucleus_datanucleus-core-3.2.2.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.avro_avro-1.7.4.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/aopalliance_aopalliance-1.0.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.directory.api_api-asn1-api-1.0.0-M20.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.httpcomponents_httpcore-4.2.5.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.codehaus.jettison_jettison-1.1.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/commons-digester_commons-digester-1.8.jar,
 

[jira] [Created] (SPARK-13648) org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError

2016-03-03 Thread Tim Preece (JIRA)
Tim Preece created SPARK-13648:
--

 Summary: org.apache.spark.sql.hive.client.VersionsSuite fails 
NoClassDefFoundError
 Key: SPARK-13648
 URL: https://issues.apache.org/jira/browse/SPARK-13648
 Project: Spark
  Issue Type: Bug
 Environment: Fails on vendor specific JVMs ( e.g IBM JVM )
Reporter: Tim Preece


When running the standard Spark unit tests on the IBM Java SDK the hive 
VersionsSuite fail with the following error.

{panel}
java.lang.NoClassDefFoundError: org.apache.hadoop.hive.cli.CliSessionState when 
creating Hive client using classpath: 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/log4j_log4j-1.2.17.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.hadoop_hadoop-mapreduce-client-jobclient-2.6.0.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.directory.server_apacheds-kerberos-codec-2.0.0-M15.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.json_json-20090211.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/junit_junit-3.8.1.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/com.sun.xml.bind_jaxb-impl-2.2.3-1.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.commons_commons-math3-3.1.1.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/com.google.inject_guice-3.0.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.hadoop_hadoop-mapreduce-client-core-2.6.0.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.httpcomponents_httpclient-4.2.5.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/javax.servlet_servlet-api-2.5.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/commons-codec_commons-codec-1.4.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.iq80.snappy_snappy-0.2.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/javax.transaction_jta-1.1.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.datanucleus_datanucleus-core-3.2.2.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.avro_avro-1.7.4.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/aopalliance_aopalliance-1.0.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.directory.api_api-asn1-api-1.0.0-M20.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.httpcomponents_httpcore-4.2.5.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.codehaus.jettison_jettison-1.1.jar,
 
file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/commons-digester_commons-digester-1.8.jar,
 

[jira] [Commented] (SPARK-12785) Implement columnar in memory representation

2016-01-15 Thread Tim Preece (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101520#comment-15101520
 ] 

Tim Preece commented on SPARK-12785:


I notice there is an explicit check for endianess in OffHeapColumnVector.

Is the a technical reason why this feature is only being developed for Little 
Endian platforms ?

I think the endianess should never matter ( providing Spark is run on an 
homogeneous endianess cluster ).



> Implement columnar in memory representation
> ---
>
> Key: SPARK-12785
> URL: https://issues.apache.org/jira/browse/SPARK-12785
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Nong Li
>Assignee: Nong Li
> Fix For: 2.0.0
>
>
> Tungsten can benefit from having a columnar in memory representation which 
> can provide a few benefits:
>  - Enables vectorized execution
>  - Improves memory efficiency (memory is more tightly packed)
>  - Enables cheap serialization/zero-copy transfer with third party components 
> (e.g. numpy)
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12555) Datasets: data is corrupted when input data is reordered

2016-01-14 Thread Tim Preece (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Preece updated SPARK-12555:
---
Description: 
Testcase
---
{code}
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Dataset

case class people(age: Int, name: String)

object nameAgg extends Aggregator[people, String, String] {
  def zero: String = ""
  def reduce(b: String, a: people): String = a.name + b
  def merge(b1: String, b2: String): String = b1 + b2
  def finish(r: String): String = r
}

object DataSetAgg {
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DataSetAgg")
val spark = new SparkContext(conf)
val sqlContext = new SQLContext(spark)
import sqlContext.implicits._

val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS 
name, 1279869254 AS age").as[people]
peopleds.groupBy(_.age).agg(nameAgg.toColumn).show()
  }
}
{code}

Result ( on a Little Endian Platform )

+--+--+
|_1|_2|
+--+--+
|1279869254|FAILTi|
+--+--+

Explanation
---
Internally the String variable in the unsafe row is not updated after an unsafe 
row join operation.
The displayed string is corrupted and shows part of the integer ( interpreted 
as a string ) along with "Ti"
The column names also look incorrect on a Little Endian platform.

Result ( on a Big Endian Platform )
+--+--+
| value|nameAgg$(name,age)|
+--+--+
|1279869254|LIAFTi|
+--+--+

The following Unit test also fails ( but only explicitly on a Big Endian 
platorm )

org.apache.spark.sql.DatasetAggregatorSuite

- typed aggregation: class input with reordering *** FAILED ***
  Results do not match for query:
  == Parsed Logical Plan ==
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Analyzed Logical Plan ==
  value: string, ClassInputAgg$(b,a): int
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Optimized Logical Plan ==
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Physical Plan ==
  TungstenAggregate(key=[value#748], 
functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
output=[value#748,ClassInputAgg$(b,a)#762])
  +- TungstenExchange hashpartitioning(value#748,5), None
 +- TungstenAggregate(key=[value#748], 
functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
output=[value#748,value#758])
+- !AppendColumns , class[a[0]: int, b[0]: string], 
class[value[0]: string], [value#748]
   +- Project [one AS b#650,1 AS a#651]
  +- Scan OneRowRelation[]
  == Results ==
  !== Correct Answer - 1 ==   == Spark Answer - 1 ==
  ![one,1][one,9] (QueryTest.scala:127)


  was:
Testcase
---
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Dataset

case class people(age: Int, name: String)

object nameAgg extends Aggregator[people, String, String] {
  def zero: String = ""
  def reduce(b: String, a: people): String = a.name + b
  def merge(b1: String, b2: String): String = b1 + b2
  def finish(r: String): String = r
}

object DataSetAgg {
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DataSetAgg")
val spark = new SparkContext(conf)
val sqlContext = new SQLContext(spark)
import sqlContext.implicits._

val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS 
name, 1279869254 AS age").as[people]
peopleds.groupBy(_.age).agg(nameAgg.toColumn).show()
  }
}

Result ( on a Little Endian Platform )

+--+--+
|_1|_2|
+--+--+
|1279869254|FAILTi|
+--+--+

Explanation
---
Internally the String variable in the unsafe row is not updated after an unsafe 
row join operation.
The displayed string is corrupted and shows part of the integer ( interpreted 
as a string ) 

[jira] [Updated] (SPARK-12555) Datasets: data is corrupted when input data is reordered

2016-01-14 Thread Tim Preece (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Preece updated SPARK-12555:
---
Description: 
Testcase
---
{code}
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Dataset

case class people(age: Int, name: String)

object nameAgg extends Aggregator[people, String, String] {
  def zero: String = ""
  def reduce(b: String, a: people): String = a.name + b
  def merge(b1: String, b2: String): String = b1 + b2
  def finish(r: String): String = r
}

object DataSetAgg {
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DataSetAgg")
val spark = new SparkContext(conf)
val sqlContext = new SQLContext(spark)
import sqlContext.implicits._

val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS 
name, 1279869254 AS age").as[people]
peopleds.groupBy(_.age).agg(nameAgg.toColumn).show()
  }
}
{code}

Result ( on a Little Endian Platform )

{noformat}
+--+--+
|_1|_2|
+--+--+
|1279869254|FAILTi|
+--+--+
{noformat}

Explanation
---
Internally the String variable in the unsafe row is not updated after an unsafe 
row join operation.
The displayed string is corrupted and shows part of the integer ( interpreted 
as a string ) along with "Ti"
The column names also look incorrect on a Little Endian platform.

Result ( on a Big Endian Platform )
{noformat}
+--+--+
| value|nameAgg$(name,age)|
+--+--+
|1279869254|LIAFTi|
+--+--+
{noformat}

The following Unit test also fails ( but only explicitly on a Big Endian 
platorm )

org.apache.spark.sql.DatasetAggregatorSuite

- typed aggregation: class input with reordering *** FAILED ***
  Results do not match for query:
  == Parsed Logical Plan ==
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Analyzed Logical Plan ==
  value: string, ClassInputAgg$(b,a): int
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Optimized Logical Plan ==
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Physical Plan ==
  TungstenAggregate(key=[value#748], 
functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
output=[value#748,ClassInputAgg$(b,a)#762])
  +- TungstenExchange hashpartitioning(value#748,5), None
 +- TungstenAggregate(key=[value#748], 
functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
output=[value#748,value#758])
+- !AppendColumns , class[a[0]: int, b[0]: string], 
class[value[0]: string], [value#748]
   +- Project [one AS b#650,1 AS a#651]
  +- Scan OneRowRelation[]
  == Results ==
  !== Correct Answer - 1 ==   == Spark Answer - 1 ==
  ![one,1][one,9] (QueryTest.scala:127)


  was:
Testcase
---
{code}
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Dataset

case class people(age: Int, name: String)

object nameAgg extends Aggregator[people, String, String] {
  def zero: String = ""
  def reduce(b: String, a: people): String = a.name + b
  def merge(b1: String, b2: String): String = b1 + b2
  def finish(r: String): String = r
}

object DataSetAgg {
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DataSetAgg")
val spark = new SparkContext(conf)
val sqlContext = new SQLContext(spark)
import sqlContext.implicits._

val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS 
name, 1279869254 AS age").as[people]
peopleds.groupBy(_.age).agg(nameAgg.toColumn).show()
  }
}
{code}

Result ( on a Little Endian Platform )

+--+--+
|_1|_2|
+--+--+
|1279869254|FAILTi|
+--+--+

Explanation
---
Internally the String variable in the unsafe row is not updated after an unsafe 
row join operation.
The displayed string is corrupted 

[jira] [Updated] (SPARK-12555) Datasets: data is corrupted when input data is reordered

2016-01-14 Thread Tim Preece (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Preece updated SPARK-12555:
---
Description: 
Testcase
---
{code}
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Dataset

case class people(age: Int, name: String)

object nameAgg extends Aggregator[people, String, String] {
  def zero: String = ""
  def reduce(b: String, a: people): String = a.name + b
  def merge(b1: String, b2: String): String = b1 + b2
  def finish(r: String): String = r
}

object DataSetAgg {
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DataSetAgg")
val spark = new SparkContext(conf)
val sqlContext = new SQLContext(spark)
import sqlContext.implicits._

val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS 
name, 1279869254 AS age").as[people]
peopleds.groupBy(_.age).agg(nameAgg.toColumn).show()
  }
}
{code}

Result ( on a Little Endian Platform )

{noformat}
+--+--+
|_1|_2|
+--+--+
|1279869254|FAILTi|
+--+--+
{noformat}

Explanation
---
Internally the String variable in the unsafe row is not updated after an unsafe 
row join operation.
The displayed string is corrupted and shows part of the integer ( interpreted 
as a string ) along with "Ti"
The column names also look different on a Little Endian platform.

Result ( on a Big Endian Platform )
{noformat}
+--+--+
| value|nameAgg$(name,age)|
+--+--+
|1279869254|LIAFTi|
+--+--+
{noformat}

The following Unit test also fails ( but only explicitly on a Big Endian 
platorm )

org.apache.spark.sql.DatasetAggregatorSuite

- typed aggregation: class input with reordering *** FAILED ***
  Results do not match for query:
  == Parsed Logical Plan ==
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Analyzed Logical Plan ==
  value: string, ClassInputAgg$(b,a): int
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Optimized Logical Plan ==
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Physical Plan ==
  TungstenAggregate(key=[value#748], 
functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
output=[value#748,ClassInputAgg$(b,a)#762])
  +- TungstenExchange hashpartitioning(value#748,5), None
 +- TungstenAggregate(key=[value#748], 
functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
output=[value#748,value#758])
+- !AppendColumns , class[a[0]: int, b[0]: string], 
class[value[0]: string], [value#748]
   +- Project [one AS b#650,1 AS a#651]
  +- Scan OneRowRelation[]
  == Results ==
  !== Correct Answer - 1 ==   == Spark Answer - 1 ==
  ![one,1][one,9] (QueryTest.scala:127)


  was:
Testcase
---
{code}
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Dataset

case class people(age: Int, name: String)

object nameAgg extends Aggregator[people, String, String] {
  def zero: String = ""
  def reduce(b: String, a: people): String = a.name + b
  def merge(b1: String, b2: String): String = b1 + b2
  def finish(r: String): String = r
}

object DataSetAgg {
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DataSetAgg")
val spark = new SparkContext(conf)
val sqlContext = new SQLContext(spark)
import sqlContext.implicits._

val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS 
name, 1279869254 AS age").as[people]
peopleds.groupBy(_.age).agg(nameAgg.toColumn).show()
  }
}
{code}

Result ( on a Little Endian Platform )

{noformat}
+--+--+
|_1|_2|
+--+--+
|1279869254|FAILTi|
+--+--+
{noformat}

Explanation
---
Internally the String variable in the unsafe row is not updated after an unsafe 
row join operation.
The displayed 

[jira] [Updated] (SPARK-12555) DatasetAggregatorSuite fails on big-endian platforms

2016-01-14 Thread Tim Preece (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Preece updated SPARK-12555:
---
Description: 
Testcase
---
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Dataset

case class people(age: Int, name: String)

object nameAgg extends Aggregator[people, String, String] {
  def zero: String = ""
  def reduce(b: String, a: people): String = a.name + b
  def merge(b1: String, b2: String): String = b1 + b2
  def finish(r: String): String = r
}

object DataSetAgg {
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DataSetAgg")
val spark = new SparkContext(conf)
val sqlContext = new SQLContext(spark)
import sqlContext.implicits._

val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS 
name, 1279869254 AS age").as[people]
peopleds.groupBy(_.age).agg(nameAgg.toColumn).show()
  }
}

Result ( on a Little Endian Platform )

+--+--+
|_1|_2|
+--+--+
|1279869254|FAILTi|
+--+--+

Explanation
---
Internally the String variable in the unsafe row is not updated after an unsafe 
row join operation.
The displayed string is corrupted and shows part of the integer ( interpreted 
as a string ) along with "Ti"
The column names are look incorrect on a Little Endian platform.

Result ( on a Big Endian Platform )
+--+--+
| value|nameAgg$(name,age)|
+--+--+
|1279869254|LIAFTi|
+--+--+

The following Unit test also fails ( but only explicitly on a Big Endian 
platorm )

org.apache.spark.sql.DatasetAggregatorSuite

- typed aggregation: class input with reordering *** FAILED ***
  Results do not match for query:
  == Parsed Logical Plan ==
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Analyzed Logical Plan ==
  value: string, ClassInputAgg$(b,a): int
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Optimized Logical Plan ==
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Physical Plan ==
  TungstenAggregate(key=[value#748], 
functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
output=[value#748,ClassInputAgg$(b,a)#762])
  +- TungstenExchange hashpartitioning(value#748,5), None
 +- TungstenAggregate(key=[value#748], 
functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
output=[value#748,value#758])
+- !AppendColumns , class[a[0]: int, b[0]: string], 
class[value[0]: string], [value#748]
   +- Project [one AS b#650,1 AS a#651]
  +- Scan OneRowRelation[]
  == Results ==
  !== Correct Answer - 1 ==   == Spark Answer - 1 ==
  ![one,1][one,9] (QueryTest.scala:127)


  was:
org.apache.spark.sql.DatasetAggregatorSuite

- typed aggregation: class input with reordering *** FAILED ***
  Results do not match for query:
  == Parsed Logical Plan ==
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Analyzed Logical Plan ==
  value: string, ClassInputAgg$(b,a): int
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Optimized Logical Plan ==
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Physical Plan ==
  TungstenAggregate(key=[value#748], 
functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 

[jira] [Updated] (SPARK-12555) Datasets: data is corrupted when input data is reordered

2016-01-14 Thread Tim Preece (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Preece updated SPARK-12555:
---
Summary: Datasets: data is corrupted when input data is reordered  (was: 
DatasetAggregatorSuite fails on big-endian platforms)

> Datasets: data is corrupted when input data is reordered
> 
>
> Key: SPARK-12555
> URL: https://issues.apache.org/jira/browse/SPARK-12555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: ALL platforms ( although test only explicitly fails on 
> Big Endian platforms ).
>Reporter: Tim Preece
>
> Testcase
> ---
> import org.apache.spark.sql.expressions.Aggregator
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.Dataset
> case class people(age: Int, name: String)
> object nameAgg extends Aggregator[people, String, String] {
>   def zero: String = ""
>   def reduce(b: String, a: people): String = a.name + b
>   def merge(b1: String, b2: String): String = b1 + b2
>   def finish(r: String): String = r
> }
> object DataSetAgg {
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("DataSetAgg")
> val spark = new SparkContext(conf)
> val sqlContext = new SQLContext(spark)
> import sqlContext.implicits._
> val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS 
> name, 1279869254 AS age").as[people]
> peopleds.groupBy(_.age).agg(nameAgg.toColumn).show()
>   }
> }
> Result ( on a Little Endian Platform )
> 
> +--+--+
> |_1|_2|
> +--+--+
> |1279869254|FAILTi|
> +--+--+
> Explanation
> ---
> Internally the String variable in the unsafe row is not updated after an 
> unsafe row join operation.
> The displayed string is corrupted and shows part of the integer ( interpreted 
> as a string ) along with "Ti"
> The column names are look incorrect on a Little Endian platform.
> Result ( on a Big Endian Platform )
> +--+--+
> | value|nameAgg$(name,age)|
> +--+--+
> |1279869254|LIAFTi|
> +--+--+
> The following Unit test also fails ( but only explicitly on a Big Endian 
> platorm )
> org.apache.spark.sql.DatasetAggregatorSuite
> - typed aggregation: class input with reordering *** FAILED ***
>   Results do not match for query:
>   == Parsed Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Analyzed Logical Plan ==
>   value: string, ClassInputAgg$(b,a): int
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Optimized Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Physical Plan ==
>   TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
> output=[value#748,ClassInputAgg$(b,a)#762])
>   +- TungstenExchange hashpartitioning(value#748,5), None
>  +- TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
> output=[value#748,value#758])
> +- !AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>+- Project [one AS b#650,1 AS a#651]
>   +- Scan OneRowRelation[]
>   == Results ==
>   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
>   ![one,1][one,9] (QueryTest.scala:127)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12555) Datasets: data is corrupted when input data is reordered

2016-01-14 Thread Tim Preece (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Preece updated SPARK-12555:
---
Description: 
Testcase
---
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Dataset

case class people(age: Int, name: String)

object nameAgg extends Aggregator[people, String, String] {
  def zero: String = ""
  def reduce(b: String, a: people): String = a.name + b
  def merge(b1: String, b2: String): String = b1 + b2
  def finish(r: String): String = r
}

object DataSetAgg {
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DataSetAgg")
val spark = new SparkContext(conf)
val sqlContext = new SQLContext(spark)
import sqlContext.implicits._

val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS 
name, 1279869254 AS age").as[people]
peopleds.groupBy(_.age).agg(nameAgg.toColumn).show()
  }
}

Result ( on a Little Endian Platform )

+--+--+
|_1|_2|
+--+--+
|1279869254|FAILTi|
+--+--+

Explanation
---
Internally the String variable in the unsafe row is not updated after an unsafe 
row join operation.
The displayed string is corrupted and shows part of the integer ( interpreted 
as a string ) along with "Ti"
The column names also look incorrect on a Little Endian platform.

Result ( on a Big Endian Platform )
+--+--+
| value|nameAgg$(name,age)|
+--+--+
|1279869254|LIAFTi|
+--+--+

The following Unit test also fails ( but only explicitly on a Big Endian 
platorm )

org.apache.spark.sql.DatasetAggregatorSuite

- typed aggregation: class input with reordering *** FAILED ***
  Results do not match for query:
  == Parsed Logical Plan ==
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Analyzed Logical Plan ==
  value: string, ClassInputAgg$(b,a): int
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Optimized Logical Plan ==
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Physical Plan ==
  TungstenAggregate(key=[value#748], 
functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
output=[value#748,ClassInputAgg$(b,a)#762])
  +- TungstenExchange hashpartitioning(value#748,5), None
 +- TungstenAggregate(key=[value#748], 
functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
output=[value#748,value#758])
+- !AppendColumns , class[a[0]: int, b[0]: string], 
class[value[0]: string], [value#748]
   +- Project [one AS b#650,1 AS a#651]
  +- Scan OneRowRelation[]
  == Results ==
  !== Correct Answer - 1 ==   == Spark Answer - 1 ==
  ![one,1][one,9] (QueryTest.scala:127)


  was:
Testcase
---
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Dataset

case class people(age: Int, name: String)

object nameAgg extends Aggregator[people, String, String] {
  def zero: String = ""
  def reduce(b: String, a: people): String = a.name + b
  def merge(b1: String, b2: String): String = b1 + b2
  def finish(r: String): String = r
}

object DataSetAgg {
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DataSetAgg")
val spark = new SparkContext(conf)
val sqlContext = new SQLContext(spark)
import sqlContext.implicits._

val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS 
name, 1279869254 AS age").as[people]
peopleds.groupBy(_.age).agg(nameAgg.toColumn).show()
  }
}

Result ( on a Little Endian Platform )

+--+--+
|_1|_2|
+--+--+
|1279869254|FAILTi|
+--+--+

Explanation
---
Internally the String variable in the unsafe row is not updated after an unsafe 
row join operation.
The displayed string is corrupted and shows part of the integer ( interpreted 
as a string ) along with "Ti"

[jira] [Updated] (SPARK-12555) Datasets: data is corrupted when input data is reordered

2016-01-14 Thread Tim Preece (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Preece updated SPARK-12555:
---
Environment: ALL platforms on 1.6  (was: ALL platforms ( although test only 
explicitly fails on Big Endian platforms ).)

> Datasets: data is corrupted when input data is reordered
> 
>
> Key: SPARK-12555
> URL: https://issues.apache.org/jira/browse/SPARK-12555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: ALL platforms on 1.6
>Reporter: Tim Preece
>
> Testcase
> ---
> import org.apache.spark.sql.expressions.Aggregator
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.Dataset
> case class people(age: Int, name: String)
> object nameAgg extends Aggregator[people, String, String] {
>   def zero: String = ""
>   def reduce(b: String, a: people): String = a.name + b
>   def merge(b1: String, b2: String): String = b1 + b2
>   def finish(r: String): String = r
> }
> object DataSetAgg {
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("DataSetAgg")
> val spark = new SparkContext(conf)
> val sqlContext = new SQLContext(spark)
> import sqlContext.implicits._
> val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS 
> name, 1279869254 AS age").as[people]
> peopleds.groupBy(_.age).agg(nameAgg.toColumn).show()
>   }
> }
> Result ( on a Little Endian Platform )
> 
> +--+--+
> |_1|_2|
> +--+--+
> |1279869254|FAILTi|
> +--+--+
> Explanation
> ---
> Internally the String variable in the unsafe row is not updated after an 
> unsafe row join operation.
> The displayed string is corrupted and shows part of the integer ( interpreted 
> as a string ) along with "Ti"
> The column names also look incorrect on a Little Endian platform.
> Result ( on a Big Endian Platform )
> +--+--+
> | value|nameAgg$(name,age)|
> +--+--+
> |1279869254|LIAFTi|
> +--+--+
> The following Unit test also fails ( but only explicitly on a Big Endian 
> platorm )
> org.apache.spark.sql.DatasetAggregatorSuite
> - typed aggregation: class input with reordering *** FAILED ***
>   Results do not match for query:
>   == Parsed Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Analyzed Logical Plan ==
>   value: string, ClassInputAgg$(b,a): int
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Optimized Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Physical Plan ==
>   TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
> output=[value#748,ClassInputAgg$(b,a)#762])
>   +- TungstenExchange hashpartitioning(value#748,5), None
>  +- TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
> output=[value#748,value#758])
> +- !AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>+- Project [one AS b#650,1 AS a#651]
>   +- Scan OneRowRelation[]
>   == Results ==
>   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
>   ![one,1][one,9] (QueryTest.scala:127)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12778) Use of Java Unsafe should take endianness into account

2016-01-12 Thread Tim Preece (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15094128#comment-15094128
 ] 

Tim Preece commented on SPARK-12778:


https://issues.apache.org/jira/browse/SPARK-12555 is not related, and in fact 
is not an Endian problem.

However, whilst investigating 12555 we did wonder how/if a mixed Endian Spark 
cluster could work, given an unsafe row mixes writing Integers and reading 
bytes.

> Use of Java Unsafe should take endianness into account
> --
>
> Key: SPARK-12778
> URL: https://issues.apache.org/jira/browse/SPARK-12778
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Reporter: Ted Yu
>
> In Platform.java, methods of Java Unsafe are called directly without 
> considering endianness.
> In thread, 'Tungsten in a mixed endian environment', Adam Roberts reported 
> data corruption when "spark.sql.tungsten.enabled" is enabled in mixed endian 
> environment.
> Platform.java should take endianness into account.
> Below is a copy of Adam's report:
> I've been experimenting with DataFrame operations in a mixed endian 
> environment - a big endian master with little endian workers. With tungsten 
> enabled I'm encountering data corruption issues. 
> For example, with this simple test code: 
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark._
> import org.apache.spark.sql.SQLContext
> object SimpleSQL {
>   def main(args: Array[String]): Unit = {
> if (args.length != 1) {
>   println("Not enough args, you need to specify the master url")
> }
> val masterURL = args(0)
> println("Setting up Spark context at: " + masterURL)
> val sparkConf = new SparkConf
> val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf)
> println("Performing SQL tests")
> val sqlContext = new SQLContext(sc)
> println("SQL context set up")
> val df = sqlContext.read.json("/tmp/people.json")
> df.show()
> println("Selecting everyone's age and adding one to it")
> df.select(df("name"), df("age") + 1).show()
> println("Showing all people over the age of 21")
> df.filter(df("age") > 21).show()
> println("Counting people by age")
> df.groupBy("age").count().show()
>   }
> } 
> {code}
> Instead of getting 
> {code}
> ++-+
> | age|count|
> ++-+
> |null|1|
> |  19|1|
> |  30|1|
> ++-+ 
> {code}
> I get the following with my mixed endian set up: 
> {code}
> +---+-+
> |age|count|
> +---+-+
> |   null|1|
> |1369094286720630784|72057594037927936|
> | 30|1|
> +---+-+ 
> {code}
> and on another run: 
> {code}
> +---+-+
> |age|count|
> +---+-+
> |  0|72057594037927936|
> | 19|1| 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12778) Use of Java Unsafe should take endianness into account

2016-01-12 Thread Tim Preece (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15094185#comment-15094185
 ] 

Tim Preece commented on SPARK-12778:


The testcase in 12555 only fails on big-endian ( even though I can see the 
problem in a debugger on little endian ).

So perhaps it's best if I should create a testcase which explicity fails on 
both BE and LE and then update 12555.

> Use of Java Unsafe should take endianness into account
> --
>
> Key: SPARK-12778
> URL: https://issues.apache.org/jira/browse/SPARK-12778
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Reporter: Ted Yu
>
> In Platform.java, methods of Java Unsafe are called directly without 
> considering endianness.
> In thread, 'Tungsten in a mixed endian environment', Adam Roberts reported 
> data corruption when "spark.sql.tungsten.enabled" is enabled in mixed endian 
> environment.
> Platform.java should take endianness into account.
> Below is a copy of Adam's report:
> I've been experimenting with DataFrame operations in a mixed endian 
> environment - a big endian master with little endian workers. With tungsten 
> enabled I'm encountering data corruption issues. 
> For example, with this simple test code: 
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark._
> import org.apache.spark.sql.SQLContext
> object SimpleSQL {
>   def main(args: Array[String]): Unit = {
> if (args.length != 1) {
>   println("Not enough args, you need to specify the master url")
> }
> val masterURL = args(0)
> println("Setting up Spark context at: " + masterURL)
> val sparkConf = new SparkConf
> val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf)
> println("Performing SQL tests")
> val sqlContext = new SQLContext(sc)
> println("SQL context set up")
> val df = sqlContext.read.json("/tmp/people.json")
> df.show()
> println("Selecting everyone's age and adding one to it")
> df.select(df("name"), df("age") + 1).show()
> println("Showing all people over the age of 21")
> df.filter(df("age") > 21).show()
> println("Counting people by age")
> df.groupBy("age").count().show()
>   }
> } 
> {code}
> Instead of getting 
> {code}
> ++-+
> | age|count|
> ++-+
> |null|1|
> |  19|1|
> |  30|1|
> ++-+ 
> {code}
> I get the following with my mixed endian set up: 
> {code}
> +---+-+
> |age|count|
> +---+-+
> |   null|1|
> |1369094286720630784|72057594037927936|
> | 30|1|
> +---+-+ 
> {code}
> and on another run: 
> {code}
> +---+-+
> |age|count|
> +---+-+
> |  0|72057594037927936|
> | 19|1| 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12555) Build Failure on 1.6 ( DatasetAggregatorSuite )

2015-12-29 Thread Tim Preece (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Preece updated SPARK-12555:
---
Summary: Build Failure on 1.6 ( DatasetAggregatorSuite )  (was: Build 
Failure on 1.6)

> Build Failure on 1.6 ( DatasetAggregatorSuite )
> ---
>
> Key: SPARK-12555
> URL: https://issues.apache.org/jira/browse/SPARK-12555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: ALL platforms ( although test only explicitly fails on 
> Big Endian platforms ).
>Reporter: Tim Preece
>Priority: Blocker
>
> org.apache.spark.sql.DatasetAggregatorSuite
> - typed aggregation: class input with reordering *** FAILED ***
>   Results do not match for query:
>   == Parsed Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Analyzed Logical Plan ==
>   value: string, ClassInputAgg$(b,a): int
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Optimized Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Physical Plan ==
>   TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
> output=[value#748,ClassInputAgg$(b,a)#762])
>   +- TungstenExchange hashpartitioning(value#748,5), None
>  +- TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
> output=[value#748,value#758])
> +- !AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>+- Project [one AS b#650,1 AS a#651]
>   +- Scan OneRowRelation[]
>   == Results ==
>   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
>   ![one,1][one,9] (QueryTest.scala:127)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12555) Build Failure on 1.6

2015-12-29 Thread Tim Preece (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073795#comment-15073795
 ] 

Tim Preece commented on SPARK-12555:


Analysis shows that this test fails because of data corruption:

There is a mismatch between unsaferow (string,int) and the schema (int,string), 
presumably because the test involves reordering of columns.

Subsequently when joining (string,int) + (string) the code incorrectly patches 
the int value with the offset change of the first String.

This data corruption occurs on ALL platforms and the offset part of the first 
string is always incorrect. On Big Endian platforms the value for the integer 
is also corrupted. This is simply due to location of the 4-byte integer in the 
8-byte unsafe row slot.

> Build Failure on 1.6
> 
>
> Key: SPARK-12555
> URL: https://issues.apache.org/jira/browse/SPARK-12555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: ALL platforms ( although test only explicitly fails on 
> Big Endian platforms ).
>Reporter: Tim Preece
>Priority: Blocker
>
> org.apache.spark.sql.DatasetAggregatorSuite
> - typed aggregation: class input with reordering *** FAILED ***
>   Results do not match for query:
>   == Parsed Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Analyzed Logical Plan ==
>   value: string, ClassInputAgg$(b,a): int
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Optimized Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Physical Plan ==
>   TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
> output=[value#748,ClassInputAgg$(b,a)#762])
>   +- TungstenExchange hashpartitioning(value#748,5), None
>  +- TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
> output=[value#748,value#758])
> +- !AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>+- Project [one AS b#650,1 AS a#651]
>   +- Scan OneRowRelation[]
>   == Results ==
>   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
>   ![one,1][one,9] (QueryTest.scala:127)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12555) Build Failure on 1.6 ( DatasetAggregatorSuite )

2015-12-29 Thread Tim Preece (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Preece updated SPARK-12555:
---
Priority: Major  (was: Blocker)

> Build Failure on 1.6 ( DatasetAggregatorSuite )
> ---
>
> Key: SPARK-12555
> URL: https://issues.apache.org/jira/browse/SPARK-12555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: ALL platforms ( although test only explicitly fails on 
> Big Endian platforms ).
>Reporter: Tim Preece
>
> org.apache.spark.sql.DatasetAggregatorSuite
> - typed aggregation: class input with reordering *** FAILED ***
>   Results do not match for query:
>   == Parsed Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Analyzed Logical Plan ==
>   value: string, ClassInputAgg$(b,a): int
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Optimized Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Physical Plan ==
>   TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
> output=[value#748,ClassInputAgg$(b,a)#762])
>   +- TungstenExchange hashpartitioning(value#748,5), None
>  +- TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
> output=[value#748,value#758])
> +- !AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>+- Project [one AS b#650,1 AS a#651]
>   +- Scan OneRowRelation[]
>   == Results ==
>   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
>   ![one,1][one,9] (QueryTest.scala:127)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12555) DatasetAggregatorSuite fails on big-endian platforms

2015-12-29 Thread Tim Preece (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073868#comment-15073868
 ] 

Tim Preece commented on SPARK-12555:


Yes it is a test failure. Thanks for updating the title.

> DatasetAggregatorSuite fails on big-endian platforms
> 
>
> Key: SPARK-12555
> URL: https://issues.apache.org/jira/browse/SPARK-12555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: ALL platforms ( although test only explicitly fails on 
> Big Endian platforms ).
>Reporter: Tim Preece
>
> org.apache.spark.sql.DatasetAggregatorSuite
> - typed aggregation: class input with reordering *** FAILED ***
>   Results do not match for query:
>   == Parsed Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Analyzed Logical Plan ==
>   value: string, ClassInputAgg$(b,a): int
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Optimized Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Physical Plan ==
>   TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
> output=[value#748,ClassInputAgg$(b,a)#762])
>   +- TungstenExchange hashpartitioning(value#748,5), None
>  +- TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
> output=[value#748,value#758])
> +- !AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>+- Project [one AS b#650,1 AS a#651]
>   +- Scan OneRowRelation[]
>   == Results ==
>   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
>   ![one,1][one,9] (QueryTest.scala:127)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12555) Build Failure on 1.6

2015-12-29 Thread Tim Preece (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Preece updated SPARK-12555:
---
Summary: Build Failure on 1.6  (was: Build Failure on 1.6 RC4)

> Build Failure on 1.6
> 
>
> Key: SPARK-12555
> URL: https://issues.apache.org/jira/browse/SPARK-12555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: ALL platforms ( although test only explicitly fails on 
> Big Endian platforms ).
>Reporter: Tim Preece
>Priority: Blocker
>
> org.apache.spark.sql.DatasetAggregatorSuite
> - typed aggregation: class input with reordering *** FAILED ***
>   Results do not match for query:
>   == Parsed Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Analyzed Logical Plan ==
>   value: string, ClassInputAgg$(b,a): int
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Optimized Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Physical Plan ==
>   TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
> output=[value#748,ClassInputAgg$(b,a)#762])
>   +- TungstenExchange hashpartitioning(value#748,5), None
>  +- TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
> output=[value#748,value#758])
> +- !AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>+- Project [one AS b#650,1 AS a#651]
>   +- Scan OneRowRelation[]
>   == Results ==
>   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
>   ![one,1][one,9] (QueryTest.scala:127)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12319) Address endian specific problems surfaced in 1.6

2015-12-29 Thread Tim Preece (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073801#comment-15073801
 ] 

Tim Preece commented on SPARK-12319:


Michael,
Since this JIRA's description is not quite right and involves two distinct 
problems, I have created a new JIRA 
https://issues.apache.org/jira/browse/SPARK-12555 to address the 
DatasetAggregatorSuite failure.

This is important to us since it causes an explicit build failure on our Big 
Endian platforms.

> Address endian specific problems surfaced in 1.6
> 
>
> Key: SPARK-12319
> URL: https://issues.apache.org/jira/browse/SPARK-12319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Problems apparent on BE, LE could be impacted too
>Reporter: Adam Roberts
>Priority: Critical
>
> JIRA to cover endian specific problems - since testing 1.6 I've noticed 
> problems with DataFrames on BE platforms, e.g. 
> https://issues.apache.org/jira/browse/SPARK-9858
> [~joshrosen] [~yhuai]
> Current progress: using com.google.common.io.LittleEndianDataInputStream and 
> com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer 
> fixes three test failures in ExchangeCoordinatorSuite but I'm concerned 
> around performance/wider functional implications
> "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input 
> with reordering" fails as we expect "one, 1" but instead get "one, 9" - we 
> believe the issue lies within BitSetMethods.java, specifically around: return 
> (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12555) Build Failure on 1.6 ( DatasetAggregatorSuite )

2015-12-29 Thread Tim Preece (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Preece updated SPARK-12555:
---
Target Version/s:   (was: 1.6.0)

> Build Failure on 1.6 ( DatasetAggregatorSuite )
> ---
>
> Key: SPARK-12555
> URL: https://issues.apache.org/jira/browse/SPARK-12555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: ALL platforms ( although test only explicitly fails on 
> Big Endian platforms ).
>Reporter: Tim Preece
>Priority: Blocker
>
> org.apache.spark.sql.DatasetAggregatorSuite
> - typed aggregation: class input with reordering *** FAILED ***
>   Results do not match for query:
>   == Parsed Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Analyzed Logical Plan ==
>   value: string, ClassInputAgg$(b,a): int
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Optimized Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Physical Plan ==
>   TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
> output=[value#748,ClassInputAgg$(b,a)#762])
>   +- TungstenExchange hashpartitioning(value#748,5), None
>  +- TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
> output=[value#748,value#758])
> +- !AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>+- Project [one AS b#650,1 AS a#651]
>   +- Scan OneRowRelation[]
>   == Results ==
>   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
>   ![one,1][one,9] (QueryTest.scala:127)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12319) Address endian specific problems surfaced in 1.6

2015-12-29 Thread Tim Preece (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073821#comment-15073821
 ] 

Tim Preece commented on SPARK-12319:


The remaining problem is ExchangeCoordinatorSuite. I don't have the right 
access to update the description or title.

> Address endian specific problems surfaced in 1.6
> 
>
> Key: SPARK-12319
> URL: https://issues.apache.org/jira/browse/SPARK-12319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Problems apparent on BE, LE could be impacted too
>Reporter: Adam Roberts
>Priority: Critical
>
> JIRA to cover endian specific problems - since testing 1.6 I've noticed 
> problems with DataFrames on BE platforms, e.g. 
> https://issues.apache.org/jira/browse/SPARK-9858
> [~joshrosen] [~yhuai]
> Current progress: using com.google.common.io.LittleEndianDataInputStream and 
> com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer 
> fixes three test failures in ExchangeCoordinatorSuite but I'm concerned 
> around performance/wider functional implications
> "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input 
> with reordering" fails as we expect "one, 1" but instead get "one, 9" - we 
> believe the issue lies within BitSetMethods.java, specifically around: return 
> (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12555) Build Failure on 1.6 RC4

2015-12-29 Thread Tim Preece (JIRA)
Tim Preece created SPARK-12555:
--

 Summary: Build Failure on 1.6 RC4
 Key: SPARK-12555
 URL: https://issues.apache.org/jira/browse/SPARK-12555
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
 Environment: ALL platforms ( although test only explicitly fails on 
Big Endian platforms ).
Reporter: Tim Preece
Priority: Blocker


org.apache.spark.sql.DatasetAggregatorSuite

- typed aggregation: class input with reordering *** FAILED ***
  Results do not match for query:
  == Parsed Logical Plan ==
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Analyzed Logical Plan ==
  value: string, ClassInputAgg$(b,a): int
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Optimized Logical Plan ==
  Aggregate [value#748], 
[value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
ClassInputAgg$(b,a)#762]
  +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: 
string], [value#748]
 +- Project [one AS b#650,1 AS a#651]
+- OneRowRelation$
  
  == Physical Plan ==
  TungstenAggregate(key=[value#748], 
functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
output=[value#748,ClassInputAgg$(b,a)#762])
  +- TungstenExchange hashpartitioning(value#748,5), None
 +- TungstenAggregate(key=[value#748], 
functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
output=[value#748,value#758])
+- !AppendColumns , class[a[0]: int, b[0]: string], 
class[value[0]: string], [value#748]
   +- Project [one AS b#650,1 AS a#651]
  +- Scan OneRowRelation[]
  == Results ==
  !== Correct Answer - 1 ==   == Spark Answer - 1 ==
  ![one,1][one,9] (QueryTest.scala:127)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12555) Build Failure on 1.6 ( DatasetAggregatorSuite )

2015-12-29 Thread Tim Preece (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073805#comment-15073805
 ] 

Tim Preece commented on SPARK-12555:


Set priority to Major following guidance here - 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-JIRA

Although I would suggest that the priority could be set higher.

> Build Failure on 1.6 ( DatasetAggregatorSuite )
> ---
>
> Key: SPARK-12555
> URL: https://issues.apache.org/jira/browse/SPARK-12555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: ALL platforms ( although test only explicitly fails on 
> Big Endian platforms ).
>Reporter: Tim Preece
>
> org.apache.spark.sql.DatasetAggregatorSuite
> - typed aggregation: class input with reordering *** FAILED ***
>   Results do not match for query:
>   == Parsed Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Analyzed Logical Plan ==
>   value: string, ClassInputAgg$(b,a): int
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Optimized Logical Plan ==
>   Aggregate [value#748], 
> [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS 
> ClassInputAgg$(b,a)#762]
>   +- AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>  +- Project [one AS b#650,1 AS a#651]
> +- OneRowRelation$
>   
>   == Physical Plan ==
>   TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], 
> output=[value#748,ClassInputAgg$(b,a)#762])
>   +- TungstenExchange hashpartitioning(value#748,5), None
>  +- TungstenAggregate(key=[value#748], 
> functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], 
> output=[value#748,value#758])
> +- !AppendColumns , class[a[0]: int, b[0]: string], 
> class[value[0]: string], [value#748]
>+- Project [one AS b#650,1 AS a#651]
>   +- Scan OneRowRelation[]
>   == Results ==
>   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
>   ![one,1][one,9] (QueryTest.scala:127)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12319) Address endian specific problems surfaced in 1.6

2015-12-28 Thread Tim Preece (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073145#comment-15073145
 ] 

Tim Preece commented on SPARK-12319:


Hi,
The failing test is already checked in. It is:
"org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input 
with reordering"

The test only explicitly fails on Big Endian platforms. This is because an 
integer takes an 8 byte slot in the Unsafe row. When the data corruption occurs 
the BE integer ends up with the wrong value. I added print statements which 
shows the data corruption on Little Endian  as well, it just happens not to 
effect the value of the LE integer, since the LE integer is in the other 
4-bytes of the 8-byte slot.

> Address endian specific problems surfaced in 1.6
> 
>
> Key: SPARK-12319
> URL: https://issues.apache.org/jira/browse/SPARK-12319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Problems apparent on BE, LE could be impacted too
>Reporter: Adam Roberts
>Priority: Critical
>
> JIRA to cover endian specific problems - since testing 1.6 I've noticed 
> problems with DataFrames on BE platforms, e.g. 
> https://issues.apache.org/jira/browse/SPARK-9858
> [~joshrosen] [~yhuai]
> Current progress: using com.google.common.io.LittleEndianDataInputStream and 
> com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer 
> fixes three test failures in ExchangeCoordinatorSuite but I'm concerned 
> around performance/wider functional implications
> "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input 
> with reordering" fails as we expect "one, 1" but instead get "one, 9" - we 
> believe the issue lies within BitSetMethods.java, specifically around: return 
> (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12319) Address endian specific problems surfaced in 1.6

2015-12-22 Thread Tim Preece (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068176#comment-15068176
 ] 

Tim Preece commented on SPARK-12319:


[~marmbrus]
Hi Michael,
I think this may be a problem with the new DataSet API, in particular the new 
"as" function of DataFrame which I see is tagged as Experimental.

When we run the DatasetAggregatorSuite test "typed aggregation: class input 
with reordering" the implementation seems to get confused between the ordering 
of the data in the unsaferow (string,int) and the schema (int,string). This 
results in a testcase failure that shows up to BE platforms ( although the data 
is also corrupted on LE platforms ).

At the moment I'm not sure how to fix, so any pointers would be helpful.

> Address endian specific problems surfaced in 1.6
> 
>
> Key: SPARK-12319
> URL: https://issues.apache.org/jira/browse/SPARK-12319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Problems apparent on BE, LE could be impacted too
>Reporter: Adam Roberts
>Priority: Critical
>
> JIRA to cover endian specific problems - since testing 1.6 I've noticed 
> problems with DataFrames on BE platforms, e.g. 
> https://issues.apache.org/jira/browse/SPARK-9858
> [~joshrosen] [~yhuai]
> Current progress: using com.google.common.io.LittleEndianDataInputStream and 
> com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer 
> fixes three test failures in ExchangeCoordinatorSuite but I'm concerned 
> around performance/wider functional implications
> "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input 
> with reordering" fails as we expect "one, 1" but instead get "one, 9" - we 
> believe the issue lies within BitSetMethods.java, specifically around: return 
> (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12319) Address endian specific problems surfaced in 1.6

2015-12-21 Thread Tim Preece (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066628#comment-15066628
 ] 

Tim Preece commented on SPARK-12319:


I notice for the failing testcase the schema ( for row1) mismatches the actual 
data in row1.
Row1 has schema:
SpecificUnsafeRowJoiner schema1 
StructType(StructField(a,IntegerType,false), StructField(b,StringType,true))
But row 1 has the following data ( i.e. a string followed by int )
row1 [0,180003,1,656e6f]

So why doesn't the schema mismatch the data? 

The name of the failing test may give a clue!

test("typed aggregation: class input with reordering") {
val ds = sql("SELECT 'one' AS b, 1 as a").as[AggData]

checkAnswer(
  ds.select(ClassInputAgg.toColumn),
  1)

checkAnswer(
  ds.select(expr("avg(a)").as[Double], ClassInputAgg.toColumn),
  (1.0, 1))

checkAnswer(
  ds.groupBy(_.b).agg(ClassInputAgg.toColumn),
  ("one", 1))
  } 

> Address endian specific problems surfaced in 1.6
> 
>
> Key: SPARK-12319
> URL: https://issues.apache.org/jira/browse/SPARK-12319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Problems apparent on BE, LE could be impacted too
>Reporter: Adam Roberts
>Priority: Critical
>
> JIRA to cover endian specific problems - since testing 1.6 I've noticed 
> problems with DataFrames on BE platforms, e.g. 
> https://issues.apache.org/jira/browse/SPARK-9858
> [~joshrosen] [~yhuai]
> Current progress: using com.google.common.io.LittleEndianDataInputStream and 
> com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer 
> fixes three test failures in ExchangeCoordinatorSuite but I'm concerned 
> around performance/wider functional implications
> "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input 
> with reordering" fails as we expect "one, 1" but instead get "one, 9" - we 
> believe the issue lies within BitSetMethods.java, specifically around: return 
> (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12319) Address endian specific problems surfaced in 1.6

2015-12-17 Thread Tim Preece (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15062175#comment-15062175
 ] 

Tim Preece commented on SPARK-12319:


Hi Sean, Yin
I've started to ( and continue to ) investigate this DatasetAggregatorSuite 
failure as described above.

So far I believe:
a) the description is incorrect and it has nothing to do with endianess or 
BitSetMethods.java. (It just happens we see a failure on bigendian platforms - 
see below)
b) the problem is probably in the codegen for unsaferow joins ( 
GenerateUnsafeRowJoiner ).

I see two Unsaferows being joined. A (string,int) + (string) which results in 
an Unsaferow with schema (string,int,string). 

When we come to update the offsets for the variable length data ( in this case 
for the first String ) the offset is miscalculated.
( in updateOffset in GenerateUnsafeRowJoiner )
This means the int value in the second field slot is wrongly changed, and on a 
BE platform (for this particular testcase) it is incremented by 8. 
On a LE platform the value in the second field is also changed, but in a way 
that does not alter the value of the int. However for both BE and LE platforms 
the first String variable looks bogus with an invalid variable offset.

I'm continuing to investigate ( and so could well revise the above ), but 
thought I would share my observations so far.

Also it would be useful if you happened to have a pointer to any design 
documentation for unsaferow. For example I wasn't sure if all the variable 
length data should go at the end of the row. That is the schema for the joined 
row should actually have been (int,string,string).

Tim Preece

> Address endian specific problems surfaced in 1.6
> 
>
> Key: SPARK-12319
> URL: https://issues.apache.org/jira/browse/SPARK-12319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: BE platforms
>Reporter: Adam Roberts
>Priority: Critical
>
> JIRA to cover endian specific problems - since testing 1.6 I've noticed 
> problems with DataFrames on BE platforms, e.g. 
> https://issues.apache.org/jira/browse/SPARK-9858
> [~joshrosen] [~yhuai]
> Current progress: using com.google.common.io.LittleEndianDataInputStream and 
> com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer 
> fixes three test failures in ExchangeCoordinatorSuite but I'm concerned 
> around performance/wider functional implications
> "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input 
> with reordering" fails as we expect "one, 1" but instead get "one, 9" - we 
> believe the issue lies within BitSetMethods.java, specifically around: return 
> (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org