[jira] [Updated] (SPARK-13745) Support columnar in memory representation on Big Endian platforms
[ https://issues.apache.org/jira/browse/SPARK-13745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Preece updated SPARK-13745: --- Affects Version/s: 2.0.0 > Support columnar in memory representation on Big Endian platforms > - > > Key: SPARK-13745 > URL: https://issues.apache.org/jira/browse/SPARK-13745 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tim Preece > Labels: big-endian > > SPARK-12785 introduced a columnar in memory representation. > Currently this feature is explicitly only supported on Little Endian > platorms. On Big Endian platforms the following exception is thrown: > "org.apache.commons.lang.NotImplementedException: Only little endian is > supported." > This JIRA should be used to extend support to Big Endian architectures, and > decide whether the "in memory" columnar format should be consistent with > parquet format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13745) Support columnar in memory representation on Big Endian platforms
Tim Preece created SPARK-13745: -- Summary: Support columnar in memory representation on Big Endian platforms Key: SPARK-13745 URL: https://issues.apache.org/jira/browse/SPARK-13745 Project: Spark Issue Type: Improvement Components: SQL Reporter: Tim Preece SPARK-12785 introduced a columnar in memory representation. Currently this feature is explicitly only supported on Little Endian platorms. On Big Endian platforms the following exception is thrown: "org.apache.commons.lang.NotImplementedException: Only little endian is supported." This JIRA should be used to extend support to Big Endian architectures, and decide whether the "in memory" columnar format should be consistent with parquet format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13648) org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-13648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Preece updated SPARK-13648: --- Description: When running the standard Spark unit tests on the IBM Java SDK the hive VersionsSuite fail with the following error. java.lang.NoClassDefFoundError: org.apache.hadoop.hive.cli.CliSessionState when creating Hive client using classpath: .. was: When running the standard Spark unit tests on the IBM Java SDK the hive VersionsSuite fail with the following error. {panel} java.lang.NoClassDefFoundError: org.apache.hadoop.hive.cli.CliSessionState when creating Hive client using classpath: file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/log4j_log4j-1.2.17.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.hadoop_hadoop-mapreduce-client-jobclient-2.6.0.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.directory.server_apacheds-kerberos-codec-2.0.0-M15.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.json_json-20090211.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/junit_junit-3.8.1.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/com.sun.xml.bind_jaxb-impl-2.2.3-1.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.commons_commons-math3-3.1.1.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/com.google.inject_guice-3.0.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.hadoop_hadoop-mapreduce-client-core-2.6.0.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.httpcomponents_httpclient-4.2.5.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/javax.servlet_servlet-api-2.5.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/commons-codec_commons-codec-1.4.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.iq80.snappy_snappy-0.2.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/javax.transaction_jta-1.1.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.datanucleus_datanucleus-core-3.2.2.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.avro_avro-1.7.4.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/aopalliance_aopalliance-1.0.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.directory.api_api-asn1-api-1.0.0-M20.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.httpcomponents_httpcore-4.2.5.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.codehaus.jettison_jettison-1.1.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/commons-digester_commons-digester-1.8.jar,
[jira] [Created] (SPARK-13648) org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError
Tim Preece created SPARK-13648: -- Summary: org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError Key: SPARK-13648 URL: https://issues.apache.org/jira/browse/SPARK-13648 Project: Spark Issue Type: Bug Environment: Fails on vendor specific JVMs ( e.g IBM JVM ) Reporter: Tim Preece When running the standard Spark unit tests on the IBM Java SDK the hive VersionsSuite fail with the following error. {panel} java.lang.NoClassDefFoundError: org.apache.hadoop.hive.cli.CliSessionState when creating Hive client using classpath: file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/log4j_log4j-1.2.17.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.hadoop_hadoop-mapreduce-client-jobclient-2.6.0.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.directory.server_apacheds-kerberos-codec-2.0.0-M15.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.json_json-20090211.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/junit_junit-3.8.1.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/com.sun.xml.bind_jaxb-impl-2.2.3-1.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.commons_commons-math3-3.1.1.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/com.google.inject_guice-3.0.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.hadoop_hadoop-mapreduce-client-core-2.6.0.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.httpcomponents_httpclient-4.2.5.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/javax.servlet_servlet-api-2.5.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/commons-codec_commons-codec-1.4.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.iq80.snappy_snappy-0.2.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/javax.transaction_jta-1.1.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.datanucleus_datanucleus-core-3.2.2.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.avro_avro-1.7.4.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/aopalliance_aopalliance-1.0.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.directory.api_api-asn1-api-1.0.0-M20.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.apache.httpcomponents_httpcore-4.2.5.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/org.codehaus.jettison_jettison-1.1.jar, file:/home/jenkins/workspace/Spark/GIT_BRANCH/branch-1.6/SCALA_VERSION/2.10/label/AMD64/sql/hive/target/tmp/hive-v12-aebc9334-4fea-43c5-8113-f902cbdfbfdc/commons-digester_commons-digester-1.8.jar,
[jira] [Commented] (SPARK-12785) Implement columnar in memory representation
[ https://issues.apache.org/jira/browse/SPARK-12785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101520#comment-15101520 ] Tim Preece commented on SPARK-12785: I notice there is an explicit check for endianess in OffHeapColumnVector. Is the a technical reason why this feature is only being developed for Little Endian platforms ? I think the endianess should never matter ( providing Spark is run on an homogeneous endianess cluster ). > Implement columnar in memory representation > --- > > Key: SPARK-12785 > URL: https://issues.apache.org/jira/browse/SPARK-12785 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Nong Li >Assignee: Nong Li > Fix For: 2.0.0 > > > Tungsten can benefit from having a columnar in memory representation which > can provide a few benefits: > - Enables vectorized execution > - Improves memory efficiency (memory is more tightly packed) > - Enables cheap serialization/zero-copy transfer with third party components > (e.g. numpy) > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12555) Datasets: data is corrupted when input data is reordered
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Preece updated SPARK-12555: --- Description: Testcase --- {code} import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext import org.apache.spark.sql.Dataset case class people(age: Int, name: String) object nameAgg extends Aggregator[people, String, String] { def zero: String = "" def reduce(b: String, a: people): String = a.name + b def merge(b1: String, b2: String): String = b1 + b2 def finish(r: String): String = r } object DataSetAgg { def main(args: Array[String]) { val conf = new SparkConf().setAppName("DataSetAgg") val spark = new SparkContext(conf) val sqlContext = new SQLContext(spark) import sqlContext.implicits._ val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS name, 1279869254 AS age").as[people] peopleds.groupBy(_.age).agg(nameAgg.toColumn).show() } } {code} Result ( on a Little Endian Platform ) +--+--+ |_1|_2| +--+--+ |1279869254|FAILTi| +--+--+ Explanation --- Internally the String variable in the unsafe row is not updated after an unsafe row join operation. The displayed string is corrupted and shows part of the integer ( interpreted as a string ) along with "Ti" The column names also look incorrect on a Little Endian platform. Result ( on a Big Endian Platform ) +--+--+ | value|nameAgg$(name,age)| +--+--+ |1279869254|LIAFTi| +--+--+ The following Unit test also fails ( but only explicitly on a Big Endian platorm ) org.apache.spark.sql.DatasetAggregatorSuite - typed aggregation: class input with reordering *** FAILED *** Results do not match for query: == Parsed Logical Plan == Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Analyzed Logical Plan == value: string, ClassInputAgg$(b,a): int Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Optimized Logical Plan == Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Physical Plan == TungstenAggregate(key=[value#748], functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], output=[value#748,ClassInputAgg$(b,a)#762]) +- TungstenExchange hashpartitioning(value#748,5), None +- TungstenAggregate(key=[value#748], functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], output=[value#748,value#758]) +- !AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- Scan OneRowRelation[] == Results == !== Correct Answer - 1 == == Spark Answer - 1 == ![one,1][one,9] (QueryTest.scala:127) was: Testcase --- import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext import org.apache.spark.sql.Dataset case class people(age: Int, name: String) object nameAgg extends Aggregator[people, String, String] { def zero: String = "" def reduce(b: String, a: people): String = a.name + b def merge(b1: String, b2: String): String = b1 + b2 def finish(r: String): String = r } object DataSetAgg { def main(args: Array[String]) { val conf = new SparkConf().setAppName("DataSetAgg") val spark = new SparkContext(conf) val sqlContext = new SQLContext(spark) import sqlContext.implicits._ val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS name, 1279869254 AS age").as[people] peopleds.groupBy(_.age).agg(nameAgg.toColumn).show() } } Result ( on a Little Endian Platform ) +--+--+ |_1|_2| +--+--+ |1279869254|FAILTi| +--+--+ Explanation --- Internally the String variable in the unsafe row is not updated after an unsafe row join operation. The displayed string is corrupted and shows part of the integer ( interpreted as a string )
[jira] [Updated] (SPARK-12555) Datasets: data is corrupted when input data is reordered
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Preece updated SPARK-12555: --- Description: Testcase --- {code} import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext import org.apache.spark.sql.Dataset case class people(age: Int, name: String) object nameAgg extends Aggregator[people, String, String] { def zero: String = "" def reduce(b: String, a: people): String = a.name + b def merge(b1: String, b2: String): String = b1 + b2 def finish(r: String): String = r } object DataSetAgg { def main(args: Array[String]) { val conf = new SparkConf().setAppName("DataSetAgg") val spark = new SparkContext(conf) val sqlContext = new SQLContext(spark) import sqlContext.implicits._ val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS name, 1279869254 AS age").as[people] peopleds.groupBy(_.age).agg(nameAgg.toColumn).show() } } {code} Result ( on a Little Endian Platform ) {noformat} +--+--+ |_1|_2| +--+--+ |1279869254|FAILTi| +--+--+ {noformat} Explanation --- Internally the String variable in the unsafe row is not updated after an unsafe row join operation. The displayed string is corrupted and shows part of the integer ( interpreted as a string ) along with "Ti" The column names also look incorrect on a Little Endian platform. Result ( on a Big Endian Platform ) {noformat} +--+--+ | value|nameAgg$(name,age)| +--+--+ |1279869254|LIAFTi| +--+--+ {noformat} The following Unit test also fails ( but only explicitly on a Big Endian platorm ) org.apache.spark.sql.DatasetAggregatorSuite - typed aggregation: class input with reordering *** FAILED *** Results do not match for query: == Parsed Logical Plan == Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Analyzed Logical Plan == value: string, ClassInputAgg$(b,a): int Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Optimized Logical Plan == Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Physical Plan == TungstenAggregate(key=[value#748], functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], output=[value#748,ClassInputAgg$(b,a)#762]) +- TungstenExchange hashpartitioning(value#748,5), None +- TungstenAggregate(key=[value#748], functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], output=[value#748,value#758]) +- !AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- Scan OneRowRelation[] == Results == !== Correct Answer - 1 == == Spark Answer - 1 == ![one,1][one,9] (QueryTest.scala:127) was: Testcase --- {code} import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext import org.apache.spark.sql.Dataset case class people(age: Int, name: String) object nameAgg extends Aggregator[people, String, String] { def zero: String = "" def reduce(b: String, a: people): String = a.name + b def merge(b1: String, b2: String): String = b1 + b2 def finish(r: String): String = r } object DataSetAgg { def main(args: Array[String]) { val conf = new SparkConf().setAppName("DataSetAgg") val spark = new SparkContext(conf) val sqlContext = new SQLContext(spark) import sqlContext.implicits._ val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS name, 1279869254 AS age").as[people] peopleds.groupBy(_.age).agg(nameAgg.toColumn).show() } } {code} Result ( on a Little Endian Platform ) +--+--+ |_1|_2| +--+--+ |1279869254|FAILTi| +--+--+ Explanation --- Internally the String variable in the unsafe row is not updated after an unsafe row join operation. The displayed string is corrupted
[jira] [Updated] (SPARK-12555) Datasets: data is corrupted when input data is reordered
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Preece updated SPARK-12555: --- Description: Testcase --- {code} import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext import org.apache.spark.sql.Dataset case class people(age: Int, name: String) object nameAgg extends Aggregator[people, String, String] { def zero: String = "" def reduce(b: String, a: people): String = a.name + b def merge(b1: String, b2: String): String = b1 + b2 def finish(r: String): String = r } object DataSetAgg { def main(args: Array[String]) { val conf = new SparkConf().setAppName("DataSetAgg") val spark = new SparkContext(conf) val sqlContext = new SQLContext(spark) import sqlContext.implicits._ val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS name, 1279869254 AS age").as[people] peopleds.groupBy(_.age).agg(nameAgg.toColumn).show() } } {code} Result ( on a Little Endian Platform ) {noformat} +--+--+ |_1|_2| +--+--+ |1279869254|FAILTi| +--+--+ {noformat} Explanation --- Internally the String variable in the unsafe row is not updated after an unsafe row join operation. The displayed string is corrupted and shows part of the integer ( interpreted as a string ) along with "Ti" The column names also look different on a Little Endian platform. Result ( on a Big Endian Platform ) {noformat} +--+--+ | value|nameAgg$(name,age)| +--+--+ |1279869254|LIAFTi| +--+--+ {noformat} The following Unit test also fails ( but only explicitly on a Big Endian platorm ) org.apache.spark.sql.DatasetAggregatorSuite - typed aggregation: class input with reordering *** FAILED *** Results do not match for query: == Parsed Logical Plan == Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Analyzed Logical Plan == value: string, ClassInputAgg$(b,a): int Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Optimized Logical Plan == Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Physical Plan == TungstenAggregate(key=[value#748], functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], output=[value#748,ClassInputAgg$(b,a)#762]) +- TungstenExchange hashpartitioning(value#748,5), None +- TungstenAggregate(key=[value#748], functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], output=[value#748,value#758]) +- !AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- Scan OneRowRelation[] == Results == !== Correct Answer - 1 == == Spark Answer - 1 == ![one,1][one,9] (QueryTest.scala:127) was: Testcase --- {code} import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext import org.apache.spark.sql.Dataset case class people(age: Int, name: String) object nameAgg extends Aggregator[people, String, String] { def zero: String = "" def reduce(b: String, a: people): String = a.name + b def merge(b1: String, b2: String): String = b1 + b2 def finish(r: String): String = r } object DataSetAgg { def main(args: Array[String]) { val conf = new SparkConf().setAppName("DataSetAgg") val spark = new SparkContext(conf) val sqlContext = new SQLContext(spark) import sqlContext.implicits._ val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS name, 1279869254 AS age").as[people] peopleds.groupBy(_.age).agg(nameAgg.toColumn).show() } } {code} Result ( on a Little Endian Platform ) {noformat} +--+--+ |_1|_2| +--+--+ |1279869254|FAILTi| +--+--+ {noformat} Explanation --- Internally the String variable in the unsafe row is not updated after an unsafe row join operation. The displayed
[jira] [Updated] (SPARK-12555) DatasetAggregatorSuite fails on big-endian platforms
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Preece updated SPARK-12555: --- Description: Testcase --- import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext import org.apache.spark.sql.Dataset case class people(age: Int, name: String) object nameAgg extends Aggregator[people, String, String] { def zero: String = "" def reduce(b: String, a: people): String = a.name + b def merge(b1: String, b2: String): String = b1 + b2 def finish(r: String): String = r } object DataSetAgg { def main(args: Array[String]) { val conf = new SparkConf().setAppName("DataSetAgg") val spark = new SparkContext(conf) val sqlContext = new SQLContext(spark) import sqlContext.implicits._ val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS name, 1279869254 AS age").as[people] peopleds.groupBy(_.age).agg(nameAgg.toColumn).show() } } Result ( on a Little Endian Platform ) +--+--+ |_1|_2| +--+--+ |1279869254|FAILTi| +--+--+ Explanation --- Internally the String variable in the unsafe row is not updated after an unsafe row join operation. The displayed string is corrupted and shows part of the integer ( interpreted as a string ) along with "Ti" The column names are look incorrect on a Little Endian platform. Result ( on a Big Endian Platform ) +--+--+ | value|nameAgg$(name,age)| +--+--+ |1279869254|LIAFTi| +--+--+ The following Unit test also fails ( but only explicitly on a Big Endian platorm ) org.apache.spark.sql.DatasetAggregatorSuite - typed aggregation: class input with reordering *** FAILED *** Results do not match for query: == Parsed Logical Plan == Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Analyzed Logical Plan == value: string, ClassInputAgg$(b,a): int Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Optimized Logical Plan == Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Physical Plan == TungstenAggregate(key=[value#748], functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], output=[value#748,ClassInputAgg$(b,a)#762]) +- TungstenExchange hashpartitioning(value#748,5), None +- TungstenAggregate(key=[value#748], functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], output=[value#748,value#758]) +- !AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- Scan OneRowRelation[] == Results == !== Correct Answer - 1 == == Spark Answer - 1 == ![one,1][one,9] (QueryTest.scala:127) was: org.apache.spark.sql.DatasetAggregatorSuite - typed aggregation: class input with reordering *** FAILED *** Results do not match for query: == Parsed Logical Plan == Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Analyzed Logical Plan == value: string, ClassInputAgg$(b,a): int Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Optimized Logical Plan == Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Physical Plan == TungstenAggregate(key=[value#748], functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)],
[jira] [Updated] (SPARK-12555) Datasets: data is corrupted when input data is reordered
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Preece updated SPARK-12555: --- Summary: Datasets: data is corrupted when input data is reordered (was: DatasetAggregatorSuite fails on big-endian platforms) > Datasets: data is corrupted when input data is reordered > > > Key: SPARK-12555 > URL: https://issues.apache.org/jira/browse/SPARK-12555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: ALL platforms ( although test only explicitly fails on > Big Endian platforms ). >Reporter: Tim Preece > > Testcase > --- > import org.apache.spark.sql.expressions.Aggregator > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.SQLContext > import org.apache.spark.sql.Dataset > case class people(age: Int, name: String) > object nameAgg extends Aggregator[people, String, String] { > def zero: String = "" > def reduce(b: String, a: people): String = a.name + b > def merge(b1: String, b2: String): String = b1 + b2 > def finish(r: String): String = r > } > object DataSetAgg { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("DataSetAgg") > val spark = new SparkContext(conf) > val sqlContext = new SQLContext(spark) > import sqlContext.implicits._ > val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS > name, 1279869254 AS age").as[people] > peopleds.groupBy(_.age).agg(nameAgg.toColumn).show() > } > } > Result ( on a Little Endian Platform ) > > +--+--+ > |_1|_2| > +--+--+ > |1279869254|FAILTi| > +--+--+ > Explanation > --- > Internally the String variable in the unsafe row is not updated after an > unsafe row join operation. > The displayed string is corrupted and shows part of the integer ( interpreted > as a string ) along with "Ti" > The column names are look incorrect on a Little Endian platform. > Result ( on a Big Endian Platform ) > +--+--+ > | value|nameAgg$(name,age)| > +--+--+ > |1279869254|LIAFTi| > +--+--+ > The following Unit test also fails ( but only explicitly on a Big Endian > platorm ) > org.apache.spark.sql.DatasetAggregatorSuite > - typed aggregation: class input with reordering *** FAILED *** > Results do not match for query: > == Parsed Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Analyzed Logical Plan == > value: string, ClassInputAgg$(b,a): int > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Optimized Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Physical Plan == > TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], > output=[value#748,ClassInputAgg$(b,a)#762]) > +- TungstenExchange hashpartitioning(value#748,5), None > +- TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], > output=[value#748,value#758]) > +- !AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] >+- Project [one AS b#650,1 AS a#651] > +- Scan OneRowRelation[] > == Results == > !== Correct Answer - 1 == == Spark Answer - 1 == > ![one,1][one,9] (QueryTest.scala:127) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12555) Datasets: data is corrupted when input data is reordered
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Preece updated SPARK-12555: --- Description: Testcase --- import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext import org.apache.spark.sql.Dataset case class people(age: Int, name: String) object nameAgg extends Aggregator[people, String, String] { def zero: String = "" def reduce(b: String, a: people): String = a.name + b def merge(b1: String, b2: String): String = b1 + b2 def finish(r: String): String = r } object DataSetAgg { def main(args: Array[String]) { val conf = new SparkConf().setAppName("DataSetAgg") val spark = new SparkContext(conf) val sqlContext = new SQLContext(spark) import sqlContext.implicits._ val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS name, 1279869254 AS age").as[people] peopleds.groupBy(_.age).agg(nameAgg.toColumn).show() } } Result ( on a Little Endian Platform ) +--+--+ |_1|_2| +--+--+ |1279869254|FAILTi| +--+--+ Explanation --- Internally the String variable in the unsafe row is not updated after an unsafe row join operation. The displayed string is corrupted and shows part of the integer ( interpreted as a string ) along with "Ti" The column names also look incorrect on a Little Endian platform. Result ( on a Big Endian Platform ) +--+--+ | value|nameAgg$(name,age)| +--+--+ |1279869254|LIAFTi| +--+--+ The following Unit test also fails ( but only explicitly on a Big Endian platorm ) org.apache.spark.sql.DatasetAggregatorSuite - typed aggregation: class input with reordering *** FAILED *** Results do not match for query: == Parsed Logical Plan == Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Analyzed Logical Plan == value: string, ClassInputAgg$(b,a): int Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Optimized Logical Plan == Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Physical Plan == TungstenAggregate(key=[value#748], functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], output=[value#748,ClassInputAgg$(b,a)#762]) +- TungstenExchange hashpartitioning(value#748,5), None +- TungstenAggregate(key=[value#748], functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], output=[value#748,value#758]) +- !AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- Scan OneRowRelation[] == Results == !== Correct Answer - 1 == == Spark Answer - 1 == ![one,1][one,9] (QueryTest.scala:127) was: Testcase --- import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext import org.apache.spark.sql.Dataset case class people(age: Int, name: String) object nameAgg extends Aggregator[people, String, String] { def zero: String = "" def reduce(b: String, a: people): String = a.name + b def merge(b1: String, b2: String): String = b1 + b2 def finish(r: String): String = r } object DataSetAgg { def main(args: Array[String]) { val conf = new SparkConf().setAppName("DataSetAgg") val spark = new SparkContext(conf) val sqlContext = new SQLContext(spark) import sqlContext.implicits._ val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS name, 1279869254 AS age").as[people] peopleds.groupBy(_.age).agg(nameAgg.toColumn).show() } } Result ( on a Little Endian Platform ) +--+--+ |_1|_2| +--+--+ |1279869254|FAILTi| +--+--+ Explanation --- Internally the String variable in the unsafe row is not updated after an unsafe row join operation. The displayed string is corrupted and shows part of the integer ( interpreted as a string ) along with "Ti"
[jira] [Updated] (SPARK-12555) Datasets: data is corrupted when input data is reordered
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Preece updated SPARK-12555: --- Environment: ALL platforms on 1.6 (was: ALL platforms ( although test only explicitly fails on Big Endian platforms ).) > Datasets: data is corrupted when input data is reordered > > > Key: SPARK-12555 > URL: https://issues.apache.org/jira/browse/SPARK-12555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: ALL platforms on 1.6 >Reporter: Tim Preece > > Testcase > --- > import org.apache.spark.sql.expressions.Aggregator > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.SQLContext > import org.apache.spark.sql.Dataset > case class people(age: Int, name: String) > object nameAgg extends Aggregator[people, String, String] { > def zero: String = "" > def reduce(b: String, a: people): String = a.name + b > def merge(b1: String, b2: String): String = b1 + b2 > def finish(r: String): String = r > } > object DataSetAgg { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("DataSetAgg") > val spark = new SparkContext(conf) > val sqlContext = new SQLContext(spark) > import sqlContext.implicits._ > val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS > name, 1279869254 AS age").as[people] > peopleds.groupBy(_.age).agg(nameAgg.toColumn).show() > } > } > Result ( on a Little Endian Platform ) > > +--+--+ > |_1|_2| > +--+--+ > |1279869254|FAILTi| > +--+--+ > Explanation > --- > Internally the String variable in the unsafe row is not updated after an > unsafe row join operation. > The displayed string is corrupted and shows part of the integer ( interpreted > as a string ) along with "Ti" > The column names also look incorrect on a Little Endian platform. > Result ( on a Big Endian Platform ) > +--+--+ > | value|nameAgg$(name,age)| > +--+--+ > |1279869254|LIAFTi| > +--+--+ > The following Unit test also fails ( but only explicitly on a Big Endian > platorm ) > org.apache.spark.sql.DatasetAggregatorSuite > - typed aggregation: class input with reordering *** FAILED *** > Results do not match for query: > == Parsed Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Analyzed Logical Plan == > value: string, ClassInputAgg$(b,a): int > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Optimized Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Physical Plan == > TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], > output=[value#748,ClassInputAgg$(b,a)#762]) > +- TungstenExchange hashpartitioning(value#748,5), None > +- TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], > output=[value#748,value#758]) > +- !AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] >+- Project [one AS b#650,1 AS a#651] > +- Scan OneRowRelation[] > == Results == > !== Correct Answer - 1 == == Spark Answer - 1 == > ![one,1][one,9] (QueryTest.scala:127) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12778) Use of Java Unsafe should take endianness into account
[ https://issues.apache.org/jira/browse/SPARK-12778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15094128#comment-15094128 ] Tim Preece commented on SPARK-12778: https://issues.apache.org/jira/browse/SPARK-12555 is not related, and in fact is not an Endian problem. However, whilst investigating 12555 we did wonder how/if a mixed Endian Spark cluster could work, given an unsafe row mixes writing Integers and reading bytes. > Use of Java Unsafe should take endianness into account > -- > > Key: SPARK-12778 > URL: https://issues.apache.org/jira/browse/SPARK-12778 > Project: Spark > Issue Type: Bug > Components: Input/Output >Reporter: Ted Yu > > In Platform.java, methods of Java Unsafe are called directly without > considering endianness. > In thread, 'Tungsten in a mixed endian environment', Adam Roberts reported > data corruption when "spark.sql.tungsten.enabled" is enabled in mixed endian > environment. > Platform.java should take endianness into account. > Below is a copy of Adam's report: > I've been experimenting with DataFrame operations in a mixed endian > environment - a big endian master with little endian workers. With tungsten > enabled I'm encountering data corruption issues. > For example, with this simple test code: > {code} > import org.apache.spark.SparkContext > import org.apache.spark._ > import org.apache.spark.sql.SQLContext > object SimpleSQL { > def main(args: Array[String]): Unit = { > if (args.length != 1) { > println("Not enough args, you need to specify the master url") > } > val masterURL = args(0) > println("Setting up Spark context at: " + masterURL) > val sparkConf = new SparkConf > val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf) > println("Performing SQL tests") > val sqlContext = new SQLContext(sc) > println("SQL context set up") > val df = sqlContext.read.json("/tmp/people.json") > df.show() > println("Selecting everyone's age and adding one to it") > df.select(df("name"), df("age") + 1).show() > println("Showing all people over the age of 21") > df.filter(df("age") > 21).show() > println("Counting people by age") > df.groupBy("age").count().show() > } > } > {code} > Instead of getting > {code} > ++-+ > | age|count| > ++-+ > |null|1| > | 19|1| > | 30|1| > ++-+ > {code} > I get the following with my mixed endian set up: > {code} > +---+-+ > |age|count| > +---+-+ > | null|1| > |1369094286720630784|72057594037927936| > | 30|1| > +---+-+ > {code} > and on another run: > {code} > +---+-+ > |age|count| > +---+-+ > | 0|72057594037927936| > | 19|1| > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12778) Use of Java Unsafe should take endianness into account
[ https://issues.apache.org/jira/browse/SPARK-12778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15094185#comment-15094185 ] Tim Preece commented on SPARK-12778: The testcase in 12555 only fails on big-endian ( even though I can see the problem in a debugger on little endian ). So perhaps it's best if I should create a testcase which explicity fails on both BE and LE and then update 12555. > Use of Java Unsafe should take endianness into account > -- > > Key: SPARK-12778 > URL: https://issues.apache.org/jira/browse/SPARK-12778 > Project: Spark > Issue Type: Bug > Components: Input/Output >Reporter: Ted Yu > > In Platform.java, methods of Java Unsafe are called directly without > considering endianness. > In thread, 'Tungsten in a mixed endian environment', Adam Roberts reported > data corruption when "spark.sql.tungsten.enabled" is enabled in mixed endian > environment. > Platform.java should take endianness into account. > Below is a copy of Adam's report: > I've been experimenting with DataFrame operations in a mixed endian > environment - a big endian master with little endian workers. With tungsten > enabled I'm encountering data corruption issues. > For example, with this simple test code: > {code} > import org.apache.spark.SparkContext > import org.apache.spark._ > import org.apache.spark.sql.SQLContext > object SimpleSQL { > def main(args: Array[String]): Unit = { > if (args.length != 1) { > println("Not enough args, you need to specify the master url") > } > val masterURL = args(0) > println("Setting up Spark context at: " + masterURL) > val sparkConf = new SparkConf > val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf) > println("Performing SQL tests") > val sqlContext = new SQLContext(sc) > println("SQL context set up") > val df = sqlContext.read.json("/tmp/people.json") > df.show() > println("Selecting everyone's age and adding one to it") > df.select(df("name"), df("age") + 1).show() > println("Showing all people over the age of 21") > df.filter(df("age") > 21).show() > println("Counting people by age") > df.groupBy("age").count().show() > } > } > {code} > Instead of getting > {code} > ++-+ > | age|count| > ++-+ > |null|1| > | 19|1| > | 30|1| > ++-+ > {code} > I get the following with my mixed endian set up: > {code} > +---+-+ > |age|count| > +---+-+ > | null|1| > |1369094286720630784|72057594037927936| > | 30|1| > +---+-+ > {code} > and on another run: > {code} > +---+-+ > |age|count| > +---+-+ > | 0|72057594037927936| > | 19|1| > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12555) Build Failure on 1.6 ( DatasetAggregatorSuite )
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Preece updated SPARK-12555: --- Summary: Build Failure on 1.6 ( DatasetAggregatorSuite ) (was: Build Failure on 1.6) > Build Failure on 1.6 ( DatasetAggregatorSuite ) > --- > > Key: SPARK-12555 > URL: https://issues.apache.org/jira/browse/SPARK-12555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: ALL platforms ( although test only explicitly fails on > Big Endian platforms ). >Reporter: Tim Preece >Priority: Blocker > > org.apache.spark.sql.DatasetAggregatorSuite > - typed aggregation: class input with reordering *** FAILED *** > Results do not match for query: > == Parsed Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Analyzed Logical Plan == > value: string, ClassInputAgg$(b,a): int > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Optimized Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Physical Plan == > TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], > output=[value#748,ClassInputAgg$(b,a)#762]) > +- TungstenExchange hashpartitioning(value#748,5), None > +- TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], > output=[value#748,value#758]) > +- !AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] >+- Project [one AS b#650,1 AS a#651] > +- Scan OneRowRelation[] > == Results == > !== Correct Answer - 1 == == Spark Answer - 1 == > ![one,1][one,9] (QueryTest.scala:127) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12555) Build Failure on 1.6
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073795#comment-15073795 ] Tim Preece commented on SPARK-12555: Analysis shows that this test fails because of data corruption: There is a mismatch between unsaferow (string,int) and the schema (int,string), presumably because the test involves reordering of columns. Subsequently when joining (string,int) + (string) the code incorrectly patches the int value with the offset change of the first String. This data corruption occurs on ALL platforms and the offset part of the first string is always incorrect. On Big Endian platforms the value for the integer is also corrupted. This is simply due to location of the 4-byte integer in the 8-byte unsafe row slot. > Build Failure on 1.6 > > > Key: SPARK-12555 > URL: https://issues.apache.org/jira/browse/SPARK-12555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: ALL platforms ( although test only explicitly fails on > Big Endian platforms ). >Reporter: Tim Preece >Priority: Blocker > > org.apache.spark.sql.DatasetAggregatorSuite > - typed aggregation: class input with reordering *** FAILED *** > Results do not match for query: > == Parsed Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Analyzed Logical Plan == > value: string, ClassInputAgg$(b,a): int > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Optimized Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Physical Plan == > TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], > output=[value#748,ClassInputAgg$(b,a)#762]) > +- TungstenExchange hashpartitioning(value#748,5), None > +- TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], > output=[value#748,value#758]) > +- !AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] >+- Project [one AS b#650,1 AS a#651] > +- Scan OneRowRelation[] > == Results == > !== Correct Answer - 1 == == Spark Answer - 1 == > ![one,1][one,9] (QueryTest.scala:127) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12555) Build Failure on 1.6 ( DatasetAggregatorSuite )
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Preece updated SPARK-12555: --- Priority: Major (was: Blocker) > Build Failure on 1.6 ( DatasetAggregatorSuite ) > --- > > Key: SPARK-12555 > URL: https://issues.apache.org/jira/browse/SPARK-12555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: ALL platforms ( although test only explicitly fails on > Big Endian platforms ). >Reporter: Tim Preece > > org.apache.spark.sql.DatasetAggregatorSuite > - typed aggregation: class input with reordering *** FAILED *** > Results do not match for query: > == Parsed Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Analyzed Logical Plan == > value: string, ClassInputAgg$(b,a): int > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Optimized Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Physical Plan == > TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], > output=[value#748,ClassInputAgg$(b,a)#762]) > +- TungstenExchange hashpartitioning(value#748,5), None > +- TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], > output=[value#748,value#758]) > +- !AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] >+- Project [one AS b#650,1 AS a#651] > +- Scan OneRowRelation[] > == Results == > !== Correct Answer - 1 == == Spark Answer - 1 == > ![one,1][one,9] (QueryTest.scala:127) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12555) DatasetAggregatorSuite fails on big-endian platforms
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073868#comment-15073868 ] Tim Preece commented on SPARK-12555: Yes it is a test failure. Thanks for updating the title. > DatasetAggregatorSuite fails on big-endian platforms > > > Key: SPARK-12555 > URL: https://issues.apache.org/jira/browse/SPARK-12555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: ALL platforms ( although test only explicitly fails on > Big Endian platforms ). >Reporter: Tim Preece > > org.apache.spark.sql.DatasetAggregatorSuite > - typed aggregation: class input with reordering *** FAILED *** > Results do not match for query: > == Parsed Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Analyzed Logical Plan == > value: string, ClassInputAgg$(b,a): int > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Optimized Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Physical Plan == > TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], > output=[value#748,ClassInputAgg$(b,a)#762]) > +- TungstenExchange hashpartitioning(value#748,5), None > +- TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], > output=[value#748,value#758]) > +- !AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] >+- Project [one AS b#650,1 AS a#651] > +- Scan OneRowRelation[] > == Results == > !== Correct Answer - 1 == == Spark Answer - 1 == > ![one,1][one,9] (QueryTest.scala:127) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12555) Build Failure on 1.6
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Preece updated SPARK-12555: --- Summary: Build Failure on 1.6 (was: Build Failure on 1.6 RC4) > Build Failure on 1.6 > > > Key: SPARK-12555 > URL: https://issues.apache.org/jira/browse/SPARK-12555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: ALL platforms ( although test only explicitly fails on > Big Endian platforms ). >Reporter: Tim Preece >Priority: Blocker > > org.apache.spark.sql.DatasetAggregatorSuite > - typed aggregation: class input with reordering *** FAILED *** > Results do not match for query: > == Parsed Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Analyzed Logical Plan == > value: string, ClassInputAgg$(b,a): int > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Optimized Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Physical Plan == > TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], > output=[value#748,ClassInputAgg$(b,a)#762]) > +- TungstenExchange hashpartitioning(value#748,5), None > +- TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], > output=[value#748,value#758]) > +- !AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] >+- Project [one AS b#650,1 AS a#651] > +- Scan OneRowRelation[] > == Results == > !== Correct Answer - 1 == == Spark Answer - 1 == > ![one,1][one,9] (QueryTest.scala:127) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12319) Address endian specific problems surfaced in 1.6
[ https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073801#comment-15073801 ] Tim Preece commented on SPARK-12319: Michael, Since this JIRA's description is not quite right and involves two distinct problems, I have created a new JIRA https://issues.apache.org/jira/browse/SPARK-12555 to address the DatasetAggregatorSuite failure. This is important to us since it causes an explicit build failure on our Big Endian platforms. > Address endian specific problems surfaced in 1.6 > > > Key: SPARK-12319 > URL: https://issues.apache.org/jira/browse/SPARK-12319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Problems apparent on BE, LE could be impacted too >Reporter: Adam Roberts >Priority: Critical > > JIRA to cover endian specific problems - since testing 1.6 I've noticed > problems with DataFrames on BE platforms, e.g. > https://issues.apache.org/jira/browse/SPARK-9858 > [~joshrosen] [~yhuai] > Current progress: using com.google.common.io.LittleEndianDataInputStream and > com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer > fixes three test failures in ExchangeCoordinatorSuite but I'm concerned > around performance/wider functional implications > "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input > with reordering" fails as we expect "one, 1" but instead get "one, 9" - we > believe the issue lies within BitSetMethods.java, specifically around: return > (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12555) Build Failure on 1.6 ( DatasetAggregatorSuite )
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Preece updated SPARK-12555: --- Target Version/s: (was: 1.6.0) > Build Failure on 1.6 ( DatasetAggregatorSuite ) > --- > > Key: SPARK-12555 > URL: https://issues.apache.org/jira/browse/SPARK-12555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: ALL platforms ( although test only explicitly fails on > Big Endian platforms ). >Reporter: Tim Preece >Priority: Blocker > > org.apache.spark.sql.DatasetAggregatorSuite > - typed aggregation: class input with reordering *** FAILED *** > Results do not match for query: > == Parsed Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Analyzed Logical Plan == > value: string, ClassInputAgg$(b,a): int > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Optimized Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Physical Plan == > TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], > output=[value#748,ClassInputAgg$(b,a)#762]) > +- TungstenExchange hashpartitioning(value#748,5), None > +- TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], > output=[value#748,value#758]) > +- !AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] >+- Project [one AS b#650,1 AS a#651] > +- Scan OneRowRelation[] > == Results == > !== Correct Answer - 1 == == Spark Answer - 1 == > ![one,1][one,9] (QueryTest.scala:127) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12319) Address endian specific problems surfaced in 1.6
[ https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073821#comment-15073821 ] Tim Preece commented on SPARK-12319: The remaining problem is ExchangeCoordinatorSuite. I don't have the right access to update the description or title. > Address endian specific problems surfaced in 1.6 > > > Key: SPARK-12319 > URL: https://issues.apache.org/jira/browse/SPARK-12319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Problems apparent on BE, LE could be impacted too >Reporter: Adam Roberts >Priority: Critical > > JIRA to cover endian specific problems - since testing 1.6 I've noticed > problems with DataFrames on BE platforms, e.g. > https://issues.apache.org/jira/browse/SPARK-9858 > [~joshrosen] [~yhuai] > Current progress: using com.google.common.io.LittleEndianDataInputStream and > com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer > fixes three test failures in ExchangeCoordinatorSuite but I'm concerned > around performance/wider functional implications > "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input > with reordering" fails as we expect "one, 1" but instead get "one, 9" - we > believe the issue lies within BitSetMethods.java, specifically around: return > (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12555) Build Failure on 1.6 RC4
Tim Preece created SPARK-12555: -- Summary: Build Failure on 1.6 RC4 Key: SPARK-12555 URL: https://issues.apache.org/jira/browse/SPARK-12555 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Environment: ALL platforms ( although test only explicitly fails on Big Endian platforms ). Reporter: Tim Preece Priority: Blocker org.apache.spark.sql.DatasetAggregatorSuite - typed aggregation: class input with reordering *** FAILED *** Results do not match for query: == Parsed Logical Plan == Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Analyzed Logical Plan == value: string, ClassInputAgg$(b,a): int Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Optimized Logical Plan == Aggregate [value#748], [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS ClassInputAgg$(b,a)#762] +- AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- OneRowRelation$ == Physical Plan == TungstenAggregate(key=[value#748], functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], output=[value#748,ClassInputAgg$(b,a)#762]) +- TungstenExchange hashpartitioning(value#748,5), None +- TungstenAggregate(key=[value#748], functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], output=[value#748,value#758]) +- !AppendColumns , class[a[0]: int, b[0]: string], class[value[0]: string], [value#748] +- Project [one AS b#650,1 AS a#651] +- Scan OneRowRelation[] == Results == !== Correct Answer - 1 == == Spark Answer - 1 == ![one,1][one,9] (QueryTest.scala:127) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12555) Build Failure on 1.6 ( DatasetAggregatorSuite )
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073805#comment-15073805 ] Tim Preece commented on SPARK-12555: Set priority to Major following guidance here - https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-JIRA Although I would suggest that the priority could be set higher. > Build Failure on 1.6 ( DatasetAggregatorSuite ) > --- > > Key: SPARK-12555 > URL: https://issues.apache.org/jira/browse/SPARK-12555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: ALL platforms ( although test only explicitly fails on > Big Endian platforms ). >Reporter: Tim Preece > > org.apache.spark.sql.DatasetAggregatorSuite > - typed aggregation: class input with reordering *** FAILED *** > Results do not match for query: > == Parsed Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Analyzed Logical Plan == > value: string, ClassInputAgg$(b,a): int > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Optimized Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Physical Plan == > TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], > output=[value#748,ClassInputAgg$(b,a)#762]) > +- TungstenExchange hashpartitioning(value#748,5), None > +- TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], > output=[value#748,value#758]) > +- !AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] >+- Project [one AS b#650,1 AS a#651] > +- Scan OneRowRelation[] > == Results == > !== Correct Answer - 1 == == Spark Answer - 1 == > ![one,1][one,9] (QueryTest.scala:127) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12319) Address endian specific problems surfaced in 1.6
[ https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073145#comment-15073145 ] Tim Preece commented on SPARK-12319: Hi, The failing test is already checked in. It is: "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input with reordering" The test only explicitly fails on Big Endian platforms. This is because an integer takes an 8 byte slot in the Unsafe row. When the data corruption occurs the BE integer ends up with the wrong value. I added print statements which shows the data corruption on Little Endian as well, it just happens not to effect the value of the LE integer, since the LE integer is in the other 4-bytes of the 8-byte slot. > Address endian specific problems surfaced in 1.6 > > > Key: SPARK-12319 > URL: https://issues.apache.org/jira/browse/SPARK-12319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Problems apparent on BE, LE could be impacted too >Reporter: Adam Roberts >Priority: Critical > > JIRA to cover endian specific problems - since testing 1.6 I've noticed > problems with DataFrames on BE platforms, e.g. > https://issues.apache.org/jira/browse/SPARK-9858 > [~joshrosen] [~yhuai] > Current progress: using com.google.common.io.LittleEndianDataInputStream and > com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer > fixes three test failures in ExchangeCoordinatorSuite but I'm concerned > around performance/wider functional implications > "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input > with reordering" fails as we expect "one, 1" but instead get "one, 9" - we > believe the issue lies within BitSetMethods.java, specifically around: return > (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12319) Address endian specific problems surfaced in 1.6
[ https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068176#comment-15068176 ] Tim Preece commented on SPARK-12319: [~marmbrus] Hi Michael, I think this may be a problem with the new DataSet API, in particular the new "as" function of DataFrame which I see is tagged as Experimental. When we run the DatasetAggregatorSuite test "typed aggregation: class input with reordering" the implementation seems to get confused between the ordering of the data in the unsaferow (string,int) and the schema (int,string). This results in a testcase failure that shows up to BE platforms ( although the data is also corrupted on LE platforms ). At the moment I'm not sure how to fix, so any pointers would be helpful. > Address endian specific problems surfaced in 1.6 > > > Key: SPARK-12319 > URL: https://issues.apache.org/jira/browse/SPARK-12319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Problems apparent on BE, LE could be impacted too >Reporter: Adam Roberts >Priority: Critical > > JIRA to cover endian specific problems - since testing 1.6 I've noticed > problems with DataFrames on BE platforms, e.g. > https://issues.apache.org/jira/browse/SPARK-9858 > [~joshrosen] [~yhuai] > Current progress: using com.google.common.io.LittleEndianDataInputStream and > com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer > fixes three test failures in ExchangeCoordinatorSuite but I'm concerned > around performance/wider functional implications > "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input > with reordering" fails as we expect "one, 1" but instead get "one, 9" - we > believe the issue lies within BitSetMethods.java, specifically around: return > (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12319) Address endian specific problems surfaced in 1.6
[ https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066628#comment-15066628 ] Tim Preece commented on SPARK-12319: I notice for the failing testcase the schema ( for row1) mismatches the actual data in row1. Row1 has schema: SpecificUnsafeRowJoiner schema1 StructType(StructField(a,IntegerType,false), StructField(b,StringType,true)) But row 1 has the following data ( i.e. a string followed by int ) row1 [0,180003,1,656e6f] So why doesn't the schema mismatch the data? The name of the failing test may give a clue! test("typed aggregation: class input with reordering") { val ds = sql("SELECT 'one' AS b, 1 as a").as[AggData] checkAnswer( ds.select(ClassInputAgg.toColumn), 1) checkAnswer( ds.select(expr("avg(a)").as[Double], ClassInputAgg.toColumn), (1.0, 1)) checkAnswer( ds.groupBy(_.b).agg(ClassInputAgg.toColumn), ("one", 1)) } > Address endian specific problems surfaced in 1.6 > > > Key: SPARK-12319 > URL: https://issues.apache.org/jira/browse/SPARK-12319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Problems apparent on BE, LE could be impacted too >Reporter: Adam Roberts >Priority: Critical > > JIRA to cover endian specific problems - since testing 1.6 I've noticed > problems with DataFrames on BE platforms, e.g. > https://issues.apache.org/jira/browse/SPARK-9858 > [~joshrosen] [~yhuai] > Current progress: using com.google.common.io.LittleEndianDataInputStream and > com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer > fixes three test failures in ExchangeCoordinatorSuite but I'm concerned > around performance/wider functional implications > "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input > with reordering" fails as we expect "one, 1" but instead get "one, 9" - we > believe the issue lies within BitSetMethods.java, specifically around: return > (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12319) Address endian specific problems surfaced in 1.6
[ https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15062175#comment-15062175 ] Tim Preece commented on SPARK-12319: Hi Sean, Yin I've started to ( and continue to ) investigate this DatasetAggregatorSuite failure as described above. So far I believe: a) the description is incorrect and it has nothing to do with endianess or BitSetMethods.java. (It just happens we see a failure on bigendian platforms - see below) b) the problem is probably in the codegen for unsaferow joins ( GenerateUnsafeRowJoiner ). I see two Unsaferows being joined. A (string,int) + (string) which results in an Unsaferow with schema (string,int,string). When we come to update the offsets for the variable length data ( in this case for the first String ) the offset is miscalculated. ( in updateOffset in GenerateUnsafeRowJoiner ) This means the int value in the second field slot is wrongly changed, and on a BE platform (for this particular testcase) it is incremented by 8. On a LE platform the value in the second field is also changed, but in a way that does not alter the value of the int. However for both BE and LE platforms the first String variable looks bogus with an invalid variable offset. I'm continuing to investigate ( and so could well revise the above ), but thought I would share my observations so far. Also it would be useful if you happened to have a pointer to any design documentation for unsaferow. For example I wasn't sure if all the variable length data should go at the end of the row. That is the schema for the joined row should actually have been (int,string,string). Tim Preece > Address endian specific problems surfaced in 1.6 > > > Key: SPARK-12319 > URL: https://issues.apache.org/jira/browse/SPARK-12319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: BE platforms >Reporter: Adam Roberts >Priority: Critical > > JIRA to cover endian specific problems - since testing 1.6 I've noticed > problems with DataFrames on BE platforms, e.g. > https://issues.apache.org/jira/browse/SPARK-9858 > [~joshrosen] [~yhuai] > Current progress: using com.google.common.io.LittleEndianDataInputStream and > com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer > fixes three test failures in ExchangeCoordinatorSuite but I'm concerned > around performance/wider functional implications > "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input > with reordering" fails as we expect "one, 1" but instead get "one, 9" - we > believe the issue lies within BitSetMethods.java, specifically around: return > (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org