[ https://issues.apache.org/jira/browse/SPARK-12778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ted Yu updated SPARK-12778: --------------------------- Description: In Platform.java, methods of Java Unsafe are called directly without considering endianness. In thread, 'Tungsten in a mixed endian environment', Adam Roberts reported data corruption when "spark.sql.tungsten.enabled" is enabled in mixed endian environment. Platform.java should take endianness into account. Below is a copy of Adam's report: I've been experimenting with DataFrame operations in a mixed endian environment - a big endian master with little endian workers. With tungsten enabled I'm encountering data corruption issues. For example, with this simple test code: {code} import org.apache.spark.SparkContext import org.apache.spark._ import org.apache.spark.sql.SQLContext object SimpleSQL { def main(args: Array[String]): Unit = { if (args.length != 1) { println("Not enough args, you need to specify the master url") } val masterURL = args(0) println("Setting up Spark context at: " + masterURL) val sparkConf = new SparkConf val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf) println("Performing SQL tests") val sqlContext = new SQLContext(sc) println("SQL context set up") val df = sqlContext.read.json("/tmp/people.json") df.show() println("Selecting everyone's age and adding one to it") df.select(df("name"), df("age") + 1).show() println("Showing all people over the age of 21") df.filter(df("age") > 21).show() println("Counting people by age") df.groupBy("age").count().show() } } {code} Instead of getting +----+-----+ | age|count| +----+-----+ |null| 1| | 19| 1| | 30| 1| +----+-----+ I get the following with my mixed endian set up: +-------------------+-----------------+ | age| count| +-------------------+-----------------+ | null| 1| |1369094286720630784|72057594037927936| | 30| 1| +-------------------+-----------------+ and on another run: +-------------------+-----------------+ | age| count| +-------------------+-----------------+ | 0|72057594037927936| | 19| 1| was: In Platform.java, methods of Java Unsafe are called directly without considering endianness. In thread, 'Tungsten in a mixed endian environment', Adam Roberts reported data corruption when "spark.sql.tungsten.enabled" is enabled in mixed endian environment. Platform.java should take endianness into account. > Use of Java Unsafe should take endianness into account > ------------------------------------------------------ > > Key: SPARK-12778 > URL: https://issues.apache.org/jira/browse/SPARK-12778 > Project: Spark > Issue Type: Bug > Components: Input/Output > Reporter: Ted Yu > > In Platform.java, methods of Java Unsafe are called directly without > considering endianness. > In thread, 'Tungsten in a mixed endian environment', Adam Roberts reported > data corruption when "spark.sql.tungsten.enabled" is enabled in mixed endian > environment. > Platform.java should take endianness into account. > Below is a copy of Adam's report: > I've been experimenting with DataFrame operations in a mixed endian > environment - a big endian master with little endian workers. With tungsten > enabled I'm encountering data corruption issues. > For example, with this simple test code: > {code} > import org.apache.spark.SparkContext > import org.apache.spark._ > import org.apache.spark.sql.SQLContext > object SimpleSQL { > def main(args: Array[String]): Unit = { > if (args.length != 1) { > println("Not enough args, you need to specify the master url") > } > val masterURL = args(0) > println("Setting up Spark context at: " + masterURL) > val sparkConf = new SparkConf > val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf) > println("Performing SQL tests") > val sqlContext = new SQLContext(sc) > println("SQL context set up") > val df = sqlContext.read.json("/tmp/people.json") > df.show() > println("Selecting everyone's age and adding one to it") > df.select(df("name"), df("age") + 1).show() > println("Showing all people over the age of 21") > df.filter(df("age") > 21).show() > println("Counting people by age") > df.groupBy("age").count().show() > } > } > {code} > Instead of getting > +----+-----+ > | age|count| > +----+-----+ > |null| 1| > | 19| 1| > | 30| 1| > +----+-----+ > I get the following with my mixed endian set up: > +-------------------+-----------------+ > | age| count| > +-------------------+-----------------+ > | null| 1| > |1369094286720630784|72057594037927936| > | 30| 1| > +-------------------+-----------------+ > and on another run: > +-------------------+-----------------+ > | age| count| > +-------------------+-----------------+ > | 0|72057594037927936| > | 19| 1| -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org