Hi all, I've been experimenting with DataFrame operations in a mixed endian environment - a big endian master with little endian workers. With tungsten enabled I'm encountering data corruption issues.
For example, with this simple test code: import org.apache.spark.SparkContext import org.apache.spark._ import org.apache.spark.sql.SQLContext object SimpleSQL { def main(args: Array[String]): Unit = { if (args.length != 1) { println("Not enough args, you need to specify the master url") } val masterURL = args(0) println("Setting up Spark context at: " + masterURL) val sparkConf = new SparkConf val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf) println("Performing SQL tests") val sqlContext = new SQLContext(sc) println("SQL context set up") val df = sqlContext.read.json("/tmp/people.json") df.show() println("Selecting everyone's age and adding one to it") df.select(df("name"), df("age") + 1).show() println("Showing all people over the age of 21") df.filter(df("age") > 21).show() println("Counting people by age") df.groupBy("age").count().show() } } Instead of getting +----+-----+ | age|count| +----+-----+ |null| 1| | 19| 1| | 30| 1| +----+-----+ I get the following with my mixed endian set up: +-------------------+-----------------+ | age| count| +-------------------+-----------------+ | null| 1| |1369094286720630784|72057594037927936| | 30| 1| +-------------------+-----------------+ and on another run: +-------------------+-----------------+ | age| count| +-------------------+-----------------+ | 0|72057594037927936| | 19| 1| Is Spark expected to work in such an environment? If I turn off tungsten ( sparkConf.set("spark.sql.tungsten.enabled", "false"), in 20 runs I don't see any problems. Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU