Hi all, I've been experimenting with DataFrame operations in a mixed
endian environment - a big endian master with little endian workers. With
tungsten enabled I'm encountering data corruption issues.
For example, with this simple test code:
import org.apache.spark.SparkContext
import org.apache.spark._
import org.apache.spark.sql.SQLContext
object SimpleSQL {
def main(args: Array[String]): Unit = {
if (args.length != 1) {
println("Not enough args, you need to specify the master url")
}
val masterURL = args(0)
println("Setting up Spark context at: " + masterURL)
val sparkConf = new SparkConf
val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf)
println("Performing SQL tests")
val sqlContext = new SQLContext(sc)
println("SQL context set up")
val df = sqlContext.read.json("/tmp/people.json")
df.show()
println("Selecting everyone's age and adding one to it")
df.select(df("name"), df("age") + 1).show()
println("Showing all people over the age of 21")
df.filter(df("age") > 21).show()
println("Counting people by age")
df.groupBy("age").count().show()
}
}
Instead of getting
+----+-----+
| age|count|
+----+-----+
|null| 1|
| 19| 1|
| 30| 1|
+----+-----+
I get the following with my mixed endian set up:
+-------------------+-----------------+
| age| count|
+-------------------+-----------------+
| null| 1|
|1369094286720630784|72057594037927936|
| 30| 1|
+-------------------+-----------------+
and on another run:
+-------------------+-----------------+
| age| count|
+-------------------+-----------------+
| 0|72057594037927936|
| 19| 1|
Is Spark expected to work in such an environment? If I turn off tungsten (
sparkConf.set("spark.sql.tungsten.enabled", "false"), in 20 runs I don't
see any problems.
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU