How big of a deal this use case is in a heterogeneous endianness
environment? If we do want to fix it, we should do it when right before
Spark shuffles data to minimize performance penalty, i.e. turn big-endian
encoded data into little-indian encoded data before it goes on the wire.
This is a pretty involved change and given other things that might break
across heterogeneous endianness environments, I am not sure if it is high
priority enough to even warrant review bandwidth right now.




On Tue, Jan 12, 2016 at 7:30 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> I logged SPARK-12778 where endian awareness in Platform.java should help
> in mixed endian set up.
>
> There could be other parts of the code base which are related.
>
> Cheers
>
> On Tue, Jan 12, 2016 at 7:01 AM, Adam Roberts <arobe...@uk.ibm.com> wrote:
>
>> Hi all, I've been experimenting with DataFrame operations in a mixed
>> endian environment - a big endian master with little endian workers. With
>> tungsten enabled I'm encountering data corruption issues.
>>
>> For example, with this simple test code:
>>
>> import org.apache.spark.SparkContext
>> import org.apache.spark._
>> import org.apache.spark.sql.SQLContext
>>
>> object SimpleSQL {
>>   def main(args: Array[String]): Unit = {
>>     if (args.length != 1) {
>>       println("Not enough args, you need to specify the master url")
>>     }
>>     val masterURL = args(0)
>>     println("Setting up Spark context at: " + masterURL)
>>     val sparkConf = new SparkConf
>>     val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf)
>>
>>     println("Performing SQL tests")
>>
>>     val sqlContext = new SQLContext(sc)
>>     println("SQL context set up")
>>     val df = sqlContext.read.json("/tmp/people.json")
>>     df.show()
>>     println("Selecting everyone's age and adding one to it")
>>     df.select(df("name"), df("age") + 1).show()
>>     println("Showing all people over the age of 21")
>>     df.filter(df("age") > 21).show()
>>     println("Counting people by age")
>>     df.groupBy("age").count().show()
>>   }
>> }
>>
>> Instead of getting
>>
>> +----+-----+
>> | age|count|
>> +----+-----+
>> |null|    1|
>> |  19|    1|
>> |  30|    1|
>> +----+-----+
>>
>> I get the following with my mixed endian set up:
>>
>> +-------------------+-----------------+
>> |                age|            count|
>> +-------------------+-----------------+
>> |               null|                1|
>> |1369094286720630784|72057594037927936|
>> |                 30|                1|
>> +-------------------+-----------------+
>>
>> and on another run:
>>
>> +-------------------+-----------------+
>> |                age|            count|
>> +-------------------+-----------------+
>> |                  0|72057594037927936|
>> |                 19|                1|
>>
>> Is Spark expected to work in such an environment? If I turn off tungsten
>> (sparkConf.set("spark.sql.tungsten.enabled", "false"), in 20 runs I don't
>> see any problems.
>>
>> Unless stated otherwise above:
>> IBM United Kingdom Limited - Registered in England and Wales with number
>> 741598.
>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>>
>
>

Reply via email to