[ 
https://issues.apache.org/jira/browse/SPARK-12778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-12778:
---------------------------
    Description: 
In Platform.java, methods of Java Unsafe are called directly without 
considering endianness.

In thread, 'Tungsten in a mixed endian environment', Adam Roberts reported data 
corruption when "spark.sql.tungsten.enabled" is enabled in mixed endian 
environment.

Platform.java should take endianness into account.

Below is a copy of Adam's report:

I've been experimenting with DataFrame operations in a mixed endian environment 
- a big endian master with little endian workers. With tungsten enabled I'm 
encountering data corruption issues. 

For example, with this simple test code: 
{code}
import org.apache.spark.SparkContext
import org.apache.spark._
import org.apache.spark.sql.SQLContext

object SimpleSQL {
  def main(args: Array[String]): Unit = {
    if (args.length != 1) {
      println("Not enough args, you need to specify the master url")
    }
    val masterURL = args(0)
    println("Setting up Spark context at: " + masterURL)
    val sparkConf = new SparkConf
    val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf)

    println("Performing SQL tests")

    val sqlContext = new SQLContext(sc)
    println("SQL context set up")
    val df = sqlContext.read.json("/tmp/people.json")
    df.show()
    println("Selecting everyone's age and adding one to it")
    df.select(df("name"), df("age") + 1).show()
    println("Showing all people over the age of 21")
    df.filter(df("age") > 21).show()
    println("Counting people by age")
    df.groupBy("age").count().show()
  }
} 
{code}
Instead of getting 

+----+-----+
| age|count|
+----+-----+
|null|    1|
|  19|    1|
|  30|    1|
+----+-----+ 

I get the following with my mixed endian set up: 

+-------------------+-----------------+
|                age|            count|
+-------------------+-----------------+
|               null|                1|
|1369094286720630784|72057594037927936|
|                 30|                1|
+-------------------+-----------------+ 

and on another run: 

+-------------------+-----------------+
|                age|            count|
+-------------------+-----------------+
|                  0|72057594037927936|
|                 19|                1| 


  was:
In Platform.java, methods of Java Unsafe are called directly without 
considering endianness.

In thread, 'Tungsten in a mixed endian environment', Adam Roberts reported data 
corruption when "spark.sql.tungsten.enabled" is enabled in mixed endian 
environment.

Platform.java should take endianness into account.


> Use of Java Unsafe should take endianness into account
> ------------------------------------------------------
>
>                 Key: SPARK-12778
>                 URL: https://issues.apache.org/jira/browse/SPARK-12778
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>            Reporter: Ted Yu
>
> In Platform.java, methods of Java Unsafe are called directly without 
> considering endianness.
> In thread, 'Tungsten in a mixed endian environment', Adam Roberts reported 
> data corruption when "spark.sql.tungsten.enabled" is enabled in mixed endian 
> environment.
> Platform.java should take endianness into account.
> Below is a copy of Adam's report:
> I've been experimenting with DataFrame operations in a mixed endian 
> environment - a big endian master with little endian workers. With tungsten 
> enabled I'm encountering data corruption issues. 
> For example, with this simple test code: 
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark._
> import org.apache.spark.sql.SQLContext
> object SimpleSQL {
>   def main(args: Array[String]): Unit = {
>     if (args.length != 1) {
>       println("Not enough args, you need to specify the master url")
>     }
>     val masterURL = args(0)
>     println("Setting up Spark context at: " + masterURL)
>     val sparkConf = new SparkConf
>     val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf)
>     println("Performing SQL tests")
>     val sqlContext = new SQLContext(sc)
>     println("SQL context set up")
>     val df = sqlContext.read.json("/tmp/people.json")
>     df.show()
>     println("Selecting everyone's age and adding one to it")
>     df.select(df("name"), df("age") + 1).show()
>     println("Showing all people over the age of 21")
>     df.filter(df("age") > 21).show()
>     println("Counting people by age")
>     df.groupBy("age").count().show()
>   }
> } 
> {code}
> Instead of getting 
> +----+-----+
> | age|count|
> +----+-----+
> |null|    1|
> |  19|    1|
> |  30|    1|
> +----+-----+ 
> I get the following with my mixed endian set up: 
> +-------------------+-----------------+
> |                age|            count|
> +-------------------+-----------------+
> |               null|                1|
> |1369094286720630784|72057594037927936|
> |                 30|                1|
> +-------------------+-----------------+ 
> and on another run: 
> +-------------------+-----------------+
> |                age|            count|
> +-------------------+-----------------+
> |                  0|72057594037927936|
> |                 19|                1| 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to