[ 
https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063909#comment-17063909
 ] 

angerszhu commented on SPARK-27097:
-----------------------------------

[~irashid] to be honest, I meet this problem these days.

 

[~dbtsai] I have some question. 
We start a self-developed thrift server program  and use spark as compute 
engine with below javaOptions parameter
 
{color:#e14141}-Xmx64g {color}
{color:#e14141}-Djava.library.path=/home/hadoop/hadoop/lib/native {color}
{color:#e14141}-Djavax.security.auth.useSubjectCredsOnly=false {color}
{color:#e14141}-Dcom.sun.management.jmxremote.port=9021 {color}
{color:#e14141}-Dcom.sun.management.jmxremote.authenticate=false {color}
{color:#e14141}-Dcom.sun.management.jmxremote.ssl=false {color}
{color:#e14141}-XX:MaxPermSize=1024m -XX:PermSize=256m 
-XX:MaxDirectMemorySize=8192m -XX:-TraceClassUnloading {color}
{color:#e14141}-XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC 
-XX:+CMSClassUnloadingEnabled -XX:+UseCMSCompactAtFullCollection 
-XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSParallelRemarkEnabled 
-XX:+DisableExplicitGC -XX:+PrintTenuringDistribution 
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75 
-Xnoclassgc -XX:+PrintGCDetails -XX:+PrintGCDateStamps {color}
{color:#e14141} {color}
{color:#e14141} {color}
Then the {color:#347eec}Platform{color}{color:#e14141}.{color} 
{color:#347eec}BYTE_ARRAY_OFFSET{color} will be 24, when we start a normal 
spark thrift server, the value will be 16, this problem cause strange data 
corruption. 
After few days check, I located the problem because of spark  *codegen*, and  
this pr can fix our problem , but I can’t find  evidence why 
Platform.BYTE_ARRAY_OFFSET will be 24 in above parameter. Since I test in local 
that when we set  {color:#e14141} -XX:+ UseCompressedOops,  {color} using 
pointer compression it's going to be 16.
{color:#e14141} -XX:- UseCompressedOops,  {color} not using pointer compression 
it's going to be 24. This is easy to understand why the offset is not same.
But I don’t know why above parameter will be 24 since I am not a professor  
about java compiler and  Basic computer knowledge.
 
Can you give me some advisor or information about how to understand and find 
the root cause.
 

> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-27097
>                 URL: https://issues.apache.org/jira/browse/SPARK-27097
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0, 2.1.3, 2.2.3, 2.3.4, 2.4.0
>            Reporter: Xiao Li
>            Assignee: Kris Mok
>            Priority: Critical
>              Labels: correctness
>             Fix For: 2.4.1
>
>
> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code.
> Spark SQL performs whole-stage code generation to speed up query execution. 
> There are two steps to it:
> Java source code is generated from the physical query plan on the driver. A 
> single version of the source code is generated from a query plan, and sent to 
> all executors.
> It's compiled to bytecode on the driver to catch compilation errors before 
> sending to executors, but currently only the generated source code gets sent 
> to the executors. The bytecode compilation is for fail-fast only.
> Executors receive the generated source code and compile to bytecode, then the 
> query runs like a hand-written Java program.
> In this model, there's an implicit assumption about the driver and executors 
> being run on similar platforms. Some code paths accidentally embedded 
> platform-dependent object layout information into the generated code, such as:
> {code:java}
> Platform.putLong(buffer, /* offset */ 24, /* value */ 1);
> {code}
> This code expects a field to be at offset +24 of the buffer object, and sets 
> a value to that field.
> But whole-stage code generation generally uses platform-dependent information 
> from the driver. If the object layout is significantly different on the 
> driver and executors, the generated code can be reading/writing to wrong 
> offsets on the executors, causing all kinds of data corruption.
> One code pattern that leads to such problem is the use of Platform.XXX 
> constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET.
> Bad:
> {code:java}
> val baseOffset = Platform.BYTE_ARRAY_OFFSET
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into 
> the generated code.
> {code}
> Good:
> {code:java}
> val baseOffset = "Platform.BYTE_ARRAY_OFFSET"
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will generate the offset symbolically -- Platform.putLong(buffer, 
> Platform.BYTE_ARRAY_OFFSET, value), which will be able to pick up the correct 
> value on the executors.
> {code}
> Caveat: these offset constants are declared as runtime-initialized static 
> final in Java, so they're not compile-time constants from the Java language's 
> perspective. It does lead to a slightly increased size of the generated code, 
> but this is necessary for correctness.
> NOTE: there can be other patterns that generate platform-dependent code on 
> the driver which is invalid on the executors. e.g. if the endianness is 
> different between the driver and the executors, and if some generated code 
> makes strong assumption about endianness, it would also be problematic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to