[jira] [Commented] (SPARK-27097) Avoid embedding platform-dependent offsets literally in whole-stage generated code

2020-03-21 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063909#comment-17063909
 ] 

angerszhu commented on SPARK-27097:
---

[~irashid] to be honest, I meet this problem these days.

 

[~dbtsai] I have some question. 
We start a self-developed thrift server program  and use spark as compute 
engine with below javaOptions parameter
 
{color:#e14141}-Xmx64g {color}
{color:#e14141}-Djava.library.path=/home/hadoop/hadoop/lib/native {color}
{color:#e14141}-Djavax.security.auth.useSubjectCredsOnly=false {color}
{color:#e14141}-Dcom.sun.management.jmxremote.port=9021 {color}
{color:#e14141}-Dcom.sun.management.jmxremote.authenticate=false {color}
{color:#e14141}-Dcom.sun.management.jmxremote.ssl=false {color}
{color:#e14141}-XX:MaxPermSize=1024m -XX:PermSize=256m 
-XX:MaxDirectMemorySize=8192m -XX:-TraceClassUnloading {color}
{color:#e14141}-XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC 
-XX:+CMSClassUnloadingEnabled -XX:+UseCMSCompactAtFullCollection 
-XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSParallelRemarkEnabled 
-XX:+DisableExplicitGC -XX:+PrintTenuringDistribution 
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75 
-Xnoclassgc -XX:+PrintGCDetails -XX:+PrintGCDateStamps {color}
{color:#e14141} {color}
{color:#e14141} {color}
Then the {color:#347eec}Platform{color}{color:#e14141}.{color} 
{color:#347eec}BYTE_ARRAY_OFFSET{color} will be 24, when we start a normal 
spark thrift server, the value will be 16, this problem cause strange data 
corruption. 
After few days check, I located the problem because of spark  *codegen*, and  
this pr can fix our problem , but I can’t find  evidence why 
Platform.BYTE_ARRAY_OFFSET will be 24 in above parameter. Since I test in local 
that when we set  {color:#e14141} -XX:+ UseCompressedOops,  {color} using 
pointer compression it's going to be 16.
{color:#e14141} -XX:- UseCompressedOops,  {color} not using pointer compression 
it's going to be 24. This is easy to understand why the offset is not same.
But I don’t know why above parameter will be 24 since I am not a professor  
about java compiler and  Basic computer knowledge.
 
Can you give me some advisor or information about how to understand and find 
the root cause.
 

> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code
> --
>
> Key: SPARK-27097
> URL: https://issues.apache.org/jira/browse/SPARK-27097
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.3, 2.2.3, 2.3.4, 2.4.0
>Reporter: Xiao Li
>Assignee: Kris Mok
>Priority: Critical
>  Labels: correctness
> Fix For: 2.4.1
>
>
> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code.
> Spark SQL performs whole-stage code generation to speed up query execution. 
> There are two steps to it:
> Java source code is generated from the physical query plan on the driver. A 
> single version of the source code is generated from a query plan, and sent to 
> all executors.
> It's compiled to bytecode on the driver to catch compilation errors before 
> sending to executors, but currently only the generated source code gets sent 
> to the executors. The bytecode compilation is for fail-fast only.
> Executors receive the generated source code and compile to bytecode, then the 
> query runs like a hand-written Java program.
> In this model, there's an implicit assumption about the driver and executors 
> being run on similar platforms. Some code paths accidentally embedded 
> platform-dependent object layout information into the generated code, such as:
> {code:java}
> Platform.putLong(buffer, /* offset */ 24, /* value */ 1);
> {code}
> This code expects a field to be at offset +24 of the buffer object, and sets 
> a value to that field.
> But whole-stage code generation generally uses platform-dependent information 
> from the driver. If the object layout is significantly different on the 
> driver and executors, the generated code can be reading/writing to wrong 
> offsets on the executors, causing all kinds of data corruption.
> One code pattern that leads to such problem is the use of Platform.XXX 
> constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET.
> Bad:
> {code:java}
> val baseOffset = Platform.BYTE_ARRAY_OFFSET
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into 
> the generated code.
> {code}
> Good:
> {code:java}
> val baseOffset = "Platform.BYTE_ARRAY_OFFSET"
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will generate the offset symbolically -- Platform.putLong(buffer, 
> 

[jira] [Commented] (SPARK-27097) Avoid embedding platform-dependent offsets literally in whole-stage generated code

2019-03-10 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788862#comment-16788862
 ] 

DB Tsai commented on SPARK-27097:
-

[~irashid] Initially, I was thinking this fix is for the platform difference of 
endianness. By looking at the test, this bug can happen when both executors and 
driver are x86, but `UseCompressedOops` is turned off in the executors to 
access more than 32GB of the heap while the driver uses the default JVM option 
with `UseCompressedOops` on with less memory. Thus, in driver, the references 
will be 32-bit in 64-bit JVM resulting different byte array offset. 

> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code
> --
>
> Key: SPARK-27097
> URL: https://issues.apache.org/jira/browse/SPARK-27097
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Kris Mok
>Priority: Critical
>  Labels: correctness
> Fix For: 2.4.1
>
>
> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code.
> Spark SQL performs whole-stage code generation to speed up query execution. 
> There are two steps to it:
> Java source code is generated from the physical query plan on the driver. A 
> single version of the source code is generated from a query plan, and sent to 
> all executors.
> It's compiled to bytecode on the driver to catch compilation errors before 
> sending to executors, but currently only the generated source code gets sent 
> to the executors. The bytecode compilation is for fail-fast only.
> Executors receive the generated source code and compile to bytecode, then the 
> query runs like a hand-written Java program.
> In this model, there's an implicit assumption about the driver and executors 
> being run on similar platforms. Some code paths accidentally embedded 
> platform-dependent object layout information into the generated code, such as:
> {code:java}
> Platform.putLong(buffer, /* offset */ 24, /* value */ 1);
> {code}
> This code expects a field to be at offset +24 of the buffer object, and sets 
> a value to that field.
> But whole-stage code generation generally uses platform-dependent information 
> from the driver. If the object layout is significantly different on the 
> driver and executors, the generated code can be reading/writing to wrong 
> offsets on the executors, causing all kinds of data corruption.
> One code pattern that leads to such problem is the use of Platform.XXX 
> constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET.
> Bad:
> {code:java}
> val baseOffset = Platform.BYTE_ARRAY_OFFSET
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into 
> the generated code.
> {code}
> Good:
> {code:java}
> val baseOffset = "Platform.BYTE_ARRAY_OFFSET"
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will generate the offset symbolically -- Platform.putLong(buffer, 
> Platform.BYTE_ARRAY_OFFSET, value), which will be able to pick up the correct 
> value on the executors.
> {code}
> Caveat: these offset constants are declared as runtime-initialized static 
> final in Java, so they're not compile-time constants from the Java language's 
> perspective. It does lead to a slightly increased size of the generated code, 
> but this is necessary for correctness.
> NOTE: there can be other patterns that generate platform-dependent code on 
> the driver which is invalid on the executors. e.g. if the endianness is 
> different between the driver and the executors, and if some generated code 
> makes strong assumption about endianness, it would also be problematic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27097) Avoid embedding platform-dependent offsets literally in whole-stage generated code

2019-03-08 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788369#comment-16788369
 ] 

Imran Rashid commented on SPARK-27097:
--

I'm kind of amazed Spark works at all on different Platforms.  As you note, 
endianness probably cannot be different.  What kind of platform difference 
results in this issue?  Is it different versions of the JVM?  I'd also be 
amazed if that worked properly.

I'm not saying we shouldn't fix this if its easy, but maybe we should clarify 
how different the "platform" can be between containers in a spark app?

> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code
> --
>
> Key: SPARK-27097
> URL: https://issues.apache.org/jira/browse/SPARK-27097
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Kris Mok
>Priority: Critical
>  Labels: correctness
>
> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code.
> Spark SQL performs whole-stage code generation to speed up query execution. 
> There are two steps to it:
> Java source code is generated from the physical query plan on the driver. A 
> single version of the source code is generated from a query plan, and sent to 
> all executors.
> It's compiled to bytecode on the driver to catch compilation errors before 
> sending to executors, but currently only the generated source code gets sent 
> to the executors. The bytecode compilation is for fail-fast only.
> Executors receive the generated source code and compile to bytecode, then the 
> query runs like a hand-written Java program.
> In this model, there's an implicit assumption about the driver and executors 
> being run on similar platforms. Some code paths accidentally embedded 
> platform-dependent object layout information into the generated code, such as:
> {code:java}
> Platform.putLong(buffer, /* offset */ 24, /* value */ 1);
> {code}
> This code expects a field to be at offset +24 of the buffer object, and sets 
> a value to that field.
> But whole-stage code generation generally uses platform-dependent information 
> from the driver. If the object layout is significantly different on the 
> driver and executors, the generated code can be reading/writing to wrong 
> offsets on the executors, causing all kinds of data corruption.
> One code pattern that leads to such problem is the use of Platform.XXX 
> constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET.
> Bad:
> {code:java}
> val baseOffset = Platform.BYTE_ARRAY_OFFSET
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into 
> the generated code.
> {code}
> Good:
> {code:java}
> val baseOffset = "Platform.BYTE_ARRAY_OFFSET"
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will generate the offset symbolically -- Platform.putLong(buffer, 
> Platform.BYTE_ARRAY_OFFSET, value), which will be able to pick up the correct 
> value on the executors.
> {code}
> Caveat: these offset constants are declared as runtime-initialized static 
> final in Java, so they're not compile-time constants from the Java language's 
> perspective. It does lead to a slightly increased size of the generated code, 
> but this is necessary for correctness.
> NOTE: there can be other patterns that generate platform-dependent code on 
> the driver which is invalid on the executors. e.g. if the endianness is 
> different between the driver and the executors, and if some generated code 
> makes strong assumption about endianness, it would also be problematic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27097) Avoid embedding platform-dependent offsets literally in whole-stage generated code

2019-03-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787356#comment-16787356
 ] 

Xiao Li commented on SPARK-27097:
-

The fix will be pushed soon. 

> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code
> --
>
> Key: SPARK-27097
> URL: https://issues.apache.org/jira/browse/SPARK-27097
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Kris Mok
>Priority: Blocker
>  Labels: correctness
>
> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code.
> Spark SQL performs whole-stage code generation to speed up query execution. 
> There are two steps to it:
> Java source code is generated from the physical query plan on the driver. A 
> single version of the source code is generated from a query plan, and sent to 
> all executors.
> It's compiled to bytecode on the driver to catch compilation errors before 
> sending to executors, but currently only the generated source code gets sent 
> to the executors. The bytecode compilation is for fail-fast only.
> Executors receive the generated source code and compile to bytecode, then the 
> query runs like a hand-written Java program.
> In this model, there's an implicit assumption about the driver and executors 
> being run on similar platforms. Some code paths accidentally embedded 
> platform-dependent object layout information into the generated code, such as:
> {code:java}
> Platform.putLong(buffer, /* offset */ 24, /* value */ 1);
> {code}
> This code expects a field to be at offset +24 of the buffer object, and sets 
> a value to that field.
> But whole-stage code generation generally uses platform-dependent information 
> from the driver. If the object layout is significantly different on the 
> driver and executors, the generated code can be reading/writing to wrong 
> offsets on the executors, causing all kinds of data corruption.
> One code pattern that leads to such problem is the use of Platform.XXX 
> constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET.
> Bad:
> {code:java}
> val baseOffset = Platform.BYTE_ARRAY_OFFSET
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into 
> the generated code.
> {code}
> Good:
> {code:java}
> val baseOffset = "Platform.BYTE_ARRAY_OFFSET"
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will generate the offset symbolically -- Platform.putLong(buffer, 
> Platform.BYTE_ARRAY_OFFSET, value), which will be able to pick up the correct 
> value on the executors.
> {code}
> Caveat: these offset constants are declared as runtime-initialized static 
> final in Java, so they're not compile-time constants from the Java language's 
> perspective. It does lead to a slightly increased size of the generated code, 
> but this is necessary for correctness.
> NOTE: there can be other patterns that generate platform-dependent code on 
> the driver which is invalid on the executors. e.g. if the endianness is 
> different between the driver and the executors, and if some generated code 
> makes strong assumption about endianness, it would also be problematic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27097) Avoid embedding platform-dependent offsets literally in whole-stage generated code

2019-03-07 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787357#comment-16787357
 ] 

Xiao Li commented on SPARK-27097:
-

This is not a regression but a long-standing issue. 

> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code
> --
>
> Key: SPARK-27097
> URL: https://issues.apache.org/jira/browse/SPARK-27097
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Kris Mok
>Priority: Blocker
>  Labels: correctness
>
> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code.
> Spark SQL performs whole-stage code generation to speed up query execution. 
> There are two steps to it:
> Java source code is generated from the physical query plan on the driver. A 
> single version of the source code is generated from a query plan, and sent to 
> all executors.
> It's compiled to bytecode on the driver to catch compilation errors before 
> sending to executors, but currently only the generated source code gets sent 
> to the executors. The bytecode compilation is for fail-fast only.
> Executors receive the generated source code and compile to bytecode, then the 
> query runs like a hand-written Java program.
> In this model, there's an implicit assumption about the driver and executors 
> being run on similar platforms. Some code paths accidentally embedded 
> platform-dependent object layout information into the generated code, such as:
> {code:java}
> Platform.putLong(buffer, /* offset */ 24, /* value */ 1);
> {code}
> This code expects a field to be at offset +24 of the buffer object, and sets 
> a value to that field.
> But whole-stage code generation generally uses platform-dependent information 
> from the driver. If the object layout is significantly different on the 
> driver and executors, the generated code can be reading/writing to wrong 
> offsets on the executors, causing all kinds of data corruption.
> One code pattern that leads to such problem is the use of Platform.XXX 
> constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET.
> Bad:
> {code:java}
> val baseOffset = Platform.BYTE_ARRAY_OFFSET
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into 
> the generated code.
> {code}
> Good:
> {code:java}
> val baseOffset = "Platform.BYTE_ARRAY_OFFSET"
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will generate the offset symbolically -- Platform.putLong(buffer, 
> Platform.BYTE_ARRAY_OFFSET, value), which will be able to pick up the correct 
> value on the executors.
> {code}
> Caveat: these offset constants are declared as runtime-initialized static 
> final in Java, so they're not compile-time constants from the Java language's 
> perspective. It does lead to a slightly increased size of the generated code, 
> but this is necessary for correctness.
> NOTE: there can be other patterns that generate platform-dependent code on 
> the driver which is invalid on the executors. e.g. if the endianness is 
> different between the driver and the executors, and if some generated code 
> makes strong assumption about endianness, it would also be problematic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org