[ https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063909#comment-17063909 ]
angerszhu commented on SPARK-27097: ----------------------------------- [~irashid] to be honest, I meet this problem these days. [~dbtsai] I have some question. We start a self-developed thrift server program and use spark as compute engine with below javaOptions parameter {color:#e14141}-Xmx64g {color} {color:#e14141}-Djava.library.path=/home/hadoop/hadoop/lib/native {color} {color:#e14141}-Djavax.security.auth.useSubjectCredsOnly=false {color} {color:#e14141}-Dcom.sun.management.jmxremote.port=9021 {color} {color:#e14141}-Dcom.sun.management.jmxremote.authenticate=false {color} {color:#e14141}-Dcom.sun.management.jmxremote.ssl=false {color} {color:#e14141}-XX:MaxPermSize=1024m -XX:PermSize=256m -XX:MaxDirectMemorySize=8192m -XX:-TraceClassUnloading {color} {color:#e14141}-XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC -XX:+PrintTenuringDistribution -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75 -Xnoclassgc -XX:+PrintGCDetails -XX:+PrintGCDateStamps {color} {color:#e14141} {color} {color:#e14141} {color} Then the {color:#347eec}Platform{color}{color:#e14141}.{color} {color:#347eec}BYTE_ARRAY_OFFSET{color} will be 24, when we start a normal spark thrift server, the value will be 16, this problem cause strange data corruption. After few days check, I located the problem because of spark *codegen*, and this pr can fix our problem , but I can’t find evidence why Platform.BYTE_ARRAY_OFFSET will be 24 in above parameter. Since I test in local that when we set {color:#e14141} -XX:+ UseCompressedOops, {color} using pointer compression it's going to be 16. {color:#e14141} -XX:- UseCompressedOops, {color} not using pointer compression it's going to be 24. This is easy to understand why the offset is not same. But I don’t know why above parameter will be 24 since I am not a professor about java compiler and Basic computer knowledge. Can you give me some advisor or information about how to understand and find the root cause. > Avoid embedding platform-dependent offsets literally in whole-stage generated > code > ---------------------------------------------------------------------------------- > > Key: SPARK-27097 > URL: https://issues.apache.org/jira/browse/SPARK-27097 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0, 2.1.3, 2.2.3, 2.3.4, 2.4.0 > Reporter: Xiao Li > Assignee: Kris Mok > Priority: Critical > Labels: correctness > Fix For: 2.4.1 > > > Avoid embedding platform-dependent offsets literally in whole-stage generated > code. > Spark SQL performs whole-stage code generation to speed up query execution. > There are two steps to it: > Java source code is generated from the physical query plan on the driver. A > single version of the source code is generated from a query plan, and sent to > all executors. > It's compiled to bytecode on the driver to catch compilation errors before > sending to executors, but currently only the generated source code gets sent > to the executors. The bytecode compilation is for fail-fast only. > Executors receive the generated source code and compile to bytecode, then the > query runs like a hand-written Java program. > In this model, there's an implicit assumption about the driver and executors > being run on similar platforms. Some code paths accidentally embedded > platform-dependent object layout information into the generated code, such as: > {code:java} > Platform.putLong(buffer, /* offset */ 24, /* value */ 1); > {code} > This code expects a field to be at offset +24 of the buffer object, and sets > a value to that field. > But whole-stage code generation generally uses platform-dependent information > from the driver. If the object layout is significantly different on the > driver and executors, the generated code can be reading/writing to wrong > offsets on the executors, causing all kinds of data corruption. > One code pattern that leads to such problem is the use of Platform.XXX > constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET. > Bad: > {code:java} > val baseOffset = Platform.BYTE_ARRAY_OFFSET > // codegen template: > s"Platform.putLong($buffer, $baseOffset, $value);" > This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into > the generated code. > {code} > Good: > {code:java} > val baseOffset = "Platform.BYTE_ARRAY_OFFSET" > // codegen template: > s"Platform.putLong($buffer, $baseOffset, $value);" > This will generate the offset symbolically -- Platform.putLong(buffer, > Platform.BYTE_ARRAY_OFFSET, value), which will be able to pick up the correct > value on the executors. > {code} > Caveat: these offset constants are declared as runtime-initialized static > final in Java, so they're not compile-time constants from the Java language's > perspective. It does lead to a slightly increased size of the generated code, > but this is necessary for correctness. > NOTE: there can be other patterns that generate platform-dependent code on > the driver which is invalid on the executors. e.g. if the endianness is > different between the driver and the executors, and if some generated code > makes strong assumption about endianness, it would also be problematic. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org