[ https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16787357#comment-16787357 ]
Xiao Li commented on SPARK-27097: --------------------------------- This is not a regression but a long-standing issue. > Avoid embedding platform-dependent offsets literally in whole-stage generated > code > ---------------------------------------------------------------------------------- > > Key: SPARK-27097 > URL: https://issues.apache.org/jira/browse/SPARK-27097 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.0 > Reporter: Xiao Li > Assignee: Kris Mok > Priority: Blocker > Labels: correctness > > Avoid embedding platform-dependent offsets literally in whole-stage generated > code. > Spark SQL performs whole-stage code generation to speed up query execution. > There are two steps to it: > Java source code is generated from the physical query plan on the driver. A > single version of the source code is generated from a query plan, and sent to > all executors. > It's compiled to bytecode on the driver to catch compilation errors before > sending to executors, but currently only the generated source code gets sent > to the executors. The bytecode compilation is for fail-fast only. > Executors receive the generated source code and compile to bytecode, then the > query runs like a hand-written Java program. > In this model, there's an implicit assumption about the driver and executors > being run on similar platforms. Some code paths accidentally embedded > platform-dependent object layout information into the generated code, such as: > {code:java} > Platform.putLong(buffer, /* offset */ 24, /* value */ 1); > {code} > This code expects a field to be at offset +24 of the buffer object, and sets > a value to that field. > But whole-stage code generation generally uses platform-dependent information > from the driver. If the object layout is significantly different on the > driver and executors, the generated code can be reading/writing to wrong > offsets on the executors, causing all kinds of data corruption. > One code pattern that leads to such problem is the use of Platform.XXX > constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET. > Bad: > {code:java} > val baseOffset = Platform.BYTE_ARRAY_OFFSET > // codegen template: > s"Platform.putLong($buffer, $baseOffset, $value);" > This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into > the generated code. > {code} > Good: > {code:java} > val baseOffset = "Platform.BYTE_ARRAY_OFFSET" > // codegen template: > s"Platform.putLong($buffer, $baseOffset, $value);" > This will generate the offset symbolically -- Platform.putLong(buffer, > Platform.BYTE_ARRAY_OFFSET, value), which will be able to pick up the correct > value on the executors. > {code} > Caveat: these offset constants are declared as runtime-initialized static > final in Java, so they're not compile-time constants from the Java language's > perspective. It does lead to a slightly increased size of the generated code, > but this is necessary for correctness. > NOTE: there can be other patterns that generate platform-dependent code on > the driver which is invalid on the executors. e.g. if the endianness is > different between the driver and the executors, and if some generated code > makes strong assumption about endianness, it would also be problematic. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org