[jira] [Resolved] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"
[ https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen resolved SPARK-42539. - Resolution: Fixed [~csun] it looks like this didn't get marked as closed / fix-version updated when the PR was merged. I believe this went only into 3.5.0; the original PR went into branch-3.4 but was reverted and the second PR didn't make it to branch-3.4. I've marked the fix version as 3.5.0 but please correct me if I'm wrong here: {code:java} > glog apache/branch-3.4 | grep SPARK-42539 * 26009d47c1f 2023-02-28 Revert "[SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client" [Hyukjin Kwon ] * 40a4019dfc5 2023-02-27 [SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client [Erik Krogen ] > glog apache/master | grep SPARK-42539 * 2e34427d4f3 2023-03-01 [SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client [Erik Krogen ] * 5627ceeddb4 2023-02-28 Revert "[SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client" [Hyukjin Kwon ] * 27ad5830f9a 2023-02-27 [SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client [Erik Krogen ] {code} > User-provided JARs can override Spark's Hive metadata client JARs when using > "builtin" > -- > > Key: SPARK-42539 > URL: https://issues.apache.org/jira/browse/SPARK-42539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.3, 3.3.2 >Reporter: Erik Krogen >Priority: Major > Fix For: 3.5.0 > > > Recently we observed that on version 3.2.0 and Java 8, it is possible for > user-provided Hive JARs to break the ability for Spark, via the Hive metadata > client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when > using the default behavior of the "builtin" Hive version. After SPARK-35321, > when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client > version is used, we will call the method {{Hive.getWithoutRegisterFns()}} > (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for > example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break > with a {{NoSuchMethodError}}. This particular failure mode was resolved in > 3.2.1 by SPARK-37446, but while investigating, we found a general issue that > it's possible for user JARs to override Spark's own JARs -- but only inside > of the IsolatedClientLoader when using "builtin". This happens because even > when Spark is configured to use the "builtin" Hive classes, it still creates > a separate URLClassLoader for the HiveClientImpl used for HMS communication. > To get the set of JAR URLs to use for this classloader, Spark [collects all > of the JARs used by the user classloader (and its parent, and that > classloader's parent, and so > on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438]. > Thus the newly created classloader will have all of the same JARs as the > user classloader, but the ordering has been reversed! User JARs get > prioritized ahead of system JARs, because the classloader hierarchy is > traversed from bottom-to-top. For example let's say we have user JARs > "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this: > {code} > MutableURLClassLoader > -- foo.jar > -- hive-exec-2.3.8.jar > -- parent: URLClassLoader > - spark-core_2.12-3.2.0.jar > - ... > - hive-exec-2.3.9.jar > - ... > {code} > This setup provides the expected behavior within the user classloader; it > will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the > MutableURLClassLoader is only checked if the class doesn't exist in the > parent. But when a JAR list is constructed for the IsolatedClientLoader, it > traverses the URLs from MutableURLClassLoader first, then it's parent, so the > final list looks like (in order): > {code} > URLClassLoader [IsolatedClientLoader] > -- foo.jar > -- hive-exec-2.3.8.jar > -- spark-core_2.12-3.2.0.jar > -- ... > -- hive-exec-2.3.9.jar > -- ... > -- parent: boot classloader (JVM classes) > {code} > Now when a lookup happens, all of the JARs are within the same > URLClassLoader, and the user JARs are in front of the Spark ones, so the user > JARs get prioritized. This is the opposite of the expected behavior when > using the default user/application classloader in Spark, which has > parent-first behavior, prioritizing the Spark/system classes over the user > classes. (Note that thi
[jira] [Resolved] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"
[ https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-42539. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40144 [https://github.com/apache/spark/pull/40144] > User-provided JARs can override Spark's Hive metadata client JARs when using > "builtin" > -- > > Key: SPARK-42539 > URL: https://issues.apache.org/jira/browse/SPARK-42539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.3, 3.3.2 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.5.0 > > > Recently we observed that on version 3.2.0 and Java 8, it is possible for > user-provided Hive JARs to break the ability for Spark, via the Hive metadata > client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when > using the default behavior of the "builtin" Hive version. After SPARK-35321, > when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client > version is used, we will call the method {{Hive.getWithoutRegisterFns()}} > (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for > example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break > with a {{NoSuchMethodError}}. This particular failure mode was resolved in > 3.2.1 by SPARK-37446, but while investigating, we found a general issue that > it's possible for user JARs to override Spark's own JARs -- but only inside > of the IsolatedClientLoader when using "builtin". This happens because even > when Spark is configured to use the "builtin" Hive classes, it still creates > a separate URLClassLoader for the HiveClientImpl used for HMS communication. > To get the set of JAR URLs to use for this classloader, Spark [collects all > of the JARs used by the user classloader (and its parent, and that > classloader's parent, and so > on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438]. > Thus the newly created classloader will have all of the same JARs as the > user classloader, but the ordering has been reversed! User JARs get > prioritized ahead of system JARs, because the classloader hierarchy is > traversed from bottom-to-top. For example let's say we have user JARs > "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this: > {code} > MutableURLClassLoader > -- foo.jar > -- hive-exec-2.3.8.jar > -- parent: URLClassLoader > - spark-core_2.12-3.2.0.jar > - ... > - hive-exec-2.3.9.jar > - ... > {code} > This setup provides the expected behavior within the user classloader; it > will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the > MutableURLClassLoader is only checked if the class doesn't exist in the > parent. But when a JAR list is constructed for the IsolatedClientLoader, it > traverses the URLs from MutableURLClassLoader first, then it's parent, so the > final list looks like (in order): > {code} > URLClassLoader [IsolatedClientLoader] > -- foo.jar > -- hive-exec-2.3.8.jar > -- spark-core_2.12-3.2.0.jar > -- ... > -- hive-exec-2.3.9.jar > -- ... > -- parent: boot classloader (JVM classes) > {code} > Now when a lookup happens, all of the JARs are within the same > URLClassLoader, and the user JARs are in front of the Spark ones, so the user > JARs get prioritized. This is the opposite of the expected behavior when > using the default user/application classloader in Spark, which has > parent-first behavior, prioritizing the Spark/system classes over the user > classes. (Note that this behavior is correct when using the > {{ChildFirstURLClassLoader}}.) > After SPARK-37446, the NoSuchMethodError is no longer an issue, but this > still breaks assumptions about how user JARs should be treated vs. system > JARs, and presents the ability for the client to break in other ways. For > example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have > been included; the changes in Hive 2.3.9 were needed to improve compatibility > with older HMS, so if a user were to accidentally include these older JARs, > it could break the ability of Spark to communicate with HMS 1.x > I see two solutions to this: > *(A) Remove the separate classloader entirely when using "builtin"* > Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even > create a new classloader when using "builtin". This makes sense, as [called > out in this > comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], > since the point of "builtin" is to use the existing JARs on the classpath > anyway. This proposes simply ext