[jira] [Updated] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"

Erik Krogen (Jira) Fri, 24 Feb 2023 08:48:07 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Erik Krogen updated SPARK-42539:
--------------------------------
    Description: 
Recently we observed that on version 3.2.0 and Java 8, it is possible for 
user-provided Hive JARs to break the ability for Spark, via the Hive metadata 
client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when 
using the default behavior of the "builtin" Hive version. After SPARK-35321, 
when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client 
version is used, we will call the method {{Hive.getWithoutRegisterFns()}} (from 
HIVE-21563) instead of {{Hive.get()}}. If the user has included, for example, 
{{hive-exec-2.3.8.jar}} on their classpath, the client will break with a 
{{NoSuchMethodError}}. This particular failure mode was resolved in 3.2.1 by 
SPARK-37446, but while investigating, we found a general issue that it's 
possible for user JARs to override Spark's own JARs -- but only inside of the 
IsolatedClientLoader when using "builtin". This happens because even when Spark 
is configured to use the "builtin" Hive classes, it still creates a separate 
URLClassLoader for the HiveClientImpl used for HMS communication. To get the 
set of JAR URLs to use for this classloader, Spark [collects all of the JARs 
used by the user classloader (and its parent, and that classloader's parent, 
and so 
on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438].
 Thus the newly created classloader will have all of the same JARs as the user 
classloader, but the ordering has been reversed! User JARs get prioritized 
ahead of system JARs, because the classloader hierarchy is traversed from 
bottom-to-top. For example let's say we have user JARs "foo.jar" and 
"hive-exec-2.3.8.jar". The user classloader will look like this:
{code}
MutableURLClassLoader
-- foo.jar
-- hive-exec-2.3.8.jar
-- parent: URLClassLoader
----- spark-core_2.12-3.2.0.jar
----- ...
----- hive-exec-2.3.9.jar
----- ...
{code}

This setup provides the expected behavior within the user classloader; it will 
first check the parent, so hive-exec-2.3.9.jar takes precedence, and the 
MutableURLClassLoader is only checked if the class doesn't exist in the parent. 
But when a JAR list is constructed for the IsolatedClientLoader, it traverses 
the URLs from MutableURLClassLoader first, then it's parent, so the final list 
looks like (in order):
{code}
URLClassLoader [IsolatedClientLoader]
-- foo.jar
-- hive-exec-2.3.8.jar
-- spark-core_2.12-3.2.0.jar
-- ...
-- hive-exec-2.3.9.jar
-- ...
-- parent: boot classloader (JVM classes)
{code}
Now when a lookup happens, all of the JARs are within the same URLClassLoader, 
and the user JARs are in front of the Spark ones, so the user JARs get 
prioritized. This is the opposite of the expected behavior when using the 
default user/application classloader in Spark, which has parent-first behavior, 
prioritizing the Spark/system classes over the user classes. (Note that this 
behavior is correct when using the {{ChildFirstURLClassLoader}}.)

After SPARK-37446, the NoSuchMethodError is no longer an issue, but this still 
breaks assumptions about how user JARs should be treated vs. system JARs, and 
presents the ability for the client to break in other ways. For example in 
SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have been included; 
the changes in Hive 2.3.9 were needed to improve compatibility with older HMS, 
so if a user were to accidentally include these older JARs, it could break the 
ability of Spark to communicate with HMS 1.x

I see two solutions to this:

*(A) Remove the separate classloader entirely when using "builtin"*
Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even 
create a new classloader when using "builtin". This makes sense, as [called out 
in this 
comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], 
since the point of "builtin" is to use the existing JARs on the classpath 
anyway. This proposes simply extending the changes from SPARK-26839 to all Java 
versions, instead of restricting to Java 9+ only.

*(B) Reverse the ordering of parent/child JARs when constructing the URL list*
The most targeted fix that can be made is to simply reverse the ordering on 
[this line in 
HiveUtils|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L419],
 which prioritizes child-classloader JARs over parent-classloader JARs, 
reversing the expected ordering. There is already special handling for 
{{ChildFirstURLClassLoader}}, so all that needs to be done is to reverse this 
order.

I prefer (A) because I think it is a clean solution in that it both simplifies 
the classloader setup, and reduces divergence / special handling for different 
Java versions. At the time SPARK-26839 went in (2019), Java 9+ support was 
newer and less well-tested. Now after a few years we can see that the approach 
in SPARK-26839 clearly works well, so I see no reason _not_ to extend this 
approach to other Java versions as well.

  was:
Recently we observed that on version 3.2.0 and Java 8, it is possible for 
user-provided Hive JARs to break the ability for Spark, via the Hive metadata 
client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when 
using the default behavior of the "builtin" Hive version. After SPARK-35321, 
when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client 
version is used, we will call the method {{Hive.getWithoutRegisterFns()}} (from 
HIVE-21563) instead of {{Hive.get()}}. If the user has included, for example, 
{{hive-exec-2.3.8.jar}} on their classpath, the client will break with a 
{{NoSuchMethodError}}. This particular failure mode was resolved in 3.2.1 by 
SPARK-37446, but while investigating, we found a general issue that it's 
possible for user JARs to override Spark's own JARs -- but only inside of the 
IsolatedClientLoader when using "builtin". This happens because even when Spark 
is configured to use the "builtin" Hive classes, it still creates a separate 
URLClassLoader for the HiveClientImpl used for HMS communication. To get the 
set of JAR URLs to use for this classloader, Spark [collects all of the JARs 
used by the user classloader (and its parent, and that classloader's parent, 
and so 
on)](https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438).
 Thus the newly created classloader will have all of the same JARs as the user 
classloader, but the ordering has been reversed! User JARs get prioritized 
ahead of system JARs, because the classloader hierarchy is traversed from 
bottom-to-top. For example let's say we have user JARs "foo.jar" and 
"hive-exec-2.3.8.jar". The user classloader will look like this:
{code}
MutableURLClassLoader
-- foo.jar
-- hive-exec-2.3.8.jar
-- parent: URLClassLoader
----- spark-core_2.12-3.2.0.jar
----- ...
----- hive-exec-2.3.9.jar
----- ...
{code}

This setup provides the expected behavior within the user classloader; it will 
first check the parent, so hive-exec-2.3.9.jar takes precedence, and the 
MutableURLClassLoader is only checked if the class doesn't exist in the parent. 
But when a JAR list is constructed for the IsolatedClientLoader, it traverses 
the URLs from MutableURLClassLoader first, then it's parent, so the final list 
looks like (in order):
{code}
URLClassLoader [IsolatedClientLoader]
-- foo.jar
-- hive-exec-2.3.8.jar
-- spark-core_2.12-3.2.0.jar
-- ...
-- hive-exec-2.3.9.jar
-- ...
-- parent: boot classloader (JVM classes)
{code}
Now when a lookup happens, all of the JARs are within the same URLClassLoader, 
and the user JARs are in front of the Spark ones, so the user JARs get 
prioritized. This is the opposite of the expected behavior when using the 
default user/application classloader in Spark, which has parent-first behavior, 
prioritizing the Spark/system classes over the user classes. (Note that this 
behavior is correct when using the {{ChildFirstURLClassLoader}}.)

After SPARK-37446, the NoSuchMethodError is no longer an issue, but this still 
breaks assumptions about how user JARs should be treated vs. system JARs, and 
presents the ability for the client to break in other ways. For example in 
SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have been included; 
the changes in Hive 2.3.9 were needed to improve compatibility with older HMS, 
so if a user were to accidentally include these older JARs, it could break the 
ability of Spark to communicate with HMS 1.x

I see two solutions to this:

*(A) Remove the separate classloader entirely when using "builtin"*
Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even 
create a new classloader when using "builtin". This makes sense, as [called out 
in this 
comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], 
since the point of "builtin" is to use the existing JARs on the classpath 
anyway. This proposes simply extending the changes from SPARK-26839 to all Java 
versions, instead of restricting to Java 9+ only.

*(B) Reverse the ordering of parent/child JARs when constructing the URL list*
The most targeted fix that can be made is to simply reverse the ordering on 
[this line in 
HiveUtils|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L419],
 which prioritizes child-classloader JARs over parent-classloader JARs, 
reversing the expected ordering. There is already special handling for 
{{ChildFirstURLClassLoader}}, so all that needs to be done is to reverse this 
order.

I prefer (A) because I think it is a clean solution in that it both simplifies 
the classloader setup, and reduces divergence / special handling for different 
Java versions. At the time SPARK-26839 went in (2019), Java 9+ support was 
newer and less well-tested. Now after a few years we can see that the approach 
in SPARK-26839 clearly works well, so I see no reason _not_ to extend this 
approach to other Java versions as well.


> User-provided JARs can override Spark's Hive metadata client JARs when using 
> "builtin"
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-42539
>                 URL: https://issues.apache.org/jira/browse/SPARK-42539
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.3, 3.2.3, 3.3.2
>            Reporter: Erik Krogen
>            Priority: Major
>
> Recently we observed that on version 3.2.0 and Java 8, it is possible for 
> user-provided Hive JARs to break the ability for Spark, via the Hive metadata 
> client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when 
> using the default behavior of the "builtin" Hive version. After SPARK-35321, 
> when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client 
> version is used, we will call the method {{Hive.getWithoutRegisterFns()}} 
> (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for 
> example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break 
> with a {{NoSuchMethodError}}. This particular failure mode was resolved in 
> 3.2.1 by SPARK-37446, but while investigating, we found a general issue that 
> it's possible for user JARs to override Spark's own JARs -- but only inside 
> of the IsolatedClientLoader when using "builtin". This happens because even 
> when Spark is configured to use the "builtin" Hive classes, it still creates 
> a separate URLClassLoader for the HiveClientImpl used for HMS communication. 
> To get the set of JAR URLs to use for this classloader, Spark [collects all 
> of the JARs used by the user classloader (and its parent, and that 
> classloader's parent, and so 
> on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438].
>  Thus the newly created classloader will have all of the same JARs as the 
> user classloader, but the ordering has been reversed! User JARs get 
> prioritized ahead of system JARs, because the classloader hierarchy is 
> traversed from bottom-to-top. For example let's say we have user JARs 
> "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this:
> {code}
> MutableURLClassLoader
> -- foo.jar
> -- hive-exec-2.3.8.jar
> -- parent: URLClassLoader
> ----- spark-core_2.12-3.2.0.jar
> ----- ...
> ----- hive-exec-2.3.9.jar
> ----- ...
> {code}
> This setup provides the expected behavior within the user classloader; it 
> will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the 
> MutableURLClassLoader is only checked if the class doesn't exist in the 
> parent. But when a JAR list is constructed for the IsolatedClientLoader, it 
> traverses the URLs from MutableURLClassLoader first, then it's parent, so the 
> final list looks like (in order):
> {code}
> URLClassLoader [IsolatedClientLoader]
> -- foo.jar
> -- hive-exec-2.3.8.jar
> -- spark-core_2.12-3.2.0.jar
> -- ...
> -- hive-exec-2.3.9.jar
> -- ...
> -- parent: boot classloader (JVM classes)
> {code}
> Now when a lookup happens, all of the JARs are within the same 
> URLClassLoader, and the user JARs are in front of the Spark ones, so the user 
> JARs get prioritized. This is the opposite of the expected behavior when 
> using the default user/application classloader in Spark, which has 
> parent-first behavior, prioritizing the Spark/system classes over the user 
> classes. (Note that this behavior is correct when using the 
> {{ChildFirstURLClassLoader}}.)
> After SPARK-37446, the NoSuchMethodError is no longer an issue, but this 
> still breaks assumptions about how user JARs should be treated vs. system 
> JARs, and presents the ability for the client to break in other ways. For 
> example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have 
> been included; the changes in Hive 2.3.9 were needed to improve compatibility 
> with older HMS, so if a user were to accidentally include these older JARs, 
> it could break the ability of Spark to communicate with HMS 1.x
> I see two solutions to this:
> *(A) Remove the separate classloader entirely when using "builtin"*
> Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even 
> create a new classloader when using "builtin". This makes sense, as [called 
> out in this 
> comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], 
> since the point of "builtin" is to use the existing JARs on the classpath 
> anyway. This proposes simply extending the changes from SPARK-26839 to all 
> Java versions, instead of restricting to Java 9+ only.
> *(B) Reverse the ordering of parent/child JARs when constructing the URL list*
> The most targeted fix that can be made is to simply reverse the ordering on 
> [this line in 
> HiveUtils|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L419],
>  which prioritizes child-classloader JARs over parent-classloader JARs, 
> reversing the expected ordering. There is already special handling for 
> {{ChildFirstURLClassLoader}}, so all that needs to be done is to reverse this 
> order.
> I prefer (A) because I think it is a clean solution in that it both 
> simplifies the classloader setup, and reduces divergence / special handling 
> for different Java versions. At the time SPARK-26839 went in (2019), Java 9+ 
> support was newer and less well-tested. Now after a few years we can see that 
> the approach in SPARK-26839 clearly works well, so I see no reason _not_ to 
> extend this approach to other Java versions as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"

Reply via email to