[GitHub] HyukjinKwon opened a new pull request #23356: [SPARK-26422][R] Support to disable Hive support in SparkR even for Hadoop versions unsupported by Hive fork

GitBox Thu, 20 Dec 2018 07:39:05 -0800

HyukjinKwon opened a new pull request #23356: [SPARK-26422][R] Support to 
disable Hive support in SparkR even for Hadoop versions unsupported by Hive fork
URL: https://github.com/apache/spark/pull/23356
 
 
   ## What changes were proposed in this pull request?
   
   Currently,  even if I explicitly disable Hive support in SparkR session as 
below:
   
   ```r
   sparkSession <- sparkR.session("local[4]", "SparkR", 
Sys.getenv("SPARK_HOME"),
                                  enableHiveSupport = FALSE)
   ```
   
   produces when the Hadoop version is not supported by our Hive fork:
   
   ```
   java.lang.reflect.InvocationTargetException
   ...
   Caused by: java.lang.IllegalArgumentException: Unrecognized Hadoop major 
version number: 3.1.1.3.1.0.0-78
        at 
org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)
        at 
org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
        at 
org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
        at 
org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:368)
        ... 43 more
   Error in handleErrors(returnStatus, conn) :
     java.lang.ExceptionInInitializerError
        at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.spark.util.Utils$.classForName(Utils.scala:193)
        at 
org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:1116)
        at 
org.apache.spark.sql.api.r.SQLUtils$.getOrCreateSparkSession(SQLUtils.scala:52)
        at 
org.apache.spark.sql.api.r.SQLUtils.getOrCreateSparkSession(SQLUtils.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   ```
   
   The root cause is that:
   
   ```
   SparkSession.hiveClassesArePresent
   ```
   
   check if the class is loadable or not to check if that's in classpath but 
`org.apache.hadoop.hive.conf.HiveConf` has a check for Hadoop version as static 
logic which is executed right away. This throws an `IllegalArgumentException` 
and that's not caught:
   
   
https://github.com/apache/spark/blob/36edbac1c8337a4719f90e4abd58d38738b2e1fb/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L1113-L1121
   
   So, currently, if users have a Hive built-in Spark with unsupported Hadoop 
version by our fork (namely 3+), there's no way to use SparkR even thought it 
could work.
   
   This PR just propose to change the order of bool comparison so that we can 
don't execute `SparkSession.hiveClassesArePresent` when:
   
     1. `enableHiveSupport` is explicitly disabled
     2. `spark.sql.catalogImplementation` is `in-memory`
   
   so that we **only** check `SparkSession.hiveClassesArePresent` when Hive 
support is explicitly enabled by short short circuiting.
   
   ## How was this patch tested?
   
   It's difficult to write a test since we don't run tests against Hadoop 3 
yet. See https://github.com/apache/spark/pull/21588. Manually tested.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] HyukjinKwon opened a new pull request #23356: [SPARK-26422][R] Support to disable Hive support in SparkR even for Hadoop versions unsupported by Hive fork

Reply via email to