wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix 
ClassCastException when querying partitioned JSON table
URL: https://github.com/apache/spark/pull/26895#discussion_r358000739
 
 

 ##########
 File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
 ##########
 @@ -132,7 +132,9 @@ class HadoopTableReader(
     val deserializedHadoopRDD = hadoopRDD.mapPartitions { iter =>
       val hconf = broadcastedHadoopConf.value.value
       val deserializer = deserializerClass.getConstructor().newInstance()
-      deserializer.initialize(hconf, localTableDesc.getProperties)
+      DeserializerLock.synchronized {
 
 Review comment:
   I don't think there will be a lot of contention. We call 
`deserializer.initialize` once for each partition.
   The race is actually in 
`HCatRecordObjectInspectorFactory.getHCatRecordObjectInspector` which is called 
by `JsonSerDe#initialize` (the factory uses a `HashMap` instead of , e.g., a 
`ConcurrentHashMap` to maintain a cache). We want the same `ObjectInspector` to 
be returned by the factory cache and set in each `JsonSerDe` instance as its 
`ObjectInspector`.
   Our objective is that there not be more than one task from each executor 
calling `JsonSerDe#initialize` at the same time.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to