wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table URL: https://github.com/apache/spark/pull/26895#discussion_r358000739
########## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ########## @@ -132,7 +132,9 @@ class HadoopTableReader( val deserializedHadoopRDD = hadoopRDD.mapPartitions { iter => val hconf = broadcastedHadoopConf.value.value val deserializer = deserializerClass.getConstructor().newInstance() - deserializer.initialize(hconf, localTableDesc.getProperties) + DeserializerLock.synchronized { Review comment: I don't think there will be a lot of contention. We call `deserializer.initialize` once for each partition. The race is actually in `HCatRecordObjectInspectorFactory.getHCatRecordObjectInspector` which is called by `JsonSerDe#initialize` (the factory uses a `HashMap` instead of , e.g., a `ConcurrentHashMap` to maintain a cache). We want the same `ObjectInspector` to be returned by the factory cache and set in each `JsonSerDe` instance as its `ObjectInspector`. Our objective is that there not be more than one task from each executor calling `JsonSerDe#initialize` at the same time. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org