[spark] branch branch-2.4 updated: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table
This is an automated email from the ASF dual-hosted git repository. vanzin pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 07caebf [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table 07caebf is described below commit 07caebf2194fc34f1b4aacf2aa6c2d6961587482 Author: Wing Yew Poon AuthorDate: Fri Dec 20 10:39:26 2019 -0800 [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table When querying a partitioned table with format `org.apache.hive.hcatalog.data.JsonSerDe` and more than one task runs in each executor concurrently, the following exception is encountered: `java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord` The exception occurs in `HadoopTableReader.fillObject`. `org.apache.hive.hcatalog.data.JsonSerDe#initialize` populates a `cachedObjectInspector` field by calling `HCatRecordObjectInspectorFactory.getHCatRecordObjectInspector`, which is not thread-safe; this `cachedObjectInspector` is returned by `JsonSerDe#getObjectInspector`. We protect against this Hive bug by synchronizing on an object when we need to call `initialize` on `org.apache.hadoop.hive.serde2.Deserializer` instances (which may be `JsonSerDe` instances). By doing so, the `ObjectInspector` for the `Deserializer` of the partitions of the JSON table and that of the table `SerDe` are the same cached `ObjectInspector` and `HadoopTableReader.fillObject` then works correctly. (If the `ObjectInspector`s are different, then a bug in `HCatRecordObjectInsp [...] To avoid HIVE-15773 / HIVE-21752. No. Tested manually on a cluster with a partitioned JSON table and running a query using more than one core per executor. Before this change, the ClassCastException happens consistently. With this change it does not happen. Closes #26895 from wypoon/SPARK-17398. Authored-by: Wing Yew Poon Signed-off-by: Marcelo Vanzin (cherry picked from commit c72f88b0ba20727e831ba9755d9628d0347ee3cb) Signed-off-by: Marcelo Vanzin --- .../org/apache/spark/sql/hive/TableReader.scala| 23 +++--- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala index 7d57389..6631073 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala @@ -133,7 +133,9 @@ class HadoopTableReader( val deserializedHadoopRDD = hadoopRDD.mapPartitions { iter => val hconf = broadcastedHadoopConf.value.value val deserializer = deserializerClass.newInstance() - deserializer.initialize(hconf, localTableDesc.getProperties) + DeserializerLock.synchronized { +deserializer.initialize(hconf, localTableDesc.getProperties) + } HadoopTableReader.fillObject(iter, deserializer, attrsWithIndex, mutableRow, deserializer) } @@ -255,10 +257,14 @@ class HadoopTableReader( partProps.asScala.foreach { case (key, value) => props.setProperty(key, value) } -deserializer.initialize(hconf, props) +DeserializerLock.synchronized { + deserializer.initialize(hconf, props) +} // get the table deserializer val tableSerDe = localTableDesc.getDeserializerClass.newInstance() -tableSerDe.initialize(hconf, localTableDesc.getProperties) +DeserializerLock.synchronized { + tableSerDe.initialize(hconf, localTableDesc.getProperties) +} // fill the non partition key attributes HadoopTableReader.fillObject(iter, deserializer, nonPartitionKeyAttrs, @@ -337,6 +343,17 @@ private[hive] object HiveTableUtil { } } +/** + * Object to synchronize on when calling org.apache.hadoop.hive.serde2.Deserializer#initialize. + * + * [SPARK-17398] org.apache.hive.hcatalog.data.JsonSerDe#initialize calls the non-thread-safe + * HCatRecordObjectInspectorFactory.getHCatRecordObjectInspector, the results of which are + * returned by JsonSerDe#getObjectInspector. + * To protect against this bug in Hive (HIVE-15773/HIVE-21752), we synchronize on this object + * when calling initialize on Deserializer instances that could be JsonSerDe instances. + */ +private[hive] object DeserializerLock + private[hive] object HadoopTableReader extends HiveInspectors with Logging { /** * Curried. After given an argument for 'path', the resulting JobConf => Unit closure is used to - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h.
[spark] branch master updated (7dff3b1 -> c72f88b)
This is an automated email from the ASF dual-hosted git repository. vanzin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7dff3b1 [SPARK-30272][SQL][CORE] Remove usage of Guava that breaks in 27; replace with workalikes add c72f88b [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/hive/TableReader.scala| 23 +++--- 1 file changed, 20 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7dff3b1 -> c72f88b)
This is an automated email from the ASF dual-hosted git repository. vanzin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7dff3b1 [SPARK-30272][SQL][CORE] Remove usage of Guava that breaks in 27; replace with workalikes add c72f88b [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/hive/TableReader.scala| 23 +++--- 1 file changed, 20 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (07b04c4 -> 7dff3b1)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 07b04c4 [SPARK-29938][SQL] Add batching support in Alter table add partition flow add 7dff3b1 [SPARK-30272][SQL][CORE] Remove usage of Guava that breaks in 27; replace with workalikes No new revisions were added by this update. Summary of changes: .../org/apache/spark/util/kvstore/CustomType1.java | 9 +-- .../network/buffer/FileSegmentManagedBuffer.java | 11 +-- .../spark/network/buffer/NettyManagedBuffer.java | 7 +- .../spark/network/buffer/NioManagedBuffer.java | 7 +- .../spark/network/client/TransportClient.java | 11 +-- .../spark/network/protocol/ChunkFetchFailure.java | 13 ++-- .../spark/network/protocol/ChunkFetchRequest.java | 7 +- .../spark/network/protocol/ChunkFetchSuccess.java | 13 ++-- .../spark/network/protocol/OneWayMessage.java | 9 ++- .../apache/spark/network/protocol/RpcFailure.java | 13 ++-- .../apache/spark/network/protocol/RpcRequest.java | 13 ++-- .../apache/spark/network/protocol/RpcResponse.java | 13 ++-- .../spark/network/protocol/StreamChunkId.java | 13 ++-- .../spark/network/protocol/StreamFailure.java | 13 ++-- .../spark/network/protocol/StreamRequest.java | 9 ++- .../spark/network/protocol/StreamResponse.java | 15 ++-- .../spark/network/protocol/UploadStream.java | 9 ++- .../shuffle/ExternalShuffleBlockResolver.java | 13 ++-- .../network/shuffle/protocol/BlocksRemoved.java| 9 ++- .../shuffle/protocol/ExecutorShuffleInfo.java | 16 ++-- .../shuffle/protocol/FetchShuffleBlocks.java | 17 +++-- .../shuffle/protocol/GetLocalDirsForExecutors.java | 10 ++- .../shuffle/protocol/LocalDirsForExecutors.java| 11 +-- .../spark/network/shuffle/protocol/OpenBlocks.java | 18 +++-- .../network/shuffle/protocol/RegisterExecutor.java | 21 +++--- .../network/shuffle/protocol/RemoveBlocks.java | 23 +++--- .../network/shuffle/protocol/StreamHandle.java | 17 +++-- .../network/shuffle/protocol/UploadBlock.java | 24 +++--- .../shuffle/protocol/UploadBlockStream.java| 12 +-- .../CleanupNonShuffleServiceServedFilesSuite.java | 3 +- .../shuffle/ExternalShuffleCleanupSuite.java | 3 +- .../spark/network/yarn/YarnShuffleService.java | 10 ++- .../src/main/scala/org/apache/spark/SparkEnv.scala | 5 +- .../spark/deploy/history/FsHistoryProvider.scala | 3 +- .../scala/org/apache/spark/rdd/HadoopRDD.scala | 4 +- .../apache/spark/status/ElementTrackingStore.scala | 8 +- .../scala/org/apache/spark/util/ThreadUtils.scala | 85 -- .../main/scala/org/apache/spark/util/Utils.scala | 6 +- .../shuffle/sort/UnsafeShuffleWriterSuite.java | 3 +- .../spark/scheduler/TaskResultGetterSuite.scala| 5 +- .../kinesis/KPLBasedKinesisTestUtils.scala | 7 +- .../spark/sql/catalyst/JavaTypeInference.scala | 23 -- .../spark/sql/JavaBeanDeserializationSuite.java| 36 + 43 files changed, 360 insertions(+), 217 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (07b04c4 -> 7dff3b1)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 07b04c4 [SPARK-29938][SQL] Add batching support in Alter table add partition flow add 7dff3b1 [SPARK-30272][SQL][CORE] Remove usage of Guava that breaks in 27; replace with workalikes No new revisions were added by this update. Summary of changes: .../org/apache/spark/util/kvstore/CustomType1.java | 9 +-- .../network/buffer/FileSegmentManagedBuffer.java | 11 +-- .../spark/network/buffer/NettyManagedBuffer.java | 7 +- .../spark/network/buffer/NioManagedBuffer.java | 7 +- .../spark/network/client/TransportClient.java | 11 +-- .../spark/network/protocol/ChunkFetchFailure.java | 13 ++-- .../spark/network/protocol/ChunkFetchRequest.java | 7 +- .../spark/network/protocol/ChunkFetchSuccess.java | 13 ++-- .../spark/network/protocol/OneWayMessage.java | 9 ++- .../apache/spark/network/protocol/RpcFailure.java | 13 ++-- .../apache/spark/network/protocol/RpcRequest.java | 13 ++-- .../apache/spark/network/protocol/RpcResponse.java | 13 ++-- .../spark/network/protocol/StreamChunkId.java | 13 ++-- .../spark/network/protocol/StreamFailure.java | 13 ++-- .../spark/network/protocol/StreamRequest.java | 9 ++- .../spark/network/protocol/StreamResponse.java | 15 ++-- .../spark/network/protocol/UploadStream.java | 9 ++- .../shuffle/ExternalShuffleBlockResolver.java | 13 ++-- .../network/shuffle/protocol/BlocksRemoved.java| 9 ++- .../shuffle/protocol/ExecutorShuffleInfo.java | 16 ++-- .../shuffle/protocol/FetchShuffleBlocks.java | 17 +++-- .../shuffle/protocol/GetLocalDirsForExecutors.java | 10 ++- .../shuffle/protocol/LocalDirsForExecutors.java| 11 +-- .../spark/network/shuffle/protocol/OpenBlocks.java | 18 +++-- .../network/shuffle/protocol/RegisterExecutor.java | 21 +++--- .../network/shuffle/protocol/RemoveBlocks.java | 23 +++--- .../network/shuffle/protocol/StreamHandle.java | 17 +++-- .../network/shuffle/protocol/UploadBlock.java | 24 +++--- .../shuffle/protocol/UploadBlockStream.java| 12 +-- .../CleanupNonShuffleServiceServedFilesSuite.java | 3 +- .../shuffle/ExternalShuffleCleanupSuite.java | 3 +- .../spark/network/yarn/YarnShuffleService.java | 10 ++- .../src/main/scala/org/apache/spark/SparkEnv.scala | 5 +- .../spark/deploy/history/FsHistoryProvider.scala | 3 +- .../scala/org/apache/spark/rdd/HadoopRDD.scala | 4 +- .../apache/spark/status/ElementTrackingStore.scala | 8 +- .../scala/org/apache/spark/util/ThreadUtils.scala | 85 -- .../main/scala/org/apache/spark/util/Utils.scala | 6 +- .../shuffle/sort/UnsafeShuffleWriterSuite.java | 3 +- .../spark/scheduler/TaskResultGetterSuite.scala| 5 +- .../kinesis/KPLBasedKinesisTestUtils.scala | 7 +- .../spark/sql/catalyst/JavaTypeInference.scala | 23 -- .../spark/sql/JavaBeanDeserializationSuite.java| 36 + 43 files changed, 360 insertions(+), 217 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (0d2ef3a -> 07b04c4)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 0d2ef3a [SPARK-30300][SQL][WEB-UI] Fix updating the UI max value string when driver updates the same metric id as the tasks add 07b04c4 [SPARK-29938][SQL] Add batching support in Alter table add partition flow No new revisions were added by this update. Summary of changes: .../apache/spark/sql/execution/command/ddl.scala | 22 +- .../org/apache/spark/sql/sources/InsertSuite.scala | 14 ++ 2 files changed, 31 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-30300][SQL][WEB-UI] Fix updating the UI max value string when driver updates the same metric id as the tasks
This is an automated email from the ASF dual-hosted git repository. tgraves pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0d2ef3a [SPARK-30300][SQL][WEB-UI] Fix updating the UI max value string when driver updates the same metric id as the tasks 0d2ef3a is described below commit 0d2ef3ae2b2d0b66f763d6bb2e490a667c83f9f2 Author: Niranjan Artal AuthorDate: Fri Dec 20 07:29:28 2019 -0600 [SPARK-30300][SQL][WEB-UI] Fix updating the UI max value string when driver updates the same metric id as the tasks ### What changes were proposed in this pull request? In this PR, For a given metrics id we are checking if the driver side accumulator's value is greater than max of all stages value. If it's true, then we are removing that entry from the Hashmap. By doing this, for this metrics, "driver" would be displayed on the UI(As the driver would have the maximum value) ### Why are the changes needed? This PR fixes https://issues.apache.org/jira/browse/SPARK-30300. Currently driver's metric value is not compared while caluculating the max. ### Does this PR introduce any user-facing change? For the metrics where driver's value is greater than max of all stages, this is the change. Previous : (min, median, max (stageId 0( attemptId 1): taskId 2)) Now: (min, median, max (driver)) ### How was this patch tested? Ran unit tests. Closes #26941 from nartal1/SPARK-30300. Authored-by: Niranjan Artal Signed-off-by: Thomas Graves --- .../org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala| 6 ++ 1 file changed, 6 insertions(+) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala index 64d2f33..d5bb36e 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala @@ -237,6 +237,12 @@ class SQLAppStatusListener( if (metricTypes.contains(id)) { val prev = allMetrics.getOrElse(id, null) val updated = if (prev != null) { + // If the driver updates same metrics as tasks and has higher value then remove + // that entry from maxMetricsFromAllStage. This would make stringValue function default + // to "driver" that would be displayed on UI. + if (maxMetricsFromAllStages.contains(id) && value > maxMetricsFromAllStages(id)(0)) { +maxMetricsFromAllStages.remove(id) + } val _copy = Arrays.copyOf(prev, prev.length + 1) _copy(prev.length) = value _copy - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (a296d15 -> 12249fc)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a296d15 [SPARK-30291] catch the exception when doing materialize in AQE add 12249fc [SPARK-30301][SQL] Fix wrong results when datetimes as fields of complex types No new revisions were added by this update. Summary of changes: .../sql/catalyst/expressions/csvExpressions.scala | 2 +- .../sql/catalyst/expressions/jsonExpressions.scala | 2 +- .../apache/spark/sql/execution/HiveResult.scala| 75 ++ .../test/resources/sql-tests/results/array.sql.out | 4 +- .../sql-tests/results/csv-functions.sql.out| 2 +- .../sql-tests/results/inline-table.sql.out | 2 +- .../sql-tests/results/json-functions.sql.out | 2 +- .../results/typeCoercion/native/concat.sql.out | 2 +- .../results/typeCoercion/native/mapZipWith.sql.out | 2 +- .../results/typeCoercion/native/mapconcat.sql.out | 2 +- .../sql-tests/results/udf/udf-inline-table.sql.out | 2 +- .../spark/sql/execution/HiveResultSuite.scala | 30 ++--- 12 files changed, 50 insertions(+), 77 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (a296d15 -> 12249fc)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a296d15 [SPARK-30291] catch the exception when doing materialize in AQE add 12249fc [SPARK-30301][SQL] Fix wrong results when datetimes as fields of complex types No new revisions were added by this update. Summary of changes: .../sql/catalyst/expressions/csvExpressions.scala | 2 +- .../sql/catalyst/expressions/jsonExpressions.scala | 2 +- .../apache/spark/sql/execution/HiveResult.scala| 75 ++ .../test/resources/sql-tests/results/array.sql.out | 4 +- .../sql-tests/results/csv-functions.sql.out| 2 +- .../sql-tests/results/inline-table.sql.out | 2 +- .../sql-tests/results/json-functions.sql.out | 2 +- .../results/typeCoercion/native/concat.sql.out | 2 +- .../results/typeCoercion/native/mapZipWith.sql.out | 2 +- .../results/typeCoercion/native/mapconcat.sql.out | 2 +- .../sql-tests/results/udf/udf-inline-table.sql.out | 2 +- .../spark/sql/execution/HiveResultSuite.scala | 30 ++--- 12 files changed, 50 insertions(+), 77 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (18e8d1d -> a296d15)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 18e8d1d [SPARK-30307][SQL] remove ReusedQueryStageExec add a296d15 [SPARK-30291] catch the exception when doing materialize in AQE No new revisions were added by this update. Summary of changes: .../execution/adaptive/AdaptiveSparkPlanExec.scala | 18 ++-- .../adaptive/AdaptiveQueryExecSuite.scala | 25 ++ 2 files changed, 36 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (18e8d1d -> a296d15)
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 18e8d1d [SPARK-30307][SQL] remove ReusedQueryStageExec add a296d15 [SPARK-30291] catch the exception when doing materialize in AQE No new revisions were added by this update. Summary of changes: .../execution/adaptive/AdaptiveSparkPlanExec.scala | 18 ++-- .../adaptive/AdaptiveQueryExecSuite.scala | 25 ++ 2 files changed, 36 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org