[ https://issues.apache.org/jira/browse/SPARK-29938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen resolved SPARK-29938. ---------------------------------- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26569 [https://github.com/apache/spark/pull/26569] > Add batching in alter table add partition flow > ---------------------------------------------- > > Key: SPARK-29938 > URL: https://issues.apache.org/jira/browse/SPARK-29938 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.4, 2.4.4 > Reporter: Prakhar Jain > Assignee: Prakhar Jain > Priority: Major > Fix For: 3.0.0 > > > When lot of new partitions are added by an Insert query on a partitioned > datasource table, sometimes the query fails with - > {noformat} > An error was encountered: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.thrift.transport.TTransportException: > java.net.SocketTimeoutException: Read timed out; at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createPartitions(HiveExternalCatalog.scala:928) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createPartitions(SessionCatalog.scala:798) > at > org.apache.spark.sql.execution.command.AlterTableAddPartitionCommand.run(ddl.scala:448) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.refreshUpdatedPartitions$1(InsertIntoHadoopFsRelationCommand.scala:137) > {noformat} > This happens because adding thousands of partition in a single call takes lot > of time and the client eventually timesout. > Also adding lot of partitions can lead to OOM in Hive Metastore (similar > issue in [recover partition flow|https://github.com/apache/spark/pull/14607] > fixed). > Steps to reproduce - > {noformat} > case class Partition(data: Int, partition_key: Int) > val df = sc.parallelize(1 to 15000, 15000).map(x => Partition(x,x)).toDF > df.registerTempTable("temp_table") > spark.sql("""CREATE TABLE `test_table` (`data` INT, `partition_key` INT) > USING parquet PARTITIONED BY (partition_key) """) > spark.sql("INSERT OVERWRITE TABLE test_table select * from > temp_table").collect() > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org