[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table URL: https://github.com/apache/spark/pull/25306#discussion_r319878361 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ## @@ -197,7 +197,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log Some(partitionSchema)) val logicalRelation = cached.getOrElse { - val sizeInBytes = relation.stats.sizeInBytes.toLong + val defaultSizeInBytes = sparkSession.sessionState.conf.defaultSizeInBytes Review comment: Fixed by DetermineTableStats handling InsertIntoTable(HiveTableRelation, ...) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table URL: https://github.com/apache/spark/pull/25306#discussion_r319567938 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ## @@ -136,10 +136,9 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto |)""".stripMargin) spark.sql("REFRESH TABLE t1") -// Before SPARK-19678, sizeInBytes should be equal to dataSize. -// After SPARK-19678, sizeInBytes should be equal to DEFAULT_SIZE_IN_BYTES. +// After SPARK-28573, sizeInBytes should be equal to dataSize. val relation1 = spark.table("t1").queryExecution.analyzed.children.head -assert(relation1.stats.sizeInBytes === spark.sessionState.conf.defaultSizeInBytes) +assert(relation1.stats.sizeInBytes === dataSize) Review comment: cc @wangyum, after changes applied to HiveTableRelation, we don't modify `tableMeta` any more. The converted `LogicalRelation(HadoopFsRelation, xx)` from `HiveTableRelation` will actual get size by `HadoopFsRelation.sizeInBytes`. Do you think this is an acceptable change or we should deal with stats differently in `DetermineTableStats` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table URL: https://github.com/apache/spark/pull/25306#discussion_r313837686 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala ## @@ -58,7 +58,11 @@ class HiveCommandSuite extends QueryTest with SQLTestUtils with TestHiveSingleto |TBLPROPERTIES('prop1Key'="prop1Val", '`prop2Key`'="prop2Val") """.stripMargin) sql("CREATE TABLE parquet_tab3(col1 int, `col 2` int)") -sql("CREATE TABLE parquet_tab4 (price int, qty int) partitioned by (year int, month int)") +sql( + """ +|CREATE TABLE parquet_tab4 (price int, qty int) partitioned by (year int, month int) +|STORED AS PARQUET Review comment: reverted This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table URL: https://github.com/apache/spark/pull/25306#discussion_r311376838 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ## @@ -209,13 +211,16 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log val updatedTable = inferIfNeeded(relation, options, fileFormat, Option(fileIndex)) + // Spark SQL's data source table now support static and dynamic partition insert. Source + // table converted from Hive table should always use dynamic. + val enableDynamicPartition = options.updated("partitionOverwriteMode", "dynamic") Review comment: Please refer https://issues.apache.org/jira/browse/SPARK-20236. TL;DR version: dynamic follows hive's style, static is the spark style which is different with hive. When converting InsertIntoHiveTable, we definitely want the hive behaviour. So it takes preference over `spark.sql.sources.partitionOverwriteMode` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table URL: https://github.com/apache/spark/pull/25306#discussion_r311375710 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetMetastoreSuite.scala ## @@ -512,16 +512,14 @@ class HiveParquetMetastoreSuite extends ParquetPartitioningTest { |PARTITION (`date`='2015-04-01') |select a, b from jt """.stripMargin) -// Right now, insert into a partitioned Parquet is not supported in data source Parquet. -// So, we expect it is not cached. -assert(getCachedDataSourceTable(tableIdentifier) === null) +checkCached(tableIdentifier) Review comment: I will check existing tests. ORC cases are not that much probably, I will add similar cases for ORC then. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table URL: https://github.com/apache/spark/pull/25306#discussion_r311375175 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala ## @@ -112,6 +112,15 @@ private[spark] object HiveUtils extends Logging { .booleanConf .createWithDefault(true) + val CONVERT_INSERTING_PARTITIONED_TABLE = +buildConf("spark.sql.hive.convertInsertingPartitionedTable") + .doc("When set to true, and \"spark.sql.hive.convertMetastoreParquet\" or " + +"\"spark.sql.hive.convertMetastoreOrc\" is true, the built-in ORC/Parquet writer is used" + Review comment: I referred `CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING`, will fix it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table URL: https://github.com/apache/spark/pull/25306#discussion_r311351669 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala ## @@ -58,7 +58,11 @@ class HiveCommandSuite extends QueryTest with SQLTestUtils with TestHiveSingleto |TBLPROPERTIES('prop1Key'="prop1Val", '`prop2Key`'="prop2Val") """.stripMargin) sql("CREATE TABLE parquet_tab3(col1 int, `col 2` int)") -sql("CREATE TABLE parquet_tab4 (price int, qty int) partitioned by (year int, month int)") +sql( + """ +|CREATE TABLE parquet_tab4 (price int, qty int) partitioned by (year int, month int) +|STORED AS PARQUET Review comment: It's modified because I randomly chose it to make sure the insert into partitioned table can be safely converted. However it should already covered in other test cases. So it's neutral change. I can revert it if you think it's unnecessary. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table URL: https://github.com/apache/spark/pull/25306#discussion_r309027655 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ## @@ -197,7 +197,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log Some(partitionSchema)) val logicalRelation = cached.getOrElse { - val sizeInBytes = relation.stats.sizeInBytes.toLong + val defaultSizeInBytes = sparkSession.sessionState.conf.defaultSizeInBytes Review comment: For write path, the stats for HiveTable is not calculated as InsertIntoTable's table filed is not added in the children. So `org.apache.spark.sql.hive.DetermineTableStats` skip this table This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org