[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table

2019-09-02 Thread GitBox
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] 
Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for 
partitioned table
URL: https://github.com/apache/spark/pull/25306#discussion_r319878361
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
 ##
 @@ -197,7 +197,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
   Some(partitionSchema))
 
 val logicalRelation = cached.getOrElse {
-  val sizeInBytes = relation.stats.sizeInBytes.toLong
+  val defaultSizeInBytes = 
sparkSession.sessionState.conf.defaultSizeInBytes
 
 Review comment:
   Fixed by DetermineTableStats handling InsertIntoTable(HiveTableRelation, ...)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table

2019-08-30 Thread GitBox
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] 
Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for 
partitioned table
URL: https://github.com/apache/spark/pull/25306#discussion_r319567938
 
 

 ##
 File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala
 ##
 @@ -136,10 +136,9 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
  |)""".stripMargin)
 
 spark.sql("REFRESH TABLE t1")
-// Before SPARK-19678, sizeInBytes should be equal to dataSize.
-// After SPARK-19678, sizeInBytes should be equal to 
DEFAULT_SIZE_IN_BYTES.
+// After SPARK-28573, sizeInBytes should be equal to dataSize.
 val relation1 = spark.table("t1").queryExecution.analyzed.children.head
-assert(relation1.stats.sizeInBytes === 
spark.sessionState.conf.defaultSizeInBytes)
+assert(relation1.stats.sizeInBytes === dataSize)
 
 Review comment:
   cc @wangyum, after changes applied to HiveTableRelation, we don't modify 
`tableMeta` any more. 
   The converted `LogicalRelation(HadoopFsRelation, xx)` from 
`HiveTableRelation` will actual get size by `HadoopFsRelation.sizeInBytes`. 
   
   Do you think this is an acceptable change or we should deal with stats 
differently in `DetermineTableStats`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table

2019-08-14 Thread GitBox
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] 
Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for 
partitioned table
URL: https://github.com/apache/spark/pull/25306#discussion_r313837686
 
 

 ##
 File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala
 ##
 @@ -58,7 +58,11 @@ class HiveCommandSuite extends QueryTest with SQLTestUtils 
with TestHiveSingleto
 |TBLPROPERTIES('prop1Key'="prop1Val", '`prop2Key`'="prop2Val")
   """.stripMargin)
 sql("CREATE TABLE parquet_tab3(col1 int, `col 2` int)")
-sql("CREATE TABLE parquet_tab4 (price int, qty int) partitioned by (year 
int, month int)")
+sql(
+  """
+|CREATE TABLE parquet_tab4 (price int, qty int) partitioned by (year 
int, month int)
+|STORED AS PARQUET
 
 Review comment:
   reverted 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table

2019-08-06 Thread GitBox
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] 
Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for 
partitioned table
URL: https://github.com/apache/spark/pull/25306#discussion_r311376838
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
 ##
 @@ -209,13 +211,16 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 
   val updatedTable = inferIfNeeded(relation, options, fileFormat, 
Option(fileIndex))
 
+  // Spark SQL's data source table now support static and dynamic 
partition insert. Source
+  // table converted from Hive table should always use dynamic.
+  val enableDynamicPartition = 
options.updated("partitionOverwriteMode", "dynamic")
 
 Review comment:
   Please refer https://issues.apache.org/jira/browse/SPARK-20236.
   TL;DR version: dynamic follows hive's style, static is the spark style which 
is different with hive. When converting InsertIntoHiveTable, we definitely want 
the hive behaviour. So it takes preference over 
`spark.sql.sources.partitionOverwriteMode`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table

2019-08-06 Thread GitBox
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] 
Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for 
partitioned table
URL: https://github.com/apache/spark/pull/25306#discussion_r311375710
 
 

 ##
 File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetMetastoreSuite.scala
 ##
 @@ -512,16 +512,14 @@ class HiveParquetMetastoreSuite extends 
ParquetPartitioningTest {
 |PARTITION (`date`='2015-04-01')
 |select a, b from jt
   """.stripMargin)
-// Right now, insert into a partitioned Parquet is not supported in data 
source Parquet.
-// So, we expect it is not cached.
-assert(getCachedDataSourceTable(tableIdentifier) === null)
+checkCached(tableIdentifier)
 
 Review comment:
   I will check existing tests. ORC cases are not that much probably, I will 
add similar cases for ORC then.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table

2019-08-06 Thread GitBox
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] 
Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for 
partitioned table
URL: https://github.com/apache/spark/pull/25306#discussion_r311375175
 
 

 ##
 File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
 ##
 @@ -112,6 +112,15 @@ private[spark] object HiveUtils extends Logging {
 .booleanConf
 .createWithDefault(true)
 
+  val CONVERT_INSERTING_PARTITIONED_TABLE =
+buildConf("spark.sql.hive.convertInsertingPartitionedTable")
+  .doc("When set to true, and \"spark.sql.hive.convertMetastoreParquet\" 
or " +
+"\"spark.sql.hive.convertMetastoreOrc\" is true, the built-in 
ORC/Parquet writer is used" +
 
 Review comment:
   I referred `CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING`, will fix it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table

2019-08-06 Thread GitBox
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] 
Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for 
partitioned table
URL: https://github.com/apache/spark/pull/25306#discussion_r311351669
 
 

 ##
 File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala
 ##
 @@ -58,7 +58,11 @@ class HiveCommandSuite extends QueryTest with SQLTestUtils 
with TestHiveSingleto
 |TBLPROPERTIES('prop1Key'="prop1Val", '`prop2Key`'="prop2Val")
   """.stripMargin)
 sql("CREATE TABLE parquet_tab3(col1 int, `col 2` int)")
-sql("CREATE TABLE parquet_tab4 (price int, qty int) partitioned by (year 
int, month int)")
+sql(
+  """
+|CREATE TABLE parquet_tab4 (price int, qty int) partitioned by (year 
int, month int)
+|STORED AS PARQUET
 
 Review comment:
   It's modified because I randomly chose it to make sure the insert into 
partitioned table can be safely converted.
   However it should already covered in other test cases.
   
   So it's neutral change. I can revert it if you think it's unnecessary.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table

2019-07-30 Thread GitBox
advancedxy commented on a change in pull request #25306: [SPARK-28573][SQL] 
Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for 
partitioned table
URL: https://github.com/apache/spark/pull/25306#discussion_r309027655
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
 ##
 @@ -197,7 +197,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
   Some(partitionSchema))
 
 val logicalRelation = cached.getOrElse {
-  val sizeInBytes = relation.stats.sizeInBytes.toLong
+  val defaultSizeInBytes = 
sparkSession.sessionState.conf.defaultSizeInBytes
 
 Review comment:
   For write path, the stats for HiveTable is not calculated as 
InsertIntoTable's table filed is not added in the children.
   
   So `org.apache.spark.sql.hive.DetermineTableStats` skip this table


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org