subject:"\[jira\] \[Commented\] \(HUDI\-2208\) \[SQL\] Support Bulk Insert For Spark Sql"

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395744#comment-17395744
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

nsivabalan commented on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-894939778


   thanks for your contribution! Good job  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395742#comment-17395742
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

nsivabalan merged pull request #3328:
URL: https://github.com/apache/hudi/pull/3328


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395741#comment-17395741
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

nsivabalan commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r684894126



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala
##
@@ -378,6 +379,41 @@ class TestInsertTable extends TestHoodieSqlBase {
   Seq(1, "a1", 10.0, "2021-07-18"),
   Seq(2, "a2", 10.0, "2021-07-18")
 )
+
+// Test bulk insert for multi-level partition
+val tableMultiPartition = generateTableName
+spark.sql(
+  s"""
+ |create table $tableMultiPartition (
+ |  id int,
+ |  name string,
+ |  price double,
+ |  dt string,
+ |  hh string
+ |) using hudi
+ | options (
+ |  type = '$tableType'
+ | )
+ | partitioned by (dt, hh)
+ | location '${tmp.getCanonicalPath}/$tableMultiPartition'
+   """.stripMargin)
+
+// Enable the bulk insert
+spark.sql("set hoodie.sql.bulk.insert.enable = true")
+spark.sql(s"insert into $tableMultiPartition values(1, 'a1', 10, 
'2021-07-18', '12')")
+
+checkAnswer(s"select id, name, price, dt, hh from 
$tableMultiPartition")(

Review comment:
   lets verify meta fields as well as suggested in other patch. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395740#comment-17395740
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN
   * abed45a1e858e7bd16b40e203c9aa88302e67921 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1558)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395724#comment-17395724
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN
   * cf12550810f530b84896fb904f2feb60eb440ac5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1543)
 
   * abed45a1e858e7bd16b40e203c9aa88302e67921 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1558)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395715#comment-17395715
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN
   * cf12550810f530b84896fb904f2feb60eb440ac5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1543)
 
   * abed45a1e858e7bd16b40e203c9aa88302e67921 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395693#comment-17395693
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

pengzhiwei2018 commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r684870787



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala
##
@@ -303,5 +304,184 @@ class TestInsertTable extends TestHoodieSqlBase {
   "assertion failed: Required select columns count: 4, Current select 
columns(including static partition column)" +
 " count: 3，columns: (1,a1,10)"
 )
+spark.sql("set hoodie.sql.bulk.insert.enable = true")
+spark.sql("set hoodie.sql.insert.mode= strict")
+
+val tableName2 = generateTableName

Review comment:
   Yes, will add the case for multi-level partition




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395645#comment-17395645
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN
   * cf12550810f530b84896fb904f2feb60eb440ac5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1543)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395549#comment-17395549
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

nsivabalan commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r684803078



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala
##
@@ -303,5 +304,184 @@ class TestInsertTable extends TestHoodieSqlBase {
   "assertion failed: Required select columns count: 4, Current select 
columns(including static partition column)" +
 " count: 3，columns: (1,a1,10)"
 )
+spark.sql("set hoodie.sql.bulk.insert.enable = true")
+spark.sql("set hoodie.sql.insert.mode= strict")
+
+val tableName2 = generateTableName

Review comment:
   Can we also enhance the test w/ both type of partitions(single level and 
multi-level). 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395482#comment-17395482
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN
   * cf12550810f530b84896fb904f2feb60eb440ac5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1543)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395477#comment-17395477
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN
   *  Unknown: [CANCELED](TBD) 
   * cf12550810f530b84896fb904f2feb60eb440ac5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1543)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395475#comment-17395475
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN
   *  Unknown: [CANCELED](TBD) 
   * cf12550810f530b84896fb904f2feb60eb440ac5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395474#comment-17395474
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

pengzhiwei2018 commented on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-894753661


   @hudi-bot  run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395221#comment-17395221
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN
   *  Unknown: [CANCELED](TBD) 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395220#comment-17395220
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   *  Unknown: [CANCELED](TBD) 
   * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395217#comment-17395217
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

pengzhiwei2018 commented on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-894651032


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395152#comment-17395152
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   * ebd3310059d27544648a31a7c3fb3cb1febcea60 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1460)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395147#comment-17395147
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * f4e8be72b0140410daeb4eb01879047eba074751 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1447)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1459)
 
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   * ebd3310059d27544648a31a7c3fb3cb1febcea60 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1460)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395144#comment-17395144
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * f4e8be72b0140410daeb4eb01879047eba074751 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1447)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1459)
 
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   * ebd3310059d27544648a31a7c3fb3cb1febcea60 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1460)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395124#comment-17395124
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

pengzhiwei2018 commented on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-894613775


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395125#comment-17395125
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * f4e8be72b0140410daeb4eb01879047eba074751 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1447)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1459)
 
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   * ebd3310059d27544648a31a7c3fb3cb1febcea60 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395122#comment-17395122
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

pengzhiwei2018 edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-894613109


   Hi @nsivabalan , The PR has updated with the follow chanes:
   add "upsert" mode for insert.mode. Currently we have 3 insert mode:
  - upsert
In upsert mode for insert into, duplicate record on primary key will be 
updated.This is the default insert mode for pk-table.
  - strict
In strict mode for insert into, we do the pk uniqueness guarantee for 
COW pk-table. For MOR pk-table, it has the same behavior with "upsert" mode.
 - non-strict
  In non-strict mode for insert into, we use insert operation to write data 
which allow writing the duplicate record.
   
   The default insert mode is `upsert` for pk-table. And these config is only 
used to control the behavior of pk-table. For non pk-table,  the insert 
operation is always `insert` or `bulkinsert`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395119#comment-17395119
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

pengzhiwei2018 edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-894613109


   Hi @nsivabalan , The PR has updated with the follow chanes:
   add "upsert" mode for insert.mode. Currently we have 3 insert mode:
  - upsert
In upsert mode for insert into, duplicate record on primary key will be 
updated.This is the default insert mode for pk-table.
  - strict
In strict mode for insert into, we do the pk uniqueness guarantee for 
COW pk-table. For MOR pk-table, it has the same behavior with "upsert" mode.
 - non-strict
  In non-strict mode for insert into, we use insert operation to write data 
which allow writing the duplicate record.
   The default insert mode is `upsert`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395120#comment-17395120
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

pengzhiwei2018 edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-894613109


   Hi @nsivabalan , The PR has updated with the follow chanes:
   add "upsert" mode for insert.mode. Currently we have 3 insert mode:
  - upsert
In upsert mode for insert into, duplicate record on primary key will be 
updated.This is the default insert mode for pk-table.
  - strict
In strict mode for insert into, we do the pk uniqueness guarantee for 
COW pk-table. For MOR pk-table, it has the same behavior with "upsert" mode.
 - non-strict
  In non-strict mode for insert into, we use insert operation to write data 
which allow writing the duplicate record.
   The default insert mode is `upsert` for pk-table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395121#comment-17395121
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

pengzhiwei2018 edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-894613109


   Hi @nsivabalan , The PR has updated with the follow chanes:
   add "upsert" mode for insert.mode. Currently we have 3 insert mode:
  - upsert
In upsert mode for insert into, duplicate record on primary key will be 
updated.This is the default insert mode for pk-table.
  - strict
In strict mode for insert into, we do the pk uniqueness guarantee for 
COW pk-table. For MOR pk-table, it has the same behavior with "upsert" mode.
 - non-strict
  In non-strict mode for insert into, we use insert operation to write data 
which allow writing the duplicate record.
   
   The default insert mode is `upsert` for pk-table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395118#comment-17395118
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

pengzhiwei2018 commented on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-894613109


   Hi @nsivabalan , The PR has updated with the follow chanes:
   add "upsert" mode for insert.mode. Currently we have 3 insert mode:
  - upsert
In upsert mode for insert into, duplicate record on primary key will be 
updated.This is the default insert mode for pk-table.
  - strict
In strict mode for insert into, we do the pk uniqueness guarantee for 
COW pk-table. For MOR pk-table, it has the same behavior with "upsert" mode.
 - non-strict
  In non-strict mode for insert into, we use insert operation to write data 
which allow writing the duplicate record.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395116#comment-17395116
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * f4e8be72b0140410daeb4eb01879047eba074751 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1447)
 
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   * ebd3310059d27544648a31a7c3fb3cb1febcea60 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395115#comment-17395115
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * f4e8be72b0140410daeb4eb01879047eba074751 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1447)
 
   * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-06 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395037#comment-17395037
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * f4e8be72b0140410daeb4eb01879047eba074751 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1447)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-06 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395022#comment-17395022
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * f4e8be72b0140410daeb4eb01879047eba074751 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1447)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-06 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395021#comment-17395021
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

nsivabalan commented on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-894571028


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-06 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394899#comment-17394899
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * f4e8be72b0140410daeb4eb01879047eba074751 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-06 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394857#comment-17394857
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * f4e8be72b0140410daeb4eb01879047eba074751 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-06 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394854#comment-17394854
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

nsivabalan commented on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-894355629


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-06 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394586#comment-17394586
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * f4e8be72b0140410daeb4eb01879047eba074751 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-06 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394552#comment-17394552
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * 123f88802eb116939544de462d3fa372b8eb1684 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1406)
 
   * f4e8be72b0140410daeb4eb01879047eba074751 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-06 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394538#comment-17394538
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * 123f88802eb116939544de462d3fa372b8eb1684 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1406)
 
   * f4e8be72b0140410daeb4eb01879047eba074751 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-05 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394384#comment-17394384
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

nsivabalan commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r683843426



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")
+case (_, true, true, _) if isPartitionedTable =>
+  throw new IllegalArgumentException(s"Insert Overwrite Partition can 
not use bulk insert.")
+case (_, true, _, true) =>
+  throw new IllegalArgumentException(s"Bulk insert cannot support drop 
duplication." +
+s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.")
+// if enableBulkInsert is true, use bulk insert for the insert 
overwrite non-partitioned table.
+case (_, true, true, _) if !isPartitionedTable => 
BULK_INSERT_OPERATION_OPT_VAL
+// insert overwrite partition
+case (_, _, true, _) if isPartitionedTable => 
INSERT_OVERWRITE_OPERATION_OPT_VAL
+// insert overwrite table
+case (_, _, true, _) if !isPartitionedTable => 
INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL
+// if the table has primaryKey and the dropDuplicate has disable, use 
the upsert operation
+case (true, false, false, false) => UPSERT_OPERATION_OPT_VAL
+// if enableBulkInsert is true and the table is non-primaryKeyed, use 
the bulk insert operation
+case (false, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL
+// for the rest case, use the insert operation
+case (_, _, _, _) => INSERT_OPERATION_OPT_VAL

Review comment:
   I did go through every case here and have 2 suggestions. rest of the 
cases looks good. You don't need to consider my proposal above. But would like 
you to consider below feedback. 
   1.
   ```
   case (true, true, _, _) if !isNonStrictMode => throw new 
IllegalArgumentException(s"Table with primaryKey can not use bulk insert in 
strict mode.")
   ```
   Can we enable preCombine here and proceed with Bulk_Insert operation. Within 
hudi, we can do preCombine/dedup. As we agreed on using bulk_insert as default 
with CTAS, this will be a very common use-case. 
   
   2. 
   ```
   case (_, true, true, _) if isPartitionedTable =>
 throw new IllegalArgumentException(s"Insert Overwrite Partition 
can not use bulk insert.")
   ```
   since we agreed on enabling Bulk_insert as default for CTAS, this will be 
very common use-case as well. Can you help me understand why do we fail this 
call? why can't we let it proceed. This is basically, CTAS for a partitioned 
table.
   
   

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -159,7 +159,10 @@ object HoodieSparkSqlWriter {
 
   // Convert to RDD[HoodieRecord]
   val genericRecords: RDD[GenericRecord] = 
HoodieSparkUtils.createRdd(df, schema, structName, nameSpace)
-  val shouldCombine = 
parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean || 
operation.equals(WriteOperationType.UPSERT);
+  val shouldCombine = 
parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean ||
+operation.equals(WriteOperationType.UPSERT) ||
+
parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_INSERT_PROP.key(),

Review comment:
   my bad. I get it now. 
   If InsertDropDups is set, we automatically set combine.before.insert. but if 
a user has set just "combine.before.insert", we need to do PreCombine here. 
   But I am not sure why this wasn't reported by anyone until now. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-05 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394000#comment-17394000
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * 123f88802eb116939544de462d3fa372b8eb1684 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1406)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-05 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393942#comment-17393942
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * b3e8a6d36161d5da60a1429e518253e1bff92a9d Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1402)
 
   * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN
   * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN
   * 123f88802eb116939544de462d3fa372b8eb1684 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1406)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-05 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393933#comment-17393933
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

pengzhiwei2018 commented on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-893396472


   @vinothchandar @nsivabalan The PR has updated with the follow changes:
   1、Allow bulk insert for pk-table. 
 I introduce a config:  `hoodie.sql.insert.mode` . If set to "strict", we 
will do the pk uniqueness guarantee. If set to "non-strict", we will ignore the 
uniqueness guarantee for pk table. The bulk insert is support in the case. By 
default the value is "non-strict".
   2、CTAS use bulk insert by default.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-05 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393903#comment-17393903
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-05 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393821#comment-17393821
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

pengzhiwei2018 commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r683207185



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -248,6 +248,14 @@ object DataSourceWriteOptions {
 .withDocumentation("When set to true, will perform write operations 
directly using the spark native " +
   "`Row` representation, avoiding any additional conversion costs.")
 
+  /**
+   * Enable the bulk insert for sql insert statement.
+   */
+  val SQL_ENABLE_BULK_INSERT:ConfigProperty[String] = ConfigProperty

Review comment:
   Sound reasonable about this. CTAS use the bulk_insert by default, and 
regular insert for insert into by default.

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")

Review comment:
   For CTAS, we can relax this. Because there is no data exist in the 
target table. We can just combine the input by pk before bulk insert to reach 
the same goal.

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -159,7 +159,10 @@ object HoodieSparkSqlWriter {
 
   // Convert to RDD[HoodieRecord]
   val genericRecords: RDD[GenericRecord] = 
HoodieSparkUtils.createRdd(df, schema, structName, nameSpace)
-  val shouldCombine = 
parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean || 
operation.equals(WriteOperationType.UPSERT);
+  val shouldCombine = 
parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean ||
+operation.equals(WriteOperationType.UPSERT) ||
+
parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_INSERT_PROP.key(),

Review comment:
   @vinothchandar  well I think  INSERT_DROP_DUPS_OPT_KEY is some different 
from COMBINE_BEFORE_INSERT_PROP. 
   **INSERT_DROP_DUPS_OPT_KEY**:  is used to drop the duplicate record in the 
target table.
   `COMBINE_BEFORE_INSERT_PROP`: is used to combine the duplicate record in the 
input.
   So they are not total the same config. IMO.
   

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")
+case (_, true, true, _) if isPartitionedTable =>
+  throw new IllegalArgumentException(s"Insert Overwrite Partition can 
not use bulk insert.")
+case (_, true, _, true) =>
+  throw new IllegalArgumentException(s"Bulk insert cannot support drop 
duplication." +
+s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.")
+// if enableBulkInsert is true, use bulk insert for the insert 
overwrite non-partitioned table.
+case (_, true, true, _) if !isPartitionedTable => 
BULK_INSERT_OPERATION_OPT_VAL
+// insert overwrite partition
+case (_, _, true, _) if isPartitionedTable =>

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-05 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393781#comment-17393781
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

hudi-bot edited a comment on pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427


   
   ## CI report:
   
   * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN
   * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN
   * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN
   * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN
   * e88244d233d323364916c4fc240083566ddc4e56 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1272)
 
   * b3e8a6d36161d5da60a1429e518253e1bff92a9d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-05 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393699#comment-17393699
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

pengzhiwei2018 commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r683207185



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -248,6 +248,14 @@ object DataSourceWriteOptions {
 .withDocumentation("When set to true, will perform write operations 
directly using the spark native " +
   "`Row` representation, avoiding any additional conversion costs.")
 
+  /**
+   * Enable the bulk insert for sql insert statement.
+   */
+  val SQL_ENABLE_BULK_INSERT:ConfigProperty[String] = ConfigProperty

Review comment:
   Sound reasonable about this. CTAS use the bulk_insert by default, and 
regular insert for insert into by default.

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")

Review comment:
   For CTAS, we can relax this. Because there is no data exist in the 
target table. We can just combine the input by pk before bulk insert to reach 
the same goal.

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -159,7 +159,10 @@ object HoodieSparkSqlWriter {
 
   // Convert to RDD[HoodieRecord]
   val genericRecords: RDD[GenericRecord] = 
HoodieSparkUtils.createRdd(df, schema, structName, nameSpace)
-  val shouldCombine = 
parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean || 
operation.equals(WriteOperationType.UPSERT);
+  val shouldCombine = 
parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean ||
+operation.equals(WriteOperationType.UPSERT) ||
+
parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_INSERT_PROP.key(),

Review comment:
   @vinothchandar  well I think  INSERT_DROP_DUPS_OPT_KEY is some different 
from COMBINE_BEFORE_INSERT_PROP. 
   **INSERT_DROP_DUPS_OPT_KEY**:  is used to drop the duplicate record in the 
target table.
   `COMBINE_BEFORE_INSERT_PROP`: is used to combine the duplicate record in the 
input.
   So they are not total the same config. IMO.
   

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")
+case (_, true, true, _) if isPartitionedTable =>
+  throw new IllegalArgumentException(s"Insert Overwrite Partition can 
not use bulk insert.")
+case (_, true, _, true) =>
+  throw new IllegalArgumentException(s"Bulk insert cannot support drop 
duplication." +
+s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.")
+// if enableBulkInsert is true, use bulk insert for the insert 
overwrite non-partitioned table.
+case (_, true, true, _) if !isPartitionedTable => 
BULK_INSERT_OPERATION_OPT_VAL
+// insert overwrite partition
+case (_, _, true, _) if isPartitionedTable =>

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-03 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392596#comment-17392596
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

nsivabalan commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r682137257



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")
+case (_, true, true, _) if isPartitionedTable =>
+  throw new IllegalArgumentException(s"Insert Overwrite Partition can 
not use bulk insert.")
+case (_, true, _, true) =>
+  throw new IllegalArgumentException(s"Bulk insert cannot support drop 
duplication." +
+s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.")
+// if enableBulkInsert is true, use bulk insert for the insert 
overwrite non-partitioned table.
+case (_, true, true, _) if !isPartitionedTable => 
BULK_INSERT_OPERATION_OPT_VAL
+// insert overwrite partition
+case (_, _, true, _) if isPartitionedTable => 
INSERT_OVERWRITE_OPERATION_OPT_VAL
+// insert overwrite table
+case (_, _, true, _) if !isPartitionedTable => 
INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL
+// if the table has primaryKey and the dropDuplicate has disable, use 
the upsert operation
+case (true, false, false, false) => UPSERT_OPERATION_OPT_VAL
+// if enableBulkInsert is true and the table is non-primaryKeyed, use 
the bulk insert operation
+case (false, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL
+// for the rest case, use the insert operation
+case (_, _, _, _) => INSERT_OPERATION_OPT_VAL

Review comment:
   actually I came across [INSERT OVERWRITE 
DIRECTORY](https://spark.apache.org/docs/latest/sql-ref-syntax-dml-insert-overwrite-directory.html)
 which can be mapped to insert_overwrite. 
   
   Here is a suggestion w/o using any additional configs: 
   CTAS -> bulk_insert 
   Insert into -> insert
   INSERT OVERWRITE -> insert overwrite table
   INSERT OVERWRITE DIRECTORY -> insert overwrite (partitions)
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-03 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392571#comment-17392571
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

nsivabalan commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r682137257



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")
+case (_, true, true, _) if isPartitionedTable =>
+  throw new IllegalArgumentException(s"Insert Overwrite Partition can 
not use bulk insert.")
+case (_, true, _, true) =>
+  throw new IllegalArgumentException(s"Bulk insert cannot support drop 
duplication." +
+s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.")
+// if enableBulkInsert is true, use bulk insert for the insert 
overwrite non-partitioned table.
+case (_, true, true, _) if !isPartitionedTable => 
BULK_INSERT_OPERATION_OPT_VAL
+// insert overwrite partition
+case (_, _, true, _) if isPartitionedTable => 
INSERT_OVERWRITE_OPERATION_OPT_VAL
+// insert overwrite table
+case (_, _, true, _) if !isPartitionedTable => 
INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL
+// if the table has primaryKey and the dropDuplicate has disable, use 
the upsert operation
+case (true, false, false, false) => UPSERT_OPERATION_OPT_VAL
+// if enableBulkInsert is true and the table is non-primaryKeyed, use 
the bulk insert operation
+case (false, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL
+// for the rest case, use the insert operation
+case (_, _, _, _) => INSERT_OPERATION_OPT_VAL

Review comment:
   actually I came across [INSERT OVERWRITE 
DIRECTORY](https://spark.apache.org/docs/latest/sql-ref-syntax-dml-insert-overwrite-directory.html)
 which can be mapped to insert_overwrite. 
   
   Here is a suggestion w/o using any additional configs: 
   CTAS -> bulk_insert 
   Insert into -> insert
   INSERT OVERWRITE -> insert overwrite table
   INSERT OVERWRITE DIRECTORY -> insert overwrite 
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-03 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392553#comment-17392553
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

vinothchandar commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r682108786



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")
+case (_, true, true, _) if isPartitionedTable =>
+  throw new IllegalArgumentException(s"Insert Overwrite Partition can 
not use bulk insert.")

Review comment:
   insert overwrite partition should be using the `INSERT_OVERWRITE` 
operation.

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -159,7 +159,10 @@ object HoodieSparkSqlWriter {
 
   // Convert to RDD[HoodieRecord]
   val genericRecords: RDD[GenericRecord] = 
HoodieSparkUtils.createRdd(df, schema, structName, nameSpace)
-  val shouldCombine = 
parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean || 
operation.equals(WriteOperationType.UPSERT);
+  val shouldCombine = 
parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean ||
+operation.equals(WriteOperationType.UPSERT) ||
+
parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_INSERT_PROP.key(),

Review comment:
   @pengzhiwei2018 do you agree with siva's analysis above?

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")

Review comment:
   +1 users might have another hudi table for e.g to CTAS from. So if we 
disallow bulk insert with a pk, then there is no good way to do a full 
bootstrap. Can we relax this?

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")
+case (_, true, true, _) if isPartitionedTable =>
+  throw new IllegalArgumentException(s"Insert Overwrite Partition can 
not use bulk insert.")
+case (_, true, _, true) =>
+  throw new IllegalArgumentException(s"Bulk insert cannot support drop

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-03 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392548#comment-17392548
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

nsivabalan commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r682101452



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")
+case (_, true, true, _) if isPartitionedTable =>
+  throw new IllegalArgumentException(s"Insert Overwrite Partition can 
not use bulk insert.")
+case (_, true, _, true) =>
+  throw new IllegalArgumentException(s"Bulk insert cannot support drop 
duplication." +
+s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.")
+// if enableBulkInsert is true, use bulk insert for the insert 
overwrite non-partitioned table.
+case (_, true, true, _) if !isPartitionedTable => 
BULK_INSERT_OPERATION_OPT_VAL
+// insert overwrite partition
+case (_, _, true, _) if isPartitionedTable => 
INSERT_OVERWRITE_OPERATION_OPT_VAL
+// insert overwrite table
+case (_, _, true, _) if !isPartitionedTable => 
INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL
+// if the table has primaryKey and the dropDuplicate has disable, use 
the upsert operation
+case (true, false, false, false) => UPSERT_OPERATION_OPT_VAL
+// if enableBulkInsert is true and the table is non-primaryKeyed, use 
the bulk insert operation
+case (false, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL
+// for the rest case, use the insert operation
+case (_, _, _, _) => INSERT_OPERATION_OPT_VAL

Review comment:
   @vinothchandar : do check this out before reviewing other feedbacks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-02 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17391810#comment-17391810
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

nsivabalan commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r680197209



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")

Review comment:
   anyways, we can call it out that its responsibility of the user to 
ensure there are uniqueness. Also, IIUC, hudi can handle duplicates. Incase of 
updates, both records will be updated. but bulk_insert is very performant 
compared to regular Insert especially w/ row wirter. So, we should not keep it 
too restrictive for use. I know from the community msgs, that lot of users 
leverage bulk_insert. I would vote to relax this constraint. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-08-02 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17391377#comment-17391377
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

pengzhiwei2018 commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r680693753



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -243,6 +256,8 @@ object InsertIntoHoodieTableCommand {
 RECORDKEY_FIELD_OPT_KEY.key -> primaryColumns.mkString(","),
 PARTITIONPATH_FIELD_OPT_KEY.key -> partitionFields,
 PAYLOAD_CLASS_OPT_KEY.key -> payloadClassName,
+ENABLE_ROW_WRITER_OPT_KEY.key -> enableBulkInsert.toString,
+HoodieWriteConfig.COMBINE_BEFORE_INSERT_PROP.key -> 
isPrimaryKeyTable.toString, // if the table has primaryKey, enable the combine

Review comment:
   Just like the upsert operation,  Hudi do the combine automatic， we can 
do this for the user too which is much friendly for our users.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390829#comment-17390829
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

nsivabalan commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r680212711



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")
+case (_, true, true, _) if isPartitionedTable =>
+  throw new IllegalArgumentException(s"Insert Overwrite Partition can 
not use bulk insert.")
+case (_, true, _, true) =>
+  throw new IllegalArgumentException(s"Bulk insert cannot support drop 
duplication." +
+s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.")
+// if enableBulkInsert is true, use bulk insert for the insert 
overwrite non-partitioned table.
+case (_, true, true, _) if !isPartitionedTable => 
BULK_INSERT_OPERATION_OPT_VAL
+// insert overwrite partition
+case (_, _, true, _) if isPartitionedTable => 
INSERT_OVERWRITE_OPERATION_OPT_VAL
+// insert overwrite table
+case (_, _, true, _) if !isPartitionedTable => 
INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL
+// if the table has primaryKey and the dropDuplicate has disable, use 
the upsert operation
+case (true, false, false, false) => UPSERT_OPERATION_OPT_VAL
+// if enableBulkInsert is true and the table is non-primaryKeyed, use 
the bulk insert operation
+case (false, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL
+// for the rest case, use the insert operation
+case (_, _, _, _) => INSERT_OPERATION_OPT_VAL

Review comment:
   Here is my thought on choosing the right operation. Having too many case 
statements might complicate things and is error prone too. As I mentioned 
earlier, we should try to do any valid conversions in HoodiesSparkSqlWriter. 
Only those thats applicable just to sql dml, we should keep it here. 
   Anyways, here is one simplified approach. Ignoring the primary, non primary 
key table for now. We can come back to that later once we have consensus on 
this. 
   
   We need just two configs. 
   hoodie.sql.enable.bulk_insert (default false)
   hoodie.sql.overwrite.entire.table (default true)
   
   From sql syntax, there are two commands allowed. 
   "INSERT" into and "INSERT OVERWRITE". And these need to map to 4 operations 
on the hudi end (insert, bulk_insert, insert over write and insert overwrite 
table)
   
   "INSERT" with no other configs set -> insert operation
   "INSERT" with enable bulk insert set -> bulk_insert
   "INSERT OVERWRITE" with no other configs set -> insert_overwrite_table 
operation
   "INSERT OVERWRITE" with hoodie.sql.overwrite.entire.table = false -> 
insert_overwrite operation.
   "INSERT OVERWRITE" with enable bulk_insert set -> bulk_insert. pass the 
right save mode to HoodieSparkSqlWriter
   "INSERT OVERWRITE" with enable bulk_insert set and 
hoodie.sql.overwrite.entire.table = false -> bulk_insert. pass the right save 
mode to HoodieSparkSqlWriter.
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL] Support Bulk Insert For Spark Sql
> ---
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
>
> Support the bulk insert for spark sql



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql

2021-07-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390783#comment-17390783
 ] 

ASF GitHub Bot commented on HUDI-2208:
--

nsivabalan commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r680186620



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -248,6 +248,14 @@ object DataSourceWriteOptions {
 .withDocumentation("When set to true, will perform write operations 
directly using the spark native " +
   "`Row` representation, avoiding any additional conversion costs.")
 
+  /**
+   * Enable the bulk insert for sql insert statement.
+   */
+  val SQL_ENABLE_BULK_INSERT:ConfigProperty[String] = ConfigProperty

Review comment:
   @vinothchandar : In sql, we don't have two separate commands like INSERT 
into and BULK_INSERT into. so, guess we are going this route. But default CTAS 
choose INSERT operation. I am thinking users may not use bulk_insert only since 
they have to set the property explicitly. any thoughts. 
   There are two things to discuss. 
   1. Which operation to use with CTAS
   2. which operation to use with INSERT into. 
   State as of now, is "Insert". And user has to explicitly set operation type 
to bulk_insert before calling any of this commands. 

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")
+case (_, true, true, _) if isPartitionedTable =>
+  throw new IllegalArgumentException(s"Insert Overwrite Partition can 
not use bulk insert.")
+case (_, true, _, true) =>
+  throw new IllegalArgumentException(s"Bulk insert cannot support drop 
duplication." +
+s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.")
+// if enableBulkInsert is true, use bulk insert for the insert 
overwrite non-partitioned table.
+case (_, true, true, _) if !isPartitionedTable => 
BULK_INSERT_OPERATION_OPT_VAL
+// insert overwrite partition
+case (_, _, true, _) if isPartitionedTable => 
INSERT_OVERWRITE_OPERATION_OPT_VAL
+// insert overwrite table
+case (_, _, true, _) if !isPartitionedTable => 
INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL

Review comment:
   HoodieSparkSqlWriter will handle this save mode. 

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
   .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
   .toBoolean
 
-val operation = if (isOverwrite) {
-  if (table.partitionColumnNames.nonEmpty) {
-INSERT_OVERWRITE_OPERATION_OPT_VAL  // overwrite partition
-  } else {
-INSERT_OPERATION_OPT_VAL
+val enableBulkInsert = 
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+  DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+val isPartitionedTable = table.partitionColumnNames.nonEmpty
+val isPrimaryKeyTable = primaryColumns.nonEmpty
+val operation =
+  (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+case (true, true, _, _) =>
+  throw new IllegalArgumentException(s"Table with primaryKey can not 
use bulk insert.")
+case (_, true, true, _) if isPartitionedTable =>
+  throw new IllegalArgumentException(s"Insert Overwrite Partition can 
not use bulk insert.")
+case (_, true, _, true) =>
+  throw new IllegalArgumentException(s"Bulk insert cannot support drop 
duplication." +
+s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.")
+// if enableBulkInsert is true, use bulk insert for the insert 
overwrite non-partitioned table.
+case (_, true, true, _) if !isPartitionedTable => 
BULK_INSERT_OPERATION_OPT_VAL
+// insert

53 matches

Mail list logo