[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395744#comment-17395744 ] ASF GitHub Bot commented on HUDI-2208: -- nsivabalan commented on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-894939778 thanks for your contribution! Good job -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395742#comment-17395742 ] ASF GitHub Bot commented on HUDI-2208: -- nsivabalan merged pull request #3328: URL: https://github.com/apache/hudi/pull/3328 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395741#comment-17395741 ] ASF GitHub Bot commented on HUDI-2208: -- nsivabalan commented on a change in pull request #3328: URL: https://github.com/apache/hudi/pull/3328#discussion_r684894126 ## File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala ## @@ -378,6 +379,41 @@ class TestInsertTable extends TestHoodieSqlBase { Seq(1, "a1", 10.0, "2021-07-18"), Seq(2, "a2", 10.0, "2021-07-18") ) + +// Test bulk insert for multi-level partition +val tableMultiPartition = generateTableName +spark.sql( + s""" + |create table $tableMultiPartition ( + | id int, + | name string, + | price double, + | dt string, + | hh string + |) using hudi + | options ( + | type = '$tableType' + | ) + | partitioned by (dt, hh) + | location '${tmp.getCanonicalPath}/$tableMultiPartition' + """.stripMargin) + +// Enable the bulk insert +spark.sql("set hoodie.sql.bulk.insert.enable = true") +spark.sql(s"insert into $tableMultiPartition values(1, 'a1', 10, '2021-07-18', '12')") + +checkAnswer(s"select id, name, price, dt, hh from $tableMultiPartition")( Review comment: lets verify meta fields as well as suggested in other patch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395740#comment-17395740 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN * abed45a1e858e7bd16b40e203c9aa88302e67921 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1558) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395724#comment-17395724 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN * cf12550810f530b84896fb904f2feb60eb440ac5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1543) * abed45a1e858e7bd16b40e203c9aa88302e67921 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1558) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395715#comment-17395715 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN * cf12550810f530b84896fb904f2feb60eb440ac5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1543) * abed45a1e858e7bd16b40e203c9aa88302e67921 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395693#comment-17395693 ] ASF GitHub Bot commented on HUDI-2208: -- pengzhiwei2018 commented on a change in pull request #3328: URL: https://github.com/apache/hudi/pull/3328#discussion_r684870787 ## File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala ## @@ -303,5 +304,184 @@ class TestInsertTable extends TestHoodieSqlBase { "assertion failed: Required select columns count: 4, Current select columns(including static partition column)" + " count: 3,columns: (1,a1,10)" ) +spark.sql("set hoodie.sql.bulk.insert.enable = true") +spark.sql("set hoodie.sql.insert.mode= strict") + +val tableName2 = generateTableName Review comment: Yes, will add the case for multi-level partition -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395645#comment-17395645 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN * cf12550810f530b84896fb904f2feb60eb440ac5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1543) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395549#comment-17395549 ] ASF GitHub Bot commented on HUDI-2208: -- nsivabalan commented on a change in pull request #3328: URL: https://github.com/apache/hudi/pull/3328#discussion_r684803078 ## File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala ## @@ -303,5 +304,184 @@ class TestInsertTable extends TestHoodieSqlBase { "assertion failed: Required select columns count: 4, Current select columns(including static partition column)" + " count: 3,columns: (1,a1,10)" ) +spark.sql("set hoodie.sql.bulk.insert.enable = true") +spark.sql("set hoodie.sql.insert.mode= strict") + +val tableName2 = generateTableName Review comment: Can we also enhance the test w/ both type of partitions(single level and multi-level). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395482#comment-17395482 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN * cf12550810f530b84896fb904f2feb60eb440ac5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1543) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395477#comment-17395477 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN * Unknown: [CANCELED](TBD) * cf12550810f530b84896fb904f2feb60eb440ac5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1543) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395475#comment-17395475 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN * Unknown: [CANCELED](TBD) * cf12550810f530b84896fb904f2feb60eb440ac5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395474#comment-17395474 ] ASF GitHub Bot commented on HUDI-2208: -- pengzhiwei2018 commented on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-894753661 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395221#comment-17395221 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN * Unknown: [CANCELED](TBD) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395220#comment-17395220 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN * Unknown: [CANCELED](TBD) * 0fddc583afb6b8eb460eb3dcff57ba355db5b7a8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395217#comment-17395217 ] ASF GitHub Bot commented on HUDI-2208: -- pengzhiwei2018 commented on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-894651032 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395152#comment-17395152 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN * ebd3310059d27544648a31a7c3fb3cb1febcea60 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1460) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395147#comment-17395147 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * f4e8be72b0140410daeb4eb01879047eba074751 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1447) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1459) * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN * ebd3310059d27544648a31a7c3fb3cb1febcea60 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1460) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395144#comment-17395144 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * f4e8be72b0140410daeb4eb01879047eba074751 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1447) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1459) * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN * ebd3310059d27544648a31a7c3fb3cb1febcea60 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1460) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395124#comment-17395124 ] ASF GitHub Bot commented on HUDI-2208: -- pengzhiwei2018 commented on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-894613775 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395125#comment-17395125 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * f4e8be72b0140410daeb4eb01879047eba074751 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1447) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1459) * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN * ebd3310059d27544648a31a7c3fb3cb1febcea60 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395122#comment-17395122 ] ASF GitHub Bot commented on HUDI-2208: -- pengzhiwei2018 edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-894613109 Hi @nsivabalan , The PR has updated with the follow chanes: add "upsert" mode for insert.mode. Currently we have 3 insert mode: - upsert In upsert mode for insert into, duplicate record on primary key will be updated.This is the default insert mode for pk-table. - strict In strict mode for insert into, we do the pk uniqueness guarantee for COW pk-table. For MOR pk-table, it has the same behavior with "upsert" mode. - non-strict In non-strict mode for insert into, we use insert operation to write data which allow writing the duplicate record. The default insert mode is `upsert` for pk-table. And these config is only used to control the behavior of pk-table. For non pk-table, the insert operation is always `insert` or `bulkinsert` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395119#comment-17395119 ] ASF GitHub Bot commented on HUDI-2208: -- pengzhiwei2018 edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-894613109 Hi @nsivabalan , The PR has updated with the follow chanes: add "upsert" mode for insert.mode. Currently we have 3 insert mode: - upsert In upsert mode for insert into, duplicate record on primary key will be updated.This is the default insert mode for pk-table. - strict In strict mode for insert into, we do the pk uniqueness guarantee for COW pk-table. For MOR pk-table, it has the same behavior with "upsert" mode. - non-strict In non-strict mode for insert into, we use insert operation to write data which allow writing the duplicate record. The default insert mode is `upsert`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395120#comment-17395120 ] ASF GitHub Bot commented on HUDI-2208: -- pengzhiwei2018 edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-894613109 Hi @nsivabalan , The PR has updated with the follow chanes: add "upsert" mode for insert.mode. Currently we have 3 insert mode: - upsert In upsert mode for insert into, duplicate record on primary key will be updated.This is the default insert mode for pk-table. - strict In strict mode for insert into, we do the pk uniqueness guarantee for COW pk-table. For MOR pk-table, it has the same behavior with "upsert" mode. - non-strict In non-strict mode for insert into, we use insert operation to write data which allow writing the duplicate record. The default insert mode is `upsert` for pk-table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395121#comment-17395121 ] ASF GitHub Bot commented on HUDI-2208: -- pengzhiwei2018 edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-894613109 Hi @nsivabalan , The PR has updated with the follow chanes: add "upsert" mode for insert.mode. Currently we have 3 insert mode: - upsert In upsert mode for insert into, duplicate record on primary key will be updated.This is the default insert mode for pk-table. - strict In strict mode for insert into, we do the pk uniqueness guarantee for COW pk-table. For MOR pk-table, it has the same behavior with "upsert" mode. - non-strict In non-strict mode for insert into, we use insert operation to write data which allow writing the duplicate record. The default insert mode is `upsert` for pk-table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395118#comment-17395118 ] ASF GitHub Bot commented on HUDI-2208: -- pengzhiwei2018 commented on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-894613109 Hi @nsivabalan , The PR has updated with the follow chanes: add "upsert" mode for insert.mode. Currently we have 3 insert mode: - upsert In upsert mode for insert into, duplicate record on primary key will be updated.This is the default insert mode for pk-table. - strict In strict mode for insert into, we do the pk uniqueness guarantee for COW pk-table. For MOR pk-table, it has the same behavior with "upsert" mode. - non-strict In non-strict mode for insert into, we use insert operation to write data which allow writing the duplicate record. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395116#comment-17395116 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * f4e8be72b0140410daeb4eb01879047eba074751 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1447) * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN * ebd3310059d27544648a31a7c3fb3cb1febcea60 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395115#comment-17395115 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * f4e8be72b0140410daeb4eb01879047eba074751 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1447) * b986aa4cc25d36da24a8ea926e3ecbe8912f1f17 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395037#comment-17395037 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * f4e8be72b0140410daeb4eb01879047eba074751 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1447) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395022#comment-17395022 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * f4e8be72b0140410daeb4eb01879047eba074751 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1447) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395021#comment-17395021 ] ASF GitHub Bot commented on HUDI-2208: -- nsivabalan commented on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-894571028 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394899#comment-17394899 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * f4e8be72b0140410daeb4eb01879047eba074751 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394857#comment-17394857 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * f4e8be72b0140410daeb4eb01879047eba074751 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1443) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394854#comment-17394854 ] ASF GitHub Bot commented on HUDI-2208: -- nsivabalan commented on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-894355629 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394586#comment-17394586 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * f4e8be72b0140410daeb4eb01879047eba074751 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394552#comment-17394552 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * 123f88802eb116939544de462d3fa372b8eb1684 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1406) * f4e8be72b0140410daeb4eb01879047eba074751 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1426) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394538#comment-17394538 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * 123f88802eb116939544de462d3fa372b8eb1684 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1406) * f4e8be72b0140410daeb4eb01879047eba074751 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394384#comment-17394384 ] ASF GitHub Bot commented on HUDI-2208: -- nsivabalan commented on a change in pull request #3328: URL: https://github.com/apache/hudi/pull/3328#discussion_r683843426 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") +case (_, true, true, _) if isPartitionedTable => + throw new IllegalArgumentException(s"Insert Overwrite Partition can not use bulk insert.") +case (_, true, _, true) => + throw new IllegalArgumentException(s"Bulk insert cannot support drop duplication." + +s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.") +// if enableBulkInsert is true, use bulk insert for the insert overwrite non-partitioned table. +case (_, true, true, _) if !isPartitionedTable => BULK_INSERT_OPERATION_OPT_VAL +// insert overwrite partition +case (_, _, true, _) if isPartitionedTable => INSERT_OVERWRITE_OPERATION_OPT_VAL +// insert overwrite table +case (_, _, true, _) if !isPartitionedTable => INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL +// if the table has primaryKey and the dropDuplicate has disable, use the upsert operation +case (true, false, false, false) => UPSERT_OPERATION_OPT_VAL +// if enableBulkInsert is true and the table is non-primaryKeyed, use the bulk insert operation +case (false, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL +// for the rest case, use the insert operation +case (_, _, _, _) => INSERT_OPERATION_OPT_VAL Review comment: I did go through every case here and have 2 suggestions. rest of the cases looks good. You don't need to consider my proposal above. But would like you to consider below feedback. 1. ``` case (true, true, _, _) if !isNonStrictMode => throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert in strict mode.") ``` Can we enable preCombine here and proceed with Bulk_Insert operation. Within hudi, we can do preCombine/dedup. As we agreed on using bulk_insert as default with CTAS, this will be a very common use-case. 2. ``` case (_, true, true, _) if isPartitionedTable => throw new IllegalArgumentException(s"Insert Overwrite Partition can not use bulk insert.") ``` since we agreed on enabling Bulk_insert as default for CTAS, this will be very common use-case as well. Can you help me understand why do we fail this call? why can't we let it proceed. This is basically, CTAS for a partitioned table. ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala ## @@ -159,7 +159,10 @@ object HoodieSparkSqlWriter { // Convert to RDD[HoodieRecord] val genericRecords: RDD[GenericRecord] = HoodieSparkUtils.createRdd(df, schema, structName, nameSpace) - val shouldCombine = parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean || operation.equals(WriteOperationType.UPSERT); + val shouldCombine = parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean || +operation.equals(WriteOperationType.UPSERT) || + parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_INSERT_PROP.key(), Review comment: my bad. I get it now. If InsertDropDups is set, we automatically set combine.before.insert. but if a user has set just "combine.before.insert", we need to do PreCombine here. But I am not sure why this wasn't reported by anyone until now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394000#comment-17394000 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * 123f88802eb116939544de462d3fa372b8eb1684 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1406) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393942#comment-17393942 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * b3e8a6d36161d5da60a1429e518253e1bff92a9d Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1402) * becb7a1eae66a8d9ea6a730e5206da9b4434a50e UNKNOWN * 37d1608a58c08664f19ca8439162fece22d11e3f UNKNOWN * 123f88802eb116939544de462d3fa372b8eb1684 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1406) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393933#comment-17393933 ] ASF GitHub Bot commented on HUDI-2208: -- pengzhiwei2018 commented on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-893396472 @vinothchandar @nsivabalan The PR has updated with the follow changes: 1、Allow bulk insert for pk-table. I introduce a config: `hoodie.sql.insert.mode` . If set to "strict", we will do the pk uniqueness guarantee. If set to "non-strict", we will ignore the uniqueness guarantee for pk table. The bulk insert is support in the case. By default the value is "non-strict". 2、CTAS use bulk insert by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393903#comment-17393903 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393821#comment-17393821 ] ASF GitHub Bot commented on HUDI-2208: -- pengzhiwei2018 commented on a change in pull request #3328: URL: https://github.com/apache/hudi/pull/3328#discussion_r683207185 ## File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala ## @@ -248,6 +248,14 @@ object DataSourceWriteOptions { .withDocumentation("When set to true, will perform write operations directly using the spark native " + "`Row` representation, avoiding any additional conversion costs.") + /** + * Enable the bulk insert for sql insert statement. + */ + val SQL_ENABLE_BULK_INSERT:ConfigProperty[String] = ConfigProperty Review comment: Sound reasonable about this. CTAS use the bulk_insert by default, and regular insert for insert into by default. ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") Review comment: For CTAS, we can relax this. Because there is no data exist in the target table. We can just combine the input by pk before bulk insert to reach the same goal. ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala ## @@ -159,7 +159,10 @@ object HoodieSparkSqlWriter { // Convert to RDD[HoodieRecord] val genericRecords: RDD[GenericRecord] = HoodieSparkUtils.createRdd(df, schema, structName, nameSpace) - val shouldCombine = parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean || operation.equals(WriteOperationType.UPSERT); + val shouldCombine = parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean || +operation.equals(WriteOperationType.UPSERT) || + parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_INSERT_PROP.key(), Review comment: @vinothchandar well I think INSERT_DROP_DUPS_OPT_KEY is some different from COMBINE_BEFORE_INSERT_PROP. **INSERT_DROP_DUPS_OPT_KEY**: is used to drop the duplicate record in the target table. `COMBINE_BEFORE_INSERT_PROP`: is used to combine the duplicate record in the input. So they are not total the same config. IMO. ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") +case (_, true, true, _) if isPartitionedTable => + throw new IllegalArgumentException(s"Insert Overwrite Partition can not use bulk insert.") +case (_, true, _, true) => + throw new IllegalArgumentException(s"Bulk insert cannot support drop duplication." + +s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.") +// if enableBulkInsert is true, use bulk insert for the insert overwrite non-partitioned table. +case (_, true, true, _) if !isPartitionedTable => BULK_INSERT_OPERATION_OPT_VAL +// insert overwrite partition +case (_, _, true, _) if isPartitionedTable =>
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393781#comment-17393781 ] ASF GitHub Bot commented on HUDI-2208: -- hudi-bot edited a comment on pull request #3328: URL: https://github.com/apache/hudi/pull/3328#issuecomment-884869427 ## CI report: * 9c9f804618dd0275abdae10673c21bf1f5737caf UNKNOWN * 50539ec543951e7a4442798ac7c66e5dc3d3705a UNKNOWN * f8b449c31ee8601542f00e3cc15fbcab77da7787 UNKNOWN * bb9a6d83361f3a652b2c902b1b3dc846de617d93 UNKNOWN * e88244d233d323364916c4fc240083566ddc4e56 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1272) * b3e8a6d36161d5da60a1429e518253e1bff92a9d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393699#comment-17393699 ] ASF GitHub Bot commented on HUDI-2208: -- pengzhiwei2018 commented on a change in pull request #3328: URL: https://github.com/apache/hudi/pull/3328#discussion_r683207185 ## File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala ## @@ -248,6 +248,14 @@ object DataSourceWriteOptions { .withDocumentation("When set to true, will perform write operations directly using the spark native " + "`Row` representation, avoiding any additional conversion costs.") + /** + * Enable the bulk insert for sql insert statement. + */ + val SQL_ENABLE_BULK_INSERT:ConfigProperty[String] = ConfigProperty Review comment: Sound reasonable about this. CTAS use the bulk_insert by default, and regular insert for insert into by default. ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") Review comment: For CTAS, we can relax this. Because there is no data exist in the target table. We can just combine the input by pk before bulk insert to reach the same goal. ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala ## @@ -159,7 +159,10 @@ object HoodieSparkSqlWriter { // Convert to RDD[HoodieRecord] val genericRecords: RDD[GenericRecord] = HoodieSparkUtils.createRdd(df, schema, structName, nameSpace) - val shouldCombine = parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean || operation.equals(WriteOperationType.UPSERT); + val shouldCombine = parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean || +operation.equals(WriteOperationType.UPSERT) || + parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_INSERT_PROP.key(), Review comment: @vinothchandar well I think INSERT_DROP_DUPS_OPT_KEY is some different from COMBINE_BEFORE_INSERT_PROP. **INSERT_DROP_DUPS_OPT_KEY**: is used to drop the duplicate record in the target table. `COMBINE_BEFORE_INSERT_PROP`: is used to combine the duplicate record in the input. So they are not total the same config. IMO. ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") +case (_, true, true, _) if isPartitionedTable => + throw new IllegalArgumentException(s"Insert Overwrite Partition can not use bulk insert.") +case (_, true, _, true) => + throw new IllegalArgumentException(s"Bulk insert cannot support drop duplication." + +s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.") +// if enableBulkInsert is true, use bulk insert for the insert overwrite non-partitioned table. +case (_, true, true, _) if !isPartitionedTable => BULK_INSERT_OPERATION_OPT_VAL +// insert overwrite partition +case (_, _, true, _) if isPartitionedTable =>
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392596#comment-17392596 ] ASF GitHub Bot commented on HUDI-2208: -- nsivabalan commented on a change in pull request #3328: URL: https://github.com/apache/hudi/pull/3328#discussion_r682137257 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") +case (_, true, true, _) if isPartitionedTable => + throw new IllegalArgumentException(s"Insert Overwrite Partition can not use bulk insert.") +case (_, true, _, true) => + throw new IllegalArgumentException(s"Bulk insert cannot support drop duplication." + +s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.") +// if enableBulkInsert is true, use bulk insert for the insert overwrite non-partitioned table. +case (_, true, true, _) if !isPartitionedTable => BULK_INSERT_OPERATION_OPT_VAL +// insert overwrite partition +case (_, _, true, _) if isPartitionedTable => INSERT_OVERWRITE_OPERATION_OPT_VAL +// insert overwrite table +case (_, _, true, _) if !isPartitionedTable => INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL +// if the table has primaryKey and the dropDuplicate has disable, use the upsert operation +case (true, false, false, false) => UPSERT_OPERATION_OPT_VAL +// if enableBulkInsert is true and the table is non-primaryKeyed, use the bulk insert operation +case (false, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL +// for the rest case, use the insert operation +case (_, _, _, _) => INSERT_OPERATION_OPT_VAL Review comment: actually I came across [INSERT OVERWRITE DIRECTORY](https://spark.apache.org/docs/latest/sql-ref-syntax-dml-insert-overwrite-directory.html) which can be mapped to insert_overwrite. Here is a suggestion w/o using any additional configs: CTAS -> bulk_insert Insert into -> insert INSERT OVERWRITE -> insert overwrite table INSERT OVERWRITE DIRECTORY -> insert overwrite (partitions) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > Fix For: 0.9.0 > > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392571#comment-17392571 ] ASF GitHub Bot commented on HUDI-2208: -- nsivabalan commented on a change in pull request #3328: URL: https://github.com/apache/hudi/pull/3328#discussion_r682137257 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") +case (_, true, true, _) if isPartitionedTable => + throw new IllegalArgumentException(s"Insert Overwrite Partition can not use bulk insert.") +case (_, true, _, true) => + throw new IllegalArgumentException(s"Bulk insert cannot support drop duplication." + +s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.") +// if enableBulkInsert is true, use bulk insert for the insert overwrite non-partitioned table. +case (_, true, true, _) if !isPartitionedTable => BULK_INSERT_OPERATION_OPT_VAL +// insert overwrite partition +case (_, _, true, _) if isPartitionedTable => INSERT_OVERWRITE_OPERATION_OPT_VAL +// insert overwrite table +case (_, _, true, _) if !isPartitionedTable => INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL +// if the table has primaryKey and the dropDuplicate has disable, use the upsert operation +case (true, false, false, false) => UPSERT_OPERATION_OPT_VAL +// if enableBulkInsert is true and the table is non-primaryKeyed, use the bulk insert operation +case (false, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL +// for the rest case, use the insert operation +case (_, _, _, _) => INSERT_OPERATION_OPT_VAL Review comment: actually I came across [INSERT OVERWRITE DIRECTORY](https://spark.apache.org/docs/latest/sql-ref-syntax-dml-insert-overwrite-directory.html) which can be mapped to insert_overwrite. Here is a suggestion w/o using any additional configs: CTAS -> bulk_insert Insert into -> insert INSERT OVERWRITE -> insert overwrite table INSERT OVERWRITE DIRECTORY -> insert overwrite -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392553#comment-17392553 ] ASF GitHub Bot commented on HUDI-2208: -- vinothchandar commented on a change in pull request #3328: URL: https://github.com/apache/hudi/pull/3328#discussion_r682108786 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") +case (_, true, true, _) if isPartitionedTable => + throw new IllegalArgumentException(s"Insert Overwrite Partition can not use bulk insert.") Review comment: insert overwrite partition should be using the `INSERT_OVERWRITE` operation. ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala ## @@ -159,7 +159,10 @@ object HoodieSparkSqlWriter { // Convert to RDD[HoodieRecord] val genericRecords: RDD[GenericRecord] = HoodieSparkUtils.createRdd(df, schema, structName, nameSpace) - val shouldCombine = parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean || operation.equals(WriteOperationType.UPSERT); + val shouldCombine = parameters(INSERT_DROP_DUPS_OPT_KEY.key()).toBoolean || +operation.equals(WriteOperationType.UPSERT) || + parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_INSERT_PROP.key(), Review comment: @pengzhiwei2018 do you agree with siva's analysis above? ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") Review comment: +1 users might have another hudi table for e.g to CTAS from. So if we disallow bulk insert with a pk, then there is no good way to do a full bootstrap. Can we relax this? ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") +case (_, true, true, _) if isPartitionedTable => + throw new IllegalArgumentException(s"Insert Overwrite Partition can not use bulk insert.") +case (_, true, _, true) => + throw new IllegalArgumentException(s"Bulk insert cannot support drop
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392548#comment-17392548 ] ASF GitHub Bot commented on HUDI-2208: -- nsivabalan commented on a change in pull request #3328: URL: https://github.com/apache/hudi/pull/3328#discussion_r682101452 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") +case (_, true, true, _) if isPartitionedTable => + throw new IllegalArgumentException(s"Insert Overwrite Partition can not use bulk insert.") +case (_, true, _, true) => + throw new IllegalArgumentException(s"Bulk insert cannot support drop duplication." + +s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.") +// if enableBulkInsert is true, use bulk insert for the insert overwrite non-partitioned table. +case (_, true, true, _) if !isPartitionedTable => BULK_INSERT_OPERATION_OPT_VAL +// insert overwrite partition +case (_, _, true, _) if isPartitionedTable => INSERT_OVERWRITE_OPERATION_OPT_VAL +// insert overwrite table +case (_, _, true, _) if !isPartitionedTable => INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL +// if the table has primaryKey and the dropDuplicate has disable, use the upsert operation +case (true, false, false, false) => UPSERT_OPERATION_OPT_VAL +// if enableBulkInsert is true and the table is non-primaryKeyed, use the bulk insert operation +case (false, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL +// for the rest case, use the insert operation +case (_, _, _, _) => INSERT_OPERATION_OPT_VAL Review comment: @vinothchandar : do check this out before reviewing other feedbacks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17391810#comment-17391810 ] ASF GitHub Bot commented on HUDI-2208: -- nsivabalan commented on a change in pull request #3328: URL: https://github.com/apache/hudi/pull/3328#discussion_r680197209 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") Review comment: anyways, we can call it out that its responsibility of the user to ensure there are uniqueness. Also, IIUC, hudi can handle duplicates. Incase of updates, both records will be updated. but bulk_insert is very performant compared to regular Insert especially w/ row wirter. So, we should not keep it too restrictive for use. I know from the community msgs, that lot of users leverage bulk_insert. I would vote to relax this constraint. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17391377#comment-17391377 ] ASF GitHub Bot commented on HUDI-2208: -- pengzhiwei2018 commented on a change in pull request #3328: URL: https://github.com/apache/hudi/pull/3328#discussion_r680693753 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -243,6 +256,8 @@ object InsertIntoHoodieTableCommand { RECORDKEY_FIELD_OPT_KEY.key -> primaryColumns.mkString(","), PARTITIONPATH_FIELD_OPT_KEY.key -> partitionFields, PAYLOAD_CLASS_OPT_KEY.key -> payloadClassName, +ENABLE_ROW_WRITER_OPT_KEY.key -> enableBulkInsert.toString, +HoodieWriteConfig.COMBINE_BEFORE_INSERT_PROP.key -> isPrimaryKeyTable.toString, // if the table has primaryKey, enable the combine Review comment: Just like the upsert operation, Hudi do the combine automatic, we can do this for the user too which is much friendly for our users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390829#comment-17390829 ] ASF GitHub Bot commented on HUDI-2208: -- nsivabalan commented on a change in pull request #3328: URL: https://github.com/apache/hudi/pull/3328#discussion_r680212711 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") +case (_, true, true, _) if isPartitionedTable => + throw new IllegalArgumentException(s"Insert Overwrite Partition can not use bulk insert.") +case (_, true, _, true) => + throw new IllegalArgumentException(s"Bulk insert cannot support drop duplication." + +s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.") +// if enableBulkInsert is true, use bulk insert for the insert overwrite non-partitioned table. +case (_, true, true, _) if !isPartitionedTable => BULK_INSERT_OPERATION_OPT_VAL +// insert overwrite partition +case (_, _, true, _) if isPartitionedTable => INSERT_OVERWRITE_OPERATION_OPT_VAL +// insert overwrite table +case (_, _, true, _) if !isPartitionedTable => INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL +// if the table has primaryKey and the dropDuplicate has disable, use the upsert operation +case (true, false, false, false) => UPSERT_OPERATION_OPT_VAL +// if enableBulkInsert is true and the table is non-primaryKeyed, use the bulk insert operation +case (false, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL +// for the rest case, use the insert operation +case (_, _, _, _) => INSERT_OPERATION_OPT_VAL Review comment: Here is my thought on choosing the right operation. Having too many case statements might complicate things and is error prone too. As I mentioned earlier, we should try to do any valid conversions in HoodiesSparkSqlWriter. Only those thats applicable just to sql dml, we should keep it here. Anyways, here is one simplified approach. Ignoring the primary, non primary key table for now. We can come back to that later once we have consensus on this. We need just two configs. hoodie.sql.enable.bulk_insert (default false) hoodie.sql.overwrite.entire.table (default true) From sql syntax, there are two commands allowed. "INSERT" into and "INSERT OVERWRITE". And these need to map to 4 operations on the hudi end (insert, bulk_insert, insert over write and insert overwrite table) "INSERT" with no other configs set -> insert operation "INSERT" with enable bulk insert set -> bulk_insert "INSERT OVERWRITE" with no other configs set -> insert_overwrite_table operation "INSERT OVERWRITE" with hoodie.sql.overwrite.entire.table = false -> insert_overwrite operation. "INSERT OVERWRITE" with enable bulk_insert set -> bulk_insert. pass the right save mode to HoodieSparkSqlWriter "INSERT OVERWRITE" with enable bulk_insert set and hoodie.sql.overwrite.entire.table = false -> bulk_insert. pass the right save mode to HoodieSparkSqlWriter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [SQL] Support Bulk Insert For Spark Sql > --- > > Key: HUDI-2208 > URL: https://issues.apache.org/jira/browse/HUDI-2208 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Blocker > Labels: pull-request-available, release-blocker > > Support the bulk insert for spark sql -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2208) [SQL] Support Bulk Insert For Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390783#comment-17390783 ] ASF GitHub Bot commented on HUDI-2208: -- nsivabalan commented on a change in pull request #3328: URL: https://github.com/apache/hudi/pull/3328#discussion_r680186620 ## File path: hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala ## @@ -248,6 +248,14 @@ object DataSourceWriteOptions { .withDocumentation("When set to true, will perform write operations directly using the spark native " + "`Row` representation, avoiding any additional conversion costs.") + /** + * Enable the bulk insert for sql insert statement. + */ + val SQL_ENABLE_BULK_INSERT:ConfigProperty[String] = ConfigProperty Review comment: @vinothchandar : In sql, we don't have two separate commands like INSERT into and BULK_INSERT into. so, guess we are going this route. But default CTAS choose INSERT operation. I am thinking users may not use bulk_insert only since they have to set the property explicitly. any thoughts. There are two things to discuss. 1. Which operation to use with CTAS 2. which operation to use with INSERT into. State as of now, is "Insert". And user has to explicitly set operation type to bulk_insert before calling any of this commands. ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") +case (_, true, true, _) if isPartitionedTable => + throw new IllegalArgumentException(s"Insert Overwrite Partition can not use bulk insert.") +case (_, true, _, true) => + throw new IllegalArgumentException(s"Bulk insert cannot support drop duplication." + +s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.") +// if enableBulkInsert is true, use bulk insert for the insert overwrite non-partitioned table. +case (_, true, true, _) if !isPartitionedTable => BULK_INSERT_OPERATION_OPT_VAL +// insert overwrite partition +case (_, _, true, _) if isPartitionedTable => INSERT_OVERWRITE_OPERATION_OPT_VAL +// insert overwrite table +case (_, _, true, _) if !isPartitionedTable => INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL Review comment: HoodieSparkSqlWriter will handle this save mode. ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala ## @@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand { .getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue) .toBoolean -val operation = if (isOverwrite) { - if (table.partitionColumnNames.nonEmpty) { -INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition - } else { -INSERT_OPERATION_OPT_VAL +val enableBulkInsert = parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key, + DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean +val isPartitionedTable = table.partitionColumnNames.nonEmpty +val isPrimaryKeyTable = primaryColumns.nonEmpty +val operation = + (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match { +case (true, true, _, _) => + throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert.") +case (_, true, true, _) if isPartitionedTable => + throw new IllegalArgumentException(s"Insert Overwrite Partition can not use bulk insert.") +case (_, true, _, true) => + throw new IllegalArgumentException(s"Bulk insert cannot support drop duplication." + +s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.") +// if enableBulkInsert is true, use bulk insert for the insert overwrite non-partitioned table. +case (_, true, true, _) if !isPartitionedTable => BULK_INSERT_OPERATION_OPT_VAL +// insert