date:20220119

[jira] [Updated] (HUDI-1297) [Umbrella] Spark Datasource Support

2022-01-19 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1297:
-
Priority: Blocker  (was: Critical)

> [Umbrella] Spark Datasource Support
> ---
>
> Key: HUDI-1297
> URL: https://issues.apache.org/jira/browse/HUDI-1297
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: spark
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: hudi-umbrellas
> Fix For: 0.11.0
>
>
> Yet to be fully scoped out
> But high level, we want to 
>  * First class support for streaming reads/writes via structured streaming
>  * Row based reader/writers all the way
>  * Support for File/Partition pruning using Hudi metadata tables



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] xushiyan commented on issue #4411: [SUPPORT] - Presto Querying Issue in AWS EMR 6.3.1

2022-01-19 Thread GitBox



xushiyan commented on issue #4411:
URL: https://github.com/apache/hudi/issues/4411#issuecomment-1017200245


   @rajgowtham24 i think you need `NonpartitionedKeyGenerator` instead for 
non-partitioned table. `default/` is created when complex key gen fails to 
extract partition path properly. With `default/` in the directory, your table 
is unexpectedly "partitioned" now, and in your 
`hive_sync.partition_extractor_class` you used `NonPartitionedExtractor`, 
that's probably causing reading issue.
   
   > @xushiyan - when we are using complexkey generator in 0.5.0, by default 
Hudi is creating the data under default partition(same is happening with 0.8.0 
as well). So to query the table through Presto 0.230 we have to modify the 
table location s3://bucket_name/test/table_name/default.
   > 
   > For partitioned table, without updating the table location we are able to 
query the table from Presto 0.230
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4647: [HUDI-3283] Bootstrap support overwrite existing table

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4647:
URL: https://github.com/apache/hudi/pull/4647#issuecomment-1017198356


   
   ## CI report:
   
   * 6e9b2bfa50f349bc68e32f890a20df2da4b9f708 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4647: [HUDI-3283] Bootstrap support overwrite existing table

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4647:
URL: https://github.com/apache/hudi/pull/4647#issuecomment-1017200201


   
   ## CI report:
   
   * 6e9b2bfa50f349bc68e32f890a20df2da4b9f708 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5364)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] BruceKellan opened a new issue #4648: [SUPPORT] Upgrade Hudi to 0.10.1 from 0.10.0 using spark

2022-01-19 Thread GitBox



BruceKellan opened a new issue #4648:
URL: https://github.com/apache/hudi/issues/4648


   **Describe the problem you faced**
   
   I am using spark3 + hudi 0.10.0.
   When I am upgrading hudi to 0.10.1-rc2, get this:
   java.io.InvalidClassException: 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline; local class 
incompatible: stream classdesc serialVersionUID = -1280891512509140081, local 
class serialVersionUID = 1642514781003501811
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1941)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1807)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2098)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1624)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2343)
at 
java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:594)
at 
org.apache.hudi.common.table.HoodieTableMetaClient.readObject(HoodieTableMetaClient.java:151)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2234)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2125)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1624)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2343)
   
   **Environment Description**
   
   * Hudi version : 0.10.1-rc2 using  b670801afc110870f354766c872f386d18261add
   
   * Spark version : 3.1.2
   
   * Hadoop version : 2.8.5
   
   * Storage (HDFS/S3/GCS..) : Aliyun OSS
   
   * Running on Docker? (yes/no) : no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4647: [HUDI-3283] Bootstrap support overwrite existing table

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4647:
URL: https://github.com/apache/hudi/pull/4647#issuecomment-1017198356


   
   ## CI report:
   
   * 6e9b2bfa50f349bc68e32f890a20df2da4b9f708 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3283) Bootstrap support overwrite existing table

2022-01-19 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-3283:
-
Labels: pull-request-available  (was: )

> Bootstrap support overwrite existing table
> --
>
> Key: HUDI-3283
> URL: https://issues.apache.org/jira/browse/HUDI-3283
> Project: Apache Hudi
>  Issue Type: Task
>  Components: bootstrap
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] wangxianghu opened a new pull request #4647: [HUDI-3283] Bootstrap support overwrite existing table

2022-01-19 Thread GitBox



wangxianghu opened a new pull request #4647:
URL: https://github.com/apache/hudi/pull/4647


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] dongkelun commented on a change in pull request #4083: [HUDI-2837] Add support for using database name in incremental query

2022-01-19 Thread GitBox



dongkelun commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r788447754



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##
@@ -76,6 +76,11 @@
   public static final String HOODIE_PROPERTIES_FILE = "hoodie.properties";
   public static final String HOODIE_PROPERTIES_FILE_BACKUP = 
"hoodie.properties.backup";
 
+  public static final ConfigProperty DATABASE_NAME = ConfigProperty
+  .key("hoodie.database.name")
+  .noDefaultValue()
+  .withDocumentation("Database name that will be used for incremental 
query.");

Review comment:
   Although spark sql will use it, it is only for incremental query at 
present.So do we need to change here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-3283) Bootstrap support overwrite existing table

2022-01-19 Thread Xianghu Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghu Wang reassigned HUDI-3283:
--

Assignee: Xianghu Wang

> Bootstrap support overwrite existing table
> --
>
> Key: HUDI-3283
> URL: https://issues.apache.org/jira/browse/HUDI-3283
> Project: Apache Hudi
>  Issue Type: Task
>  Components: bootstrap
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HUDI-3283) Bootstrap support overwrite existing table

2022-01-19 Thread Xianghu Wang (Jira)

Xianghu Wang created HUDI-3283:
--

 Summary: Bootstrap support overwrite existing table
 Key: HUDI-3283
 URL: https://issues.apache.org/jira/browse/HUDI-3283
 Project: Apache Hudi
  Issue Type: Task
  Components: bootstrap
Reporter: Xianghu Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] xushiyan commented on a change in pull request #4083: [HUDI-2837] Add support for using database name in incremental query

2022-01-19 Thread GitBox



xushiyan commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r788428757



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/InputPathHandler.java
##
@@ -53,25 +57,27 @@
   public static final Logger LOG = 
LogManager.getLogger(InputPathHandler.class);
 
   private final Configuration conf;
-  // tablename to metadata mapping for all Hoodie tables(both incremental & 
snapshot)
+  // tableName to metadata mapping for all Hoodie tables(both incremental & 
snapshot)
   private final Map tableMetaClientMap;
   private final Map> groupedIncrementalPaths;
   private final List snapshotPaths;
   private final List nonHoodieInputPaths;
+  private boolean isIncrementalUseDatabase;
 
-  public InputPathHandler(Configuration conf, Path[] inputPaths, List 
incrementalTables) throws IOException {
+  public InputPathHandler(Configuration conf, Path[] inputPaths, List 
incrementalTables, JobConf job) throws IOException {
 this.conf = conf;
 tableMetaClientMap = new HashMap<>();
 snapshotPaths = new ArrayList<>();
 nonHoodieInputPaths = new ArrayList<>();
 groupedIncrementalPaths = new HashMap<>();
+this.isIncrementalUseDatabase = 
HoodieHiveUtils.isIncrementalUseDatabase(Job.getInstance(job));

Review comment:
   `JobConf` is passed in just to compute a boolean config? can we somehow 
make the config extracted from `Configuration conf` ? always prefer to have 
less args

##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/InputPathHandler.java
##
@@ -117,9 +124,11 @@ private void parseInputPaths(Path[] inputPaths, 
List incrementalTables)
 }
   }
 
-  private void tagAsIncrementalOrSnapshot(Path inputPath, String tableName,
-  HoodieTableMetaClient metaClient, List incrementalTables) {
-if (!incrementalTables.contains(tableName)) {
+  private void tagAsIncrementalOrSnapshot(Path inputPath, 
HoodieTableMetaClient metaClient, List incrementalTables) {
+String databaseName = metaClient.getTableConfig().getDatabaseName();
+String tableName = metaClient.getTableConfig().getTableName();
+if ((isIncrementalUseDatabase && !StringUtils.isNullOrEmpty(databaseName) 
&& !incrementalTables.contains(databaseName + "." + tableName))
+|| (!(isIncrementalUseDatabase && 
!StringUtils.isNullOrEmpty(databaseName)) && 
!incrementalTables.contains(tableName))) {

Review comment:
   this condition check is pretty hard to read.. can we improve this?

##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/InputPathHandler.java
##
@@ -95,19 +101,20 @@ private void parseInputPaths(Path[] inputPaths, 
List incrementalTables)
   // We already know the base path for this inputPath.
   basePathKnown = true;
   // Check if this is for a snapshot query
-  String tableName = metaClient.getTableConfig().getTableName();
-  tagAsIncrementalOrSnapshot(inputPath, tableName, metaClient, 
incrementalTables);
+  tagAsIncrementalOrSnapshot(inputPath, metaClient, incrementalTables);
   break;
 }
   }
   if (!basePathKnown) {
-// This path is for a table that we dont know about yet.
+// This path is for a table that we don't know about yet.
 HoodieTableMetaClient metaClient;
 try {
   metaClient = 
getTableMetaClientForBasePath(inputPath.getFileSystem(conf), inputPath);
+  String databaseName = metaClient.getTableConfig().getDatabaseName();
   String tableName = metaClient.getTableConfig().getTableName();
-  tableMetaClientMap.put(tableName, metaClient);
-  tagAsIncrementalOrSnapshot(inputPath, tableName, metaClient, 
incrementalTables);
+  tableMetaClientMap.put(isIncrementalUseDatabase && 
!StringUtils.isNullOrEmpty(databaseName)
+  ? databaseName + "." + tableName : tableName, metaClient);

Review comment:
   can we move the table name creation logic into a helper method?

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##
@@ -76,6 +76,11 @@
   public static final String HOODIE_PROPERTIES_FILE = "hoodie.properties";
   public static final String HOODIE_PROPERTIES_FILE_BACKUP = 
"hoodie.properties.backup";
 
+  public static final ConfigProperty DATABASE_NAME = ConfigProperty
+  .key("hoodie.database.name")
+  .noDefaultValue()
+  .withDocumentation("Database name that will be used for incremental 
query.");

Review comment:
   this is not just for incremental query now, right? spark sql also uses 
it. Better to add more info here to explain the use cases of this config, which 
will show up in the website for users to understand.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscrib

[GitHub] [hudi] hudi-bot removed a comment on pull request #3745: [HUDI-2514] Add default hiveTableSerdeProperties for Spark SQL when sync Hive

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #3745:
URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017147751


   
   ## CI report:
   
   * 1cd9e56e69b15edac5a12afdb9626c853eb6e83f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5343)
 
   * 942072ece4f6fc46251f922eabe4f6bdac767410 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5362)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #3745: [HUDI-2514] Add default hiveTableSerdeProperties for Spark SQL when sync Hive

2022-01-19 Thread GitBox



hudi-bot commented on pull request #3745:
URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017178601


   
   ## CI report:
   
   * 942072ece4f6fc46251f922eabe4f6bdac767410 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5362)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-2837) The original hoodie.table.name should be maintained in Spark SQL

2022-01-19 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2837:
-
Epic Link: HUDI-1658

> The original hoodie.table.name should be maintained in Spark SQL
> 
>
> Key: HUDI-2837
> URL: https://issues.apache.org/jira/browse/HUDI-2837
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When querying Hudi incrementally in hive, we set the start query time of the 
> table. This setting works for all tables with the same name, not only for the 
> tables in the current database. In actual business, it can not be guaranteed 
> that the tables in different databases are different, so it can be realized 
> by setting hoodie.table.name as database name + table name, However, at 
> present, the original value of hoodie.table.name is not consistent in Spark 
> SQL



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2837) The original hoodie.table.name should be maintained in Spark SQL

2022-01-19 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2837:
-
Reviewers: Raymond Xu, Yann Byron  (was: Raymond Xu, Yann Byron)

> The original hoodie.table.name should be maintained in Spark SQL
> 
>
> Key: HUDI-2837
> URL: https://issues.apache.org/jira/browse/HUDI-2837
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When querying Hudi incrementally in hive, we set the start query time of the 
> table. This setting works for all tables with the same name, not only for the 
> tables in the current database. In actual business, it can not be guaranteed 
> that the tables in different databases are different, so it can be realized 
> by setting hoodie.table.name as database name + table name, However, at 
> present, the original value of hoodie.table.name is not consistent in Spark 
> SQL



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3282) Fix delete exception for Spark SQL when sync Hive

2022-01-19 Thread Jira



 [ 
https://issues.apache.org/jira/browse/HUDI-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

董可伦 updated HUDI-3282:
--
Description: 
```
hudi 0.11.0 master build
spark: 2.4.5
```
```bash
hive
create database test_hudi;
```
```scala
spark-shell --master yarn --deploy-mode client --executor-memory 2G 
--num-executors 3 --executor-cores 2 --driver-memory 4G --driver-cores 2 --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'  --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' 
--principal ..  --keytab ..

import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.QuickstartUtils.\{DataGenerator, convertToStringList, 
getQuickstartWriteConfigs}
import org.apache.hudi.config.HoodieWriteConfig.TBL_NAME
import org.apache.spark.sql.SaveMode._
import org.apache.spark.sql.\{SaveMode, SparkSession}
import org.apache.spark.sql.functions.lit
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.keygen.SimpleKeyGenerator
import org.apache.hudi.common.model.\{DefaultHoodieRecordPayload, 
HoodiePayloadProps}
import org.apache.hudi.io.HoodieMergeHandle
import org.apache.hudi.common.table.HoodieTableConfig
import org.apache.spark.sql.functions._

import spark.implicits._
val df = Seq((1, "a1", 10, 1000, "2022-01-19")).toDF("id", "name", "value", 
"ts", "dt")

df.write.format("hudi").
option(HoodieWriteConfig.TBL_NAME.key, "test_hudi_table_sync_hive").
option(TABLE_TYPE.key, COW_TABLE_TYPE_OPT_VAL).
option(RECORDKEY_FIELD.key, "id").
option(PRECOMBINE_FIELD.key, "ts").
option(KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator").
option("hoodie.datasource.write.partitionpath.field", "").
option("hoodie.metadata.enable", false).
option(KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.ComplexKeyGenerator").
option(META_SYNC_ENABLED.key(), true).
option(HIVE_USE_JDBC.key(), false).
option(HIVE_DATABASE.key(), "test_hudi").
option(HIVE_AUTO_CREATE_DATABASE.key(), true).
option(HIVE_TABLE.key(), "test_hudi_table_sync_hive").
option(HIVE_PARTITION_EXTRACTOR_CLASS.key(), 
"org.apache.hudi.hive.MultiPartKeysValueExtractor").
mode("overwrite").
save("/test_hudi/test_hudi_table_sync_hive")
```
```
# hoodie.properties
hoodie.table.precombine.field=ts
hoodie.table.partition.fields=
hoodie.table.type=COPY_ON_WRITE
hoodie.archivelog.folder=archived
hoodie.populate.meta.fields=true
hoodie.timeline.layout.version=1
hoodie.table.version=3
hoodie.table.recordkey.fields=id
hoodie.table.base.file.format=PARQUET
hoodie.table.timeline.timezone=LOCAL
hoodie.table.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.table.name=test_hudi_table_sync_hive
hoodie.datasource.write.hive_style_partitioning=false
```


hive
```sql
show create table test_hudi_table_sync_hive;
++
|                   createtab_stmt                   |
++
| CREATE EXTERNAL TABLE `test_hudi_table_sync_hive`( |
|   `_hoodie_commit_time` string,                    |
|   `_hoodie_commit_seqno` string,                   |
|   `_hoodie_record_key` string,                     |
|   `_hoodie_partition_path` string,                 |
|   `_hoodie_file_name` string,                      |
|   `id` int,                                        |
|   `name` string,                                   |
|   `value` int,                                     |
|   `ts` int,                                        |
|   `dt` string)                                     |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  |
| WITH SERDEPROPERTIES (                             |
|   'hoodie.query.as.ro.table'='false',              |
|   'path'='/test_hudi/test_hudi_table_sync_hive')   |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hudi.hadoop.HoodieParquetInputFormat'  |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
| LOCATION                                           |
|   'hdfs://cluster1/test_hudi/test_hudi_table_sync_hive' |
| TBLPROPERTIES (                                    |
|   'last_commit_time_sync'='20220119110215185',     |
|   'spark.sql.sources.provider'='hudi',             |
|   'spark.sql.sources.schema.numParts'='1',         |
|   
'spark.sql.sources.schema.part.0'='\{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},\

[jira] [Updated] (HUDI-3282) Fix delete exception for Spark SQL when sync Hive

2022-01-19 Thread Jira



 [ 
https://issues.apache.org/jira/browse/HUDI-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

董可伦 updated HUDI-3282:
--
Description: 
{{hudi 0.11.0 master build
spark: 2.4.5}}
hive create database test_hudi;
spark-shell --master yarn --deploy-mode client --executor-memory 2G 
--num-executors 3 --executor-cores 2 --driver-memory 4G --driver-cores 2 --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' 
--principal .. --keytab .. import org.apache.hudi.DataSourceWriteOptions._ 
import org.apache.hudi.QuickstartUtils.{DataGenerator, convertToStringList, 
getQuickstartWriteConfigs} import 
org.apache.hudi.config.HoodieWriteConfig.TBL_NAME import 
org.apache.spark.sql.SaveMode._ import org.apache.spark.sql.{SaveMode, 
SparkSession} import org.apache.spark.sql.functions.lit import 
org.apache.hudi.DataSourceReadOptions._ import 
org.apache.hudi.config.HoodieWriteConfig import 
org.apache.hudi.keygen.SimpleKeyGenerator import 
org.apache.hudi.common.model.{DefaultHoodieRecordPayload, HoodiePayloadProps} 
import org.apache.hudi.io.HoodieMergeHandle import 
org.apache.hudi.common.table.HoodieTableConfig import 
org.apache.spark.sql.functions._ import spark.implicits._ val df = Seq((1, 
"a1", 10, 1000, "2022-01-19")).toDF("id", "name", "value", "ts", "dt") 
df.write.format("hudi"). option(HoodieWriteConfig.TBL_NAME.key, 
"test_hudi_table_sync_hive"). option(TABLE_TYPE.key, COW_TABLE_TYPE_OPT_VAL). 
option(RECORDKEY_FIELD.key, "id"). option(PRECOMBINE_FIELD.key, "ts"). 
option(KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator"). 
option("hoodie.datasource.write.partitionpath.field", ""). 
option("hoodie.metadata.enable", false). option(KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.ComplexKeyGenerator"). option(META_SYNC_ENABLED.key(), 
true). option(HIVE_USE_JDBC.key(), false). option(HIVE_DATABASE.key(), 
"test_hudi"). option(HIVE_AUTO_CREATE_DATABASE.key(), true). 
option(HIVE_TABLE.key(), "test_hudi_table_sync_hive"). 
option(HIVE_PARTITION_EXTRACTOR_CLASS.key(), 
"org.apache.hudi.hive.MultiPartKeysValueExtractor"). mode("overwrite"). 
save("/test_hudi/test_hudi_table_sync_hive")

{{# hoodie.properties
hoodie.table.precombine.field=ts
hoodie.table.partition.fields=
hoodie.table.type=COPY_ON_WRITE
hoodie.archivelog.folder=archived
hoodie.populate.meta.fields=true
hoodie.timeline.layout.version=1
hoodie.table.version=3
hoodie.table.recordkey.fields=id
hoodie.table.base.file.format=PARQUET
hoodie.table.timeline.timezone=LOCAL
hoodie.table.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.table.name=test_hudi_table_sync_hive
hoodie.datasource.write.hive_style_partitioning=false}}
hive
show create table test_hudi_table_sync_hive; 
++ | createtab_stmt | 
++ | CREATE EXTERNAL TABLE 
`test_hudi_table_sync_hive`( | | `_hoodie_commit_time` string, | | 
`_hoodie_commit_seqno` string, | | `_hoodie_record_key` string, | | 
`_hoodie_partition_path` string, | | `_hoodie_file_name` string, | | `id` int, 
| | `name` string, | | `value` int, | | `ts` int, | | `dt` string) | | ROW 
FORMAT SERDE | | 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
| | WITH SERDEPROPERTIES ( | | 'hoodie.query.as.ro.table'='false', | | 
'path'='/test_hudi/test_hudi_table_sync_hive') | | STORED AS INPUTFORMAT | | 
'org.apache.hudi.hadoop.HoodieParquetInputFormat' | | OUTPUTFORMAT | | 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' | | LOCATION | 
| 'hdfs://cluster1/test_hudi/test_hudi_table_sync_hive' | | TBLPROPERTIES ( | | 
'last_commit_time_sync'='20220119110215185', | | 
'spark.sql.sources.provider'='hudi', | | 
'spark.sql.sources.schema.numParts'='1', | | 
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[\{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},\{"name":"id","type":"integer","nullable":false,"metadata":{}},\{"name":"name","type":"string","nullable":true,"metadata":{}},\{"name":"value","type":"integer","nullable":false,"metadata":{}},\{"name":"ts","type":"integer","nullable":false,"metadata":{}},\{"name":"dt","type":"string","nullable":true,"metadata":{}}]}',
 | | 'transient_lastDdlTime'='1642561355') | 
++ 28 rows selected (0.429 
seconds)
spark-sql --master yarn --deploy-mode client --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
'spark.sql.extensions=org.apache.spar

[jira] [Updated] (HUDI-3088) Make Spark 3 the default profile for build and test

2022-01-19 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3088:
-
Story Points: 0.5

> Make Spark 3 the default profile for build and test
> ---
>
> Key: HUDI-3088
> URL: https://issues.apache.org/jira/browse/HUDI-3088
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Blocker
> Fix For: 0.11.0
>
>
> By default, when people check out the code, they should have activated spark 
> 3 for the repo. Also all tests should be running against the latest supported 
> spark version. Correspondingly the default scala version becomes 2.12 and the 
> default parquet version 1.12.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3161) Add Call Produce Command for spark sql

2022-01-19 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3161:
-
Epic Link: HUDI-1658

> Add Call Produce Command for spark sql
> --
>
> Key: HUDI-3161
> URL: https://issues.apache.org/jira/browse/HUDI-3161
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> example
> {code:java}
> // code placeholder
> # Produce1
> call show_commits_metadata(table => 'test_hudi_table');
> commit_time   action  partition   file_id previous_commit num_writes  
> num_inserts num_deletes num_update_writes   total_errors
> total_log_blockstotal_corrupt_logblocks total_rollback_blocks   
> total_log_records   total_updated_records_compacted total_bytes_written
> 20220109225319449 commit  dt=2021-05-03   
> d0073a12-085d-4f49-83e9-402947e7e90a-0  null1   1   0   0 
>   0   0   0   0   0   0   435349
> 20220109225311742 commit  dt=2021-05-02   
> b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0  20220109214830592   1   1 
>   0   0   0   0   0   0   0   0   435340
> 20220109225301429 commit  dt=2021-05-01   
> 0d7298b3-6b55-4cff-8d7d-b0772358b78a-0  20220109214830592   1   1 
>   0   0   0   0   0   0   0   0   435340
> 20220109214830592 commit  dt=2021-05-01   
> 0d7298b3-6b55-4cff-8d7d-b0772358b78a-0  20220109191631015   0   0 
>   1   0   0   0   0   0   0   0   432653
> 20220109214830592 commit  dt=2021-05-02   
> b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0  20220109191648181   0   0 
>   1   0   0   0   0   0   0   0   432653
> 20220109191648181 commit  dt=2021-05-02   
> b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0  null1   1   0   0 
>   0   0   0   0   0   0   435341
> 20220109191631015 commit  dt=2021-05-01   
> 0d7298b3-6b55-4cff-8d7d-b0772358b78a-0  null1   1   0   0 
>   0   0   0   0   0   0   435341
> Time taken: 0.844 seconds, Fetched 7 row(s)
> # Produce2
> call rollback_to_instant(table => 'test_hudi_table', instant_time => 
> '20220109225319449');
> rollback_result
> true
> Time taken: 5.038 seconds, Fetched 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] hudi-bot commented on pull request #4646: [HUDI-3250] Upgrade Presto docker image

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4646:
URL: https://github.com/apache/hudi/pull/4646#issuecomment-1017157186


   
   ## CI report:
   
   * 097d027ebdd6cdb3246dc6dbe4aba8f00122e5e6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5363)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4646: [HUDI-3250] Upgrade Presto docker image

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4646:
URL: https://github.com/apache/hudi/pull/4646#issuecomment-1017155666


   
   ## CI report:
   
   * 097d027ebdd6cdb3246dc6dbe4aba8f00122e5e6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4646: [HUDI-3250] Upgrade Presto docker image

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4646:
URL: https://github.com/apache/hudi/pull/4646#issuecomment-1017155666


   
   ## CI report:
   
   * 097d027ebdd6cdb3246dc6dbe4aba8f00122e5e6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] dongkelun commented on pull request #3745: [HUDI-2514] Add default hiveTableSerdeProperties for Spark SQL when sync Hive

2022-01-19 Thread GitBox



dongkelun commented on pull request #3745:
URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017155450


   In order to avoid confusion and ambiguity, I have retreated to the 
historical 
commitid：[071b13b](https://github.com/apache/hudi/commit/071b13bfe3ebee1875d12432e550c8718566bfd4)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3250) Upgrade Presto version in docker setup and integ test

2022-01-19 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-3250:
-
Labels: pull-request-available  (was: )

> Upgrade Presto version in docker setup and integ test
> -
>
> Key: HUDI-3250
> URL: https://issues.apache.org/jira/browse/HUDI-3250
> Project: Apache Hudi
>  Issue Type: Test
>  Components: trino-presto
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] codope opened a new pull request #4646: [HUDI-3250] Upgrade Presto docker image

2022-01-19 Thread GitBox



codope opened a new pull request #4646:
URL: https://github.com/apache/hudi/pull/4646


   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #3745: [HUDI-2514] Add default hiveTableSerdeProperties for Spark SQL when sync Hive

2022-01-19 Thread GitBox



hudi-bot commented on pull request #3745:
URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017147751


   
   ## CI report:
   
   * 1cd9e56e69b15edac5a12afdb9626c853eb6e83f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5343)
 
   * 942072ece4f6fc46251f922eabe4f6bdac767410 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5362)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #3745: [HUDI-2514] Add default hiveTableSerdeProperties for Spark SQL when sync Hive

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #3745:
URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017146397


   
   ## CI report:
   
   * 1cd9e56e69b15edac5a12afdb9626c853eb6e83f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5343)
 
   * 942072ece4f6fc46251f922eabe4f6bdac767410 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #3745: [HUDI-2514] Add default hiveTableSerdeProperties for Spark SQL when sync Hive

2022-01-19 Thread GitBox



hudi-bot commented on pull request #3745:
URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017146397


   
   ## CI report:
   
   * 1cd9e56e69b15edac5a12afdb9626c853eb6e83f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5343)
 
   * 942072ece4f6fc46251f922eabe4f6bdac767410 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #3745: [HUDI-2514] Add default hiveTableSerdeProperties for Spark SQL when sync Hive

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #3745:
URL: https://github.com/apache/hudi/pull/3745#issuecomment-1016217784


   
   ## CI report:
   
   * 1cd9e56e69b15edac5a12afdb9626c853eb6e83f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5343)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated (a08a2b7 -> 31b57a2)

2022-01-19 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from a08a2b7  [MINOR] Add instructions to build and upload Docker Demo 
images (#4612)
 add 31b57a2  [HUDI-3236] use fields'comments persisted in catalog to fill 
in schema (#4587)

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/catalog/HoodieCatalogTable.scala  | 31 +--
 .../spark/sql/hudi/HoodieSqlCommonUtils.scala  | 13 +++-
 .../AlterHoodieTableChangeColumnCommand.scala  | 16 ++
 .../AlterHoodieTableDropPartitionCommand.scala |  2 +-
 .../command/ShowHoodieTablePartitionsCommand.scala |  4 +--
 .../org/apache/spark/sql/hudi/TestAlterTable.scala | 36 +-
 6 files changed, 75 insertions(+), 27 deletions(-)

[jira] [Closed] (HUDI-3236) ALTER TABLE COMMENT old comment gets reverted

2022-01-19 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-3236.

 Reviewers: Raymond Xu, Tao Meng  (was: Raymond Xu)
Resolution: Fixed

> ALTER TABLE COMMENT old comment gets reverted
> -
>
> Key: HUDI-3236
> URL: https://issues.apache.org/jira/browse/HUDI-3236
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.10.1
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>   Original Estimate: 0.5h
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:sql}
> create table if not exists cow_nonpt_nonpcf_tbl (
>   id int,
>   name string,
>   price double
> ) using hudi
> options (
>   type = 'cow',
>   primaryKey = 'id'
> );
> insert into cow_nonpt_nonpcf_tbl select 1, 'a1', 20;
> ALTER TABLE cow_nonpt_nonpcf_tbl alter column id comment "primary id";
> DESC cow_nonpt_nonpcf_tbl;
> -- this works fine so far
> ALTER TABLE cow_nonpt_nonpcf_tbl alter column name comment "name column";
> DESC cow_nonpt_nonpcf_tbl;
> -- this saves the comment for name column
> -- but comment for id column was reverted back to NULL
> {code}
> reported while testing on 0.10.1-rc1 (spark 3.0.3, 3.1.2)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] xushiyan merged pull request #4587: [HUDI-3236] use fields'comments persisted in catalog to fill in schema

2022-01-19 Thread GitBox



xushiyan merged pull request #4587:
URL: https://github.com/apache/hudi/pull/4587


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] harishraju-govindaraju edited a comment on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

2022-01-19 Thread GitBox



harishraju-govindaraju edited a comment on issue #4641:
URL: https://github.com/apache/hudi/issues/4641#issuecomment-1017138248


   Tried to define proper schema. Still having same error . Any help is much 
appreciated as we are planning to use deltastreamer in production.
   
   Caused by: org.apache.hudi.exception.HoodieIOException: Unrecognized token 
'Objavro': was expecting (JSON String, Number, Array, Object or token 'null', 
'true' or 'false')
at [Source: 
(String)"Objavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["string","null"]},{"name":"creation_date","type":["string","null"]},{"name":"last_update_time","type":["string","null"]},{"name":"quantity","type":["string","null"]},{"name":"compcode","type":["string","null"]}]}0org.apache.spark.version";
 line: 1, column: 11]
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] harishraju-govindaraju commented on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

2022-01-19 Thread GitBox



harishraju-govindaraju commented on issue #4641:
URL: https://github.com/apache/hudi/issues/4641#issuecomment-1017138248


   tried to define proper schema. Still having same error 
   
   Caused by: org.apache.hudi.exception.HoodieIOException: Unrecognized token 
'Objavro': was expecting (JSON String, Number, Array, Object or token 'null', 
'true' or 'false')
at [Source: 
(String)"Objavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["string","null"]},{"name":"creation_date","type":["string","null"]},{"name":"last_update_time","type":["string","null"]},{"name":"quantity","type":["string","null"]},{"name":"compcode","type":["string","null"]}]}0org.apache.spark.version";
 line: 1, column: 11]
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4287: [DO NOT MERGE] 0.10.0 release patch for flink

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4287:
URL: https://github.com/apache/hudi/pull/4287#issuecomment-1017101009


   
   ## CI report:
   
   * 5b7a535559d80359a3febc2d1a80bf9a8ac20cf9 UNKNOWN
   * 952a154b1c656cd8e3c9c0df9fee313d3890d938 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5319)
 
   * a9254d5c5059e77f883467f05dedf08d704b17f1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5359)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4287: [DO NOT MERGE] 0.10.0 release patch for flink

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4287:
URL: https://github.com/apache/hudi/pull/4287#issuecomment-1017133273


   
   ## CI report:
   
   * 5b7a535559d80359a3febc2d1a80bf9a8ac20cf9 UNKNOWN
   * a9254d5c5059e77f883467f05dedf08d704b17f1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5359)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4645: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target …

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4645:
URL: https://github.com/apache/hudi/pull/4645#issuecomment-1017128365


   
   ## CI report:
   
   * 5332458bfb61a6e13b9b59ae3813d236f86e01da Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5358)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4645: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target …

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4645:
URL: https://github.com/apache/hudi/pull/4645#issuecomment-1017099833


   
   ## CI report:
   
   * 5332458bfb61a6e13b9b59ae3813d236f86e01da Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5358)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] harishraju-govindaraju edited a comment on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

2022-01-19 Thread GitBox



harishraju-govindaraju edited a comment on issue #4641:
URL: https://github.com/apache/hudi/issues/4641#issuecomment-1017122913


   Hello @nsivabalan ,
   
   Thanks for promptly responding to my question. 
   
   I tried to clear the folder and reran the below spark-submit command. The 
folder .hoodie got created but the job ended with error with no data files. 
   
**_Unrecognized token 'Objavro': was expecting (JSON String, Number, Array, 
Object or token 'null', 'true' or 'false')
at [Source: 
(String)"Objavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["string","null"]},{"name":"creation_date","type":["string","null"]},{"name":"last_update_time","type":["string","null"]},{"name":"quantity","type":["string","null"]},{"name":"compcode","type":["string","null"]}]}0org.apache.spark.version";
 line: 1, column: 11]_**
   
   spark-submit \
   --jars "s3://zcustomjar/spark-avro_2.11-2.4.4.jar" \
   --deploy-mode "client" \
   --class "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer"  
/usr/lib/hudi/hudi-utilities-bundle.jar \
   --schemaprovider-class 
"org.apache.hudi.utilities.schema.FilebasedSchemaProvider" \
   --table-type COPY_ON_WRITE \
   --source-ordering-field id \
   --target-base-path s3://ztrusted1/default/hudi-table1/ --target-table 
hudi-table1 \
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
 \
   --hoodie-conf hoodie.datasource.write.recordkey.field=id \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://zlanding1/input1/ \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=compcode \
   --hoodie-conf hoodie.datasource.write.operation=insert \
   --hoodie-conf 
hoodie.deltastreamer.schemaprovider.source.schema.file=s3://zcustomjar/source2.avsc
 \
   --hoodie-conf 
hoodie.deltastreamer.schemaprovider.target.schema.file=s3://zcustomjar/target.avsc
 \
   
   
   I have manually created the schema .avsc file using notepad. Not sure if 
that is a problem. 
   
   {
 "type" : "record",
 "name" : "triprec",
 "fields" : [
 {
   "name" : "id",
   "type" : "string"
 }, {
   "name" : "creation_date",
   "type" : "string"
 }, {
   "name" : "last_update_time",
   "type" : "string"
 }, {
   "name" : "quantity",
   "type" : "string"
 }, {
   "name" : "compcode",
   "type" : "string"
 }]
   }
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] harishraju-govindaraju commented on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working

2022-01-19 Thread GitBox



harishraju-govindaraju commented on issue #4641:
URL: https://github.com/apache/hudi/issues/4641#issuecomment-1017122913


   Hello @nsivabalan ,
   
   Thanks for promptly responding to my question. 
   
   I tried to clear the folder and reran the below spark-submit command. The 
folder .hoodie got created but the job ended with error with no data files. 
   
Unrecognized token 'Objavro': was expecting (JSON String, Number, Array, 
Object or token 'null', 'true' or 'false')
at [Source: 
(String)"Objavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["string","null"]},{"name":"creation_date","type":["string","null"]},{"name":"last_update_time","type":["string","null"]},{"name":"quantity","type":["string","null"]},{"name":"compcode","type":["string","null"]}]}0org.apache.spark.version";
 line: 1, column: 11]
   
   spark-submit \
   --jars "s3://zcustomjar/spark-avro_2.11-2.4.4.jar" \
   --deploy-mode "client" \
   --class "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer"  
/usr/lib/hudi/hudi-utilities-bundle.jar \
   --schemaprovider-class 
"org.apache.hudi.utilities.schema.FilebasedSchemaProvider" \
   --table-type COPY_ON_WRITE \
   --source-ordering-field id \
   --target-base-path s3://ztrusted1/default/hudi-table1/ --target-table 
hudi-table1 \
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
 \
   --hoodie-conf hoodie.datasource.write.recordkey.field=id \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://zlanding1/input1/ \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=compcode \
   --hoodie-conf hoodie.datasource.write.operation=insert \
   --hoodie-conf 
hoodie.deltastreamer.schemaprovider.source.schema.file=s3://zcustomjar/source2.avsc
 \
   --hoodie-conf 
hoodie.deltastreamer.schemaprovider.target.schema.file=s3://zcustomjar/target.avsc
 \
   
   
   I have manually created the schema .avsc file using notepad. Not sure if 
that is a problem. 
   
   {
 "type" : "record",
 "name" : "triprec",
 "fields" : [
 {
   "name" : "id",
   "type" : "string"
 }, {
   "name" : "creation_date",
   "type" : "string"
 }, {
   "name" : "last_update_time",
   "type" : "string"
 }, {
   "name" : "quantity",
   "type" : "string"
 }, {
   "name" : "compcode",
   "type" : "string"
 }]
   }
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] LucassLin commented on issue #4642: [SUPPORT] Hudi Merge Into

2022-01-19 Thread GitBox



LucassLin commented on issue #4642:
URL: https://github.com/apache/hudi/issues/4642#issuecomment-1017119391


   > ```scala
   > historicalDF.write.format("hudi").saveAsTable("tableName")
   > ```
   > 
   > Sorry, I didn't see this one just now. It's OK from Hudi version 0.9.0, 
because there's still hudi sparkSql
   > 
   > In addition, it is also possible to configure sync hive, but there are 
bugs in previous versions. See this PR for details 
[3745](https://github.com/apache/hudi/pull/3745)
   > 
   > Of course, you can also create a hive table. As long as the attributes are 
completely consistent, it is essentially the same as sync hive
   
   thanks for the replies. I tried using hudi createTable sql command but 
getting 
   ```
   Exception = MetaException(message:Got exception: java.io.IOException Error 
accessing gs://*
   ```
   I also tried using saveAsTable but seems like there might be some issue with 
hive config which causes
   ```
   Exception = Invalid host name: local host is:
   ```
   I will try to resolve these once I get back to work and see if the entity 
table would solve the mergeInto issue. Thanks again for your help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] RocMarshal closed pull request #3813: [HUDI-2563][hudi-client] Refactor CompactionTriggerStrategy.

2022-01-19 Thread GitBox



RocMarshal closed pull request #3813:
URL: https://github.com/apache/hudi/pull/3813


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] prashantwason commented on a change in pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field

2022-01-19 Thread GitBox



prashantwason commented on a change in pull request #4449:
URL: https://github.com/apache/hudi/pull/4449#discussion_r788345996



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java
##
@@ -83,6 +84,12 @@
   .withDocumentation("Lower values increase the size of metadata tracked 
within HFile, but can offer potentially "
   + "faster lookup times.");
 
+  public static final ConfigProperty HFILE_SCHEMA_KEY_FIELD_NAME = 
ConfigProperty

Review comment:
   This setting is broken because the HFileReader does not have a way to 
use it. Assume I specify this setting to be "someotherkey". The HFileReader 
will still use the hardcoded "key".
   
   I suggest you remove this setting and all associated code and defer this for 
a later PR which will plug in this setting to the reader.
   
   

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java
##
@@ -122,7 +128,13 @@ public boolean canWrite() {
 
   @Override
   public void writeAvro(String recordKey, IndexedRecord object) throws 
IOException {
-byte[] value = HoodieAvroUtils.avroToBytes((GenericRecord)object);
+byte[] value = HoodieAvroUtils.avroToBytes((GenericRecord) object);
+if (schemaRecordKeyField.isPresent()) {
+  GenericRecord recordKeyExcludedRecord = 
HoodieAvroUtils.bytesToAvro(value, this.schema);

Review comment:
   This will reduce performance as you are converting the record to bytes 
in the line above and then immediately parsing it back to the GenericRecord 
again. 
   
   If may be better to check first before creating the bytes.

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileConfig.java
##
@@ -43,9 +43,10 @@
   private final Configuration hadoopConf;
   private final BloomFilter bloomFilter;
   private final KeyValue.KVComparator hfileComparator;
+  private final String schemaKeyFieldId;

Review comment:
   Why is this an Id and not name? schemaKeyFieldName




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4644: [HUDI-3282] Fix delete exception for Spark SQL when sync Hive

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4644:
URL: https://github.com/apache/hudi/pull/4644#issuecomment-1017091703


   
   ## CI report:
   
   * ce8d15b8d574ce0591dd309dd27bf5e26d0bdef0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5357)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4644: [HUDI-3282] Fix delete exception for Spark SQL when sync Hive

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4644:
URL: https://github.com/apache/hudi/pull/4644#issuecomment-1017113200


   
   ## CI report:
   
   * ce8d15b8d574ce0591dd309dd27bf5e26d0bdef0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5357)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] dongkelun edited a comment on issue #4642: [SUPPORT] Hudi Merge Into

2022-01-19 Thread GitBox



dongkelun edited a comment on issue #4642:
URL: https://github.com/apache/hudi/issues/4642#issuecomment-1017091320


   @LucassLin 
   In official 
documents:[https://hudi.apache.org/docs/quick-start-guide/](https://hudi.apache.org/docs/quick-start-guide/)
   #Create Table # SparkSQL
   ```sql
   -- create a mor non-partitioned table without preCombineField provided
   create table hudi_mor_tbl (
 id int,
 name string,
 price double,
 ts bigint
   ) using hudi
   tblproperties (
 type = 'mor',
 primaryKey = 'id',
 preCombineField = 'ts'
   );
   ```
   The document is a bit wrong in this place. type ='cow' should be type = 
'mor',This parameter only controls the table type
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017109276


   
   ## CI report:
   
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017071243


   
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264)
 
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] dongkelun commented on issue #4642: [SUPPORT] Hudi Merge Into

2022-01-19 Thread GitBox



dongkelun commented on issue #4642:
URL: https://github.com/apache/hudi/issues/4642#issuecomment-1017108838


   ```scala
   historicalDF.write.format("hudi").saveAsTable("tableName")
   ```
   Sorry, I didn't see this one just now. It's OK from Hudi version 0.9.0, 
because there's still hudi sparkSql
   
   In addition, it is also possible to configure sync hive, but there are bugs 
in previous versions. See this PR for details 
[3745](https://github.com/apache/hudi/pull/3745)
   
   Of course, you can also create a hive table. As long as the attributes are 
completely consistent, it is essentially the same as sync hive


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [MINOR] Add instructions to build and upload Docker Demo images (#4612)

2022-01-19 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new a08a2b7  [MINOR] Add instructions to build and upload Docker Demo 
images (#4612)
a08a2b7 is described below

commit a08a2b730674d57a0c40545023b21f0bac34e7df
Author: Y Ethan Guo 
AuthorDate: Wed Jan 19 20:25:28 2022 -0800

[MINOR] Add instructions to build and upload Docker Demo images (#4612)

* [MINOR] Add instructions to build and upload Docker Demo images

* Add local test instruction
---
 docker/README.md  |  93 ++
 docker/push_to_docker_hub.png | Bin 0 -> 260126 bytes
 docker/setup_demo.sh  |   6 ++-
 3 files changed, 98 insertions(+), 1 deletion(-)

diff --git a/docker/README.md b/docker/README.md
new file mode 100644
index 000..19293de
--- /dev/null
+++ b/docker/README.md
@@ -0,0 +1,93 @@
+
+
+# Docker Demo for Hudi
+
+This repo contains the docker demo resources for building docker demo images, 
set up the demo, and running Hudi in the
+docker demo environment.
+
+## Repo Organization
+
+### Configs for assembling docker images - `/hoodie`
+
+The `/hoodie` folder contains all the configs for assembling necessary docker 
images. The name and repository of each
+docker image, e.g., `apachehudi/hudi-hadoop_2.8.4-trinobase_368`, is defined 
in the maven configuration file `pom.xml`.
+
+### Docker compose config for the Demo - `/compose`
+
+The `/compose` folder contains the yaml file to compose the Docker environment 
for running Hudi Demo.
+
+### Resources and Sample Data for the Demo - `/demo`
+
+The `/demo` folder contains useful resources and sample data use for the Demo.
+
+## Build and Test Image locally
+
+To build all docker images locally, you can run the script:
+
+```shell
+./build_local_docker_images.sh
+```
+
+To build a single image target, you can run
+
+```shell
+mvn clean pre-integration-test -DskipTests -Ddocker.compose.skip=true 
-Ddocker.build.skip=false -pl : -am
+# For example, to build hudi-hadoop-trinobase-docker
+mvn clean pre-integration-test -DskipTests -Ddocker.compose.skip=true 
-Ddocker.build.skip=false -pl :hudi-hadoop-trinobase-docker -am
+```
+
+Alternatively, you can use `docker` cli directly under `hoodie/hadoop`. Note 
that, you need to manually name your local
+image by using `-t` option to match the naming in the `pom.xml`, so that you 
can update the corresponding image
+repository in Docker Hub (detailed steps in the next section).
+
+```shell
+# Run under hoodie/hadoop, the  is optional, "latest" by default
+docker build  -t /[:]
+# For example, to build trinobase
+docker build trinobase -t apachehudi/hudi-hadoop_2.8.4-trinobase_368
+```
+
+After new images are built, you can run the following script to bring up 
docker demo with your local images:
+
+```shell
+./setup_demo.sh dev
+```
+
+## Upload Updated Image to Repository on Docker Hub
+
+Once you have built the updated image locally, you can push the corresponding 
this repository of the image to the Docker
+Hud registry designated by its name or tag:
+
+```shell
+docker push /:
+# For example
+docker push apachehudi/hudi-hadoop_2.8.4-trinobase_368
+```
+
+You can also easily push the image to the Docker Hub using Docker Desktop app: 
go to `Images`, search for the image by
+the name, and then click on the three dots and `Push to Hub`.
+
+![Push to Docker Hub](push_to_docker_hub.png)
+
+Note that you need to ask for permission to upload the Hudi Docker Demo images 
to the repositories.
+
+You can find more information on [Docker Hub Repositories 
Manual](https://docs.docker.com/docker-hub/repos/).
+
+## Docker Demo Setup
+
+Please refer to the [Docker Demo Docs 
page](https://hudi.apache.org/docs/docker_demo).
\ No newline at end of file
diff --git a/docker/push_to_docker_hub.png b/docker/push_to_docker_hub.png
new file mode 100644
index 000..faa431b
Binary files /dev/null and b/docker/push_to_docker_hub.png differ
diff --git a/docker/setup_demo.sh b/docker/setup_demo.sh
index 634fe9e..9f0a100 100755
--- a/docker/setup_demo.sh
+++ b/docker/setup_demo.sh
@@ -17,10 +17,14 @@
 # limitations under the License.
 
 SCRIPT_PATH=$(cd `dirname $0`; pwd)
+HUDI_DEMO_ENV=$1
 WS_ROOT=`dirname $SCRIPT_PATH`
 # restart cluster
 HUDI_WS=${WS_ROOT} docker-compose -f 
${SCRIPT_PATH}/compose/docker-compose_hadoop284_hive233_spark244.yml down
-HUDI_WS=${WS_ROOT} docker-compose -f 
${SCRIPT_PATH}/compose/docker-compose_hadoop284_hive233_spark244.yml pull
+if [ "$HUDI_DEMO_ENV" != "dev" ]; then
+  echo "Pulling docker demo images ..."
+  HUDI_WS=${WS_ROOT} docker-compose -f 
${SCRIPT_PATH}/compose/docker-compose_hadoop284_hive233_spark244.yml pull
+fi
 sleep 5
 HUDI_WS=${WS_ROOT} docker-compose -f 
${SCRIPT_PATH}/compose/docker-compose_hadoop284_hive233_spark244.yml up -d
 sleep 15

[GitHub] [hudi] codope merged pull request #4612: [MINOR] Add instructions to build and upload Docker Demo images

2022-01-19 Thread GitBox



codope merged pull request #4612:
URL: https://github.com/apache/hudi/pull/4612


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4287: [DO NOT MERGE] 0.10.0 release patch for flink

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4287:
URL: https://github.com/apache/hudi/pull/4287#issuecomment-1017101009


   
   ## CI report:
   
   * 5b7a535559d80359a3febc2d1a80bf9a8ac20cf9 UNKNOWN
   * 952a154b1c656cd8e3c9c0df9fee313d3890d938 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5319)
 
   * a9254d5c5059e77f883467f05dedf08d704b17f1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5359)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4287: [DO NOT MERGE] 0.10.0 release patch for flink

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4287:
URL: https://github.com/apache/hudi/pull/4287#issuecomment-1017093744


   
   ## CI report:
   
   * 5b7a535559d80359a3febc2d1a80bf9a8ac20cf9 UNKNOWN
   * 952a154b1c656cd8e3c9c0df9fee313d3890d938 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5319)
 
   * a9254d5c5059e77f883467f05dedf08d704b17f1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4645: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target …

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4645:
URL: https://github.com/apache/hudi/pull/4645#issuecomment-1017099833


   
   ## CI report:
   
   * 5332458bfb61a6e13b9b59ae3813d236f86e01da Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5358)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4645: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target …

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4645:
URL: https://github.com/apache/hudi/pull/4645#issuecomment-1017098255


   
   ## CI report:
   
   * 5332458bfb61a6e13b9b59ae3813d236f86e01da UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4645: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target …

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4645:
URL: https://github.com/apache/hudi/pull/4645#issuecomment-1017098255


   
   ## CI report:
   
   * 5332458bfb61a6e13b9b59ae3813d236f86e01da UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] watermelon12138 opened a new pull request #4645: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target …

2022-01-19 Thread GitBox



watermelon12138 opened a new pull request #4645:
URL: https://github.com/apache/hudi/pull/4645


   ## What is the purpose of the pull request
   Enable MultiTableDeltaStreamer to update a single target table from multiple 
source tables.
   ## Brief change log
 - *Modify the HoodieMultiTableDeltaStreamer file so that it can generate 
the execution context of table based on source tables.*
 - *Modify the DeltaSync.java file so that the source table can associate 
with other tables and the source can configure independent checkpoint.*
 - *add UT.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3279) Metadata table stores incorrect file sizes after Restore

2022-01-19 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3279:
--
Attachment: Screen Shot 2022-01-19 at 7.56.37 PM.png

> Metadata table stores incorrect file sizes after Restore
> 
>
> Key: HUDI-3279
> URL: https://issues.apache.org/jira/browse/HUDI-3279
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2022-01-19 at 12.17.21 PM.png, Screen Shot 
> 2022-01-19 at 12.18.27 PM.png, Screen Shot 2022-01-19 at 7.56.37 PM.png
>
>
> While working on [https://github.com/apache/hudi/pull/4556,] I have stumbled 
> upon an issue of the LogBlock Scanner EOF-ing on the log-files in tests after 
> performing Restore operation.
> The root-cause of these turned out to be Metadata Table storing incorrect 
> sizes of the files after Restore (sizes in MT are essentially 2x of what is 
> in FS):
> !Screen Shot 2022-01-19 at 12.17.21 PM.png!
> !Screen Shot 2022-01-19 at 12.18.27 PM.png!
>  
> This seems to occur due to following: 
>  # Metadata table treats new Records for the same file as "deltas", appending 
> the file-size to its records 
> (https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java#L227)]
>  # Upon Restore (which is handled simply as a collection of Rollbacks) we 
> pick *max* of the sizes of the files before and after the operation, not 
> regarding to which we're actually rolling back to 
> (https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java#L254).]
>  
> *Proposal*
> Instead of simply always picking the max size, we should pick the size of the 
> file as it was right before.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-3279) Metadata table stores incorrect file sizes after Restore

2022-01-19 Thread Alexey Kudinkin (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479086#comment-17479086
 ] 

Alexey Kudinkin commented on HUDI-3279:
---

This is an example of test failing in CI:

[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=5351&view=logs&j=7601efb9-4019-552e-11ba-eb31b66593b2&t=9688f101-287d-53f4-2a80-87202516f5d0&l=4344]

!Screen Shot 2022-01-19 at 7.56.37 PM.png!

> Metadata table stores incorrect file sizes after Restore
> 
>
> Key: HUDI-3279
> URL: https://issues.apache.org/jira/browse/HUDI-3279
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2022-01-19 at 12.17.21 PM.png, Screen Shot 
> 2022-01-19 at 12.18.27 PM.png, Screen Shot 2022-01-19 at 7.56.37 PM.png
>
>
> While working on [https://github.com/apache/hudi/pull/4556,] I have stumbled 
> upon an issue of the LogBlock Scanner EOF-ing on the log-files in tests after 
> performing Restore operation.
> The root-cause of these turned out to be Metadata Table storing incorrect 
> sizes of the files after Restore (sizes in MT are essentially 2x of what is 
> in FS):
> !Screen Shot 2022-01-19 at 12.17.21 PM.png!
> !Screen Shot 2022-01-19 at 12.18.27 PM.png!
>  
> This seems to occur due to following: 
>  # Metadata table treats new Records for the same file as "deltas", appending 
> the file-size to its records 
> (https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java#L227)]
>  # Upon Restore (which is handled simply as a collection of Rollbacks) we 
> pick *max* of the sizes of the files before and after the operation, not 
> regarding to which we're actually rolling back to 
> (https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java#L254).]
>  
> *Proposal*
> Instead of simply always picking the max size, we should pick the size of the 
> file as it was right before.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] stayrascal commented on a change in pull request #4141: [HUDI-2815] Support partial update for streaming change logs

2022-01-19 Thread GitBox



stayrascal commented on a change in pull request #4141:
URL: https://github.com/apache/hudi/pull/4141#discussion_r788331396



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateWithLatestAvroPayload.java
##
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.hudi.common.util.Option;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Objects;
+import java.util.Properties;
+
+import static org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro;
+
+/**
+ * The only difference with {@link DefaultHoodieRecordPayload} is that support 
update partial fields
+ * in latest record which value is not null to existing record instead of all 
fields.
+ *
+ *  Assuming a {@link GenericRecord} has three fields: a int , b int, c 
int. The first record value: 1, 2, 3.
+ * The second record value is: 4, 5, null, the field c value is null. After 
call the combineAndGetUpdateValue method,
+ * we will get final record value: 4, 5, 3, field c value will not be 
overwritten because its value is null in latest record.
+ */
+public class PartialUpdateWithLatestAvroPayload extends 
DefaultHoodieRecordPayload {
+

Review comment:
   Hi @danny0405 , if I didn't understand wrongly, the `preCombine` method 
will be called during deduplicate record by `flushBucket` or `flushRemaining` 
in `SteamWriteFucntion`, which will only deduplicate the records(new 
created/updated record) in the buffer. 
   
   If we only overwrite `preCombine`, which will only update/merge the records 
in the buffer, but if there is a record with same recordKey existed in base/log 
file, the record will be overwrote by the  new merged record from buffer, right?
   
   For example, in COW mode, we might still need to overwrite 
`combineAndGetUpdateValue` method, because it will be called by 
`HoodieMergeHandle.write(GenericRecord oldRecord)`, this method will merge the 
new merged record with old records.
   ```
   public void write(GenericRecord oldRecord) {
   String key = KeyGenUtils.getRecordKeyFromGenericRecord(oldRecord, 
keyGeneratorOpt);
   boolean copyOldRecord = true;
   if (keyToNewRecords.containsKey(key)) {
 // If we have duplicate records that we are updating, then the hoodie 
record will be deflated after
 // writing the first record. So make a copy of the record to be merged
 HoodieRecord hoodieRecord = new 
HoodieRecord<>(keyToNewRecords.get(key));
 try {
   Option combinedAvroRecord =
   hoodieRecord.getData().combineAndGetUpdateValue(oldRecord,
 useWriterSchema ? tableSchemaWithMetaFields : tableSchema,
   config.getPayloadConfig().getProps());
   
   if (combinedAvroRecord.isPresent() && 
combinedAvroRecord.get().equals(IGNORE_RECORD)) {
 // If it is an IGNORE_RECORD, just copy the old record, and do not 
update the new record.
 copyOldRecord = true;
   } else if (writeUpdateRecord(hoodieRecord, oldRecord, 
combinedAvroRecord)) {
 /*
  * ONLY WHEN 1) we have an update for this key AND 2) We are able 
to successfully
  * write the the combined new
  * value
  *
  * We no longer need to copy the old record over.
  */
 copyOldRecord = false;
   }
   writtenRecordKeys.add(key);
 } catch (Exception e) {
   throw new HoodieUpsertException("Failed to combine/merge new record 
with old value in storage, for new record {"
   + keyToNewRecords.get(key) + "}, old value {" + oldRecord + "}", 
e);
 }
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4287: [DO NOT MERGE] 0.10.0 release patch for flink

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4287:
URL: https://github.com/apache/hudi/pull/4287#issuecomment-1017093744


   
   ## CI report:
   
   * 5b7a535559d80359a3febc2d1a80bf9a8ac20cf9 UNKNOWN
   * 952a154b1c656cd8e3c9c0df9fee313d3890d938 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5319)
 
   * a9254d5c5059e77f883467f05dedf08d704b17f1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4287: [DO NOT MERGE] 0.10.0 release patch for flink

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4287:
URL: https://github.com/apache/hudi/pull/4287#issuecomment-1015311668


   
   ## CI report:
   
   * 5b7a535559d80359a3febc2d1a80bf9a8ac20cf9 UNKNOWN
   * 952a154b1c656cd8e3c9c0df9fee313d3890d938 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5319)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4644: [HUDI-3282] Fix delete exception for Spark SQL when sync Hive

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4644:
URL: https://github.com/apache/hudi/pull/4644#issuecomment-1017090637


   
   ## CI report:
   
   * ce8d15b8d574ce0591dd309dd27bf5e26d0bdef0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4644: [HUDI-3282] Fix delete exception for Spark SQL when sync Hive

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4644:
URL: https://github.com/apache/hudi/pull/4644#issuecomment-1017091703


   
   ## CI report:
   
   * ce8d15b8d574ce0591dd309dd27bf5e26d0bdef0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5357)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] dongkelun commented on issue #4642: [SUPPORT] Hudi Merge Into

2022-01-19 Thread GitBox



dongkelun commented on issue #4642:
URL: https://github.com/apache/hudi/issues/4642#issuecomment-1017091320


   @LucassLin 
   In official 
documents:[https://hudi.apache.org/docs/quick-start-guide/](https://hudi.apache.org/docs/quick-start-guide/)
   #Create Table # SparkSQL
   ```sql
   -- create a mor non-partitioned table without preCombineField provided
   create table hudi_mor_tbl (
 id int,
 name string,
 price double,
 ts bigint
   ) using hudi
   tblproperties (
 type = 'cow',
 primaryKey = 'id',
 preCombineField = 'ts'
   );
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-3122) presto query failed for bootstrap tables

2022-01-19 Thread Yue Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479084#comment-17479084
 ] 

Yue Zhang commented on HUDI-3122:
-

Since https://github.com/apache/hudi/pull/4551 is merged maybe we can close 
this issue?

> presto query failed for bootstrap tables
> 
>
> Key: HUDI-3122
> URL: https://issues.apache.org/jira/browse/HUDI-3122
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: trino-presto
>Reporter: Wenning Ding
>Priority: Major
>
>  
> {{java.lang.NoClassDefFoundError: 
> org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243)
> at 
> org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137)
> at java.util.HashMap.forEach(HashMap.java:1290)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294)
> at 
> java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] hudi-bot commented on pull request #4644: [HUDI-3282] Fix delete exception for Spark SQL when sync Hive

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4644:
URL: https://github.com/apache/hudi/pull/4644#issuecomment-1017090637


   
   ## CI report:
   
   * ce8d15b8d574ce0591dd309dd27bf5e26d0bdef0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4441:
URL: https://github.com/apache/hudi/pull/4441#issuecomment-1017090493


   
   ## CI report:
   
   * 6b59b2758827fad29ac77c2e7ac00ba4ee00cbbd Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5355)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4441:
URL: https://github.com/apache/hudi/pull/4441#issuecomment-1017061971


   
   ## CI report:
   
   * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5044)
 
   * 6b59b2758827fad29ac77c2e7ac00ba4ee00cbbd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5355)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] dongkelun commented on pull request #3745: [HUDI-2514] Fix delete exception for Spark SQL when sync Hive

2022-01-19 Thread GitBox



dongkelun commented on pull request #3745:
URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017090320


   > > @YannByron thanks for the review. can you take it from here please? from 
the reproducing steps, looks like a different bug where primary key and other 
properties were not respected by HMS?
   > 
   > @xushiyan ok, i'll take this. And it'a real different case, better to 
create another pr and ticket. Otherwise, it can feel strange and confusing. 
@dongkelun
   
   A new PR has been submitted：[4644](https://github.com/apache/hudi/pull/4644)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3282) Fix delete exception for Spark SQL when sync Hive

2022-01-19 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-3282:
-
Labels: pull-request-available  (was: )

> Fix delete exception for Spark SQL when sync Hive
> -
>
> Key: HUDI-3282
> URL: https://issues.apache.org/jira/browse/HUDI-3282
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive-sync, spark-sql
>Affects Versions: 0.10.0
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.1
>
>
> h1. Fix delete exception for Spark SQL when sync Hive



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] dongkelun opened a new pull request #4644: [HUDI-3282] Fix delete exception for Spark SQL when sync Hive

2022-01-19 Thread GitBox



dongkelun opened a new pull request #4644:
URL: https://github.com/apache/hudi/pull/4644


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   Fix delete exception for Spark SQL when sync Hive
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] LucassLin commented on issue #4642: [SUPPORT] Hudi Merge Into

2022-01-19 Thread GitBox



LucassLin commented on issue #4642:
URL: https://github.com/apache/hudi/issues/4642#issuecomment-1017089251


   > Instead of using a temporary table, try changing the target table to an 
entity table. The current version should not support the situation that the 
target table is a temporary table.And the target table must be a Hudi table
   
   Thanks for the reply. By entity table, do you mean something like
   ```
   historicalDF.write.saveAsTable("tableName")
   ```
   Can you also elaborate more on "And the target table must be a Hudi table"? 
How do I ensure the table I write is Hudi table? Is there a specific API to use 
to create this entity table as hudi table?
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-3282) Fix delete exception for Spark SQL when sync Hive

2022-01-19 Thread Jira

董可伦 created HUDI-3282:
-

 Summary: Fix delete exception for Spark SQL when sync Hive
 Key: HUDI-3282
 URL: https://issues.apache.org/jira/browse/HUDI-3282
 Project: Apache Hudi
  Issue Type: Bug
  Components: hive-sync, spark-sql
Affects Versions: 0.10.0
Reporter: 董可伦
Assignee: 董可伦
 Fix For: 0.10.1


h1. Fix delete exception for Spark SQL when sync Hive



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] nsivabalan edited a comment on issue #3879: [SUPPORT] Incomplete Table Migration

2022-01-19 Thread GitBox



nsivabalan edited a comment on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1017075893


   May I know whats the partition path field I should be choosing while writing 
to hudi ? and I assume record key is UUID and preCombine field is SORT_KEY. 
   
   ```
   spark.sql("describe tbl1").show()
   ++-+---+
   |col_name|data_type|comment|
   ++-+---+
   |UUID|   string|   null|
   |   A|   string|   null|
   |   B|timestamp|   null|
   |   C|   string|   null|
   |   D|  int|   null|
   |   E|timestamp|   null|
   |   F|   string|   null|
   |   G|timestamp|   null|
   |SORT_KEY|timestamp|   null|
   ++-+---+
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4643: [HUDI-3281][Performance]Tuning performance of getAllPartitionPaths API in FileSystemBackedTableMetadata

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4643:
URL: https://github.com/apache/hudi/pull/4643#issuecomment-1017060755


   
   ## CI report:
   
   * d63b3431b7e4331bf5bcc5e8789d008a296848f4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5354)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4643: [HUDI-3281][Performance]Tuning performance of getAllPartitionPaths API in FileSystemBackedTableMetadata

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4643:
URL: https://github.com/apache/hudi/pull/4643#issuecomment-1017083372


   
   ## CI report:
   
   * d63b3431b7e4331bf5bcc5e8789d008a296848f4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5354)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YannByron commented on pull request #3745: [HUDI-2514] Fix delete exception for Spark SQL when sync Hive

2022-01-19 Thread GitBox



YannByron commented on pull request #3745:
URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017082441


   > @YannByron thanks for the review. can you take it from here please? from 
the reproducing steps, looks like a different bug where primary key and other 
properties were not respected by HMS?
   
   @xushiyan ok, i'll take this. 
   And it'a real different case, better to create another pr and ticket. 
Otherwise, it can feel strange and confusing. @dongkelun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3879: [SUPPORT] Incomplete Table Migration

2022-01-19 Thread GitBox



nsivabalan commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1017075893


   May I know whats the partition path field I should be choosing while writing 
to hudi ? 
   
   ```
   spark.sql("describe tbl1").show()
   ++-+---+
   |col_name|data_type|comment|
   ++-+---+
   |UUID|   string|   null|
   |   A|   string|   null|
   |   B|timestamp|   null|
   |   C|   string|   null|
   |   D|  int|   null|
   |   E|timestamp|   null|
   |   F|   string|   null|
   |   G|timestamp|   null|
   |SORT_KEY|timestamp|   null|
   ++-+---+
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017071243


   
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264)
 
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017069976


   
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264)
 
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1013607661


   
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017069976


   
   ## CI report:
   
   * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264)
 
   * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] watermelon12138 closed pull request #4637: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target table from multiple source tables.

2022-01-19 Thread GitBox



watermelon12138 closed pull request #4637:
URL: https://github.com/apache/hudi/pull/4637


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4441:
URL: https://github.com/apache/hudi/pull/4441#issuecomment-1017061971


   
   ## CI report:
   
   * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5044)
 
   * 6b59b2758827fad29ac77c2e7ac00ba4ee00cbbd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5355)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4441:
URL: https://github.com/apache/hudi/pull/4441#issuecomment-1017060566


   
   ## CI report:
   
   * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5044)
 
   * 6b59b2758827fad29ac77c2e7ac00ba4ee00cbbd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4643: [HUDI-3281][Performance]Tuning performance of getAllPartitionPaths API in FileSystemBackedTableMetadata

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4643:
URL: https://github.com/apache/hudi/pull/4643#issuecomment-1017060755


   
   ## CI report:
   
   * d63b3431b7e4331bf5bcc5e8789d008a296848f4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5354)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4643: [HUDI-3281][Performance]Tuning performance of getAllPartitionPaths API in FileSystemBackedTableMetadata

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4643:
URL: https://github.com/apache/hudi/pull/4643#issuecomment-1017059408


   
   ## CI report:
   
   * d63b3431b7e4331bf5bcc5e8789d008a296848f4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4441:
URL: https://github.com/apache/hudi/pull/4441#issuecomment-1017060566


   
   ## CI report:
   
   * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5044)
 
   * 6b59b2758827fad29ac77c2e7ac00ba4ee00cbbd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction

2022-01-19 Thread GitBox



hudi-bot removed a comment on pull request #4441:
URL: https://github.com/apache/hudi/pull/4441#issuecomment-1008557130


   
   ## CI report:
   
   * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5044)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4643: [HUDI-3281][Performance]Tuning performance of getAllPartitionPaths API in FileSystemBackedTableMetadata

2022-01-19 Thread GitBox



hudi-bot commented on pull request #4643:
URL: https://github.com/apache/hudi/pull/4643#issuecomment-1017059408


   
   ## CI report:
   
   * d63b3431b7e4331bf5bcc5e8789d008a296848f4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] watermelon12138 commented on pull request #4637: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target table from multiple source tables.

2022-01-19 Thread GitBox



watermelon12138 commented on pull request #4637:
URL: https://github.com/apache/hudi/pull/4637#issuecomment-1017058557


   @nsivabalan 
   Thank you for the advice.
   However, the resumeCheckpointStr calculation method in DeltaSync applies 
only to updating a single target table by a single source. If multiple sources 
update a single target table, this calculation method does not work. If 
multiple sources update a single target, set an independent checkpoint for each 
source so that each source can recover from any checkpoint. I only changed the 
methods for calculating resumeCheckpointStr and saving checkpoints to 
checkpointCommitMetadata in DeltaSync. Specifically, I added these methods. 
This does not affect the calculation logic when a single source updates a 
single target.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3281) Tuning performance of getAllPartitionPaths in FileSystemBackedTableMetadata

2022-01-19 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-3281:
-
Labels: pull-request-available  (was: )

> Tuning performance of getAllPartitionPaths in FileSystemBackedTableMetadata
> ---
>
> Key: HUDI-3281
> URL: https://issues.apache.org/jira/browse/HUDI-3281
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yue Zhang
>Assignee: Yue Zhang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] zhangyue19921010 opened a new pull request #4643: [HUDI-3281][Performance]Tuning performance of getAllPartitionPaths API in FileSystemBackedTableMetadata

2022-01-19 Thread GitBox



zhangyue19921010 opened a new pull request #4643:
URL: https://github.com/apache/hudi/pull/4643


   https://issues.apache.org/jira/projects/HUDI/issues/HUDI-3281
   
   ## What is the purpose of the pull request
   Current implement of getAllPartitionPaths API list/collect all the data 
files and check `.hoodie_partition_metadata` in these files.
   As we know list is a pretty heavy action especially in scenarios where S3 is 
used as storage.
   We sometimes can see 20+ seconds for a single list action causing streaming 
job delay.
   
   ## Brief change log
   Just check if `partitions/.hoodie_partition_metadata` exists instead of 
getting all the data files under partitions.
   
   
   Here are the test result based on S3
   
![image](https://user-images.githubusercontent.com/69956021/150256681-379d2138-2c4a-4f11-a703-1399b268290c.png)
   
   | | time1  |time2  | time3  |time4  |time5  
|time6  |time7  |time8  |time9  |time10 
 |   avg(ms)  |
   |     |   |    |   |   |   |   |   | 
  |   |   |   |
   | Original   | 6045 |5627 |  5736 |  5697 |  5733 |  5321 |  5462 |  
5700 |  6072 |  5367 | 5676  |
   | Optimized  | 2888 |2730 |  2717 |  2680 |  2684 |  2778 |  2650 |  
2728 |  3107 |  2908 |  2787 |  
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

2022-01-19 Thread GitBox



alexeykudinkin commented on pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017049008


   @nsivabalan correct, all configs are kept and marked as deprecated. The only 
thing that changes is that some of them have actually no effect anymore. How 
should we handle this?
   
   For example `LAYOUT_OPTIMIZATION_ENABLE` is not used anymore, but that 
should not have an effect on users:
   
   1. Those that didn't use Clustering based on Spatial Curves, they will stay 
the same way (there are other configs required for that)
   2. Those that did use Clustering based on Spatial Curves, will also not be 
affected b/c it also required clustering to be enabled (which they should have 
to already had enabled)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] watermelon12138 removed a comment on pull request #4637: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target table from multiple source tables.

2022-01-19 Thread GitBox



watermelon12138 removed a comment on pull request #4637:
URL: https://github.com/apache/hudi/pull/4637#issuecomment-1017048479


   @nsivabalan
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] watermelon12138 commented on pull request #4637: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target table from multiple source tables.

2022-01-19 Thread GitBox



watermelon12138 commented on pull request #4637:
URL: https://github.com/apache/hudi/pull/4637#issuecomment-1017048479


   @nsivabalan
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

2022-01-19 Thread GitBox



alexeykudinkin commented on a change in pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#discussion_r788291798



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java
##
@@ -207,41 +199,71 @@
   .withDocumentation("Enable use z-ordering/space-filling curves to 
optimize the layout of table to boost query performance. "
   + "This parameter takes precedence over clustering strategy set 
using " + EXECUTION_STRATEGY_CLASS_NAME.key());
 
-  public static final ConfigProperty LAYOUT_OPTIMIZE_STRATEGY = ConfigProperty
+  /**
+   * Determines ordering strategy in for records layout optimization.
+   * Currently, following strategies are supported
+   * 
+   *   Linear: simply orders records lexicographically
+   *   Z-order: orders records along Z-order spatial-curve
+   *   Hilbert: orders records along Hilbert's spatial-curve
+   * 
+   *
+   * NOTE: "z-order", "hilbert" strategies may consume considerably more 
compute, than "linear".
+   *   Make sure to perform small-scale local testing for your dataset 
before applying globally.
+   */
+  public static final ConfigProperty LAYOUT_OPTIMIZE_STRATEGY = 
ConfigProperty
   .key(LAYOUT_OPTIMIZE_PARAM_PREFIX + "strategy")
   .defaultValue("z-order")
   .sinceVersion("0.10.0")
-  .withDocumentation("Type of layout optimization to be applied, current 
only supports `z-order` and `hilbert` curves.");
+  .withDocumentation("Determines ordering strategy used in records layout 
optimization. "
+  + "Currently supported strategies are \"linear\", \"z-order\" and 
\"hilbert\" values are supported.");
 
   /**
-   * There exists two method to build z-curve.
-   * one is directly mapping sort cols to z-value to build z-curve;
-   * we can find this method in Amazon DynamoDB 
https://aws.amazon.com/cn/blogs/database/tag/z-order/
-   * the other one is Boundary-based Interleaved Index method which we 
proposed. simply call it sample method.
-   * Refer to rfc-28 for specific algorithm flow.
-   * Boundary-based Interleaved Index method has better generalization, but 
the build speed is slower than direct method.
+   * NOTE: This setting only has effect if {@link #LAYOUT_OPTIMIZE_STRATEGY} 
value is set to
+   *   either "z-order" or "hilbert" (ie leveraging space-filling curves)
+   *
+   * Currently, two methods to order records along the curve are supported 
"build" and "sample":

Review comment:
   Good catch!

##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##
@@ -134,16 +134,28 @@ public MultipleSparkJobExecutionStrategy(HoodieTable 
table, HoodieEngineContext
* @return {@link RDDCustomColumnsSortPartitioner} if sort columns are 
provided, otherwise empty.
*/
   protected Option> getPartitioner(Map strategyParams, Schema schema) {
-if (getWriteConfig().isLayoutOptimizationEnabled()) {
-  // sort input records by z-order/hilbert
-  return Option.of(new 
RDDSpatialCurveOptimizationSortPartitioner((HoodieSparkEngineContext) 
getEngineContext(),
-  getWriteConfig(), HoodieAvroUtils.addMetadataFields(schema)));
-} else if (strategyParams.containsKey(PLAN_STRATEGY_SORT_COLUMNS.key())) {
-  return Option.of(new 
RDDCustomColumnsSortPartitioner(strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key()).split(","),
-  HoodieAvroUtils.addMetadataFields(schema), 
getWriteConfig().isConsistentLogicalTimestampEnabled()));
-} else {
-  return Option.empty();
-}
+Option orderByColumnsOpt =
+Option.ofNullable(strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key()))
+.map(listStr -> listStr.split(","));
+
+return orderByColumnsOpt.map(orderByColumns -> {

Review comment:
   It will fallback to no-op in that case




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] dongkelun edited a comment on issue #4642: [SUPPORT] Hudi Merge Into

2022-01-19 Thread GitBox



dongkelun edited a comment on issue #4642:
URL: https://github.com/apache/hudi/issues/4642#issuecomment-1017039375


   Instead of using a temporary table, try changing the target table to an 
entity table. The current version should not support the situation that the 
target table is a temporary table.And the target table must be a Hudi table 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 3 >

1 - 100 of 277 matches

Mail list logo