[GitHub] [hudi] wxplovecc closed pull request #5185: [HUDI-3758] Optimize flink partition table with BucketIndex

2022-04-25 Thread GitBox


wxplovecc closed pull request #5185: [HUDI-3758] Optimize flink partition table 
with BucketIndex
URL: https://github.com/apache/hudi/pull/5185


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wxplovecc opened a new pull request, #5185: [HUDI-3758] Optimize flink partition table with BucketIndex

2022-04-25 Thread GitBox


wxplovecc opened a new pull request, #5185:
URL: https://github.com/apache/hudi/pull/5185

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   When using flink bucket index , I meet two problems
   
   - without use all streamWriter tasks when partition table with small Bucket 
number
   - crashed with the following step
   1. start job
   2. killed before first commit success ( left some log files)
   3. restart job run nomal after one successful commit
   4. kill job and restart  throws `Duplicate fileID 
0001-6f57-4c71-bf6f-ee7616ec7b14 from bucket 1 of partition  found during 
the BucketStreamWriteFunction index bootstrap`
   
![image](https://user-images.githubusercontent.com/3350718/161009251-a46a692d-60c9-465e-8bf8-6f33887905db.png)
   
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SabyasachiDasTR opened a new issue, #5422: Enabling metadata on MOR table causes FileNotFound exception [SUPPORT]

2022-04-25 Thread GitBox


SabyasachiDasTR opened a new issue, #5422:
URL: https://github.com/apache/hudi/issues/5422

   **Describe the problem you faced**
   
   We are incrementally upserting data into our Hudi table/s every 5 minutes. 
As we begin to process this data we notice mentioned error occurs and the 
upserts are failing. The only command we execute is Upsert. We never call bulk 
insert/insert and we are using the single writer. This starts happening when we 
enable metadata with rest of the properties same as before.
   Hudi table: Mor Table 
   Compaction : Inline Compaction

   **StackTrace**
   
2022-04-21T12:31:34.433+ [WARN] [001qa_correlation_id] 
[org.apache.spark.scheduler.TaskSetManager] [TaskSetManager]: Lost task 181.0 
in stage 769.0 (TID 177843) (ip-10-1--.aws-int.---.com executor 
54): org.apache.hudi.exception.HoodieIOException: Failed to read footer for 
parquet 
s3://bucket/table/partition/7404f6ba-4b10-4d64-8d85-a7f855af18f3-1_15-273-66858_20220421115823.parquet
at 
org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:178)
at 
org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:194)
at 
org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
at 
org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
at 
org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
at 
org.apache.hudi.index.bloom.SparkHoodieBloomIndex.lambda$loadInvolvedFiles$dac7877d$1(SparkHoodieBloomIndex.java:179)
at 
org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2281)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
   Caused by: java.io.FileNotFoundException: No such file or directory 
's3://bucket/table/partition/7404f6ba-4b10-4d64-8d85-a7f855af18f3-1_15-273-66858_20220421115823.parquet'
at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:521)
at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:694)
at 
org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:61)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:456)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:441)
at 
org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:176)
... 33 more
   
   
   "throwable": [
 "Failed to upsert for commit time 20220421123126",
 "at 
org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:62)",
 "at 
org.apache.hudi.table.action.deltacommit.SparkUpsertDeltaCommitActionExecutor.execute(SparkUpsertDeltaCommitActionExecutor.java:46)",
 "at 
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:82)",

[jira] [Closed] (HUDI-3906) Prepare RC3 and run basic tests

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-3906.

Resolution: Done

> Prepare RC3 and run basic tests
> ---
>
> Key: HUDI-3906
> URL: https://issues.apache.org/jira/browse/HUDI-3906
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hudi-bot commented on pull request #5185: [HUDI-3758] Optimize flink partition table with BucketIndex

2022-04-25 Thread GitBox


hudi-bot commented on PR #5185:
URL: https://github.com/apache/hudi/pull/5185#issuecomment-1108201850

   
   ## CI report:
   
   * 64743cf541772b9addb74add001e5cd57916bc9d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8162)
 
   * 8b79ca5bfb14ff32fd84a52ea9432862da104d50 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5185: [HUDI-3758] Fix duplicate fileId error in MOR table type with flink bucket hash Index

2022-04-25 Thread GitBox


hudi-bot commented on PR #5185:
URL: https://github.com/apache/hudi/pull/5185#issuecomment-1108205535

   
   ## CI report:
   
   * 64743cf541772b9addb74add001e5cd57916bc9d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8162)
 
   * 8b79ca5bfb14ff32fd84a52ea9432862da104d50 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8288)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wxplovecc commented on pull request #5185: [HUDI-3758] Fix duplicate fileId error in MOR table type with flink bucket hash Index

2022-04-25 Thread GitBox


wxplovecc commented on PR #5185:
URL: https://github.com/apache/hudi/pull/5185#issuecomment-1108211728

   Done @garyli1019 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] XuQianJin-Stars commented on a diff in pull request #5238: [DOC]Add schema evolution doc for sparksql

2022-04-25 Thread GitBox


XuQianJin-Stars commented on code in PR #5238:
URL: https://github.com/apache/hudi/pull/5238#discussion_r857356824


##
website/docs/quick-start-guide.md:
##
@@ -1095,6 +1095,178 @@ Currently,  the result of `show partitions` is based on 
the filesystem table pat
 
 :::
 
+## Schema evolution
+Schema evolution allows users to easily change the current schema of a Hudi 
table to adapt to the data that is changing over time.
+As of 0.11.0 release, Spark SQL(spark3.1.x and spark3.2.1) DDL support for 
Schema  evolution has been added and is experimental.
+
+### Schema Evolution Scenarios
+1) Columns (including nested columns) can be added, deleted, modified, and 
moved.
+2) Partition columns cannot be evolved.
+3) You cannot add, delete, or perform operations on nested columns of the 
Array type.
+
+## SparkSQL Schema Evolution and Syntax Description
+Before using schema evolution, pls set `spark.sql.extensions`. For spark3.2.1 
`spark.sql.catalog.spark_catalog` also need to be set.
+```shell
+# Spark SQL for spark 3.1.x
+spark-sql --packages 
org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.12:3.1.2
 \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
+
+# Spark SQL for spark 3.2.1
+spark-sql --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.12:3.2.1
 \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
+--conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
+
+```
+After start spark-app,  pls exec `set schema.on.read.enable=true` to enable 
schema evolution.
+
+:::note
+Currently, Schema evolution cannot disabled once being enabled.
+
+
+:::
+
+### Adding Columns
+**Syntax**
+```sql
+-- add columns
+ALTER TABLE Table name ADD COLUMNS(col_spec[, col_spec ...])
+```
+**Parameter Description**
+
+| Parameter   | Description  |
+|-|--|
+| tableName   | Table name   |
+| col_spec| Column specifications, consisting of five fields, 
*col_name*, *col_type*, *nullable*, *comment*, and *col_position*.|
+
+**col_name** : name of the new column. It is mandatory.To add a sub-column to 
a nested column, specify the full name of the sub-column in this field.
+
+For example:
+
+1. To add sub-column col1 to a nested struct type column column users 
struct, set this field to users.col1.
+
+2. To add sub-column col1 to a nested map type column memeber map>, set this field to member.value.col1.
+
+**col_type** : type of the new column.
+
+**nullable** : whether the new column can be null. The value can be left 
empty. Now this field is not used in Hudi.
+
+**comment** : comment of the new column. The value can be left empty.
+
+**col_position** : position where the new column is added. The value can be 
*FIRST* or *AFTER* origin_col.
+
+1. If it is set to *FIRST*, the new column will be added to the first column 
of the table.
+
+2. If it is set to *AFTER* origin_col, the new column will be added after 
original column origin_col.
+
+3. The value can be left empty. *FIRST* can be used only when new sub-columns 
are added to nested columns. Do not use *FIRST* in top-level columns. There are 
no restrictions about the usage of *AFTER*.
+
+**Examples**
+
+```sql
+alter table h0 add columns(ext0 string);
+alter table h0 add columns(new_col int not null comment 'add new column' after 
col1);
+alter table complex_table add columns(col_struct.col_name string comment 'add 
new column to a struct col' after col_from_col_struct);
+```
+
+### Altering Columns
+**Syntax**
+```sql
+-- alter table ... alter column
+ALTER TABLE Table name ALTER [COLUMN] col_old_name TYPE column_type [COMMENT] 
col_comment[FIRST|AFTER] column_name
+```
+
+**Parameter Description**
+
+| Parameter   | Description  |
+|-|--|
+| tableName  | Table name.   |
+| col_old_name   | Name of the column to be altered.|
+| column_type| Type of the target column.|
+| col_comment| col_comment.|
+| column_name| New position to place the target column. For example, 
*AFTER* **column_name** indicates that the target column is placed after 
**column_name**.|
+
+
+**Examples**
+
+```sql
+--- Changing the column type
+ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint
+
+--- Altering other attributes
+ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment'
+ALTER TABLE table1 ALTER COLUMN a.b.c FIRST
+ALTER TABLE table1 ALTER COLUMN a.b.c AFTER x
+ALTER TABLE table1 ALTER COLUMN a.b.c DROP NOT NULL
+```
+
+**column type change**
+
+| old_type| new_type|
+|-|-|
+| int 

[GitHub] [hudi] hudi-bot commented on pull request #5410: [HUDI-3953]Flink Hudi module should support low-level read and write…

2022-04-25 Thread GitBox


hudi-bot commented on PR #5410:
URL: https://github.com/apache/hudi/pull/5410#issuecomment-1108293556

   
   ## CI report:
   
   * 0cccfe39468ec699aae59d0028dcc949e9161a6d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5410: [HUDI-3953]Flink Hudi module should support low-level read and write…

2022-04-25 Thread GitBox


hudi-bot commented on PR #5410:
URL: https://github.com/apache/hudi/pull/5410#issuecomment-1108298432

   
   ## CI report:
   
   * 0cccfe39468ec699aae59d0028dcc949e9161a6d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8289)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5185: [HUDI-3758] Fix duplicate fileId error in MOR table type with flink bucket hash Index

2022-04-25 Thread GitBox


hudi-bot commented on PR #5185:
URL: https://github.com/apache/hudi/pull/5185#issuecomment-1108339791

   
   ## CI report:
   
   * 8b79ca5bfb14ff32fd84a52ea9432862da104d50 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8288)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hadonchen commented on pull request #5420: Flink hive sync got ClassNotFoundException (org/apache/hudi/org/apache/hadoop/hive/ql/metadata/Hive)

2022-04-25 Thread GitBox


hadonchen commented on PR #5420:
URL: https://github.com/apache/hudi/pull/5420#issuecomment-1108345795

   Yes , I use `release 0.10.1` branch to build the project . 
   And I found  when I only build the `hudi-flink-bundle` module (in path 
packaging/hudi-flink-bundle)  there can find the class , but when I build all 
module (in project root path) there  can't find the class .  Please fixed the 
`release 0.10.0` and `release 0.10.1`  branch . Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #5420: Flink hive sync got ClassNotFoundException (org/apache/hudi/org/apache/hadoop/hive/ql/metadata/Hive)

2022-04-25 Thread GitBox


danny0405 commented on PR #5420:
URL: https://github.com/apache/hudi/pull/5420#issuecomment-1108363673

   > Yes , I use `release 0.10.1` branch to build the project . And I found 
when I only build the `hudi-flink-bundle` module (in path 
packaging/hudi-flink-bundle) there can find the class , but when I build all 
module (in project root path) there can't find the class . Please fixed the 
`release 0.10.0` and `release 0.10.1` branch . Thanks.
   
   That's weird, the hive shade pattern seems overridden by the parent pom.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] chrischnweiss commented on issue #4887: [SUPPORT] Unexpected behaviour with partitioned hudi tables with impala as query engine

2022-04-25 Thread GitBox


chrischnweiss commented on issue #4887:
URL: https://github.com/apache/hudi/issues/4887#issuecomment-1108370616

   Hi guys,
   
   so it seems that we fixed the problem of reading partitioned Hudi tables 
with Impala by changing the Hudi-Operation from `insert_overwrite` to `upsert`. 
 Maybe it´s a problem of how Hudi handels Insert-Overwrites on the Filesystem 
paired with how Impala reads the resulting tables?
   
   Cheers,
   Christian


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #5410: [HUDI-3953]Flink Hudi module should support low-level read and write…

2022-04-25 Thread GitBox


danny0405 commented on PR #5410:
URL: https://github.com/apache/hudi/pull/5410#issuecomment-1108384622

   Can we give some explanation about why we need this change ? Looks like it 
is a code refactor but i see no gains.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #5087: [HUDI-3614] [DO_NOT_MERGE]Replace List with HoodieData in HoodieFlink/JavaTable and commit executors

2022-04-25 Thread GitBox


danny0405 commented on PR #5087:
URL: https://github.com/apache/hudi/pull/5087#issuecomment-1108388472

   There are some conflicts, can you rebase the code with latest master ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #5087: [HUDI-3614] [DO_NOT_MERGE]Replace List with HoodieData in HoodieFlink/JavaTable and commit executors

2022-04-25 Thread GitBox


danny0405 commented on PR #5087:
URL: https://github.com/apache/hudi/pull/5087#issuecomment-1108391806

   > I see there have to more code design / refactoring work to achieve it.
   
   Before we do that, my suggestion is to keep the current code as it is. It is 
a API refactoring and we should keep cautious.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] JerryYue-M commented on pull request #5410: [HUDI-3953]Flink Hudi module should support low-level read and write…

2022-04-25 Thread GitBox


JerryYue-M commented on PR #5410:
URL: https://github.com/apache/hudi/pull/5410#issuecomment-1108399547

   
   
   
   
   > Can we give some explanation about why we need this change ? Looks like it 
is a code refactor but i see no gains.
   > 
   > One rule is that we should not copy new clazz with similar API with 
existing public APIs.
   
   Indeed, it had some code refactor. the gain is to use the existing SQL API 
code to implement a low-level API for users
   eg:
   Write:
```
   DataStream input = dataStreamGen();
   Map confMap = new HashMap<>();
   confMap.put("connector" , "hudi");
   confMap.put("table.type", "MERGE_ON_READ");
   confMap.put("path" , "hdfs://127.0.0.1:9000/hudi/hudi_db/mor_test9");
   
   SinkBuilder.builder()
.input(input)
.options(confMap)
.partitions(Arrays.asList("dt", "hr"))
.schema(CREATE_TABLE_SCHEMA)
.sink();
   ```
   Read
   ```
   Map confMap = new HashMap<>();
   confMap.put("connector" , "hudi");
   confMap.put("table.type", "MERGE_ON_READ");
   confMap.put("path" , "hdfs://127.0.0.1:9000/hudi/hudi_db/mor_test9");
   confMap.put("read.streaming.enabled" , "true");
   confMap.put("read.streaming.check-interval" , "4");
   
   DataStream rowDataStream = SourceBuilder
   .builder()
   .env(getEnv())
   .schema(CREATE_TABLE_SCHEMA)
   .options(confMap)
   .partitions(Arrays.asList("dt", "hr"))
   .source();
   rowDataStream.print();
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #5238: [DOC]Add schema evolution doc for sparksql

2022-04-25 Thread GitBox


xiarixiaoyao commented on code in PR #5238:
URL: https://github.com/apache/hudi/pull/5238#discussion_r857491914


##
website/docs/quick-start-guide.md:
##
@@ -1095,6 +1095,178 @@ Currently,  the result of `show partitions` is based on 
the filesystem table pat
 
 :::
 
+## Schema evolution
+Schema evolution allows users to easily change the current schema of a Hudi 
table to adapt to the data that is changing over time.
+As of 0.11.0 release, Spark SQL(spark3.1.x and spark3.2.1) DDL support for 
Schema  evolution has been added and is experimental.
+
+### Schema Evolution Scenarios
+1) Columns (including nested columns) can be added, deleted, modified, and 
moved.
+2) Partition columns cannot be evolved.
+3) You cannot add, delete, or perform operations on nested columns of the 
Array type.
+
+## SparkSQL Schema Evolution and Syntax Description
+Before using schema evolution, pls set `spark.sql.extensions`. For spark3.2.1 
`spark.sql.catalog.spark_catalog` also need to be set.
+```shell
+# Spark SQL for spark 3.1.x
+spark-sql --packages 
org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.12:3.1.2
 \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
+
+# Spark SQL for spark 3.2.1
+spark-sql --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.12:3.2.1
 \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
+--conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
+
+```
+After start spark-app,  pls exec `set schema.on.read.enable=true` to enable 
schema evolution.
+
+:::note
+Currently, Schema evolution cannot disabled once being enabled.
+
+
+:::
+
+### Adding Columns
+**Syntax**
+```sql
+-- add columns
+ALTER TABLE Table name ADD COLUMNS(col_spec[, col_spec ...])
+```
+**Parameter Description**
+
+| Parameter   | Description  |
+|-|--|
+| tableName   | Table name   |
+| col_spec| Column specifications, consisting of five fields, 
*col_name*, *col_type*, *nullable*, *comment*, and *col_position*.|
+
+**col_name** : name of the new column. It is mandatory.To add a sub-column to 
a nested column, specify the full name of the sub-column in this field.
+
+For example:
+
+1. To add sub-column col1 to a nested struct type column column users 
struct, set this field to users.col1.
+
+2. To add sub-column col1 to a nested map type column memeber map>, set this field to member.value.col1.
+
+**col_type** : type of the new column.
+
+**nullable** : whether the new column can be null. The value can be left 
empty. Now this field is not used in Hudi.
+
+**comment** : comment of the new column. The value can be left empty.
+
+**col_position** : position where the new column is added. The value can be 
*FIRST* or *AFTER* origin_col.
+
+1. If it is set to *FIRST*, the new column will be added to the first column 
of the table.
+
+2. If it is set to *AFTER* origin_col, the new column will be added after 
original column origin_col.
+
+3. The value can be left empty. *FIRST* can be used only when new sub-columns 
are added to nested columns. Do not use *FIRST* in top-level columns. There are 
no restrictions about the usage of *AFTER*.
+
+**Examples**
+
+```sql
+alter table h0 add columns(ext0 string);
+alter table h0 add columns(new_col int not null comment 'add new column' after 
col1);
+alter table complex_table add columns(col_struct.col_name string comment 'add 
new column to a struct col' after col_from_col_struct);
+```
+
+### Altering Columns
+**Syntax**
+```sql
+-- alter table ... alter column
+ALTER TABLE Table name ALTER [COLUMN] col_old_name TYPE column_type [COMMENT] 
col_comment[FIRST|AFTER] column_name
+```
+
+**Parameter Description**
+
+| Parameter   | Description  |
+|-|--|
+| tableName  | Table name.   |
+| col_old_name   | Name of the column to be altered.|
+| column_type| Type of the target column.|
+| col_comment| col_comment.|
+| column_name| New position to place the target column. For example, 
*AFTER* **column_name** indicates that the target column is placed after 
**column_name**.|
+
+
+**Examples**
+
+```sql
+--- Changing the column type
+ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint
+
+--- Altering other attributes
+ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment'
+ALTER TABLE table1 ALTER COLUMN a.b.c FIRST
+ALTER TABLE table1 ALTER COLUMN a.b.c AFTER x
+ALTER TABLE table1 ALTER COLUMN a.b.c DROP NOT NULL
+```
+
+**column type change**
+
+| old_type| new_type|
+|-|-|
+| int | l

[GitHub] [hudi] leesf merged pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction

2022-04-25 Thread GitBox


leesf merged PR #4441:
URL: https://github.com/apache/hudi/pull/4441


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-3085] Improve bulk insert partitioner abstraction (#4441)

2022-04-25 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new f2ba0fead2 [HUDI-3085] Improve bulk insert partitioner abstraction 
(#4441)
f2ba0fead2 is described below

commit f2ba0fead24072e37c68477d0cfc3489fa098938
Author: Yuwei XIAO 
AuthorDate: Mon Apr 25 18:42:17 2022 +0800

[HUDI-3085] Improve bulk insert partitioner abstraction (#4441)
---
 .../apache/hudi/table/BulkInsertPartitioner.java   | 28 +-
 .../table/action/commit/BaseBulkInsertHelper.java  |  2 +-
 .../run/strategy/JavaExecutionStrategy.java| 11 +
 .../table/action/commit/JavaBulkInsertHelper.java  | 17 +++--
 .../MultipleSparkJobExecutionStrategy.java |  5 ++--
 .../SparkSingleFileSortExecutionStrategy.java  |  1 +
 .../bulkinsert/BulkInsertMapFunction.java  | 11 +
 .../bulkinsert/RDDSpatialCurveSortPartitioner.java | 11 -
 .../table/action/commit/SparkBulkInsertHelper.java | 24 +++
 9 files changed, 65 insertions(+), 45 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java
index fd1558a823..63b502531a 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java
@@ -18,12 +18,18 @@
 
 package org.apache.hudi.table;
 
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.io.WriteHandleFactory;
+
+import java.io.Serializable;
+
 /**
  * Repartition input records into at least expected number of output spark 
partitions. It should give below guarantees -
  * Output spark partition will have records from only one hoodie partition. - 
Average records per output spark
  * partitions should be almost equal to (#inputRecords / 
#outputSparkPartitions) to avoid possible skews.
  */
-public interface BulkInsertPartitioner {
+public interface BulkInsertPartitioner extends Serializable {
 
   /**
* Repartitions the input records into at least expected number of output 
spark partitions.
@@ -38,4 +44,24 @@ public interface BulkInsertPartitioner {
* @return {@code true} if the records within a partition are sorted; {@code 
false} otherwise.
*/
   boolean arePartitionRecordsSorted();
+
+  /**
+   * Return file group id prefix for the given data partition.
+   * By defauult, return a new file group id prefix, so that incoming records 
will route to a fresh new file group
+   * @param partitionId data partition
+   * @return
+   */
+  default String getFileIdPfx(int partitionId) {
+return FSUtils.createNewFileIdPfx();
+  }
+
+  /**
+   * Return write handle factory for the given partition.
+   * @param partitionId data partition
+   * @return
+   */
+  default Option getWriteHandleFactory(int partitionId) {
+return Option.empty();
+  }
+
 }
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseBulkInsertHelper.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseBulkInsertHelper.java
index ad2145c350..5355194ff7 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseBulkInsertHelper.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseBulkInsertHelper.java
@@ -42,7 +42,7 @@ public abstract class BaseBulkInsertHelper table, 
HoodieWriteConfig config,
boolean performDedupe,
-   Option 
userDefinedBulkInsertPartitioner,
+   BulkInsertPartitioner partitioner,
boolean addMetadataFields,
int parallelism,
WriteHandleFactory writeHandleFactory);
diff --git 
a/hudi-client/hudi-java-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/JavaExecutionStrategy.java
 
b/hudi-client/hudi-java-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/JavaExecutionStrategy.java
index 7d7609f0fa..233c70ecf9 100644
--- 
a/hudi-client/hudi-java-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/JavaExecutionStrategy.java
+++ 
b/hudi-client/hudi-java-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/JavaExecutionStrategy.java
@@ -41,6 +41,7 @@ import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.exception.HoodieClusteringException;
+import 
org.apache.hudi.execution.bulkinsert.JavaBu

[GitHub] [hudi] JerryYue-M commented on pull request #5410: [HUDI-3953]Flink Hudi module should support low-level read and write…

2022-04-25 Thread GitBox


JerryYue-M commented on PR #5410:
URL: https://github.com/apache/hudi/pull/5410#issuecomment-1108403085

   @danny0405 
   One rule is that we should not copy new clazz with similar API with existing 
public APIs
   thanks for remind. I will make some change later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5410: [HUDI-3953]Flink Hudi module should support low-level read and write…

2022-04-25 Thread GitBox


hudi-bot commented on PR #5410:
URL: https://github.com/apache/hudi/pull/5410#issuecomment-1108407276

   
   ## CI report:
   
   * 0cccfe39468ec699aae59d0028dcc949e9161a6d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8289)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hadonchen commented on pull request #5420: Flink hive sync got ClassNotFoundException (org/apache/hudi/org/apache/hadoop/hive/ql/metadata/Hive)

2022-04-25 Thread GitBox


hadonchen commented on PR #5420:
URL: https://github.com/apache/hudi/pull/5420#issuecomment-1108430967

   > > Yes , I use `release 0.10.1` branch to build the project . And I found 
when I only build the `hudi-flink-bundle` module (in path 
packaging/hudi-flink-bundle) there can find the class , but when I build all 
module (in project root path) there can't find the class . Please fixed the 
`release 0.10.0` and `release 0.10.1` branch . Thanks.
   > 
   > That's weird, the hive shade pattern seems overridden by the parent pom.
   
   Yes , so I change the parent pom to fix it , intend to commit to 0.10.0 , 
but commit wrong branch. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-3965) Spark sql dml w/ spark2 and scala12 fails w/ ClassNotFoundException for SparlSQLCLIDriver

2022-04-25 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3965:
--
Fix Version/s: 0.12.0

> Spark sql dml w/ spark2 and scala12 fails w/ ClassNotFoundException for 
> SparlSQLCLIDriver
> -
>
> Key: HUDI-3965
> URL: https://issues.apache.org/jira/browse/HUDI-3965
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>
> spark-sql dml when launched w/ spark2 and scala 12 profile, fails with 
> ClassNotFoundException: 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver for both 0.10.0 and 
> 0.11.0. 
>  
> {code:java}
> java.lang.ClassNotFoundException: 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:816)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:930)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:939)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Failed to load main class 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
> // launch command
> ./bin/spark-sql --jars /home/hadoop/hudi-spark2.4-bundle_2.12-0.11.0-rc3.jar 
> --packages org.apache.spark:spark-avro_2.12:2.4.8 --conf 
> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3965) Spark sql dml w/ spark2 and scala12 fails w/ ClassNotFoundException for SparlSQLCLIDriver

2022-04-25 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3965:
--
Sprint: Hudi-Sprint-Apr-19

> Spark sql dml w/ spark2 and scala12 fails w/ ClassNotFoundException for 
> SparlSQLCLIDriver
> -
>
> Key: HUDI-3965
> URL: https://issues.apache.org/jira/browse/HUDI-3965
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Priority: Major
>
> spark-sql dml when launched w/ spark2 and scala 12 profile, fails with 
> ClassNotFoundException: 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver for both 0.10.0 and 
> 0.11.0. 
>  
> {code:java}
> java.lang.ClassNotFoundException: 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:816)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:930)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:939)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Failed to load main class 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
> // launch command
> ./bin/spark-sql --jars /home/hadoop/hudi-spark2.4-bundle_2.12-0.11.0-rc3.jar 
> --packages org.apache.spark:spark-avro_2.12:2.4.8 --conf 
> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3965) Spark sql dml w/ spark2 and scala12 fails w/ ClassNotFoundException for SparlSQLCLIDriver

2022-04-25 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3965:
--
Priority: Blocker  (was: Major)

> Spark sql dml w/ spark2 and scala12 fails w/ ClassNotFoundException for 
> SparlSQLCLIDriver
> -
>
> Key: HUDI-3965
> URL: https://issues.apache.org/jira/browse/HUDI-3965
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.12.0
>
>
> spark-sql dml when launched w/ spark2 and scala 12 profile, fails with 
> ClassNotFoundException: 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver for both 0.10.0 and 
> 0.11.0. 
>  
> {code:java}
> java.lang.ClassNotFoundException: 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:816)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:930)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:939)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Failed to load main class 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
> // launch command
> ./bin/spark-sql --jars /home/hadoop/hudi-spark2.4-bundle_2.12-0.11.0-rc3.jar 
> --packages org.apache.spark:spark-avro_2.12:2.4.8 --conf 
> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-3965) Spark sql dml w/ spark2 and scala12 fails w/ ClassNotFoundException for SparlSQLCLIDriver

2022-04-25 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-3965:
-

 Summary: Spark sql dml w/ spark2 and scala12 fails w/ 
ClassNotFoundException for SparlSQLCLIDriver
 Key: HUDI-3965
 URL: https://issues.apache.org/jira/browse/HUDI-3965
 Project: Apache Hudi
  Issue Type: Task
  Components: spark-sql
Reporter: sivabalan narayanan


spark-sql dml when launched w/ spark2 and scala 12 profile, fails with 
ClassNotFoundException: 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver for both 0.10.0 and 
0.11.0. 

 
{code:java}
java.lang.ClassNotFoundException: 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:816)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:930)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:939)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Failed to load main class 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.


// launch command
./bin/spark-sql --jars /home/hadoop/hudi-spark2.4-bundle_2.12-0.11.0-rc3.jar 
--packages org.apache.spark:spark-avro_2.12:2.4.8 --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension 
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3965) Spark sql dml w/ spark2 and scala12 fails w/ ClassNotFoundException for SparlSQLCLIDriver

2022-04-25 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3965:
--
Sprint:   (was: Hudi-Sprint-Apr-19)

> Spark sql dml w/ spark2 and scala12 fails w/ ClassNotFoundException for 
> SparlSQLCLIDriver
> -
>
> Key: HUDI-3965
> URL: https://issues.apache.org/jira/browse/HUDI-3965
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Priority: Major
>
> spark-sql dml when launched w/ spark2 and scala 12 profile, fails with 
> ClassNotFoundException: 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver for both 0.10.0 and 
> 0.11.0. 
>  
> {code:java}
> java.lang.ClassNotFoundException: 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:816)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:930)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:939)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Failed to load main class 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
> // launch command
> ./bin/spark-sql --jars /home/hadoop/hudi-spark2.4-bundle_2.12-0.11.0-rc3.jar 
> --packages org.apache.spark:spark-avro_2.12:2.4.8 --conf 
> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3752) Update website content based on 0.11 new features

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3752:
-
Story Points: 0  (was: 2)

> Update website content based on 0.11 new features
> -
>
> Key: HUDI-3752
> URL: https://issues.apache.org/jira/browse/HUDI-3752
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: docs
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Minor
> Fix For: 0.11.0
>
>
> content to update
> - utilities slim bundle https://github.com/apache/hudi/pull/5184/files
> - schema evol
> https://github.com/apache/hudi/pull/5238
> - spark bundle compatibilities
> https://github.com/apache/hudi/pull/5297



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3752) Update website content based on 0.11 new features

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3752:
-
Fix Version/s: 0.11.0

> Update website content based on 0.11 new features
> -
>
> Key: HUDI-3752
> URL: https://issues.apache.org/jira/browse/HUDI-3752
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: docs
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Minor
> Fix For: 0.11.0
>
>
> content to update
> - utilities slim bundle https://github.com/apache/hudi/pull/5184/files
> - schema evol
> https://github.com/apache/hudi/pull/5238
> - spark bundle compatibilities
> https://github.com/apache/hudi/pull/5297



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3654) Support basic actions based on hudi metastore server

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3654:
-
Fix Version/s: 0.12.0

> Support basic actions based on hudi metastore server 
> -
>
> Key: HUDI-3654
> URL: https://issues.apache.org/jira/browse/HUDI-3654
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: metadata
>Reporter: XiaoyuGeng
>Assignee: XiaoyuGeng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-1823) Hive/Presto Integration with ORC

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1823:
-
Fix Version/s: 0.12.0

> Hive/Presto Integration with ORC
> 
>
> Key: HUDI-1823
> URL: https://issues.apache.org/jira/browse/HUDI-1823
> Project: Apache Hudi
>  Issue Type: Task
>  Components: storage-management
>Reporter: Teresa Kang
>Priority: Major
> Fix For: 0.12.0
>
>
> Implement HoodieOrcInputFormat to support ORC with spark/presto query engines.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] Baibaiwuguo opened a new pull request, #5423: [HUDI-3964] hudi-spark-bundle miss avro dependency

2022-04-25 Thread GitBox


Baibaiwuguo opened a new pull request, #5423:
URL: https://github.com/apache/hudi/pull/5423

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   hudi-spark-bundle miss avro. I  repair it.
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
- hudi-spark-bundle add avro dependency
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-3964) hudi-spark-bundle lack avro dependency

2022-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-3964:
-
Labels: pull-request-available  (was: )

> hudi-spark-bundle lack avro dependency
> --
>
> Key: HUDI-3964
> URL: https://issues.apache.org/jira/browse/HUDI-3964
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: hanjie
>Priority: Major
>  Labels: pull-request-available
>
> The project lack avro in hudi-spark-bundel module.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-1236) [UMBRELLA] Integ Test suite infra

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1236:
-
Fix Version/s: 0.12.0

> [UMBRELLA] Integ Test suite infra 
> --
>
> Key: HUDI-1236
> URL: https://issues.apache.org/jira/browse/HUDI-1236
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: hudi-umbrellas
> Fix For: 0.11.0, 0.12.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Long running test suite that checks for correctness across all deployment 
> modes (batch/streaming) and writers (deltastreamer/spark) and readers (hive, 
> presto, spark)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hehuiyuan commented on a diff in pull request #5405: [HUDI-3951]support general parameter 'sink.parallelism' for flink-hudi

2022-04-25 Thread GitBox


hehuiyuan commented on code in PR #5405:
URL: https://github.com/apache/hudi/pull/5405#discussion_r857559082


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -232,6 +233,8 @@ private FlinkOptions() {
   // 
   //  Write Options
   // 
+  public static final ConfigOption SINK_PARALLELISM = 
FactoryUtil.SINK_PARALLELISM;
+
   public static final ConfigOption TABLE_NAME = ConfigOptions

Review Comment:
   @danny0405 ok.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-1822) [Umbrella] Multi Modal Indexing

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1822:
-
Fix Version/s: 0.12.0

> [Umbrella] Multi Modal Indexing
> ---
>
> Key: HUDI-1822
> URL: https://issues.apache.org/jira/browse/HUDI-1822
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: index
>Reporter: satish
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-umbrellas, pull-request-available
> Fix For: 0.11.0, 0.12.0
>
>
> RFC-27 umbrella ticket. Goal is to support global range index to improve 
> query planning time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3317) Partition specific pointed lookup/reading strategy for metadata table

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3317:
-
Priority: Critical  (was: Blocker)

> Partition specific pointed lookup/reading strategy for metadata table
> -
>
> Key: HUDI-3317
> URL: https://issues.apache.org/jira/browse/HUDI-3317
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata, writer-core
>Reporter: Manoj Govindassamy
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.12.0
>
>
> Today inline reading can only be turned on for the entire metadata table. 
> Mean all partitions either have this feature enabled or not. But, for smaller 
> partitions like "files" inline is not preferable as it turns off external 
> spillable map caching of records. Where as for other partitions like 
> bloom_filters, inline reading is preferred. We need Partition specific inline 
> reading strategy for metadata table.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3951) Support general parameter 'sink.parallelism' to define the parallelism of the hudi sink operator for flink.

2022-04-25 Thread hehuiyuan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hehuiyuan updated HUDI-3951:

Description: For flink connector, we can set sink operator parallelism by 
''.

> Support general parameter 'sink.parallelism' to define the parallelism of the 
> hudi sink operator for flink.
> ---
>
> Key: HUDI-3951
> URL: https://issues.apache.org/jira/browse/HUDI-3951
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: hehuiyuan
>Priority: Major
>  Labels: pull-request-available
>
> For flink connector, we can set sink operator parallelism by ''.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-686:

Priority: Major  (was: Blocker)

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index, performance
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-1613) Document for interfaces for Spark on Hive & Spark Datasource integrations with index

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1613:
-
Priority: Minor  (was: Blocker)

> Document for interfaces for Spark on Hive & Spark Datasource integrations 
> with index 
> -
>
> Key: HUDI-1613
> URL: https://issues.apache.org/jira/browse/HUDI-1613
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs, metadata
>Reporter: Nishith Agarwal
>Assignee: Udit Mehrotra
>Priority: Minor
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (HUDI-1613) Document for interfaces for Spark on Hive & Spark Datasource integrations with index

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-1613.

Fix Version/s: (was: 0.12.0)
 Assignee: (was: Udit Mehrotra)
   Resolution: Abandoned

> Document for interfaces for Spark on Hive & Spark Datasource integrations 
> with index 
> -
>
> Key: HUDI-1613
> URL: https://issues.apache.org/jira/browse/HUDI-1613
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs, metadata
>Reporter: Nishith Agarwal
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-2199) DynamoDB based external index implementation

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2199:
-
Priority: Major  (was: Blocker)

> DynamoDB based external index implementation
> 
>
> Key: HUDI-2199
> URL: https://issues.apache.org/jira/browse/HUDI-2199
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: index
>Reporter: Vinoth Chandar
>Assignee: Biswajit mohapatra
>Priority: Major
> Fix For: 0.12.0
>
>
> We have a HBaseIndex, that provides uses with ability to store fileID <=> 
> recordKey mappings in an external kv store, for fast lookups during upsert 
> operations. We can potentially create a similar one for DynamoDB. 
> We just use a single column family in HBase, so we should be able to largely 
> re-use the code/key-value schema across them even. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-2199) DynamoDB based external index implementation

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2199:
-
Epic Link:   (was: HUDI-1822)

> DynamoDB based external index implementation
> 
>
> Key: HUDI-2199
> URL: https://issues.apache.org/jira/browse/HUDI-2199
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: index
>Reporter: Vinoth Chandar
>Assignee: Biswajit mohapatra
>Priority: Blocker
> Fix For: 0.12.0
>
>
> We have a HBaseIndex, that provides uses with ability to store fileID <=> 
> recordKey mappings in an external kv store, for fast lookups during upsert 
> operations. We can potentially create a similar one for DynamoDB. 
> We just use a single column family in HBase, so we should be able to largely 
> re-use the code/key-value schema across them even. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3951) Support general parameter 'sink.parallelism' to define the parallelism of the hudi sink operator for flink.

2022-04-25 Thread hehuiyuan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hehuiyuan updated HUDI-3951:

Description: 
For flink connector, we can set sink operator parallelism by the general 
parameter 'sink.parallelism', 

for example , kafka  jdbc and so on.

[https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/kafka/#sink-parallelism]

[https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/jdbc/#sink-parallelism]

 

In the jira, i want to add the parameter to set sink operator parallesim 

 

  was:For flink connector, we can set sink operator parallelism by ''.


> Support general parameter 'sink.parallelism' to define the parallelism of the 
> hudi sink operator for flink.
> ---
>
> Key: HUDI-3951
> URL: https://issues.apache.org/jira/browse/HUDI-3951
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: hehuiyuan
>Priority: Major
>  Labels: pull-request-available
>
> For flink connector, we can set sink operator parallelism by the general 
> parameter 'sink.parallelism', 
> for example , kafka  jdbc and so on.
> [https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/kafka/#sink-parallelism]
> [https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/jdbc/#sink-parallelism]
>  
> In the jira, i want to add the parameter to set sink operator parallesim 
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3951) Support general parameter 'sink.parallelism' to define the parallelism of the hudi sink operator for flink.

2022-04-25 Thread hehuiyuan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hehuiyuan updated HUDI-3951:

Description: 
For flink connector, we can set sink operator parallelism by the general 
parameter 'sink.parallelism', 

for example , kafka  jdbc and so on.

[https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/kafka/#sink-parallelism]

[https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/jdbc/#sink-parallelism]

 

In the jira, i want to add the parameter to set sink operator parallesim. 

 

  was:
For flink connector, we can set sink operator parallelism by the general 
parameter 'sink.parallelism', 

for example , kafka  jdbc and so on.

[https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/kafka/#sink-parallelism]

[https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/jdbc/#sink-parallelism]

 

In the jira, i want to add the parameter to set sink operator parallesim 

 


> Support general parameter 'sink.parallelism' to define the parallelism of the 
> hudi sink operator for flink.
> ---
>
> Key: HUDI-3951
> URL: https://issues.apache.org/jira/browse/HUDI-3951
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: hehuiyuan
>Priority: Major
>  Labels: pull-request-available
>
> For flink connector, we can set sink operator parallelism by the general 
> parameter 'sink.parallelism', 
> for example , kafka  jdbc and so on.
> [https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/kafka/#sink-parallelism]
> [https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/jdbc/#sink-parallelism]
>  
> In the jira, i want to add the parameter to set sink operator parallesim. 
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-3966) External Index implementation

2022-04-25 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-3966:


 Summary: External Index implementation
 Key: HUDI-3966
 URL: https://issues.apache.org/jira/browse/HUDI-3966
 Project: Apache Hudi
  Issue Type: Epic
  Components: index
Reporter: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-2199) DynamoDB based external index implementation

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2199:
-
Fix Version/s: (was: 0.12.0)

> DynamoDB based external index implementation
> 
>
> Key: HUDI-2199
> URL: https://issues.apache.org/jira/browse/HUDI-2199
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: index
>Reporter: Vinoth Chandar
>Assignee: Biswajit mohapatra
>Priority: Major
>
> We have a HBaseIndex, that provides uses with ability to store fileID <=> 
> recordKey mappings in an external kv store, for fast lookups during upsert 
> operations. We can potentially create a similar one for DynamoDB. 
> We just use a single column family in HBase, so we should be able to largely 
> re-use the code/key-value schema across them even. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-2657) Make inlining configurable based on diff use-case.

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2657:
-
Priority: Major  (was: Blocker)

> Make inlining configurable based on diff use-case. 
> ---
>
> Key: HUDI-2657
> URL: https://issues.apache.org/jira/browse/HUDI-2657
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Prashant Wason
>Priority: Major
> Fix For: 0.12.0
>
>
> Make inlining configurable based on diff use-case.
> Files partition, column_stats and bloom might need inlining. but record level 
> index may not need inline reading. 
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-2199) DynamoDB based external index implementation

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2199:
-
Epic Link: HUDI-3966

> DynamoDB based external index implementation
> 
>
> Key: HUDI-2199
> URL: https://issues.apache.org/jira/browse/HUDI-2199
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: index
>Reporter: Vinoth Chandar
>Assignee: Biswajit mohapatra
>Priority: Major
> Fix For: 0.12.0
>
>
> We have a HBaseIndex, that provides uses with ability to store fileID <=> 
> recordKey mappings in an external kv store, for fast lookups during upsert 
> operations. We can potentially create a similar one for DynamoDB. 
> We just use a single column family in HBase, so we should be able to largely 
> re-use the code/key-value schema across them even. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3166) Implement new HoodieIndex based on metadata indices

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3166:
-
Priority: Blocker  (was: Critical)

> Implement new HoodieIndex based on metadata indices 
> 
>
> Key: HUDI-3166
> URL: https://issues.apache.org/jira/browse/HUDI-3166
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index, metadata
>Reporter: Manoj Govindassamy
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: metadata, pull-request-available
> Fix For: 0.12.0
>
>
> A new HoodieIndex implementation working based off indices from the metadata 
> table. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3259) Code Refactor: Common prep records commit util for Spark and Flink

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3259:
-
Priority: Major  (was: Blocker)

> Code Refactor: Common prep records commit util for Spark and Flink
> --
>
> Key: HUDI-3259
> URL: https://issues.apache.org/jira/browse/HUDI-3259
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, metadata, writer-core
>Reporter: Manoj Govindassamy
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3323) Refactor: Metadata various partitions payload merging using delegation pattern

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3323:
-
Priority: Critical  (was: Blocker)

> Refactor: Metadata various partitions payload merging using delegation pattern
> --
>
> Key: HUDI-3323
> URL: https://issues.apache.org/jira/browse/HUDI-3323
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Manoj Govindassamy
>Priority: Critical
> Fix For: 0.12.0
>
>
> With the introduction of new partitions under metadata table, payload merging 
> part needs to more organized - like merging for each partition type in its 
> own payload class and the metadata table metadata payload class can just be a 
> delegator. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3323) Refactor: Metadata various partitions payload merging using delegation pattern

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3323:
-
Priority: Blocker  (was: Critical)

> Refactor: Metadata various partitions payload merging using delegation pattern
> --
>
> Key: HUDI-3323
> URL: https://issues.apache.org/jira/browse/HUDI-3323
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.12.0
>
>
> With the introduction of new partitions under metadata table, payload merging 
> part needs to more organized - like merging for each partition type in its 
> own payload class and the metadata table metadata payload class can just be a 
> delegator. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3167) Update RFC27 with the design for the new HoodieIndex type based on metadata indices

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3167:
-
Priority: Major  (was: Blocker)

> Update RFC27 with the design for the new HoodieIndex type based on metadata 
> indices 
> 
>
> Key: HUDI-3167
> URL: https://issues.apache.org/jira/browse/HUDI-3167
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs, metadata
>Reporter: Manoj Govindassamy
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-512) Support Logical Partitioning

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-512:

Epic Link: (was: HUDI-1822)

> Support Logical Partitioning
> 
>
> Key: HUDI-512
> URL: https://issues.apache.org/jira/browse/HUDI-512
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Common Core
>Affects Versions: 0.9.0
>Reporter: Alexander Filipchik
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: features, pull-request-available
> Fix For: 0.11.0, 0.12.0
>
>
> So for us to support logical partitioning in lieu of physical one following 
> will be necessary: 
>  # If User would like to apply any transformations on top of raw Partitioning 
> column, such transformed column will have to be *materialized* (either in the 
> table as meta-column or elsewhere)
>  # We will have to make sure that individual Base (alas Delta Log) files only 
> contain records with the *same partition values* (ie records have to be 
> implicitly clustered by partition values w/in files)
>  # Partition Values have to be exposed to the Query Engine such that 
> Partition pruning could be performed (limiting number of files that will be 
> scanned) 
>  
> 
> h4. *--- Original Description ---*
> This one is more inspirational, but, I believe, will be very useful. 
> Currently hudi is following Hive table format, which means that data is 
> logically and physically partitioned into folder structure like:
> table_name
>   2019
>     01
>     02
>        bla.parquet
>  
> This has several issues:
>  1) Modern object stores (AWS S3, GCP) are more performant when each file 
> name starts with some kind of a random value. By definition Hive layout is 
> not perfect
> 2) Hive Metastore stores partitions in the text field in the single table (2 
> tables with very similar information) and doesn't support proper filtering. 
> Data partitioned by day will be stored like:
> 2019/01/10
> 2019/01/11
> so only regexp queries are suported (at least in Hive 2.X.X)
> 3) Having a single POF which relies on non distributed DB is dangerous and 
> creates bottlenecks. 
>  
> The idea is to get rid of logical partitioning all together (and hive 
> metastore as well). If dataset has a time columns, user should be able to 
> query it without understanding what is the physical layout of the table (by 
> specifying those partitions explicitly or ending up with a full table scan 
> accidentally).
> It will require some kind of mapping of time to file locations (similar to 
> Iceberg). I'm also leaning towards the idea that storing table metadata with 
> the table is a good thing as it can be read by the engine in one shot and 
> will be faster that taxing a standalone metastore. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-512) Support Logical Partitioning

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-512:

Fix Version/s: 0.11.0
   Issue Type: Epic  (was: Task)

> Support Logical Partitioning
> 
>
> Key: HUDI-512
> URL: https://issues.apache.org/jira/browse/HUDI-512
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Common Core
>Affects Versions: 0.9.0
>Reporter: Alexander Filipchik
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: features, pull-request-available
> Fix For: 0.11.0, 0.12.0
>
>
> So for us to support logical partitioning in lieu of physical one following 
> will be necessary: 
>  # If User would like to apply any transformations on top of raw Partitioning 
> column, such transformed column will have to be *materialized* (either in the 
> table as meta-column or elsewhere)
>  # We will have to make sure that individual Base (alas Delta Log) files only 
> contain records with the *same partition values* (ie records have to be 
> implicitly clustered by partition values w/in files)
>  # Partition Values have to be exposed to the Query Engine such that 
> Partition pruning could be performed (limiting number of files that will be 
> scanned) 
>  
> 
> h4. *--- Original Description ---*
> This one is more inspirational, but, I believe, will be very useful. 
> Currently hudi is following Hive table format, which means that data is 
> logically and physically partitioned into folder structure like:
> table_name
>   2019
>     01
>     02
>        bla.parquet
>  
> This has several issues:
>  1) Modern object stores (AWS S3, GCP) are more performant when each file 
> name starts with some kind of a random value. By definition Hive layout is 
> not perfect
> 2) Hive Metastore stores partitions in the text field in the single table (2 
> tables with very similar information) and doesn't support proper filtering. 
> Data partitioned by day will be stored like:
> 2019/01/10
> 2019/01/11
> so only regexp queries are suported (at least in Hive 2.X.X)
> 3) Having a single POF which relies on non distributed DB is dangerous and 
> creates bottlenecks. 
>  
> The idea is to get rid of logical partitioning all together (and hive 
> metastore as well). If dataset has a time columns, user should be able to 
> query it without understanding what is the physical layout of the table (by 
> specifying those partitions explicitly or ending up with a full table scan 
> accidentally).
> It will require some kind of mapping of time to file locations (similar to 
> Iceberg). I'm also leaning towards the idea that storing table metadata with 
> the table is a good thing as it can be read by the engine in one shot and 
> will be faster that taxing a standalone metastore. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3532) Refactor FileSystemBackedTableMetadata and related classes to support getColumnStats directly

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3532:
-
Priority: Minor  (was: Blocker)

> Refactor FileSystemBackedTableMetadata and related classes to support 
> getColumnStats directly
> -
>
> Key: HUDI-3532
> URL: https://issues.apache.org/jira/browse/HUDI-3532
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, writer-core
>Reporter: Ethan Guo
>Priority: Minor
> Fix For: 0.12.0
>
>
> The api {{getColumnStats}} is not supported in FileSystemBackedTableMetadata:
> {code:java}
> @Override
>   public Map, HoodieMetadataColumnStats> 
> getColumnStats(final List> partitionNameFileNameList, 
> final String columnName)
>   throws HoodieMetadataException {
> throw new HoodieMetadataException("Unsupported operation: 
> getColumnsStats!");
>   } {code}
> It's better to support column stats without metadata table from 
> FileSystemBackedTableMetadata as well to unify the logic around col stats and 
> reduce the special-cased logic for col stats between metadata table vs file 
> system backed metadata.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HUDI-3532) Refactor FileSystemBackedTableMetadata and related classes to support getColumnStats directly

2022-04-25 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527454#comment-17527454
 ] 

Raymond Xu commented on HUDI-3532:
--

We may not need to do this. It may not be a good practice to allow this API via 
FS listing.

> Refactor FileSystemBackedTableMetadata and related classes to support 
> getColumnStats directly
> -
>
> Key: HUDI-3532
> URL: https://issues.apache.org/jira/browse/HUDI-3532
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, writer-core
>Reporter: Ethan Guo
>Priority: Minor
> Fix For: 0.12.0
>
>
> The api {{getColumnStats}} is not supported in FileSystemBackedTableMetadata:
> {code:java}
> @Override
>   public Map, HoodieMetadataColumnStats> 
> getColumnStats(final List> partitionNameFileNameList, 
> final String columnName)
>   throws HoodieMetadataException {
> throw new HoodieMetadataException("Unsupported operation: 
> getColumnsStats!");
>   } {code}
> It's better to support column stats without metadata table from 
> FileSystemBackedTableMetadata as well to unify the logic around col stats and 
> reduce the special-cased logic for col stats between metadata table vs file 
> system backed metadata.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3533) Refactor FileSystemBackedTableMetadata and related classes to support getBloomFilters directly

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3533:
-
Priority: Minor  (was: Blocker)

> Refactor FileSystemBackedTableMetadata and related classes to support 
> getBloomFilters directly
> --
>
> Key: HUDI-3533
> URL: https://issues.apache.org/jira/browse/HUDI-3533
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, writer-core
>Reporter: Ethan Guo
>Priority: Minor
> Fix For: 0.12.0
>
>
> The api {{getBloomFilters}} is not supported in FileSystemBackedTableMetadata:
> {code:java}
> @Override
>   public Map, ByteBuffer> getBloomFilters(final 
> List> partitionNameFileNameList)
>   throws HoodieMetadataException {
> throw new HoodieMetadataException("Unsupported operation: 
> getBloomFilters!");
>   }{code}
> It's better to support bloom filters without metadata table from 
> FileSystemBackedTableMetadata as well to unify the logic and reduce the 
> special-cased logic for bloom filters between metadata table vs file system 
> backed metadata.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HUDI-3533) Refactor FileSystemBackedTableMetadata and related classes to support getBloomFilters directly

2022-04-25 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527455#comment-17527455
 ] 

Raymond Xu commented on HUDI-3533:
--

Similar to HUDI-3532

> Refactor FileSystemBackedTableMetadata and related classes to support 
> getBloomFilters directly
> --
>
> Key: HUDI-3533
> URL: https://issues.apache.org/jira/browse/HUDI-3533
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, writer-core
>Reporter: Ethan Guo
>Priority: Minor
> Fix For: 0.12.0
>
>
> The api {{getBloomFilters}} is not supported in FileSystemBackedTableMetadata:
> {code:java}
> @Override
>   public Map, ByteBuffer> getBloomFilters(final 
> List> partitionNameFileNameList)
>   throws HoodieMetadataException {
> throw new HoodieMetadataException("Unsupported operation: 
> getBloomFilters!");
>   }{code}
> It's better to support bloom filters without metadata table from 
> FileSystemBackedTableMetadata as well to unify the logic and reduce the 
> special-cased logic for bloom filters between metadata table vs file system 
> backed metadata.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3582) Support record level index based on Apache Lucene to improve query/tagLocation performance

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3582:
-
Epic Link: HUDI-466  (was: HUDI-1822)

> Support record level index based on Apache Lucene to improve 
> query/tagLocation performance
> --
>
> Key: HUDI-3582
> URL: https://issues.apache.org/jira/browse/HUDI-3582
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: index
>Reporter: shibei
>Assignee: shibei
>Priority: Blocker
> Fix For: 0.12.0
>
>
> Nowadays, record level index is mainly implemented for `tagLocation`, and 
> queries do not benefit from this, see 
> [https://github.com/apache/hudi/pull/3508] for more detail. In this issue, we 
> will implement record level index based on Apache Lucene to gain the 
> following abilities:
> 1. For point query, we can get file level row number from record level index, 
> combining parquet column index brought by Spark 3.2 to achieve accurate 
> reading;
> 2. For `tagLocation`, we can directly get record location info from record 
> level index.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3907) RFC for lucene based record level index

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3907:
-
Epic Link: HUDI-466

> RFC for lucene based record level index
> ---
>
> Key: HUDI-3907
> URL: https://issues.apache.org/jira/browse/HUDI-3907
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: shibei
>Assignee: shibei
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3777) Optimize column stats storage

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3777:
-
Description: Avoid storing filename of each record in the colstats 
partition.

> Optimize column stats storage
> -
>
> Key: HUDI-3777
> URL: https://issues.apache.org/jira/browse/HUDI-3777
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.0
>
>
> Avoid storing filename of each record in the colstats partition.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3809) Make full scan optional for metadata partitions other than FILES

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3809:
-
Priority: Critical  (was: Blocker)

> Make full scan optional for metadata partitions other than FILES
> 
>
> Key: HUDI-3809
> URL: https://issues.apache.org/jira/browse/HUDI-3809
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Priority: Critical
> Fix For: 0.12.0
>
>
> Currently, just one config controls whether to do full scan or point lookup 
> while reading log records in metadata table and that config applies to ALL 
> metadata partitions. However, full scan is disabled for column_stats and 
> bloom_filters partitions: 
> HoodieBackedTableMetadata#isFullScanAllowedForPartition
>  
> We should make it configurable for other partitions too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3809) Make full scan optional for metadata partitions other than FILES

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3809:
-
Issue Type: Improvement  (was: Task)

> Make full scan optional for metadata partitions other than FILES
> 
>
> Key: HUDI-3809
> URL: https://issues.apache.org/jira/browse/HUDI-3809
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.0
>
>
> Currently, just one config controls whether to do full scan or point lookup 
> while reading log records in metadata table and that config applies to ALL 
> metadata partitions. However, full scan is disabled for column_stats and 
> bloom_filters partitions: 
> HoodieBackedTableMetadata#isFullScanAllowedForPartition
>  
> We should make it configurable for other partitions too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3914) Enhance TestColumnStatsIndex to test indexing with regular writes and table services

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3914:
-
Priority: Critical  (was: Blocker)

> Enhance TestColumnStatsIndex to test indexing with regular writes and table 
> services
> 
>
> Key: HUDI-3914
> URL: https://issues.apache.org/jira/browse/HUDI-3914
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Critical
> Fix For: 0.11.1
>
>
> Beef up {{TestColumnStatsIndex}} running through all Hudi flows (insert, 
> upsert, deletion, rollback, compaction, clustering, etc) for both COW/MOR.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3166) Implement new HoodieIndex based on metadata indices

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3166:
-
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14, Hudi-Sprint-Feb-22, 
Hudi-Sprint-May-02  (was: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14, 
Hudi-Sprint-Feb-22)

> Implement new HoodieIndex based on metadata indices 
> 
>
> Key: HUDI-3166
> URL: https://issues.apache.org/jira/browse/HUDI-3166
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index, metadata
>Reporter: Manoj Govindassamy
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: metadata, pull-request-available
> Fix For: 0.12.0
>
>
> A new HoodieIndex implementation working based off indices from the metadata 
> table. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3806) Improve HoodieBloomIndex using bloom_filter and col_stats in MDT

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3806:
-
Sprint: Hudi-Sprint-Apr-05, Hudi-Sprint-Apr-12, Hudi-Sprint-May-02  (was: 
Hudi-Sprint-Apr-05, Hudi-Sprint-Apr-12)

> Improve HoodieBloomIndex using bloom_filter and col_stats in MDT
> 
>
> Key: HUDI-3806
> URL: https://issues.apache.org/jira/browse/HUDI-3806
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.0
>
>
> For a Delastreamer job doing bulk inserts of 10GB batches, the job is stuck 
> at the stage when HoodieBloomIndex reads bloom filter from the metadata 
> table, taking more than 2 hours.  When bloom filter is disabled in metadata 
> table, each commit takes 10-20 minutes.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3806) Improve HoodieBloomIndex using bloom_filter and col_stats in MDT

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3806:
-
Sprint: Hudi-Sprint-Apr-05, Hudi-Sprint-Apr-12, Hudi-Sprint-Apr-25  (was: 
Hudi-Sprint-Apr-05, Hudi-Sprint-Apr-12, Hudi-Sprint-May-02)

> Improve HoodieBloomIndex using bloom_filter and col_stats in MDT
> 
>
> Key: HUDI-3806
> URL: https://issues.apache.org/jira/browse/HUDI-3806
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.0
>
>
> For a Delastreamer job doing bulk inserts of 10GB batches, the job is stuck 
> at the stage when HoodieBloomIndex reads bloom filter from the metadata 
> table, taking more than 2 hours.  When bloom filter is disabled in metadata 
> table, each commit takes 10-20 minutes.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3166) Implement new HoodieIndex based on metadata indices

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3166:
-
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14, Hudi-Sprint-Feb-22, 
Hudi-Sprint-Apr-25  (was: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14, 
Hudi-Sprint-Feb-22, Hudi-Sprint-May-02)

> Implement new HoodieIndex based on metadata indices 
> 
>
> Key: HUDI-3166
> URL: https://issues.apache.org/jira/browse/HUDI-3166
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: index, metadata
>Reporter: Manoj Govindassamy
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: metadata, pull-request-available
> Fix For: 0.12.0
>
>
> A new HoodieIndex implementation working based off indices from the metadata 
> table. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hudi-bot commented on pull request #5423: [HUDI-3964] hudi-spark-bundle miss avro dependency

2022-04-25 Thread GitBox


hudi-bot commented on PR #5423:
URL: https://github.com/apache/hudi/pull/5423#issuecomment-1108526591

   
   ## CI report:
   
   * a7fa04a277661c65b4f7faa48f0b69b39a7cee9a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5423: [HUDI-3964] hudi-spark-bundle miss avro dependency

2022-04-25 Thread GitBox


hudi-bot commented on PR #5423:
URL: https://github.com/apache/hudi/pull/5423#issuecomment-1108530150

   
   ## CI report:
   
   * a7fa04a277661c65b4f7faa48f0b69b39a7cee9a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8291)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-1061) Hudi CLI savepoint command fail because of spark conf loading issue

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-1061:


Assignee: sivabalan narayanan

> Hudi CLI savepoint command fail because of spark conf loading issue
> ---
>
> Key: HUDI-1061
> URL: https://issues.apache.org/jira/browse/HUDI-1061
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Reporter: Wenning Ding
>Assignee: sivabalan narayanan
>Priority: Major
>
> h3. Reproduce
> open hudi-cli.sh and run these two commands:
> {code:java}
> connect --path s3://wenningd-emr-dev/hudi/tables/events/hudi_null01 savepoint 
> create --commit 2019115109 
> {code}
> {{}}
> {{}}You would see this error:
> {{}}
> {code:java}
> java.io.FileNotFoundException: File file:/tmp/spark-events does not exist at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630)
>  at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:452)
>  at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:97)
>  at org.apache.spark.SparkContext.(SparkContext.scala:523) at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) 
> at org.apache.hudi.cli.utils.SparkUtil.initJavaSparkConf(SparkUtil.java:85) 
> at 
> org.apache.hudi.cli.commands.SavepointsCommand.savepoint(SavepointsCommand.java:79)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216)
>  at 
> org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68)
>  at 
> org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59)
>  at 
> org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134)
>  at org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533) 
> at org.springframework.shell.core.JLineShell.run(JLineShell.java:179) at 
> java.lang.Thread.run(Thread.java:748){code}
> {{}}Although in {{spark-defaults.conf}}, it configs {{spark.eventLog.dir  
>  hdfs:///var/log/spark/apps}}, but here hudi cli still uses 
> {{file:/tmp/spark-events}} as the event log dir, which means sparkcontext 
> won't load the configs from {{spark-defaults.conf}}.
> We should make initJavaSparkConf method be able to read configs from spark 
> config file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3884) Adding support to let archival proceed beyond savepointed commits

2022-04-25 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3884:
--
Summary: Adding support to let archival proceed beyond savepointed commits  
(was: Inspect why archival stops at first savepoint. Add support if possible)

> Adding support to let archival proceed beyond savepointed commits
> -
>
> Key: HUDI-3884
> URL: https://issues.apache.org/jira/browse/HUDI-3884
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-3967) Exempt savepoint from archiving and add automated savepoint

2022-04-25 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-3967:


 Summary: Exempt savepoint from archiving and add automated 
savepoint
 Key: HUDI-3967
 URL: https://issues.apache.org/jira/browse/HUDI-3967
 Project: Apache Hudi
  Issue Type: Epic
  Components: table-service
Reporter: Raymond Xu
Assignee: sivabalan narayanan
 Fix For: 0.12.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3734) Add savepoint restore nodes and yamls to integ test suite

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3734:
-
Fix Version/s: 0.12.0

> Add savepoint restore nodes and yamls to integ test suite
> -
>
> Key: HUDI-3734
> URL: https://issues.apache.org/jira/browse/HUDI-3734
> Project: Apache Hudi
>  Issue Type: Task
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3884) Adding support to let archival proceed beyond savepointed commits

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3884:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)

> Adding support to let archival proceed beyond savepointed commits
> -
>
> Key: HUDI-3884
> URL: https://issues.apache.org/jira/browse/HUDI-3884
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-3968) Add automated savepoint support to Hudi

2022-04-25 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-3968:
-

 Summary: Add automated savepoint support to Hudi
 Key: HUDI-3968
 URL: https://issues.apache.org/jira/browse/HUDI-3968
 Project: Apache Hudi
  Issue Type: Task
  Components: table-service
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3884) Adding support to let archival proceed beyond savepointed commits

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3884:
-
Component/s: table-service
  Epic Link: HUDI-3967
 Issue Type: Improvement  (was: Task)

> Adding support to let archival proceed beyond savepointed commits
> -
>
> Key: HUDI-3884
> URL: https://issues.apache.org/jira/browse/HUDI-3884
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-1272) Add utility scripts to manage Savepoints

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1272:
-
Epic Link: HUDI-3967  (was: HUDI-1388)

> Add utility scripts to manage Savepoints
> 
>
> Key: HUDI-1272
> URL: https://issues.apache.org/jira/browse/HUDI-1272
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: cli, Utilities
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.12.0
>
>
> We need to expose commands for manging savepoints.
> We have similar scripts for cleaner : 
> org.apache.hudi.utilities.HoodieCleaner
> We need to add something similar for restores.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-1061) Hudi CLI savepoint command fail because of spark conf loading issue

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1061:
-
Fix Version/s: 0.12.0

> Hudi CLI savepoint command fail because of spark conf loading issue
> ---
>
> Key: HUDI-1061
> URL: https://issues.apache.org/jira/browse/HUDI-1061
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Reporter: Wenning Ding
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>
> h3. Reproduce
> open hudi-cli.sh and run these two commands:
> {code:java}
> connect --path s3://wenningd-emr-dev/hudi/tables/events/hudi_null01 savepoint 
> create --commit 2019115109 
> {code}
> {{}}
> {{}}You would see this error:
> {{}}
> {code:java}
> java.io.FileNotFoundException: File file:/tmp/spark-events does not exist at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630)
>  at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:452)
>  at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:97)
>  at org.apache.spark.SparkContext.(SparkContext.scala:523) at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) 
> at org.apache.hudi.cli.utils.SparkUtil.initJavaSparkConf(SparkUtil.java:85) 
> at 
> org.apache.hudi.cli.commands.SavepointsCommand.savepoint(SavepointsCommand.java:79)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216)
>  at 
> org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68)
>  at 
> org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59)
>  at 
> org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134)
>  at org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533) 
> at org.springframework.shell.core.JLineShell.run(JLineShell.java:179) at 
> java.lang.Thread.run(Thread.java:748){code}
> {{}}Although in {{spark-defaults.conf}}, it configs {{spark.eventLog.dir  
>  hdfs:///var/log/spark/apps}}, but here hudi cli still uses 
> {{file:/tmp/spark-events}} as the event log dir, which means sparkcontext 
> won't load the configs from {{spark-defaults.conf}}.
> We should make initJavaSparkConf method be able to read configs from spark 
> config file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-1061) Hudi CLI savepoint command fail because of spark conf loading issue

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1061:
-
Epic Link: HUDI-3967

> Hudi CLI savepoint command fail because of spark conf loading issue
> ---
>
> Key: HUDI-1061
> URL: https://issues.apache.org/jira/browse/HUDI-1061
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Reporter: Wenning Ding
>Assignee: sivabalan narayanan
>Priority: Major
>
> h3. Reproduce
> open hudi-cli.sh and run these two commands:
> {code:java}
> connect --path s3://wenningd-emr-dev/hudi/tables/events/hudi_null01 savepoint 
> create --commit 2019115109 
> {code}
> {{}}
> {{}}You would see this error:
> {{}}
> {code:java}
> java.io.FileNotFoundException: File file:/tmp/spark-events does not exist at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630)
>  at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:452)
>  at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:97)
>  at org.apache.spark.SparkContext.(SparkContext.scala:523) at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) 
> at org.apache.hudi.cli.utils.SparkUtil.initJavaSparkConf(SparkUtil.java:85) 
> at 
> org.apache.hudi.cli.commands.SavepointsCommand.savepoint(SavepointsCommand.java:79)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216)
>  at 
> org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68)
>  at 
> org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59)
>  at 
> org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134)
>  at org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533) 
> at org.springframework.shell.core.JLineShell.run(JLineShell.java:179) at 
> java.lang.Thread.run(Thread.java:748){code}
> {{}}Although in {{spark-defaults.conf}}, it configs {{spark.eventLog.dir  
>  hdfs:///var/log/spark/apps}}, but here hudi cli still uses 
> {{file:/tmp/spark-events}} as the event log dir, which means sparkcontext 
> won't load the configs from {{spark-defaults.conf}}.
> We should make initJavaSparkConf method be able to read configs from spark 
> config file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3968) Add automated savepoint support to Hudi

2022-04-25 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3968:
--
Epic Link: HUDI-3967

> Add automated savepoint support to Hudi
> ---
>
> Key: HUDI-3968
> URL: https://issues.apache.org/jira/browse/HUDI-3968
> Project: Apache Hudi
>  Issue Type: Task
>  Components: table-service
>Reporter: sivabalan narayanan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3968) Add automated savepoint support to Hudi

2022-04-25 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3968:
--
Fix Version/s: 0.12.0

> Add automated savepoint support to Hudi
> ---
>
> Key: HUDI-3968
> URL: https://issues.apache.org/jira/browse/HUDI-3968
> Project: Apache Hudi
>  Issue Type: Task
>  Components: table-service
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-1015) Audit all getAllPartitionPaths() calls and keep em out of fast path

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1015:
-
Priority: Critical  (was: Major)

> Audit all getAllPartitionPaths() calls and keep em out of fast path
> ---
>
> Key: HUDI-1015
> URL: https://issues.apache.org/jira/browse/HUDI-1015
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, writer-core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.10.0, 0.12.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Reopened] (HUDI-1015) Audit all getAllPartitionPaths() calls and keep em out of fast path

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reopened HUDI-1015:
--
Assignee: Ethan Guo  (was: sivabalan narayanan)

> Audit all getAllPartitionPaths() calls and keep em out of fast path
> ---
>
> Key: HUDI-1015
> URL: https://issues.apache.org/jira/browse/HUDI-1015
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, writer-core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.10.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-1015) Audit all getAllPartitionPaths() calls and keep em out of fast path

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1015:
-
Fix Version/s: 0.12.0

> Audit all getAllPartitionPaths() calls and keep em out of fast path
> ---
>
> Key: HUDI-1015
> URL: https://issues.apache.org/jira/browse/HUDI-1015
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, writer-core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.10.0, 0.12.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HUDI-1015) Audit all getAllPartitionPaths() calls and keep em out of fast path

2022-04-25 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527477#comment-17527477
 ] 

Raymond Xu commented on HUDI-1015:
--

This is for taking 1 more pass on the code paths. cc [~guoyihua] [~shivnarayan]

> Audit all getAllPartitionPaths() calls and keep em out of fast path
> ---
>
> Key: HUDI-1015
> URL: https://issues.apache.org/jira/browse/HUDI-1015
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, writer-core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.10.0, 0.12.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-1015) Audit all getAllPartitionPaths() calls and keep em out of fast path

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1015:
-
Epic Link: HUDI-1292

> Audit all getAllPartitionPaths() calls and keep em out of fast path
> ---
>
> Key: HUDI-1015
> URL: https://issues.apache.org/jira/browse/HUDI-1015
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, writer-core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.10.0, 0.12.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-1292) [Umbrella] RFC-15 : Metadata Table for File Listing and other table metadata

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1292:
-
Fix Version/s: 0.12.0

> [Umbrella] RFC-15 : Metadata Table for File Listing and other table metadata
> 
>
> Key: HUDI-1292
> URL: https://issues.apache.org/jira/browse/HUDI-1292
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: spark, writer-core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: hudi-umbrellas, pull-request-available
> Fix For: 0.11.0, 0.12.0
>
>
> This is the umbrella ticket that tracks the overall implementation of RFC-15



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-3969) Reliable lock provider and OCC across clouds

2022-04-25 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-3969:


 Summary: Reliable lock provider and OCC across clouds
 Key: HUDI-3969
 URL: https://issues.apache.org/jira/browse/HUDI-3969
 Project: Apache Hudi
  Issue Type: Epic
Reporter: Raymond Xu
 Fix For: 0.12.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-3970) Close testing gap in e2e test

2022-04-25 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-3970:


 Summary: Close testing gap in e2e test
 Key: HUDI-3970
 URL: https://issues.apache.org/jira/browse/HUDI-3970
 Project: Apache Hudi
  Issue Type: Epic
Reporter: Raymond Xu
 Fix For: 0.12.0


Identify existing testing gaps for 0.11 and 0.12 and cover them with adequate 
tests (UT/FT/IT/e2e)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3753) Support a CLI command to delete any metadata partition and clean index commits from timeline

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3753:
-
Priority: Major  (was: Critical)

> Support a CLI command to delete any metadata partition and clean index  
> commits from timeline
> -
>
> Key: HUDI-3753
> URL: https://issues.apache.org/jira/browse/HUDI-3753
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3755) Change the index plan to capture everything that is needed to create index

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3755:
-
Priority: Blocker  (was: Critical)

> Change the index plan to capture everything that is needed to create index
> --
>
> Key: HUDI-3755
> URL: https://issues.apache.org/jira/browse/HUDI-3755
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Blocker
>
> Currently, we capture the MetadataPartitionType and base instant upto which 
> there are no holes in the timeline. We can also write other options, e.g. 
> columns that are going to be indexed, catch up start instant, etc. to the 
> plan. This will make it easier to reindex,  just read the plan, map it to 
> configs and start indexing. This also makes the write client APIs cleaner.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3718) Support concurrent writes while dropping index

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3718:
-
Priority: Blocker  (was: Major)

> Support concurrent writes while dropping index
> --
>
> Key: HUDI-3718
> URL: https://issues.apache.org/jira/browse/HUDI-3718
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>
> Try to avoid lock for dropping index by doing lazy dropping, like how the 
> databases do.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3756) Clean up indexing APIs in write client

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3756:
-
Priority: Major  (was: Critical)

> Clean up indexing APIs in write client
> --
>
> Key: HUDI-3756
> URL: https://issues.apache.org/jira/browse/HUDI-3756
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>
> See HUDI-3755 for more details.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3971) Revisit UT/FTs for colstats and bloom filter and fill gaps

2022-04-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3971:
-
Priority: Blocker  (was: Major)

> Revisit UT/FTs for colstats and bloom filter and fill gaps
> --
>
> Key: HUDI-3971
> URL: https://issues.apache.org/jira/browse/HUDI-3971
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Raymond Xu
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-3971) Revisit UT/FTs for colstats and bloom filter and fill gaps

2022-04-25 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-3971:


 Summary: Revisit UT/FTs for colstats and bloom filter and fill gaps
 Key: HUDI-3971
 URL: https://issues.apache.org/jira/browse/HUDI-3971
 Project: Apache Hudi
  Issue Type: Test
Reporter: Raymond Xu






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


  1   2   3   >