contributor permission

2022-04-17 Thread Chuang Lee
Hi,

I want to contribute to Apache Hudi. Would you please give me the contributor 
permission? My JIRA ID is HUDI-3898.

Thank you.

| |
Chuang Lee
|
|
codecooker_h...@163.com
|
签名由网易邮箱大师定制



Re: contributor permission

2022-04-17 Thread Sivabalan
Is HUDI-3898 is your apache id? we might need your apache id to add you as
a contributor.


On Sun, 17 Apr 2022 at 05:16, Chuang Lee  wrote:

> Hi,
>
> I want to contribute to Apache Hudi. Would you please give me the
> contributor permission? My JIRA ID is HUDI-3898.
>
> Thank you.
>
> | |
> Chuang Lee
> |
> |
> codecooker_h...@163.com
> |
> 签名由网易邮箱大师定制
>
>

-- 
Regards,
-Sivabalan


Re: contributor permission

2022-04-17 Thread Chuang Lee
sorry, my mistake. 
My apache id is CodeCooker17. 
Please add permission for me, thank you.


| |
ChuangLee
|
|
codecooker_h...@163.com
|
签名由网易邮箱大师定制


On 04/18/2022 10:18,Sivabalan wrote:
Is HUDI-3898 is your apache id? we might need your apache id to add you as
a contributor.


On Sun, 17 Apr 2022 at 05:16, Chuang Lee  wrote:

Hi,

I want to contribute to Apache Hudi. Would you please give me the
contributor permission? My JIRA ID is HUDI-3898.

Thank you.

| |
Chuang Lee
|
|
codecooker_h...@163.com
|
签名由网易邮箱大师定制



--
Regards,
-Sivabalan


[DISCUSS] hudi index improve

2022-04-17 Thread Forward Xu
Hi All,

I want to improve hudi‘s index. There are four main steps to achieve this

1. Implement index syntax
a. Implement index syntax for spark sql [1] , I have submitted the
first pr.
b. Implement index syntax for prestodb sql
c. Implement index syntax for trino sql

2. read/write index decoupling
The read/write index is decoupled from the computing engine side, and the
sql index syntax of the first step can be independently executed and called
through the API.

3. build index service

Promote the implementation of the hudi service framework, including index
service, metastore service[2], compact/cluster service[3], etc.

4. Index Management
There are two kinds of management semantic for Index.

   - Automatic Refresh
   - Manual Refresh


   1. Automatic Refresh

When a user creates an index on the main table without using WITH DEFERRED
REFRESH syntax, the index will be managed by the system automatically. For
every data load to the main table, the system will immediately trigger a
load to the index automatically. These two data loading (to main table and
index) is executed in a transactional manner, meaning that it will be
either both success or neither success.

The data loading to index is incremental based on Segment concept, avoiding
an expensive total refresh.

If a user performs the following command on the main table, the system will
return failure. (reject the operation)


   - Data management command: UPDATE/DELETE/DELETE SEGMENT.
   - Schema management command: ALTER TABLE DROP COLUMN, ALTER TABLE CHANGE
   DATATYPE, ALTER TABLE RENAME. Note that adding a new column is supported,
   and for dropping columns and change datatype command, CarbonData will check
   whether it will impact the index table, if not, the operation is allowed,
   otherwise operation will be rejected by throwing an exception.
   - Partition management command: ALTER TABLE ADD/DROP PARTITION.

If a user does want to perform above operations on the main table, the user
can first drop the index, perform the operation, and re-create the index
again.

If a user drops the main table, the index will be dropped immediately too.

We do recommend you to use this management for indexing.

  2.  Manual Refresh

When a user creates an index on the main table using WITH DEFERRED REFRESH
syntax, the index will be created with status disabled and query will NOT
use this index until the user issues REFRESH INDEX command to build the
index. For every REFRESH INDEX command, the system will trigger a full
refresh of the index. Once the refresh operation is finished, system will
change index status to enabled, so that it can be used in query rewrite.

For every new data loading, data update, delete, the related index will be
made disabled, which means that the following queries will not benefit from
the index before it becomes enabled again.

If the main table is dropped by the user, the related index will be dropped
immediately.



Any feedback is welcome!

Thank you.

Regards,
Forward Xu

Related Links:
[1] Implement index syntax for spark sql

[2] Metastore service 

[3] compaction/clustering job in
Service 


[DISCUSS] hudi index improve

2022-04-17 Thread Forward Xu
Hi All,

I want to improve hudi‘s index. There are four main steps to achieve this

1. Implement index syntax
a. Implement index syntax for spark sql [1] , I have submitted the
first pr.
b. Implement index syntax for prestodb sql
c. Implement index syntax for trino sql

2. read/write index decoupling
The read/write index is decoupled from the computing engine side, and the
sql index syntax of the first step can be independently executed and called
through the API.

3. build index service

Promote the implementation of the hudi service framework, including index
service, metastore service[2], compact/cluster service[3], etc.

4. Index Management
There are two kinds of management semantic for Index.

   - Automatic Refresh
   - Manual Refresh


   1. Automatic Refresh

When a user creates an index on the main table without using WITH DEFERRED
REFRESH syntax, the index will be managed by the system automatically. For
every data load to the main table, the system will immediately trigger a
load to the index automatically. These two data loading (to main table and
index) is executed in a transactional manner, meaning that it will be
either both success or neither success.

The data loading to index is incremental, avoiding an expensive total
refresh.

If a user performs the following command on the main table, the system will
return failure. (reject the operation)


   - Data management command: UPDATE/DELETE/DELETE.
   - Schema management command: ALTER TABLE DROP COLUMN, ALTER TABLE CHANGE
   DATATYPE, ALTER TABLE RENAME. Note that adding a new column is supported,
   and for dropping columns and change datatype command, hudi will check
   whether it will impact the index table, if not, the operation is allowed,
   otherwise operation will be rejected by throwing an exception.
   - Partition management command: ALTER TABLE ADD/DROP PARTITION.

If a user does want to perform above operations on the main table, the user
can first drop the index, perform the operation, and re-create the index
again.

If a user drops the main table, the index will be dropped immediately too.

We do recommend you to use this management for indexing.

  2.  Manual Refresh

When a user creates an index on the main table using WITH DEFERRED REFRESH
syntax, the index will be created with status disabled and query will NOT
use this index until the user issues REFRESH INDEX command to build the
index. For every REFRESH INDEX command, the system will trigger a full
refresh of the index. Once the refresh operation is finished, system will
change index status to enabled, so that it can be used in query rewrite.

For every new data loading, data update, delete, the related index will be
made disabled, which means that the following queries will not benefit from
the index before it becomes enabled again.

If the main table is dropped by the user, the related index will be dropped
immediately.



Any feedback is welcome!

Thank you.

Regards,
Forward Xu

Related Links:
[1] Implement index syntax for spark sql

[2] Metastore service 

[3] compaction/clustering job in
Service 


??????[DISCUSS] hudi index improve

2022-04-17 Thread ????
??1 , it will be a great feature for hudi
index is very import to boost query, and we are also trying to add index 
support for trino on hudi; maybe we can work together. Looking forward to the 
design documents
some minor questions:
1. do we need to consider concurrent operation
2. do we want to use metaTable to store index information?






--  --
??: 
   "dev"

https://issues.apache.org/jira/browse/HUDI-3881>;
[2] Metastore service ;

[3] compaction/clustering job in
Service ;

[DISSCUSS][NEW FEATURE] Hudi Lake Manager

2022-04-17 Thread Yue Zhang


Hi all, 
I would like to discuss and contribute a new feature named Hudi Lake 
Manager.


As more and more users from different companies and different businesses 
begin to use the hudi pipeline to write data, data governance has gradually 
become one of the most pain points for users. In order to get better query 
performance or better timeliness, users need to carefully configure clustering, 
compaction, cleaner and archive for each ingestion pipeline, which will 
undoubtedly bring higher learning costs and maintenance costs. Imagine that if 
a business has hundreds or thousands of ingestion piplines, then users even 
need to maintain hundreds or thousands of sets of configurations and keep 
tuning them maybe.


This new Feature Hudi Lake Manager is to decouple hudi ingestion and hudi 
table service, including cleaner, archival, clustering, comapction and any 
table services in the feature.


Users only need to care about their own ingest pipline and leave all the 
table services to the manager to automatically discover and manage the hudi 
table, thereby greatly reducing the pressure of operation and maintenance and 
the cost of on board.


This lake manager is  the role of a hudi table master/coordinator, which 
can discover hudi tables and unify and automatically call out services such as 
cleaner/clustering/compaction/archive(multi-writer and async) based on certain 
conditions.


A common and interesting example is that in our production environment, we 
basically use date as the partition key and have specific data retention 
requests. To do this we need to write a script for each pipline to delete the 
data and the corresponding hive metadata. With this lake manager, we can expand 
the scope of the cleaner, implement a mechanism for data retention based on 
date partition.


I found there is a very valuable RFC-36 on going now 
https://github.com/apache/hudi/pull/4718 Proposal for hudi metastore server, 
which will store the metadata of the hudi table, maybe we could expand this 
RFC's scope to design and develop lake manager or we could raise a new RFC and 
take this RFC-36 as information inputs.


I hope we can discuss the feasibility of this idea, it would be greatly 
appreciated.
I also volunteer my part if it is possible.
| |
Yue Zhang
|
|
zhangyue921...@163.com
|