[GitHub] [hudi] danny0405 commented on a diff in pull request #5392: [HUDI-3942] [RFC-50] Improve Timeline Server

2022-05-18 Thread GitBox


danny0405 commented on code in PR #5392:
URL: https://github.com/apache/hudi/pull/5392#discussion_r875645064


##
rfc/rfc-50/rfc-50.md:
##
@@ -0,0 +1,94 @@
+
+
+# RFC-50: Improve Timeline Server
+
+## Proposers
+- @yuzhaojing
+
+## Approvers
+ - @xushiyan
+ - @danny0405
+
+## Abstract
+
+Support client to obtain timeline from timeline server.
+
+## Background
+
+The core of HUDI is to maintain all the operations performed by the timeline 
on the table at different times. Every time you write and read, you need to 
obtain the information of the HUDI table through the timeline.
+At present, there are two ways to obtain the timeline of HUDI :
+- Create a MetaClient and get the complete timeline through MetaClient 
#getActiveTimeline, which will directly scan the HDFS directory of metadata
+- Get the timeline through FileSystemView#getTimeline. This timeline is the 
cache timeline obtained by requesting the Embedded timeline service. There is 
no need to repeatedly scan the HDFS directory of metadata, but this timeline 
only contains completed instants
+
+### Problem description
+
+- HUDI designs the Timeline service for processing and caching when accessing 
metadata , but currently does not converge all access to metadata to the 
Timeline service, such as the acquisition of a complete timeline.
+- When the number of tasks written increases, a large number of repeated 
access to metadata will lead to high HDFS NameNode requests, causing greater 
pressure and not easy to expand.
+
+### Spark and Flink write flow comparison diagram
+
+Since Hudi is designed based on the Spark micro-batch model, in the Spark 
write process, all operations on the timeline are completed on the driver side, 
and then distributed to the executor side to start the write operation.
+
+But for Flink , Write tasks are resident services due to their pure streaming 
model. There is also no highly reliable communication mechanism between the 
user-side JM and the TM in Flink, so the TM needs to obtain the latest instant 
by polling the timeline for writing.
+
+![](ComparisonDiagram.png)
+
+### Current
+
+![](CurrentDesign.png)
+
+The current design implementation has two main problems with the convergence 
timeline
+- Since the timeline of the task is pulled from the Embedded timeline service, 
the refresh mechanism of the Embedded timeline service itself will doesn't work
+- MetaClient and HoodieTable are decoupled. Obtain the timeline in MetaClient 
and then request the Embedded timeline service to obtain file-related 
information through the FileSystemViewManager in HoodieTable combined with the 
timeline. There are circular dependencies and problems in the case of using 
MetaClient alone without creating HoodieTable
+

Review Comment:
   No, we can try to reuse the timeline server.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #5392: [HUDI-3942] [RFC-50] Improve Timeline Server

2022-05-17 Thread GitBox


danny0405 commented on code in PR #5392:
URL: https://github.com/apache/hudi/pull/5392#discussion_r875427334


##
rfc/rfc-50/rfc-50.md:
##
@@ -0,0 +1,93 @@
+
+
+# RFC-50: Improve Timeline Server
+
+## Proposers
+- @yuzhaojing
+
+## Approvers
+ - @xushiyan
+ - @danny0405
+
+## Abstract
+
+Support client to obtain timeline from timeline server.
+
+## Background
+
+At its core, Hudi maintains a timeline of all actions performed on the table 
at different instants of time.Before each operation is performed on the Hoodie 
table, the information of the HUDI table needs to be obtained through the 
timeline.At present, there are two ways to obtain the timeline of HUDI :
+- Create a MetaClient and get the complete timeline through MetaClient 
#getActiveTimeline, which will directly scan the HDFS directory of metadata

Review Comment:
   `time.Before` -> `time. Before`
   `timeline.At present` -> `timeline. At present`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #5392: [HUDI-3942] [RFC-50] Improve Timeline Server

2022-05-17 Thread GitBox


danny0405 commented on code in PR #5392:
URL: https://github.com/apache/hudi/pull/5392#discussion_r875426679


##
rfc/rfc-50/rfc-50.md:
##
@@ -0,0 +1,94 @@
+
+
+# RFC-50: Improve Timeline Server
+
+## Proposers
+- @yuzhaojing
+
+## Approvers
+ - @xushiyan
+ - @danny0405
+
+## Abstract
+
+Support client to obtain timeline from timeline server.
+
+## Background
+
+The core of HUDI is to maintain all the operations performed by the timeline 
on the table at different times. Every time you write and read, you need to 
obtain the information of the HUDI table through the timeline.
+At present, there are two ways to obtain the timeline of HUDI :
+- Create a MetaClient and get the complete timeline through MetaClient 
#getActiveTimeline, which will directly scan the HDFS directory of metadata
+- Get the timeline through FileSystemView#getTimeline. This timeline is the 
cache timeline obtained by requesting the Embedded timeline service. There is 
no need to repeatedly scan the HDFS directory of metadata, but this timeline 
only contains completed instants
+
+### Problem description
+
+- HUDI designs the Timeline service for processing and caching when accessing 
metadata , but currently does not converge all access to metadata to the 
Timeline service, such as the acquisition of a complete timeline.
+- When the number of tasks written increases, a large number of repeated 
access to metadata will lead to high HDFS NameNode requests, causing greater 
pressure and not easy to expand.
+
+### Spark and Flink write flow comparison diagram
+
+Since Hudi is designed based on the Spark micro-batch model, in the Spark 
write process, all operations on the timeline are completed on the driver side, 
and then distributed to the executor side to start the write operation.
+
+But for Flink , Write tasks are resident services due to their pure streaming 
model. There is also no highly reliable communication mechanism between the 
user-side JM and the TM in Flink, so the TM needs to obtain the latest instant 
by polling the timeline for writing.
+
+![](ComparisonDiagram.png)
+
+### Current
+
+![](CurrentDesign.png)
+
+The current design implementation has two main problems with the convergence 
timeline
+- Since the timeline of the task is pulled from the Embedded timeline service, 
the refresh mechanism of the Embedded timeline service itself will doesn't work
+- MetaClient and HoodieTable are decoupled. Obtain the timeline in MetaClient 
and then request the Embedded timeline service to obtain file-related 
information through the FileSystemViewManager in HoodieTable combined with the 
timeline. There are circular dependencies and problems in the case of using 
MetaClient alone without creating HoodieTable
+

Review Comment:
   > We can record the last instant of the last request in fs view, and sync if 
there is a change
   
   Actually the current code already cache a local timeline for each fs view 
instance, we can just use that timeline hash toe compare with the timeline hash 
from the client to decide whether the fs view is behind (the code already does 
so, see `RequestHandler#syncIfLocalViewBehind`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #5392: [HUDI-3942] [RFC-50] Improve Timeline Server

2022-04-26 Thread GitBox


danny0405 commented on code in PR #5392:
URL: https://github.com/apache/hudi/pull/5392#discussion_r858421514


##
rfc/rfc-50/rfc-50.md:
##
@@ -0,0 +1,94 @@
+
+
+# RFC-50: Improve Timeline Server
+
+## Proposers
+- @yuzhaojing
+
+## Approvers
+ - @xushiyan
+ - @danny0405
+
+## Abstract
+
+Support client to obtain timeline from timeline server.
+
+## Background
+
+The core of HUDI is to maintain all the operations performed by the timeline 
on the table at different times. Every time you write and read, you need to 
obtain the information of the HUDI table through the timeline.
+At present, there are two ways to obtain the timeline of HUDI :

Review Comment:
   The syntax should be re-organized.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #5392: [HUDI-3942] [RFC-50] Improve Timeline Server

2022-04-26 Thread GitBox


danny0405 commented on code in PR #5392:
URL: https://github.com/apache/hudi/pull/5392#discussion_r858421078


##
rfc/rfc-50/rfc-50.md:
##
@@ -0,0 +1,94 @@
+
+
+# RFC-50: Improve Timeline Server
+
+## Proposers
+- @yuzhaojing
+
+## Approvers
+ - @xushiyan
+ - @danny0405
+
+## Abstract
+
+Support client to obtain timeline from timeline server.
+
+## Background
+
+The core of HUDI is to maintain all the operations performed by the timeline 
on the table at different times. Every time you write and read, you need to 
obtain the information of the HUDI table through the timeline.
+At present, there are two ways to obtain the timeline of HUDI :
+- Create a MetaClient and get the complete timeline through MetaClient 
#getActiveTimeline, which will directly scan the HDFS directory of metadata
+- Get the timeline through FileSystemView#getTimeline. This timeline is the 
cache timeline obtained by requesting the Embedded timeline service. There is 
no need to repeatedly scan the HDFS directory of metadata, but this timeline 
only contains completed instants
+
+### Problem description
+
+- HUDI designs the Timeline service for processing and caching when accessing 
metadata , but currently does not converge all access to metadata to the 
Timeline service, such as the acquisition of a complete timeline.
+- When the number of tasks written increases, a large number of repeated 
access to metadata will lead to high HDFS NameNode requests, causing greater 
pressure and not easy to expand.
+
+### Spark and Flink write flow comparison diagram
+
+Since Hudi is designed based on the Spark micro-batch model, in the Spark 
write process, all operations on the timeline are completed on the driver side, 
and then distributed to the executor side to start the write operation.
+
+But for Flink , Write tasks are resident services due to their pure streaming 
model. There is also no highly reliable communication mechanism between the 
user-side JM and the TM in Flink, so the TM needs to obtain the latest instant 
by polling the timeline for writing.
+
+![](ComparisonDiagram.png)
+
+### Current
+
+![](CurrentDesign.png)
+
+The current design implementation has two main problems with the convergence 
timeline
+- Since the timeline of the task is pulled from the Embedded timeline service, 
the refresh mechanism of the Embedded timeline service itself will doesn't work
+- MetaClient and HoodieTable are decoupled. Obtain the timeline in MetaClient 
and then request the Embedded timeline service to obtain file-related 
information through the FileSystemViewManager in HoodieTable combined with the 
timeline. There are circular dependencies and problems in the case of using 
MetaClient alone without creating HoodieTable
+

Review Comment:
   Can you explain a little more when the timeline on the timeline serve is 
synced/refreshed ? Since there are all kinds of async table services during the 
writing process. The timeline server should see all the changes such as 
cleaning and compaction.
   
   What i was expecting is that the timeline request from the meta client will 
only trigger the timeline refresh on the timeline line server, then if there 
are subsequent request for fs view, the view was synced lazily.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org