[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #6600: [RFC-62] Diagnostic Reporter

GitBox Thu, 15 Sep 2022 03:21:56 -0700


zhangyue19921010 commented on code in PR #6600:
URL: https://github.com/apache/hudi/pull/6600#discussion_r971808477



##########
rfc/rfc-62/rfc-62.md:
##########
@@ -0,0 +1,443 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-62: Diagnostic Reporter
+
+
+
+## Proposers
+
+- zhangyue19921...@163.com
+
+## Approvers
+ - @codope
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4707
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+With the development of hudi, more and more users choose hudi to build their 
own ingestion pipelines to support real-time or batch upsert requirements.
+Subsequently, some of them may ask the community for help, such as how to 
improve the performance of their hudi ingestion jobs? Why did their hudi jobs 
fail? etc.
+
+For the volunteers in the hudi community, dealing with such issue, the 
volunteers may ask users to provide a list of information, including engine 
context, job configs,
+data pattern, Spark UI, etc. Users need to spend extra effort to review their 
own jobs, collect metrics one buy one according to the list and give feedback 
to volunteers.
+By the way, unexpected errors may occur at this time as users are manually 
collecting these information.
+
+Obviously, there are relatively high communication costs for both volunteers 
and users.
+
+On the other hand, for advanced users, they also need some way to efficiently 
understand the characteristics of their hudi tables, including data volume, 
upsert pattern, and so on.
+
+## Background
+As we know, hudi already has its own unique metrics system and metadata 
framework. These information are very important for hudi job tuning or 
troubleshooting. Fo example:
+
+1. Hudi will record the complete timeline in the .hoodie directory, including 
active timeline and archive timeline. From this we can trace the historical 
state of the hudi job.
+
+2. Hudi metadata table which will records all the partitions, all the data 
files, etc
+
+3. Each commit of hudi records various metadata information and runtime 
metrics currently written, such as:
+```json
+{
+    "partitionToWriteStats":{
+        "20210623/0/20210825":[
+            {
+                "fileId":"4ae31921-eedd-4c56-8218-bb47849397a4-0",
+                
"path":"20210623/0/20210825/4ae31921-eedd-4c56-8218-bb47849397a4-0_0-27-2006_20220818134233973.parquet",
+                "prevCommit":"null",
+                "numWrites":123352,
+                "numDeletes":0,
+                "numUpdateWrites":0,
+                "numInserts":123352,
+                "totalWriteBytes":4675371,
+                "totalWriteErrors":0,
+                "tempPath":null,
+                "partitionPath":"20210623/0/20210825",
+                "totalLogRecords":0,
+                "totalLogFilesCompacted":0,
+                "totalLogSizeCompacted":0,
+                "totalUpdatedRecordsCompacted":0,
+                "totalLogBlocks":0,
+                "totalCorruptLogBlock":0,
+                "totalRollbackBlocks":0,
+                "fileSizeInBytes":4675371,
+                "minEventTime":null,
+                "maxEventTime":null
+            }
+        ]
+    },
+    "compacted":false,
+    "extraMetadata":{
+        "schema":"xxxx"
+    },
+    "operationType":"UPSERT",
+    "totalRecordsDeleted":0,
+    "totalLogFilesSize":0,
+    "totalScanTime":0,
+    "totalCreateTime":21051,
+    "totalUpsertTime":0,
+    "minAndMaxEventTime":{
+        "Optional.empty":{
+            "val":null,
+            "present":false
+        }
+    },
+    "writePartitionPaths":[
+        "20210623/0/20210825"
+    ],
+    "fileIdAndRelativePaths":{
+        
"c144908e-ca7d-401f-be1c-613de98d96a3-0":"20210623/0/20210825/c144908e-ca7d-401f-be1c-613de98d96a3-0_3-33-2009_20220818134233973.parquet"
+    },
+    "totalLogRecordsCompacted":0,
+    "totalLogFilesCompacted":0,
+    "totalCompactedRecordsUpdated":0
+}
+```
+
+In order to expose hudi table context more efficiently, this RFC propose a 
Diagnostic Reporter Tool.
+This tool can be turned on as the final stage in ingestion job after commit 
which will collect common troubleshooting information including engine(take 
spark as example here) runtime information and generate a diagnostic report 
json file.
+
+Or users can trigger this diagnostic reporter tool using hudi-cli to generate 
this report json file.
+
+## Implementation
+
+This Diagnostic Reporter Tool will go through whole hudi table and generate a 
report json file which contains all the necessary information. Also this tool 
will package .hoodie folder as a zip compressed file.
+
+Users can use this Diagnostic Reporter Tool in the following two ways：
+1. Users can directly enable this diagnostic reporter in the writing jobs, at 
this time diagnostic reporter tool will go through current hudi table and 
generate report files as the last stage after commit.
+2. Users can directly generate the corresponding report file for a hudi table 
through the hudi cli command.

Review Comment:
   Changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #6600: [RFC-62] Diagnostic Reporter

Reply via email to