hudi-agent commented on code in PR #19072:
URL: https://github.com/apache/hudi/pull/19072#discussion_r3485050811


##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -865,7 +865,18 @@ public interface SerializableFunction<T, R> extends 
Function<T, R>, Serializable
   private static Option<HoodieLogFile> getLatestLogFile(Stream<HoodieLogFile> 
logFiles) {
     return 
Option.fromJavaOptional(logFiles.min(HoodieLogFile.getReverseLogFileComparator()));
   }
-  
+
+  /**
+   * Return the file size of a log file.
+   */
+  public static long getFileSize(HoodieStorage storage, HoodieLogFile logFile) 
{
+    try {
+      return logFile.getFileSize() >= 0 ? logFile.getFileSize() : 
storage.getPathInfo(logFile.getPath()).getLength();

Review Comment:
   πŸ€– nit: `logFile.getFileSize()` is called twice in the same expression β€” 
could you extract it to a local variable (`long fileSize = 
logFile.getFileSize()`) so a reader doesn't have to wonder whether the double 
call is intentional?
   
   <sub><i>⚠️ AI-generated; verify before applying. React πŸ‘/πŸ‘Ž to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##########
@@ -530,7 +530,8 @@ private Stream<HoodieLogFile> 
convertFileStatusesToLogFiles(List<StoragePathInfo
     String logFileExtension = 
metaClient.getTableConfig().getLogFileFormat().getFileExtension();
     Predicate<StoragePathInfo> rtFilePredicate = pathInfo -> {
       String fileName = pathInfo.getPath().getName();
-      return FSUtils.isLogFile(pathInfo.getPath()) && 
fileName.contains(logFileExtension);
+      return FSUtils.isLogFile(pathInfo.getPath())
+          && (FSUtils.matchNativeLogFile(fileName).isPresent() || 
fileName.contains(logFileExtension));

Review Comment:
   πŸ€– This now surfaces native `.cdc.parquet` files in a file slice's log files 
(the new test asserts this). But the read path excludes CDC logs in 
`InputSplit` via `getFileName().endsWith(CDC_LOGFILE_SUFFIX)` (".cdc"), which 
won't match `.cdc.parquet`. Since `HoodieNativeLogFileReader` treats any 
non-delete native file as a data block, could native CDC logs leak into 
snapshot/merge reads here? @yihua
   
   <sub><i>⚠️ AI-generated; verify before applying. React πŸ‘/πŸ‘Ž to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/NativeLogFooterMetadata.java:
##########
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.table.log;
+
+import org.apache.hudi.common.model.HoodieFileFormat;
+import org.apache.hudi.common.schema.HoodieSchema;
+import 
org.apache.hudi.common.table.log.block.HoodieLogBlock.HeaderMetadataType;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.HoodieNotSupportedException;
+import org.apache.hudi.io.storage.HoodieIOFactory;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+
+import com.fasterxml.jackson.core.type.TypeReference;
+
+import java.io.IOException;
+import java.util.LinkedHashMap;
+import java.util.Map;
+import java.util.regex.Matcher;
+
+/**
+ * Shared on-disk contract for RFC-103 native log files. The log block header 
(schema, instant time,
+ * etc.) is persisted as a single JSON entry in the native file footer, keyed 
by
+ * {@link #FOOTER_METADATA_KEY}. This is the single source of truth used by 
both the write path
+ * ({@code HoodieNativeLogFormatWriter}) and the read path ({@code 
HoodieNativeLogFileReader}) so the
+ * two cannot drift apart.
+ */
+public class NativeLogFooterMetadata {
+
+  /**
+   * Footer key under which the serialized log block header is stored in a 
native log file.
+   */
+  public static final String FOOTER_METADATA_KEY = "hudi.log.format.metadata";
+
+  /**
+   * Current native log format version. Bump this when the footer layout 
changes in a
+   * backward-incompatible way; readers reject files written with a newer 
version.
+   */
+  public static final int CURRENT_FORMAT_VERSION = 2;
+
+  private static final TypeReference<LinkedHashMap<String, String>> MAP_TYPE =
+      new TypeReference<LinkedHashMap<String, String>>() {
+      };
+
+  private NativeLogFooterMetadata() {
+  }
+
+  /**
+   * Serializes the log block header into the footer key/value map written to 
the native file.
+   * The {@link HeaderMetadataType#VERSION} entry is injected automatically.
+   */
+  public static Map<String, String> toFooterMetadata(Map<HeaderMetadataType, 
String> header) {
+    Map<String, String> logFormatMetadata = new LinkedHashMap<>();
+    logFormatMetadata.put(HeaderMetadataType.VERSION.name(), 
String.valueOf(CURRENT_FORMAT_VERSION));
+    header.forEach((key, value) -> {
+      if (value != null) {
+        logFormatMetadata.put(key.name(), value);
+      }
+    });
+    String serialized;
+    try {
+      serialized = 
JsonUtils.getObjectMapper().writeValueAsString(logFormatMetadata);
+    } catch (IOException e) {
+      throw new HoodieIOException("Failed to serialize native log footer 
metadata", e);
+    }
+    Map<String, String> footer = new LinkedHashMap<>();
+    footer.put(FOOTER_METADATA_KEY, serialized);
+    return footer;
+  }
+
+  /**
+   * Parses the log block header back from the native file footer key/value 
map. Unknown header
+   * types are ignored so that files written by newer minor versions remain 
readable.
+   */
+  public static Map<HeaderMetadataType, String> fromFooterMetadata(Map<String, 
String> footerMetadata) {
+    Map<HeaderMetadataType, String> header = new LinkedHashMap<>();
+    String serialized = footerMetadata.get(FOOTER_METADATA_KEY);
+    if (serialized == null) {
+      return header;
+    }
+    Map<String, String> logFormatMetadata;
+    try {
+      logFormatMetadata = JsonUtils.getObjectMapper().readValue(serialized, 
MAP_TYPE);
+    } catch (IOException e) {
+      throw new HoodieIOException("Failed to parse native log footer metadata: 
" + serialized, e);
+    }
+    logFormatMetadata.forEach((name, value) -> {
+      HeaderMetadataType type = headerTypeOrNull(name);
+      if (type != null) {
+        header.put(type, value);
+      }
+    });
+    validateFormatVersion(header);
+    return header;
+  }
+
+  /**
+   * Reads the table schema embedded in a native data log file footer.
+   *
+   * <p>Native data logs persist the synthetic log-block header, including 
{@link HeaderMetadataType#SCHEMA},
+   * in the footer metadata entry keyed by {@link #FOOTER_METADATA_KEY}. This 
method is intended for
+   * schema discovery paths that do not have a {@code HoodieReaderContext} and 
therefore cannot construct
+   * a {@code HoodieNativeLogFileReader}. Native delete logs do not expose a 
generic table schema through
+   * this API because their schema depends on the table schema and configured 
ordering fields; those files
+   * return {@code null}, matching the legacy behavior when no data-block 
schema can be discovered.
+   *
+   * @param storage          storage used to read the native log file footer
+   * @param path             native log file path
+   * @param nativeLogMatcher matcher returned by {@code 
FSUtils.matchNativeLogFile(path.getName())}
+   * @return table schema for native data logs, or {@code null} for native 
non-data logs
+   */
+  public static HoodieSchema readSchemaFromNativeLogFile(
+      HoodieStorage storage, StoragePath path, Matcher nativeLogMatcher) {
+    String nativeLogType = nativeLogMatcher.group(8);

Review Comment:
   πŸ€– nit: `group(8)` and `group(9)` below are opaque without cross-referencing 
the regex in `FSUtils`. Could you extract named constants (e.g. 
`NATIVE_LOG_TYPE_GROUP = 8`, `NATIVE_LOG_EXTENSION_GROUP = 9`) here or in 
`FSUtils`, so a reader can understand what each group captures at a glance?
   
   <sub><i>⚠️ AI-generated; verify before applying. React πŸ‘/πŸ‘Ž to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to