[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible

2023-08-03 Thread via GitHub


the-other-tim-brown commented on code in PR #9337:
URL: https://github.com/apache/hudi/pull/9337#discussion_r1283557623


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieBaseFile.java:
##
@@ -52,14 +61,45 @@ public HoodieBaseFile(String filePath) {
   public HoodieBaseFile(String filePath, BaseFile bootstrapBaseFile) {
 super(filePath);
 this.bootstrapBaseFile = Option.ofNullable(bootstrapBaseFile);
+String[] fileIdAndCommitTime = getFileIdAndCommitTimeFromFileName();
+this.fileId = fileIdAndCommitTime[0];
+this.commitTime = fileIdAndCommitTime[1];
+  }
+
+  /**
+   * Parses the file ID and commit time from the fileName.
+   * @return String array of size 2 with fileId as the first and commitTime as 
the second element.
+   */
+  private String[] getFileIdAndCommitTimeFromFileName() {

Review Comment:
   Do you see any other places that would require this? I can add it and update 
callers if so, but I wanted to avoid adding more to that class if possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible

2023-08-03 Thread via GitHub


the-other-tim-brown commented on code in PR #9337:
URL: https://github.com/apache/hudi/pull/9337#discussion_r1283399352


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java:
##
@@ -214,12 +214,16 @@ protected List> 
loadColumnRangesFromMetaIndex(
 
 String keyField = 
hoodieTable.getMetaClient().getTableConfig().getRecordKeyFieldProp();
 
+List> baseFilesForAllPartitions = 
HoodieIndexUtils.getLatestBaseFilesForAllPartitions(partitions, context, 
hoodieTable);

Review Comment:
   Line 219 in the original, same question about whether this is really adding 
more memory usage. Can you elaborate on this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible

2023-08-03 Thread via GitHub


the-other-tim-brown commented on code in PR #9337:
URL: https://github.com/apache/hudi/pull/9337#discussion_r1283384777


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -481,10 +482,10 @@ private Pair> 
initializeRecordIndexPartition()
 // Collect the list of latest base files present in each partition
 List partitions = metadata.getAllPartitionPaths();
 fsView.loadAllPartitions();
-final List> partitionBaseFilePairs = new 
ArrayList<>();
+final List> partitionBaseFilePairs = new 
ArrayList<>();

Review Comment:
   Ok I'm a bit confused here. The HoodieBaseFile is and always was created at 
line 487 below so we are creating the object regardless. When we create a list 
with these objects, we will store a reference to the object and not make a copy 
of the object, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible

2023-08-03 Thread via GitHub


the-other-tim-brown commented on code in PR #9337:
URL: https://github.com/apache/hudi/pull/9337#discussion_r1283381296


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java:
##
@@ -214,12 +214,16 @@ protected List> 
loadColumnRangesFromMetaIndex(
 
 String keyField = 
hoodieTable.getMetaClient().getTableConfig().getRecordKeyFieldProp();
 
+List> baseFilesForAllPartitions = 
HoodieIndexUtils.getLatestBaseFilesForAllPartitions(partitions, context, 
hoodieTable);
+List> partitionFileNameList = new 
ArrayList<>(baseFilesForAllPartitions.size());
+Map, String> partitionAndFileNameToFileId = new 
HashMap<>(baseFilesForAllPartitions.size());
+baseFilesForAllPartitions.forEach(pair -> {
+  Pair parititonAndFileName = Pair.of(pair.getKey(), 
pair.getValue().getFileName());
+  partitionFileNameList.add(parititonAndFileName);
+  partitionAndFileNameToFileId.put(parititonAndFileName, 
pair.getValue().getFileId());
+});
 // Partition and file name pairs
-List> partitionFileNameList =
-HoodieIndexUtils.getLatestBaseFilesForAllPartitions(partitions, 
context, hoodieTable).stream()
-.map(partitionBaseFilePair -> 
Pair.of(partitionBaseFilePair.getLeft(), 
partitionBaseFilePair.getRight().getFileName()))
-.sorted()
-.collect(toList());
+Collections.sort(partitionFileNameList); // TODO why does this need to be 
sorted?

Review Comment:
   Is there good testing around this? I can remove and make sure the tests 
still pass



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible

2023-08-03 Thread via GitHub


the-other-tim-brown commented on code in PR #9337:
URL: https://github.com/apache/hudi/pull/9337#discussion_r1283380361


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java:
##
@@ -114,7 +115,8 @@ public HoodieLogFileReader(FileSystem fs, HoodieLogFile 
logFile, Schema readerSc
 // NOTE: We repackage {@code HoodieLogFile} here to make sure that the 
provided path
 //   is prefixed with an appropriate scheme given that we're not 
propagating the FS
 //   further
-this.logFile = new HoodieLogFile(FSUtils.makeQualified(fs, 
logFile.getPath()), logFile.getFileSize());
+Path updatedPath = FSUtils.makeQualified(fs, logFile.getPath());
+this.logFile = updatedPath.equals(logFile.getPath()) ? logFile : new 
HoodieLogFile(updatedPath, logFile.getFileSize());

Review Comment:
   This is to avoid creating an extra object and recomputing the metadata from 
the file name



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible

2023-08-01 Thread via GitHub


the-other-tim-brown commented on code in PR #9337:
URL: https://github.com/apache/hudi/pull/9337#discussion_r1281305345


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java:
##
@@ -214,12 +214,16 @@ protected List> 
loadColumnRangesFromMetaIndex(
 
 String keyField = 
hoodieTable.getMetaClient().getTableConfig().getRecordKeyFieldProp();
 
+List> baseFilesForAllPartitions = 
HoodieIndexUtils.getLatestBaseFilesForAllPartitions(partitions, context, 
hoodieTable);
+List> partitionFileNameList = new 
ArrayList<>(baseFilesForAllPartitions.size());
+Map, String> partitionAndFileNameToFileId = new 
HashMap<>(baseFilesForAllPartitions.size());
+baseFilesForAllPartitions.forEach(pair -> {
+  Pair parititonAndFileName = Pair.of(pair.getKey(), 
pair.getValue().getFileName());
+  partitionFileNameList.add(parititonAndFileName);
+  partitionAndFileNameToFileId.put(parititonAndFileName, 
pair.getValue().getFileId());
+});
 // Partition and file name pairs
-List> partitionFileNameList =
-HoodieIndexUtils.getLatestBaseFilesForAllPartitions(partitions, 
context, hoodieTable).stream()
-.map(partitionBaseFilePair -> 
Pair.of(partitionBaseFilePair.getLeft(), 
partitionBaseFilePair.getRight().getFileName()))
-.sorted()
-.collect(toList());
+Collections.sort(partitionFileNameList); // TODO why does this need to be 
sorted?

Review Comment:
   @nsivabalan or @yihua do you know why this partitionFileNameList has to be 
sorted?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible

2023-08-01 Thread via GitHub


the-other-tim-brown commented on code in PR #9337:
URL: https://github.com/apache/hudi/pull/9337#discussion_r1281245951


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieBaseFile.java:
##
@@ -52,14 +59,39 @@ public HoodieBaseFile(String filePath) {
   public HoodieBaseFile(String filePath, BaseFile bootstrapBaseFile) {
 super(filePath);
 this.bootstrapBaseFile = Option.ofNullable(bootstrapBaseFile);
+String[] fileIdAndCommitTime = getFileIdAndCommitTimeFromFileName();
+this.fileId = fileIdAndCommitTime[0];
+this.commitTime = fileIdAndCommitTime[1];
+  }
+
+  private String[] getFileIdAndCommitTimeFromFileName() {
+String[] values = new String[2];
+short underscoreCount = 0;
+short lastUnderscoreIndex = 0;
+for (int i = 0; i < fileName.length(); i++) {
+  char c = fileName.charAt(i);
+  if (c == '_') {
+if (underscoreCount == 0) {
+  values[0] = fileName.substring(0, i);
+}
+lastUnderscoreIndex = (short) i;
+underscoreCount++;
+  } else if (c == '.') {
+if (underscoreCount == 2) {
+  values[1] = fileName.substring(lastUnderscoreIndex + 1, i);
+  break;
+}
+  }
+}
+return values;

Review Comment:
   I don't think we strictly require file IDs to be a UUID so I don't think 
this will work. It will require multiple traversals of the string which is what 
I was trying to avoid.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible

2023-08-01 Thread via GitHub


the-other-tim-brown commented on code in PR #9337:
URL: https://github.com/apache/hudi/pull/9337#discussion_r1281188320


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieLogFile.java:
##
@@ -47,64 +52,86 @@ public class HoodieLogFile implements Serializable {
 
   private transient FileStatus fileStatus;
   private final String pathStr;
+  private final String fileId;
+  private final String baseCommitTime;
+  private final int logVersion;
+  private final String logWriteToken;
+  private final String fileExtension;
+  private final String suffix;
+  private final Path path;
   private long fileLen;
 
   public HoodieLogFile(HoodieLogFile logFile) {
-this.fileStatus = logFile.fileStatus;
-this.pathStr = logFile.pathStr;
-this.fileLen = logFile.fileLen;
+this(logFile.getFileStatus(), logFile.getPath(), logFile.pathStr, 
logFile.getFileSize());
   }
 
   public HoodieLogFile(FileStatus fileStatus) {
-this.fileStatus = fileStatus;
-this.pathStr = fileStatus.getPath().toString();
-this.fileLen = fileStatus.getLen();
+this(fileStatus, fileStatus.getPath(), fileStatus.getPath().toString(), 
fileStatus.getLen());
   }
 
   public HoodieLogFile(Path logPath) {
-this.fileStatus = null;
-this.pathStr = logPath.toString();
-this.fileLen = -1;
+this(null, logPath, logPath.toString(), -1);
   }
 
-  public HoodieLogFile(Path logPath, Long fileLen) {
-this.fileStatus = null;
-this.pathStr = logPath.toString();
-this.fileLen = fileLen;
+  public HoodieLogFile(Path logPath, long fileLen) {
+this(null, logPath, logPath.toString(), fileLen);
   }
 
   public HoodieLogFile(String logPathStr) {
-this.fileStatus = null;
+this(null, null, logPathStr, -1);
+  }
+
+  private HoodieLogFile(FileStatus fileStatus, Path logPath, String 
logPathStr, long fileLen) {
+this.fileStatus = fileStatus;
 this.pathStr = logPathStr;
-this.fileLen = -1;
+this.fileLen = fileLen;
+if (logPath != null) {
+  if (logPath instanceof CachingPath) {
+this.path = logPath;
+  } else {
+this.path = new CachingPath(logPath.getParent(), logPath.getName());
+  }
+} else {
+  this.path = new CachingPath(pathStr);
+}
+Matcher matcher = LOG_FILE_PATTERN.matcher(path.getName());
+if (!matcher.find()) {
+  throw new InvalidHoodiePathException(path, "LogFile");
+}
+this.fileId = matcher.group(1);
+this.baseCommitTime = matcher.group(2);
+this.fileExtension = matcher.group(3);
+this.logVersion = Integer.parseInt(matcher.group(4));
+this.logWriteToken = matcher.group(6);
+this.suffix = matcher.group(10);
   }
 
   public String getFileId() {
-return FSUtils.getFileIdFromLogPath(getPath());

Review Comment:
   Previously all of these methods would construct a path object and then run a 
matcher on the fileName from that path. Now we'll make a single Path object 
when creating the object and we'll run the matcher once and extract all the 
values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org