[GitHub] [hudi] vinothchandar commented on a change in pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.

2021-03-15 Thread GitBox


vinothchandar commented on a change in pull request #2494:
URL: https://github.com/apache/hudi/pull/2494#discussion_r594656644



##
File path: 
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
##
@@ -232,12 +246,13 @@ public long getTotalRecords() {
   }
 
   @Override
-  public void close() {
+  public synchronized void close() {
 try {
   reader.close();
   reader = null;
+  keyScanner = null;
 } catch (IOException e) {
-  e.printStackTrace();

Review comment:
   @prashantwason fixed this as well. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.

2021-03-14 Thread GitBox


vinothchandar commented on a change in pull request #2494:
URL: https://github.com/apache/hudi/pull/2494#discussion_r593866290



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
##
@@ -147,82 +150,91 @@ private void initIfNeeded() {
 }
   }
   timings.add(timer.endTimer());
-  LOG.info(String.format("Metadata read for key %s took [open, 
baseFileRead, logMerge] %s ms", key, timings));
+  LOG.info(String.format("Metadata read for key %s took [baseFileRead, 
logMerge] %s ms", key, timings));
   return Option.ofNullable(hoodieRecord);
 } catch (IOException ioe) {
   throw new HoodieIOException("Error merging records from metadata table 
for key :" + key, ioe);
-} finally {

Review comment:
   this close is actually needed, when opening the metadata table from the 
executors. Otherwise, we will leak and suffer the same issues as before. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.

2021-02-28 Thread GitBox


vinothchandar commented on a change in pull request #2494:
URL: https://github.com/apache/hudi/pull/2494#discussion_r584393784



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
##
@@ -112,13 +113,59 @@ private void initIfNeeded() {
 
   @Override
   protected Option> 
getRecordByKeyFromMetadata(String key) {
+// This function can be called in parallel through multiple threads. For 
each thread, we determine the thread-local
+// versions of the baseFile and logRecord readers to use.
+// - If reuse is enabled, we use the same readers and dont close them
+// - if reuse is disabled, we open new readers in each thread and close 
them
+HoodieFileReader localFileReader = null;
+HoodieMetadataMergedLogRecordScanner localLogRecordScanner = null;
+synchronized (this) {
+  if (!metadataConfig.enableReuse()) {
+// reuse is disabled so always open new readers
+try {
+  Pair readers 
= openReaders();
+  localFileReader = readers.getKey();
+  localLogRecordScanner = readers.getValue();
+} catch (IOException e) {
+  throw new HoodieIOException("Error opening readers", e);
+}
+  } else if (baseFileReader == null && logRecordScanner == null) {
+// reuse is enabled but we haven't opened the readers yet
+try {
+  Pair readers 
= openReaders();
+  localFileReader = readers.getKey();
+  localLogRecordScanner = readers.getValue();
+  // cache the readers
+  baseFileReader = localFileReader;
+  logRecordScanner = localLogRecordScanner;
+} catch (IOException e) {
+  throw new HoodieIOException("Error opening readers", e);
+}
+  } else {
+// reuse the already open readers
+ValidationUtils.checkState((baseFileReader != null || logRecordScanner 
!= null), "Readers should already be open");
+localFileReader = baseFileReader;
+localLogRecordScanner = logRecordScanner;
+  }
+}

Review comment:
   @prashantwason and I already synced on this. Will be catching up on 
reviews. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.

2021-02-17 Thread GitBox


vinothchandar commented on a change in pull request #2494:
URL: https://github.com/apache/hudi/pull/2494#discussion_r577798704



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
##
@@ -112,13 +113,59 @@ private void initIfNeeded() {
 
   @Override
   protected Option> 
getRecordByKeyFromMetadata(String key) {
+// This function can be called in parallel through multiple threads. For 
each thread, we determine the thread-local
+// versions of the baseFile and logRecord readers to use.
+// - If reuse is enabled, we use the same readers and dont close them
+// - if reuse is disabled, we open new readers in each thread and close 
them
+HoodieFileReader localFileReader = null;
+HoodieMetadataMergedLogRecordScanner localLogRecordScanner = null;
+synchronized (this) {
+  if (!metadataConfig.enableReuse()) {
+// reuse is disabled so always open new readers
+try {
+  Pair readers 
= openReaders();
+  localFileReader = readers.getKey();
+  localLogRecordScanner = readers.getValue();
+} catch (IOException e) {
+  throw new HoodieIOException("Error opening readers", e);
+}
+  } else if (baseFileReader == null && logRecordScanner == null) {
+// reuse is enabled but we haven't opened the readers yet
+try {
+  Pair readers 
= openReaders();
+  localFileReader = readers.getKey();
+  localLogRecordScanner = readers.getValue();
+  // cache the readers
+  baseFileReader = localFileReader;
+  logRecordScanner = localLogRecordScanner;
+} catch (IOException e) {
+  throw new HoodieIOException("Error opening readers", e);
+}
+  } else {
+// reuse the already open readers
+ValidationUtils.checkState((baseFileReader != null || logRecordScanner 
!= null), "Readers should already be open");
+localFileReader = baseFileReader;
+localLogRecordScanner = logRecordScanner;
+  }
+}

Review comment:
   I suggest, we just stick to the instance variables and initialize/close 
them lazily as needed. This is typically the OO pattern





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.

2021-02-17 Thread GitBox


vinothchandar commented on a change in pull request #2494:
URL: https://github.com/apache/hudi/pull/2494#discussion_r577537335



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
##
@@ -112,13 +113,59 @@ private void initIfNeeded() {
 
   @Override
   protected Option> 
getRecordByKeyFromMetadata(String key) {
+// This function can be called in parallel through multiple threads. For 
each thread, we determine the thread-local
+// versions of the baseFile and logRecord readers to use.
+// - If reuse is enabled, we use the same readers and dont close them
+// - if reuse is disabled, we open new readers in each thread and close 
them
+HoodieFileReader localFileReader = null;
+HoodieMetadataMergedLogRecordScanner localLogRecordScanner = null;
+synchronized (this) {

Review comment:
   if we synchronize access here, why synchronize at the lower levels?

##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
##
@@ -112,13 +113,59 @@ private void initIfNeeded() {
 
   @Override
   protected Option> 
getRecordByKeyFromMetadata(String key) {
+// This function can be called in parallel through multiple threads. For 
each thread, we determine the thread-local
+// versions of the baseFile and logRecord readers to use.
+// - If reuse is enabled, we use the same readers and dont close them
+// - if reuse is disabled, we open new readers in each thread and close 
them
+HoodieFileReader localFileReader = null;
+HoodieMetadataMergedLogRecordScanner localLogRecordScanner = null;
+synchronized (this) {
+  if (!metadataConfig.enableReuse()) {

Review comment:
   can we please go back to the previous style of lazily opening/closing 
depending on the config and hiding it from the flow of reading and processing 
the values. 
   
   Happy to jump of a call if needed. But typically, reusing fewer variables 
and fewer branching the better. This does add signficant tax when reading the 
code. Thats why I changed them this way





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.

2021-01-28 Thread GitBox


vinothchandar commented on a change in pull request #2494:
URL: https://github.com/apache/hudi/pull/2494#discussion_r566564182



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
##
@@ -188,41 +196,51 @@ private synchronized void openFileSliceIfNeeded() throws 
IOException {
 
 // Load the schema
 Schema schema = 
HoodieAvroUtils.addMetadataFields(HoodieMetadataRecord.getClassSchema());
-logRecordScanner = new 
HoodieMetadataMergedLogRecordScanner(metaClient.getFs(), metadataBasePath,
-logFilePaths, schema, latestMetaInstantTimestamp, 
MAX_MEMORY_SIZE_IN_BYTES, BUFFER_SIZE,
+HoodieMetadataMergedLogRecordScanner logRecordScanner = new 
HoodieMetadataMergedLogRecordScanner(metaClient.getFs(),
+metadataBasePath, logFilePaths, schema, 
latestMetaInstantTimestamp, MAX_MEMORY_SIZE_IN_BYTES, BUFFER_SIZE,
 spillableMapDirectory, null);
 
 LOG.info("Opened metadata log files from " + logFilePaths + " at instant " 
+ latestInstantTime
 + "(dataset instant=" + latestInstantTime + ", metadata instant=" + 
latestMetaInstantTimestamp + ")");
 
 metrics.ifPresent(metrics -> 
metrics.updateMetrics(HoodieMetadataMetrics.SCAN_STR, timer.endTimer()));
+
+if (metadataConfig.enableReuse()) {
+  // cache for later reuse
+  cachedBaseFileReader = baseFileReader;
+  cachedLogRecordScanner = logRecordScanner;
+}
+
+return Pair.of(baseFileReader, logRecordScanner);
   }
 
-  private void closeIfNeeded() {
+  private void closeIfNeeded(Pair readers) {

Review comment:
   I prefer the previous approach for readability.  i.e have it just close 
the member variables. If you disagree, may be we can chat about why you think 
this is better for reading. I did face this issue when I originally read the 
code here

##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
##
@@ -188,41 +196,51 @@ private synchronized void openFileSliceIfNeeded() throws 
IOException {
 
 // Load the schema
 Schema schema = 
HoodieAvroUtils.addMetadataFields(HoodieMetadataRecord.getClassSchema());
-logRecordScanner = new 
HoodieMetadataMergedLogRecordScanner(metaClient.getFs(), metadataBasePath,
-logFilePaths, schema, latestMetaInstantTimestamp, 
MAX_MEMORY_SIZE_IN_BYTES, BUFFER_SIZE,
+HoodieMetadataMergedLogRecordScanner logRecordScanner = new 
HoodieMetadataMergedLogRecordScanner(metaClient.getFs(),
+metadataBasePath, logFilePaths, schema, 
latestMetaInstantTimestamp, MAX_MEMORY_SIZE_IN_BYTES, BUFFER_SIZE,
 spillableMapDirectory, null);
 
 LOG.info("Opened metadata log files from " + logFilePaths + " at instant " 
+ latestInstantTime
 + "(dataset instant=" + latestInstantTime + ", metadata instant=" + 
latestMetaInstantTimestamp + ")");
 
 metrics.ifPresent(metrics -> 
metrics.updateMetrics(HoodieMetadataMetrics.SCAN_STR, timer.endTimer()));
+
+if (metadataConfig.enableReuse()) {

Review comment:
   We are sort of creating two code paths again for reuse and non-reuse. 
Can we please go back to just always initializing the member variable here 
always? 
   then `closeIfNeeded()` can continue to work with members alone. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org