[GitHub] [hudi] vinothchandar commented on a change in pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
vinothchandar commented on a change in pull request #2494: URL: https://github.com/apache/hudi/pull/2494#discussion_r594656644 ## File path: hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java ## @@ -232,12 +246,13 @@ public long getTotalRecords() { } @Override - public void close() { + public synchronized void close() { try { reader.close(); reader = null; + keyScanner = null; } catch (IOException e) { - e.printStackTrace(); Review comment: @prashantwason fixed this as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
vinothchandar commented on a change in pull request #2494: URL: https://github.com/apache/hudi/pull/2494#discussion_r593866290 ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java ## @@ -147,82 +150,91 @@ private void initIfNeeded() { } } timings.add(timer.endTimer()); - LOG.info(String.format("Metadata read for key %s took [open, baseFileRead, logMerge] %s ms", key, timings)); + LOG.info(String.format("Metadata read for key %s took [baseFileRead, logMerge] %s ms", key, timings)); return Option.ofNullable(hoodieRecord); } catch (IOException ioe) { throw new HoodieIOException("Error merging records from metadata table for key :" + key, ioe); -} finally { Review comment: this close is actually needed, when opening the metadata table from the executors. Otherwise, we will leak and suffer the same issues as before. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
vinothchandar commented on a change in pull request #2494: URL: https://github.com/apache/hudi/pull/2494#discussion_r584393784 ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java ## @@ -112,13 +113,59 @@ private void initIfNeeded() { @Override protected Option> getRecordByKeyFromMetadata(String key) { +// This function can be called in parallel through multiple threads. For each thread, we determine the thread-local +// versions of the baseFile and logRecord readers to use. +// - If reuse is enabled, we use the same readers and dont close them +// - if reuse is disabled, we open new readers in each thread and close them +HoodieFileReader localFileReader = null; +HoodieMetadataMergedLogRecordScanner localLogRecordScanner = null; +synchronized (this) { + if (!metadataConfig.enableReuse()) { +// reuse is disabled so always open new readers +try { + Pair readers = openReaders(); + localFileReader = readers.getKey(); + localLogRecordScanner = readers.getValue(); +} catch (IOException e) { + throw new HoodieIOException("Error opening readers", e); +} + } else if (baseFileReader == null && logRecordScanner == null) { +// reuse is enabled but we haven't opened the readers yet +try { + Pair readers = openReaders(); + localFileReader = readers.getKey(); + localLogRecordScanner = readers.getValue(); + // cache the readers + baseFileReader = localFileReader; + logRecordScanner = localLogRecordScanner; +} catch (IOException e) { + throw new HoodieIOException("Error opening readers", e); +} + } else { +// reuse the already open readers +ValidationUtils.checkState((baseFileReader != null || logRecordScanner != null), "Readers should already be open"); +localFileReader = baseFileReader; +localLogRecordScanner = logRecordScanner; + } +} Review comment: @prashantwason and I already synced on this. Will be catching up on reviews. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
vinothchandar commented on a change in pull request #2494: URL: https://github.com/apache/hudi/pull/2494#discussion_r577798704 ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java ## @@ -112,13 +113,59 @@ private void initIfNeeded() { @Override protected Option> getRecordByKeyFromMetadata(String key) { +// This function can be called in parallel through multiple threads. For each thread, we determine the thread-local +// versions of the baseFile and logRecord readers to use. +// - If reuse is enabled, we use the same readers and dont close them +// - if reuse is disabled, we open new readers in each thread and close them +HoodieFileReader localFileReader = null; +HoodieMetadataMergedLogRecordScanner localLogRecordScanner = null; +synchronized (this) { + if (!metadataConfig.enableReuse()) { +// reuse is disabled so always open new readers +try { + Pair readers = openReaders(); + localFileReader = readers.getKey(); + localLogRecordScanner = readers.getValue(); +} catch (IOException e) { + throw new HoodieIOException("Error opening readers", e); +} + } else if (baseFileReader == null && logRecordScanner == null) { +// reuse is enabled but we haven't opened the readers yet +try { + Pair readers = openReaders(); + localFileReader = readers.getKey(); + localLogRecordScanner = readers.getValue(); + // cache the readers + baseFileReader = localFileReader; + logRecordScanner = localLogRecordScanner; +} catch (IOException e) { + throw new HoodieIOException("Error opening readers", e); +} + } else { +// reuse the already open readers +ValidationUtils.checkState((baseFileReader != null || logRecordScanner != null), "Readers should already be open"); +localFileReader = baseFileReader; +localLogRecordScanner = logRecordScanner; + } +} Review comment: I suggest, we just stick to the instance variables and initialize/close them lazily as needed. This is typically the OO pattern This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
vinothchandar commented on a change in pull request #2494: URL: https://github.com/apache/hudi/pull/2494#discussion_r577537335 ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java ## @@ -112,13 +113,59 @@ private void initIfNeeded() { @Override protected Option> getRecordByKeyFromMetadata(String key) { +// This function can be called in parallel through multiple threads. For each thread, we determine the thread-local +// versions of the baseFile and logRecord readers to use. +// - If reuse is enabled, we use the same readers and dont close them +// - if reuse is disabled, we open new readers in each thread and close them +HoodieFileReader localFileReader = null; +HoodieMetadataMergedLogRecordScanner localLogRecordScanner = null; +synchronized (this) { Review comment: if we synchronize access here, why synchronize at the lower levels? ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java ## @@ -112,13 +113,59 @@ private void initIfNeeded() { @Override protected Option> getRecordByKeyFromMetadata(String key) { +// This function can be called in parallel through multiple threads. For each thread, we determine the thread-local +// versions of the baseFile and logRecord readers to use. +// - If reuse is enabled, we use the same readers and dont close them +// - if reuse is disabled, we open new readers in each thread and close them +HoodieFileReader localFileReader = null; +HoodieMetadataMergedLogRecordScanner localLogRecordScanner = null; +synchronized (this) { + if (!metadataConfig.enableReuse()) { Review comment: can we please go back to the previous style of lazily opening/closing depending on the config and hiding it from the flow of reading and processing the values. Happy to jump of a call if needed. But typically, reusing fewer variables and fewer branching the better. This does add signficant tax when reading the code. Thats why I changed them this way This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
vinothchandar commented on a change in pull request #2494: URL: https://github.com/apache/hudi/pull/2494#discussion_r566564182 ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java ## @@ -188,41 +196,51 @@ private synchronized void openFileSliceIfNeeded() throws IOException { // Load the schema Schema schema = HoodieAvroUtils.addMetadataFields(HoodieMetadataRecord.getClassSchema()); -logRecordScanner = new HoodieMetadataMergedLogRecordScanner(metaClient.getFs(), metadataBasePath, -logFilePaths, schema, latestMetaInstantTimestamp, MAX_MEMORY_SIZE_IN_BYTES, BUFFER_SIZE, +HoodieMetadataMergedLogRecordScanner logRecordScanner = new HoodieMetadataMergedLogRecordScanner(metaClient.getFs(), +metadataBasePath, logFilePaths, schema, latestMetaInstantTimestamp, MAX_MEMORY_SIZE_IN_BYTES, BUFFER_SIZE, spillableMapDirectory, null); LOG.info("Opened metadata log files from " + logFilePaths + " at instant " + latestInstantTime + "(dataset instant=" + latestInstantTime + ", metadata instant=" + latestMetaInstantTimestamp + ")"); metrics.ifPresent(metrics -> metrics.updateMetrics(HoodieMetadataMetrics.SCAN_STR, timer.endTimer())); + +if (metadataConfig.enableReuse()) { + // cache for later reuse + cachedBaseFileReader = baseFileReader; + cachedLogRecordScanner = logRecordScanner; +} + +return Pair.of(baseFileReader, logRecordScanner); } - private void closeIfNeeded() { + private void closeIfNeeded(Pair readers) { Review comment: I prefer the previous approach for readability. i.e have it just close the member variables. If you disagree, may be we can chat about why you think this is better for reading. I did face this issue when I originally read the code here ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java ## @@ -188,41 +196,51 @@ private synchronized void openFileSliceIfNeeded() throws IOException { // Load the schema Schema schema = HoodieAvroUtils.addMetadataFields(HoodieMetadataRecord.getClassSchema()); -logRecordScanner = new HoodieMetadataMergedLogRecordScanner(metaClient.getFs(), metadataBasePath, -logFilePaths, schema, latestMetaInstantTimestamp, MAX_MEMORY_SIZE_IN_BYTES, BUFFER_SIZE, +HoodieMetadataMergedLogRecordScanner logRecordScanner = new HoodieMetadataMergedLogRecordScanner(metaClient.getFs(), +metadataBasePath, logFilePaths, schema, latestMetaInstantTimestamp, MAX_MEMORY_SIZE_IN_BYTES, BUFFER_SIZE, spillableMapDirectory, null); LOG.info("Opened metadata log files from " + logFilePaths + " at instant " + latestInstantTime + "(dataset instant=" + latestInstantTime + ", metadata instant=" + latestMetaInstantTimestamp + ")"); metrics.ifPresent(metrics -> metrics.updateMetrics(HoodieMetadataMetrics.SCAN_STR, timer.endTimer())); + +if (metadataConfig.enableReuse()) { Review comment: We are sort of creating two code paths again for reuse and non-reuse. Can we please go back to just always initializing the member variable here always? then `closeIfNeeded()` can continue to work with members alone. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org