[GitHub] [hudi] prashantwason commented on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
prashantwason commented on pull request #2494: URL: https://github.com/apache/hudi/pull/2494#issuecomment-799740584 Looks good @vinothchandar This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
prashantwason commented on pull request #2494: URL: https://github.com/apache/hudi/pull/2494#issuecomment-797665038 @vinothchandar PTAL. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
prashantwason commented on pull request #2494: URL: https://github.com/apache/hudi/pull/2494#issuecomment-794269612 @vinothchandar and I discussed simplifying this PR. The following changes are to be implemented: 1. Remove the "reuse" configuration as it does not make sense for performance reasons. - When timeline server is used, reuse should be on - When timeline server is not used, each executor has its own instance of the Metadata Reader and reuse is implicit. 2. Simplify the above code to use the instance variables 3. Locking is not required because of the usage pattern in #1. Locking will still be required in HFileReader because KeyScanner is not thread safe. I am working on updating this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
prashantwason commented on pull request #2494: URL: https://github.com/apache/hudi/pull/2494#issuecomment-768576158 With enableReuse=false, the caching of readers needs special handling because: 1. Multiple threads may call into HoodieBackedTableMetadata.getRecordByKeyFromMetadata() to read their respective keys 2. If enableReuse=false, then each of these threads will try to close the readers after reading the key Hence, we essentially have two codepaths: 1. enableReuse=false then readers cannot be cached 2. enableReuse=true then the readers can be cached. I have updated the patch to handle both these cases by modifying the openFileSliceIfNeeded function (renamed to getReader) which returns either: 1. cached readers when enableReuse=true 2. newly opened readers when enableReuse=false This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
prashantwason commented on pull request #2494: URL: https://github.com/apache/hudi/pull/2494#issuecomment-768495604 > > The size of the base file was 3MB so this means that the in-memory HFile block caching was also working. > > Trying to understand this part. Was the workload, trying to fetch all the keys out of the HFile or just 1? The workload was a commit followed by a Clean operation with num_versions_retained=1 so it will clean all partitions. Hence, number of key lookups should be equal to number of partitions and all the keys should have been read from the HFile. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org