[GitHub] [hudi] bvaradar commented on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing
bvaradar commented on issue #1847: URL: https://github.com/apache/hudi/issues/1847#issuecomment-685841451 @zuyanton : I am not sure if this is still an issue. Since, this seems specific to EMR, can you open a ticket with EMR folks directly ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing
bvaradar commented on issue #1847: URL: https://github.com/apache/hudi/issues/1847#issuecomment-663450319 @bschell : Thanks for the information. As getLen() is used extensively both on read and write side, can you let us elaborate more on what cases does it actually result in RPC calls ? Is there an ability to cache within the implementation ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing
bvaradar commented on issue #1847: URL: https://github.com/apache/hudi/issues/1847#issuecomment-661461345 @zuyanton : I am not sure if I can find the source code of this class. @umehrot2 : Can you let me know if the current implementation of FileStatus returned S3NativeFileSystem overrides getLen() ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing
bvaradar commented on issue #1847: URL: https://github.com/apache/hudi/issues/1847#issuecomment-660836081 @zuyanton : Thanks for the detailed write-up. This is very interesting. If you look at the base implementation of FileStatus getLen() method, it returns a cached copy of the length. So, I wouldnt expect it to be the cause of such high variance. Also, 100 milliseconds you had observed would definitely making some blocking operations like RPC calls. Does the EMR/S3 implementation of filesystem overrides these classes ? ``` /** * Get the length of this file, in bytes. * @return the length of this file, in bytes. */ public long getLen() { return length; } ``` @zuyanton : Can you track the class type for the incoming file-status object ? cc @umehrot2 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org