[GitHub] [hudi] bvaradar commented on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing

2020-09-02 Thread GitBox


bvaradar commented on issue #1847:
URL: https://github.com/apache/hudi/issues/1847#issuecomment-685841451


   @zuyanton : I am not sure if this is still an issue. Since, this seems 
specific to EMR, can you  open a ticket with EMR folks directly ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing

2020-07-24 Thread GitBox


bvaradar commented on issue #1847:
URL: https://github.com/apache/hudi/issues/1847#issuecomment-663450319


   @bschell : Thanks for the information. As getLen() is used extensively both 
on read and write side, can you let us elaborate more on what cases does it 
actually result in RPC calls ? Is there an ability to cache within the 
implementation ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing

2020-07-20 Thread GitBox


bvaradar commented on issue #1847:
URL: https://github.com/apache/hudi/issues/1847#issuecomment-661461345


   @zuyanton : I am not sure if I can find the source code of this class. 
@umehrot2 : Can you let me know if the current implementation of FileStatus 
returned S3NativeFileSystem overrides getLen() ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing

2020-07-20 Thread GitBox


bvaradar commented on issue #1847:
URL: https://github.com/apache/hudi/issues/1847#issuecomment-660836081


   @zuyanton : Thanks for the detailed write-up.  This is very interesting. If 
you look at the base implementation of FileStatus  getLen() method, it returns 
a cached copy of the length. So, I wouldnt expect it to be the cause of such 
high variance. Also, 100 milliseconds you had observed would definitely making 
some blocking operations like RPC calls.  Does the EMR/S3 implementation of 
filesystem overrides these classes ? 
   
   ```
   
 /**
  * Get the length of this file, in bytes.
  * @return the length of this file, in bytes.
  */
 public long getLen() {
   return length;
 }
   ```
   
   @zuyanton : Can you track the class type for the incoming file-status object 
?
   
   cc @umehrot2 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org