[GitHub] [hudi] zuyanton opened a new issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing

GitBox Fri, 17 Jul 2020 16:54:45 -0700


zuyanton opened a new issue #1847:
URL: https://github.com/apache/hudi/issues/1847



   We are noticing that Hudi MoR table on S3 starts perform slow with number of 
files growing. Although it may sound similar as #1829 , its a different issue, 
as this time we have tested on table with relatively few partitions (100) and 
log indicates a different bottleneck. We can manage to keep writing/reading 
time within acceptable limits if we keep number of files small (compacting 
every 10 delta commits, setting cleaner to only keep one commit ) however if we 
try to increase the number of historical commits to 30 - 40  thats when we 
start noticing increase in upsert and read time. Specifically to reading table: 
We run simple count query, when checking spark UI we can see that cluster is 
idled for the first 20 minutes and only master node does some work, after that 
20 minutes pause,spark starts running the count job.   
     
   
![read25](https://user-images.githubusercontent.com/67354813/87827172-26e85580-c82f-11ea-8bf4-c0c4a17d2de7.PNG)
   When checking logs we observe that first 20 minutes are taken by master node 
loading all the necessary metadata  from s3. More specifically we see a lot of 
lines like follow:  
   ```
   20/07/17 02:16:45 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :11, #FileGroups=17
   20/07/17 02:16:45 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=120, NumFileGroups=17, FileGroupsCreationTime=13226, StoreTimeTaken=0 
   20/07/17 02:16:45 INFO AbstractTableFileSystemView: Time to load partition 
(11) =13285
   ```  
   we observe FileGroupsCreationTime value is all over the place from less then 
a second for small partitions to 4 minutes per partition containing 1000+ 
files. I placed bunch of timer log lines in Hudi code to narrow done the bottle 
neck and my findings are following: most time consuming lines are this 
   
https://github.com/apache/hudi/blob/bf1d36fa639cae558aa388d8d547e58ad2e67aba/hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java#L254
  
   and  
   
https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java#L265
    
    
    
   more specifically instantiation of HoodieLogFile  and  HoodieBaseFile  more 
specifically grabbing file length value from FileStatus here 
https://github.com/apache/hudi/blob/bf1d36fa639cae558aa388d8d547e58ad2e67aba/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieBaseFile.java#L42
 and here 
https://github.com/apache/hudi/blob/bf1d36fa639cae558aa388d8d547e58ad2e67aba/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieLogFile.java#L51
 .
   so pretty much hoodie iterates all partitions and within each partition 
sequentially traverses all files and grabs their length, mentioned lines of 
code in HoodieLogFile and HoodieBaseFile constructors take on average 100 
milliseconds ,so every time  Hoodie processes partition with 600+ files , it 
takes 600*100 milliseconds =  1 minute+ per partition.   
     
   **Possible ways this process may be sped up**
   This are based on my not super deep understanding of Hudi functionality, I 
can be grossly wrong about them.   
   From my understanding ```HoodieParquetInputFormat.listStatus``` that gets 
executed per each partition and that eventually triggers File group creation, 
gets executed in multiple threads (multiple thread run listStatus) ,so Hudi 
does not process one partition at a time, there is some multi threading going 
on, however from my observations this parallelism is pretty small, maybe just 
handful of threads at a time and I dont know what parameter controls it. 
Theoretically increasing this number may improve performance. 
   It looks like Hudi grabs file sizes for all files in partition folder ,which 
may be unnecessary since if lets say we configured to keep last N commits, then 
only 1/N th of the folder are parquet files that need to be queried (plus log 
files of cause), the rest are just historical commits.  just grabbing file 
length of 1/N th of the total files should make file groups creation process 
faster.
   
   
   
   
   
   **Environment Description**
   
   * Hudi version : master branch
   
   * Spark version : 2.4.4
   
   * Hive version : 2.3.6
   
   * Hadoop version : 2.8.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Table Info**
   
   We observe read time increasing for all the tests tables, however it gets 
more obvious pretty quick on tables where every incremental update , updates 
data in 50+ out of 100 partitions. Initial state of test table was 100 
partitions,  initial size 100gb, initial file count 6k. compaction is set to be 
ran every 10 delta commits, cleaner is set to keep last 50 commits. We started 
observing 20+ minutes table loading time after table grew to 27k files and 
400gb. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zuyanton opened a new issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing

Reply via email to