hudi-bot opened a new issue, #14689:
URL: https://github.com/apache/hudi/issues/14689

   I am using huh 0.5.0 . I took 0.5.0 and used the changes for 
HoodieROTablePathFilter from HUDI-1144.  Even though it caches, I am seeing 
only 46 directories cached in 1 min. Due to this, My job takes lot of time to 
write. because I have 6 months worth of hourly partitions. Is there a way to 
speed up? I am running it in production cluster and have enough Vcores 
available to process.
   
    
   
   HoodieTableMetaClient metaClient = metaClientCache.get(baseDir.toString());
    if (null == metaClient)
   
   { metaClient = new HoodieTableMetaClient(fs.getConf(), baseDir.toString(), 
true); metaClientCache.put(baseDir.toString(), metaClient); }
   
   HoodieTableFileSystemView fsView = new HoodieTableFileSystemView(metaClient,
    
metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(), 
fs.listStatus(folder));
    List<HoodieDataFile> latestFiles = 
fsView.getLatestDataFiles().collect(Collectors.toList());
    // populate the cache
    if (!hoodiePathCache.containsKey(folder.toString()))
   
   { hoodiePathCache.put(folder.toString(), new HashSet<>()); }
   
   LOG.info("Custom Code : Based on hoodie metadata from base path: " + 
baseDir.toString() + ", caching " + latestFiles.size()
    + " files under " + folder);
    for (HoodieDataFile lfile : latestFiles)
   
   { hoodiePathCache.get(folder.toString()).add(new Path(lfile.getPath())); }
   
    
   
    
   
    
   
   Sample Logs here. I have attached the log file as well.
   
    
   
   20/11/01 08:16:00 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200919/08, #FileGroups=2
    20/11/01 08:16:00 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
    20/11/01 08:16:00 INFO HoodieROTablePathFilter: Custom Code : Based on 
hoodie metadata from base path: 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/08
    20/11/01 08:16:01 WARN LoadBalancingKMSClientProvider: KMS provider at  
threw an IOException!! java.io.IOException: 
org.apache.hadoop.security.authentication.client.AuthenticationException: 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)
    20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200919/09, #FileGroups=2
    20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
    20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on 
hoodie metadata from base path: 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/09
    20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200919/10, #FileGroups=3
    20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=10, FileGroupsCreationTime=1, StoreTimeTaken=0
    20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on 
hoodie metadata from base path: 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 3 files under 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/10
    20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at  
threw an IOException!! java.io.IOException: 
org.apache.hadoop.security.authentication.client.AuthenticationException: 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)
    20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at  
threw an IOException!! java.io.IOException: 
org.apache.hadoop.security.authentication.client.AuthenticationException: 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)
    20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200919/11, #FileGroups=2
    20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
    20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on 
hoodie metadata from base path: 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/11
    20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at  
threw an IOException!! java.io.IOException: 
org.apache.hadoop.security.authentication.client.AuthenticationException: 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)
    20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200919/12, #FileGroups=3
    20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=10, FileGroupsCreationTime=1, StoreTimeTaken=0
    20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on 
hoodie metadata from base path: 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 3 files under 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/12
    20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200919/13, #FileGroups=2
    20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=7, FileGroupsCreationTime=0, StoreTimeTaken=0
    20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on 
hoodie metadata from base path: 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/13
    20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at  
threw an IOException!! java.io.IOException: 
org.apache.hadoop.security.authentication.client.AuthenticationException: 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)
    20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at 
threw an IOException!! java.io.IOException: 
org.apache.hadoop.security.authentication.client.AuthenticationException: 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)
    20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200919/14, #FileGroups=2
    20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=7, FileGroupsCreationTime=0, StoreTimeTaken=0
    20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on 
hoodie metadata from base path: 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/14
    20/11/01 08:16:03 WARN LoadBalancingKMSClientProvider: KMS provider at  
threw an IOException!! java.io.IOException: 
org.apache.hadoop.security.authentication.client.AuthenticationException: 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)
    20/11/01 08:16:03 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200919/15, #FileGroups=3
    20/11/01 08:16:03 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
    20/11/01 08:16:03 INFO HoodieROTablePathFilter: Custom Code : Based on 
hoodie metadata from base path: 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 3 files under 
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/15
    20/11/01 08:16:03 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200919/16, #FileGroups=2
    20/11/01 08:16:03 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=7, FileGroupsCreationTime=0, StoreTimeTaken=0
   
    
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-1365
   - Type: Bug
   - Attachment(s):
     - 01/Nov/20 
09:11;Selvaraj.periyasamy1983;image-2020-11-01-01-11-11-561.png;https://issues.apache.org/jira/secure/attachment/13014495/image-2020-11-01-01-11-11-561.png
     - 02/Nov/20 
16:44;Selvaraj.periyasamy1983;image.png;https://issues.apache.org/jira/secure/attachment/13014589/image.png
   
   
   ---
   
   
   ## Comments
   
   01/Nov/20 09:12;Selvaraj.periyasamy1983;Below is the job detail. you can see 
job 21 took more than 20 mins for listing. 
   
   !image-2020-11-01-01-11-11-561.png!;;;
   
   ---
   
   02/Nov/20 15:44;vbalaji;[~Selvaraj.periyasamy1983]: 0.5.0 is a very old 
version of Hudi. You should try moving to later versions as there are other 
improvements like removing "rename" operations in them. W.r.t your performance, 
I see lot of WARN level logs with exceptions getting caught. I am wondering if 
this is due to misconfiguration and is slowing your query. 
   
   On a different note, we are going to support feature in the next release 
which would  avoid listing data partitions completely 
(https://issues.apache.org/jira/browse/HUDI-1292).  ;;;
   
   ---
   
   02/Nov/20 16:14;Selvaraj.periyasamy1983;Thanks Balaji. Could you provide the 
GIT change info for rename operations? Will take a look.;;;
   
   ---
   
   03/Nov/20 
15:05;vbalaji;https://github.com/apache/hudi/commit/9a1f698eef1044443adadbf7a1bf7b5eb94fb84e;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to