hudi-bot opened a new issue, #14689:
URL: https://github.com/apache/hudi/issues/14689
I am using huh 0.5.0 . I took 0.5.0 and used the changes for
HoodieROTablePathFilter from HUDI-1144. Even though it caches, I am seeing
only 46 directories cached in 1 min. Due to this, My job takes lot of time to
write. because I have 6 months worth of hourly partitions. Is there a way to
speed up? I am running it in production cluster and have enough Vcores
available to process.
HoodieTableMetaClient metaClient = metaClientCache.get(baseDir.toString());
if (null == metaClient)
{ metaClient = new HoodieTableMetaClient(fs.getConf(), baseDir.toString(),
true); metaClientCache.put(baseDir.toString(), metaClient); }
HoodieTableFileSystemView fsView = new HoodieTableFileSystemView(metaClient,
metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(),
fs.listStatus(folder));
List<HoodieDataFile> latestFiles =
fsView.getLatestDataFiles().collect(Collectors.toList());
// populate the cache
if (!hoodiePathCache.containsKey(folder.toString()))
{ hoodiePathCache.put(folder.toString(), new HashSet<>()); }
LOG.info("Custom Code : Based on hoodie metadata from base path: " +
baseDir.toString() + ", caching " + latestFiles.size()
+ " files under " + folder);
for (HoodieDataFile lfile : latestFiles)
{ hoodiePathCache.get(folder.toString()).add(new Path(lfile.getPath())); }
Sample Logs here. I have attached the log file as well.
20/11/01 08:16:00 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200919/08, #FileGroups=2
20/11/01 08:16:00 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
20/11/01 08:16:00 INFO HoodieROTablePathFilter: Custom Code : Based on
hoodie metadata from base path:
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/08
20/11/01 08:16:01 WARN LoadBalancingKMSClientProvider: KMS provider at
threw an IOException!! java.io.IOException:
org.apache.hadoop.security.authentication.client.AuthenticationException:
GSSException: No valid credentials provided (Mechanism level: Failed to find
any Kerberos tgt)
20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200919/09, #FileGroups=2
20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on
hoodie metadata from base path:
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/09
20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200919/10, #FileGroups=3
20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=10, FileGroupsCreationTime=1, StoreTimeTaken=0
20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on
hoodie metadata from base path:
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 3 files under
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/10
20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at
threw an IOException!! java.io.IOException:
org.apache.hadoop.security.authentication.client.AuthenticationException:
GSSException: No valid credentials provided (Mechanism level: Failed to find
any Kerberos tgt)
20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at
threw an IOException!! java.io.IOException:
org.apache.hadoop.security.authentication.client.AuthenticationException:
GSSException: No valid credentials provided (Mechanism level: Failed to find
any Kerberos tgt)
20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200919/11, #FileGroups=2
20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on
hoodie metadata from base path:
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/11
20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at
threw an IOException!! java.io.IOException:
org.apache.hadoop.security.authentication.client.AuthenticationException:
GSSException: No valid credentials provided (Mechanism level: Failed to find
any Kerberos tgt)
20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200919/12, #FileGroups=3
20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=10, FileGroupsCreationTime=1, StoreTimeTaken=0
20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on
hoodie metadata from base path:
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 3 files under
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/12
20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200919/13, #FileGroups=2
20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=7, FileGroupsCreationTime=0, StoreTimeTaken=0
20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on
hoodie metadata from base path:
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/13
20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at
threw an IOException!! java.io.IOException:
org.apache.hadoop.security.authentication.client.AuthenticationException:
GSSException: No valid credentials provided (Mechanism level: Failed to find
any Kerberos tgt)
20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at
threw an IOException!! java.io.IOException:
org.apache.hadoop.security.authentication.client.AuthenticationException:
GSSException: No valid credentials provided (Mechanism level: Failed to find
any Kerberos tgt)
20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200919/14, #FileGroups=2
20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=7, FileGroupsCreationTime=0, StoreTimeTaken=0
20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on
hoodie metadata from base path:
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/14
20/11/01 08:16:03 WARN LoadBalancingKMSClientProvider: KMS provider at
threw an IOException!! java.io.IOException:
org.apache.hadoop.security.authentication.client.AuthenticationException:
GSSException: No valid credentials provided (Mechanism level: Failed to find
any Kerberos tgt)
20/11/01 08:16:03 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200919/15, #FileGroups=3
20/11/01 08:16:03 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
20/11/01 08:16:03 INFO HoodieROTablePathFilter: Custom Code : Based on
hoodie metadata from base path:
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 3 files under
hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/15
20/11/01 08:16:03 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200919/16, #FileGroups=2
20/11/01 08:16:03 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=7, FileGroupsCreationTime=0, StoreTimeTaken=0
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-1365
- Type: Bug
- Attachment(s):
- 01/Nov/20
09:11;Selvaraj.periyasamy1983;image-2020-11-01-01-11-11-561.png;https://issues.apache.org/jira/secure/attachment/13014495/image-2020-11-01-01-11-11-561.png
- 02/Nov/20
16:44;Selvaraj.periyasamy1983;image.png;https://issues.apache.org/jira/secure/attachment/13014589/image.png
---
## Comments
01/Nov/20 09:12;Selvaraj.periyasamy1983;Below is the job detail. you can see
job 21 took more than 20 mins for listing.
!image-2020-11-01-01-11-11-561.png!;;;
---
02/Nov/20 15:44;vbalaji;[~Selvaraj.periyasamy1983]: 0.5.0 is a very old
version of Hudi. You should try moving to later versions as there are other
improvements like removing "rename" operations in them. W.r.t your performance,
I see lot of WARN level logs with exceptions getting caught. I am wondering if
this is due to misconfiguration and is slowing your query.
On a different note, we are going to support feature in the next release
which would avoid listing data partitions completely
(https://issues.apache.org/jira/browse/HUDI-1292). ;;;
---
02/Nov/20 16:14;Selvaraj.periyasamy1983;Thanks Balaji. Could you provide the
GIT change info for rename operations? Will take a look.;;;
---
03/Nov/20
15:05;vbalaji;https://github.com/apache/hudi/commit/9a1f698eef1044443adadbf7a1bf7b5eb94fb84e;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]