Hi Selvaraj,
We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark read 
queries by caching metaclient in HoodieROPathFilter (#1919)). Can you please 
try 0.6.0
Balaji.V
    On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy 
<selvaraj.periyasamy1...@gmail.com> wrote:  
 
 I have created this https://issues.apache.org/jira/browse/HUDI-1232 ticket
for tracking a couple of issues.

One of the concerns I have in my use cases is that, have a COW type table
name called TRR.  I see below pasted logs rolling for all individual
partitions even though my write is on only a couple of partitions  and it
takes upto 4 to 5  mins. I pasted only a few of them alone. I am wondering
, in the future , I will have 3 years worth of data, and writing will be
very slow every time I write into only a couple of partitions.

20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
type COPY_ON_WRITE from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
java.util.stream.ReferencePipeline$Head@fed0a8b
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200714/01, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
from base path:
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
files under
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
core-site.xml, mapred-default.xml, m
apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
ugi=svchdc36q@V
ISA.COM (auth:KERBEROS)]]]
20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
type COPY_ON_WRITE from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
java.util.stream.ReferencePipeline$Head@285c67a9
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200714/02, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
from base path:
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
files under
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
core-site.xml, mapred-default.xml, m
apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
ugi=svchdc36q@V
ISA.COM (auth:KERBEROS)]]]
20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
type COPY_ON_WRITE from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
java.util.stream.ReferencePipeline$Head@2edd9c8
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200714/03, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
from base path:
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
files under
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
ugi=svchdc...@visa.com (auth:KERBEROS)]]]



 Seems more and more partitions we have,  path filter lists take more time.
Could someone provide more insight on how to make these things work faster
and make it scalable when the number of partitions is increasing?


Thanks,

Selva
  

Reply via email to