From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs repeating which suggests this is the read side. So, we recommend you using latest version. I tried 2.3.3 and ran quickstart without issues. Give it a shot and let us know if there are any issues. Balaji.V On Friday, August 28, 2020, 04:42:51 PM PDT, selvaraj periyasamy <selvaraj.periyasamy1...@gmail.com> wrote: Thanks Balaji. My hadoop environment is still running with spark 2.3. Can I run 0.6.0 on spark 2.3?
For issue 1: I am able to manage it with spark glob read, instead of hive read. With this approach, I am good with this approach. Issue 2: I see the performance issue while writing into the COW table. This is purely write and no read involved. Attached the write logs ( hudiLogs.txt) in the ticket . The more and more my target has partitions, I am noticing a spike in write time. The fix #1919 mentioned is applicable for writing as well. On Fri, Aug 28, 2020 at 3:28 PM vbal...@apache.org <vbal...@apache.org> wrote: > Hi Selvaraj, > We had fixed relevant perf issue in 0.6.0 ([HUDI-1144] Speedup spark read > queries by caching metaclient in HoodieROPathFilter (#1919)). Can you > please try 0.6.0 > Balaji.V > On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy < > selvaraj.periyasamy1...@gmail.com> wrote: > > I have created this https://issues.apache.org/jira/browse/HUDI-1232 > ticket > for tracking a couple of issues. > > One of the concerns I have in my use cases is that, have a COW type table > name called TRR. I see below pasted logs rolling for all individual > partitions even though my write is on only a couple of partitions and it > takes upto 4 to 5 mins. I pasted only a few of them alone. I am wondering > , in the future , I will have 3 years worth of data, and writing will be > very slow every time I write into only a couple of partitions. > > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of > type COPY_ON_WRITE from > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants > java.util.stream.ReferencePipeline$Head@fed0a8b > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for > partition :20200714/01, #FileGroups=1 > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1 > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata > from base path: > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1 > files under > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01 > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, > core-site.xml, mapred-default.xml, m > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, > hdfs-site.xml], FileSystem: > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, > ugi=svchdc36q@V > ISA.COM (auth:KERBEROS)]]] > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of > type COPY_ON_WRITE from > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants > java.util.stream.ReferencePipeline$Head@285c67a9 > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for > partition :20200714/02, #FileGroups=1 > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0 > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata > from base path: > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1 > files under > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02 > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, > core-site.xml, mapred-default.xml, m > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, > hdfs-site.xml], FileSystem: > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, > ugi=svchdc36q@V > ISA.COM (auth:KERBEROS)]]] > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of > type COPY_ON_WRITE from > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants > java.util.stream.ReferencePipeline$Head@2edd9c8 > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for > partition :20200714/03, #FileGroups=1 > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: > NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0 > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata > from base path: > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1 > files under > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03 > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, > yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, > ugi=svchdc...@visa.com (auth:KERBEROS)]]] > > > > Seems more and more partitions we have, path filter lists take more time. > Could someone provide more insight on how to make these things work faster > and make it scalable when the number of partitions is increasing? > > > Thanks, > > Selva >