Thanks Balaji. Since upgrade is not an immediate solution in a shared cluster, I tried a workaround. Added org.apache.hudi.hadoop.HoodieROTablePathFilter class in a common project module and added the caching logic and created a jar and then added common.jar before the hudi jar. It is now able to use custom class and takes care of caching. I can manage with tihs until we upgrade.
spark2-submit --jars /home/selva/common.jar,/home/selva/hudi-spark-bundle-0.5.0-incubating.jar --conf spark.sql.hive.convertMetastoreParquet=false --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --master yarn --deploy-mode client --driver-memory 4g --executor-memory 10g --num-executors 200 --executor-cores 1 --conf spark.executor.memoryOverhead=4096 --conf spark.shuffle.service.enabled=true --class com.test.cdp.reporting.trr.TRREngine /home/seperiya/transformation-engine.jar Thanks, Selva On Sat, Aug 29, 2020 at 12:55 PM Balaji Varadarajan <[email protected]> wrote: > Hi Selvaraj, > Yes, you are right. Sorry for the confusion. As mentioned in the release > notes, Spark 2.4.4 runtime is needed although I dont remember what problem > you will encounter with Spark 2.3.3. I think it will be a worthwhile > exercise for you to upgrade to Spark 2.4.4 and Hudi latest versions as we > had been and continuing to improve performance in Hudi :) For instance, the > very next release will have consolidated metadata which would avoid file > listing in the first place. > THanks,Balaji.V On Saturday, August 29, 2020, 11:09:25 AM PDT, selvaraj > periyasamy <[email protected]> wrote: > > Thanks Balaji, > > I am looking into the steps to upgrade to 0.6.0. I noticed the below > content in 0.5.1 release notes here https://hudi.apache.org/releases.html. > It says the runtime spark version must be 2.4+. Little confused now. Could > you shed more light on this? > Release HighlightsPermalink > <https://hudi.apache.org/releases.html#release-highlights-3> > > - Dependency Version Upgrades > - Upgrade from Spark 2.1.0 to Spark 2.4.4 > - Upgrade from Avro 1.7.7 to Avro 1.8.2 > - Upgrade from Parquet 1.8.1 to Parquet 1.10.1 > - Upgrade from Kafka 0.8.2.1 to Kafka 2.0.0 as a result of updating > spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12. > - *IMPORTANT* This version requires your runtime spark version to be > upgraded to 2.4+. > > Thanks, > Selva > > On Sat, Aug 29, 2020 at 1:16 AM Balaji Varadarajan > <[email protected]> wrote: > > > From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs > > repeating which suggests this is the read side. So, we recommend you > using > > latest version. I tried 2.3.3 and ran quickstart without issues. Give it > a > > shot and let us know if there are any issues. > > Balaji.V > > On Friday, August 28, 2020, 04:42:51 PM PDT, selvaraj periyasamy < > > [email protected]> wrote: > > > > Thanks Balaji. My hadoop environment is still running with spark 2.3. > Can > > I > > run 0.6.0 on spark 2.3? > > > > For issue 1: I am able to manage it with spark glob read, instead of > > hive read. With this approach, I am good with this approach. > > Issue 2: I see the performance issue while writing into the COW table. > > This is purely write and no read involved. Attached the write logs ( > > hudiLogs.txt) in the ticket . The more and more my target has > partitions, I > > am noticing a spike in write time. The fix #1919 mentioned is applicable > > for writing as well. > > > > On Fri, Aug 28, 2020 at 3:28 PM [email protected] <[email protected]> > > wrote: > > > > > Hi Selvaraj, > > > We had fixed relevant perf issue in 0.6.0 ([HUDI-1144] Speedup spark > > read > > > queries by caching metaclient in HoodieROPathFilter (#1919)). Can you > > > please try 0.6.0 > > > Balaji.V > > > On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy < > > > [email protected]> wrote: > > > > > > I have created this https://issues.apache.org/jira/browse/HUDI-1232 > > > ticket > > > for tracking a couple of issues. > > > > > > One of the concerns I have in my use cases is that, have a COW type > table > > > name called TRR. I see below pasted logs rolling for all individual > > > partitions even though my write is on only a couple of partitions and > it > > > takes upto 4 to 5 mins. I pasted only a few of them alone. I am > > wondering > > > , in the future , I will have 3 years worth of data, and writing will > be > > > very slow every time I write into only a couple of partitions. > > > > > > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties > from > > > > > > > > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties > > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of > > > type COPY_ON_WRITE from > > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr > > > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants > > > java.util.stream.ReferencePipeline$Head@fed0a8b > > > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups > for > > > partition :20200714/01, #FileGroups=1 > > > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: > > > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1 > > > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie > metadata > > > from base path: > > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1 > > > files under > > > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01 > > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading > > HoodieTableMetaClient > > > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr > > > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: > > > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, > > > core-site.xml, mapred-default.xml, m > > > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, > > > hdfs-site.xml], FileSystem: > > > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, > > > ugi=svchdc36q@V > > > ISA.COM (auth:KERBEROS)]]] > > > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties > from > > > > > > > > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties > > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of > > > type COPY_ON_WRITE from > > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr > > > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants > > > java.util.stream.ReferencePipeline$Head@285c67a9 > > > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups > for > > > partition :20200714/02, #FileGroups=1 > > > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: > > > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0 > > > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie > metadata > > > from base path: > > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1 > > > files under > > > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02 > > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading > > HoodieTableMetaClient > > > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr > > > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: > > > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, > > > core-site.xml, mapred-default.xml, m > > > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, > > > hdfs-site.xml], FileSystem: > > > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, > > > ugi=svchdc36q@V > > > ISA.COM (auth:KERBEROS)]]] > > > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties > from > > > > > > > > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties > > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of > > > type COPY_ON_WRITE from > > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr > > > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants > > > java.util.stream.ReferencePipeline$Head@2edd9c8 > > > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups > for > > > partition :20200714/03, #FileGroups=1 > > > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: > > > NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0 > > > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie > metadata > > > from base path: > > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1 > > > files under > > > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03 > > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading > > HoodieTableMetaClient > > > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr > > > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: > > > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, > > > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, > > > yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: > > > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, > > > [email protected] (auth:KERBEROS)]]] > > > > > > > > > > > > Seems more and more partitions we have, path filter lists take more > > time. > > > Could someone provide more insight on how to make these things work > > faster > > > and make it scalable when the number of partitions is increasing? > > > > > > > > > Thanks, > > > > > > Selva > > > > > >
