Re: HUDI-1232

selvaraj periyasamy Tue, 01 Sep 2020 00:45:17 -0700

Thanks Balaji. Since upgrade is not an immediate solution in a shared
cluster, I tried a workaround. Added
org.apache.hudi.hadoop.HoodieROTablePathFilter
class in a common project module and added the caching logic and created a
jar and then added common.jar before the hudi jar.  It is now able to use
custom class and takes care of caching. I can manage with tihs until we
upgrade.


spark2-submit --jars
/home/selva/common.jar,/home/selva/hudi-spark-bundle-0.5.0-incubating.jar
--conf spark.sql.hive.convertMetastoreParquet=false --conf
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --master yarn
--deploy-mode client --driver-memory 4g --executor-memory 10g
--num-executors 200 --executor-cores 1  --conf
spark.executor.memoryOverhead=4096 --conf
spark.shuffle.service.enabled=true  --class
com.test.cdp.reporting.trr.TRREngine
/home/seperiya/transformation-engine.jar

Thanks,
Selva

On Sat, Aug 29, 2020 at 12:55 PM Balaji Varadarajan
<[email protected]> wrote:

>  Hi Selvaraj,
> Yes, you are right. Sorry for the confusion. As mentioned in the release
> notes, Spark 2.4.4 runtime is needed although I dont remember what problem
> you will encounter with Spark 2.3.3. I think it will be a worthwhile
> exercise for you to upgrade to Spark 2.4.4 and Hudi latest versions as we
> had been and continuing to improve performance in Hudi :) For instance, the
> very next release will have consolidated metadata which would avoid file
> listing in the first place.
> THanks,Balaji.V    On Saturday, August 29, 2020, 11:09:25 AM PDT, selvaraj
> periyasamy <[email protected]> wrote:
>
>  Thanks Balaji,
>
> I am looking into the steps to upgrade to 0.6.0. I noticed the below
> content in 0.5.1 release notes here https://hudi.apache.org/releases.html.
> It says the runtime spark version must be 2.4+. Little confused now. Could
> you shed more light on this?
> Release HighlightsPermalink
> <https://hudi.apache.org/releases.html#release-highlights-3>
>
>   - Dependency Version Upgrades
>       - Upgrade from Spark 2.1.0 to Spark 2.4.4
>       - Upgrade from Avro 1.7.7 to Avro 1.8.2
>       - Upgrade from Parquet 1.8.1 to Parquet 1.10.1
>       - Upgrade from Kafka 0.8.2.1 to Kafka 2.0.0 as a result of updating
>       spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12.
>   - *IMPORTANT* This version requires your runtime spark version to be
>   upgraded to 2.4+.
>
> Thanks,
> Selva
>
> On Sat, Aug 29, 2020 at 1:16 AM Balaji Varadarajan
> <[email protected]> wrote:
>
> >  From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs
> > repeating which suggests this is the read side. So, we recommend you
> using
> > latest version. I tried 2.3.3 and ran quickstart without issues. Give it
> a
> > shot and let us know if there are any issues.
> > Balaji.V
> >    On Friday, August 28, 2020, 04:42:51 PM PDT, selvaraj periyasamy <
> > [email protected]> wrote:
> >
> >  Thanks Balaji. My hadoop environment is still running with spark 2.3.
> Can
> > I
> > run 0.6.0 on spark 2.3?
> >
> > For issue 1: I am able to manage it with spark glob read, instead of
> > hive read. With this approach, I am good with this approach.
> >  Issue 2: I see the performance issue while writing into the COW table.
> > This is purely write and no read involved.  Attached the write logs (
> > hudiLogs.txt) in the ticket . The more and more my target has
> partitions, I
> > am noticing a spike in write time.  The fix #1919 mentioned is applicable
> > for writing as well.
> >
> > On Fri, Aug 28, 2020 at 3:28 PM [email protected] <[email protected]>
> > wrote:
> >
> > >  Hi Selvaraj,
> > > We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark
> > read
> > > queries by caching metaclient in HoodieROPathFilter (#1919)). Can you
> > > please try 0.6.0
> > > Balaji.V
> > >    On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy <
> > > [email protected]> wrote:
> > >
> > >  I have created this https://issues.apache.org/jira/browse/HUDI-1232
> > > ticket
> > > for tracking a couple of issues.
> > >
> > > One of the concerns I have in my use cases is that, have a COW type
> table
> > > name called TRR.  I see below pasted logs rolling for all individual
> > > partitions even though my write is on only a couple of partitions  and
> it
> > > takes upto 4 to 5  mins. I pasted only a few of them alone. I am
> > wondering
> > > , in the future , I will have 3 years worth of data, and writing will
> be
> > > very slow every time I write into only a couple of partitions.
> > >
> > > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties
> from
> > >
> > >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > > type COPY_ON_WRITE from
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > > java.util.stream.ReferencePipeline$Head@fed0a8b
> > > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups
> for
> > > partition :20200714/01, #FileGroups=1
> > > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
> > > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie
> metadata
> > > from base path:
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > > files under
> > >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> > HoodieTableMetaClient
> > > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > > core-site.xml, mapred-default.xml, m
> > > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> > > hdfs-site.xml], FileSystem:
> > > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > > ugi=svchdc36q@V
> > > ISA.COM (auth:KERBEROS)]]]
> > > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties
> from
> > >
> > >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > > type COPY_ON_WRITE from
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > > java.util.stream.ReferencePipeline$Head@285c67a9
> > > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups
> for
> > > partition :20200714/02, #FileGroups=1
> > > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
> > > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie
> metadata
> > > from base path:
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > > files under
> > >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> > HoodieTableMetaClient
> > > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > > core-site.xml, mapred-default.xml, m
> > > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> > > hdfs-site.xml], FileSystem:
> > > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > > ugi=svchdc36q@V
> > > ISA.COM (auth:KERBEROS)]]]
> > > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties
> from
> > >
> > >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > > type COPY_ON_WRITE from
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > > java.util.stream.ReferencePipeline$Head@2edd9c8
> > > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups
> for
> > > partition :20200714/03, #FileGroups=1
> > > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > > NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0
> > > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie
> metadata
> > > from base path:
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > > files under
> > >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> > HoodieTableMetaClient
> > > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
> > > yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem:
> > > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > > [email protected] (auth:KERBEROS)]]]
> > >
> > >
> > >
> > >  Seems more and more partitions we have,  path filter lists take more
> > time.
> > > Could someone provide more insight on how to make these things work
> > faster
> > > and make it scalable when the number of partitions is increasing?
> > >
> > >
> > > Thanks,
> > >
> > > Selva
> > >
> >
>

Re: HUDI-1232

Reply via email to