Re: HUDI 0.6 Read Table Performance

Tanuj Mon, 26 Apr 2021 01:31:00 -0700

Thanks Udit. We have been using with /*/*/* . Now when I use /*/* its much 
faster now. But why are we saying in Quick Start Guide that we need to use 
extra /*. What is the side effect or advantage of putting extra /* ?


On 2021/04/26 04:20:53, Udit Mehrotra <[email protected]> wrote: 
> In the example provided do what is the exact path you pass to hudi data 
> source to read:
> 
> s3a://sample_table/*/*/*
> 
> Or
> 
> s3a://sample_table/*/*
> 
> Is my question. The first one is bound to be much slower because of the 
> nature of solars listing as Hudi filter will be applied to each and every 
> file. You should use the second example of globing till the partition level 
> not file.
> 
> Sent from my iPhone
> 
> > On Apr 25, 2021, at 9:15 PM, Tanuj <[email protected]> wrote:
> > 
> > For eg. if we have a table s3a://sample_table/ with partition path as year 
> > and then month
> > When we read partitions using glob string like 
> > s3a://sample_table/year=2020/month={1,2,3,4}. Performance is good
> > But when we read the whole table without providing any partition info like 
> > s3a://sample?table/. It does the whole listing of partitions I believe and 
> > it becomes slow. We have close to 1k partitions in some of the tables
> > 
> >> On 2021/04/26 04:07:31, Udit Mehrotra <[email protected]> wrote: 
> >> When you read the whole table using datasource do you provide glob path up 
> >> to file level or partition level ?
> >> 
> >> Sent from my iPhone
> >> 
> >>>> On Apr 25, 2021, at 9:05 PM, Tanuj <[email protected]> wrote:
> >>> 
> >>> Thanks Udit. I am just reading Spark data Source to read the full table. 
> >>> Sometimes we provide partitions and performance is ok and sometimes we 
> >>> cant due to the nature of data. Are you looking for HUDI parameters that 
> >>> we set while reading the table.
> >>> 
> >>> 
> >>>> On 2021/04/26 04:02:31, Udit Mehrotra <[email protected]> wrote: 
> >>>> Hi Tanuj,
> >>>> 
> >>>> Can you provide exact commands how you are reading the table ? We might 
> >>>> be able to guide based on that.
> >>>> 
> >>>> Thanks,
> >>>> Udit
> >>>> 
> >>>> Sent from my iPhone
> >>>> 
> >>>>>> On Apr 25, 2021, at 8:34 PM, Tanuj <[email protected]> wrote:
> >>>>> 
> >>>>> Hi,
> >>>>> We are using HUDI 0.6 and noticed that some hudi tables are very slow 
> >>>>> to read specially with large number of partitions, probably due to S3 
> >>>>> listing. I know in later versions of HUDI we have fixed some of the 
> >>>>> issues but it will take us some time to migrate . Is there anything in 
> >>>>> 0.6 I can leverage ?
> >>>>> 
> >>>>> I also dont understand what ./aux does as these folders are empty for 
> >>>>> us. We sometimes do S3 to S3 copy and read HUDI tables from the new 
> >>>>> copied location and able to read without .aux/ folders. 
> >>>>> When S3 copy works it doesnt copy empty folders.
> >>>>> 
> >>>>> Thnaks,
> >>>>> Tanu
> >>>> 
> >> 
>

Re: HUDI 0.6 Read Table Performance

Reply via email to