For eg. if we have a table s3a://sample_table/ with partition path as year and
then month
When we read partitions using glob string like
s3a://sample_table/year=2020/month={1,2,3,4}. Performance is good
But when we read the whole table without providing any partition info like
s3a://sample?table/. It does the whole listing of partitions I believe and it
becomes slow. We have close to 1k partitions in some of the tables
On 2021/04/26 04:07:31, Udit Mehrotra <[email protected]> wrote:
> When you read the whole table using datasource do you provide glob path up to
> file level or partition level ?
>
> Sent from my iPhone
>
> > On Apr 25, 2021, at 9:05 PM, Tanuj <[email protected]> wrote:
> >
> > Thanks Udit. I am just reading Spark data Source to read the full table.
> > Sometimes we provide partitions and performance is ok and sometimes we cant
> > due to the nature of data. Are you looking for HUDI parameters that we set
> > while reading the table.
> >
> >
> >> On 2021/04/26 04:02:31, Udit Mehrotra <[email protected]> wrote:
> >> Hi Tanuj,
> >>
> >> Can you provide exact commands how you are reading the table ? We might be
> >> able to guide based on that.
> >>
> >> Thanks,
> >> Udit
> >>
> >> Sent from my iPhone
> >>
> >>>> On Apr 25, 2021, at 8:34 PM, Tanuj <[email protected]> wrote:
> >>>
> >>> Hi,
> >>> We are using HUDI 0.6 and noticed that some hudi tables are very slow to
> >>> read specially with large number of partitions, probably due to S3
> >>> listing. I know in later versions of HUDI we have fixed some of the
> >>> issues but it will take us some time to migrate . Is there anything in
> >>> 0.6 I can leverage ?
> >>>
> >>> I also dont understand what ./aux does as these folders are empty for us.
> >>> We sometimes do S3 to S3 copy and read HUDI tables from the new copied
> >>> location and able to read without .aux/ folders.
> >>> When S3 copy works it doesnt copy empty folders.
> >>>
> >>> Thnaks,
> >>> Tanu
> >>
>