For eg. if we have a table s3a://sample_table/ with partition path as year and 
then month
When we read partitions using glob string like 
s3a://sample_table/year=2020/month={1,2,3,4}. Performance is good
But when we read the whole table without providing any partition info like 
s3a://sample?table/. It does the whole listing of partitions I believe and it 
becomes slow. We have close to 1k partitions in some of the tables

On 2021/04/26 04:07:31, Udit Mehrotra <[email protected]> wrote: 
> When you read the whole table using datasource do you provide glob path up to 
> file level or partition level ?
> 
> Sent from my iPhone
> 
> > On Apr 25, 2021, at 9:05 PM, Tanuj <[email protected]> wrote:
> > 
> > Thanks Udit. I am just reading Spark data Source to read the full table. 
> > Sometimes we provide partitions and performance is ok and sometimes we cant 
> > due to the nature of data. Are you looking for HUDI parameters that we set 
> > while reading the table.
> > 
> > 
> >> On 2021/04/26 04:02:31, Udit Mehrotra <[email protected]> wrote: 
> >> Hi Tanuj,
> >> 
> >> Can you provide exact commands how you are reading the table ? We might be 
> >> able to guide based on that.
> >> 
> >> Thanks,
> >> Udit
> >> 
> >> Sent from my iPhone
> >> 
> >>>> On Apr 25, 2021, at 8:34 PM, Tanuj <[email protected]> wrote:
> >>> 
> >>> Hi,
> >>> We are using HUDI 0.6 and noticed that some hudi tables are very slow to 
> >>> read specially with large number of partitions, probably due to S3 
> >>> listing. I know in later versions of HUDI we have fixed some of the 
> >>> issues but it will take us some time to migrate . Is there anything in 
> >>> 0.6 I can leverage ?
> >>> 
> >>> I also dont understand what ./aux does as these folders are empty for us. 
> >>> We sometimes do S3 to S3 copy and read HUDI tables from the new copied 
> >>> location and able to read without .aux/ folders. 
> >>> When S3 copy works it doesnt copy empty folders.
> >>> 
> >>> Thnaks,
> >>> Tanu
> >> 
> 

Reply via email to