Re: HUDI 0.6 Read Table Performance

Udit Mehrotra Sun, 25 Apr 2021 21:21:03 -0700

In the example provided do what is the exact path you pass to hudi data source 
to read:


s3a://sample_table/*/*/*

Or

s3a://sample_table/*/*

Is my question. The first one is bound to be much slower because of the nature 
of solars listing as Hudi filter will be applied to each and every file. You 
should use the second example of globing till the partition level not file.

Sent from my iPhone

> On Apr 25, 2021, at 9:15 PM, Tanuj <tanu.dua...@gmail.com> wrote:
> 
> For eg. if we have a table s3a://sample_table/ with partition path as year 
> and then month
> When we read partitions using glob string like 
> s3a://sample_table/year=2020/month={1,2,3,4}. Performance is good
> But when we read the whole table without providing any partition info like 
> s3a://sample?table/. It does the whole listing of partitions I believe and it 
> becomes slow. We have close to 1k partitions in some of the tables
> 
>> On 2021/04/26 04:07:31, Udit Mehrotra <udit.mehrotr...@gmail.com> wrote: 
>> When you read the whole table using datasource do you provide glob path up 
>> to file level or partition level ?
>> 
>> Sent from my iPhone
>> 
>>>> On Apr 25, 2021, at 9:05 PM, Tanuj <tanu.dua...@gmail.com> wrote:
>>> 
>>> Thanks Udit. I am just reading Spark data Source to read the full table. 
>>> Sometimes we provide partitions and performance is ok and sometimes we cant 
>>> due to the nature of data. Are you looking for HUDI parameters that we set 
>>> while reading the table.
>>> 
>>> 
>>>> On 2021/04/26 04:02:31, Udit Mehrotra <udit.mehrotr...@gmail.com> wrote: 
>>>> Hi Tanuj,
>>>> 
>>>> Can you provide exact commands how you are reading the table ? We might be 
>>>> able to guide based on that.
>>>> 
>>>> Thanks,
>>>> Udit
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>>> On Apr 25, 2021, at 8:34 PM, Tanuj <tanu.dua...@gmail.com> wrote:
>>>>> 
>>>>> Hi,
>>>>> We are using HUDI 0.6 and noticed that some hudi tables are very slow to 
>>>>> read specially with large number of partitions, probably due to S3 
>>>>> listing. I know in later versions of HUDI we have fixed some of the 
>>>>> issues but it will take us some time to migrate . Is there anything in 
>>>>> 0.6 I can leverage ?
>>>>> 
>>>>> I also dont understand what ./aux does as these folders are empty for us. 
>>>>> We sometimes do S3 to S3 copy and read HUDI tables from the new copied 
>>>>> location and able to read without .aux/ folders. 
>>>>> When S3 copy works it doesnt copy empty folders.
>>>>> 
>>>>> Thnaks,
>>>>> Tanu
>>>> 
>>

Re: HUDI 0.6 Read Table Performance

Reply via email to