Re: Performance Issue with Parquet-Avro

Yan Qi Thu, 04 Dec 2014 10:44:39 -0800

Hi Ryan,

Thanks for your quick reply!


Probably you're right. The projected schema has 4 columns (out of 13 in the
read schema). If that's the problem, how does the read schema get FILTERED?
I thought the read schema should be always the same as the file schema
(i.e., Profile.getClassSchema()), right?

Thanks,
Yan

On Thu, Dec 4, 2014 at 10:34 AM, Ryan Blue <[email protected]> wrote:

> On 12/04/2014 10:28 AM, Yan Qi wrote:
>
>> Hi rb,
>>
>> Thanks for your quick reply!
>>
>> I first set the read schema,
>> AvroParquetInputFormat.setAvroReadSchema(job, Profile.getClassSchema());
>>
>> Then I define a request schema which is a subset of
>> Profile.getClassSchema() and set the projection:
>> AvroParquetInputFormat.setRequestedProjection(job, requestSchema);
>>
>> Is there any problem with this? Or is there anything else I missed?
>>
>> Thanks,
>> Yan
>>
>
> How many fields are there in the iR record in your read schema? I think
> the problem is that you're getting defaults for the columns you're removing
> with the projected schema. So if you have 13 columns in the read schema but
> you're only loading 3 of them from the file, then you're defaulting 10
> columns and that might cause a slow-down.
>
> An easy way to check is to see how many columns in iR your read schema has
> and how many you're actually loading from the file. Then try filtering your
> read schema as well so you don't have as many and see if that helps
> performance.
>
>
> rb
>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Performance Issue with Parquet-Avro

Reply via email to