Re: Performance Issue with Parquet-Avro

Yan Qi Thu, 04 Dec 2014 10:30:12 -0800

Hi rb,

Thanks for your quick reply!


I first set the read schema,
AvroParquetInputFormat.setAvroReadSchema(job, Profile.getClassSchema());

Then I define a request schema which is a subset of
Profile.getClassSchema() and set the projection:
AvroParquetInputFormat.setRequestedProjection(job, requestSchema);

Is there any problem with this? Or is there anything else I missed?

Thanks,
Yan

On Tue, Dec 2, 2014 at 2:53 PM, Ryan Blue <[email protected]> wrote:

> On 12/02/2014 12:59 PM, Yan Qi wrote:
>
>> We have a table which has billions of records, where a record may have
>> multiple nested tables and a nested table may have tens of thousands of
>> nested records. Now we use parquet-avro for the data
>> serialization/deserialization.
>>
>> For example, given a table with the following schema,
>>
>> {
>> "namespace": "user.model",
>> "type": "record",
>> "name": "Profile",
>> "fields": [
>> {"name": "a", "type": "long"},
>> {"name": "b", "type": "int"},
>> {"name": "c", "type": "int"},
>> {"name": "d", "type": "int"},
>> {"name": "e", "type": ["int", "null"], "default": 0},
>> {"name": "f", "type": ["int", "null"], "default": 0},
>> {"name": "g", "type": ["long", "null"], "default": 0},
>> {"name": "h", "type": ["int", "null"], "default": 0},
>> {"name": "i", "type": [{ "type": "array", "items": "iR"}, "null"]},
>> {"name": "j", "type": [{ "type": "array", "items": "jR"}, "null"]},
>> {"name": "k", "type": [{ "type": "array", "items": "kR"}, "null"]},
>> {"name": "l", "type": [{ "type": "array", "items": "lR"}, "null"]},
>> {"name": "m", "type": [{ "type": "array", "items": "mR"}, "null"]},
>> {"name": "n", "type": [{ "type": "array", "items": "nR"}, "null"]},
>> {"name": "o", "type": [{ "type": "array", "items": "oR"}, "null"]},
>> {"name": "p", "type": [{ "type": "array", "items": "pR"}, "null"]},
>> {"name": "q", "type": ["qR", "null"]},
>> {"name": "r", "type": [{ "type": "array", "items": "rR"}, "null"]},
>> {"name": "s", "type": [{ "type": "array", "items": "sR"}, "null"]},
>> {"name": "t", "type": [{ "type": "array", "items": "rR"}, "null"]} ]
>> }
>>
>> One of the embeded tables, 'iR' has the following definition:
>>
>> {
>> "namespace": "user.model",
>> "type": "record",
>> "name": "iR",
>> "fields": [
>> {"name": "i1", "type": "int"},
>> {"name": "i2", "type": "int"},
>> {"name": "i3", "type": ["float", "null"], "default": 0},
>> {"name": "i4", "type": ["boolean", "null"], "default": false},
>> {"name": "i5", "type": ["boolean", "null"], "default": true},
>> {"name": "i6", "type": ["int", "null"], "default": -1},
>> {"name": "i7", "type": ["string", "null"], "default": "Unknown"},
>> {"name": "i8", "type": ["string", "null"], "default": "Unknown"},
>> {"name": "i9", "type": ["string", "null"], "default": "Unknown"},
>> {"name": "i10", "type": ["string", "null"], "default": "Unknown"},
>> {"name": "i11", "type": ["float", "null"], "default": 0},
>> {"name": "i12", "type": ["string", "null"], "default": "Unknown"},
>> {"name": "i13", "type": ["string", "null"], "default": "Unknown"} ]
>> }
>>
>> One problem is that it is significantly slower when reading a subset of
>> columns from a nested table (by applying
>> "AvroParquetInputFormat.setRequestedProjection") than selecting all
>> columns
>> of that nested table.
>>
>> For example, when I pick some columns from the nested table 'iR', like
>> {i1,
>> i2, i3}, it is slower than the case where I selected all columns from iR
>> {i1 to i13}. This is against intuition as selecting more columns gets
>> involved in more I/O.
>>
>> Is there anyone knowing why this happens?
>>
>> Thanks,
>> Yan
>>
>
> Are you setting a read schema and using a subset of columns in it? If the
> read schema is not set then it will default to the file schema. In that
> case, or if you haven't filtered out columns in the read schema, the
> conversion process will fill in the defaults for columns that aren't
> present by deep copying the default value.
>
> I wouldn't expect this to be a huge cost, especially with your schema that
> wouldn't cause any new objects to be allocated because the values are all
> Immutable. But, I don't see what else would cause this. Do you have the
> ability to profile one of these files with a tool like VisualVM to see what
> the difference is? Or do you have a sample file I could look at?
>
> Thanks for letting us know about this,
>
> rb
>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Performance Issue with Parquet-Avro

Reply via email to