Re: Performance Issue with Parquet-Avro

Ryan Blue Tue, 02 Dec 2014 14:54:44 -0800

On 12/02/2014 12:59 PM, Yan Qi wrote:

We have a table which has billions of records, where a record may have
multiple nested tables and a nested table may have tens of thousands of
nested records. Now we use parquet-avro for the data
serialization/deserialization.


For example, given a table with the following schema,

{
"namespace": "user.model",
"type": "record",
"name": "Profile",
"fields": [
{"name": "a", "type": "long"},
{"name": "b", "type": "int"},
{"name": "c", "type": "int"},
{"name": "d", "type": "int"},
{"name": "e", "type": ["int", "null"], "default": 0},
{"name": "f", "type": ["int", "null"], "default": 0},
{"name": "g", "type": ["long", "null"], "default": 0},
{"name": "h", "type": ["int", "null"], "default": 0},
{"name": "i", "type": [{ "type": "array", "items": "iR"}, "null"]},
{"name": "j", "type": [{ "type": "array", "items": "jR"}, "null"]},
{"name": "k", "type": [{ "type": "array", "items": "kR"}, "null"]},
{"name": "l", "type": [{ "type": "array", "items": "lR"}, "null"]},
{"name": "m", "type": [{ "type": "array", "items": "mR"}, "null"]},
{"name": "n", "type": [{ "type": "array", "items": "nR"}, "null"]},
{"name": "o", "type": [{ "type": "array", "items": "oR"}, "null"]},
{"name": "p", "type": [{ "type": "array", "items": "pR"}, "null"]},
{"name": "q", "type": ["qR", "null"]},
{"name": "r", "type": [{ "type": "array", "items": "rR"}, "null"]},
{"name": "s", "type": [{ "type": "array", "items": "sR"}, "null"]},
{"name": "t", "type": [{ "type": "array", "items": "rR"}, "null"]} ]
}

One of the embeded tables, 'iR' has the following definition:

{
"namespace": "user.model",
"type": "record",
"name": "iR",
"fields": [
{"name": "i1", "type": "int"},
{"name": "i2", "type": "int"},
{"name": "i3", "type": ["float", "null"], "default": 0},
{"name": "i4", "type": ["boolean", "null"], "default": false},
{"name": "i5", "type": ["boolean", "null"], "default": true},
{"name": "i6", "type": ["int", "null"], "default": -1},
{"name": "i7", "type": ["string", "null"], "default": "Unknown"},
{"name": "i8", "type": ["string", "null"], "default": "Unknown"},
{"name": "i9", "type": ["string", "null"], "default": "Unknown"},
{"name": "i10", "type": ["string", "null"], "default": "Unknown"},
{"name": "i11", "type": ["float", "null"], "default": 0},
{"name": "i12", "type": ["string", "null"], "default": "Unknown"},
{"name": "i13", "type": ["string", "null"], "default": "Unknown"} ]
}

One problem is that it is significantly slower when reading a subset of
columns from a nested table (by applying
"AvroParquetInputFormat.setRequestedProjection") than selecting all columns
of that nested table.

For example, when I pick some columns from the nested table 'iR', like {i1,
i2, i3}, it is slower than the case where I selected all columns from iR
{i1 to i13}. This is against intuition as selecting more columns gets
involved in more I/O.

Is there anyone knowing why this happens?

Thanks,
Yan

Are you setting a read schema and using a subset of columns in it? Ifthe read schema is not set then it will default to the file schema. Inthat case, or if you haven't filtered out columns in the read schema,the conversion process will fill in the defaults for columns that aren'tpresent by deep copying the default value.

I wouldn't expect this to be a huge cost, especially with your schemathat wouldn't cause any new objects to be allocated because the valuesare all Immutable. But, I don't see what else would cause this. Do youhave the ability to profile one of these files with a tool like VisualVMto see what the difference is? Or do you have a sample file I could look at?


Thanks for letting us know about this,

rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Performance Issue with Parquet-Avro

Reply via email to