On 12/02/2014 12:59 PM, Yan Qi wrote:
We have a table which has billions of records, where a record may have
multiple nested tables and a nested table may have tens of thousands of
nested records. Now we use parquet-avro for the data
serialization/deserialization.

For example, given a table with the following schema,

{
"namespace": "user.model",
"type": "record",
"name": "Profile",
"fields": [
{"name": "a", "type": "long"},
{"name": "b", "type": "int"},
{"name": "c", "type": "int"},
{"name": "d", "type": "int"},
{"name": "e", "type": ["int", "null"], "default": 0},
{"name": "f", "type": ["int", "null"], "default": 0},
{"name": "g", "type": ["long", "null"], "default": 0},
{"name": "h", "type": ["int", "null"], "default": 0},
{"name": "i", "type": [{ "type": "array", "items": "iR"}, "null"]},
{"name": "j", "type": [{ "type": "array", "items": "jR"}, "null"]},
{"name": "k", "type": [{ "type": "array", "items": "kR"}, "null"]},
{"name": "l", "type": [{ "type": "array", "items": "lR"}, "null"]},
{"name": "m", "type": [{ "type": "array", "items": "mR"}, "null"]},
{"name": "n", "type": [{ "type": "array", "items": "nR"}, "null"]},
{"name": "o", "type": [{ "type": "array", "items": "oR"}, "null"]},
{"name": "p", "type": [{ "type": "array", "items": "pR"}, "null"]},
{"name": "q", "type": ["qR", "null"]},
{"name": "r", "type": [{ "type": "array", "items": "rR"}, "null"]},
{"name": "s", "type": [{ "type": "array", "items": "sR"}, "null"]},
{"name": "t", "type": [{ "type": "array", "items": "rR"}, "null"]} ]
}

One of the embeded tables, 'iR' has the following definition:

{
"namespace": "user.model",
"type": "record",
"name": "iR",
"fields": [
{"name": "i1", "type": "int"},
{"name": "i2", "type": "int"},
{"name": "i3", "type": ["float", "null"], "default": 0},
{"name": "i4", "type": ["boolean", "null"], "default": false},
{"name": "i5", "type": ["boolean", "null"], "default": true},
{"name": "i6", "type": ["int", "null"], "default": -1},
{"name": "i7", "type": ["string", "null"], "default": "Unknown"},
{"name": "i8", "type": ["string", "null"], "default": "Unknown"},
{"name": "i9", "type": ["string", "null"], "default": "Unknown"},
{"name": "i10", "type": ["string", "null"], "default": "Unknown"},
{"name": "i11", "type": ["float", "null"], "default": 0},
{"name": "i12", "type": ["string", "null"], "default": "Unknown"},
{"name": "i13", "type": ["string", "null"], "default": "Unknown"} ]
}

One problem is that it is significantly slower when reading a subset of
columns from a nested table (by applying
"AvroParquetInputFormat.setRequestedProjection") than selecting all columns
of that nested table.

For example, when I pick some columns from the nested table 'iR', like {i1,
i2, i3}, it is slower than the case where I selected all columns from iR
{i1 to i13}. This is against intuition as selecting more columns gets
involved in more I/O.

Is there anyone knowing why this happens?

Thanks,
Yan

Are you setting a read schema and using a subset of columns in it? If the read schema is not set then it will default to the file schema. In that case, or if you haven't filtered out columns in the read schema, the conversion process will fill in the defaults for columns that aren't present by deep copying the default value.

I wouldn't expect this to be a huge cost, especially with your schema that wouldn't cause any new objects to be allocated because the values are all Immutable. But, I don't see what else would cause this. Do you have the ability to profile one of these files with a tool like VisualVM to see what the difference is? Or do you have a sample file I could look at?

Thanks for letting us know about this,

rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to