We have a table which has billions of records, where a record may have
multiple nested tables and a nested table may have tens of thousands of
nested records. Now we use parquet-avro for the data
serialization/deserialization.
For example, given a table with the following schema,
{
"namespace": "user.model",
"type": "record",
"name": "Profile",
"fields": [
{"name": "a", "type": "long"},
{"name": "b", "type": "int"},
{"name": "c", "type": "int"},
{"name": "d", "type": "int"},
{"name": "e", "type": ["int", "null"], "default": 0},
{"name": "f", "type": ["int", "null"], "default": 0},
{"name": "g", "type": ["long", "null"], "default": 0},
{"name": "h", "type": ["int", "null"], "default": 0},
{"name": "i", "type": [{ "type": "array", "items": "iR"}, "null"]},
{"name": "j", "type": [{ "type": "array", "items": "jR"}, "null"]},
{"name": "k", "type": [{ "type": "array", "items": "kR"}, "null"]},
{"name": "l", "type": [{ "type": "array", "items": "lR"}, "null"]},
{"name": "m", "type": [{ "type": "array", "items": "mR"}, "null"]},
{"name": "n", "type": [{ "type": "array", "items": "nR"}, "null"]},
{"name": "o", "type": [{ "type": "array", "items": "oR"}, "null"]},
{"name": "p", "type": [{ "type": "array", "items": "pR"}, "null"]},
{"name": "q", "type": ["qR", "null"]},
{"name": "r", "type": [{ "type": "array", "items": "rR"}, "null"]},
{"name": "s", "type": [{ "type": "array", "items": "sR"}, "null"]},
{"name": "t", "type": [{ "type": "array", "items": "rR"}, "null"]} ]
}
One of the embeded tables, 'iR' has the following definition:
{
"namespace": "user.model",
"type": "record",
"name": "iR",
"fields": [
{"name": "i1", "type": "int"},
{"name": "i2", "type": "int"},
{"name": "i3", "type": ["float", "null"], "default": 0},
{"name": "i4", "type": ["boolean", "null"], "default": false},
{"name": "i5", "type": ["boolean", "null"], "default": true},
{"name": "i6", "type": ["int", "null"], "default": -1},
{"name": "i7", "type": ["string", "null"], "default": "Unknown"},
{"name": "i8", "type": ["string", "null"], "default": "Unknown"},
{"name": "i9", "type": ["string", "null"], "default": "Unknown"},
{"name": "i10", "type": ["string", "null"], "default": "Unknown"},
{"name": "i11", "type": ["float", "null"], "default": 0},
{"name": "i12", "type": ["string", "null"], "default": "Unknown"},
{"name": "i13", "type": ["string", "null"], "default": "Unknown"} ]
}
One problem is that it is significantly slower when reading a subset of
columns from a nested table (by applying
"AvroParquetInputFormat.setRequestedProjection") than selecting all columns
of that nested table.
For example, when I pick some columns from the nested table 'iR', like {i1,
i2, i3}, it is slower than the case where I selected all columns from iR
{i1 to i13}. This is against intuition as selecting more columns gets
involved in more I/O.
Is there anyone knowing why this happens?
Thanks,
Yan