Hi rb, Thanks for your quick reply!
I first set the read schema, AvroParquetInputFormat.setAvroReadSchema(job, Profile.getClassSchema()); Then I define a request schema which is a subset of Profile.getClassSchema() and set the projection: AvroParquetInputFormat.setRequestedProjection(job, requestSchema); Is there any problem with this? Or is there anything else I missed? Thanks, Yan On Tue, Dec 2, 2014 at 2:53 PM, Ryan Blue <[email protected]> wrote: > On 12/02/2014 12:59 PM, Yan Qi wrote: > >> We have a table which has billions of records, where a record may have >> multiple nested tables and a nested table may have tens of thousands of >> nested records. Now we use parquet-avro for the data >> serialization/deserialization. >> >> For example, given a table with the following schema, >> >> { >> "namespace": "user.model", >> "type": "record", >> "name": "Profile", >> "fields": [ >> {"name": "a", "type": "long"}, >> {"name": "b", "type": "int"}, >> {"name": "c", "type": "int"}, >> {"name": "d", "type": "int"}, >> {"name": "e", "type": ["int", "null"], "default": 0}, >> {"name": "f", "type": ["int", "null"], "default": 0}, >> {"name": "g", "type": ["long", "null"], "default": 0}, >> {"name": "h", "type": ["int", "null"], "default": 0}, >> {"name": "i", "type": [{ "type": "array", "items": "iR"}, "null"]}, >> {"name": "j", "type": [{ "type": "array", "items": "jR"}, "null"]}, >> {"name": "k", "type": [{ "type": "array", "items": "kR"}, "null"]}, >> {"name": "l", "type": [{ "type": "array", "items": "lR"}, "null"]}, >> {"name": "m", "type": [{ "type": "array", "items": "mR"}, "null"]}, >> {"name": "n", "type": [{ "type": "array", "items": "nR"}, "null"]}, >> {"name": "o", "type": [{ "type": "array", "items": "oR"}, "null"]}, >> {"name": "p", "type": [{ "type": "array", "items": "pR"}, "null"]}, >> {"name": "q", "type": ["qR", "null"]}, >> {"name": "r", "type": [{ "type": "array", "items": "rR"}, "null"]}, >> {"name": "s", "type": [{ "type": "array", "items": "sR"}, "null"]}, >> {"name": "t", "type": [{ "type": "array", "items": "rR"}, "null"]} ] >> } >> >> One of the embeded tables, 'iR' has the following definition: >> >> { >> "namespace": "user.model", >> "type": "record", >> "name": "iR", >> "fields": [ >> {"name": "i1", "type": "int"}, >> {"name": "i2", "type": "int"}, >> {"name": "i3", "type": ["float", "null"], "default": 0}, >> {"name": "i4", "type": ["boolean", "null"], "default": false}, >> {"name": "i5", "type": ["boolean", "null"], "default": true}, >> {"name": "i6", "type": ["int", "null"], "default": -1}, >> {"name": "i7", "type": ["string", "null"], "default": "Unknown"}, >> {"name": "i8", "type": ["string", "null"], "default": "Unknown"}, >> {"name": "i9", "type": ["string", "null"], "default": "Unknown"}, >> {"name": "i10", "type": ["string", "null"], "default": "Unknown"}, >> {"name": "i11", "type": ["float", "null"], "default": 0}, >> {"name": "i12", "type": ["string", "null"], "default": "Unknown"}, >> {"name": "i13", "type": ["string", "null"], "default": "Unknown"} ] >> } >> >> One problem is that it is significantly slower when reading a subset of >> columns from a nested table (by applying >> "AvroParquetInputFormat.setRequestedProjection") than selecting all >> columns >> of that nested table. >> >> For example, when I pick some columns from the nested table 'iR', like >> {i1, >> i2, i3}, it is slower than the case where I selected all columns from iR >> {i1 to i13}. This is against intuition as selecting more columns gets >> involved in more I/O. >> >> Is there anyone knowing why this happens? >> >> Thanks, >> Yan >> > > Are you setting a read schema and using a subset of columns in it? If the > read schema is not set then it will default to the file schema. In that > case, or if you haven't filtered out columns in the read schema, the > conversion process will fill in the defaults for columns that aren't > present by deep copying the default value. > > I wouldn't expect this to be a huge cost, especially with your schema that > wouldn't cause any new objects to be allocated because the values are all > Immutable. But, I don't see what else would cause this. Do you have the > ability to profile one of these files with a tool like VisualVM to see what > the difference is? Or do you have a sample file I could look at? > > Thanks for letting us know about this, > > rb > > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >
