Hi Ryan, Thanks for your quick reply!
Probably you're right. The projected schema has 4 columns (out of 13 in the read schema). If that's the problem, how does the read schema get FILTERED? I thought the read schema should be always the same as the file schema (i.e., Profile.getClassSchema()), right? Thanks, Yan On Thu, Dec 4, 2014 at 10:34 AM, Ryan Blue <[email protected]> wrote: > On 12/04/2014 10:28 AM, Yan Qi wrote: > >> Hi rb, >> >> Thanks for your quick reply! >> >> I first set the read schema, >> AvroParquetInputFormat.setAvroReadSchema(job, Profile.getClassSchema()); >> >> Then I define a request schema which is a subset of >> Profile.getClassSchema() and set the projection: >> AvroParquetInputFormat.setRequestedProjection(job, requestSchema); >> >> Is there any problem with this? Or is there anything else I missed? >> >> Thanks, >> Yan >> > > How many fields are there in the iR record in your read schema? I think > the problem is that you're getting defaults for the columns you're removing > with the projected schema. So if you have 13 columns in the read schema but > you're only loading 3 of them from the file, then you're defaulting 10 > columns and that might cause a slow-down. > > An easy way to check is to see how many columns in iR your read schema has > and how many you're actually loading from the file. Then try filtering your > read schema as well so you don't have as many and see if that helps > performance. > > > rb > > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >
