On 12/09/2014 01:08 PM, Yan Qi wrote:
Hi Ryan,You're right. I did some benchmarking and found that the function, fillInDefaults() took over 70% of time cost. I am wondering if it is possible to simply assign a NULL as the default for the column not showing in the read schema. For example, private void fillInDefaults() { for (Map.Entry<Schema.Field, Object> entry : recordDefaults.entrySet()) { Schema.Field f = entry.getKey(); // replace following with model.deepCopy once AVRO-1455 is being used Object defaultValue = null; this.currentRecord.put(f.pos(), defaultValue); } } In the application, the default value would be corrected if the column is accessed. What do you think? Thanks, Yan
Great work! Thanks for letting us know that it was fillInDefaults. We'll have to figure out what to do about it now. :)
I don't think it's a good idea to use null because it breaks the contract set by Avro. Avro guarantees that reading with a requested schema will produce the correct data for all fields in that schema, or the default value if there is no data.
We're not really doing the "right" thing now by setting default values because the value set on your object will probably differ from the real value in the file. I think we should actually fix that so we always read values that your read schema expects. That means getting rid of the projection schema and deriving it from the file schema and the read schema. While using null might be a way to get around the current problem, we're still not correctly implementing the assumptions of the data model.
rb -- Ryan Blue Software Engineer Cloudera, Inc.
