[ 
https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888991#comment-15888991
 ] 

Nandor Kollar commented on PIG-5115:
------------------------------------

Does this mean, that the schema of the generated Avro files will change for 
every Avro file? I'm a bit worried, because users might already have Pig scrips 
producing Avro files with a specific schema (as far as I understood, names in 
Avro schema are the same as names in Pig schema - expect replacing :: to __ and 
rejecting field names not matching Avro name specification), it might cause 
serious headache for them when they upgrade to newer Pig version, and they see 
that the schema has completely changed. [~knoguchi] what do you think, is this 
a real concern?

> Builtin AvroStorage generates incorrect avro schema when the same pig field 
> name appears in the alias
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PIG-5115
>                 URL: https://issues.apache.org/jira/browse/PIG-5115
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.17.0
>            Reporter: Anyi Li
>            Assignee: Anyi Li
>             Fix For: 0.17.0
>
>         Attachments: PIG-5115.patch
>
>
> Pig ResourceSchema allows to use same field names but different types when 
> they are not in the same level. The pig schema like
> {quote}
> data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: 
> chararray)}}
> {quote}
> Although _col2_ has been redefined, they are not appeared in the same level, 
> it is a totally valid pig schema. 
> However, once it is translated by AvroStorage, it will throw exception 
> {noformat}
> Can't redefine: col2
>         at 
> org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
>         at 
> org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
>         at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
>         at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
>         at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
>         at 
> org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
>         at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>         at 
> org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
>         at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
>         at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
>         at org.apache.pig.PigServer.execute(PigServer.java:1356)
>         at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
>         at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
>         at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
>         at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
>         at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
>         at org.apache.pig.Main.run(Main.java:631)
>         at org.apache.pig.Main.main(Main.java:177)
> Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
>         at org.apache.avro.Schema$Names.put(Schema.java:1042)
>         at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
>         at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
>         at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
>         at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
>         at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
>         at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
>         at org.apache.avro.Schema.toString(Schema.java:297)
>         at org.apache.avro.Schema.toString(Schema.java:287)
>         at 
> org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
>         at 
> org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
>         at 
> org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
>         ... 18 more
> {noformat}
> It is caused by a bug in AvroStorageSchemaConversionUtilities class which 
> uses tuple name as GenericRecord name as well as the fieldname that wraps the 
> record. 
> So it would like to  produces the avro schema like the following 
> {noformat}
> {
>   "type": "record",
>   "name": "data",
>   "fields": [
>     {
>       "name": "col1",
>       "type": {
>         "type": "record",
>         "name": "col1_1",
>         "fields": [
>           {
>             "name": "col2",
>             "type": {
>               "type": "record",
>               "name": "col2",
>               "fields": [
>                 {
>                   "name": "col1_data",
>                   "type": "string"
>                 }
>               ]
>             }
>           }
>         ]
>       }
>     },
>     {
>       "name": "col2",
>       "type": {
>         "type": "array",
>         "items": {
>           "type": "record",
>           "name": "col2",
>           "fields": [
>             {
>               "name": "col2_data",
>               "type": "string"
>             }
>           ]
>         }
>       }
>     }
>   ]
> }
> {noformat}
> But according to the avro 1.7.7  specs 
> ([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been 
> defined as record and redefined as array later, it is an invalid schema, 
> unless the fullname (namespace + name) is unique. 
> Since AvroStorageSchemaConversionUtilities will generate avro record if the 
> pig schema is a tuple, we need a way to generate unique _recordName_. 
> {code}
> public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
>       String recordName, final String recordNameSpace,
>       final Map<String, List<Schema>> definedRecordNames,
>       final Boolean doubleColonsToDoubleUnderscores) throws IOException {
>     if (rs == null) {
>       return null;
>     }
>     recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores);
>     List<Schema.Field> fields = new ArrayList<Schema.Field>();
>     Schema newSchema = Schema.createRecord(
>             recordName, null, recordNameSpace, false);
> {code}
> The AvroStorage class from piggybank solved this problem by defining a static 
> method and generate unique _recordName_. We can implement the similar method 
> for the builtin AvroStorage 
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to