[ https://issues.apache.org/jira/browse/PIG-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lorand Bendig updated PIG-3911: ------------------------------- Description: Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that more flexible output schema can be defined through annotations. As a result, the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from most of the UDFs. Examples: {code} @OutputSchema("bytearray") {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY)); } {code} {code} @OutputSchema("chararray") @Unique {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), DataType.CHARARRAY)); } {code} {code} @OutputSchema(value = "dimensions:bag", useInputSchema = true) {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new FieldSchema("dimensions", input, DataType.BAG)); } {code} {code} @OutputSchema(value = "${0}:bag", useInputSchema = true) @Unique("${0}") {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), input, DataType.BAG)); } {code} If useInputSchema attribute is set then input schema will be applied to the output schema, provided that: * outputschema is "simple", i.e: [name][:type] or '()', '{}', '[]' and * it has complex field type (tuple, bag, map) *@Unique* : this annotation defines which fields should be unique in the schema * if no parameters are provided, all fields will be unique * otherwise it takes a string array of fields name Unique field generation: A unique field is generated in the same manner that EvalFunc#getSchemaName does. - if field has an alias: - it's a placeholder (${i}, i=0..n) : fieldName -> com_myfunc_[input_alias]_[nextSchemaId] - otherwise: fieldName -> fieldName_[nextSchemaId] - otherwise: com_myfunc_[input_alias]_[nextSchemaId] Scripting UDFs: The following scripting languages have been extended to use the above modifications: Python, Jython, Groovy, JRuby --- The patch incorporates PIG-2361, and contains the following testcases: Modified piggybank UDFs: {{contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalOutputAnnotation.java}} Various output schema definitions: {{/trunk/test/org/apache/pig/test/TestEvalFuncOutputAnnotation.java}} Modified builtin UDFs: {{test/org/apache/pig/test/TestBuiltinOutputAnnotation.java}} Scripting UDFs: test/org/apache/pig/test/TestPythonUDFOutputAnnotation.java}} test/org/apache/pig/test/TestJythonUDFOutputAnnotation.java}} test/org/apache/pig/test/TestGroovyUDFOutputAnnotation.java}} test/org/apache/pig/test/TestJRubyUDFOutputAnnotation.java}} was: Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that more flexible output schema can be defined through annotations. As a result, the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from most of the UDFs. Examples: {code} @OutputSchema("bytearray") {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY)); } {code} {code} @OutputSchema("chararray") @Unique {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), DataType.CHARARRAY)); } {code} {code} @OutputSchema(value = "dimensions:bag", useInputSchema = true) {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new FieldSchema("dimensions", input, DataType.BAG)); } {code} {code} @OutputSchema(value = "${0}:bag", useInputSchema = true) @Unique("${0}") {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), input, DataType.BAG)); } {code} If useInputSchema attribute is set then input schema will be applied to the output schema, provided that: - outputschema is "simple", i.e: [name][:type] or '()', '{}', '[]' and - it has complex field type (tuple, bag, map) @Unique : this annotation defines which fields should be unique in the schema - if no parameters are provided, all fields will be unique - otherwise it takes a string array of fields name Unique field generation: A unique field is generated in the same manner that EvalFunc#getSchemaName does. - if field has an alias: - it's a placeholder (${i}, i=0..n) : fieldName -> com_myfunc_[input_alias]_[nextSchemaId] - otherwise: fieldName -> fieldName_[nextSchemaId] - otherwise: com_myfunc_[input_alias]_[nextSchemaId] Scripting UDFs: The following scripting languages have been extended to use the above modifications: Python, Jython, Groovy, JRuby --- The patch incorporates PIG-2361, and contains the following testcases: Modified piggybank UDFs: {{contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalOutputAnnotation.java}} Various output schema definitions: {{/trunk/test/org/apache/pig/test/TestEvalFuncOutputAnnotation.java}} Modified builtin UDFs: {{test/org/apache/pig/test/TestBuiltinOutputAnnotation.java}} Scripting UDFs: test/org/apache/pig/test/TestPythonUDFOutputAnnotation.java}} test/org/apache/pig/test/TestJythonUDFOutputAnnotation.java}} test/org/apache/pig/test/TestGroovyUDFOutputAnnotation.java}} test/org/apache/pig/test/TestJRubyUDFOutputAnnotation.java}} > Define unique fields with @OutputSchema > --------------------------------------- > > Key: PIG-3911 > URL: https://issues.apache.org/jira/browse/PIG-3911 > Project: Pig > Issue Type: Improvement > Reporter: Lorand Bendig > Assignee: Lorand Bendig > > Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that > more flexible output schema can be defined through annotations. As a result, > the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from > most of the UDFs. > Examples: > {code} > @OutputSchema("bytearray") > {code} > => equivalent to: > {code} > @Override > public Schema outputSchema(Schema input) { > return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY)); > } > {code} > {code} > @OutputSchema("chararray") > @Unique > {code} > => equivalent to: > {code} > @Override > public Schema outputSchema(Schema input) { > return new Schema(new > Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), > input), DataType.CHARARRAY)); > } > {code} > {code} > @OutputSchema(value = "dimensions:bag", useInputSchema = true) > {code} > => equivalent to: > {code} > @Override > public Schema outputSchema(Schema input) { > return new Schema(new FieldSchema("dimensions", input, DataType.BAG)); > } > {code} > {code} > @OutputSchema(value = "${0}:bag", useInputSchema = true) > @Unique("${0}") > {code} > => equivalent to: > {code} > @Override > public Schema outputSchema(Schema input) { > return new Schema(new > Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), > input), input, DataType.BAG)); > } > {code} > If useInputSchema attribute is set then input schema will be applied to the > output schema, provided that: > * outputschema is "simple", i.e: [name][:type] or '()', '{}', '[]' and > * it has complex field type (tuple, bag, map) > *@Unique* : this annotation defines which fields should be unique in the > schema > * if no parameters are provided, all fields will be unique > * otherwise it takes a string array of fields name > Unique field generation: > A unique field is generated in the same manner that EvalFunc#getSchemaName > does. > - if field has an alias: > - it's a placeholder (${i}, i=0..n) : fieldName -> > com_myfunc_[input_alias]_[nextSchemaId] > - otherwise: fieldName -> fieldName_[nextSchemaId] > - otherwise: com_myfunc_[input_alias]_[nextSchemaId] > Scripting UDFs: > The following scripting languages have been extended to use the above > modifications: > Python, Jython, Groovy, JRuby > --- > The patch incorporates PIG-2361, and contains the following testcases: > Modified piggybank UDFs: > {{contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalOutputAnnotation.java}} > Various output schema definitions: > {{/trunk/test/org/apache/pig/test/TestEvalFuncOutputAnnotation.java}} > Modified builtin UDFs: > {{test/org/apache/pig/test/TestBuiltinOutputAnnotation.java}} > Scripting UDFs: > test/org/apache/pig/test/TestPythonUDFOutputAnnotation.java}} > test/org/apache/pig/test/TestJythonUDFOutputAnnotation.java}} > test/org/apache/pig/test/TestGroovyUDFOutputAnnotation.java}} > test/org/apache/pig/test/TestJRubyUDFOutputAnnotation.java}} -- This message was sent by Atlassian JIRA (v6.2#6252)