[ https://issues.apache.org/jira/browse/PIG-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lorand Bendig updated PIG-3911: ------------------------------- Description: Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that more flexible output schema can be defined through annotations. As a result, the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from most of the UDFs. Examples: {code} @OutputSchema("bytearray") {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY)); } {code} {code} @OutputSchema("chararray") @Unique {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), DataType.CHARARRAY)); } {code} {code} @OutputSchema(value = "dimensions:bag", useInputSchema = true) {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new FieldSchema("dimensions", input, DataType.BAG)); } {code} {code} @OutputSchema(value = "${0}:bag", useInputSchema = true) @Unique("${0}") {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), input, DataType.BAG)); } {code} If {{useInputSchema}} attribute is set then input schema will be applied to the output schema, provided that: * outputschema is "simple", i.e: \[name\]\[:type\] or '()', '{}', '[]' and * it has complex field type (tuple, bag, map) @Unique : this annotation defines which fields should be unique in the schema * if no parameters are provided, all fields will be unique * otherwise it takes a string array of fields name Unique field generation: A unique field is generated in the same manner that {{EvalFunc#getSchemaName}} does. * if field has an alias: ** it's a placeholder ($\{i\}, i=0..n) : fieldName -> com_myfunc_\[input_alias\]\_\[nextSchemaId\] ** otherwise: fieldName -> fieldName\_\[nextSchemaId\] * otherwise: com\_myfunc\_\[input_alias\]\_\[nextSchemaId\] Supported scripting UDFs: Python, Jython, Groovy, JRuby was: Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that more flexible output schema can be defined through annotations. As a result, the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from most of the UDFs. Examples: {code} @OutputSchema("bytearray") {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY)); } {code} {code} @OutputSchema("chararray") @Unique {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), DataType.CHARARRAY)); } {code} {code} @OutputSchema(value = "dimensions:bag", useInputSchema = true) {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new FieldSchema("dimensions", input, DataType.BAG)); } {code} {code} @OutputSchema(value = "${0}:bag", useInputSchema = true) @Unique("${0}") {code} => equivalent to: {code} @Override public Schema outputSchema(Schema input) { return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), input, DataType.BAG)); } {code} If useInputSchema attribute is set then input schema will be applied to the output schema, provided that: * outputschema is "simple", i.e: [name][:type] or '()', '{}', '[]' and * it has complex field type (tuple, bag, map) *@Unique* : this annotation defines which fields should be unique in the schema * if no parameters are provided, all fields will be unique * otherwise it takes a string array of fields name Unique field generation: A unique field is generated in the same manner that EvalFunc#getSchemaName does. - if field has an alias: - it's a placeholder (${i}, i=0..n) : fieldName -> com_myfunc_[input_alias]_[nextSchemaId] - otherwise: fieldName -> fieldName_[nextSchemaId] - otherwise: com_myfunc_[input_alias]_[nextSchemaId] Scripting UDFs: The following scripting languages have been extended to use the above modifications: Python, Jython, Groovy, JRuby --- The patch incorporates PIG-2361, and contains the following testcases: Modified piggybank UDFs: {{contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalOutputAnnotation.java}} Various output schema definitions: {{/trunk/test/org/apache/pig/test/TestEvalFuncOutputAnnotation.java}} Modified builtin UDFs: {{test/org/apache/pig/test/TestBuiltinOutputAnnotation.java}} Scripting UDFs: test/org/apache/pig/test/TestPythonUDFOutputAnnotation.java}} test/org/apache/pig/test/TestJythonUDFOutputAnnotation.java}} test/org/apache/pig/test/TestGroovyUDFOutputAnnotation.java}} test/org/apache/pig/test/TestJRubyUDFOutputAnnotation.java}} > Define unique fields with @OutputSchema > --------------------------------------- > > Key: PIG-3911 > URL: https://issues.apache.org/jira/browse/PIG-3911 > Project: Pig > Issue Type: Improvement > Reporter: Lorand Bendig > Assignee: Lorand Bendig > > Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that > more flexible output schema can be defined through annotations. As a result, > the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from > most of the UDFs. > Examples: > {code} > @OutputSchema("bytearray") > {code} > => equivalent to: > {code} > @Override > public Schema outputSchema(Schema input) { > return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY)); > } > {code} > {code} > @OutputSchema("chararray") > @Unique > {code} > => equivalent to: > {code} > @Override > public Schema outputSchema(Schema input) { > return new Schema(new > Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), > input), DataType.CHARARRAY)); > } > {code} > {code} > @OutputSchema(value = "dimensions:bag", useInputSchema = true) > {code} > => equivalent to: > {code} > @Override > public Schema outputSchema(Schema input) { > return new Schema(new FieldSchema("dimensions", input, DataType.BAG)); > } > {code} > {code} > @OutputSchema(value = "${0}:bag", useInputSchema = true) > @Unique("${0}") > {code} > => equivalent to: > {code} > @Override > public Schema outputSchema(Schema input) { > return new Schema(new > Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), > input), input, DataType.BAG)); > } > {code} > If {{useInputSchema}} attribute is set then input schema will be applied to > the output schema, provided that: > * outputschema is "simple", i.e: \[name\]\[:type\] or '()', '{}', '[]' and > * it has complex field type (tuple, bag, map) > @Unique : this annotation defines which fields should be unique in the schema > * if no parameters are provided, all fields will be unique > * otherwise it takes a string array of fields name > Unique field generation: > A unique field is generated in the same manner that > {{EvalFunc#getSchemaName}} does. > * if field has an alias: > ** it's a placeholder ($\{i\}, i=0..n) : fieldName -> > com_myfunc_\[input_alias\]\_\[nextSchemaId\] > ** otherwise: fieldName -> fieldName\_\[nextSchemaId\] > * otherwise: com\_myfunc\_\[input_alias\]\_\[nextSchemaId\] > Supported scripting UDFs: Python, Jython, Groovy, JRuby -- This message was sent by Atlassian JIRA (v6.2#6252)