[jira] [Updated] (PIG-3911) Define unique fields with @OutputSchema

Lorand Bendig (JIRA) Sat, 24 May 2014 14:09:15 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lorand Bendig updated PIG-3911:
-------------------------------

    Description: 
Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that 
more flexible output schema can be defined through annotations. As a result, 
the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from 
most of the UDFs.
Examples:
{code}
@OutputSchema("bytearray")
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
  return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY));
}
{code}

{code}
@OutputSchema("chararray")
@Unique
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
  return new Schema(new 
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
input), DataType.CHARARRAY));
}
{code}
{code}
@OutputSchema(value = "dimensions:bag", useInputSchema = true)
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
  return new Schema(new FieldSchema("dimensions", input, DataType.BAG));
}
{code}
{code}
@OutputSchema(value = "${0}:bag", useInputSchema = true)
@Unique("${0}")
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
    return new Schema(new 
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
input), input, DataType.BAG));
}
{code}

If useInputSchema attribute is set then input schema will be applied to the 
output schema, provided that:
* outputschema is "simple", i.e: [name][:type]  or '()', '{}', '[]' and
* it has complex field type (tuple, bag, map)

*@Unique* : this annotation defines which fields should be unique in the schema
* if no parameters are provided, all fields will be unique
* otherwise it takes a string array of fields name

Unique field generation:
A unique field is generated in the same manner that EvalFunc#getSchemaName does.

- if field has an alias:
  - it's a placeholder (${i}, i=0..n) : fieldName -> 
com_myfunc_[input_alias]_[nextSchemaId]
  - otherwise: fieldName -> fieldName_[nextSchemaId]

- otherwise: com_myfunc_[input_alias]_[nextSchemaId]

Scripting UDFs:
The following scripting languages have been extended to use the above 
modifications:
Python, Jython, Groovy, JRuby


---

The patch incorporates PIG-2361, and contains the following testcases:
Modified piggybank UDFs:
{{contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalOutputAnnotation.java}}

Various output schema definitions:
{{/trunk/test/org/apache/pig/test/TestEvalFuncOutputAnnotation.java}}

Modified builtin UDFs:
{{test/org/apache/pig/test/TestBuiltinOutputAnnotation.java}}

Scripting UDFs:
test/org/apache/pig/test/TestPythonUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestJythonUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestGroovyUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestJRubyUDFOutputAnnotation.java}}

  was:
Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that 
more flexible output schema can be defined through annotations. As a result, 
the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from 
most of the UDFs.
Examples:
{code}
@OutputSchema("bytearray")
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
  return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY));
}
{code}

{code}
@OutputSchema("chararray")
@Unique
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
  return new Schema(new 
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
input), DataType.CHARARRAY));
}
{code}
{code}
@OutputSchema(value = "dimensions:bag", useInputSchema = true)
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
  return new Schema(new FieldSchema("dimensions", input, DataType.BAG));
}
{code}
{code}
@OutputSchema(value = "${0}:bag", useInputSchema = true)
@Unique("${0}")
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
    return new Schema(new 
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
input), input, DataType.BAG));
}
{code}

If useInputSchema attribute is set then input schema will be applied to the 
output schema, provided that:
- outputschema is "simple", i.e: [name][:type]  or '()', '{}', '[]' and
- it has complex field type (tuple, bag, map)

@Unique : this annotation defines which fields should be unique in the schema
- if no parameters are provided, all fields will be unique
- otherwise it takes a string array of fields name

Unique field generation:
A unique field is generated in the same manner that EvalFunc#getSchemaName does.

- if field has an alias:
  - it's a placeholder (${i}, i=0..n) : fieldName -> 
com_myfunc_[input_alias]_[nextSchemaId]
  - otherwise: fieldName -> fieldName_[nextSchemaId]

- otherwise: com_myfunc_[input_alias]_[nextSchemaId]

Scripting UDFs:
The following scripting languages have been extended to use the above 
modifications:
Python, Jython, Groovy, JRuby


---

The patch incorporates PIG-2361, and contains the following testcases:
Modified piggybank UDFs:
{{contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalOutputAnnotation.java}}

Various output schema definitions:
{{/trunk/test/org/apache/pig/test/TestEvalFuncOutputAnnotation.java}}

Modified builtin UDFs:
{{test/org/apache/pig/test/TestBuiltinOutputAnnotation.java}}

Scripting UDFs:
test/org/apache/pig/test/TestPythonUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestJythonUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestGroovyUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestJRubyUDFOutputAnnotation.java}}


> Define unique fields with @OutputSchema
> ---------------------------------------
>
>                 Key: PIG-3911
>                 URL: https://issues.apache.org/jira/browse/PIG-3911
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Lorand Bendig
>            Assignee: Lorand Bendig
>
> Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that 
> more flexible output schema can be defined through annotations. As a result, 
> the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from 
> most of the UDFs.
> Examples:
> {code}
> @OutputSchema("bytearray")
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
>   return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY));
> }
> {code}
> {code}
> @OutputSchema("chararray")
> @Unique
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
>   return new Schema(new 
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
> input), DataType.CHARARRAY));
> }
> {code}
> {code}
> @OutputSchema(value = "dimensions:bag", useInputSchema = true)
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
>   return new Schema(new FieldSchema("dimensions", input, DataType.BAG));
> }
> {code}
> {code}
> @OutputSchema(value = "${0}:bag", useInputSchema = true)
> @Unique("${0}")
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
>     return new Schema(new 
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
> input), input, DataType.BAG));
> }
> {code}
> If useInputSchema attribute is set then input schema will be applied to the 
> output schema, provided that:
> * outputschema is "simple", i.e: [name][:type]  or '()', '{}', '[]' and
> * it has complex field type (tuple, bag, map)
> *@Unique* : this annotation defines which fields should be unique in the 
> schema
> * if no parameters are provided, all fields will be unique
> * otherwise it takes a string array of fields name
> Unique field generation:
> A unique field is generated in the same manner that EvalFunc#getSchemaName 
> does.
> - if field has an alias:
>   - it's a placeholder (${i}, i=0..n) : fieldName -> 
> com_myfunc_[input_alias]_[nextSchemaId]
>   - otherwise: fieldName -> fieldName_[nextSchemaId]
> - otherwise: com_myfunc_[input_alias]_[nextSchemaId]
> Scripting UDFs:
> The following scripting languages have been extended to use the above 
> modifications:
> Python, Jython, Groovy, JRuby
> ---
> The patch incorporates PIG-2361, and contains the following testcases:
> Modified piggybank UDFs:
> {{contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalOutputAnnotation.java}}
> Various output schema definitions:
> {{/trunk/test/org/apache/pig/test/TestEvalFuncOutputAnnotation.java}}
> Modified builtin UDFs:
> {{test/org/apache/pig/test/TestBuiltinOutputAnnotation.java}}
> Scripting UDFs:
> test/org/apache/pig/test/TestPythonUDFOutputAnnotation.java}}
> test/org/apache/pig/test/TestJythonUDFOutputAnnotation.java}}
> test/org/apache/pig/test/TestGroovyUDFOutputAnnotation.java}}
> test/org/apache/pig/test/TestJRubyUDFOutputAnnotation.java}}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PIG-3911) Define unique fields with @OutputSchema

Reply via email to