[
https://issues.apache.org/jira/browse/PIG-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lorand Bendig updated PIG-3911:
-------------------------------
Description:
Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that
more flexible output schema can be defined through annotations. As a result,
the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from
most of the UDFs.
Examples:
{code}
@OutputSchema("bytearray")
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY));
}
{code}
{code}
@OutputSchema("chararray")
@Unique
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
input), DataType.CHARARRAY));
}
{code}
{code}
@OutputSchema(value = "dimensions:bag", useInputSchema = true)
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new FieldSchema("dimensions", input, DataType.BAG));
}
{code}
{code}
@OutputSchema(value = "${0}:bag", useInputSchema = true)
@Unique("${0}")
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
input), input, DataType.BAG));
}
{code}
If {{useInputSchema}} attribute is set then input schema will be applied to the
output schema, provided that:
* outputschema is "simple", i.e: \[name\]\[:type\] or '()', '{}', '[]' and
* it has complex field type (tuple, bag, map)
@Unique : this annotation defines which fields should be unique in the schema
* if no parameters are provided, all fields will be unique
* otherwise it takes a string array of fields name
Unique field generation:
A unique field is generated in the same manner that {{EvalFunc#getSchemaName}}
does.
* if field has an alias:
** it's a placeholder ($\{i\}, i=0..n) : fieldName ->
com_myfunc_\[input_alias\]\_\[nextSchemaId\]
** otherwise: fieldName -> fieldName\_\[nextSchemaId\]
* otherwise: com\_myfunc\_\[input_alias\]\_\[nextSchemaId\]
Supported scripting UDFs: Python, Jython, Groovy, JRuby
was:
Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that
more flexible output schema can be defined through annotations. As a result,
the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from
most of the UDFs.
Examples:
{code}
@OutputSchema("bytearray")
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY));
}
{code}
{code}
@OutputSchema("chararray")
@Unique
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
input), DataType.CHARARRAY));
}
{code}
{code}
@OutputSchema(value = "dimensions:bag", useInputSchema = true)
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new FieldSchema("dimensions", input, DataType.BAG));
}
{code}
{code}
@OutputSchema(value = "${0}:bag", useInputSchema = true)
@Unique("${0}")
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
input), input, DataType.BAG));
}
{code}
If useInputSchema attribute is set then input schema will be applied to the
output schema, provided that:
* outputschema is "simple", i.e: [name][:type] or '()', '{}', '[]' and
* it has complex field type (tuple, bag, map)
*@Unique* : this annotation defines which fields should be unique in the schema
* if no parameters are provided, all fields will be unique
* otherwise it takes a string array of fields name
Unique field generation:
A unique field is generated in the same manner that EvalFunc#getSchemaName does.
- if field has an alias:
- it's a placeholder (${i}, i=0..n) : fieldName ->
com_myfunc_[input_alias]_[nextSchemaId]
- otherwise: fieldName -> fieldName_[nextSchemaId]
- otherwise: com_myfunc_[input_alias]_[nextSchemaId]
Scripting UDFs:
The following scripting languages have been extended to use the above
modifications:
Python, Jython, Groovy, JRuby
---
The patch incorporates PIG-2361, and contains the following testcases:
Modified piggybank UDFs:
{{contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalOutputAnnotation.java}}
Various output schema definitions:
{{/trunk/test/org/apache/pig/test/TestEvalFuncOutputAnnotation.java}}
Modified builtin UDFs:
{{test/org/apache/pig/test/TestBuiltinOutputAnnotation.java}}
Scripting UDFs:
test/org/apache/pig/test/TestPythonUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestJythonUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestGroovyUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestJRubyUDFOutputAnnotation.java}}
> Define unique fields with @OutputSchema
> ---------------------------------------
>
> Key: PIG-3911
> URL: https://issues.apache.org/jira/browse/PIG-3911
> Project: Pig
> Issue Type: Improvement
> Reporter: Lorand Bendig
> Assignee: Lorand Bendig
>
> Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that
> more flexible output schema can be defined through annotations. As a result,
> the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from
> most of the UDFs.
> Examples:
> {code}
> @OutputSchema("bytearray")
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
> return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY));
> }
> {code}
> {code}
> @OutputSchema("chararray")
> @Unique
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
> return new Schema(new
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
> input), DataType.CHARARRAY));
> }
> {code}
> {code}
> @OutputSchema(value = "dimensions:bag", useInputSchema = true)
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
> return new Schema(new FieldSchema("dimensions", input, DataType.BAG));
> }
> {code}
> {code}
> @OutputSchema(value = "${0}:bag", useInputSchema = true)
> @Unique("${0}")
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
> return new Schema(new
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
> input), input, DataType.BAG));
> }
> {code}
> If {{useInputSchema}} attribute is set then input schema will be applied to
> the output schema, provided that:
> * outputschema is "simple", i.e: \[name\]\[:type\] or '()', '{}', '[]' and
> * it has complex field type (tuple, bag, map)
> @Unique : this annotation defines which fields should be unique in the schema
> * if no parameters are provided, all fields will be unique
> * otherwise it takes a string array of fields name
> Unique field generation:
> A unique field is generated in the same manner that
> {{EvalFunc#getSchemaName}} does.
> * if field has an alias:
> ** it's a placeholder ($\{i\}, i=0..n) : fieldName ->
> com_myfunc_\[input_alias\]\_\[nextSchemaId\]
> ** otherwise: fieldName -> fieldName\_\[nextSchemaId\]
> * otherwise: com\_myfunc\_\[input_alias\]\_\[nextSchemaId\]
> Supported scripting UDFs: Python, Jython, Groovy, JRuby
--
This message was sent by Atlassian JIRA
(v6.2#6252)