[
https://issues.apache.org/jira/browse/PIG-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975232#action_12975232
]
Gerrit Jansen van Vuuren commented on PIG-1717:
-----------------------------------------------
Hi,
The problem I'm facing is the following:
Load function can tell pig the schema through LoadMetadata
But the default way in pig to define a schema is inside the script -- a =
LOAD 'input' as (schema)
In this case the Load function has no way of knowing the schema and cannot
validate of modify it e.g. to add partition keys.
The HiveRC classes already have the LoadMetaData implemented and gets around
the above problem by having the user specify the schema in the Load function
Constructor. .eg. a = LOAD 'input' using HiveColumnarLoader('schema'), no as
clause should be used here.
The loader can then tell pig what the partition keys are because:
(1) The loader now returns a Schema
(2) and as a result of this pig calls the getPartitionKeys method on the
LoadMetadata
The problem with this approach is that the user always has to define the schema
in the constructor and cannot use the as clause. The problem is not when the
LoadFunction returns a schema (e.g. from howl or other) but rather when the
user specifies a schema in the as clause which feels more normal in pig.
There are three possible solutions that I see for this:
(A) Create a new interface or extend the LoadMetadata with a method that pig
calls to inform the Load function of the as clause schema.
e.g. setUserDefinedSchema(ResourceSchema schema)
Pros: its clear in the code
Cons: breaks backwards compatibility
(B) Have pig set a variable in the UDFContext/PigContext the as Clause Schema
e.g. context.set('userDefinedSchema', asClauseSchema:ResourceSchemaRerialised)
Pros: does not break backwards compatibility, any load function can
check this method before returning on getSchema
Cons: needs good docs or else nobody will know of this :)
(C) have pig always call the getPartitiionKeys on LoadMetadata even if schema
is null from the Load function. And have pig add the partition keys to the
schema always.
Pros: does not break backwards compatibility.
Cons: The LoadFunction still has no way of knowing the user defined as
clause schema. More complicated than option (B)
I would go for option B as this gives the greatest flexibility without
requiring great code changes.
> pig needs to call setPartitionFilter if schema is null but getPartitionKeys
> is not
> ----------------------------------------------------------------------------------
>
> Key: PIG-1717
> URL: https://issues.apache.org/jira/browse/PIG-1717
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Gerrit Jansen van Vuuren
> Priority: Minor
>
> I'm writing a loader that works with hive style partitioning e.g.
> /logs/type1/daydate=2010-11-01
> The loader does not know the schema upfront and this is something that the
> user adds in the script using the AS clause.
> The problem is that this user defined schema is not available to the loader,
> so the loader cannot return any schema, the Loader does know what the
> partition keys are and pig needs in some way to know about these partition
> keys.
> Currently if the schema is null pig never calls the
> LoadMetaData:getPartitionKeys method or the setPartitionFilter method.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.