[ 
https://issues.apache.org/jira/browse/PIG-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975232#action_12975232
 ] 

Gerrit Jansen van Vuuren commented on PIG-1717:
-----------------------------------------------

Hi,

The problem I'm facing is the following: 
  Load function can tell pig the schema through LoadMetadata
  But the default way in pig to define a schema is inside the script --  a = 
LOAD 'input' as (schema)
  In this case the Load function has no way of knowing the schema and cannot 
validate of modify it e.g. to add partition keys.

The HiveRC classes already have the LoadMetaData implemented and gets around 
the  above problem by having the user specify the schema in the Load function 
Constructor. .eg. a = LOAD 'input' using HiveColumnarLoader('schema'), no as 
clause should be used here.
The loader can then tell pig what the partition keys are because:
  (1) The loader now returns a Schema
  (2) and as a result of this pig calls the getPartitionKeys method on the 
LoadMetadata

The problem with this approach is that the user always has to define the schema 
in the constructor and cannot use the as clause. The problem is not when the 
LoadFunction returns a schema (e.g. from howl or other) but rather when the 
user specifies a schema in the as clause which feels more normal in pig.

There are three possible solutions that I see for this:
 (A)  Create a new interface or extend the LoadMetadata with a method that pig 
calls to inform the Load function of the as clause schema.
         e.g. setUserDefinedSchema(ResourceSchema schema)
         Pros: its clear in the code
         Cons:  breaks backwards compatibility
 
 (B) Have pig set a variable in the UDFContext/PigContext the as Clause Schema 
e.g. context.set('userDefinedSchema', asClauseSchema:ResourceSchemaRerialised)
       Pros: does not break backwards compatibility, any load function can 
check this method before returning on getSchema
       Cons: needs good docs or else nobody will know of this :)

 (C) have pig always call the getPartitiionKeys on LoadMetadata even if schema 
is null from the Load function. And have pig add the partition keys to the 
schema always.
       Pros: does not break backwards compatibility.
       Cons: The LoadFunction still has no way of knowing the user defined as 
clause schema. More complicated than option (B)


I would go for option B as this gives the greatest flexibility without 
requiring great code changes.



> pig needs to call setPartitionFilter if schema is null but getPartitionKeys 
> is not
> ----------------------------------------------------------------------------------
>
>                 Key: PIG-1717
>                 URL: https://issues.apache.org/jira/browse/PIG-1717
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Gerrit Jansen van Vuuren
>            Priority: Minor
>
> I'm writing a loader that works with hive style partitioning e.g. 
> /logs/type1/daydate=2010-11-01
> The loader does not know the schema upfront and this is something that the 
> user adds in the script using the AS clause.
> The problem is that this user defined schema is not available to the loader, 
> so the loader cannot return any schema, the Loader does know what the 
> partition keys are and pig needs in some way to know about these partition 
> keys. 
> Currently if the schema is null pig never calls the 
> LoadMetaData:getPartitionKeys method or the setPartitionFilter method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to