[ 
https://issues.apache.org/jira/browse/PIG-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976993#action_12976993
 ] 

Alan Gates commented on PIG-1717:
---------------------------------

We do not see using the AS clause in LOAD as the preferred way to do schemas in 
Pig Latin.  For some cases it is, such as ad hoc queries over one time data 
sets.  But for regularly processed data we assume that the schema will be 
specified instead by the loader via the LoadMetadata interface.  Hence this is 
how Howl (the metadata project that will work with Pig, Hive, and Map Reduce) 
will interact with Pig.  It will expect the user not to give a schema in the AS 
clause.

The LoadMetadata interface was designed with the assumption that the user was 
either in the ad-hoc research world (throw this data on a grid and run a few 
quick Pig Latin scripts against it) or the pipeline world (run the same 
pipeline everyday to process your data) and thus there's a clear distinction 
between the needs for both.  But it seems your use case straddles these two.  
You want the ad-hoc schema but the known partitioning.  I'm guessing your 
reasoning for this is you want to read Hive data without connecting to its 
metastore, correct?

I don't like option B because I don't like hidden environment variables that 
are really part of the interface.  I agree option A is difficult because it is 
not backward compatible.  And it is hard to justify breaking compatibility for 
something that at the moment looks like a corner case.  Option C is definitely 
out since we cannot have Pig adding keys to schemas when it has no idea whether 
it should or not.  Before I vote for either A or B I want to make sure I 
understand your use case and that you really need this.


> pig needs to call setPartitionFilter if schema is null but getPartitionKeys 
> is not
> ----------------------------------------------------------------------------------
>
>                 Key: PIG-1717
>                 URL: https://issues.apache.org/jira/browse/PIG-1717
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.9.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1717.patch
>
>
> I'm writing a loader that works with hive style partitioning e.g. 
> /logs/type1/daydate=2010-11-01
> The loader does not know the schema upfront and this is something that the 
> user adds in the script using the AS clause.
> The problem is that this user defined schema is not available to the loader, 
> so the loader cannot return any schema, the Loader does know what the 
> partition keys are and pig needs in some way to know about these partition 
> keys. 
> Currently if the schema is null pig never calls the 
> LoadMetaData:getPartitionKeys method or the setPartitionFilter method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to