[ 
https://issues.apache.org/jira/browse/PIG-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980025#action_12980025
 ] 

Gerrit Jansen van Vuuren commented on PIG-1717:
-----------------------------------------------

Hi,

The Loader looks at the path for key=value patterns and use the left hand side 
as the partition key. The AllLoader for example scans for the first available 
(non hidden) file and moves upwards in its path hierarchy for key=value pairs, 
this enables it to dynamically build the partition keys without needing to scan 
the whole directory tree.
I assumes that for a group of log files the partition scheme is the same. This 
is reasonable to assume (for me at least) and I think any hive partition scheme 
for a table is uniform (per table).

There are indeed three problems here: 
 (1) Loader needs to communicate the partition keys to pig for LoadFunc Path 
filtering (more efficient than row by row filtering on partitions).
 (2) Loader needs some way (for adhoc style queries) to know the script schema.
 (3) Even if the Loader can communicate the partition keys to pig if these are 
not available in the schema itself there is no variable name that the user can 
use to filter/use on this key.

For adhoc style queries having the script schema available to the LoadFunc 
solves the above three problems, because the Loader can dynamically add the 
partition keys to the schema (solving 2, 3), and return a Schema to pig 
(solving 1).




> pig needs to call setPartitionFilter if schema is null but getPartitionKeys 
> is not
> ----------------------------------------------------------------------------------
>
>                 Key: PIG-1717
>                 URL: https://issues.apache.org/jira/browse/PIG-1717
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.9.0
>            Reporter: Gerrit Jansen van Vuuren
>            Assignee: Gerrit Jansen van Vuuren
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1717.patch
>
>
> I'm writing a loader that works with hive style partitioning e.g. 
> /logs/type1/daydate=2010-11-01
> The loader does not know the schema upfront and this is something that the 
> user adds in the script using the AS clause.
> The problem is that this user defined schema is not available to the loader, 
> so the loader cannot return any schema, the Loader does know what the 
> partition keys are and pig needs in some way to know about these partition 
> keys. 
> Currently if the schema is null pig never calls the 
> LoadMetaData:getPartitionKeys method or the setPartitionFilter method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to