[jira] Updated: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

Pradeep Kamath (JIRA) Fri, 16 Oct 2009 11:32:55 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Pradeep Kamath updated PIG-953:
-------------------------------

    Attachment: PIG-953-6.patch

Dmitriy - by default when the application does not set an OutputCommitter, 
hadoop uses FileOutputCommitter. So currently (in trunk code) since pig does 
not set an OuptuCommitter, hadoop would be using FileOutputCommitter. Hence I 
derived from FileOutputCommitter so that the current cleanup continues to 
happen and we do the extra commit needed by Zebra.

The new load-store redesign already has an allFinished() method in storeFunc 
which is the same as this commit except it does not have the Configuration - I 
have modified it to have the Configuration parameter.

It turns out zebra needs the job configuration in order to open the right side 
file during merge join. Hence I am introducing an initialize(Configuration 
conf) method into the IndexableLoadFunc interface in the attached patch so that 
the pig runtime can call it allowing zebra to store this configuration for use 
in opening the right side file later.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-953
>                 URL: https://issues.apache.org/jira/browse/PIG-953
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>         Attachments: PIG-953-2.patch, PIG-953-3.patch, PIG-953-4.patch, 
> PIG-953-5.patch, PIG-953-6.patch, PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

Reply via email to