[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

Alan Gates (JIRA) Thu, 01 Oct 2009 09:31:54 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761257#action_12761257
 ]


Alan Gates commented on PIG-984:
--------------------------------

The controlling philosophic point here is that pigs are domestic animals (see 
http://wiki.apache.org/pig/PigPhilosophy).  Just as in join, where we have 
exposed all possible join implementations to the user, we want to do the same 
with this new feature.  At some future point when we have a capable optimizer, 
we will try to select the best type of join, and try to select this form of 
grouping when it's appropriate.  But even then, we want to expose this 
functionality to the user directly because the optimizer may not have access to 
the necessary information to determine the best grouping choice (e.g., data 
sources with no schema).  And we don't want to wait until the optimizer can 
handle these things to start exposing it.  

I don't agree with Santosh's assertion that the language is evolving with no 
definition.  I agree we do not yet have a comprehensive definition of Pig 
Latin, which we need.  But this is in line with what we've done for joins, 
philosophically, semantically, and syntacticly.

> PERFORMANCE: Implement a map-side group operator to speed up processing of 
> ordered data 
> ----------------------------------------------------------------------------------------
>
>                 Key: PIG-984
>                 URL: https://issues.apache.org/jira/browse/PIG-984
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Richard Ding
>
> The general group by operation in Pig needs both mappers and reducers (the 
> aggregation is done in reducers). This incurs disk writes/reads  between 
> mappers and reducers.
> However, in the cases where the input data has the following properties
>    1. The records with the same key are grouped together (such as the data is 
> sorted by the keys).
>    2. The records with the same key are in the same mapper input.
> the group by operation can be performed in the mappers only and thus remove 
> the overhead of disk writes/reads.
> Alan proposed adding a hint to the group by clause like this one:
> {code}
> A = load 'input' using SomeLoader(...);
> B = group A by $0 using "mapside";
> C = foreach B generate ...
> {code}
> The proposed addition of using "mapside" to group will be a mapside group 
> operator that collects all records for a given key into a buffer. When it 
> sees a key change it will emit the key and bag for records it had buffered. 
> It will assume that all keys for a given record are collected together and 
> thus there is not need to buffer across keys. 
> It is expected that "SomeLoader" will be implemented by data systems such as 
> Zebra to ensure the data emitted by the loader satisfies the above properties 
> (1) and (2).
> It will be the responsibility of the user (or the loader) to guarantee these 
> properties (1) & (2) before invoking the mapside hint for the group by 
> clause. The Pig runtime can't check for the errors in the input data.
> For the group by clauses with mapside hint, Pig Latin will only support group 
> by columns (including *), not group by expressions nor group all. 
>   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data

Reply via email to