[ 
https://issues.apache.org/jira/browse/CRUNCH-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453655#comment-13453655
 ] 

Kiyan Ahmadizadeh commented on CRUNCH-58:
-----------------------------------------

I included these methods mostly because the backing PCollection exposes them. 
All three seemed like useful things to expose to the client (although this is 
debatable and could be convinced to remove some or all of them).  

I didn't want to expose a getter for the backing PCollection for a couple of 
reasons:

1. I wanted the PObject interface to be agnostic regarding what actually backed 
the implementation. Including a method in the interface that returned the 
backing PCollection would make this impossible.  The importance of this in the 
context of Crunch is debatable, since a PCollection is the mechanism through 
which all distributed computation has to happen.  PObjects act as a lazy Future 
and later we might want to use that concept more generally.  

2. It felt like it would hurt the PObject abstraction by exposing 
implementation details.  It provides a means for the client to initiate further 
distributed computation on the data backing the PObject, which encourages bad 
practice with PObjects.  PObjects should be used for values small enough to fit 
into memory so they can be worked with locally or shipped around with do 
functions to act as side data for jobs.  I think hiding the underlying 
PCollection enforces this.  

Thoughts?  
                
> Implement PObject in Crunch/Scrunch
> -----------------------------------
>
>                 Key: CRUNCH-58
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-58
>             Project: Crunch
>          Issue Type: New Feature
>    Affects Versions: 0.3.0
>            Reporter: Kiyan Ahmadizadeh
>            Assignee: Kiyan Ahmadizadeh
>         Attachments: CRUNCH-58.patch
>
>
> FlumeJava has the concept of a PObject<T>, a container for a singleton of 
> type T.  It is meant represent the result of a distributed computation that 
> yields a singleton value (for example max, min, and length methods on 
> PCollection<T>).  Generally speaking, the result of any computation that 
> combines/reduces a PCollection into a singleton value could be represented by 
> a PObject.  
> Like PCollection, a PObject defers distributed computation until its value is 
> actually used.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to