Gabriel,

Generally agree with your line of thought-- where is the attached test case?

J

On Thu, Jun 28, 2012 at 2:11 PM, Gabriel Reid <[email protected]> wrote:
> Hi guys,
>
> As you may have seen, the topic of the PTable#collectValues method came up 
> today in the user mailing list. I hadn't been aware of this method before, 
> and when I took a closer look I saw that it just creates a Collection of 
> values based on the incoming Iterable, without doing any kind of a deep copy 
> of the contents of the Iterable. As far as I can see, something similar (i.e. 
> holding on to values from an Iterable from a reducer) is also done in the 
> Join methods.
>
> As Christian also pointed out (and added to the documentation for DoFn), this 
> can be an issue, as values made available as an Iterable in a reducer are 
> re-used within Hadoop.
>
> This object re-use isn't a problem in Crunch wherever a non-identity mapping 
> is used between the serialization type and the PCollection type within the 
> PType (for example, with primitives and String). However, using Writable 
> types or non-mapped Avro types won't work (as shown in the attached test 
> case).
>
> I think it's definitely a problem that PTable#collectValues (and probably 
> some other methods) doesn't work for Writables, or in a broader sense, that 
> the semantics can change for the Iterable that is passed in when processing a 
> grouped table.
>
> One really easy (but also inefficient) way we could solve this would be to 
> not use an IdentityFn as the default mapping function in Writables and 
> AvroType, and instead use a MapFn that does a deep copy of the object (i.e. 
> by serializing and deserializing itself in memory). This is of course a 
> pretty big overhead for a something that isn't necessary in a lot of cases.
>
> Another option I was considering was to do something like making the input 
> and output PTypes of a DoFn available to the DoFn, and adding a 
> createDetachedValue method (or something similar) to PType, which would then 
> serialize and deserialize objects in order to make a clone if necessary. With 
> this approach, the clone method would have to be called within the 
> collectValues method (or any other method that is holding on to values 
> outside of the iterator).
>
> I prefer the second approach, as it avoids the the waste of extra 
> cloning/serialization while still making it possible to get detached values 
> out of an Iterable.
>
> Does anyone else have any thoughts on this?
>
> - Gabriel
>



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Reply via email to