Hi guys, 

As you may have seen, the topic of the PTable#collectValues method came up 
today in the user mailing list. I hadn't been aware of this method before, and 
when I took a closer look I saw that it just creates a Collection of values 
based on the incoming Iterable, without doing any kind of a deep copy of the 
contents of the Iterable. As far as I can see, something similar (i.e. holding 
on to values from an Iterable from a reducer) is also done in the Join methods.

As Christian also pointed out (and added to the documentation for DoFn), this 
can be an issue, as values made available as an Iterable in a reducer are 
re-used within Hadoop.

This object re-use isn't a problem in Crunch wherever a non-identity mapping is 
used between the serialization type and the PCollection type within the PType 
(for example, with primitives and String). However, using Writable types or 
non-mapped Avro types won't work (as shown in the attached test case).

I think it's definitely a problem that PTable#collectValues (and probably some 
other methods) doesn't work for Writables, or in a broader sense, that the 
semantics can change for the Iterable that is passed in when processing a 
grouped table.

One really easy (but also inefficient) way we could solve this would be to not 
use an IdentityFn as the default mapping function in Writables and AvroType, 
and instead use a MapFn that does a deep copy of the object (i.e. by 
serializing and deserializing itself in memory). This is of course a pretty big 
overhead for a something that isn't necessary in a lot of cases.

Another option I was considering was to do something like making the input and 
output PTypes of a DoFn available to the DoFn, and adding a createDetachedValue 
method (or something similar) to PType, which would then serialize and 
deserialize objects in order to make a clone if necessary. With this approach, 
the clone method would have to be called within the collectValues method (or 
any other method that is holding on to values outside of the iterator).

I prefer the second approach, as it avoids the the waste of extra 
cloning/serialization while still making it possible to get detached values out 
of an Iterable. 

Does anyone else have any thoughts on this?

- Gabriel 

Reply via email to