FYI -- we wound up going with a much cleaner and memory-friendly solution of returning a new databag implementation which simply proxied all the calls to the original bag, but returned a special Iterator which applied the necessary transformation to tuples on the fly. That way, we don't need to have the whole thing in memory twice and cause spillage.
D On Wed, Sep 5, 2012 at 7:38 PM, Alan Gates <ga...@hortonworks.com> wrote: > > On Sep 5, 2012, at 6:30 PM, Prasanth J wrote: > >> Ahh.. Now it makes more sense. >> >> I think I got the solution. I was adding to List<Tuple> and then finally >> creating a DataBag with that list.. Instead I should create a bag and keep >> adding to it..!! Is that correct? > Yes. > > Alan. > >> Thanks Alan. >> >> Thanks >> -- Prasanth >> >> On Sep 5, 2012, at 9:24 PM, Alan Gates <ga...@hortonworks.com> wrote: >> >>> You cannot modify a bag once it is written. The implementation is written >>> around the assumption that bags are immutable after they are written. >>> >>> Creating a new bag should not create an OOM exception, as bags are built to >>> spill when they grow too large. In fact it's this spilling feature that >>> makes in place modification impossible. >>> >>> Alan. >>> >>> On Sep 5, 2012, at 6:08 PM, Prasanth J wrote: >>> >>>> Hello devs >>>> >>>> I have specific case where I need to modify the contents (remove a field >>>> from each tuples) of Databag but I want to do it in-place and do not want >>>> to create another databag with new set of tuples. >>>> The situation is, say I have the following input tuple for an UDF >>>> >>>> {(111,222,3,121), (112,223,2,131), (113,224,4,141)} >>>> >>>> I want to iterate through this bag and generate an output bag removing the >>>> 3rd the of each tuples in the bag to get the following output >>>> {(111,222,121), (112,223,131), (113,224,141)} >>>> >>>> Since the number of tuples in this bag are expected to be large I cannot >>>> create new set of tuples and create a bag, as this will cause OOM >>>> exception. >>>> >>>> Also I do not want to flatten this bag as this bag will be passed to >>>> DISTINCT operator for computing distinct elements in the bag. >>>> As seen from the javadocs for DataBag, there is no way to convert a bag on >>>> the fly. I wonder if there is any other way to solve this? >>>> >>>> Thanks >>>> -- Prasanth >>>> >>> >> >