FYI -- we wound up going with a much cleaner and memory-friendly
solution of returning a new databag implementation which simply
proxied all the calls to the original bag, but returned a special
Iterator which applied the necessary transformation to tuples on the
fly. That way, we don't need to have the whole thing in memory twice
and cause spillage.

D

On Wed, Sep 5, 2012 at 7:38 PM, Alan Gates <ga...@hortonworks.com> wrote:
>
> On Sep 5, 2012, at 6:30 PM, Prasanth J wrote:
>
>> Ahh.. Now it makes more sense.
>>
>> I think I got the solution. I was adding to List<Tuple> and then finally 
>> creating a DataBag with that list.. Instead I should create a bag and keep 
>> adding to it..!! Is that correct?
> Yes.
>
> Alan.
>
>> Thanks Alan.
>>
>> Thanks
>> -- Prasanth
>>
>> On Sep 5, 2012, at 9:24 PM, Alan Gates <ga...@hortonworks.com> wrote:
>>
>>> You cannot modify a bag once it is written.  The implementation is written 
>>> around the assumption that bags are immutable after they are written.
>>>
>>> Creating a new bag should not create an OOM exception, as bags are built to 
>>> spill when they grow too large.  In fact it's this spilling feature that 
>>> makes in place modification impossible.
>>>
>>> Alan.
>>>
>>> On Sep 5, 2012, at 6:08 PM, Prasanth J wrote:
>>>
>>>> Hello devs
>>>>
>>>> I have specific case where I need to modify the contents (remove a field 
>>>> from each tuples) of Databag but I want to do it in-place and do not want 
>>>> to create another databag with new set of tuples.
>>>> The situation is, say I have the following input tuple for an UDF
>>>>
>>>> {(111,222,3,121), (112,223,2,131), (113,224,4,141)}
>>>>
>>>> I want to iterate through this bag and generate an output bag removing the 
>>>> 3rd the of each tuples in the bag to get the following output
>>>> {(111,222,121), (112,223,131), (113,224,141)}
>>>>
>>>> Since the number of tuples in this bag are expected to be large I cannot 
>>>> create new set of tuples and create a bag, as this will cause OOM 
>>>> exception.
>>>>
>>>> Also I do not want to flatten this bag as this bag will be passed to 
>>>> DISTINCT operator for computing distinct elements in the bag.
>>>> As seen from the javadocs for DataBag, there is no way to convert a bag on 
>>>> the fly. I wonder if there is any other way to solve this?
>>>>
>>>> Thanks
>>>> -- Prasanth
>>>>
>>>
>>
>

Reply via email to