[ 
https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1516:
-------------------------------

    Attachment: PIG-1516.patch

I haven't removed the use of finalize in this patch,  but with the patch the 
number of objects with finalize() method that get created should be much 
smaller, and if all the bags used in a query are small enough, there will not 
be any such objects created. 
Even in case of large bags that spill to disk, since finalize() is not a method 
of the bags, the tuples in the bags can be freed by GC without waiting on 
finalization.
This should stop queries from running out of memory because of of the wait on 
finalize().

The creation of spill files happens only if the bag is very large, and the 
processing of the tuples in those bags is likely to give enough time for the 
finalization thread to catch up. 

The changes - 
1. As I proposed in the solution, the finalize has been removed from 
DefaultAbstractBag and its subclasses, and a FileList class with a finalize is 
used as container for the list of spill files.
2. Removed the finalize() method in InternalCachedBag.CachedBagIterator . It 
was used to call close on DataInputStream. The DataInputStream contains a 
FileInputStream which would have non-memory resources to be freed. But the 
FileInputStream already has a finalize() method, so the finalize() method in 
InternalCachedBag.CachedBagIterator is unnecessary.
3. In the bags that have code to pre-merge files when there are large number of 
spill files, the files that have been merged into larger files are deleted.


Using WeakReferences as Scott suggested, we can get rid of the finalization 
completely. I have created a separate jira for that - PIG-1519 .


> finalize in bag implementations causes pig to run out of memory in reduce 
> --------------------------------------------------------------------------
>
>                 Key: PIG-1516
>                 URL: https://issues.apache.org/jira/browse/PIG-1516
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1516.patch
>
>
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have 
> finalize methods implemented. As a result, the garbage collector moves them 
> to a finalization queue, and the memory used is freed only after the 
> finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the 
> finalization queue, and pig runs out of memory. This can happen if large 
> number of small bags are being created.
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that 
> are created when the bag is too large. But if the bags are small enough, no 
> spill files are created, and there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This 
> class will have a finalize method that deletes the files. The bags will no 
> longer have finalize methods, and the bags will use FileList instead of 
> ArrayList<File>.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as 
> there will not be the stage of combining intermediate merge results. But I 
> would recommend disabling it only if you have this problem as it is likely to 
> slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to