No problem. I've created a bug. It's a root cause: https://issues.apache.org/jira/browse/PIG-3409
2013/8/2 Pradeep Gollakota <pradeep...@gmail.com> > Oh... sorry... I missed the part where you were saying that you want to > reimplement the replicated join algorithm > > > On Fri, Aug 2, 2013 at 9:13 AM, Pradeep Gollakota <pradeep...@gmail.com > >wrote: > > > join BIG by key, SMALL by key using 'replicated'; > > > > > > On Fri, Aug 2, 2013 at 5:29 AM, Serega Sheypak <serega.shey...@gmail.com > >wrote: > > > >> Hi. I've met a problem wth replicated join in pig 0.11 > >> I have two relations: > >> BIG (3-6GB) and SMALL (100MB) > >> I do join them on four integer fields. > >> It takes up to 30 minutes to join them. > >> > >> Join runs on 18 reducers: -Xmx=3072mb for Java, 128 GB in total > >> 32 cores on each TaskTracker. > >> > >> So our hardware is really powerful. > >> > >> I've ran a part of join locally and met terrible situation: > >> 50% of heap: > >> is Integers, > >> arrays of integers these integers > >> and ArrayLists for arrays with integers. > >> > >> GC overhead limit happens. The same happend on cluster. I did raise Xms, > >> Xms on cluster and problem is gone. > >> > >> Anyway, joining 6GB/18 and 00Mb for 30 minutes is toooooo much. > >> I would like to reiplement replicated join. > >> How can I do it? > >> > > > > >