No problem.
I've created a bug. It's a root cause:
https://issues.apache.org/jira/browse/PIG-3409


2013/8/2 Pradeep Gollakota <pradeep...@gmail.com>

> Oh... sorry... I missed the part where you were saying that you want to
> reimplement the replicated join algorithm
>
>
> On Fri, Aug 2, 2013 at 9:13 AM, Pradeep Gollakota <pradeep...@gmail.com
> >wrote:
>
> > join BIG by key, SMALL by key using 'replicated';
> >
> >
> > On Fri, Aug 2, 2013 at 5:29 AM, Serega Sheypak <serega.shey...@gmail.com
> >wrote:
> >
> >> Hi. I've met a problem wth replicated join in pig 0.11
> >> I have two relations:
> >> BIG (3-6GB) and SMALL (100MB)
> >> I do join them on four integer fields.
> >> It takes  up to 30 minutes to join them.
> >>
> >> Join runs on 18 reducers: -Xmx=3072mb for Java, 128 GB in total
> >> 32 cores on each TaskTracker.
> >>
> >> So our hardware is really powerful.
> >>
> >> I've ran a part of join locally and met terrible situation:
> >> 50% of heap:
> >> is Integers,
> >> arrays of integers these integers
> >> and ArrayLists for arrays with integers.
> >>
> >> GC overhead limit happens. The same happend on cluster. I did raise Xms,
> >> Xms on cluster and problem is gone.
> >>
> >> Anyway, joining 6GB/18 and 00Mb  for 30 minutes is toooooo much.
> >> I would like to reiplement replicated join.
> >> How can I do it?
> >>
> >
> >
>

Reply via email to