Hi Peter, Looks like mapjoin does not work with outer join so streamtable is instead a possible approach. You would stream the larger table through the smaller one:
can you see whether the following helps your perf issue? select /*+ streamtable(message) */ f.uuid from message m right outer join feed f on m.uuid = f.uuid where m.uuid = null; 2013/5/5 Peter Chu <pete....@outlook.com> > Thanks, Stephen, > > I do not quite understand what you mean by Stream, specifically "Stream > the 400M message records through the in-memory maps". > Can you please elaborate. > > Also, can you use MAPJOIN on left outer join? > > Peter > > ------------------------------ > Date: Sun, 5 May 2013 21:44:37 -0700 > > Subject: Re: Hive QL - NOT IN, NOT EXIST > From: java...@gmail.com > To: user@hive.apache.org > > > > @Peter Does the query plan demonstrate that the 3Meg row table is being > map-joined and the 400M table streamed through? That is what you want: but > you might either need to fiddle with hints to get it to happen > > Details: > Read uuids s of feed into in-memory map on all nodes (mapjoin) > Stream the 400M message records through the in-memory maps, copying > id's from the "all feed uuids" map to a "matched feed uuid's map for > entries that have matches in the messages > > Note: this way the 400M rows are only read once on the cluster. > > You can see whether hive can manage this or if you write a custom m/r job > to do it. > > > > 2013/5/5 Peter Chu <pete....@outlook.com> > > It works but it takes a very long time because the subqueries in NOT IN > contains 400 million rows (the message table in the example) and the feed > table contains 3 million rows. > > SELECT uuid from feed f WHERE f.uuid NOT IN (SELECT uuid FROM message); > > > Date: Sun, 5 May 2013 20:25:15 -0700 > > From: michaelma...@yahoo.com > > Subject: Re: Hive QL - NOT IN, NOT EXIST > > To: user@hive.apache.org > > > > > > > --- On Sun, 5/5/13, Peter Chu <pete....@outlook.com> wrote: > > > > > I am wondering if there is any way to do this without resorting to > > > using left outer join and finding nulls. > > > > I have found this to be an acceptable substitute. Is it not working for > you? > > > > >