Thank you both, A quick glance looks like that is what I am looking for. When I get it working, I'll post the solution.
Cheers, Tim On Mon, Nov 8, 2010 at 6:55 AM, Namit Jain <nj...@facebook.com> wrote: > Other option would be to create a wrapper script (not use either UDF or > UDTF) > That script, in any language, can emit any number of output rows per input > row. > > Look at: > http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform > for details > > ________________________________ > From: Sonal Goyal [sonalgoy...@gmail.com] > Sent: Sunday, November 07, 2010 8:40 PM > To: user@hive.apache.org > Subject: Re: Unions causing many scans of input - workaround? > > Hey Tim, > > You have an interesting problem. Have you tried creating a UDTF for your > case, so that you can possibly emit more than one record for each row of > your input? > > http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF > > Thanks and Regards, > Sonal > > Sonal Goyal | Founder and CEO | Nube Technologies LLP > http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal > > > > > > On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson <timrobertson...@gmail.com> > wrote: >> >> Hi all, >> >> I am porting custom MR code to Hive and have written working UDFs >> where I need them. Is there a work around to having to do this in >> Hive: >> >> select * from >> ( >> select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as >> y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as >> f2_y, count (1) as count >> from table >> group by name_id, x, y, f2_x, f2_y >> >> UNION ALL >> >> select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as >> y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as >> f2_y, count (1) as count >> from table >> group by name_id, x, y, f2_x, f2_y >> >> --- etc etc increasing in zoom >> ) >> >> The issue being that this does many passes over the table, whereas >> previously in my Map() I would just emit many times from the same >> input record and then let it all group in the shuffle and sort. >> I actually emit 184 times for an input record (23 zoom levels of >> google maps, and 8 ways to derive the name_id) for a single record >> which means 184 union statements - Is it possible in hive to force it >> to emit many times from the source record in the stage-1 map? >> >> (ahem) Does anyone know if Pig can do this if not in Hive? >> >> I hope I have explained this well enough to make sense. >> >> Thanks in advance, >> Tim > >