An even better option is to drop the geoip database into the Distributed Cache; the option is available in 0.9 and trunk:
https://issues.apache.org/jira/browse/PIG-1752 On Mon, Jul 11, 2011 at 6:37 PM, Ross Nordeen <rjnor...@mtu.edu> wrote: > Thanks for the quick help Matt! I'll try and work on getting that working, > if anyone else as a UDF that they could send that would be cool. > > -Ross > > -- > Ross Nordeen > Computer Networking And Systems Administration > Michigan Technological University > http://www.linkedin.com/in/rjnordee > > ----- Original Message ----- > From: "Matt Davies" <m...@mattdavies.net> > To: user@pig.apache.org > Sent: Monday, July 11, 2011 1:34:29 PM GMT -08:00 US/Canada Pacific > Subject: Re: GeoIP database lookups > > I can't really share the code, but I can tell you the general way of doing > it that works well. > > When the UDF is instantiated create a HashMap or the like with the data you > need. So, you'll do a DEFINE GEO com.xyz.Geo('$filename) in your pig code. > > This varies depending on how you are looking up the data. You could be > looking up based on IP address or the actual ID of the location. This is a > one-time hit, and in our case, very very fast and not even noticed. > > As each tuple hits the exec method then it becomes a quick lookup. > > In terms of why HDFS - we found that there were too many issues in our shop > keeping things synced and much easier to read the file out of HDFS. So, for > instance, if a job has 1000 mappers, you read the file 1000 times from HDFS. > True, you get performance gains reading from the classpath of the jar, but, > as with all of programming, there are tradeoffs. This format worked best for > us in our release structure. One instance is a general UDF like this that > could have different input files dependent on the jobs. Or, to be even > faster we may filter out all non-US data from the locations so different > files are used. YMMV. > > > Hope that helps some. > > > On Mon, Jul 11, 2011 at 2:18 PM, Ross Nordeen <rjnor...@mtu.edu> wrote: > >> Matt, >> >> So dont ship the GeoIP database with the jar? Does your mapper then cache >> the locations.csv? Would you mind sending me your UDF? That sounds like an >> interesting solution but I don't really understand how you would do that. I >> was under the impression the fastest way to do it would be to ship and cache >> the binary database instead of calling from the HDFS. >> >> -Ross >> >> ----- Original Message ----- >> From: "Matt Davies" <m...@mattdavies.net> >> To: user@pig.apache.org, "Ross" <rjnord...@semesteratsea.net> >> Sent: Monday, July 11, 2011 12:34:38 PM GMT -08:00 US/Canada Pacific >> Subject: Re: GeoIP database lookups >> >> We wrote a snazzy UDF that does 1 initialization per mapper and does all >> the >> necessary conversions. Quite efficient and fast. >> >> The trick to maintainability is to have your UDF initialize the >> locations.csv from HDFS and not to include the csv file within your jar. >> That way you can easily update the locations without recompiling. >> >> -Matt >> >> On Mon, Jul 11, 2011 at 12:57 PM, Ross Nordeen <rjnor...@mtu.edu> wrote: >> >> > Hello all, >> > >> > Is there an accepted way to use the GeoIP database with pig? >> > >> > I've found some people have tried to write UDF's with their java api. >> > http://www.maxmind.com/java >> > >> > Others say to use the streaming interface within pig and run the queries >> > through a perl script. >> > >> > >> http://www.cloudera.com/blog/2009/06/analyzing-apache-logs-with-pig/#comments >> > >> > I'm just trying to find the most efficient way to run this. any ideas? >> > >> > -Ross >> > >> >