An even better option is to drop the geoip database into the
Distributed Cache; the option is available in 0.9 and trunk:

https://issues.apache.org/jira/browse/PIG-1752

On Mon, Jul 11, 2011 at 6:37 PM, Ross Nordeen <rjnor...@mtu.edu> wrote:
> Thanks for the quick help Matt!  I'll try and work on getting that working, 
> if anyone else as a UDF that they could send that would be cool.
>
> -Ross
>
> --
> Ross Nordeen
> Computer Networking And Systems Administration
> Michigan Technological University
> http://www.linkedin.com/in/rjnordee
>
> ----- Original Message -----
> From: "Matt Davies" <m...@mattdavies.net>
> To: user@pig.apache.org
> Sent: Monday, July 11, 2011 1:34:29 PM GMT -08:00 US/Canada Pacific
> Subject: Re: GeoIP database lookups
>
> I can't really share the code, but I can tell you the general way of doing
> it that works well.
>
> When the UDF is instantiated create a HashMap or the like with the data you
> need. So, you'll do a DEFINE GEO com.xyz.Geo('$filename) in your pig code.
>
> This varies depending on how you are looking up the data. You could be
> looking up based on IP address or the actual ID of the location.  This is a
> one-time hit, and in our case, very very fast and not even noticed.
>
> As each tuple hits the exec method then it becomes a quick lookup.
>
> In terms of why HDFS - we found that there were too many issues in our shop
> keeping things synced and much easier to read the file out of HDFS.  So, for
> instance, if a job has 1000 mappers, you read the file 1000 times from HDFS.
>  True, you get performance gains reading from the classpath of the jar, but,
> as with all of programming, there are tradeoffs. This format worked best for
> us in our release structure. One instance is a general UDF like this that
> could have different input files dependent on the jobs. Or, to be even
> faster we may filter out all non-US data from the locations so different
> files are used.   YMMV.
>
>
> Hope that helps some.
>
>
> On Mon, Jul 11, 2011 at 2:18 PM, Ross Nordeen <rjnor...@mtu.edu> wrote:
>
>> Matt,
>>
>> So dont ship the GeoIP database with the jar?  Does your mapper then cache
>> the locations.csv?  Would you mind sending me your UDF?  That sounds like an
>> interesting solution but I don't really understand how you would do that.  I
>> was under the impression the fastest way to do it would be to ship and cache
>> the binary database instead of calling from the HDFS.
>>
>> -Ross
>>
>> ----- Original Message -----
>> From: "Matt Davies" <m...@mattdavies.net>
>> To: user@pig.apache.org, "Ross" <rjnord...@semesteratsea.net>
>> Sent: Monday, July 11, 2011 12:34:38 PM GMT -08:00 US/Canada Pacific
>> Subject: Re: GeoIP database lookups
>>
>> We wrote a snazzy UDF that does 1 initialization per mapper and does all
>> the
>> necessary conversions. Quite efficient and fast.
>>
>> The trick to maintainability is to have your UDF initialize the
>> locations.csv from HDFS and not to include the csv file within your jar.
>>  That way you can easily update the locations without recompiling.
>>
>> -Matt
>>
>> On Mon, Jul 11, 2011 at 12:57 PM, Ross Nordeen <rjnor...@mtu.edu> wrote:
>>
>> > Hello all,
>> >
>> > Is there an accepted way to use the GeoIP database with pig?
>> >
>> > I've found some people have tried to write UDF's with their java api.
>> > http://www.maxmind.com/java
>> >
>> > Others say to use the streaming interface within pig and run the queries
>> > through a perl script.
>> >
>> >
>> http://www.cloudera.com/blog/2009/06/analyzing-apache-logs-with-pig/#comments
>> >
>> > I'm just trying to find the most efficient way to run this.  any ideas?
>> >
>> > -Ross
>> >
>>
>

Reply via email to