Sorry, for the benefit of others who may not know the GBIF code SVN sites, this particular code is all in the GBIF common resources svn: https://code.google.com/p/gbif-common-resources/source/browse/#svn%2Fgbif-parsers%2Ftrunk%2Fsrc%2Fmain%2Fjava%2Forg%2Fgbif%2Fcommon%2Fparsers
And is a mavenized release on the GBIF maven repository: http://repository.gbif.org/index.html#nexus-search;quick~gbif-parsers And the list mapping all the variations we see is: https://code.google.com/p/gbif-common-resources/source/browse/gbif-parsers/trunk/src/main/resources/dictionaries/parse/countryName.txt I hope this helps, Tim > Hi David, > > You've built your other libraries using GBIF parsers. Have you looked at how > the GBIF country names interpretation works? It would be helpful to know why > it is not suitable for your use. > > The GBIF library concatenates known lists (such as ISO) along with about 2500 > variations we've collected through period review of what we observe while > indexing, and then using google refine we've mapped them to the ISO codes and > we follow the ISO code changes as best we can. Your narwhal-processor > already has a software dependency on the GBIF code. > > Please remember that patches and additions are always welcome to the GBIF > code, if you felt it could be improved. I'm biased of course, but I'd rather > see something that is broken fixed than watching a recreation of something > that already exists. > > Cheers, > Tim > > > On May 17, 2013, at 4:39 PM, Matt Jones wrote: > >> A good official list of countries is available from the Library of Congress: >> http://www.loc.gov/standards/codelists/countries.xml >> For background, see: http://www.loc.gov/marc/countries/ >> >> And of course there's ISO 3166, the list of country codes: >> >> http://www.iso.org/iso/home/standards/country_codes/country_names_and_code_elements_xml.htm >> http://www.iso.org/iso/country_codes >> >> Not sure about the alternate representations and misspellings, though. >> >> Matt >> >> >> On Fri, May 17, 2013 at 5:57 AM, Shorthouse, David >> <[email protected]> wrote: >> Folks, >> >> The Canadensys development team, http://www.canadensys.net is looking >> for efficient, low-maintenance ways to validate and reconcile data in >> its National cache of occurrence data. We are working on a Java >> library to initially tackle single-field Darwin Core validations, >> https://github.com/Canadensys/narwhal-processor. We hope this library >> is sufficiently generalized for uses outside our project. >> >> Our current challenge is to reconcile country names, which requires >> access to an up-to-date, well-maintained knowledge base of country >> names, their alternative representations (possibly multilingual), and >> mappings to known misspellings. For performance reasons, we'd like >> this thesaurus to be embedded in the library, but with the capacity to >> be periodically refreshed with data pulled from external resources >> such as dbpedia.org. This clearly has ties to semantic web thinking >> and, because we're new to the tools and services in this space, we'd >> like to solicit pointers and feedback such that we build this part of >> our library with maximal benefit to other projects. We started >> collecting thoughts here: >> https://github.com/Canadensys/narwhal-processor/issues/14. >> >> Cheers, >> >> David P. Shorthouse >> Christian Gendreau >> _______________________________________________ >> tdwg mailing list >> [email protected] >> http://lists.tdwg.org/mailman/listinfo/tdwg >> >> _______________________________________________ >> tdwg mailing list >> [email protected] >> http://lists.tdwg.org/mailman/listinfo/tdwg > > _______________________________________________ > tdwg mailing list > [email protected] > http://lists.tdwg.org/mailman/listinfo/tdwg
_______________________________________________ tdwg mailing list [email protected] http://lists.tdwg.org/mailman/listinfo/tdwg
