[ 
https://issues.apache.org/jira/browse/TIKA-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513286#comment-14513286
 ] 

Nick Burch commented on TIKA-1614:
----------------------------------

Is the data from geonames under a suitable license that we can distribute it 
with Tika? (http://www.apache.org/legal/resolved.html has the allowed licenses)

If it is, just distributing a big binary blob doesn't feel very "open source". 
What if people wanted to change it, or update it, or move to a newer Lucene 
version, or anything like that, how would they be able to?

Also, what about if people wanted to use a different names source, perhaps one 
built on OpenStreetMap?

What we did for Translation was to have a common API interface, then allow 
implementations to be plugged in as the user selects. Maybe we need an approach 
like that? Your geonames lookup could then be one such implementation (provided 
the source/tools to build/rebuilt it came too!), and others could be added as 
desired

> Geo Topic Parser
> ----------------
>
>                 Key: TIKA-1614
>                 URL: https://issues.apache.org/jira/browse/TIKA-1614
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Anya Yun Li
>              Labels: memex
>
> ##Description
> This program aims to provide the support to identify geonames for any 
> unstructured text data in the project NSF polar research. 
> https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1
> This project is a content-based geotagging solution, made of a variaty of NLP 
> tools and could be used for any geotagging purposes. 
> ##Workingflow
> 1. Plain text input is passed to geoparser
> 2. Location names are extracted from the text using OpenNLP NER
> 3. Provide two roles: 
>       * The most frequent location name choosed as the best match for the 
> input text
>       * Other extracted locations are treated as alternatives (equal)
> 4. location extracted above, search the best GeoName object and return the 
> resloved objects with fields (name in gazetteer, longitude, latitude)
> ##How to Use
> *Cautions*: This program requires at least 1.2 GB disk space for building 
> Lucene Index
> ```Java
>       function A(stream){
>               Metadata metadata = new Metadata();
>         ParseContext context=new ParseContext();
>         GeoParserConfig config= new GeoParserConfig();
>         config.setGazetterPath(gazetteerPath);
>         config.setNERModelPath(nerPath);
>         context.set(GeoParserConfig.class, config);
>                
>         geoparser.parse(
>                 stream,
>                 new BodyContentHandler(),
>                 metadata,
>                 context);
>    
>        for(String name: metadata.names()){
>          String value=metadata.get(name);
>          System.out.println(name +" " + value);          
>        }
>     }
> ```
> This parser generates useful geographical information to Tika's Metadata 
> Object. 
> Fields for best matched location:
> ```
> Geographic_NAME
> Geographic_LONGTITUDE
> Geographic_LATITUDE
> ```
> Fields for alternatives:
> ```
> Geographic_NAME1
> Geographic_LONGTITUDE1
> Geographic_LATITUDE1
> Geographic_NAME2
> Geographic_LONGTITUDE2
> Geographic_LATITUDE2
> ...
> ```
> If you have any questions, contact me: [email protected]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to