Anya Yun Li created TIKA-1614: --------------------------------- Summary: Geo Topic Parser Key: TIKA-1614 URL: https://issues.apache.org/jira/browse/TIKA-1614 Project: Tika Issue Type: New Feature Components: parser Reporter: Anya Yun Li
##Description This program aims to provide the support to identify geonames for any unstructured text data in the project NSF polar research. https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1 This project is a content-based geotagging solution, made of a variaty of NLP tools and could be used for any geotagging purposes. ##Workingflow 1. Plain text input is passed to geoparser 2. Location names are extracted from the text using OpenNLP NER 3. Provide two roles: * The most frequent location name choosed as the best match for the input text * Other extracted locations are treated as alternatives (equal) 4. location extracted above, search the best GeoName object and return the resloved objects with fields (name in gazetteer, longitude, latitude) ##How to Use *Cautions*: This program requires at least 1.2 GB disk space for building Lucene Index ```Java function A(stream){ Metadata metadata = new Metadata(); ParseContext context=new ParseContext(); GeoParserConfig config= new GeoParserConfig(); config.setGazetterPath(gazetteerPath); config.setNERModelPath(nerPath); context.set(GeoParserConfig.class, config); geoparser.parse( stream, new BodyContentHandler(), metadata, context); for(String name: metadata.names()){ String value=metadata.get(name); System.out.println(name +" " + value); } } ``` This parser generates useful geographical information to Tika's Metadata Object. Fields for best matched location: ``` Geographic_NAME Geographic_LONGTITUDE Geographic_LATITUDE ``` Fields for alternatives: ``` Geographic_NAME1 Geographic_LONGTITUDE1 Geographic_LATITUDE1 Geographic_NAME2 Geographic_LONGTITUDE2 Geographic_LATITUDE2 ... ``` If you have any questions, contact me: anyayu...@gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332)