[ https://issues.apache.org/jira/browse/TIKA-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557791#comment-14557791 ]
Chris A. Mattmann commented on TIKA-1614: ----------------------------------------- Just some minor cleanup and tests pass, patch about to be attached: {noformat} [INFO] Reactor Summary: [INFO] [INFO] Apache Tika parent ................................. SUCCESS [ 1.516 s] [INFO] Apache Tika core ................................... SUCCESS [ 20.335 s] [INFO] Apache Tika parsers ................................ SUCCESS [02:27 min] [INFO] Apache Tika XMP .................................... SUCCESS [ 2.535 s] [INFO] Apache Tika serialization .......................... SUCCESS [ 2.583 s] [INFO] Apache Tika batch .................................. SUCCESS [02:05 min] [INFO] Apache Tika application ............................ SUCCESS [ 41.779 s] [INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 22.209 s] [INFO] Apache Tika translate .............................. SUCCESS [ 3.316 s] [INFO] Apache Tika server ................................. SUCCESS [ 22.884 s] [INFO] Apache Tika examples ............................... SUCCESS [ 6.708 s] [INFO] Apache Tika Java-7 Components ...................... SUCCESS [ 2.623 s] [INFO] Apache Tika ........................................ SUCCESS [ 0.028 s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 06:39 min [INFO] Finished at: 2015-05-24T09:55:56-07:00 [INFO] Final Memory: 116M/1523M {noformat} > Geo Topic Parser > ---------------- > > Key: TIKA-1614 > URL: https://issues.apache.org/jira/browse/TIKA-1614 > Project: Tika > Issue Type: New Feature > Components: parser > Reporter: Anya Yun Li > Assignee: Chris A. Mattmann > Labels: memex > > ##Description > This program aims to provide the support to identify geonames for any > unstructured text data in the project NSF polar research. > https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1 > This project is a content-based geotagging solution, made of a variaty of NLP > tools and could be used for any geotagging purposes. > ##Workingflow > 1. Plain text input is passed to geoparser > 2. Location names are extracted from the text using OpenNLP NER > 3. Provide two roles: > * The most frequent location name choosed as the best match for the > input text > * Other extracted locations are treated as alternatives (equal) > 4. location extracted above, search the best GeoName object and return the > resloved objects with fields (name in gazetteer, longitude, latitude) > ##How to Use > *Cautions*: This program requires at least 1.2 GB disk space for building > Lucene Index > ```Java > function A(stream){ > Metadata metadata = new Metadata(); > ParseContext context=new ParseContext(); > GeoParserConfig config= new GeoParserConfig(); > config.setGazetterPath(gazetteerPath); > config.setNERModelPath(nerPath); > context.set(GeoParserConfig.class, config); > > geoparser.parse( > stream, > new BodyContentHandler(), > metadata, > context); > > for(String name: metadata.names()){ > String value=metadata.get(name); > System.out.println(name +" " + value); > } > } > ``` > This parser generates useful geographical information to Tika's Metadata > Object. > Fields for best matched location: > ``` > Geographic_NAME > Geographic_LONGTITUDE > Geographic_LATITUDE > ``` > Fields for alternatives: > ``` > Geographic_NAME1 > Geographic_LONGTITUDE1 > Geographic_LATITUDE1 > Geographic_NAME2 > Geographic_LONGTITUDE2 > Geographic_LATITUDE2 > ... > ``` > If you have any questions, contact me: anyayu...@gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332)