GitHub user chenlica closed a discussion: Editing Integrating Stanford NLP 
(from old wiki)

>From the page https://github.com/apache/texera/wiki/Integrating-Stanford-NLP/ 
>(may be dangling)

====
Author(s): [Feng Hong](https://github.com/sam0227), Yang Jiao

##Synopsys
Stanford NLP package is a very powerful Java software for natural language 
processing. The goal is to integrate some of its features as an operator to 
allow users to extract Named Entities or Part of speeches. 


## Status
As of 6/13/2016: **FINISHED**

## Modules

```
edu.uci.ics.texera.dataflow.nlpextractor
```

## Related Issues
https://github.com/Texera/texera/issues/33

##Stanford NLP package

Stanford NLP is a set of natural language analysis tools written in Java, which 
annotate raw human language tokens and output forms of words, their part of 
speech (whether they are names of companies, people, location, etc.). The 
package includes a POS tagger, a syntactic parser, and a named entity 
recognizer. Its analyses provide the foundational building blocks for 
higher-level and domain-specific text-understanding applications. 

The purpose of this project is to implement Stanford NLP as an extractor in 
Texera. We allow users to specify the NLP constant including 7 Named Entity 
classes and 4 types of Part of Speech entity: Number, Location, Person, 
Organization, Money, Percent, Date, Time, Adjective, Adverb, Noun, Verb.

* Common usage of Stanford NLP package:

1. Name Entity Recognition: For example, names(PERSON, LOCATION, ORGANIZATION, 
MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, 
DURATION, SET).
2. Lemmatization
3. Part-of-Speech: Determine if a word is a noun, verb, adjective, etc.

##Presentation Slides

4/11/2016 Presentation: [Project 
Overview](https://docs.google.com/presentation/d/1vB-UmBq4jgRclfOiAJStLb2ZhP8nx_HaauVZ6F7w11U/edit?usp=sharing)

4/18/2016 Presentation: [StanfordNPL 
introduction](https://docs.google.com/presentation/d/1YFapKofMvNy0wz_hhmunz7mUr5pjlb01LGcmGlnNP_0/edit?usp=sharing)

4/25/2016 Presentation: [Status Report]
(https://docs.google.com/presentation/d/1ek18Zr0OqQ0RONj8D7W2aSGs9sz1etnf9bEnWTEA2ag/edit?usp=sharing)

## Performance Test

Machine setting: Macbook Pro (Late-2015), Intel Core i5, SSD hard drive, 8GB 
memory.

* Data set: 100k Medline records, about 150 MB
* Performance results (average time reported in seconds):

|               | All NamedEntities | Part of Speech |
| ------------- | ----------- | ---------- |
| NlpExtractor |    2937s    |  209s  |

* On average:  34 Documents/sec for Named Entities Recognition and 480 Docs/sec 
for Part of Speech Recognition

* Data set: 1M Medline records, about 1.5G 

|               | All NamedEntities | Part of Speech |
| ------------- | ----------- | ---------- |
| NlpExtractor |    Too Slow    |  2110s  |

* On the average, about 500 Docs/sec for Part of Speech Recognition. Slow on 
Named Entities Recognition.


## TODOs

* According to the performance test, the Named Entities extraction runs really 
slow. Future optimization is needed to make it faster.  One possible reason is 
that the MEDLINE records have many fields, and we use the NLP package to 
process one field at a time. That means if a record have 10 fields and we want 
to extract information from all of them, we'll need to build 10 NLP pipelines 
to process them, which would need a lot of time. One way to improve that is to 
concatenate those fields to one then only build one pipeline to process it. 

##Stanford NLP package License: 
[GNU General Public License](http://www.gnu.org/licenses/gpl-2.0.html)



GitHub link: https://github.com/apache/texera/discussions/3971

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

Reply via email to