Dear all,

Rupert and I have been working on porting some of our OpenNLP based natural 
language processing to Apache Stanbol. While not yet completely finished, we 
decided it might be worthwhile for you all to have a look on it and maybe even 
contribute. I will try to briefly summarise the goals and current state of 
implementation:

Goals
=====

1. provide a modular infrastructure for NLP-related things

Many tasks in NLP can be computationally intensive, and there is no "one fits 
all" NLP approach when analysing text. Therefore, we wanted to have a NLP 
infrastructure that can be configured and wired together as needed for the 
specific use case, with several specialised modules that can build upon each 
other but many of which are optional. 

2. provide a unified data model for representing NLP text annotations

In many szenarios, it will be necessary to implement custom engines building on 
the results of a previous "generic" analysis of the text (e.g. POS tagging and 
chunking). For example, in a project we are identifying so-called "noun 
phrases", use a lemmatizer to build the ground form, then convert this to 
singular nominative form to have a gramatically correct label to use in a tag 
cloud. Most of this builds on generic NLP functionality, but the last step is 
very specific to the use case.

Therefore, we wanted also to implement a generic NLP data model that allows 
representing text annotations attached to individual words or also to spans of 
words.


Current State
=============

Currently, the unified data model has been implemented by Rupert in a first 
version. He has tested it thoroughly and it is reliable and useful for the 
szenarios we had in mind. The current enhancement engines are using OpenNLP for 
analysis, but the model can in general be used by any NLP engine that 
associates tags with tokens or spans of tokens.

 I have in the meantime concentrated on implementing modules for different NLP 
tasks. The following modules are already finished:

- POS Tagger: takes text/plain from a content item and stores an AnalyzedText 
content part in the content item where each token is assigned its grammar POS 
tag 
- Chunker (Noun Phrase Detector): takes a content item with AnalyzedText 
content part (from POS tagger) and applies noun phrase chunking on the token 
stream; results are annotated token spans that are stored in the AnalyzedText
- Sentiment Analyzer (English/German): takes a content item with AnalyzedText 
content part (from POS tagger) and assigns sentiment values to each token in 
the stream; results are annotated tokens that are stored in the AnalyzedText

In progress:
- Lemmatizer (English/German): takes a token stream (POS tagged AnalyzedText) 
and adds the lemma for each token to the AnalyzedText content part


Future work
===========

Based on these generic modules, we intend to implement a number of "NLP result 
summarizers" that take the results in an AnalyzedText and perform some post 
processing on them, storing them as RDF in the metadata associated with the 
content item. Some ideas:
- Average Sentiment: compute the average sentiment value for the text by 
summing all sentiment values and dividing them by the number of annotated tokens
- Improved Sentiment: take into account negations in a sentence before a 
sentiment value and invert the values in this case; otherwise like average 
sentiment.
- Per-Noun Sentiment: associate sentiment values with each noun occurring in 
the text by taking into account the sentiment values of adjectives associated 
with the noun in a noun phrase and negations before them; result are text 
annotations where each noun is associated with a sentiment value, so you could 
say "Product XYZ is typically mentioned with an average sentiment of 0.N"
- Noun Adjectives: collect the adjectives that are commonly used in association 
with a noun by using the noun phrases and taking the adjectives
- Simple Tag Cloud: take nouns, build lemmatized form, generate a tag cloud in 
the metadata
- Noun Phrase Cloud: take noun phrases, build lemmatized form, build nominative 
singular form, generate tag cloud; this is useful when you want to provide more 
context for the tags, e.g. in facetted search ("red car", "blue car").

The possibilities are literally endless… feel free to think about other options 
:)


Availability
============

Since this is still experimental code, we have for the time being set up a 
separate (public) repository:

https://bitbucket.org/srfgkmt/stanbol-nlp

When it is more-or-less finished, we would however like to include this into 
the main Stanbol code base so others can more easily benefit from it. Feel free 
to look at what we have implemented there!

;-)

Sebastian
-- 
| Dr. Sebastian Schaffert          [email protected]
| Salzburg Research Forschungsgesellschaft  http://www.salzburgresearch.at
| Head of Knowledge and Media Technologies Group          +43 662 2288 423
| Jakob-Haringer Strasse 5/II
| A-5020 Salzburg

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to