Dear all, Rupert and I have been working on porting some of our OpenNLP based natural language processing to Apache Stanbol. While not yet completely finished, we decided it might be worthwhile for you all to have a look on it and maybe even contribute. I will try to briefly summarise the goals and current state of implementation:
Goals
=====
1. provide a modular infrastructure for NLP-related things
Many tasks in NLP can be computationally intensive, and there is no "one fits
all" NLP approach when analysing text. Therefore, we wanted to have a NLP
infrastructure that can be configured and wired together as needed for the
specific use case, with several specialised modules that can build upon each
other but many of which are optional.
2. provide a unified data model for representing NLP text annotations
In many szenarios, it will be necessary to implement custom engines building on
the results of a previous "generic" analysis of the text (e.g. POS tagging and
chunking). For example, in a project we are identifying so-called "noun
phrases", use a lemmatizer to build the ground form, then convert this to
singular nominative form to have a gramatically correct label to use in a tag
cloud. Most of this builds on generic NLP functionality, but the last step is
very specific to the use case.
Therefore, we wanted also to implement a generic NLP data model that allows
representing text annotations attached to individual words or also to spans of
words.
Current State
=============
Currently, the unified data model has been implemented by Rupert in a first
version. He has tested it thoroughly and it is reliable and useful for the
szenarios we had in mind. The current enhancement engines are using OpenNLP for
analysis, but the model can in general be used by any NLP engine that
associates tags with tokens or spans of tokens.
I have in the meantime concentrated on implementing modules for different NLP
tasks. The following modules are already finished:
- POS Tagger: takes text/plain from a content item and stores an AnalyzedText
content part in the content item where each token is assigned its grammar POS
tag
- Chunker (Noun Phrase Detector): takes a content item with AnalyzedText
content part (from POS tagger) and applies noun phrase chunking on the token
stream; results are annotated token spans that are stored in the AnalyzedText
- Sentiment Analyzer (English/German): takes a content item with AnalyzedText
content part (from POS tagger) and assigns sentiment values to each token in
the stream; results are annotated tokens that are stored in the AnalyzedText
In progress:
- Lemmatizer (English/German): takes a token stream (POS tagged AnalyzedText)
and adds the lemma for each token to the AnalyzedText content part
Future work
===========
Based on these generic modules, we intend to implement a number of "NLP result
summarizers" that take the results in an AnalyzedText and perform some post
processing on them, storing them as RDF in the metadata associated with the
content item. Some ideas:
- Average Sentiment: compute the average sentiment value for the text by
summing all sentiment values and dividing them by the number of annotated tokens
- Improved Sentiment: take into account negations in a sentence before a
sentiment value and invert the values in this case; otherwise like average
sentiment.
- Per-Noun Sentiment: associate sentiment values with each noun occurring in
the text by taking into account the sentiment values of adjectives associated
with the noun in a noun phrase and negations before them; result are text
annotations where each noun is associated with a sentiment value, so you could
say "Product XYZ is typically mentioned with an average sentiment of 0.N"
- Noun Adjectives: collect the adjectives that are commonly used in association
with a noun by using the noun phrases and taking the adjectives
- Simple Tag Cloud: take nouns, build lemmatized form, generate a tag cloud in
the metadata
- Noun Phrase Cloud: take noun phrases, build lemmatized form, build nominative
singular form, generate tag cloud; this is useful when you want to provide more
context for the tags, e.g. in facetted search ("red car", "blue car").
The possibilities are literally endless… feel free to think about other options
:)
Availability
============
Since this is still experimental code, we have for the time being set up a
separate (public) repository:
https://bitbucket.org/srfgkmt/stanbol-nlp
When it is more-or-less finished, we would however like to include this into
the main Stanbol code base so others can more easily benefit from it. Feel free
to look at what we have implemented there!
;-)
Sebastian
--
| Dr. Sebastian Schaffert [email protected]
| Salzburg Research Forschungsgesellschaft http://www.salzburgresearch.at
| Head of Knowledge and Media Technologies Group +43 662 2288 423
| Jakob-Haringer Strasse 5/II
| A-5020 Salzburg
signature.asc
Description: Message signed with OpenPGP using GPGMail
