[ 
https://issues.apache.org/jira/browse/STANBOL-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276729#comment-13276729
 ] 

Rupert Westenthaler commented on STANBOL-614:
---------------------------------------------

There are several reasons why this could happen

1) Correct detection of the language: For short texts sometimes the correct 
language can not be detected. In those cases Enhancement Engines that depend on 
those information (e.g. openNLP-NER) will not work).
2) NER (Named Entity Recognition): Especially Entities that are mentioned in 
parts of a Text that are not full sentences do have a higher possibility to get 
overlooked.

If you send html text to Apache Stanbol it uses Apache Tika to convert the html 
to text. You can ask stanbol to return the converted texts e.g. by making a 
request such as

    curl -v -X POST -H "Accept: text/plain" -H "Content-Type: 
text/html;charset=utf-8" --data-binary @test 
"http://dev.iks-project.eu:8081/enhancer?omitMetadata=true";

You can also request the metadata and all content elements (parsed and 
converted) by a request like

    curl -v -X POST -H "Accept: multipart/from-data" -H "Content-Type: 
text/html;charset=utf-8" --data-binary @test 
"http://dev.iks-project.eu:8081/enhancer?outputContent=*/*&rdfFormat=application/json";

this can help a lot for debugging.

In general: If you sent very short documents to the Enhancer I would advice the 
use of the "KeywordLinkingEngine" instead of the combination 
"NamedEntityExtractionEnhancementEngine" and "NamedEntityTaggingEngine".

However if you use the "KeywordLinkingEngine" in combination with dbpedia as a 
Vocabulary you might need to filter results based on the "fise:entity-type" 
(e.g. ignoring all "fise:EntityAnnotations" that do not have an value for 
"fise:entity-type")

An example for an Enhancement Chain configured to use the KeywordLinkingEngine 
with dbpedia can be found at [1]

best
Rupert


[1] http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-keyword

                
> Enhancer returns inconsistent results
> -------------------------------------
>
>                 Key: STANBOL-614
>                 URL: https://issues.apache.org/jira/browse/STANBOL-614
>             Project: Stanbol
>          Issue Type: Bug
>          Components: Enhancer
>    Affects Versions: 0.10.0-incubating
>         Environment: Debian squeeze Linux 2.6.32-5-amd64 SMP x86_64
> java version "1.6.0_26"
> Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
>            Reporter: Nosiert Batiste
>              Labels: newbie
>
> I'm trying to implement a tag suggestion feature in a document editing 
> application. I'm using the stanbol enhancer to get EntityAnnotations for a 
> piece of HTML.
> This works great most of the time, but sometimes no results are returned. The 
> difference between the text for which results are returned, and the text  for 
> which no results are returned is sometimes only a single character.
> I was able to reduce one case down to an additional  .
> With the following text, the enhancer returns an EntityAnnotation for Syria, 
> but not for CNN:
>     So, where does the Syria conflict stand now? CNN 
> With the following text, the enhancer returns EntityAnnotations for both 
> Syria and CNN:
>   So, where does the Syria conflict stand now? CNN 
> I post the text with the following command (where @test refers to the file 
> that contains the text):
> curl -v -X POST -H "Accept: application/json" -H "Content-Type: 
> text/html;charset=utf-8" --data-binary @test "http://localhost:8086/enhancer";
> I checked out stanbol from svn
> $ svnversion .
> 1337074
> and started it with the following command line
> java -Xmx1g -jar 
> launchers/full/target/org.apache.stanbol.launchers.full-0.10.0-incubating-SNAPSHOT.jar
>  -p 8086
> I will try to work around this problem by simply converting everything to 
> plain text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to