Create a maven artifact to embed all the default stanbol models data
--------------------------------------------------------------------

                 Key: STANBOL-90
                 URL: https://issues.apache.org/jira/browse/STANBOL-90
             Project: Stanbol
          Issue Type: New Feature
            Reporter: Olivier Grisel
            Assignee: Olivier Grisel


To make stanbol useful, esp. in offline mode, it needs to some statistical 
model and entity / topic indices. Those indices can be huge (several GB for all 
the entities of dbpedia and geonames for instance) hence cannot be packaged as 
part of the default distrib. However it is very desirable to embed some default 
statistical models

- opennlp sentence detector for English
- opennlp name finder models for English for organizations, people, places
- solr index for the top 10000 most popular entities (of type organizations, 
people, places) as measured by number of incoming links in the Wikipedia 
article graph.
- solr index for the top 1000 most popular topics number of Wikipedia articles 
categorized in this category or subcategory

The goal is to keep that maven artifact less that 100 MB (ideally even smaller) 
so that it does not put a big barrier to entry to people downloading the 
default distribution of Stanbol.

To avoid slowing down the svn repo, those data files will not be put under 
version control, just the pom.xml + script to rebuild the artifact from a 
previous version of the jar.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to