from:"Rupert Westenthaler"

Hi Rene,

With STANBOL-764 the indexing tool now supports importing quads.
However you will still have problems to work with the CommonCrawl data.

1. Because a lot of the data do use BNodes and those are ignored by
the Entityhub. As indexing of Bnodes was already requested several
times from I created STANBOL-765 to address this. While this will not
allow the Entityhub to handle BNodes it will allow users to specify
if/how Bnodes are converted to dereferable URIs.

2. I got a parse exception with Jena Riot in the test data file
refered by your original mail [3].

Caused by: org.openjena.riot.RiotException: [line: 3931, col: 124]
expected "_:"
at 
org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)

This was caused by a literal using a country specific language tag

<http://bearhungfactory.mysinablog.com/index.php>
<http://creativecommons.org/ns#attributionName>
"\u6D2A\u96C4\u718A"@zh_tw
<http://bearhungfactory.mysinablog.com/index.php>   .

changing "@zh_tw" to "@zh" fixed the problem. This is a bug in the
used Jena version.

com.hp.hpl.jena:jena:2.6.3
com.hp.hpl.jena:arq:2.8.5
com.hp.hpl.jena:tdb:0.8.7

Maybe upgrading to a newer Jena version could solve this. However this
would previously require Clerezza to adopt the newer version (see
STANBOL-621).

best
Rupert

On Tue, Oct 9, 2012 at 10:34 PM, Rene Nederhand  wrote:
> Hi Rupert,
>
> It would be great if we could make it possible to use CommonCrawl data even
> if we would lose some information. As I remember well, this was one of the
> requests that came up in the validation reports quite frequently. Freebase
> is an alternative.
>
> So, if this involves importing N-quads then I would appreciate adding this
> feature. No need for hurry and I am more than happy to help. Thanks!
>
> Best,
> René
>
>
>
>
>
> On Tue, Oct 9, 2012 at 10:02 PM, Rupert Westenthaler <
> rupert.westentha...@gmail.com> wrote:
>
>> Hi Rene,
>>
>> The problem ist that the files of this dataset do use N-Quads and not
>> NTriples (basically SPOC (Subject, Predicate, Object, Context) instead
>> of SPO.
>>
>> I can try to add support for importing N-Quads, but because the
>> importing tool does not use named graphs you might even than lose some
>> quads ( multiple Quads with the same SPO values).
>>
>> best
>> Rupert
>>
>> On Tue, Oct 9, 2012 at 2:44 PM, Rene Nederhand  wrote:
>> > Hi,
>> >
>> >
>> > I am trying to create a custom vocabulary using
>> > webdatacommons<http://webdatacommons.org/>RDFa data [1]. To do this I
>> > am following this
>> > tutorial <http://stanbol.apache.org/docs/trunk/customvocabulary.html>
>> [2].
>> >
>> > I've installed the indexer tool without any problems, editing the config
>> > file and I am now working on the mapping.txt file. However, I am clueless
>> > on what I should change in this file.
>> >
>> > An example of the data is
>> > here<http://webdatacommons.org/samples/data/ccrdf.html-rdfa.sample.nq
>> >[3]:
>> >
>> > head -n 5 ../resources/rdfdata/ccrdf.html-rdfa.sample.nq
>> > <
>> >
>> http://turcanu.net/blog/2008/07/16/honglaowai-if-there-were-no-communist-party-then-there-would-be-no-new-china/
>> >
>> > <http://creativecommons.org/ns#attributionURL> <http://turcanu.net> <
>> >
>> http://turcanu.net/blog/2008/07/16/honglaowai-if-there-were-no-communist-party-then-there-would-be-no-new-china/
>> >
>> >   .
>> > <
>> >
>> http://turcanu.net/blog/2008/07/16/honglaowai-if-there-were-no-communist-party-then-there-would-be-no-new-china/
>> >
>> > <http://creativecommons.org/ns#attributionName> "Sergiu Turcanu" <
>> >
>> http://turcanu.net/blog/2008/07/16/honglaowai-if-there-were-no-communist-party-then-there-would-be-no-new-china/
>> >
>> >   .
>> > <http://www.telemac0.net/marketing-50/> <
>> > http://purl.org/dc/elements/1.1/type> <http://purl.org/dc/dcmitype/Text>
>> <
>> > http://www.telemac0.net/marketing-50/>   .
>> > <http://www.telemac0.net/marketing-50/> <
>> > http://purl.org/dc/elements/1.1/title> "telemac0" <
>> > http://www.telemac0.net/marketing-50/>   .
>> > <http://www.telemac0.net/marketing-50/> <
>> > http://creativecommons.org/ns#attributionURL> <http://telemac0.net> <
>> > http://www.telemac0.net/marketing-50/>
>> >
>> > Could anyone point me in de the right direction?
>> >
>> > Cheers,
>> >
>> > René Nederhand
>> >
>> >
>> > [1] http://webdatacommons.org/
>> > [2] http://stanbol.apache.org/docs/trunk/customvocabulary.html
>> > [3] http://webdatacommons.org/samples/data/ccrdf.html-rdfa.sample.nq
>>
>>
>>
>> --
>> | Rupert Westenthaler rupert.westentha...@gmail.com
>> | Bodenlehenstraße 11 ++43-699-11108907
>> | A-5500 Bischofshofen
>>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Corrupted Files downloaded from dev.iks-project.eu (Fwd: Jenkins build became unstable: stanbol-trunk-1.6 #1068)

Hi all,

during the Apache Stanbol build process some files (DBpedia default
index, OpenNLP models) are downloaded from dev.iks-project.eu. Since
the last week it happens that those files are corrupted. We do not
know the reason for that as the Apache2 logs of the dev.iks-project.eu
do not point to any problems. This is also the reason for a lot of
unstable Jenkins build on the last week.

Users that are affected by this should see "java.io.EOFException"s in
their logs. Affected files are located in the
"{stanbol-trunk}/data/{module-path}/download/resources" folders.
Deleted files will be re-downloaded on the next build. Because of that
deleting affected files and "mvm clean install" of the affected file
usually solves issues like that.

best
Rupert

-- Forwarded message --
From: Apache Jenkins Server 
Date: Wed, Oct 10, 2012 at 12:15 PM
Subject: Jenkins build became unstable:  stanbol-trunk-1.6 #1068
To: dev@stanbol.apache.org, rupert.westentha...@gmail.com


See <https://builds.apache.org/job/stanbol-trunk-1.6/1068/changes>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: "Error reloading cached bundle"

Reto,

have you looked what module bundle64 refers to?

On Wed, Oct 10, 2012 at 11:53 AM, Reto Bachmann-Gmür  wrote:
> Occasionally when starting a fresh stanbol launcher I get the following
> error message. Does anybody knows what is causing this? After deleting the
> stanbol dectory and retrying the problem doesn't appear again.
>
> Cheers,
> Reto
>
> ERROR: Error reloading cached bundle, removing it:
> /home/reto/projects/apache/stanbol/launchers/full/target/stanbol/felix/bundle64
> (java.lang.Exception: No valid revisions in bundle archive directory:
> /home/reto/projects/apache/stanbol/launchers/full/target/stanbol/felix/bundle64)
> java.lang.Exception: No valid revisions in bundle archive directory:
> /home/reto/projects/apache/stanbol/launchers/full/target/stanbol/felix/bundle64
> at
> org.apache.felix.framework.cache.BundleArchive.(BundleArchive.java:205)
> at
> org.apache.felix.framework.cache.BundleCache.getArchives(BundleCache.java:223)
> at org.apache.felix.framework.Felix.init(Felix.java:656)
> at org.apache.sling.launchpad.base.impl.Sling.init(Sling.java:363)
> at org.apache.sling.launchpad.base.impl.Sling.(Sling.java:228)
> at
> org.apache.sling.launchpad.base.app.MainDelegate$1.(MainDelegate.java:181)
> at
> org.apache.sling.launchpad.base.app.MainDelegate.start(MainDelegate.java:181)
> at org.apache.sling.launchpad.app.Main.startSling(Main.java:424)
> at org.apache.sling.launchpad.app.Main.doStart(Main.java:349)
> at org.apache.sling.launchpad.app.Main.main(Main.java:123)
> at org.apache.stanbol.launchpad.Main.main(Main.java:61)



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Validate fix for STANBOL-768: Wrong "Install-Path" header when running Entityhub Indexing Tool on Windows

Hi Gniewosław, all

it would be nice if you or anyone else could validate that OSGI
bundles created by the Entityhub Indexing Tool running on Windows now
correctly install the Configurations for the Entityhub ReferencedSite
when installed to a Stanbol Instance. See STANBOL-768 [1] [2] for
details.

I do currently not have access to any Windows box so help with that
would be really appreciated.

best
Rupert

[1] https://issues.apache.org/jira/browse/STANBOL-768
[2] http://svn.apache.org/viewvc?rev=1396614&view=rev

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: build problem

Hi Harish,

On Thu, Oct 11, 2012 at 1:27 AM, harish suvarna  wrote:
> Failure to find
> org.apache.stanbol:org.apache.stanbol.data.sites.dbpedia:jar:1.0.5-SNAPSHOT

it should not be necessary to download this dependency from any maven
repository, as it is added to your local repository by "mvn install"
the "{stanbol-trunk}/data/sites/dbpedia" module. As the dependency in
[1] does refer the version defined in [2] I would not expect any
problem.

You can check for this dependency in the local maven repository at
"~/.m2/repository/org/apache/stanbol/org.apache.stanbol.data.sites.dbpedia/"

best
Rupert

[1] http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/ldpath/pom.xml
[2] http://svn.apache.org/repos/asf/stanbol/trunk/data/sites/dbpedia/pom.xml

On Thu, Oct 11, 2012 at 1:27 AM, harish suvarna  wrote:
> I am at svn rev 1396858.
>
> I get the following error while building ldpath.
>
> Error stacktraces are turned on.
> [INFO] Scanning for projects...
> [INFO]
>
> [INFO]
> 
> [INFO] Building Apache Stanbol Entityhub LDPath Support 0.11.0-SNAPSHOT
> [INFO]
> 
> [WARNING] The POM for
> org.apache.stanbol:org.apache.stanbol.data.sites.dbpedia:jar:1.0.5-SNAPSHOT
> is missing, no dependency information available
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 2.805s
> [INFO] Finished at: Wed Oct 10 16:16:45 PDT 2012
> [INFO] Final Memory: 8M/81M
> [INFO]
> 
> [ERROR] Failed to execute goal on project
> org.apache.stanbol.entityhub.ldpath: Could not resolve dependencies for
> project
> org.apache.stanbol:org.apache.stanbol.entityhub.ldpath:bundle:0.11.0-SNAPSHOT:
> Failure to find
> org.apache.stanbol:org.apache.stanbol.data.sites.dbpedia:jar:1.0.5-SNAPSHOT
> in http://repository.apache.org/snapshots was cached in the local
> repository, resolution will not be reattempted until the update interval of
> apache.snapshots has elapsed or updates are forced -> [Help 1]
>
> I checked repository.apache.org for dbpedia jar snapshot. But nothing is
> there.
>
>
> --
> Thanks
> Harish



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Build error: two child modules missing

Hi,

those folders got recently moved in the SVN. You can check at [1] that
they are present on the server. Interestingly I had also problems
while "svn up" this changes. On my machine the old folder where not
correctly deleted and the new one where not created - no idea why.

I had to manually create and add those folder (mkdir {folder}, svn add
{folder}). Only after that I was getting the changes from the server
by calling "svn up". I would be also interested why things like that
happens from time to time.

best
Rupert


[1] http://svn.apache.org/repos/asf/stanbol/trunk/commons/security/

On Tue, Oct 9, 2012 at 9:12 PM, Andreas Kuckartz  wrote:
> I currently get a build error.
>
> Cheers,
> Andreas
> ---
>
> [INFO] Scanning for projects...
> [ERROR] The build could not read 1 project -> [Help 1]
> [ERROR]
> [ERROR]   The project
> org.apache.stanbol:org.apache.stanbol.commons.reactor:0.10.0-SNAPSHOT
> (/home/andreas/workspace/stanbol/commons/pom.xml) has 2 errors
> [ERROR] Child module
> /home/andreas/workspace/stanbol/commons/security/core of
> /home/andreas/workspace/stanbol/commons/pom.xml does not exist
> [ERROR] Child module
> /home/andreas/workspace/stanbol/commons/security/authentication.basic of
> /home/andreas/workspace/stanbol/commons/pom.xml does not exist
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Help creating a custom vocabulary

2012-10-11 Thread Rupert Westenthaler

Hi René,

BTW I finished the work on STANBOL-765 today. See first comment for
the documentation on how to enable indexing of Bnodes.

best
Rupert

On Thu, Oct 11, 2012 at 10:54 PM, Rene Nederhand  wrote:
> Hi Rupert,
>
> Thank you very much for all the work. I'd expected this would take much
> longer :)
>
> Probably this weekend, I will try to get some of the CommonCrawl data
> imported into Stanbol and see how this works out.
>
> In addition, I will try the Apache any23 tool (thx. A. Soroka).
>
> Best,
> René
>
> On Wed, Oct 10, 2012 at 11:39 AM, Rupert Westenthaler <
> rupert.westentha...@gmail.com> wrote:
>
>> Hi Rene,
>>
>> With STANBOL-764 the indexing tool now supports importing quads.
>> However you will still have problems to work with the CommonCrawl data.
>>
>> 1. Because a lot of the data do use BNodes and those are ignored by
>> the Entityhub. As indexing of Bnodes was already requested several
>> times from I created STANBOL-765 to address this. While this will not
>> allow the Entityhub to handle BNodes it will allow users to specify
>> if/how Bnodes are converted to dereferable URIs.
>>
>> 2. I got a parse exception with Jena Riot in the test data file
>> refered by your original mail [3].
>>
>> Caused by: org.openjena.riot.RiotException: [line: 3931, col: 124]
>> expected "_:"
>> at
>> org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
>>
>> This was caused by a literal using a country specific language tag
>>
>> <http://bearhungfactory.mysinablog.com/index.php>
>> <http://creativecommons.org/ns#attributionName>
>> "\u6D2A\u96C4\u718A"@zh_tw
>> <http://bearhungfactory.mysinablog.com/index.php>   .
>>
>> changing "@zh_tw" to "@zh" fixed the problem. This is a bug in the
>> used Jena version.
>>
>> com.hp.hpl.jena:jena:2.6.3
>> com.hp.hpl.jena:arq:2.8.5
>> com.hp.hpl.jena:tdb:0.8.7
>>
>> Maybe upgrading to a newer Jena version could solve this. However this
>> would previously require Clerezza to adopt the newer version (see
>> STANBOL-621).
>>
>> best
>> Rupert
>>
>> On Tue, Oct 9, 2012 at 10:34 PM, Rene Nederhand 
>> wrote:
>> > Hi Rupert,
>> >
>> > It would be great if we could make it possible to use CommonCrawl data
>> even
>> > if we would lose some information. As I remember well, this was one of
>> the
>> > requests that came up in the validation reports quite frequently.
>> Freebase
>> > is an alternative.
>> >
>> > So, if this involves importing N-quads then I would appreciate adding
>> this
>> > feature. No need for hurry and I am more than happy to help. Thanks!
>> >
>> > Best,
>> > René
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Oct 9, 2012 at 10:02 PM, Rupert Westenthaler <
>> > rupert.westentha...@gmail.com> wrote:
>> >
>> >> Hi Rene,
>> >>
>> >> The problem ist that the files of this dataset do use N-Quads and not
>> >> NTriples (basically SPOC (Subject, Predicate, Object, Context) instead
>> >> of SPO.
>> >>
>> >> I can try to add support for importing N-Quads, but because the
>> >> importing tool does not use named graphs you might even than lose some
>> >> quads ( multiple Quads with the same SPO values).
>> >>
>> >> best
>> >> Rupert
>> >>
>> >> On Tue, Oct 9, 2012 at 2:44 PM, Rene Nederhand 
>> wrote:
>> >> > Hi,
>> >> >
>> >> >
>> >> > I am trying to create a custom vocabulary using
>> >> > webdatacommons<http://webdatacommons.org/>RDFa data [1]. To do this I
>> >> > am following this
>> >> > tutorial <http://stanbol.apache.org/docs/trunk/customvocabulary.html>
>> >> [2].
>> >> >
>> >> > I've installed the indexer tool without any problems, editing the
>> config
>> >> > file and I am now working on the mapping.txt file. However, I am
>> clueless
>> >> > on what I should change in this file.
>> >> >
>> >> > An example of the data is
>> >> > here<http://webdatacommons.org/samples/data/ccrdf.html-rdfa.sample.nq
>> >> >[3]:
>> >> >
>> >> > head -n 5 ../resources/rdfdata/ccrdf.html-rdfa.sample.

Stanbol Semantic Indexing (was Re: Next releases)

2012-10-13 Thread Rupert Westenthaler

k we will add such a component. If that is the case than a
SemanticIndex would only need to specify its IndexingSource(s) and
Stanbol would keep the index in sync with its source.

### Provided Services

The services API of the semanticindexing module does NOT include the
actual Java APIs for SemanticSearch but rather leave it to the
implementations to register those APIs themselves as OSGI services.
Stanbol already defines/uses a lot of those interfaces.
Implementations that implement those will naturally integrate To give
some Examples: A SemanticIndex storing its data in Solr can register
its SolrCore as OSGI service as described by [2]. SemanticIndexes
using a Clerezza TripleStore can be accessed via the Clerezza
TCManager and can expose a SPARQL endpoint as described in [3].

This design has the advantages that

* the semanticindexing API keeps focused on the semantic indexing
process and therefore easier to implement
* it allows greater flexibility and extensibility (e.g. one could
write a semantic index based on couchDB and register the RESTful and
Java API similar as it is done for Solr
* it allows both the storage as the semanticindex layer to provide
additional services (e.g. if a TripleStore is used to store the data
it can directly provide the SPARQL endpoint in case data are stored in
a CMS the SPARQL endpoint can be provides by a SemanticIndex that
knows how to convert the CMS data to RDF).
* it fits very well to the service oriented architecture of OSGI

BTW we will also use the same system for the Stanbol specific services
(e.g. the featured search of the Contenthub, the LDpath Backend
functionality or the FieldQuery service of the Entityhub.

### Next Steps:

The first Stanbol Component that will use this infrastructure will be
the Contenthub. Suat is the Person in charge for this. Based on the
version re-integrated with the trunk I will than continue the
development of the Entityhub on top of the semanticindexing module.

best
Rupert Westenthaler

[1] http://stanbol.apache.org/presentations/Stanbol_Overview_2012-04.pdf
[2] http://stanbol.apache.org/docs/trunk/utils/commons-solr
[3] http://markmail.org/message/zm2tqlvs4flwvjyd


-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Next releases

2012-10-13 Thread Rupert Westenthaler

Hi all

On Fri, Oct 12, 2012 at 5:48 PM, aj...@virginia.edu  wrote:
> Thanks for that detailed answer. Don't worry, I understand that the notion of 
> yard is specific to the EntityHub-- I was just using it as an analogy.
>

The current Entityhub Yard implementations will be used as backend for
SemanticIndex implementations with the new System. In fact very
similar as the Yard interface is already used by the Entityhub
Indexing Tool - as an indexing destination.

>
> I have one other question about this specific effort: in IndexingSource I 
> find the important method:
>
> Item get(String uri) throws StoreException
>
> so it seems that this interface is meant to be used synchronously in direct 
> operation, when get() doesn't block for any long time waiting for a large 
> datum to transit or for slow storage to produce results. In order to use this 
> gear in these cases, would it be necessary to rewrite the upper-level 
> component "Content Create/Update"? Or could one expect to create a kind of 
> queuing component and wire it between "Content Create/Update" and "Content 
> Item Storage", maintaining synchronous behavior in the upper level of 
> architecture?

That is true. The intended usage of the interfaces of the
semanticindexing module is synchronous. If necessary the semantic
indexing process as a whole can be implemented to be asynchronously
(e.g. use a queue that is processed by multiple worker threads).

As mentioned in my other mail, the first release of the
semanticindexing module together with the Contenthub will most likely
not include a general implementation of the indexing process, but I
plan to implement such a component as part of the adaption of the
Entityhub to the new system. As the Entityhub Indexing Tool already
uses a multi threaded producer/consumer based indexing pipeline I
might likely start from their.

Note the description of the "indexing process" is included in my mail
about the "Stanbol Semantic Indexing" module.

best
Rupert

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Stanbol Semantic Indexing (was Re: Next releases)

2012-10-15 Thread Rupert Westenthaler

>>
>> SemanticIndexes do have states: UNINT, INDEXING are used during the
>> initial indexing state. ACTIVE means that the index is in normal
>> operation and finally REINDEXING is used after an *epoch* change of
>> the IndexingSource. In this state the SemanticIndex can still be used
>> (with the data before the epoch change) while the re-indexing based on
>> the new data is preformed.
>>
>> In the first version the Stanbol semanticindexing will not include a
>> component that provides an implementation of the above workflow, but I
>> think we will add such a component. If that is the case than a
>> SemanticIndex would only need to specify its IndexingSource(s) and
>> Stanbol would keep the index in sync with its source.
>
> In the first step Contenthub will provide an implementation the above
> workflow you mention with a Store (e.g FileStore[1] and SemanticIndex
> implementation[2] which is synchronized with the Store. Do you mean
> another (more generic) implementation?
>

Year. I think this workflow should be a service provided by Stanbol.
Maybe you can even start such a component when you implement the
ClerezzaIndex (needed to keep the SPARQL endpoint feature over the
enhancement metadata).

best
Rupert

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Stanbol - build - run

2012-10-15 Thread Rupert Westenthaler

Hi Adam.

While starting a Bundle (in your case
"org.apache.stanbol.ontologymanager.servicesapi") the OSGI framework
checks if all referenced packages of the module and its dependencies
are available. In your case the package ("com.hp.hpl.jena.graph")
seams to be missing for some reason. This has nothing todo with
memory. Typically "-XX:MaxPermSize=256m -Xmx1024m" are sufficient for
the Stanbol Full Launcher.

As this does not appear on continuous integration it might be related
to some invalid data in you local Maven Repository (
"~/.m2/repository" ). Can you try to delete the cache for the Bundles
referenced in the Error messages

~/.m2/repository/com/hp/hpl
~/.m2/repository/org/apache/stanbol

and afterwards make a new build of Stanbol.

If you want to validate your memory setting a binary download of the
Stanbol launcher is also available on [1]. This is build every night
from a fresh checkout.

best
Rupert

[1] http://dev.iks-project.eu/downloads/stanbol-launchers/


On Mon, Oct 15, 2012 at 9:04 PM, adasal  wrote:
> Hi,
> I have spent the last several days trying to compile and run a local
> instance of the Stanbol project.
> I can compile. If I include tests I must exclude integration as this fails
> with similar errors (the same errors plus out of heap space) as when I run
> the compiled project skiping that test.
> The errors I get are such like:-
> ERROR: Bundle org.apache.stanbol.ontologymanager.servicesapi [131]: Error
> starting
> inputstream:org.apache.stanbol.ontologymanager.servicesapi-0.10.0-SNAPSHOT.jar
> (org.osgi.framework.BundleException: Unresolved constraint in bundle
> org.apache.stanbol.ontologymanager.servicesapi [131]: Unable to resolve
> 131.0: missing requirement [131.0] package;
> (&(package=org.apache.stanbol.commons.owl.util)(version>=0.10.0)) [caused
> by: Unable to resolve 58.0: missing requirement [58.0] package;
> (package=com.hp.hpl.jena.graph)])
> org.osgi.framework.BundleException: Unresolved constraint in bundle
> org.apache.stanbol.ontologymanager.servicesapi [131]: Unable to resolve
> 131.0: missing requirement [131.0] package;
> (&(package=org.apache.stanbol.commons.owl.util)(version>=0.10.0)) [caused
> by: Unable to resolve 58.0: missing requirement [58.0] package;
> (package=com.hp.hpl.jena.graph)]
> at org.apache.felix.framework.Felix.resolveBundle(Felix.java:3443)
> at org.apache.felix.framework.Felix.startBundle(Felix.java:1727)
> at
> org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1156)
> at
> org.apache.felix.framework.StartLevelImpl.run(StartLevelImpl.java:264)
> at java.lang.Thread.run(Thread.java:680)
>
> They all indicate the missing requirement package=com.hp.hpl.jena.graph,
> rdf.model and datatypes.
>
> Is this really that I am not able to allocate enough memory to the runtime?
> (My Mac has 4g but it gets eaten up by these processes) or am I missing
> something else?
>
> Any ideas?
>
> Adam



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Request to validate/correct STANBOL-774 related POM file changes

Hi all

Yesterday I committed changes to all modules POM files that do produce
bundles. The intension and nature of those changes are well described
by the description and comments of STANBOL-744 [1] so I will not
include those in this mail. But as I am also not very experienced with
this topic Feedback and Suggestions are very welcome.

The reason for this mail is that I ask the developers of those modules
to check/validate and when necessarily correct my changes! Please take
the time to compare the definitions in the

  
  
  

and compare it with the expected

   Import-Package:

in the generated MANIFEST.MF file.
(/target/classes/META-INF/MANIFEST.MF). STANBOL-744 provides
information on what is expected and [2] provides the details.

Note also that explicitly adding missing packages usually does not
solve the issue (as it typically would only lead to runtime issues).
The BnD tool does a really great job in analyzing dependencies so
typically:

1. Your expectations are wrong (e.g. an exported package that is not
used in an private package of the same Bundle needs not to be
imported).
2. Dependencies of any Class in the exported package to an private
package will avoid the BdD tool to add it to list of Import-Package.
In this case you will need to adapt your dependencies or packages (you
can use STANBOL-773 for those changes).

IMHO doing this is really important before going for a 1.0 release as
after such an release most of such changes would only be possible with
a 2.* release (or via keeping a lot of @Deprecated stuff that we would
need to maintain)

best
Rupert


[1] https://issues.apache.org/jira/browse/STANBOL-774
[2] http://www.aqute.biz/Bnd/Versioning

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Corrupted Files downloaded from dev.iks-project.eu (Fwd: Jenkins build became unstable: stanbol-trunk-1.6 #1068)

Hi Suat

in the module of the dbpedia default dataset there should be a
download folder containing the file downloaded form the server.
Deleting that folder will trigger the re-download of that file.
This is also the best way to check if the file is actually corrupted.

You can find the folder at

{stanbol-trunk}/data/sites/dbpedia/download

best
Rupert

On Wed, Oct 17, 2012 at 6:19 PM, Suat Gonul  wrote:
> Hi Rupert,
>
> I have a similar problem but I am not sure it is related with the
> situation here. Here is the exception I get:
>
> 17.10.2012 18:51:59.930 *ERROR* [Thread-47]
> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl
> IOException while activating Index 'default:dbpedia'!
> java.io.IOException: Unable to copy Data for index 'dbpedia' (server
> 'default')
> at
> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl.updateCore(ManagedSolrServerImpl.java:779)
> at
> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl$IndexUpdateDaemon.run(ManagedSolrServerImpl.java:1162)
> Caused by: java.io.IOException: Truncated ZIP file
> at
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readDeflated(ZipArchiveInputStream.java:389)
> at
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:322)
> at java.io.InputStream.read(InputStream.java:101)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1025)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:999)
> at
> org.apache.stanbol.commons.solr.utils.ConfigUtils.copyArchiveEntry(ConfigUtils.java:539)
> at
> org.apache.stanbol.commons.solr.utils.ConfigUtils.copyCore(ConfigUtils.java:497)
> at
> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl.updateCore(ManagedSolrServerImpl.java:777)
>
>
> I tried deleting the stanbol directory inside the .m2 and even the .m2
> itself, however I still get this exception. Do you have any idea why
> this happens?
>
> Best,
> Suat
>
> On 10/10/2012 3:49 PM, Rupert Westenthaler wrote:
>> Hi all,
>>
>> during the Apache Stanbol build process some files (DBpedia default
>> index, OpenNLP models) are downloaded from dev.iks-project.eu. Since
>> the last week it happens that those files are corrupted. We do not
>> know the reason for that as the Apache2 logs of the dev.iks-project.eu
>> do not point to any problems. This is also the reason for a lot of
>> unstable Jenkins build on the last week.
>>
>> Users that are affected by this should see "java.io.EOFException"s in
>> their logs. Affected files are located in the
>> "{stanbol-trunk}/data/{module-path}/download/resources" folders.
>> Deleted files will be re-downloaded on the next build. Because of that
>> deleting affected files and "mvm clean install" of the affected file
>> usually solves issues like that.
>>
>> best
>> Rupert
>>
>> -- Forwarded message --
>> From: Apache Jenkins Server 
>> Date: Wed, Oct 10, 2012 at 12:15 PM
>> Subject: Jenkins build became unstable:  stanbol-trunk-1.6 #1068
>> To: dev@stanbol.apache.org, rupert.westentha...@gmail.com
>>
>>
>> See <https://builds.apache.org/job/stanbol-trunk-1.6/1068/changes>
>>
>>
>>
>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Corrupted Files downloaded from dev.iks-project.eu (Fwd: Jenkins build became unstable: stanbol-trunk-1.6 #1068)

Hi

On Thu, Oct 18, 2012 at 10:35 AM, Suat Gonul  wrote:
> Thanks Rupert.
>
> I was thinking that "mvn clean" would delete the files. Manually
> removing that folder solved the problem.
>

No "mvn clean" intensionally does NOT delete those files to avoid
re-downloading them again and again. However this can be easily
changed by an according configuration.

best
RUpert


> Best,
> Suat
>
> On 10/18/2012 10:34 AM, Rupert Westenthaler wrote:
>> Hi Suat
>>
>> in the module of the dbpedia default dataset there should be a
>> download folder containing the file downloaded form the server.
>> Deleting that folder will trigger the re-download of that file.
>> This is also the best way to check if the file is actually corrupted.
>>
>> You can find the folder at
>>
>> {stanbol-trunk}/data/sites/dbpedia/download
>>
>> best
>> Rupert
>>
>> On Wed, Oct 17, 2012 at 6:19 PM, Suat Gonul  wrote:
>>> Hi Rupert,
>>>
>>> I have a similar problem but I am not sure it is related with the
>>> situation here. Here is the exception I get:
>>>
>>> 17.10.2012 18:51:59.930 *ERROR* [Thread-47]
>>> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl
>>> IOException while activating Index 'default:dbpedia'!
>>> java.io.IOException: Unable to copy Data for index 'dbpedia' (server
>>> 'default')
>>> at
>>> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl.updateCore(ManagedSolrServerImpl.java:779)
>>> at
>>> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl$IndexUpdateDaemon.run(ManagedSolrServerImpl.java:1162)
>>> Caused by: java.io.IOException: Truncated ZIP file
>>> at
>>> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readDeflated(ZipArchiveInputStream.java:389)
>>> at
>>> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:322)
>>> at java.io.InputStream.read(InputStream.java:101)
>>> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1025)
>>> at org.apache.commons.io.IOUtils.copy(IOUtils.java:999)
>>> at
>>> org.apache.stanbol.commons.solr.utils.ConfigUtils.copyArchiveEntry(ConfigUtils.java:539)
>>> at
>>> org.apache.stanbol.commons.solr.utils.ConfigUtils.copyCore(ConfigUtils.java:497)
>>> at
>>> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl.updateCore(ManagedSolrServerImpl.java:777)
>>>
>>>
>>> I tried deleting the stanbol directory inside the .m2 and even the .m2
>>> itself, however I still get this exception. Do you have any idea why
>>> this happens?
>>>
>>> Best,
>>> Suat
>>>
>>> On 10/10/2012 3:49 PM, Rupert Westenthaler wrote:
>>>> Hi all,
>>>>
>>>> during the Apache Stanbol build process some files (DBpedia default
>>>> index, OpenNLP models) are downloaded from dev.iks-project.eu. Since
>>>> the last week it happens that those files are corrupted. We do not
>>>> know the reason for that as the Apache2 logs of the dev.iks-project.eu
>>>> do not point to any problems. This is also the reason for a lot of
>>>> unstable Jenkins build on the last week.
>>>>
>>>> Users that are affected by this should see "java.io.EOFException"s in
>>>> their logs. Affected files are located in the
>>>> "{stanbol-trunk}/data/{module-path}/download/resources" folders.
>>>> Deleted files will be re-downloaded on the next build. Because of that
>>>> deleting affected files and "mvm clean install" of the affected file
>>>> usually solves issues like that.
>>>>
>>>> best
>>>> Rupert
>>>>
>>>> -- Forwarded message --
>>>> From: Apache Jenkins Server 
>>>> Date: Wed, Oct 10, 2012 at 12:15 PM
>>>> Subject: Jenkins build became unstable:  stanbol-trunk-1.6 #1068
>>>> To: dev@stanbol.apache.org, rupert.westentha...@gmail.com
>>>>
>>>>
>>>> See <https://builds.apache.org/job/stanbol-trunk-1.6/1068/changes>
>>>>
>>>>
>>>>
>>
>>
>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Hackathon at ApacheCon EU 2012?

Hi Sergio,

cool idea! Thanks for sharing this information on the list. I would
enjoy participating in an Stanbol Hackathon.

best
Rupert

On Thu, Oct 18, 2012 at 2:52 PM, Sergio Fernández
 wrote:
> Hi,
>
> in addition to the Linked Data Track [1], what do you think to also organize
> a hackathon? They are collecting ideas at the wiki [2] until Monday 5th
> November. Maybe other projects (Jena, Clerezza and Any23) would be also
> interested.
>
> Kind regards,
>
> [1] http://www.apachecon.eu/tracks/#linked-data
> [2] http://wiki.apache.org/apachecon/HackathonEU12
>
>
> --
> Sergio Fernández
> Salzburg Research
> +43 662 2288 318
> Jakob-Haringer Strasse 5/II
> A-5020 Salzburg (Austria)
> http://www.salzburgresearch.at



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: How to add a new TripleCollection to Stanbol

2012-10-29 Thread Rupert Westenthaler

ting, forwarding or dissemination of this communication is 
> strictly prohibited. If you have received this  communication in error, 
> please erase all copies of the message and its  attachments and notify the 
> sender immediately. INQ Mobile Limited is  a company registered in the 
> British Virgin Islands. www.inqmobile.com.
>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Apache Stanbol ( Disambiguation Engine ) proposal and doubts

2012-10-30 Thread Rupert Westenthaler

y to create a version that works well
with the disambiguation-mlt engine. As soon as this is finished I can
also provide this demo on the http://dev.iks-project.eu server.

best
Rupert

> Thanks a lot for your attention. We hope to hear from you.
>
> Regards,
> Juan.




--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Opennlp NER ...

2012-10-30 Thread Rupert Westenthaler

Hi Andrea,

On Tue, Oct 30, 2012 at 4:15 PM, Andrea Taurchini  wrote:
> Dear All,
> I developed my own models for NER based on OPENNLP.
> Within these models I have more entities than person, organization and
> places ... will stanbol enhance text using this added entities ?
>

Currently both the OpenNLP NER engine as well as the
NamedEntityLinkingEngine can only handle Persons, Organizations and
Places. In its current form you will not be able to use them to link
other types.

For both engines this is mainly because of the configuration. So
extending those engines to support other (or better arbitrary
configureable) types would require to extend the engines configuration
options. In the following I will try to describe the necessary
extensions.

## OpenNLP NER engine

The NER engine needs the mappings for an {ner-model} to its {language}
and the extracted {entity-type}. Currently this works by a constant
defining the mappings for persons, organizations and places. NLP
models are loaded by using the OpenNLP service (defined by the
o.a.stanbol.commons.opennlp module).

To configure additional models and types I would suggest to add an
additional configuration property that uses the following syntax

{model-file-name};lang={language};type={entity-type}

The OpenNLP TokenNameFinderModel would be loaded from the configured
"{model-file-name}" via the Stanbol DataFileProvider service.
practically this means that users would need to copy their custom
models to the "{stanbol.home}/datafiles" directory.

The language parameter "lang={language}" would specify the language
supported by this model. The "type={entity-type}" parameter would
specify the dc-type value set for fise:TextAnnotations created for
named entities extracted by the model.

## NamedEntityLinkingEngine

For this engine the main problem with the current implementation is
that the current way to configure mappings does not allow to configure
arbitrary mappings. Because of that one would need to implement a
different approach to configure the mappings for linked
fise:TextAnnotations dc:type values.

I would suggest to use a configuration similar to the "type mapping"
[1] as already used by the KeywordLinkingEngine. The Syntax would be
like

 {dc-type} > {vocabulary-type}; {vocabulary-type}; ...
 {dc-type} > *
 {dc-type}

where the {dc-type} would be the value of the dc-type property of the
TextAnnotation and {vocabulary-type} is the rdf:type value required
for linked Entities in the vocabulary linked against. * represents the
wild-card (any type) and {dc-type} is a shorthand for {dc-type} >
{dc-type}

The current default mappings would be represented in this syntax by

dbp-ont:Place
dbp-ont:Person
dbp-ont:Organisation

I would suggest to keep support for the current properties for not
braking backward compatibility.

If this extension is sufficient I suggest to create according JIRA issues.

best
Rupert

[1] 
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax

> Thanks and best regards,
> Andrea

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Opennlp NER ...

Hi

On Wed, Oct 31, 2012 at 2:25 PM, Andrea Taurchini  wrote:
> Dear Rupert,
> as always thanks for your support.
> Is it possible to use a single model file to detect multiple dc-type ... or
> should I add more than one configuration property each with the same model
> file but different dc-type ... or else should I produce different model
> file.

If this is possible with OpenNLP, than for sure, but AFAIK the
"opennlp.tools.namefind.NameFinderME#find(..)" method only provide the
token spans and probability. So it tells you only that you have found
an Named Entity from tokenA to tokenB and not the type of the Named
Entity.

While I can imagine that one can train a model that detects different
types of entities, you will not know the specific type of an found
named entity. So found Entities may have any of the trained types.

So if you want to distinguish between NamedEntities of the different
types you will need to train separate models.

Please correct me if I am wrong.

> However ... where do I have to set this configuration property (^_^) ?
> Throus OSGI admin ?

Using the configuration tab of the Felix Web Console is only one
option. There are also other possibilities to provide configurations.
You can also provide configuration files to the Sling FileInstaller as
described at [1] and soon also under the new "Production" section on
the Stanbol webpage (currently only available on the staging server
[2])



[1] http://markmail.org/message/jpxpl6x4nkmz6kda
[2] http://stanbol.staging.apache.org/production/partial-updates.html

>
> Thanks a lot.
>
> Kindest regards,
> Andrea
>
>
>
>
>
>
>
> 2012/10/31 Rupert Westenthaler 
>
>> Hi Andrea,
>>
>> On Tue, Oct 30, 2012 at 4:15 PM, Andrea Taurchini 
>> wrote:
>> > Dear All,
>> > I developed my own models for NER based on OPENNLP.
>> > Within these models I have more entities than person, organization and
>> > places ... will stanbol enhance text using this added entities ?
>> >
>>
>> Currently both the OpenNLP NER engine as well as the
>> NamedEntityLinkingEngine can only handle Persons, Organizations and
>> Places. In its current form you will not be able to use them to link
>> other types.
>>
>> For both engines this is mainly because of the configuration. So
>> extending those engines to support other (or better arbitrary
>> configureable) types would require to extend the engines configuration
>> options. In the following I will try to describe the necessary
>> extensions.
>>
>> ## OpenNLP NER engine
>>
>> The NER engine needs the mappings for an {ner-model} to its {language}
>> and the extracted {entity-type}. Currently this works by a constant
>> defining the mappings for persons, organizations and places. NLP
>> models are loaded by using the OpenNLP service (defined by the
>> o.a.stanbol.commons.opennlp module).
>>
>> To configure additional models and types I would suggest to add an
>> additional configuration property that uses the following syntax
>>
>> {model-file-name};lang={language};type={entity-type}
>>
>> The OpenNLP TokenNameFinderModel would be loaded from the configured
>> "{model-file-name}" via the Stanbol DataFileProvider service.
>> practically this means that users would need to copy their custom
>> models to the "{stanbol.home}/datafiles" directory.
>>
>> The language parameter "lang={language}" would specify the language
>> supported by this model. The "type={entity-type}" parameter would
>> specify the dc-type value set for fise:TextAnnotations created for
>> named entities extracted by the model.
>>
>>
>> ## NamedEntityLinkingEngine
>>
>> For this engine the main problem with the current implementation is
>> that the current way to configure mappings does not allow to configure
>> arbitrary mappings. Because of that one would need to implement a
>> different approach to configure the mappings for linked
>> fise:TextAnnotations dc:type values.
>>
>> I would suggest to use a configuration similar to the "type mapping"
>> [1] as already used by the KeywordLinkingEngine. The Syntax would be
>> like
>>
>>  {dc-type} > {vocabulary-type}; {vocabulary-type}; ...
>>  {dc-type} > *
>>  {dc-type}
>>
>> where the {dc-type} would be the value of the dc-type property of the
>> TextAnnotation and {vocabulary-type} is the rdf:type value required
>> for linked Entities in the vocabulary linked against. * represents the
>> wild-card (any type) and {dc-type} is a shorthand for {dc-t

Re: How to add a new TripleCollection to Stanbol

Hi

AFAIK the Clerezza SPARQL implementation does not use the Graph
specific SPARQL implementation. Because of that you are limited to
what Clerezza supports and can not access additional features. This
limitation is also the reason why I am interested in extending the
STANBOL SPARQL endpoint to directly support Jena Datasets and possible
even others (Sesame, Virtuoso ...) registered with the same metadata
as currently supported for Clerezza TripleCollections.

best
Rupert

On Wed, Oct 31, 2012 at 2:32 PM, Andrea Di Menna  wrote:
> Hi Rupert,
>
> thanks for your precious help.
>
> I am using the default graph hence I had to build a custom component.
> After this was done I could access the TDB with Stanbol :-)
>
> From what I can see though, the Clerezza SPARQL processor Stanbol is using
> does not support aggregate functions like count.
> Can you confirm? Is it possible to switch to ARQ for SPARQL queries?
>
> At the moment I am using Fuseki to handle queries as well (b.t.w. I
> realised it was much much faster to build the TDB using tdbloader2 instead
> of sending triples to Fuseki - dumb me, should have know before starting).
>
> Thanks for your great support!
>
> Cheers
>
> 2012/10/30 Rupert Westenthaler 
>
>> Hi
>>
>> To use an existing Jena TDB store with Apache Stanbol you need:
>>
>> 1. to make the Jena TDB store available in Apache Clerezza
>> 2. configure a Stanbol Entityhub ClerezzaYard for your Graph URI
>>
>> ad1: Do you use named graphs or the TDB triple store? In In the
>> SNAPSHOT version of "rdf.jena.tdb.storage"
>> (org.apache.clerezza:rdf.jena.tdb.storage:0.6-incubating-SNAPSHOT)
>> there is a SingleTdbDatasetTcProvider. It allows you to configure
>> (e.g. via the Configuration tab of the Apache Felix WebConsole) the
>> directory of the local file system where your TDB store is located. If
>> you configure an instance with the location of your existing TDB
>> store, than Clerezza should have access to the data. However this
>> works only for named graphs (SPOC) and the union graph over all SPOC
>> graphs. The SPO graph is not exposed by the
>> SingleTdbDatasetTcProvider.
>>
>> ad2: As soon as you have your TDB store available in Clerezza you can
>> configure ClerezzaYard instance(s) (e.g. via the Configuration tab of
>> the Apache Felix WebConsole). Important is that the value of the
>> "Graph URI" property refers to a Context (C) of your named graphs
>> (SPOC) or to the URI of the union graph (as configured in the
>> configuration of the SingleTdbDatasetTcProvider.
>>
>> The ClerezzaYard will automatically register the Clerezza MGraph with
>> the Stanbol SPARQL endpoint.
>>
>>
>> As an alternative you could also implement an own component that (1)
>> opens the Jena TDB store (2) wraps the Jena graph with an Clerezza
>> MGraph
>>
>> For that you create your own module and implement a a component
>>
>> @Component(
>> configurationFactory=true,
>> policy=ConfigurationPolicy.REQUIRE, //the TDBpath is required!
>> specVersion="1.1",
>> metatype = true)
>>  public class TdbGraphRegistering component
>>
>> @Property
>> public static final String TDB_PATH = "jena.tdb.path";
>>
>> When your bundle starts OSGI will call the activate(..) method and
>> deactivate(..) when it is stopped.
>>
>> protected void activate(ComponentContext ctx) throws
>> ConfigurationException {
>> String tdbPath = (String)ctx.getProperties().get(TDB_PATH)
>> if(tdbPath == null){
>> throw new ConfigurationException(TDB_PATH,"Jena TDB path
>> MUST BE configured")
>> }
>>
>> So what you need to do is to initialize the Jena TDB store from the
>> configured TDB_PATH create
>> an Clerezza MGraph and register it as OSGI service
>>
>>  //Init the jena TDB model
>> com.hp.hpl.jena.rdf.model.Model model;
>>
>> MGraph graph = new LockableMGraphWrapper(
>> new PrivilegedMGraphWrapper(new JenaGraphAdaptor(model)
>>
>> and than registering this MGraph to the OSGI ServiceRegistry (whitboard
>> pattern)
>>
>> Dictionary graphRegProp = new
>> Hashtable();
>> //the URI under that you want to register your graph
>> graphRegProp.put("graph.uri", graphUri);
>> //optionally the name and description of the graph (used in the UI)
>> graphRegProp.put("graph.name", getConfig().getName());
>> graphRegProp.put("graph.de

Re: Opennlp NER ...

On Wed, Oct 31, 2012 at 3:31 PM, Andrea Taurchini  wrote:
> Dear Rupert,
> thanks again.
> Uhmmm ... using tokennamefinder from command line of opennlp if you use a
> multitype trained model than you get a multitype tagged output ... as for
> api .find method I suppose is the way you told me (one type per model ??).
>

Maybe the Span#getType() returns the type of the found entity. I will
try this out. If this really provides the different types, that the
configuration will be like


{model-file-name};language={language};{type}={type-uri};{type2}={type-uri2};...

BTW I created already
https://issues.apache.org/jira/browse/STANBOL-792 for this feature.

> Forgive me if I'm silly but I can't see how can I add configuration
> property under configuration tab of Felix WC.
>

The form you see in the configuration in generated from a XML file in
the Bundle and this XML file is generated by the @Property annotations
in the implementation of the Engine. So as soon as this new
configuration options are implemented you will see the according
options in the form.


> Thanks and best regards,
> Andrea
>
>
>
>
>
> 2012/10/31 Rupert Westenthaler 
>
>> Hi
>>
>> On Wed, Oct 31, 2012 at 2:25 PM, Andrea Taurchini 
>> wrote:
>> > Dear Rupert,
>> > as always thanks for your support.
>> > Is it possible to use a single model file to detect multiple dc-type ...
>> or
>> > should I add more than one configuration property each with the same
>> model
>> > file but different dc-type ... or else should I produce different model
>> > file.
>>
>> If this is possible with OpenNLP, than for sure, but AFAIK the
>> "opennlp.tools.namefind.NameFinderME#find(..)" method only provide the
>> token spans and probability. So it tells you only that you have found
>> an Named Entity from tokenA to tokenB and not the type of the Named
>> Entity.
>>
>> While I can imagine that one can train a model that detects different
>> types of entities, you will not know the specific type of an found
>> named entity. So found Entities may have any of the trained types.
>>
>> So if you want to distinguish between NamedEntities of the different
>> types you will need to train separate models.
>>
>> Please correct me if I am wrong.
>>
>> > However ... where do I have to set this configuration property (^_^) ?
>> > Throus OSGI admin ?
>>
>> Using the configuration tab of the Felix Web Console is only one
>> option. There are also other possibilities to provide configurations.
>> You can also provide configuration files to the Sling FileInstaller as
>> described at [1] and soon also under the new "Production" section on
>> the Stanbol webpage (currently only available on the staging server
>> [2])
>>
>>
>>
>> [1] http://markmail.org/message/jpxpl6x4nkmz6kda
>> [2] http://stanbol.staging.apache.org/production/partial-updates.html
>>
>> >
>> > Thanks a lot.
>> >
>> > Kindest regards,
>> > Andrea
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > 2012/10/31 Rupert Westenthaler 
>> >
>> >> Hi Andrea,
>> >>
>> >> On Tue, Oct 30, 2012 at 4:15 PM, Andrea Taurchini > >
>> >> wrote:
>> >> > Dear All,
>> >> > I developed my own models for NER based on OPENNLP.
>> >> > Within these models I have more entities than person, organization and
>> >> > places ... will stanbol enhance text using this added entities ?
>> >> >
>> >>
>> >> Currently both the OpenNLP NER engine as well as the
>> >> NamedEntityLinkingEngine can only handle Persons, Organizations and
>> >> Places. In its current form you will not be able to use them to link
>> >> other types.
>> >>
>> >> For both engines this is mainly because of the configuration. So
>> >> extending those engines to support other (or better arbitrary
>> >> configureable) types would require to extend the engines configuration
>> >> options. In the following I will try to describe the necessary
>> >> extensions.
>> >>
>> >> ## OpenNLP NER engine
>> >>
>> >> The NER engine needs the mappings for an {ner-model} to its {language}
>> >> and the extracted {entity-type}. Currently this works by a constant
>> >> defining the mappings for persons, organizations and places. NLP
>> >> models are loaded by using the Open

Re: Opennlp NER ...

Hi

just to lot you know that I can confirm that the type of the Named
Entity is indeed provided by the Span#getType() method. So models for
multiple Named Entity types are also supported by the Java API.

best
Rupert

On Wed, Oct 31, 2012 at 3:45 PM, Rupert Westenthaler
 wrote:
> On Wed, Oct 31, 2012 at 3:31 PM, Andrea Taurchini  
> wrote:
>> Dear Rupert,
>> thanks again.
>> Uhmmm ... using tokennamefinder from command line of opennlp if you use a
>> multitype trained model than you get a multitype tagged output ... as for
>> api .find method I suppose is the way you told me (one type per model ??).
>>
>
> Maybe the Span#getType() returns the type of the found entity. I will
> try this out. If this really provides the different types, that the
> configuration will be like
>
> 
> {model-file-name};language={language};{type}={type-uri};{type2}={type-uri2};...
>
> BTW I created already
> https://issues.apache.org/jira/browse/STANBOL-792 for this feature.
>
>> Forgive me if I'm silly but I can't see how can I add configuration
>> property under configuration tab of Felix WC.
>>
>
> The form you see in the configuration in generated from a XML file in
> the Bundle and this XML file is generated by the @Property annotations
> in the implementation of the Engine. So as soon as this new
> configuration options are implemented you will see the according
> options in the form.
>
>
>> Thanks and best regards,
>> Andrea
>>
>>
>>
>>
>>
>> 2012/10/31 Rupert Westenthaler 
>>
>>> Hi
>>>
>>> On Wed, Oct 31, 2012 at 2:25 PM, Andrea Taurchini 
>>> wrote:
>>> > Dear Rupert,
>>> > as always thanks for your support.
>>> > Is it possible to use a single model file to detect multiple dc-type ...
>>> or
>>> > should I add more than one configuration property each with the same
>>> model
>>> > file but different dc-type ... or else should I produce different model
>>> > file.
>>>
>>> If this is possible with OpenNLP, than for sure, but AFAIK the
>>> "opennlp.tools.namefind.NameFinderME#find(..)" method only provide the
>>> token spans and probability. So it tells you only that you have found
>>> an Named Entity from tokenA to tokenB and not the type of the Named
>>> Entity.
>>>
>>> While I can imagine that one can train a model that detects different
>>> types of entities, you will not know the specific type of an found
>>> named entity. So found Entities may have any of the trained types.
>>>
>>> So if you want to distinguish between NamedEntities of the different
>>> types you will need to train separate models.
>>>
>>> Please correct me if I am wrong.
>>>
>>> > However ... where do I have to set this configuration property (^_^) ?
>>> > Throus OSGI admin ?
>>>
>>> Using the configuration tab of the Felix Web Console is only one
>>> option. There are also other possibilities to provide configurations.
>>> You can also provide configuration files to the Sling FileInstaller as
>>> described at [1] and soon also under the new "Production" section on
>>> the Stanbol webpage (currently only available on the staging server
>>> [2])
>>>
>>>
>>>
>>> [1] http://markmail.org/message/jpxpl6x4nkmz6kda
>>> [2] http://stanbol.staging.apache.org/production/partial-updates.html
>>>
>>> >
>>> > Thanks a lot.
>>> >
>>> > Kindest regards,
>>> > Andrea
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > 2012/10/31 Rupert Westenthaler 
>>> >
>>> >> Hi Andrea,
>>> >>
>>> >> On Tue, Oct 30, 2012 at 4:15 PM, Andrea Taurchini >> >
>>> >> wrote:
>>> >> > Dear All,
>>> >> > I developed my own models for NER based on OPENNLP.
>>> >> > Within these models I have more entities than person, organization and
>>> >> > places ... will stanbol enhance text using this added entities ?
>>> >> >
>>> >>
>>> >> Currently both the OpenNLP NER engine as well as the
>>> >> NamedEntityLinkingEngine can only handle Persons, Organizations and
>>> >> Places. In its current form you will not be able to use them to link
>>> >> other types.
>>> >>
>>&g

Re: EntityHub Referenced Site and redirects

2012-11-03 Thread Rupert Westenthaler

s only intended for the person(s) to whom it is addressed and 
> may contain CONFIDENTIAL information. Any opinions or views are personal to 
> the writer and do not represent those of INQ Mobile Limited, Hutchison 
> Whampoa Limited or its group companies.  If you  are not the intended 
> recipient, you are hereby notified that any use, retention, disclosure, 
> copying, printing, forwarding or dissemination of this communication is 
> strictly prohibited. If you have received this  communication in error, 
> please erase all copies of the message and its  attachments and notify the 
> sender immediately. INQ Mobile Limited is  a company registered in the 
> British Virgin Islands. www.inqmobile.com.
>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Opennlp NER ...

2012-11-03 Thread Rupert Westenthaler

Hi

The implementation of the CustomNERModelEnhancementEngine
(STANBOL-792) is now available. The documentation can be found at [1].

I also updated the eHealth demo ("{stanbol-trunk}/demo/ehealth") to
use the new Engine with 5 custom NER models for DNA, RNA, Proteins,
Cell Type and Cell Line based on the BioNLP2004 dataset [2]. When you
build (mvn clean install and install the health demo bundle
(org.apache.stanbol.demo.ehealth-0.10.1-SNAPSHOT.jar) to the Stanbol
Launcher (revision > 1405306) than you can test the engine with the
chain http://localhost:8080/enhancer/chain/ehealth-ner

@Andrea: I was not able to test the engine with NER models that
extract multiple entity types, as I was not able to find/build such a
model for testing. So if you find any issues regarding that please
report it.

I dont think I will have time to work on STANBOL-793 the coming days
as ApacheCon is around the corner

best
Rupert

[1] 
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/customnermodelengine.html
[2] http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html

On Wed, Oct 31, 2012 at 5:22 PM, Rupert Westenthaler
 wrote:
> Hi
>
> just to lot you know that I can confirm that the type of the Named
> Entity is indeed provided by the Span#getType() method. So models for
> multiple Named Entity types are also supported by the Java API.
>
> best
> Rupert
>
> On Wed, Oct 31, 2012 at 3:45 PM, Rupert Westenthaler
>  wrote:
>> On Wed, Oct 31, 2012 at 3:31 PM, Andrea Taurchini  
>> wrote:
>>> Dear Rupert,
>>> thanks again.
>>> Uhmmm ... using tokennamefinder from command line of opennlp if you use a
>>> multitype trained model than you get a multitype tagged output ... as for
>>> api .find method I suppose is the way you told me (one type per model ??).
>>>
>>
>> Maybe the Span#getType() returns the type of the found entity. I will
>> try this out. If this really provides the different types, that the
>> configuration will be like
>>
>> 
>> {model-file-name};language={language};{type}={type-uri};{type2}={type-uri2};...
>>
>> BTW I created already
>> https://issues.apache.org/jira/browse/STANBOL-792 for this feature.
>>
>>> Forgive me if I'm silly but I can't see how can I add configuration
>>> property under configuration tab of Felix WC.
>>>
>>
>> The form you see in the configuration in generated from a XML file in
>> the Bundle and this XML file is generated by the @Property annotations
>> in the implementation of the Engine. So as soon as this new
>> configuration options are implemented you will see the according
>> options in the form.
>>
>>
>>> Thanks and best regards,
>>> Andrea
>>>
>>>
>>>
>>>
>>>
>>> 2012/10/31 Rupert Westenthaler 
>>>
>>>> Hi
>>>>
>>>> On Wed, Oct 31, 2012 at 2:25 PM, Andrea Taurchini 
>>>> wrote:
>>>> > Dear Rupert,
>>>> > as always thanks for your support.
>>>> > Is it possible to use a single model file to detect multiple dc-type ...
>>>> or
>>>> > should I add more than one configuration property each with the same
>>>> model
>>>> > file but different dc-type ... or else should I produce different model
>>>> > file.
>>>>
>>>> If this is possible with OpenNLP, than for sure, but AFAIK the
>>>> "opennlp.tools.namefind.NameFinderME#find(..)" method only provide the
>>>> token spans and probability. So it tells you only that you have found
>>>> an Named Entity from tokenA to tokenB and not the type of the Named
>>>> Entity.
>>>>
>>>> While I can imagine that one can train a model that detects different
>>>> types of entities, you will not know the specific type of an found
>>>> named entity. So found Entities may have any of the trained types.
>>>>
>>>> So if you want to distinguish between NamedEntities of the different
>>>> types you will need to train separate models.
>>>>
>>>> Please correct me if I am wrong.
>>>>
>>>> > However ... where do I have to set this configuration property (^_^) ?
>>>> > Throus OSGI admin ?
>>>>
>>>> Using the configuration tab of the Felix Web Console is only one
>>>> option. There are also other possibilities to provide configurations.
>>>> You can also provide configuration files to the Sling FileInstaller as
>>>> described at [1] and soon also under the new "

Re: EntityHub Referenced Site and redirects

2012-11-03 Thread Rupert Westenthaler

g, printing, forwarding or dissemination of this communication is
>> > strictly prohibited. If you have received this  communication in error,
>> > please erase all copies of the message and its  attachments and notify
>> the
>> > sender immediately. INQ Mobile Limited is  a company registered in the
>> > British Virgin Islands. www.inqmobile.com.
>> >
>> >
>>
>>
>> --
>> Thanks
>> Harish
>>
>
>
>
>
> This e-mail is only intended for the person(s) to whom it is addressed and 
> may contain CONFIDENTIAL information. Any opinions or views are personal to 
> the writer and do not represent those of INQ Mobile Limited, Hutchison 
> Whampoa Limited or its group companies.  If you  are not the intended 
> recipient, you are hereby notified that any use, retention, disclosure, 
> copying, printing, forwarding or dissemination of this communication is 
> strictly prohibited. If you have received this  communication in error, 
> please erase all copies of the message and its  attachments and notify the 
> sender immediately. INQ Mobile Limited is  a company registered in the 
> British Virgin Islands. www.inqmobile.com.
>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Enhancer engine deps problem for releases

2012-11-08 Thread Rupert Westenthaler

Hi Fabian,

do you think that would also mean to change the package
structure/module names of those engine or do you think it is OK for
any EnhancementEngine that is managed by the Stanbol Community to use
"org.apache.stanbol.enhancer.engine.{engine-name}" as artifactId and
package name.

Regardless of that +1 from my side.

best
Rupert



On Thu, Nov 8, 2012 at 11:37 AM, Olivier Grisel
 wrote:
> Sounds reasonable to me. +1 for refactorings that improve the release
> flow and lower the maintenance burden.
>
> 2012/11/8 Fabian Christ :
>> Hi,
>>
>> I am investigating the current SNAPSHOT deps of the Stanbol components in
>> order to find out what can be released and in which order.
>>
>> In the enhancer we have the problematic situation that we have enhancement
>> engines that rely on other components, like the refactor engine that relies
>> on rules.
>>
>> This is problematic to cut an Enhancer release because we would need to
>> release, e.g. the rules component first.
>>
>> I would like to prevent such situations. IMO it would be a more natural fit
>> if engines, that rely on a certain component, are removed from the Enhancer
>> source tree and moved to the source tree of that particular component or
>> even to a third place.
>>
>> The Engines included in the enhancer/engines directory should only be
>> engines that do not have such dependencies. If this is the case, releasing
>> the enhancer with all independent engines raises no problems anymore.
>>
>> My proposal would be to create a new top level folder in the source tree
>> for engines that rely on the availability of other components. We could
>> call it "enhancer-thirdparty-engines". This could also be a place for
>> contributed engines that we do not want to be in the default
>> enhancer/engines structure. Such engines will be released independently and
>> are not part of an Enhancer release anymore.
>>
>> WDYT?
>>
>> --
>> Fabian
>> http://twitter.com/fctwitt
>
>
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Future of Clerezza and Stanbol

2012-11-09 Thread Rupert Westenthaler

ion?
>> > >
>> > > Presumably the moved modules will be released by the new host - will
>> they
>> > > use group id org.apache.clerezza? or move to the new host project group
>> > id?
>> > > I'd suggest renaming the group to the new project but realise it is a
>> bit
>> > > more disruptive...
>> >
>> > I think that's really up to whatever project adopts that code. In
>> > theory package names should change but that's probably not convenient.
>> >
>> > Or maybe it's time to create a semantic module or two at
>> > http://commons.apache.org/ ? If existing committers are willing to
>> > support that with their work it should be easy to make it happen.
>> >
>> > -Bertrand
>> >
>>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Manually create a Vocabulary / Ontology

2012-11-09 Thread Rupert Westenthaler

Hi

As soon as we have SPARQL 1.1 support in Clerezza we/you can use
skos.js [1] with the SPARQL endpoint of Apache Stanbol. An other
possibility would be to add support for Entityhub Managed Sites to
VIE. This would allow you to create/update/delete entities in the
Entityhub Site used by the Stanbol enhancer.

best
Rupert

[1] https://github.com/tkurz/skosjs

On Thu, Nov 8, 2012 at 10:59 AM, Gabriel Vince
 wrote:
> Hi,
>
> I have neve tryed, but just a long shot worth to try - you can create
> RDF with Protege and then import it into the stanbol semantic store.
> Just first it could be useful to get its basic classes to start with.
>
>
>
> Best regards
>GAbriel
>
> On Thu, Nov 8, 2012 at 10:53 AM, Rüdiger Kurz  wrote:
>> Hi all,
>>
>> Having a simple UI providing the creation of individual entities that can be
>> used by Stanbol would be really helpful also for small and medium "Use
>> Cases" (ca. 100 categories in a hierarchy)
>>
>> regards Rüdiger
>
>
>
> --
> Gabriel Vince
> Senior Consultant
> Apogado
> http://www.apogado.com



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: User story: Don't want to lose the semantic information I already have inside my CMS

2012-11-09 Thread Rupert Westenthaler

Hi Walter, all

I had already a look at the htmlextractor and I think it is a nice
addition to Stanbol!

I would be interested in an Engine that does not only extract embedded
knowledge, but also keeps the link to the actual position within the
parsed Content. In more detail I would like to link the extracted
knowledge with an fise:Enhancement (e.g. a fise:TextAnnotation) that
selects the annotated part of the content.

This would not only allow to have the extracted knowledge in the
metadata of the ContentItem, but also allow EnhancementEngines to
process those information in the same way as if they would be
extracted by an other engine (e.g. linking an RDFa annotation about an
Person, Place in the same way as an Person, Place detected by an NER
engine).

Jukka Zitting  presentation "Content extraction with Apache Tika" [1]
at the ApacheCon included a nice example on how to extract the text of
an Link. I think this is a nice starting point for such an feature.

Generally I think it would be better to add RDFa, Micro Data support
to directly to Tika instead of implementing custom solutions within
Stanbol. WDYT?

best
Rupert

[1] http://www.slideshare.net/jukka/content-extraction-with-apache-tika Slide 19

On Thu, Nov 8, 2012 at 12:31 PM, Walter Kasper  wrote:
> Hi Rüdiger,
>
> RDFa extraction from HTML is part of the htmlextractor engine in Stanbol.
> Iwould welcome it if you could test it with yourOpenCms docs.
>
> Best regards,
>
> Walter
>
>
> Rüdiger Kurz wrote:
>>
>> Hi Staboler,
>>
>> during ApacheCon in Sinsheim I had some interesting conversations with
>> Fabian, Rupert and Anil as result I want to summarize one of the discussions
>> as an user story telling a typical requirement for us as CMS provider.
>>
>> Talking about traditional Content Management Systems and assuming that
>> they don't store semantic informations is not correct. For example CMS
>> Systems already deliver RDFa annotated HTML, nearly all systems are
>> providing some tagging/categorizing mechanism. Specially OpenCms provides a
>> generic approach to define a structured content and therefore we have the
>> information that a specific field/item of a content has a specified type and
>> a defined label. E.g. A technology event named ApacheCon takes place in
>> Sinsheim from 05. Nov until 08. Nov 2012 is the information that is already
>> stored in OpenCms. More over OpenCms is able to connect that event with all
>> speakers/persons that will make a presentation on that event, ...
>>
>> What we would like to achieve is not only a plain text enhancement more
>> over we are interested in telling Stanbol all informations and associations
>> we already know. In other words we absolutely don't want to lose the
>> semantic information that is already existent in OpenCms.
>>
>> A good starting point would be a REST endpoint providing the ability to
>> retrieve a RDFa annotated HTML document and than extracts the RDFa in order
>> to store those inside the semantic-index/entity-hub/... as I previously
>> suggested on the list under the subject "Extend stanbol content hub for RDFa
>> support". Maybe the content hub is not the right component, but the
>> requirement of RDFa extraction is still existent.
>>
>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: kas...@dfki.de
> -
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> -
>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Apache Stanbol: technical documentation and disambiguation

2012-11-10 Thread Rupert Westenthaler

Hi Jairo

thanks for your feedback regarding the disambiguation engine

On Fri, Nov 9, 2012 at 6:51 PM, Jairo Sarabia
 wrote:
> I'm Jairo Sarabia, a web developer at Notedlinks S.L. from Barcelone
> (Spain).
> We're very interested on Apache Stanbol and we would like to know how
> Stanbol works internally, so how works the framework is used, the directory
> structure and how works files of configuration.
> Is there any documentation about these? Could you send me?
>

For the Stanbol Enhancer there is a Developer level documentation available.

http://stanbol.apache.org/docs/trunk/components/enhancer/

is the starting point. The Section "Main Interfaces and Utility
Classes" links to
the description of the different components.

> Meanwhile, thank and congratulate you because we tested the disambiguation
> engine and we liked the improved responses in English, although I understand
> that the quality is still regularly in some respects. Especially with topics
> of Person and Organizations, so most times only detects part of the name and
> especially in compound words, and this makes the disambiguation is wrong.

This is probably because the disambiguation Engine does not refine the
fise:selected-text of the fise:TextAnnotation based on disambiguation
results. Can you provide some examples of this behavior so that I can
validate this assumption.

> We would like to know about future plans for the disambiguation engine, and
> whether it can be used for other languages.

Stanbol is a community driven Project. The engine itself was developed
by Kritarth Anand in a GSoC project [1] and contributed to Stanbol
with STANBOL-723 [2]. I am was mentoring this project.

I do not know Kritarth plans, but personally I plan is to continue
work on this engine as soon as I have finished - meaning re-integrated
the Stanbol NLP module with the trunk. This work will mainly focus on
making the MLT disambiguation engine configureable and testing that it
works well with the new Stanbol NLP processing module (STANBOL-733).

[1] http://www.google-melange.com/gsoc/project/google/gsoc2012/kritarth/12001
[2] https://issues.apache.org/jira/browse/STANBOL-723

>
> Finally, we would like to know if it is possible to create multilingual
> DBpedia indexes and then the responses link to the Dbpedia on the language
> of the text. For example, if the text is on Spanish language then the
> literals founded have relations to resources to the Spanish Dbpedia (not
> English Dbepdia resources).
> And if its possible could you explain me how to do it.

The disambiguation-mlt engine is not language specific. Principally it
works with any Entityhub Site and any language where a disambiguation
context is available.

AFAIK the currently hard coded configuration uses the full-text field
(that contains texts in any lanugages) for the Solr MLT query. The
1Gbyte Solr index you probably use for disambiguation includes short
abstracts only for English. Long abstracts are not included for any
language. This is also the reason why you are not getting
disambiguation results for other languages as English.

A better suited environment would provide short (or even long)
abstracts for the language you want to disambiguate. The configuration
of the Engine would not use the all-language full text field for the
MLT queries, but instead the language specific one. The reason why
such information are not included in the distributed index is simple
to reduce its size. In addition when this index was created there was
not yet an engine such as the disambiguation-mlt one that would have
consumed those information.

I have already created an DBpedia 3.8 based index that includes a lot
of information useful for disambiguation for several languages.
However this index in its current form is not easily shared as it is
about ~100GByte (45Gbyte compressed) is size. In addition I had not
yet time to validate the index (as indexing only completed shortly
before I left for ApacheCon last week). Anyway I will use this index
as base for further work on the disambiguation-mlt engine. I will also
share the used Entityhub indexing tool configuration and try to come
up with an modified configuration that is about 10GByte in size but
still useful for disambiguation with the MLT based engine.

best
Rupert

>
> That's all! and Thank you very much again!
>
> Best,
>
> Jairo Sarabia

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Future of Clerezza and Stanbol

2012-11-11 Thread Rupert Westenthaler

t offers doesn't alter the state of the
> system. If you intepret "stateless" very strictly then you would have to
> drop most parts of the felix webconsole as http requests to install bundle
> or configure services aren't stateless. For the user-configuration a simple
> file-based TcProvider would of course be enough so no TDB is needed for
> that.
>
> I think we should see where we want to go as a community. For me the
> important thing is that Stanbol remains very modular. I think statements
> like "Stanbol is no semantic CMS" do not bring us further. It's important
> that the stanbol services can be used as services and that many services
> are stateless. But the contenthub is a component to manage content (the
> entityhub to some degree as well), do we want to mandate a horrible user
> interface just to comply with some catchphrase about what Stanbol is not?
> Or do we want to reduce Stanbol to the be just the Enhancer and let the
> other stuff to other projects?
>
> I'd rather go for the vision of an ecosystem of modular semantic and
> restful osgi components, but if the community wants to focus on the
> enhancer I think a clear statement should be made to avoid unnecessary
> arguments about memory consumption.
>
> Cheers,
> Reto
>
>
> On Fri, Nov 9, 2012 at 10:56 AM, Rupert Westenthaler <
> rupert.westentha...@gmail.com> wrote:
>
>> Hi all,
>>
>> let me share my throughs. Because this mail is rather long I tried to
>> split it up in three separate section (1) RDF (2) RESTful/ Web
>> Interface and (3) other related topics
>>
>>
>> RDF libs:
>> 
>>
>> Out of the viewpoint of Apache Stanbol one needs to ask the Question
>> if it makes sense to manage an own RDF API. I expect the Semantic Web
>> Standards to evolve quite a bit in the coming years and I do have
>> concern that the Clerezza RDF modules will be updated/extended to
>> provide implementations of those. One example of such an situation is
>> SPARQL 1.1 that is around for quite some time and is still not
>> supported by Clerezza. While I do like the small API, the flexibility
>> to use different TripleStores and that Clerezza comes with OSGI
>> support I think given the current situation we would need to discuss
>> all options and those do also include a switch to Apache Jena or
>> Sesame. Especially Sesame would be an attractive option as their RDF
>> Graph API [1] is very similar to what Clerezza uses. Apache Jena's
>> counterparts (Model [2] and Graph [3]) are considerable different and
>> more complex interfaces. In addition Jena will only change to
>> org.apache packages with the next major release so a switch before
>> that release would mean two incompatible API changes.
>>
>> My personal opinion is that we should keep using Clerezza for now.
>> Invest some effort to improve the Clerezza RDF modules and than see
>> how it further develops. Such an Effort should include
>>
>> *  to implement SPQRAL fast lane (as already discussed with Reto
>> during ApacheCon). Fast lane would allow Clerezza to use the native
>> SPARQL engine of the used Triplestore. Meaning that Clerezza only
>> parses those parts of the SPARQL query to understand the RDF graph to
>> execute the Query on. This information is than used to parse the query
>> to the native SPARQL engine via an extended Interface of the
>> TcProvide. The Clerezza SPARQL implementation would only be used in
>> case the TcProvider does not provide a native SPARQL implementation of
>> if the Query spans RDF graphs managed by different TcProvider
>> instances. By that Clerezza users would be able to use any SPARQL
>> feature provided by the used TripleStore.
>> * update to the newest Jena versions (see also STANBOL-621; Peter
>> Ansell's Clerezza fork on github [5] as well as Sebastian Schaffert's
>> Jena bundle used for the Stanbol/LMF integration [5])
>> * finish and release the SingleTdbDatasetTcProvider.java
>> (CLEREZZA-691) as this is important for the Stanbol Ontology Manager
>> component
>> * move the Indexed in-memory graph (CLEREZZA-683) from the Stanbol
>> code base to Clerezza and release it so that we can use it from their
>> in Stanbol
>> * provide an Clerezza JsonLD parser/serializer. This is critical for
>> Stanbol as several CMS use this as preferred RDF serialization.
>>
>> [1]
>> http://www.openrdf.org/doc/sesame2/api/org/openrdf/model/package-summary.html
>> [2]
>> http://jena.apache.org/documentation/javadoc/jena/com/hp/hpl/jena/rdf/model/Model.html
>> [3]
>> http://jena.apache

Re: Future of Clerezza and Stanbol

2012-11-11 Thread Rupert Westenthaler

Hi all ,

On Sun, Nov 11, 2012 at 4:47 PM, Reto Bachmann-Gmür  wrote:
> - clerezza.rdf graudates as commons.rdf: a modular java/scala
> implementation of rdf related APIs, usable with and without OSGi

For me this immediately raises the question: Why should the Clerezza
API become commons.rdf if 90+% (just a guess) of the Java RDF stuff is
based on Jena and Sesame? Creating an Apache commons project based on
an RDF API that is only used by a very low percentage of all Java RDF
applications is not feasible. Generally I see not much room for a
commons RDF project as long as there is not a commonly agreed RDF API
for Java.

On Sun, Nov 11, 2012 at 5:40 PM, Fabian Christ
 wrote:
>
> Having the clerezza platform in Stanbol and thinking in the long term about
> merging and using this stuff is a good choice. This can not be done with
> some simple imports and we should carefully evaluate what will be the right
> way to go in Stanbol.

I would still suggest to do this within an own branch as this makes it
easier to commit/review unfinished stuff. In addition we will need a
branch for making a vote (I guess both for Clerezza and Stanbol) on
the proposed changes.

The following list tries to sum-up discussed points (please refine/complete)

* apache.commons.web:
+ Jersey -> Apache Wink
+ replace Viewable with LDViewable
+ Stanbol Web UI should become optional
* add type based Rendering (at a later time)
* apache.commons.security:
+ move security from Clerezza to Stanbol
+ based on Servlet filter
* Scala: no change needed
* TODO: observe the PermGen space issue
* Shell: no change needed
* Development Tools
* add Bundle-Dev-Tools to shell
* add Maven Archetype support to Stanbol
* Clerezza RDF framework:
? Is community strong enough to manage its own RDF framework
? Where to manage the code
+ SPARQL 1.1 via fast lane (direct access to the native SPARQL
implementations)
+ Update to the newest Jena versions
+ Merge Indexed in-memory TripleCollections to clerezza
+ finish and release the SingleTdbDatasetTcProvider
+ add support for JSON-LD parsing/serializing
? Clerezza Platform: Can someone make a list what else is present in Clerezza

best
Rupert

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Allow usage of OSGI Services without an OSGI environment (was: Future of Clerezza and Stanbol)

2012-11-12 Thread Rupert Westenthaler

so that it is possible to use those services in environment with
different life-cycle and configuration facilities.

best
Rupert Westenthaler

[1] http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/site/managed/
the service Implementation:
org.apache.stanbol.entityhub.site.managed.impl.YardSite
the OSGI component
org.apache.stanbol.entityhub.site.managed.ManagedSiteComponent


On Mon, Nov 12, 2012 at 2:15 AM, Peter Ansell  wrote:
> On 12 November 2012 09:59, Reto Bachmann-Gmür  wrote:
>> Hi Peter and all,
>>
>> Good to read about your experiments.Just a first comment:
>>
>> In addition, I did not want to use OSGI, so I had to make changes in
>>> many cases to allow a completely programmatic instantiation of
>>> components, as some fields were left private with no mutator method
>>> and in some cases no public contructor that could be used to populate
>>> the field programmatically. For all of the good that OSGI may provide
>>> for otherwise complex systems, it is not good Java software
>>> engineering to make fields private.
>>>
>>
>> The clerezza.rdf package should all be usable withouth OSGi. OSGi cannot do
>> magic and set private fields, the compiled classes do have bind and unbind
>> methods for the private fields, these methods are added by the maven felix
>> scr-plugin.  For locating dependencies outside OSGi the META-INF/services
>> method is used so that for example one can add a serializitaion provider
>> seimply by adding it to the classpath without requiring and manual binding.
>
> Sorry, I was under the impression that OSGi could actually do Java
> reflection magic to inject dependencies directly into private fields
> based on annotations without having any alternative method of setting
> the field for regular plain old java users. :)
>
> In general I would like if OSGi classes that currently rely on
> bind/unbind, still offered public mutator methods and a public
> initialise/deinitialise method for any work that needs to be done
> after using the mutator methods. The bind/unbind methodology from
> memory when I was working on Clerezza/Stanbol, seemed to require that
> all of the mutators were run immediately and the initialise was
> automatically run, without offering any other possible sequence.
>
> Additionally, offering public mutators and a public initialise method
> gives the added benefit of compile-time typesafety for plain old java
> users, which a bind method taking a Dictionary
> parameter does not provide.
>
> In addition, from memory I think some of the bind methods were
> protected, and not public, which means they are not directly
> accessible, without resorting to using reflection or subclassing just
> to be able to call bind.
>
> I use META-INF/services heavily in my projects, and I rely on it when
> using Sesame and with my extensions to OWLAPI. I extended OWLAPI to
> use Sesame META-INF/services dependencies to find
> serialisation/parsing providers for OWLAPI based on the Sesame
> parser/writer services that are available on the classpath. However, I
> always try to make sure that the use of the automatically populated
> service registries is optional, so that users can populate their own
> registries from scratch using purely programmatic methods, and they do
> not have to resort to modifying global singleton registries as one
> does when using Jena.
>
> The services that I register in META-INF/services are always factories
> based on interfaces, so that dependencies can be passed into type-safe
> java "createServiceInstance" methods when creating instances of the
> service using the factory instance. This means that it does not matter
> if the java.util.ServiceLoader loads classes in a different order, as
> the actual objects are created from the factories explicitly by users,
> with or without a key to specify which instance of the service they
> require/prefer.
>
> Cheers,
>
> Peter



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: DBpedia indexing ...

2012-11-12 Thread Rupert Westenthaler

Hi Andrea,

On Mon, Nov 12, 2012 at 12:59 PM, Andrea Taurchini  wrote:
> folder /indexing/dist the two files :
>
> 1)dbpedia.solrindex.zip
> 2)org.apache.stanbol.data.site.dbpedia-{version}.jar
>
> I prefer to install it as a new referenced site and not overwriting it to
> previous dbpedia english index so I made the following :
>
> 1) saved the zip in the stanbol/datafiles directory
> 2) installed the bundle using the Apache Felix web console
>
> So I have a new referenced site under http://localhost:8080/entityhub.
> The problem is that if I try to search for an entity such as
>
> curl "
> http://localhost:8080/entityhub/site/ITdbpedia/entity?id=http://dbpedia.org/resource/Paris
> "
>

How have you managed to deploy the Site under "ITdbpedia"? Have you
manually changed the configuration after installing the Bundle?

While this might work (if you correctly adapt the configuration for
the ReferencedSite, Cache and SolrYard those will still override the
configurations of the default DBpedia index simple because the OSGI
config files provided by the bundle (2) do have the same name as the
default dbpedia index config files.

> Problem accessing /entityhub/site/ITdbpedia/find. Reason:
> Unable to initialize the Cache with Yard ITdbpediaIndex! This
> is usually caused by Errors while reading the Cache Configuration from
> the Yard.Caused
> by:java.lang.IllegalStateException: Unable to initialize the
> Cache with Yard ITdbpediaIndex! This is usually caused by Errors while
> reading the Cache Configuration from the Yard.

This usually happens if the SolrYard "ITdbpediaIndex" is configured
for a SolrCore that is not available. Are you sure that a SolrCore
with the name configured for the "Solr Index/Core" property of the
ITdbpediaIndex SolrYard is available?
Assuming you have configured {solr-core} you will need to (a) extract
the "dbpedia.solrindex.zip" file (b) rename the root folder from
"dbpedia" to "{solr-core}" (c) re-create the ZIP file (d) rename it to
"{solr-core}.solrindex.tzp".

- - -

The intended way to change the name of a ReferencedSite created by the
Entityhub Indexing Tool is to change the value of the "name" property
within the
"./indexing/config/indexing.properties" file.

In case of the dbpedia Indexing tool you need to change the
"indexingDestination" from

indexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf,boosts:fieldboosts

to

indexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf:dbpedia,boosts:fieldboosts

NOTE the change from "solrConf" to "solrConf:dbpedia". This is
necessary to tell the SolrYardIndexingDestination component that the
SolrCore configuration is called "dbpedia". By default it assumes that
the name is equals to the value of the "name" property.

Before re-indexing you should also delete the "./indexing/destination"
folder as otherwise you will have both the data of the old index
(dbpedia) and the new one {name} in the destination folder.

- - -

If you want to create an "installable bundle" without reindexing the
data you can follow the following steps:

0. if there are still files in the indexing/resources/rdfdata folder
remove them as they are already imported into the Jena TDB store
(indexing/resources/tdb)
1. make the changes as described above
2. delete the indexing/destination folder (make sure to NOT delete the
indexing/dist folder!)
3. replace the indexing/resource/incoming_links.txt file with an empty
one (make sure to not delete the current version)
4. start the indexing (this should now complete in some seconds as no
entities are indexed.

After that you should see in the indexing/dist folder 4 files

a. "dbpedia.solrindex.zip"
b. "{name}.solrindex.zip" (this is empty - delete it)
c. "org.apache.stanbol.data.site.dbpedia-{version}.jar" (the old
bundle - delete it)
d. "org.apache.stanbol.data.site.{name}-{version}.jar (the new bundle)

(d) is the patched Bundle that you can use to install your custom
dbpedia index without overriding the default one. However to use this
bundle you need still modify the "dbpedia.solrindex.zip" as described
above: (a) extract the "dbpedia.solrindex.zip" file (b) rename the
root folder from "dbpedia" to "{name}" (c) re-create the ZIP file (d)
renme it to "{name}.solrindex.zip".

I admit that those steps are complex, but they might save you the time
needed to re-create your index.

best
Rupert

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Future of Clerezza and Stanbol

2012-11-13 Thread Rupert Westenthaler

Hi all,

I would like to share some thoughts/comments and suggestions from my side:

ResourceFactory: Clerezza is missing a Factory for RDF resources. I
would like to have such a Factory. The Factory should be obtainable
via the Graph - the Collection of Triples. IMO such a Factory is
required if all resource types (IRI, Bnode, Literal) are represented
by interfaces.

BNodes: If Bnode is an interface than any implementation is free to
internally use a "bnode-id". One argument pro such ids (that was not
yet mentioned) is that such id's allow you to avoid in-memory mappings
for bnodes when wrapping an native implementation. In Clerezza you
currently need to have this Bidi maps.

Triple, Quads: While for some use cases the Triple-in-Graph based API
(Quad := Triple t =
TripleStore#getGraph(context).filter(subject,predicate,object)) is
sufficient this is no longer the case as soon as Applications want to
work with an Graph that contains Quads with several contexts. So I
would vote for having support for Quads.

Dataset,Graph: Out of an User perspective Dataset (how the TripleStore
looks at the Triples) and Graph (how RDF looks at the Triples) are not
so different. Because of that I would like to have a single domain
object fitting for both. The API should focus on the Graph aspects (as
Clerezza does) while still allowing efficient implementations that do
not load all triples into memory (e.g. use closeable iterators)

Immutable Graphs: I had really problems to get this right and the
current Clerezza API does not help with that task (resulting in things
like read-only mutable graphs that are no Graphs as they only provide
a read-only view on a Graph that might still be changed by other
means). I think read-only Graphs (like
Collections.unmodifiableCollection(..)) should be sufficient. IMHO the
use case to protect a returned graph from modifications by the caller
of the method is much more prominent as truly immutable graphs.

SPARQL: I would not deal with parsing SPARQL queries but rather
forward them as is to the underlaying implementation. If doing so the
API would only need to border with result sets. This would also avoid
the need to deal with "Datasets". This is not arguing against a
fallback (e.g. the trick Clerezza does by using the Jena SPARQL
implementation) but in practice efficient SPARQL executions can only
happen natively within the TripleStore. Trying to do otherwise will
only trick users into use cases that will not scale.

best
Rupert

On Tue, Nov 13, 2012 at 9:08 AM, Reto Bachmann-Gmür  wrote:
> On Mon, Nov 12, 2012 at 10:40 PM, Andy Seaborne  wrote:
>
>> On 12/11/12 19:42, Reto Bachmann-Gmür wrote:
>>
>>> On Mon, Nov 12, 2012 at 5:46 PM, Andy Seaborne  wrote:
>>>
>>>  On 09/11/12 09:56, Rupert Westenthaler wrote:
>>>>
>>>>  RDF libs:
>>>>> 
>>>>>
>>>>> Out of the viewpoint of Apache Stanbol one needs to ask the Question
>>>>> if it makes sense to manage an own RDF API. I expect the Semantic Web
>>>>> Standards to evolve quite a bit in the coming years and I do have
>>>>> concern that the Clerezza RDF modules will be updated/extended to
>>>>> provide implementations of those. One example of such an situation is
>>>>> SPARQL 1.1 that is around for quite some time and is still not
>>>>> supported by Clerezza. While I do like the small API, the flexibility
>>>>> to use different TripleStores and that Clerezza comes with OSGI
>>>>> support I think given the current situation we would need to discuss
>>>>> all options and those do also include a switch to Apache Jena or
>>>>> Sesame. Especially Sesame would be an attractive option as their RDF
>>>>> Graph API [1] is very similar to what Clerezza uses. Apache Jena's
>>>>> counterparts (Model [2] and Graph [3]) are considerable different and
>>>>> more complex interfaces. In addition Jena will only change to
>>>>> org.apache packages with the next major release so a switch before
>>>>> that release would mean two incompatible API changes.
>>>>>
>>>>>
>>>> Jena isn't changing the packaging as such -- what we've discussed is
>>>> providing a package for the current API and then a new, org.apache API.
>>>>   The new API may be much the same as the existing one or it may be
>>>> different - that depends on contributions made!
>>>>
>>>>
>>> I didn't know about jena planning to introduce such a common API.
>>>
>>>
>>>> I'd like to hear more about your experiences esp. with Graph API as that
>>>> is supposed to be quite

Re: DBpedia indexing ...

2012-11-13 Thread Rupert Westenthaler

Calculate the incoming_links.txt file for the Italian page links
(http://downloads.dbpedia.org/3.8/it/page_links_it.nt.bz2)

2. Download all the RDF files you need

* basically the same you currently use from
http://downloads.dbpedia.org/3.8/en/ but now from
http://downloads.dbpedia.org/3.8/it/
* language specific labels from other languages you are interested in.
 IMPORTANT: use the
 http://downloads.dbpedia.org/3.8/{lang}/{type}_{lang}.nt.bz2
 files and NOT the

http://downloads.dbpedia.org/3.8/{lang}/{type}_en_uris_{lang}.nt.bz2
* include http://downloads.dbpedia.org/3.8/en/instance_types_en.nq.bz2

3. You will need to add the LdpathSourceProcessor to the list of
entityProcessor in the indexing.properties file. The configuration
should look like

entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:entityTypes;org.apache.stanbol.entityhub.indexing.core.processor.LdpathSourceProcessor,ldpath:dbpedia.ldpath;org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor

4. Create an LDPath [2] program that merges all the data you need with
the Italian dbpedia resource.

[2] http://code.google.com/p/ldpath/

The configuration in (3) refers to the ldpath file "dbpedia.ldpath".
This is a text file that is expected to be located within the
"indexing/config" directory. I will not give an LDpath introduction,
but what you need is something like

1: rdfs:label = (rdfs:label | dbp-ont:wikiPageInterLanguageLink/rdfs:label);
2: skos:altLabel = (^dbp-ont:wikiPageRedirects/rdfs:label |
dbp-ont:wikiPageInterLanguageLink/^dbp-ont:wikiPageRedirects/rdfs:label);
3: rdfs:comment = (rdfs:label | dbp-ont:wikiPageInterLanguageLink/rdfs:label);
4: dbp-ont:abstract = (dbp-ont:abstract |
dbp-ont:wikiPageInterLanguageLink/dbp-ont:abstract);
5: rdf:type = (rdf:type | dbp-ont:wikiPageInterLanguageLink/rdf:type);

NOTE: you will need to remove the '{line-number}: ' before using this ldpath

(1) merges the rdfs:labels of the current Entity (the Italian label)
with labels of entities referenced by inter language links. So this
will ensure that you have labels for all languages for the Italian
entity.
(2) merges labels of redirected pages to the skos:altLabel field. For
this to work you will need to include the
"redirects_{language}.nt.bz2" file in the languages you are interested
(3) same as for rdfs:labels but for short abstracts
(4) the same but for long abstracts
(5) rdf:type statements might be missing for Italian. So I merge those
as well with types from other languages. I would recommend to only
include types for the English dbpedia

5. Add surfaceForms mapping to the mappings.txt file

# add rdfs:labels and rdfs:labels of redirected sites to dbp-ont:surfaceForm
rdfs:label > dbp-ont:surfaceForm
skos:altLabel > dbp-ont:surfaceForm

Those two mappings ensure that both the rdfs:label and skos:altLabel
values are also stored in the dbp-ont:surfaceForm field. This allows
you to allow the Stanbol Enhancer (or more precisely the
NamedEntityLinkingEngine or KeywordLinkingEngine) to match against
labels of redirected pages by changing the name field form the default
rdfs:label to dbp-ont:surfaceForm

Let me conclude that I have never tried this exact use case myself,
but I have already created several dbpedia indexing with very similar
configurations. When using LDPath during indexing you need to expect
higher indexing times and you might also need to assign more memory to
the indexing tool.

Please also note http://markmail.org/message/67ivlyoxfqad6xoe as you
will most likely need process dbpedia files for some languages using
the

bzcat ${filename}.bz2 \
| sed 's//\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
| gzip -c > ${filename}.gz
rm -f ${filename}.bz2

best
Rupert

>
> Thanks,
> Andrea
>
>
>
>
>
>
>
> 2012/11/12 Rupert Westenthaler 
>
>> Hi Andrea,
>>
>>
>> On Mon, Nov 12, 2012 at 12:59 PM, Andrea Taurchini 
>> wrote:
>> > folder /indexing/dist the two files :
>> >
>> > 1)dbpedia.solrindex.zip
>> > 2)org.apache.stanbol.data.site.dbpedia-{version}.jar
>> >
>> > I prefer to install it as a new referenced site and not overwriting it to
>> > previous dbpedia english index so I made the following :
>> >
>> > 1) saved the zip in the stanbol/datafiles directory
>> > 2) installed the bundle using the Apache Felix web console
>> >
>> > So I have a new referenced site under http://localhost:8080/entityhub.
>> > The problem is that if I try to search for an entity such as
>> >
>> > curl "
>> >
>> http://localhost:8080/entityhub/site/ITdbpedia/entity?id=http://dbpedia.org/resource/Paris
>> > "
>> >
>>

Re: Creating a spanish index for Stanbol (doubts)

2012-11-13 Thread Rupert Westenthaler

n speak it as a first or second
> language)."@en .
>1881 <http://dbpedia.org/resource/Bishkek> <
> http://www.w3.org/2000/01/rdf-schema#comment> "Bishkek, formerly Pishpek
> and Frunze, is the capital and the largest city of Kyrgyzstan. Bishkek is
> also the administrative centre of Chuy Province which surrounds the city,
> even though the city itself is not part of the province but rather a
> province-level unit of Kyrgyzstan. The name is thought to derive from a
> Kyrgyz word for a churn used to make fermented mare's milk, the Kyrgyz
> national drink."@en .
>
> Someone might say why appears errors like "broken pipe" or if I'm doing
> something wrong. I think that i follow well the guide. Thanks, and I hope that
> this information can help others that try to create indexes and an Apache
> Stanbol, that is a really great project. Nice work!
>
> Best,
> Juan.



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: (Back to the) Future of Clerezza and Stanbol

2012-11-14 Thread Rupert Westenthaler

Hi

I am more with Fabian. The fact is that Clerezza has not much
activity. I am a Clerezza Committer myself and the reason why I am
rather inactive is because I have enough things to do for Stanbol.
This will also not much change in the future. Moving the Clerezza
modules to Stanbol does not solve this problem. It does only move it
from Clerezza over to Stanbol.

 - RDF libs: If Clerezza is no longer actively developed, than Stanbol
should - in the long term - switch to an other RDF framework. RDF is
not core feature of Stanbol so we should rather use existing stuff
than manage our own. So "if" Clerezza  can not graduate, than the
scenario mentioned by Fabian seams also likely to me.

 - Linked Data Platform: Reto I guess you have missed this
presentation [1] at ApacheCon. IMO a Linked Data Platform is something
that deserves an own project and as soon as there is such a Platform
available we should use it in Stanbol. This would allow us to remove a
lot of code in Stanbol (especially in the Entityhub) - a good thing as
it allows to focus more on core features of Stanbol.

best
Rupert

[1] http://www.slideshare.net/Wikier/incubating-apache-linda

On Wed, Nov 14, 2012 at 4:56 PM, Reto Bachmann-Gmür  wrote:
> Thanks for bringing the discussion back to the main issue.
>
> Clerezza could graduate as it is. But imho it would make sense to split
> clerezza into:
>
> - RDF libs
> - Linked Data Platform
>
> Imho the Semantic Platform that should strive for compliance with LDPWG
> standards could merge with Apache Stanbol as in fact for many modules it's
> hard to say were they best belong to. For this the clerezza stuff should
> not become a branch but a subproject of stanbol that can be released
> individually if needed. This subproject should become thinner and thinner
> as more stuff is being moved to the stanbol platform as technologies are
> being aligned. Discussing if this would be possible should be independent
> of the RDF API stuff.
>
> Cheers,
> Reto
>
> On Wed, Nov 14, 2012 at 4:18 PM, Fabian Christ > wrote:
>
>> Hi Andy,
>>
>> thanks for bringing the discussion back to the point where it started.
>>
>> Here is my view:
>>
>> If Clerezza can not graduate then the sources should be moved into the
>> archive. The Stanbol community can then freely fork from there and take
>> what it is needed. Other communities who also use Clerezza may do the same
>> to keep their projects working (it is not only a matter for Stanbol).
>> Clerezza committers are more than welcome to join Stanbol and help to
>> migrate the parts of Clerezza that are useful for Stanbol.
>>
>> I agree with Rupert that the best way to do it, is to set up branches to
>> explore different development paths.
>>
>> Maybe Clerezza will be able to graduate if they focus on a smaller set of
>> components. But this is a discussion for the Clerezza dev list.
>>
>> Best,
>>  - Fabian
>>
>>
>> 2012/11/14 Andy Seaborne 
>>
>> > The original issue was about whether migrating (part of) Clerezza into
>> > Stanbol made sense.  The concern raised was resourcing.
>> >
>> > Coupling this to new API design is making the resourcing more of a
>> > problem, not less.
>> >
>> > If I understand the discussion 
>> >
>> > Short term::
>> >
>> > Can Clerezza achieve graduation?
>> >
>> > Or not, does splitting out the part of Clerezza that Stanbol depends on
>> > work? (I sense "yes" with little work needed).  Maintaining such
>> > transferred code was raised as a concern - e.g. SPARQL 1.1 access.
>> >
>> > Long term::
>> >
>> > Where does this leave Stanbol?  Does the maintenance cost concern remain?
>> > or even get worse?
>> >
>> > I don't have sufficient knowledge of the codebase to know what the
>> balance
>> > is between fine-grained API work and query-based access (and update).
>> >
>> > How important is switching between (e.g.) storage providers?
>> >
>> > (local storage - remote would be SPARQL so stanbol-client-code and
>> > other-server can be chosen separately - that's why we do standards!)
>> >
>> > Andy
>> >
>> >
>>
>>
>> --
>> Fabian
>> http://twitter.com/fctwitt
>>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: REST API for dbpedia-spotlight chain

2012-11-14 Thread Rupert Westenthaler

Hi

When I send your request to http://dev.iks-project.eu:8080 I do get
the expected results.
Can you please try the same.

If you do not get those results than it has most likely todo with the
charset used by the terminal. The command you sent does not explicitly
set the charset so Stanbol will interpret it as "UTF-8" when parsing
the request.

I used the following command

curl -i -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" --data \
"üä Albrecht Václav Eusebius z Valdštejna and pe fights, leading to his \
imprisonment in town prison.[8] Already in February 1600,[8] Albrecht left \
Altdorf for his Grand Tour through the HRE, France and Italy,[10] where he \
studied at the universities of Bologna and Padua." \
http://dev.iks-project.eu:8080/enhancer/chain/dbpedia-spotlight

best
Rupert

On Wed, Nov 14, 2012 at 5:03 PM, Andriy Nikolov
 wrote:
> Dear all,
>
> I am working at fluid Operations AG on one of the IKS Early Adopters
> projects and trying to integrate Stanbol with our Information Workbench
> platform.
>
> Currently I am getting to know the Stanbol API, and I have a question
> related to the dbpedia-spotlight enhancement chain.
> I am trying to retrieve annotations via the REST interface, but I face a
> problem as the output I receive is different from the one I obtain via the
> web interface form.
> Do you know what can be the possible cause and how to deal with it?
> (possibly, it happens when sending input text with non-standard characters).
>
> As an example, I am trying to send the following string (it is meaningless,
> just that it contains non-standard chars and mentions of different entity
> types):
>
> üä Albrecht Václav Eusebius z Valdštejna and pe fights, leading to his
> imprisonment in town prison.[8] Already in February 1600,[8] Albrecht left
> Altdorf for his Grand Tour through the HRE, France and Italy,[10] where he
> studied at the universities of Bologna and Padua.
>
> When sending it via the web interface
> http://localhost:8080/enhancer/chain/dbpedia-spotlight,
> I retrieve a list of text and entity annotations, particularly the one
> mentioning the entity dbpedia:Albrecht_von_Wallenstein (the annotations are
> consistent with what I get from the dbpedia-spotlight demo service itself).
>
> However, when trying to send the same text via the API, e.g., with the
> following command:
>
> curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" --data
> "üä Albrecht Václav Eusebius z Valdštejna and pe fights, leading to his
> imprisonment in town prison.[8] Already in February 1600,[8] Albrecht left
> Altdorf for his Grand Tour through the HRE, France and Italy,[10] where he
> studied at the universities of Bologna and Padua."
> http://localhost:8080/enhancer/chain/dbpedia-spotlight
>
> I get a different set of annotations: particularly, there is no mention of
> dbpedia:Albrecht_von_Wallenstein, but there is a reference to
> dbpedia:Clavichord (extracted from the part "clav" of the name "Václav").
>
> Do you know what can be the reason for this problem? Are there any
> additional request parameters which has to be set?
>
> Thank you!
>
> Best regards,
> Andriy Nikolov



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: EntityHub Referenced Site and redirects

Hi Andrea,

A followup:

(1) Sharing your indexes:

This would be great! I talked with a collage of mine. Most likely we
will add an FTP upload folder to the dev.iks-project.eu server. For
that we will need to add more HDD space to this virtual host what
might take some more time to accomplish. I will notify you as soon as
we are ready

(2) dbp-ont:surfaceForm

I recommended to you to copy labels of redirected pages to the
"dbp-ont:surfaceForm" field. In the meantime I made some tests with an
index build like that. The results where really bad because of that I
must revoke this recommendation!

The reason for that is that the scoring algorithm of Solr is affected
by the multi-valued "dbp-ont:surfaceForm" field. e.g. for
dbpedia:Paris you have ~35 "dbp-ont:surfaceForm" values where only
about ~15 contain "Paris". So if you now make a query for Paris in
this field

(((@en/dbp\-ont\:surfaceForm/:"paris")))

you will notice that dbpedia:Paris is not within the top 10 search
results. Instead Entities like "Paris Barclay" are listed because they
do have only a single value for "dbp-ont:surfaceForm" and therefore
the match for "Paris" is much more relevant.

This means that the current index-layout where URIs of redirected
pages are represented as own Entities within the index is much better
suited for entity extraction.

On Mon, Nov 5, 2012 at 10:59 AM, Andrea Di Menna  wrote:
> Hi Rupert,
> I would be more than happy to share the indexes.
> I have also created one including redirects by forcibly inserting
> redirecting entities into the incoming_links.txt file.

Do you have a script for creating such a incoming_links.txt file?
Because this would be very useful for properly creating indexes that
include Entities of redirected pages.

best
Rupert

> Redirects have been assigned the same entity rank as the entities they
> redirect to.
>
> Please let me know how and where to store those indexes.
>
> Cheers
>
> 2012/11/3 Rupert Westenthaler 
>
>> Hi,
>>
>> I have started to play around with indexing dbpedia 3.8 myself as well
>> and I con confirm that one has to preprocess nearly all files. Because
>> of that I have written a nice shell script that downloads, processes
>> and re-compresses the RDF files
>>
>> # array syntax is ({item-1} {items-2} ... {item-n})
>> # names need to include the language path segment!
>> files=(dbpedia_3.8.owl \
>> en/labels_en.nt \
>> {all-the-other-files-you-need} \
>> )
>>
>> for i in "${files[@]}"
>> do
>> :
>> # clean possible encoding errors
>> filename=$(basename $i)
>> if [ ! -f ${filename}.gz ]
>> then
>> url=${DBPEDIA}/${i}.bz2
>> wget -c ${url}
>> echo "cleaning $filename ..."
>> #corrects encoding and recompress using gz
>> #gz is used because it is faster
>> bzcat ${filename}.bz2 \
>> | sed 's//\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
>> | gzip -c > ${filename}.gz
>> rm -f ${filename}.bz2
>> fi
>> done
>>
>> > the SolrIndex zip file is about 3.5GB.
>> > I am using a min-score=2 in minincoming.properties
>> > I think the 3.7 index file from the IKS project downloads site was
>> created
>> > with min-score=10.
>>
>> The dbpedia 3.7 index was build by ogrisel, but I think you are right.
>> 3.5GByte for all entities wih >=2 incomming links (should be about
>> 4million entities) sound reasonable. If you  want to share your index
>> with the Stanbol community I am sure we can find a server to host it.
>>
>>
>> Note about languages:
>>
>> while it is easy include labels, comments, abstracts of additional
>> languages it is not so easy to add proper Solr field definition for
>> languages. While there is a great wiki page that provides all the
>> necessary links [1] I find it still very hard to add configurations
>> for languages I do not understand. So if someone can help with that I
>> am happy to improve the Solr schemas used by the Entityhub (and the
>> Entityhub Indexing tool)!
>>
>>
>> Upgrading the default DBpedia index:
>>
>> After the ApacheCon I will work on replacing the default dbpedia index
>> used with the Stanbol launchers with a dbpedia 3.8 based version (the
>> current one is still based on 3.6). This will need some time because I
>> expect that I will need to adapt a lot of unit/integration tests
>> affected by data changes.
>>
>> [1] http://wiki.apache.org/solr/LanguageAnalysis
>&g

Re: EntityHub Referenced Site and redirects

Hi (again)

> (2) dbp-ont:surfaceForm
>
> I recommended to you to copy labels of redirected pages to the
> "dbp-ont:surfaceForm" field. In the meantime I made some tests with an
> index build like that. The results where really bad because of that I
> must revoke this recommendation!
>
> The reason for that is that the scoring algorithm of Solr is affected
> by the multi-valued "dbp-ont:surfaceForm" field. e.g. for
> dbpedia:Paris you have ~35 "dbp-ont:surfaceForm" values where only
> about ~15 contain "Paris". So if you now make a query for Paris in
> this field
>
> (((@en/dbp\-ont\:surfaceForm/:"paris")))
>
> you will notice that dbpedia:Paris is not within the top 10 search
> results. Instead Entities like "Paris Barclay" are listed because they
> do have only a single value for "dbp-ont:surfaceForm" and therefore
> the match for "Paris" is much more relevant.

Just talked about this problem with Sebastian Schaffert. He suggested
to try setting

omitNorms="true"

for all fields used for labels within the Entityhub. This should have
the affect that Entities with a lot of  "dbp-ont:surfaceForm" values
are no longer penalized by the Solr ranking algorithm. So testing that
will require some time.

best
Rupert


>
> This means that the current index-layout where URIs of redirected
> pages are represented as own Entities within the index is much better
> suited for entity extraction.


-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: REST API for dbpedia-spotlight chain

Hi Andriy

So if our result differ from each other, than it is likely that the
reason for your issue is the charset use by the command.

Can you please try to copy the text into a {file} and than use

curl -v -X POST -H "Accept: text/turtle" -H \
"Content-type: text/plain" --data "@{file}" \
http://dev.iks-project.eu:8080/enhancer/chain/dbpedia-spotlight

if the file does not use UTF-8 you will need to parse the charset in
the Content-type header

Content-type: text/plain;charset={charset}

best
Rupert

On Thu, Nov 15, 2012 at 2:08 PM, Andriy Nikolov
 wrote:
> Hi Rupert,
>
> Thanks a lot for your reply.
> Actually, when I try it with http://dev.iks-project.eu:8080, I still get the
> same effect: the output I get when submitting through the web interface and
> via the API (I used the command from your mail) are different:
> in one case (using the web form), there is a mention of
> <http://dbpedia.org/resource/Albrecht_von_Wallenstein> (correct), while via
> the API there isn't, but there is a mention of
> http://dbpedia.org/resource/Clavichord (wrong).
>
> Best regards,
> Andriy
>
>
> On Wed, Nov 14, 2012 at 6:50 PM, Rupert Westenthaler
>  wrote:
>>
>> Hi
>>
>> When I send your request to http://dev.iks-project.eu:8080 I do get
>> the expected results.
>> Can you please try the same.
>>
>> If you do not get those results than it has most likely todo with the
>> charset used by the terminal. The command you sent does not explicitly
>> set the charset so Stanbol will interpret it as "UTF-8" when parsing
>> the request.
>>
>> I used the following command
>>
>> curl -i -X POST -H "Accept: text/turtle" -H "Content-type: text/plain"
>> --data \
>> "üä Albrecht Václav Eusebius z Valdštejna and pe fights, leading to his \
>> imprisonment in town prison.[8] Already in February 1600,[8] Albrecht left
>> \
>> Altdorf for his Grand Tour through the HRE, France and Italy,[10] where he
>> \
>> studied at the universities of Bologna and Padua." \
>> http://dev.iks-project.eu:8080/enhancer/chain/dbpedia-spotlight
>>
>> best
>> Rupert
>>
>> On Wed, Nov 14, 2012 at 5:03 PM, Andriy Nikolov
>>  wrote:
>> > Dear all,
>> >
>> > I am working at fluid Operations AG on one of the IKS Early Adopters
>> > projects and trying to integrate Stanbol with our Information Workbench
>> > platform.
>> >
>> > Currently I am getting to know the Stanbol API, and I have a question
>> > related to the dbpedia-spotlight enhancement chain.
>> > I am trying to retrieve annotations via the REST interface, but I face a
>> > problem as the output I receive is different from the one I obtain via
>> > the
>> > web interface form.
>> > Do you know what can be the possible cause and how to deal with it?
>> > (possibly, it happens when sending input text with non-standard
>> > characters).
>> >
>> > As an example, I am trying to send the following string (it is
>> > meaningless,
>> > just that it contains non-standard chars and mentions of different
>> > entity
>> > types):
>> >
>> > üä Albrecht Václav Eusebius z Valdštejna and pe fights, leading to his
>> > imprisonment in town prison.[8] Already in February 1600,[8] Albrecht
>> > left
>> > Altdorf for his Grand Tour through the HRE, France and Italy,[10] where
>> > he
>> > studied at the universities of Bologna and Padua.
>> >
>> > When sending it via the web interface
>> > http://localhost:8080/enhancer/chain/dbpedia-spotlight,
>> > I retrieve a list of text and entity annotations, particularly the one
>> > mentioning the entity dbpedia:Albrecht_von_Wallenstein (the annotations
>> > are
>> > consistent with what I get from the dbpedia-spotlight demo service
>> > itself).
>> >
>> > However, when trying to send the same text via the API, e.g., with the
>> > following command:
>> >
>> > curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain"
>> > --data
>> > "üä Albrecht Václav Eusebius z Valdštejna and pe fights, leading to his
>> > imprisonment in town prison.[8] Already in February 1600,[8] Albrecht
>> > left
>> > Altdorf for his Grand Tour through the HRE, France and Italy,[10] where
>> > he
>> > studied at the universities of Bologna and Padua."
>> > http://localhost:8080/enhancer/chain/dbpedia-spotlight

Re: Stopping the framework ...

Hi

On Thu, Nov 15, 2012 at 2:36 PM, Andrea Taurchini  wrote:
> Dear All,
> maybe I'm missing (again) something, but if I stop the framework, no matter
> if through Felix Web Console or CTRL+C, configurations go to hell on the
> next restart.

No you are missing nothing. All those ways to shutdown Stanbol should
work just fine. I can not remember having ever a problem like that.

> Even the default enhancement chain will stop working since the order or the
> engine is changed to :
>
>- *metaxa* ( optional , currently not available)
>- *entityhubExtraction* ( required , currently not available)
>- *tika* ( optional , TikaEngine)
>- *langdetect* ( required , LanguageDetectionEnhancementEngine)
>- *ner* ( required , NamedEntityExtractionEnhancementEngine)
>- *dbpediaLinking* ( required , NamedEntityTaggingEngine)
>

that "not available" engines are listed first is expected for the
WeightedChain. This chain determines the order based on information
provided by the Engine. So if an Engine is not available such
Information are not available. As the order does not matter for
Engines that are not available my decision was to list them first.

> not to mention the fact that my own configurations (topic classifier ...)
> is completely removed ...
>

Somehow it looks like as OSGI is not able to write files to the disc.
Can you please check the Stanbol log file
{launcher-dir}/stanbol/logs/error.log if you can find related
information.

best
Rupert

> Thanks,
> Andrea

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: [STANBOL-798] Vocabularies/ontologies not available according the best practices

2012-11-16 Thread Rupert Westenthaler

Hi all,

just to let you know: I have started the process to fix this for

http://fise.iks-project.eu/ontology/

the URI used by the Stanbol Enhancement Structure Ontology.

for the apache.stanbol.org namespaces we will need to copy the files
in the correct directories and than to configure some things in the
.htaccess files. Here the question is if the use of .htaccess is
possible (had not yet time to look this up ... so please no RTFM
responses ^^)

best
Rupert

On Fri, Nov 16, 2012 at 1:24 PM, Fabian Christ
 wrote:
> 2012/11/16 Sergio Fernández 
>
>> that should be quite easy to solve
>
>
> Do you have patch for that easy one? ;)
>
> --
> Fabian
> http://twitter.com/fctwitt

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Stanbol indexing tool

2012-11-16 Thread Rupert Westenthaler

The TDB database is located under

{indexing-working-dir}/indexing/resources/tdb

If you do have an TDB store with the required data, than you can
provide them under that directory. Just make sure that the

{indexing-working-dir}/indexing/resources/rdfdata

folder is empty when you start the tool. Otherwise the RDF files in
that folder would get imported.

On Fri, Nov 16, 2012 at 2:18 PM, Andrea Di Menna  wrote:
> The first part of the process seems slower on my machine w.r.t. to
> loading triples in a TDB using directly tdbloader2 (Note: I am using
> the latest available version of Jena when running tdloader2 standalone
> - namely 2.7.4).

Yes the indexing tool uses

com.hp.hpl.jena:jena:2.6.3
com.hp.hpl.jena:arq:2.8.5
com.hp.hpl.jena:tdb:0.8.7

but you could still try to use your datastore. Maybe they have not
changed the binary format of the files.

If not let me know and I will try to update the Jena Version used by
the Indexing Tool

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Beginning Apache Stanbol

2012-11-17 Thread Rupert Westenthaler

cher only contains the
Enhancer and Entityhub. This would explain why you are only seeing
this two exceptions and also this exception is expected during startup
as the Refactor Engine is included in the Stanble Launcher but is
missing the dependencies to the OntologyManager and Rules components.


However

> I stopped the stanbol instance and tried the '"full" build.
> java -Xmx1g -XX:MaxPermSize=256m -jar 
> full/target/org.apache.stanbol.launchers.full-0.10.0-SNAPSHOT.jar
> instead, but got
> "ERROR: Bundle org.apache.stanbol.enhancer.engines.refactor [110]: Error 
> starting 
> slinginstall:org.apache.stanbol.enhancer.engines.refactor-0.10.0-SNAPSHOT.jar 
> (org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.stanbol.enhancer.engines.refactor [110]: Unable to resolve 110.0: 
> missing requirement [110.0] package; 
> (&(package=org.apache.stanbol.ontologymanager.servicesapi.collector)(version>=0.10.0)(!(version>=1.0.0
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.stanbol.enhancer.engines.refactor [110]: Unable to resolve 110.0: 
> missing requirement [110.0] package; 
> (&(package=org.apache.stanbol.ontologymanager.servicesapi.collector)(version>=0.10.0)(!(version>=1.0.0)))"
>

this is not expected and indicates some issue with the ontology
manager. But as with the integration tests I was also unable to
reproduce this.

Also a look at the Exported packages of the
"org.apache.stanbol.ontologymanager.servicesapi" shows that this
module correctly exports
"org.apache.stanbol.ontologymanager.servicesapi.collector,version=0.10.0.SNAPSHOT"
and also that "org.apache.stanbol.enhancer.engines.refactor" imports
"org.apache.stanbol.ontologymanager.servicesapi.collector,version=0.10.0.SNAPSHOT
from org.apache.stanbol.ontologymanager.servicesapi (123)"

You can check that yourself via the Apache Felix Webconsole under
http://localhost:8080/system/console/bundles
Also the "http://localhost:8090/system/console/depfinder"; (packages)
tab is useful to check for packages that cause errors like that (in
your case org.apache.stanbol.ontologymanager.servicesapi.collector)

Can you please validate this in your launcher. Especially if
"org.apache.stanbol.ontologymanager.servicesapi" exports
"org.apache.stanbol.ontologymanager.servicesapi.collector".


best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Beginning Apache Stanbol

2012-11-18 Thread Rupert Westenthaler

initialisation of the dbpedia sites by
stopping/starting the "org.apache.stanbol.data.sites.dbpedia" bundle
(e.g. via the Felix Webconsole under
http://localhost:8080/system/console/bundles). If this does not solve
your issue it makes it at lease easier to find the reason by looking
at the loggings during the startup.

Additional information are also available in a file called
"dbpedia.solrindex.ref" (best use find to search for the file as it is
hard to explain the where it is located). the file is a normal text
file with the current state of the site. If the State is Error it
should also contain the exception that caused the initialization to
fail.

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Beginning Apache Stanbol

2012-11-18 Thread Rupert Westenthaler

Hi Jonathan,

not the 3rd and last part answering your questions

>
> I'd appreciate help with the following:
> - how to enable dbpediaLinking?

If there would not be errors this should be enabled by default. The
only thing you might want todo is to install a full dbpedia index (as
the one included in Stanbol only contains ~40k Entities.

A index based on dbpedia 3.7 (spring 2011) can be found at
http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.7/
There is also a new dbpedia 3.8 (summer 2012) contributed by "Andrea
Di Menna" available at
http://dev.iks-project.eu/downloads/stanbol-indices/upload/dbpedia-3.8/

> - should the integration tests be passing?

Definitely yes. This is also checked by the Stanbol Jenkins Server
after every commit to the trunk (see
https://builds.apache.org/job/stanbol-trunk-1.6/)

> - how to enable contenthub

You will need to use the full launcher to have the contenthub (or
build a custom launcher as described by
http://stanbol.apache.org/production/your-launcher.html)

> - how to enable additional engines (e.g. 
> https://github.com/insideout10/wordlift-stanbol has Freebase and Schema.org, 
> but I'm not clear on how to include that code in the Stanbol src.
>

I had not yet time to look specifically at this contribution. But
generally you just need

* to add a bundle to your Stanbol Launcher
* provide a configuration for the Component(s) provided by those bundles

You can do all this via the Felix Webconsole
(http://localhost:8090/system/console): Install bundles via the
"Bundle" tag and Provide configuration via the "Configuration" tab.
Sometimes you will also need to manually start configured services via
the "Components" tab.

An other possibility is to use the Sling FileInstaller (just create
the "{stanbol-working-dir}/stanbol/fileinstall" if it does not yet
exist) and than copy the bundles and configurations in this directory.
Stanbol will pick up and install automatically from this location.

best
Rupert


--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Beginning Apache Stanbol

2012-11-18 Thread Rupert Westenthaler

w3.org/2000/01/rdf-schema#
> 18.11.2012 22:59:03.669 *DEBUG* [937106871@qtp-2017995693-10] 
> org.apache.stanbol.entityhub.yard.solr.impl.SolrFieldMapper  > prefix: 
> entityhub value: http://stanbol.apache.org/ontology/entityhub/entityhub#
> 18.11.2012 22:59:03.670 *WARN* [937106871@qtp-2017995693-10] 
> org.apache.felix.http.jetty /entityhub/entity 
> (java.lang.IllegalStateException: Unknown prefix foaf (parsed from field 
> foaf:schoolHomepage)!) java.lang.IllegalStateException: Unknown prefix foaf 
> (parsed from field foaf:schoolHomepage)!
[..]
> 18.11.2012 22:59:03.686 *DEBUG* [Event Job Manager Observer Daemon] 
> org.apache.stanbol.enhancer.jobmanager.event.impl.EnhancementJobHandler  -- 
> No active Enhancement Jobs
> 18.11.2012 22:59:04.545 *DEBUG* [Timer-1] 
> org.apache.sling.installer.provider.file.impl.FileMonitor Checking 
> /Users/jonathan/Documents/HuntDesign/Projects/stanbol/launchers/stanbol/fileinstall
> 18.11.2012 22:59:04.545 *DEBUG* [Timer-1] 
> org.apache.sling.installer.provider.file.impl.FileMonitor Checking 
> /Users/jonathan/Documents/HuntDesign/Projects/stanbol/launchers/stanbol/fileinstall/org.apache.stanbol.enhancer.engines.geonames.impl.LocationEnhancementEngine.config
>

I think this logs are from making the

> curl -X GET 
> "http://localhost:8080/entityhub/entity?id=http://huntdesign.co.nz/person/DavidBanner";

request and not from start/stopping the
"org.apache.stanbol.data.sites.dbpedia" bundle. Is this assumption
correct?

BTW has restarting the "org.apache.stanbol.data.sites.dbpedia" solved the Issue?

>> Additional information are also available in a file called
>> "dbpedia.solrindex.ref" (best use find to search for the file as it is
>> hard to explain the where it is located). the file is a normal text
>> file with the current state of the site. If the State is Error it
>> should also contain the exception that caused the initialization to
>> fail.
>
> ~/Documents/HuntDesign/Projects/stanbol/data/sites/dbpedia/target/classes/org/apache/stanbol/data/site/dbpedia/default/config/dbpedia.solrindex.ref
>

this is the file as included in the stanbol distribution. The file I
was referring to should be under
"{stanbol-launcher-dir}/stanbol/felix/"

> shows
>
> Name=SolrIndex for dbpedia
> Description=DBpedia.org
> Index-Archive=dbpedia.solrindex.zip,dbpedia_43k.solrindex.zip
> Download-Location=http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.7/dbpedia.solrindex.zip
>

The file within the stanbol launcher directory should contain
additional information like

Directory=dbpedia-2012.11.18
Index-Name=dbpedia
State=ACTIVE
Archive=dbpedia_43k.solrindex.zip

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Tika content type detection

Hi Andriy,

On Mon, Nov 19, 2012 at 10:25 AM, Andriy Nikolov
 wrote:
> Dear all,
>
> I have a question about the use of tika engine to detect the content-type
> of uploaded document. Does it require any special configuration of stanbol?

No it does not as Stanbol directly forwards the parsed content to the
Tika Mime Magic Detction if the Content-Type header is not set in the
request.

> Problem accessing /enhancer/engine/tika. Reason:
> Enhancement Chain failed because of required Engine 'tika' failed
> with Message: Unable to process ContentItem
> '<urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f>'
> with Enhancement Engine 'tika' because the engine is currently not
> active(Reason: Unexpected Exception while processing ContentItem
> <urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f> with
> EnhancementJobManager: class
> org.apache.stanbol.enhancer.jobmanager.event.impl.EventJobManagerImpl)!Caused
> by:org.apache.stanbol.enhancer.servicesapi.ChainException:
> Enhancement Chain failed because of required Engine 'tika' failed with
> Message: Unable to process ContentItem
> '<urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f>'
> with Enhancement Engine 'tika' because the engine is currently not
> active(Reason: Unexpected Exception while processing ContentItem
> <urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f> with
> EnhancementJobManager: class
> org.apache.stanbol.enhancer.jobmanager.event.impl.EventJobManagerImpl)!
> at
> org.apache.stanbol.enhancer.jobmanager.event.impl.EventJobManagerImpl.enhanceContent(EventJobManagerImpl.java:153)
> at
> org.apache.stanbol.enhancer.jersey.resource.AbstractEnhancerResource.enhance(AbstractEnhancerResource.java:233)
> at
> org.apache.stanbol.enhancer.jersey.resource.AbstractEnhancerResource.enhanceFromData(AbstractEnhancerResource.java:215)
>

This is the reason why it does not work for your. However to determine
the problem I would need the whole stack trace including all 'caused
by' sections.

The other error referenced in your mail seems unrelated.

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Stopping the framework ...

Hi,

you can not send attachments via the list. Feel free to send it directly to me.

On Mon, Nov 19, 2012 at 11:09 AM, Andrea Taurchini  wrote:
> 1) clean stanbol folder
> 2) launch java -Xmx1g -jar -XX:MaxPermSize=128m
> stanbol_src\launchers\full\target\org.apache.stanbol.launchers.full-0.10.0-SNAPSHOT.jar

-XX:MaxPermSize=128m is not enough for the full launcher as it
requires ~200 MByte. You should use -XX:MaxPermSize=256m instead.

With only 128m of PermGen memory I would expect the full launcher to
throw OutOfMemory exceptions during startup. If this happens the
initial configuration (created during the first startup) will be
incomplete and corrupted. This could indeed explain the issues you are
experiences.

best
Rupert


> 3) once fully active ... stop the service with CTRL^C
> 4) wait for stopping the service
> 5) relaunch the same startup command
> 6) verify that entityhubExtraction enhancer is no more available
>
> Thanks for your help.
>
>
> Best,
> Andrea
>
>
>
>
>
>
> 2012/11/15 Rupert Westenthaler 
>>
>> Hi
>>
>> On Thu, Nov 15, 2012 at 2:36 PM, Andrea Taurchini 
>> wrote:
>> > Dear All,
>> > maybe I'm missing (again) something, but if I stop the framework, no
>> > matter
>> > if through Felix Web Console or CTRL+C, configurations go to hell on the
>> > next restart.
>>
>> No you are missing nothing. All those ways to shutdown Stanbol should
>> work just fine. I can not remember having ever a problem like that.
>>
>> > Even the default enhancement chain will stop working since the order or
>> > the
>> > engine is changed to :
>> >
>> >- *metaxa* ( optional , currently not available)
>> >- *entityhubExtraction* ( required , currently not available)
>> >- *tika* ( optional , TikaEngine)
>> >- *langdetect* ( required , LanguageDetectionEnhancementEngine)
>> >- *ner* ( required , NamedEntityExtractionEnhancementEngine)
>> >- *dbpediaLinking* ( required , NamedEntityTaggingEngine)
>> >
>>
>> that "not available" engines are listed first is expected for the
>> WeightedChain. This chain determines the order based on information
>> provided by the Engine. So if an Engine is not available such
>> Information are not available. As the order does not matter for
>> Engines that are not available my decision was to list them first.
>>
>> > not to mention the fact that my own configurations (topic classifier
>> > ...)
>> > is completely removed ...
>> >
>>
>> Somehow it looks like as OSGI is not able to write files to the disc.
>> Can you please check the Stanbol log file
>> {launcher-dir}/stanbol/logs/error.log if you can find related
>> information.
>>
>> best
>> Rupert
>>
>> > Thanks,
>> > Andrea
>>
>>
>>
>> --
>> | Rupert Westenthaler rupert.westentha...@gmail.com
>> | Bodenlehenstraße 11 ++43-699-11108907
>> | A-5500 Bischofshofen
>
>



--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Tika content type detection

Hi Andriy, all

sending this again to the list as others might be affected/interested
as well. Especially Suat as he is currently fighting an very similar
issue in the CMS adapter

The assumption that Tika may miss XML Beans is wrong as Tika includes
xmlbeans 2.3.

java.lang.NoClassDefFoundError: Could not initialize class
org.apache.xmlbeans.XmlBeans

Errors like that indicate that during the initialization of a Class.
This includes the initialization of static variables (or static
blocks) in the mentioned class and all super classes. Looking at the
source of XmlBeans shows that in this case nearly everything is called
during static initialization :(

However when it comes to external dependencies there are only two and
in that context only the dependency to
javax.xml.stream.XMLStreamReader seams relevant.

javax.xml.stream.XMLStreamReader is part of the "stax-api". This API
is included in JDK 1.6. Stanbol imports the stax-api twice

1. via the JDK because the Stanbol frameworkfragment lists all the
packages of the stax-api
2. via the 
org.apache.servicemix.specs:org.apache.servicemix.specs.stax-api-1.0:2.1.0

This could indeed cause the error you are expiriencing. I have created
a launcher with a preliminary fix for that. You can find it under [1].
Can you please try if this solves your issue.

Please use "-Xmx1024m -XX:MaxPermSize=256M" when staring the full launcher

best
Rupert

[1] http://dev.iks-project.eu/downloads/stanbol-launchers/tmp/stax-api-debug/

On Mon, Nov 19, 2012 at 1:19 PM, Andriy Nikolov
 wrote:
> Thanks a lot!
> The error message is attached (seems like XMLBeans is not on classpath - is
> this something to configure separately?).
>
> Best,
> Andriy
>
>
> On Mon, Nov 19, 2012 at 12:52 PM, Rupert Westenthaler
>  wrote:
>>
>> Hi Andriy,
>>
>> On Mon, Nov 19, 2012 at 10:25 AM, Andriy Nikolov
>>  wrote:
>> > Dear all,
>> >
>> > I have a question about the use of tika engine to detect the
>> > content-type
>> > of uploaded document. Does it require any special configuration of
>> > stanbol?
>>
>> No it does not as Stanbol directly forwards the parsed content to the
>> Tika Mime Magic Detction if the Content-Type header is not set in the
>> request.
>>
>> > Problem accessing /enhancer/engine/tika. Reason:
>> > Enhancement Chain failed because of required Engine 'tika'
>> > failed
>> > with Message: Unable to process ContentItem
>> > '<urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f>'
>> > with Enhancement Engine 'tika' because the engine is currently not
>> > active(Reason: Unexpected Exception while processing ContentItem
>> > <urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f>
>> > with
>> > EnhancementJobManager: class
>> >
>> > org.apache.stanbol.enhancer.jobmanager.event.impl.EventJobManagerImpl)!Caused
>> > by:org.apache.stanbol.enhancer.servicesapi.ChainException:
>> > Enhancement Chain failed because of required Engine 'tika' failed with
>> > Message: Unable to process ContentItem
>> > '<urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f>'
>> > with Enhancement Engine 'tika' because the engine is currently not
>> > active(Reason: Unexpected Exception while processing ContentItem
>> > <urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f>
>> > with
>> > EnhancementJobManager: class
>> > org.apache.stanbol.enhancer.jobmanager.event.impl.EventJobManagerImpl)!
>> > at
>> >
>> > org.apache.stanbol.enhancer.jobmanager.event.impl.EventJobManagerImpl.enhanceContent(EventJobManagerImpl.java:153)
>> > at
>> >
>> > org.apache.stanbol.enhancer.jersey.resource.AbstractEnhancerResource.enhance(AbstractEnhancerResource.java:233)
>> > at
>> >
>> > org.apache.stanbol.enhancer.jersey.resource.AbstractEnhancerResource.enhanceFromData(AbstractEnhancerResource.java:215)
>> >
>>
>> This is the reason why it does not work for your. However to determine
>> the problem I would need the whole stack trace including all 'caused
>> by' sections.
>>
>> The other error referenced in your mail seems unrelated.
>>
>> best
>> Rupert
>>
>> --
>> | Rupert Westenthaler rupert.westentha...@gmail.com
>> | Bodenlehenstraße 11 ++43-699-11108907
>> | A-5500 Bischofshofen
>
>
>
>
> --
>
> Dr Andriy Nikolov
>
> R&D Engineer
>
> F +49 6227 3849-565
>
>

Re: Stopping the framework ...

Hi,

even after a detailed inspection of the log file you provided was not
able to find an indication of any problem. Based on the logging the
Stanbol instance was started, stopped and than started and stopped
again. The loggings of the first and the second startup are really
similar.

The only thing that might cause problems is the "java.io.IOException:
Unable to establish loopback connection" but as you mentioned in your
last mail solving this has also not solved your issue.

On Mon, Nov 19, 2012 at 4:17 PM, Andrea Taurchini wrote:
> I should install stanbol on a windows server ... so it is not possible ?

AFAIK there are several users that do use Stanbol on Windows. Only
yesterday a Blog about MakoLab using Stanbol with their Windows CMS
was posted [1].

Andrea can you try to do the following

1. 1st time start of Stanbol (in an empty directory)
2. after the start archive the "stanbol\config" folder
3. stop the stanbol instance
4. after shutdown again archive the "stanbol\config" folder
5. start stanbol a 2nd time
6. make an third archive of the "stanbol\config" folder

If you can send me those three archives I will make and before after
check of the OSGI component configurations as written to your HD.
Maybe this will provide a hint about your issue

best
Rupert

[1]
http://blog.iks-project.eu/makolabs-stanbol-integration-with-renault-international-cms-system/

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Apache Stanbol Enhancer Engine

Hi Stefan

To be sure I would need to check this in detail, but I think the
reason is that your labels do use 'en-GB' as language and the language
identification determines the text to be in 'en'. Because of that your
labels are not considered for linking. You can try to set the "Default
Matching Language" of the KeywordLinkingEngine to "en-GB" if you get
than the expected results it would validate my assumption.

Do you need to support country specific language identifier? Otherwise
I would suggest to change the language tags in your dataset from
"en-GB" to "en".

best
Rupert

On Wed, Nov 21, 2012 at 3:18 PM, Stefan Zwicklbauer
 wrote:
> Hello,
>
> I have generated my own index which is available in the entityhub. The index
> is small and the apropriate rdf file has the following structure:
>
>  rdf:about="http://cv.iptc.org/newscodes/genre/Archive_material";>
>  rdf:resource="http://www.w3.org/TR/skos-reference/skos.html#Concept"/>
> Archive material
> The object contains material
> distributed previously that has been selected from the originator's
> archives.
> 
> http://cv.iptc.org/newscodes/genre/Background";>
>  rdf:resource="http://www.w3.org/TR/skos-reference/skos.html#Concept"/>
> Background
> The object provides some scene setting
> and explanation for the event being reported.
> 
> http://cv.iptc.org/newscodes/genre/Biography";>
>  rdf:resource="http://www.w3.org/TR/skos-reference/skos.html#Concept"/>
> Biography
> Facts and views about a
> person
> 
>
> In the following I created an enhancement chain which consists of a Language
> Identification Engine and a Linking Keyword Engine. If i try to use this
> chain with an input word like "Background" (look example) the result does
> not contain any relevant information (only language).
>
> During the creation of the Keyword Linking Engine in the Web console i had
> to specify the label field and type field. What are the correct values
> concerning the given example above?
>
> Sincerly
> Stefan



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: stanbol internal dependencies to dependency management

Those decisions where intensional.

Normally Fabian would be the right person to answer this, but I will
try it anyway:

This is only a short summary as there was a long discussion that leaded to this:

Dependency management for Stanbol modules can not be in the parent (1)
because modules  versions as this would not allow releasing components
without releasing also the parent.
Components that depending on the oldest supported version gives users
more freedom in their launcher configurations. In addition keeping
dependencies to released version is critical for released of
single/subsets of components.
If Stanbol modules do not depend on the latest version doing the
dependency management in the parent does not work. Developers of
modules need to manage their dependencies  themselves.
Only the Stanbol launchers are supposed to use the newest versions of modules.

On Wed, Nov 21, 2012 at 5:56 PM, Reto Bachmann-Gmür  wrote:
> e.g. the
> enhancer using 0.9.0-incubating of commons.web.base. This can cause
> incompatibilities as in the launchers 0.10.1-SNAPSHOT is used.

If there is a change in the enhancer.jersey module that requires the
current commons.web.base than the developer that introduces this
change needs to update the dependency.

This happened also to myself. But after some time one gets used to it.

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: stanbol internal dependencies to dependency management

> But I think they should nevertheless be kept up to date as
> otherwise we have no compile time check that the module will indeed work in
> the trunk version of the launcher. So I think we should regularly run a

UnitTest are executed using compile time dependencies
Integration test do check the runtime dependencies.

So I do not see a problem with that.

In addition one has to consider that the OSGI dependency management is
anyway different from the maven one.

To give two examples (for details have a look at the Semantic
Versioning Whitepaper [1])

1. consumer and provider policy: Stanbol uses (since STANBOL-774)

 -provider-policy : ${range;[==,=+)}
 -consumer-policy : ${range;[==,+)}

That means that by default dependencies do use a version range of
[==,+). However this is not feasible for imported packages that are
implemented by an module, as minor versions may e.g. extend an
interface by an additional method. So for such cases the import needs
to be marked with the provider-policy to ensure [==,=+).

2. Dependency management in maven is on module level where for OSGI
uses a package level granularity.

Depending on the latest version undermines version ranges (especially
for consumer-policy dependencies) - [==,=+) where the left side is the
most current version means basically that there is no version range at
all.

- - -

While such things are not really visible as long as you run everything
within the OSGI environment it starts really to hurt as soon as you
want to access services form outside of OSGI (e.g. when you run
Stanbol in an embedded OSGI environment). In such settings one needs
to expose all packages of used interfaces via the system bundle and
therefore you do not have the possibility to use different versions of
the same class.

But also within OSGI there are some disadvantages one might encounter.
One example is a fragmentation of the service registry (basically a
bundle may not use a service, because it's version of the Interface
was loaded using a different classloader as the version of the
Interface provided by the Service). If that happens ServiceTracker
will not get notified for available services - because they would not
be compatible. Debugging that is not fun and solving such issues is
only possible by fixing version ranges.

I agree that out of an maven and build perspective this might look
like a bad choice, but from the OSGI perspective  it is exactly how it
should be done.

I think the version number confusion of sling is caused by the fact
the every single module can have totally different versions. I think
in Stanbol this can be easily avoided by ensuring that all modules of
a Stanbol component (enhancer, entityhub, ... ) are all within
[==,=+). For the commons stuff we could use the same rule but one
level below (e.g. that all commons.solr modules are within [==,=+))

best
Rupert

 [1] http://www.osgi.org/wiki/uploads/Links/SemanticVersioning.pdf

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

Hi

thanks for the feedback. I think we should go for (2) renaming the
engine. First because the current name (KeywordExtractionEngine) is
anyway not so fitting. Keyword extraction is typically more related to
finding central words within a text but the engine is more about
linking words with a vocabulary. Second because there might be some
use cases where it would still make sense to use the old engine in
parallel with the new one - e.g for extracting Product-Ids, ISBN
numbers, chemical formulas such as CH3CH2OH ... Third it is easier to
adapt the documentation - especially the usage scenarios - if there is
a new name for the new engine and finally I do also like to have
warnings instead of errors for users that have not yet adapted to the
new engine.

While Fabians suggestion would clearly document the change it would
still mean to break most current Stanbol installations as most of the
users currently use the trunk version. However as soon as we do have a
faster release cycle this option would be much more attractive.

I would than suggest to use "EntityhubLinkingEngine" as the new name
for the Engine as this name makes it very clear what this engine does.

Thanks for the feedback
best
Rupert

On Thu, Nov 22, 2012 at 12:01 AM, Bertrand Delacretaz
 wrote:
> On Wed, Nov 21, 2012 at 8:46 PM, Fabian Christ
>  wrote:
>> ...what about creating a branch from the trunk with the current version
>> (before the merge) that is known to be working? People could switch to that
>> branch to keep the status quo and we should make clear that this branch
>> will not be maintained in the future...
>
> I'd make that just a BEFORE_740 tag then - that makes it clearer that
> this is not supposed to evolve further.
>
> -Bertrand

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

2012-11-22 Thread Rupert Westenthaler

On Thu, Nov 22, 2012 at 12:10 PM, Bertrand Delacretaz
 wrote:
>
> Isn't the "hub" part an implementation detail?
>
> EntityLinkingEngine sounds better to be - but no strong opinion,
> whoever does the work decides.

Good point. While refactoring the code I came to the same conclusion

Currently I have

(1) "EntityLinkingEngine": This is the class implementing the
EnhancementEngine interface and in registered as OSGI service and
(2) "EntityhubLinkingEngine": The OSGI Component that gets the
configuration, registered an ServiceTracker for the Entityhub Site and
registers the  "EntityLinkingEngine" instance as soon as all the
required Services are available.

The goal of this is to make it really easy implement a
"MyServiceLinkingEngine". Even my current refactoring we are not yet
there, but it is getting much better.

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: "Error reloading cached bundle"

2012-11-22 Thread Rupert Westenthaler

Hi Reto,

I am now able to reproduce this by the following

(1) start the Stanbol launcher within the target folder
(2) make a mvn clean install while the stanbol is still running in the
./target folder
(4) stop the Stanbol launcher (of the in the meantime deleted folder)
(5) go to the newly created ./target folder
(6) start the stanbol launcher within the target folder

I think this is because in (4) the launcher writes some data to the
./target/stanbol folder of the new one. Because of that the
initialisation of the new launcher in (6) fails with the reported
exception.

Could this be related to the cases you are reporting?

best
Rupert

On Wed, Nov 14, 2012 at 1:29 PM, Reto Bachmann-Gmür  wrote:
> Now tha I saw the same error again its bundle 105 which is
> slinginstall:org.apache.stanbol.commons.solr.core-0.10.1-SNAPSHOT.jar.
> Again same symptoms.
>
> Reto
>
> On Wed, Oct 10, 2012 at 3:06 PM, Rupert Westenthaler <
> rupert.westentha...@gmail.com> wrote:
>
>> Reto,
>>
>> have you looked what module bundle64 refers to?
>>
>> On Wed, Oct 10, 2012 at 11:53 AM, Reto Bachmann-Gmür 
>> wrote:
>> > Occasionally when starting a fresh stanbol launcher I get the following
>> > error message. Does anybody knows what is causing this? After deleting
>> the
>> > stanbol dectory and retrying the problem doesn't appear again.
>> >
>> > Cheers,
>> > Reto
>> >
>> > ERROR: Error reloading cached bundle, removing it:
>> >
>> /home/reto/projects/apache/stanbol/launchers/full/target/stanbol/felix/bundle64
>> > (java.lang.Exception: No valid revisions in bundle archive directory:
>> >
>> /home/reto/projects/apache/stanbol/launchers/full/target/stanbol/felix/bundle64)
>> > java.lang.Exception: No valid revisions in bundle archive directory:
>> >
>> /home/reto/projects/apache/stanbol/launchers/full/target/stanbol/felix/bundle64
>> > at
>> >
>> org.apache.felix.framework.cache.BundleArchive.(BundleArchive.java:205)
>> > at
>> >
>> org.apache.felix.framework.cache.BundleCache.getArchives(BundleCache.java:223)
>> > at org.apache.felix.framework.Felix.init(Felix.java:656)
>> > at org.apache.sling.launchpad.base.impl.Sling.init(Sling.java:363)
>> > at org.apache.sling.launchpad.base.impl.Sling.(Sling.java:228)
>> > at
>> >
>> org.apache.sling.launchpad.base.app.MainDelegate$1.(MainDelegate.java:181)
>> > at
>> >
>> org.apache.sling.launchpad.base.app.MainDelegate.start(MainDelegate.java:181)
>> > at org.apache.sling.launchpad.app.Main.startSling(Main.java:424)
>> > at org.apache.sling.launchpad.app.Main.doStart(Main.java:349)
>> > at org.apache.sling.launchpad.app.Main.main(Main.java:123)
>> > at org.apache.stanbol.launchpad.Main.main(Main.java:61)
>>
>>
>>
>> --
>> | Rupert Westenthaler rupert.westentha...@gmail.com
>> | Bodenlehenstraße 11 ++43-699-11108907
>> | A-5500 Bischofshofen
>>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Apache stanbol: Enhancer service codification problem

2012-11-22 Thread Rupert Westenthaler

Hi Jairo,

I created STANBOL-813 [1] and implemented a fix with revision [2].
Your test case now works for me so it should be fine for you to.

Note that this fix does not tackle the general issues as mentioned in
my first replay so Stanbol might still write characters to the
Enhancement structure that might cause "application/rdf+xml"
serializations to fail.

best
Rupert



[1] https://issues.apache.org/jira/browse/STANBOL-813
[2] http://svn.apache.org/viewvc?rev=1412756&view=rev

On Tue, Nov 20, 2012 at 6:19 AM, Rupert Westenthaler
 wrote:
> Hi Jairo,
>
> This is caused by the "removeNonUtf8CompliantCharacters(..)" in the
> NEREngineCore class (OpenNLP-NER engine) [1]. The JavaDoc says that
> this was added to avoid errors while creating "application/rdf+xml"
> responses.
>
> I am only recently noticed this method as I adapted the OpenNLP NER
> engine to work with the new Stanbol NLP processing chain
> (STANBOL-797). In the branch version of this engine [2] this method
> the
> "removeNonUtf8CompliantCharacters(..)" is no longer called if the
> AnalyzedText ContentPart (STANBOL-734) is used as source for the
> enhancements.
>
> Generally I do not like this method as it creates a copy of the parsed
> content what can be a problem for big texts. In addition as this is
> only done by this engine there is still no guarantee that there are no
> non UTF-8 compliant chars in the response (they might even come from
> literals in dereferenced Entities).
>
> In addition this method seams to be overdoing as well, because the 'í'
> in 'París' is clearly an UTF-8 conform character.  Maybe Olivier
> Grisel can comment to that, because as far as I can remember he was
> the one adding this feature years ago.
>
> best
> Rupert
>
>
> [1] 
> http://svn.apache.org/repos/asf/stanbol/trunk/enhancer/engines/opennlp-ner/src/main/java/org/apache/stanbol/enhancer/engines/opennlp/impl/NEREngineCore.java
> [2] 
> http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/engines/opennlp-ner/src/main/java/org/apache/stanbol/enhancer/engines/opennlp/impl/NEREngineCore.java
>
> On Mon, Nov 19, 2012 at 7:01 PM, Jairo Sarabia
>  wrote:
>> Hi Rupert,
>>
>> I tried to use enhancer service for spanish texts and I have problems with
>> codification.
>> In the service, the  caracters with accents disappear in json response and
>> consequently there are important words of de Language that no appear in the
>> responses.
>> I've tried using different codifications in the requests but none seem to
>> work:
>>
>> Examples of Headers:
>> 1)  -H "Accept: application/json", "Content-type: text/plain"
>> 2)  -H "Accept: application/json", "Content-type: text/plain; charset=utf-8"
>> 3)  -H "Accept: application/json", "Content-type: text/plain;
>> charset=iso-8859-1"
>> 4) -H "Accept: application/json", "Content-type: text/html; charset=utf-8",
>> "Accept-Language: es-es"
>> 5) -H "Accept: application/json", "Content-type: text/html;
>> charset=iso-8859-1", "Accept-Language: es-es"
>>
>> Example of curl request:
>>
>> REQUEST:
>>
>> curl -v -X POST -H "Accept: text/plain" -H "Content-type: text/html;
>> charset=utf-8" -H "Accept-language:es-es;en" --data "The
>> Stanbol enhancer puede detectar personas famosas como Mariano Rajoy y
>> ciudades como París."
>> "http://ec2-50-16-118-169.compute-1.amazonaws.com:8080/enhancer/chain/notedlinks";
>>
>> JSON RESPONSE:
>>
>> {
>>  
>>
>> {
>>   "@subject":
>> "urn:content-item-sha1-69a7889f31ea325dda4a9e08f735b1499e7d6e3c",
>>   "dc:format": "text/html; charset=UTF-8",
>>   "http://www.w3.org/ns/ma-ont#hasFormat": "text/html; charset=UTF-8"
>> },
>> {
>>   "@subject": "urn:enhancement-0367734f-e48d-4dc3-e634-e5a3a4770706",
>>   "@type": [
>> "enhancer:Enhancement",
>> "enhancer:TextAnnotation"
>>   ],
>>   "dc:created": "2012-11-19T17:48:25.977Z",
>>   "dc:creator":
>> "org.apache.stanbol.enhancer.engines.opennlp.impl.NamedEntityExtractionEnhancementEngine",
>>   "dc:type": "dbp-ont:Person",
>>   "enhancer:confidence": 0.98616,
>>   "enhancer:end": 71,
>>

Re: Question: REST API expected content type

2012-11-23 Thread Rupert Westenthaler

Hi Andriy,

For the Enhancer RESTful API

The MediaType is taken from the "MediaType mediaType" parameter as
parsed by JAX-RS to the "readFrom(..)" method of the
"MessageBodyReader". This should be equals to the 'Content-Type'
header parsed in the request. The uploaded content is stored as Blob
to the created ContentItem.

In case you are sending "multipart/form-data" requests than you need
to consider the specification as documented in the "Multipart MIME
serialization" section of [1]


For the Tika Engine:

The MimeType is parsed from ContentItem#getBlob()#getMimeType() (see
also [1]). If the mime type can no be parsed of is
application/octet-stream than the Tika is used to detect the correct
MimeType. Otherwise the content type as set in the Blob is used.

BTW. plain text files are not processed by the Tika engine.

best
Rupert


[1] http://stanbol.apache.org/docs/trunk/components/enhancer/contentitem.html

On Fri, Nov 23, 2012 at 9:18 AM, Andriy Nikolov
 wrote:
> Dear all,
>
> I have another question about the use of Stanbol enhancer REST API
> (apologies if it is already covered in the documentation, i didn't
> find it).
> Is there some default content type which is expected by the enhancer?
> For instance, if I send a PDF file to the dbpedia-spotlight chain
> without specifying its content type, it gets processed correctly:
> curl -X POST -H "Accept: text/turtle" -T test.pdf
> http://localhost:8080/enhancer/chain/dbpedia-spotlight?uri=urn:testItem
> However, if I send a plain text file instead, nothing is returned:
> curl -X POST -H "Accept: text/turtle" -T dummy.txt
> http://localhost:8080/enhancer/chain/dbpedia-spotlight
> I have to set "Content-type: text/plain" in the header.
> Similarly, when I send PDF content from Java client via
> HttpURLConnection, if I don't set "Content-type:
> application/octet-stream" explicitly, it gets interpreted as plain
> text.
>
> I guess, Tika engine is able to recognise both plain text and
> different binary formats, so can I set some "default" content type,
> which will just defer the recognition of input format to the Tika
> engine?
> That will allow me sending any file to the service without first doing
> some "pre-guessing" on the client side.
>
> Best regards,
>
> Andriy Nikolov
>
> R&D Engineer
>
> F +49 6227 3849-565
>
> andriy.niko...@fluidops.com
>
> http://www.fluidops.com
>
> fluid Operations AG
>
> Altrottstr. 31
>
> 69190 Walldorf, Germany
>
> Geschäftsführer/Managing Directors: Vasu Chandrasekhara, Dr. Andreas
> Eberhart, Dr. Stefan Kraus, Dr. Ulrich Walther
>
> Beirat/Advisory Board: Prof. Dr. Andreas Reuter, Thomas Reinhart
>
> Registergericht/Commercial Register: Mannheim, HRB 704027
>
> USt-Id Nr./VAT-No.: DE258759786
>
> This e-mail may contain confidential and/or privileged information. If
> you are not the intended recipient (or have received this e-mail in
> error) please notify the sender immediately and destroy this e-mail.
> Any unauthorised copying, disclosure or distribution of the material
> in this e-mail is strictly forbidden.



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Question: REST API expected content type

2012-11-23 Thread Rupert Westenthaler

On Fri, Nov 23, 2012 at 10:57 AM, Gniewosław Rzepka
 wrote:
> http://upload.wikimedia.org/wikipedia/commons/4/45/F1_logo.svg

The current URL used by wikipedia is

http://upload.wikimedia.org/wikipedia/en/4/45/F1_logo.svg

So basically it seams that they replace "commons" with "en" in the URL.

> I tghouth this might be useful information.

Thanks for the notice, but this is something we can not easily correct
as we do use the data as provided by DBpedia. In case of dbpedia 3.8
those are from this summer. So recent changes are not reflected in
them.

best
Rupert

>
> Gniewosław Rzepka

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: OpenNLP models license

On Fri, Nov 23, 2012 at 11:39 AM, Andrea Di Menna  wrote:
> Even though I am not copying models in the datafiles dir, it looks
> like those models are anyway available in the stable launcher.
>
> My questions follow:
> 1) Are the en model from lang and ner bundles licensed with a Apache
> 2.0 license?

No. This is the reason why you get Messages like that during the build

*
* WARNING - this build downloads some OpenNLP files that are *not*
* licensed under the Apache License, and have more restrictive usage
* terms than the Apache Stanbol code. See STANBOL-545 for more
* information: https://issues.apache.org/jira/browse/STANBOL-545
*

> 2) Is there any safe/preferable way to remove those models from a
> Stanbol instance without completely disrupting the Keyword Linking
> engine?

They KeywordLinking engine only requires Tokens. Those are also
available if no models are present. However this will have an influence
on the Results and the Performance.

> I am wondering if those models are absolutely needed for the purpose
> of Keyword Linking or if the related bundles can be safely removed
> from the Felix console.
>

just exclude/remove all org.apache.stanbol.data.opennlp.* bundles

Regarding Licenses: You will find a lot of relevant posts on the
OpenNLP mailing lists.

best
Rupert

>
> [1] https://issues.apache.org/jira/browse/STANBOL-545
> [2] http://opennlp.sourceforge.net/models-1.5/

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

Hi all

The refactoring is completed (for now) - see STANBOL-812 [1].
Documentation is already online on the Staging Server

* EntityhubLinkingEngine [2]: This is the direct successor of the
KeywordlinkingEngine
* EntityLinkingEngine [3]: This is the "generic" implementation of
EntityLinking based on the NLP processing API [4]

There will be a 2nd refactoring step to make the EntityLinkingEngine
fully independent of the Stanbol Entityhub. But this will not have any
influence on public APIs, Chain configurations nor Enhancement results
so this can be done after reintegration with the trunk.

Thanks for the feedback
best
Rupert

[1] https://issues.apache.org/jira/browse/STANBOL-812
[2] 
http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entityhublinking
[3] 
http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entitylinking
[4] http://stanbol.staging.apache.org/docs/trunk/components/enhancer/nlp/

On Thu, Nov 22, 2012 at 1:05 PM, Rupert Westenthaler
 wrote:
> On Thu, Nov 22, 2012 at 12:10 PM, Bertrand Delacretaz
>  wrote:
>>
>> Isn't the "hub" part an implementation detail?
>>
>> EntityLinkingEngine sounds better to be - but no strong opinion,
>> whoever does the work decides.
>
> Good point. While refactoring the code I came to the same conclusion
>
> Currently I have
>
> (1) "EntityLinkingEngine": This is the class implementing the
> EnhancementEngine interface and in registered as OSGI service and
> (2) "EntityhubLinkingEngine": The OSGI Component that gets the
> configuration, registered an ServiceTracker for the Entityhub Site and
> registers the  "EntityLinkingEngine" instance as soon as all the
> required Services are available.
>
> The goal of this is to make it really easy implement a
> "MyServiceLinkingEngine". Even my current refactoring we are not yet
> there, but it is getting much better.
>
> best
> Rupert
>
>
>
> --
> | Rupert Westenthaler rupert.westentha...@gmail.com
> | Bodenlehenstraße 11 ++43-699-11108907
> | A-5500 Bischofshofen



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: stanbol internal dependencies to dependency management

Hi Reto,

if you make incompatible changes to an module than you need to adapt
all dependent modules and update the dependency of them to the current
version.

Normally the

 -provider-policy : ${range;[==,=+)}
 -consumer-policy : ${range;[==,+)}

would ensure that release Bundles are not affected by that. This is
also the reason why for an incompatible API change a major version
increase is required. However for pre 1.0.0 versions this is not the
case.

best
Rupert

On Fri, Nov 23, 2012 at 11:10 AM, Reto Bachmann-Gmür  wrote:
> Hi,
>
> The concrete problem: I've made changes to the WebFragment interface (in
> org.apache.stanbol.commons.web.base). The classes implementing it no longer
> compile if they have proper @Override annotations. Packages which used to
> implement the old version should remove the method and move the templates
> to another location.
>
> At runtime implementation of the old interface still work except that the
> method is never invoked and the templates are looked up in the new
> location. I've moved the templates to the new location in all modules and
> I've removed the method in those modules dependeing on the trunk version.
> The other modules are now in the state that they work only with the trunk
> launchers but compile only with the dependency to the old comms.web.base.
> If developer update the dependency version they'll have to find out why it
> fails and what adaptations are needed.
>
> I think it would be much more efficient if the one that changes an
> interface also changes all dependencies in trunk to compile with the new
> version. Of course one could just update the modules depending on the
> updated one to use the latest version. Howver I think it would be more
> consistent to keep the reactor modules to depend on the latest versions,
> this can be done running and needs no change to depenedency management:
>
> mvn org.codehaus.mojo:versions-maven-plugin:1.3.1:use-latest-versions
> "-Dincludes=org.apache.stanbol:*:*:*"  -DallowSnapshots=true
> -DexcludeReactor=false
>
> For the following modules we have other modules depending of older versions
> of them:
>
> org.apache.stanbol.commons.jsonld
> org.apache.stanbol.commons.solr.core
> org.apache.stanbol.commons.stanboltools.datafileprovider
> org.apache.stanbol.commons.stanboltools.offline
> org.apache.stanbol.commons.web.base
> org.apache.stanbol.entityhub.core
> org.apache.stanbol.entityhub.model.clerezza
> org.apache.stanbol.entityhub.servicesapi
> org.apache.stanbol.entityhub.yard.solr
>
> Given that in the launchers the reactor build they have to be  compatible
> with the latest versions anyway this seems inconsistent to me.
>
> For now I'll just update the modules to depend on the latest version of
> org.apache.stanbol.commons.web.base.
>
> Cheers,
> Reto
>
>
>
>
> On Fri, Nov 23, 2012 at 9:15 AM, Fabian Christ > wrote:
>
>> Hi,
>>
>> is there any concrete problem with this approach? I would like to live with
>> it at least for some releases and then decide upon our experience if it
>> fits. Otherwise it is just a meta-discussion. I see pros and cons on each
>> side.
>>
>> Let's do a few releases and collect some evidence ;)
>>
>>
>> 2012/11/22 Reto Bachmann-Gmür 
>>
>> > I agree that if integration offer full coverage the will fail when a
>> > compatibility breaking change is introduced. However the advantage of
>> > statically typed languages is that you can detect this problems already
>> at
>> > compile time.
>> >
>> > The two arguments you mention package split and interaction with host
>> > environment are in fact arguments for having all modules depend on the
>> same
>> > versions of their dependencies. As in the trunk launchers we use the
>> trunk
>> > versions these modules should also depend exclusively of the trunk
>> versions
>> > of other stanbol modules. Embedding is an import usecase that should be
>> > supported, the easiest way to address it is to have just one version of
>> the
>> > bundles and consistent dependencies. Backward compatibility (e.g. that
>> > somebody wants to use an old version of an engine with a new enhancer)
>> > seems less important and to provide this the current approach of having
>> > engines compile but then fail at runtime doesn't seem a good approach
>> > anyway.
>> >
>> > Cheers,
>> > Reto
>> >
>> > On Wed, Nov 21, 2012 at 11:25 PM, Rupert Westenthaler <
>> > rupert.westentha...@gmail.com> wrote:
>> >
>> > > > But I think they should n

Re: stanbol internal dependencies to dependency management

On Sat, Nov 24, 2012 at 1:09 PM, Reto Bachmann-Gmür  wrote:
> Hi Rupert,
>
> So assuming a module is in trunk at version 3.4.1-SNAPSHOT and I make an
> incompatible change, to what should I change the version number to? Does
> the degree of incompatibility makes a difference:
> - A change that affects clients of the interface

e.g. Changing/Removing/renaming any existing method of an interface

3.4.1 -> 4.0

The typical workaround is to keep the old version and deprecate it. In
this case an increase to 3.5 is sufficient

> - A change that affects subclasses (when knowing that there are such
> subclasses/not knowing)

e.g. adding a method to an interface, or abstract method in a class

3.4.1 -> 3.5

> - A change in the behaviour (documented behaviour/undocumented side effect)

3.4.1 -> 3.4.2

but this are only the minimum required version increases to ensure
that the used OSGI provider-policy and consumer-policy do work as
intended.

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: stanbol internal dependencies to dependency management

On Sat, Nov 24, 2012 at 7:50 PM, Reto Bachmann-Gmür  wrote:
> Ok, thanks. Good to have such a policy.
>
> Just the last point:
>
>> - A change in the behaviour (documented behaviour/undocumented side
>> effect)
>>
>> 3.4.1 -> 3.4.2
>>
>
> The version in trunk is a snapshot version (3.4.1-SNAPSHOT) so the latest
> released version is probably 3.4 so this change would only change what
> changed already.

3.4.2-SNAPSHOT is automatically created as soon as 3.4.1 is released.
However as long there are no changes in the trunk there will not be a
3.4.2 release.

Practically that means that a minor change in the trunk does not
increase the version. But out of a release perspective the first
change in the trunk does increase the version (as it triggers a new
release) and all further changes do not unless one decides (for some
other reason) to increase the version number.

BTW: With the introduction of ManagedSites in the Entityhub I had to
made some incompatible version changes. Back than I decided to
increase the version number of the trunk trom 0.10.1 to 0.11. I also
created an entityhub-0.10 branch [1] so that we could do 0.10.*
releases if we wanted to fix bugs in the version with the old API.
This was mainly because the entityhub 0.11.* is no longer compatible
with the release stanbol version 0.9.0.

But as Fabian already noted. We need really to do some more releases
to see how this all works out in practice. The current discussions are
all very theoretically and need to be validated by forging real
releases.

best
Rupert

[1] http://svn.apache.org/repos/asf/stanbol/branches/entityhub-0.10/

>
> Reto

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: stanbol internal dependencies to dependency management

2012-11-25 Thread Rupert Westenthaler

>
> What's missing for doinh a 1.0 release (which seems to be the precondition
> for all this major/minor/micro stuff?
>

Fabian and myself talked about that during ApacheCon. If I remember
correctly the plan was like follows:

* reintegrate the Stanbol NLP processing module (I am currently
merging ... should be finished today/tomorrow)
* make an other 0.* release of all modules (1. release candidate next
week seems feasible)
* will be 0.10 for most components
* 0.11 for the entityhub
* work towards the 1.0 release for
   * not all components will have a 1.0 release. This is something we
need to decide. Commons, Data, Enhancer, Entityhub are good
candidates. Contenthub will need to wait for the 2-layerd storage
infrastructure. Not sure about the Ontonet/reasoning and Rules.
   * for the 1.0 release we might need to change the current folder
structure of the SVN a little bit (e.g moving Engines that depend on
non Enhancer components out of enhancer/engines ...

BTW Fabian has already started the work. If you look at the recent
Jira issues you will find some that cover things mentioned above.

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Improvement of DOAP file for Stanbol

Hi

Actually a great Idea

> BTW: Maybe someone has a good idea on how that semantic data provided by
> the ASF can be used by Stanbol.

If someone could write a simple script that collects the rdf files
form all HTML files in

https://projects.apache.org/projects/

They are referenced by the following meta tag



Than we could create a Entityhub ManagedSite for those data and
include it into the Stanbol default configuration.

BTW: I think there are even more RDF files available (see information
on http://people.apache.org/foaf/) but I do not have a clear idea how
to access the RDF version with the public available information.

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: Reintegration of the Stanbol NLP processing branch (STANBOL-733) with trunk

Hi all,

with revision 1413560 [2] the stanbol-nlp-processing branch [1] is
re-integrated with the Stanbol trunk. There are still some TODOs such
as adding integration tests for the newly added engines based on the
"dbpedia-proper-noun chain" but starting from this revision the
Stanbol NLP processing module is available in the trunk.

Documentation is still in progress. The current version can be viewed
on the staging server

* NLP processing API [3]:
* Enhancement Engine List [4] with links to the newly added Engines

Especially note that the "EntityhubLinkingEngine" replaces the - now
deprecated "KeywordLinkingEngine". Users that are not using the
"Keyword Tokenizer" feature should definitely switch!

If you want to link against DBpedia you should also give the new
DBpedia 3.8 index [5] a try

best
Rupert


> [1] http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/
[2] http://svn.apache.org/viewvc?rev=1413560&view=rev
[3] http://stanbol.staging.apache.org/docs/trunk/components/enhancer/nlp/
[4] 
http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/list.html
[5] http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.8/


--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Fwd: Build failed in Jenkins: stanbol-trunk-1.6 #1116

s not
within its bound

[ERROR] 
<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/Spanish.java>:[17,31]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound

[ERROR] 
<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/Spanish.java>:[17,59]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound

[INFO] 7 errors
[INFO] -
[JENKINS] Archiving disabled - not archiving
<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/pom.xml>
[INFO] 
[ERROR] BUILD FAILURE
[INFO] 
[INFO] Compilation failure

<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/German.java>:[20,31]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound

<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/German.java>:[20,57]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound

<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/model/impl/AnalysedTextFactoryImpl.java>:[20,7]
org.apache.stanbol.enhancer.nlp.model.impl.AnalysedTextFactoryImpl is
not abstract and does not override abstract method
createAnalysedText(org.apache.stanbol.enhancer.servicesapi.ContentItem,org.apache.stanbol.enhancer.servicesapi.Blob)
in org.apache.stanbol.enhancer.nlp.model.AnalysedTextFactory

<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/English.java>:[20,31]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound

<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/English.java>:[20,66]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound

<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/Spanish.java>:[17,31]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound

<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/Spanish.java>:[17,59]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound


[INFO] 
[INFO] For more information, run Maven with the -e switch
[INFO] ----
[INFO] Total time: 13 minutes 25 seconds
[INFO] Finished at: Mon Nov 26 12:22:01 UTC 2012
[INFO] Final Memory: 558M/907M
[INFO] 
Sending e-mails to: dev@stanbol.apache.org rupert.westentha...@gmail.com
channel stopped
[locks-and-latches] Releasing all the locks
[locks-and-latches] All the locks released


--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

On Mon, Nov 26, 2012 at 1:23 PM, Apache Jenkins Server
 wrote:
> See <https://builds.apache.org/job/stanbol-trunk-1.6/1116/changes>
>
> Changes:
>
> [rwesten] STANBOL-733: Merged changed from the stanbol-nlp-processing branch 
> back to the trunk; added sentimentdata bundlelist; changed default 
> configuration of the stanbol launcher(s) by editing the /data/dafaultconfig 
> bundle; Adapted the EnhancerConfiguration integration test to the new 
> configuration.
>
> --
> [...truncated 13991 lines...]
> ---
>  T E S T S
> ---
>
> Results :
>
> Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
>
> [JENKINS] Recording test results
> [INFO] [jar:jar {execution: default-jar}]
> [INFO] Building jar: 
> <https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/test/target/org.apache.stanbol.enhancer.test-0.10.0-SNAPSHOT.jar>
> [INFO] Preparing source:jar
> [WARNING] Removing: jar from forked lifecycle, to prevent recursive 
> invocation.
> [JENKINS] Archiving disabled - not archiving 
> <https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhan

Re: Changes to get rid of jersey dependencies

yes this was the wrong thread ... sorry ... no change to the
contentitem.ftl. AFAIK this is duplicated to avoid a dependency
between the enhancer and the contenthub.

best
Rupert

On Mon, Nov 26, 2012 at 7:53 PM, Reto Bachmann-Gmür  wrote:
> Glad to hear, also this seems t have been the wrong thread. Or the wrong
> pathc link as I see no reference to the duplicate contentitem.ftl there.
>
> Cheers,
> Reto
>
> On Mon, Nov 26, 2012 at 4:43 PM, Rupert Westenthaler <
> rupert.westentha...@gmail.com> wrote:
>
>> Hi all
>>
>> with http://svn.apache.org/viewvc?rev=1413674&view=rev the trunk
>> should be fixed.
>>
>> best
>> Rupert
>>
>> On Sat, Nov 24, 2012 at 4:41 PM, Reto Bachmann-Gmür 
>> wrote:
>> > Hi Rupert,
>> >
>> > I see two templates by that name in the source:
>> >
>> > ./contenthub/web/target/classes/templates/imports/contentitem.ftl
>> > ./enhancer/jersey/src/main/resources/templates/imports/contentitem.ftl
>> >
>> > The two templates seem to differ only by a bit of formatting and they are
>> > registered at the same location where they should be included with
>> > <#include "/imports/contentitem">.
>> >
>> > Furthermore I see:
>> >
>> >
>> ./enhancer/jersey/src/main/resources/org/apache/stanbol/enhancer/jersey/templates/ajax/contentitem.ftl
>> >
>> > At that location it is not accessible to the templating system.
>> > I couldn't find an include for ajax/contentitem, the root for includes is
>> > the templates folder (not templates/html so to allow to include other
>> media
>> > types). From the error message I now moved the file to be where its is
>> > expected, verified with
>> >
>> > zz>val b = bundleContext.getBundle(28)
>> > b: org.osgi.framework.Bundle = org.apache.stanbol.enhancer.jersey [28]
>> >
>> zz>b.getResource("templates/html/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource/ajax/contentitem.ftl")
>> > res3: java.net.URL =
>> >
>> bundle://28.3:1/templates/html/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource/ajax/contentitem.ftl
>> >
>> > The error is gone now.
>> >
>> > Cheers,
>> > Reto
>> >
>> > On Sat, Nov 24, 2012 at 1:22 PM, Rupert Westenthaler <
>> > rupert.westentha...@gmail.com> wrote:
>> >
>> >> Hi Reto,
>> >>
>> >> I think I discovered  an other issue with the new template loading
>> >> mechanism while re-integrating the stanbol-nlp-proessing brach.
>> >> However also a test on the current trunk shows the same issue.
>> >>
>> >> When I post an Request to the Stanbol Enhancer via the Web UI I do get
>> >> an "Invalid query" because of
>> >>
>> >> 24.11.2012 13:17:17.994 *WARN* [1346380557@qtp-2082765220-174]
>> >> org.apache.felix.http.jetty /enhancer (java.lang.RuntimeException:
>> >> java.io.FileNotFoundException: Template
>> >>
>> >>
>> html/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource/ajax/contentitem
>> >> not found.) java.lang.RuntimeException: java.io.FileNotFoundException:
>> >> Template
>> >>
>> html/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource/ajax/contentitem
>> >> not found.
>> >> at
>> >>
>> org.apache.stanbol.commons.ldpathtemplate.LdRenderer.renderPojo(LdRenderer.java:173)
>> >> at
>> >>
>> org.apache.stanbol.commons.ldviewable.mbw.ViewableWriter.writeTo(ViewableWriter.java:80)
>> >> at
>> >>
>> org.apache.stanbol.commons.ldviewable.mbw.ViewableWriter.writeTo(ViewableWriter.java:53)
>> >> [..]
>> >> Caused by: java.io.FileNotFoundException: Template
>> >>
>> >>
>> html/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource/ajax/contentitem
>> >> not found.
>> >> at freemarker.template.Configuration.getTemplate(Configuration.java:580)
>> >> at freemarker.template.Configuration.getTemplate(Configuration.java:543)
>> >> at
>> >>
>> org.apache.stanbol.commons.ldpathtemplate.LdRenderer.renderPojo(LdRenderer.java:169)
>> >> ... 48 more
>> >>
>> >> I think this is because all the imports are still in the old location.
>> >> Can you please have a look at this. How do import work with the new
>> >> infrastructure?

Re: Build failed in Jenkins: stanbol-trunk-1.6 #1116