Re: Last call for participation (24-25. Sept): MLODE - Multilingual Linked Open Data for Enterprises

2012-09-21 Thread Rupert Westenthaler
Hi all,

Apache Stanbol will be represented at this workshop. AFAIK Fabian and
myself will participate. There will be also two Stanbol related
presentations and hopefully also a lot of discussions especially about
the new Stanbol NLP processing (STANBOL-733).

Hope to see you in Leibzig next Week

best
Rupert

On Fri, Sep 21, 2012 at 1:53 PM, Sebastian Hellmann
 wrote:
> ##Apologies for cross-posting##
> On September 23-24-25, the Multilingual Linked Open Data for Enterprises
> Workshop(MLODE) will happen in Leipzig, Germany and is co-located with SABRE
> and
> the Leipziger Semantic Web Day. Please find all information here:
> http://sabre2012.infai.org/mlode
>
> #News #
> * See the people attending the conference in our people viewer (add
> yourself, if
> you are attending) - http://mlode.nlp2rdf.org/people/view.html
> * In parallel to the code-a-thon there will be an Apache Stanbol and Linked
> Media Framework Tutorial from 9 am to 12:30 pm (please join no later than 10
> am)
> and a LOD2 Stack Tutorial at 2 pm -
> http://wiki.aksw.org/Events/2012/LeipzigerSemanticWebDay/Tutorien
> * Twitter tag #mlode
> * Program published - http://tinyurl.com/mlode-schedule
> * Please apply for lightning talks here: mlode2012 -at-
> lists.informatik.uni-leipzig.de
> * Don't forget to send your submission of the Monnet Challenge John McCrae -
> to
> win up to 600 Euro - http://sabre2012.infai.org/mlode/monnet-challenge
> * Code-a-thon: We will provide support and assistance for developers new to
> RDF
> * If you arrive on Sunday, you can join us for the zero day, where we
> brainstorm
> for the code-a-thon: Leipziger Zoo at 10 am and the bar Kicker IN at 7 pm
> * The workshop is accompanied by data post proceeding Special Issue in the
> Semantic Web Journal -
> http://www.semantic-web-journal.net/blog/call-multilingual-linked-open-data-mlod-2012-data-post-proceedings
>
>
>
> We would like to thank our sponsors for supporting the workshop:
> * The Working MultilingualWeb-LT Working Group -
> http://www.w3.org/International/multilingualweb/lt/
> * The Interactive Knowledge Stack (IKS) EU Research Project -
> http://www.iks-project.eu/
> * The Monnet Project - http://www.monnet-project.eu/
>
>
> We all hope to see you there,
> Sebastian Hellmann and  Steven Moran
> on behalf of the whole MLODE organisation committee
>
>
> --
> Dipl. Inf. Sebastian Hellmann
> Department of Computer Science, University of Leipzig
> Events:
> * http://sabre2012.infai.org/mlode (Leipzig, Sept. 23-24-25, 2012)
> * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*)
> Projects: http://nlp2rdf.org , http://dbpedia.org
> Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
> Research Group: http://aksw.org



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Update CELI engines to use Stanbol NLP processing

2012-09-21 Thread Rupert Westenthaler
Hi,

forgot to include the dev list in my last response to Alessio, hence the forward

On Fri, Sep 21, 2012 at 3:22 PM, Rupert Westenthaler
 wrote:
> Hi Alessio,
>
> On Fri, Sep 21, 2012 at 12:51 PM, Alessio Bosca  wrote:
>> We are surely willing to contribute to the development of the engines and I
>> will work on the requested modifications for supporting the  AnalyzedText
>> content part.
>
> Thats cool to hear. I already started some thinks. I will commit those
> later today so that you can continue from their.
>
>> We will also provide you a mapping for the POS tagset and the other lexical
>> features.
>
> If there is a documentation of the POS Tag Sets are available it would
> be cool if you could link those. When I commit my local changes there
> will be a "PosTagSetRegistry" in
> "org.apache.stanbol.enhancer.engines.celi" where you can add the
> mappings.
>
>>I will check with the team responsible for the morphological
>> analyzer about the confidence level or the ranking of multiple readings as
>> I'm not sure about that.
>>
>> Concerning the missing readings for some lexical entries it is because the
>> unrecognized term are not present in the lexicon of the morphological
>> analyzer; they are "unknown" words so to say.
>> It happens with mispelled words or unknown named entities. It is possible to
>> explicitly set a POS "Unknown" lexical feature for them, if you wish so, but
>> there are no lexical feature retrieved by the morphological analyzer itself.
>> Let me know if you want this update as well.
>> Calling the named entities engine for Italian may be an alternative way for
>> getting more info on that textual fragments.
>>
>
> OK that explains a lot. I had the impression that there is first a POS
> tagger and than a morphological analyzer uses those results to provide
> the lemmas and other information. If the morphological analyzer adds
> possible lemmas based on words I would expect that there are no
> results for some words and also that there are multiple readings for
> others.
>
> Does linguagrid also have a POS tagging service?
>
>> I will send you an update next week as soon as I finished to integrate the
>> updates
>>
>
> I am in Leibzig next week so I might be not as responsive as usually.
>
> best
> Rupert
>
>>
>> Bests
>> Alessio
>>
>>
>> On 09/21/2012 09:16 AM, Rupert Westenthaler wrote:
>>>
>>> Hi Alessio, all
>>>
>>> I have started to work on the migration of the CELI lemmatizer Engine
>>> to the new Stanbol NLP processing module (STANBOL-733, STANBOL-738).
>>> Basically the Idea was to adapt the Lemmatizer Engine to use the
>>> AnalysedText ContentPart (STANBOL-734) to store its result. The goal
>>> of this work is being able to use word level NLP analyses result of
>>> CELI in Apache Stanbol (e.g. CELI POS tags and lemma information for
>>> looking up terms with the KeywordLinkingEngine). Achieving this would
>>> open up a lot of additional possibilities for Stanbol Users that want
>>> to use the CELI services.
>>>
>>> While working on this I came across the following things:
>>>
>>> (1) I recognized that the Lemmatizer Service does not provide
>>> information for all Words (LexicalEntry). As an example in the
>>> sentence
>>>
>>>  Lo scandalo dei fondi pubblici sperperati in allegria dalla Regione
>>>  Lazio ha dato i primi frutti: ieri il capogruppo Pdl Francesco
>>> Battistoni
>>>  si è dimesso e la sede del Consiglio è stata invasa dalla Guardia
>>> di Finanza.
>>>
>>> the LexicalEntries for "Pdl Francesco Battistoni si" do not have any
>>> metadata (no ). Do you know why this is the case? Is their a
>>> possibility to obtain LexicalFeatures for all words?
>>>
>>> (2) The Stanbol NLP processing module maps POS tag sets used by NLP
>>> processing frameworks to Morphosyntactic Categories defined by the
>>> OLIA ontology [1]. Uses Categories are defined by the LexicalCategory
>>> enumeration [2]. Actual POS tags are represented by the PosTag class
>>> [3] that provides (1) the tag as string and optionally (2) the
>>> LexicalCategory. While LexicalCategories are optional they are
>>> important as they allow other components to determine the type of a
>>> word in an language independent way. Because of that it would be
>>> important to map the POS tag sets used by CELI to the
>>> LexicalCategories used b

Re: SVN moved - you have to switch

2012-09-21 Thread Rupert Westenthaler
Hi

On Fri, Sep 21, 2012 at 8:13 PM, Alessandro Adamou  wrote:

> Sure it isn't something like
>
>   svn switch https://svn.apache.org/repos/asf/stanbol/trunk .
> from within the working copy dir?
>
> This will preserve uncommitted changes, right?

I used exactly the command proposed by you. It preserve uncommitted
changes but it does make a svn up. So you might get conflicts. Had no
problems with the trunk and branches only the stanbol webpage created
problems because of some folder changes needed by the move to the own
sub-domain.

best
Rupert

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Trunk Console Changes

2012-09-23 Thread Rupert Westenthaler
at
>> >
>> >
>> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:926)
>> > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>> > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>> > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>> > at
>> >
>> >
>> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
>> > at org.mortbay.thread.QueuedThreadPool
>> >
>> > TIA
>> >
>> > Dave
>> > **
>> >
>>
>
>
>
> --
> Regards
>
> Dave Butler
> butlerdi-at-pharm2phork-dot-org
>
> Also on Skype as pharm2phork
>
> Get Skype here http://www.skype.com/download.html
>
>
> **
> This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they
> are addressed. If you have received this email in error please notify
> the system manager.
>
> This footnote also confirms that this email message has been swept by
> MIMEsweeper for the presence of computer viruses.
>
> www.mimesweeper.com
> **



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: failing build because CELI NER test cannot reach external server

2012-09-25 Thread Rupert Westenthaler
Hi,

I can confirm that

1. the Integration tests do run in OfflineMode
2. that the CELI engines do support OfflineMode and therefore
deactivate themselves if stanbol is started in OfflineMode

For Unit Tests it is a real dilemma because on the one hand failing
tests because of non-functional (or non reachable) external services
are inconvenient on the other hand without running those tests with
normal tests there is a good change that we will miss API (and data
changes) of those external Services. E.g. some time ago the confidence
values provided by geoname.org changed from [0..1] to [0..*]. This
would have been not detected without running the those test with every
build.

So what I typically do is to catch IOExceptions that typically result
of unreachable (or timed out) calls to external services. This ensures
that test do not fail if the external service is down - or the
developer does not have a internet connection while still ensuring
that the Stanbol Engines are validated against the external services
with every build.

I will have a look what kind of Exception caused the failed build and
make sure that those do cause tests to be skipped for future builds.

WDYT
Rupert

On Mon, Sep 24, 2012 at 2:28 PM, Reto Bachmann-Gmür  wrote:
> On Mon, Sep 24, 2012 at 1:37 PM, Bertrand Delacretaz > wrote:
>
>> On Mon, Sep 24, 2012 at 12:19 PM, Reto Bachmann-Gmür 
>> wrote:
>> > On Mon, Sep 24, 2012 at 8:54 AM, Bertrand Delacretaz <
>> bdelacre...@apache.org
>> >> On Sunday, September 23, 2012, Reto Bachmann-Gmür wrote:
>> >> > ...I think the tests shold not access external services...
>> >> Yes - there's OfflineMode for that.
>> >>
>> > ...The offline mode is for not updating libraries from the remote repos.
>> If I
>> > have the libraries in the local repository I can use the offline mode. If
>> > tests are skipped in offline mode this means that some projects might
>> buil
>> > with -o and fail otherwise...
>>
>> Sorry I was too terse maybe, didn't mean maven's offline mode, but the
>> STANBOL-86 OfflineMode service, which allows you to modify your
>> service's behavior when the system should not make any external
>> requests, which IMO is needed when running our automated tests.
>>
>
> The test that was failing was a unit tests that doesn't use OSGi and thus
> couldn't use STANBOL-86.
>
> For the integration tests STANBOL-86 seems a good approach. If understand
> STANBOL-87 <https://issues.apache.org/jira/browse/STANBOL-87> correctly the
> integrations tests are always run in offline mode. I think the unit tests
> should always be offline while for the integration tests it might be good
> to have the options to check if things integrate with the outside world.
>
> Cheers,
> Reto
>
>
>> -Bertrand
>>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: enhancer parameter outputContentPart

2012-09-27 Thread Rupert Westenthaler
Hi Melanie

from http://stanbol.apache.org/docs/trunk/components/enhancer/contentitem.html

There are two types of content parts:

2. Content parts that are registered under a predefined URI. [..] This
is used to
   share intermediate enhancement results between enhancement engines.
   An example would be tokens, sentences, POS tags and chunks that are
   extracted by some NLP engine.

An example of such a content part are the ExecutionMetadata.

from: 
http://stanbol.apache.org/docs/trunk/components/enhancer/executionmetadata.html

When the EnhancementJobManager starts the Enhancement of a ContentItem
it needs to check if the ContentItem already contains ExecutionMetadata in the
ContentPart with the URI


"http://stanbol.apache.org/ontology/enhancer/executionmetadata#ChainExecution";.


If this is the case it needs to initialize itself based on the
pre-existing information.
If no ExecutionMetadata are present, a new EnhancementProcess needs to be
created based on the parsed Chain. Differences between this two cases are
explained in the following two sub sections.


So one example usage of the "outputContentPart" would be to explicitly
include the ExectionMetadata within the response of the Stanbol Enhancer.

For this you need to add the parameter


outputContentPart=http://stanbol.apache.org/ontology/enhancer/executionmetadata#ChainExecution

to the request of the StanbolEnhancer.

If you want to include the plain text version of the parsed content in
the response you
need to use

outputContent=text/plain

as the URI of the text plain content part is dynamically generated based on the
MD5 of the plain text content and can therefore not be known in advance.

best
Rupert

On Thu, Sep 27, 2012 at 2:15 PM, Melanie Reiplinger
 wrote:
> |Hi all,
>
> could someone please briefly clarify what the outputContentPart parameter
> for the enhancer does?
>
> About 'content parts', it says on
> http://stanbol.apache.org/docs/trunk/components/enhancer/contentitem.html:
> Content parts are used to represent the original content as well as
> transformations of the original content (typically created by pre-processing
> enhancement engines
> <http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list.html>
> such as the Metaxa engine
> <http://stanbol.apache.org/docs/trunk/components/enhancer/engines/metaxaengine.html>).
> |etcetc
> |
> In the REST doku, it says:
> outputContentPart=[uri/'*']|: This parameter allows to explicitly include
> content parts with a specific URI in the response. Currently this only
> supports ContentParts that are stored as RDF graphs.
>
> Does this mean I'll specify the URI of a content item already present on the
> contenthub? And what is the use of that ?
>
> Thanks,
> Melanie



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Lessons learnt from EAP+ questions about future directions

2012-09-27 Thread Rupert Westenthaler
Hi Mihály

On Tue, Sep 25, 2012 at 9:07 PM, Mihály Héder  wrote:
> Hi All,
>
> I have written a blog post about the lessons learnt from the EAP project I
> had been working on:
> http://blog.iks-project.eu/lessons-learnt-while-working-with-apache-stanbol/
>

Thanks for this blog post. It is really valuable feedback.
I will try to answer some of your questions.

> The reason I'm citing this here is that I'm interested in your opinion on
> the following mid-term development questions and suggestions (discussed in
> detail in the post):
> -What is the best way to monitor a running stanbol instance with
> munin/nagios/icinga, etc? How can I extract e.g. an enchancement/hour
> statistic from stanbol?

Within Apache Stanbol the EnhancementJobManager collects the
ExecutionMetadata [1]. They are stored in an own ContentPart of the
processed ContentItem.

So one possibility would be to add a feature to the EnhancementJobManger that
allows to log those information (or even to store them into a RDF triple store).

If we do that this would really allow very fine grained analyses about requests
processed by the Stanbol Enhancer.


[1] 
http://stanbol.apache.org/docs/trunk/components/enhancer/executionmetadata.html

> -I think at some point we should create a standardized a REST API through
> which non-java EEs could be accessed.

I am not sure how such a interface should look like? I could think
about an interface that POST the current metadata of the ContentItem
to some URI. The results could again be RDF that is than added to the
ContentItem. Maybe one could even allow the definition of some kind of
Filter so that not the whole RDF metadata need to be serialized.

Non-java EE that also need the content (e.g. the text/plain Blob)
would need a different kind of interface.

BTW: Serialization/Deserialization of ContentItems is already
implemented (by using multipart mime).

> -Also, I think that if we had some standardized description XML or whatever
> format that would tell what kind of output a certain EE produces, that
> would be helpful.

I would really like to have EnhancementEngines providing RDF
descriptions of themselves when making a GET request to

http://{stanbol-instance}/enhancer/engine/{engine-name}

if those descriptions would also include information about the
consumed/produced elements that would be great.

However this feature is much more important for UIMA as for Stanbol,
because with Stanbol EnhancementEngines are expected to create
Annotations that confirm to the EnhancementStructure.

best
Rupert


-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Replace or augment UIMA/OpenNLP pipeline with Stanbol

2012-09-27 Thread Rupert Westenthaler
Hi Wayne,


On Thu, Sep 27, 2012 at 6:17 PM, Wayne Rasmuss
 wrote:
> I've been working with UIMA and OpenNLP together. Basically I've got the
> OpenNLP/UIMA example working. This gives me annotated text with tokens,
> sentences, parts of speech, chunks (verb phrase, noun phrase, etc.) It also
> attempts organizations, dates and locations though I don't get reliable
> results with them. Mostly I'm interested in parts of speech and chunks
> anyway.
>

Word level NLP annotations are currently not included in the
Enhancement Results. This is mainly because this would result in 20+
triples per word. However with STANBOL-733 "Stanbol NLP processing"
this feature will be added. Development of this is done in an own
branch [1]. This branch also includes an own Stanbol Launcher that
allows to easily test the current state of development (build and
start the launcher and than post some text to the
http://localhost:8080/enhancer/chain/nlp-processing)

I will give you a short overview. Details can be found in JIRA:

* AnalysedText: Java Domain Model that represents results of NLP. The
AnalysedText is added to the ContentItem as ContentPart (see
STANBOL-734 for code examples)
* NLP 2 RDF: This is an EnhancementEngine that converts the
information of the AnalysedText to RDF by using NIF (NLP Interchange
Format) - a set of OWL ontologies that allow to formally represent NLP
results (see STANBOL-741). NOTE that the NLP results provided by the
nlp-processing chain of the Stanbol Launcher do already use NIF
* The opennlp.pos EnhancementEngine supports POS tagging of parsed
texts in all languages supported by openNLP (STANBOL-735). As part of
that is also detects and adds Sentence annotations. The
opennlp.chunker EnhancementEngine consumes Tokens and POS tags and
performs chunking (STANBOL-736). Chunking is supported for English and
German. There is also a sentiment.wordclassifier EnhancementEngine
that adds sentiment tags on word level (based on SentiWordNet in
English and SentiWS for German).

You might also have a look at a presentation [2] about the Stanbol NLP
processing module I gave at the MOLDE workshop this week in Leipzig.

[1] http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/
[2] http://stanbol.apache.org/presentations/Stanbol_NLP_processing_2012-09.pdf

> I've been looking around and Stanbol looks like it may be easier to deal
> with and give me more advanced capabilities. I've done the first part of
> the getting started guide, but not the "full" version. I got he web
> interface up and was able to get some enhanced text. So that was great.
>
> After that I'm kind of stumped. I would like to get the annotated text
> (like I'm getting from UIMA/OpenNLP) so we can do analysis on it. Can
> someone help get started with setting up/calling stanbol so I can get the
> details in the enhanced result?
>

If you want to stay with the RESTful service you will need to
implement against the NIF as generated by the "NLP2RDF" engine. If you
plan to access the StanbolEnhancer via its Java API I think that the
API of the AnalyzedText (STANBOL-734) should give you everything you
need.

You might also want to consider to implement your own analysis as
Stanbol EnhancementEngine. This blog [3] provides a good introduction
on how to do that.

[3] 
http://blog.iks-project.eu/getting-started-with-apache-stanbol-enhancement-engine/

>
> We're working with Groovy as our glue code. Bertrand provided me with this
> example.https://gist.github.com/2931050 which looks very promising, I think
> what I need to do is basically add OpenNLP enhancers here and figure out
> how to call it.
>

The "opennlp.pos" and "opennlp.cunker" Engines should exactly provide
the information you are looking for. AFAIK the Apache Camel example
provided by Bertrand should allow you to call the according
Engines/Chain and also support direct access to the results stored in
the AnalyzedText content part. But as I am not familiar with Camel it
would be good if Bertrand could confirm this.

Please NOTE that the Stanbol NLP processing is still in heavy
development. So things might still change. The current plan is to have
a first rather stable version of STANBOL-733  available in the trunk
by end of October.

best
Rupert


-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Update CELI engines to use Stanbol NLP processing

2012-09-28 Thread Rupert Westenthaler
Hi,

Yesterday I committed some changes/additions to the
stanbol.enhancer.nlp module. Please make sure you are on the most
current version.

On Fri, Sep 28, 2012 at 10:44 AM, Alessio Bosca  wrote:
> Hi Rupert,
>
> I completed the POS mappings in the PosTagSetRegistry class and I'm starting
> to add mappings for other morphological features (like gender, number, case)
> using the same approach (i.e creating a GenderTagsetRegistry).
> I need to create a few classes for the mappings (GenderTag,
> GenderValuesEnum, etc) should I create them in celi engine project or should
> I create a proper subpackage (like morphology) on the same level as nlp.pos?
> Iìll send you a patch as soon as I finish

Regarding "morphology"

Please have a look at the o.a.s.enhancer.nlp.morpho package. I defined
yesterday Enumerations for Tenses and Cases (based on the Olia
Ontology). There is also a MorphoAnnotation class. If you need to
change/extend those feel free to do it. The current state is only a
first proposal (by myself) and clearly needs to be improved changed.

Regarding "GenderTagsetRegistry":

I would rather opt for a single CeliTagsetRegistry class that can be
used for everything (e.g. getPosTagSet(), getGenderTagSet(), ...) but
you can also create multiple specific registries if you like.

Regarding "GenderTag"

Would you like to introduce "{type}Tag" classes (similar as PosTag)
that hold a String Tag and a Category that is a member of the
according Enumeration.
Examples would be GenderTag, TenseTag, CaseTag ...

best
Rupert

>
> Bests
> Alessio
>
>
> On 09/21/2012 03:22 PM, Rupert Westenthaler wrote:
>>
>> Hi Alessio,
>>
>> On Fri, Sep 21, 2012 at 12:51 PM, Alessio Bosca 
>> wrote:
>>>
>>> We are surely willing to contribute to the development of the engines and
>>> I
>>> will work on the requested modifications for supporting the  AnalyzedText
>>> content part.
>>
>> Thats cool to hear. I already started some thinks. I will commit those
>> later today so that you can continue from their.
>>
>>> We will also provide you a mapping for the POS tagset and the other
>>> lexical
>>> features.
>>
>> If there is a documentation of the POS Tag Sets are available it would
>> be cool if you could link those. When I commit my local changes there
>> will be a "PosTagSetRegistry" in
>> "org.apache.stanbol.enhancer.engines.celi" where you can add the
>> mappings.
>>
>>> I will check with the team responsible for the morphological
>>> analyzer about the confidence level or the ranking of multiple readings
>>> as
>>> I'm not sure about that.
>>>
>>> Concerning the missing readings for some lexical entries it is because
>>> the
>>> unrecognized term are not present in the lexicon of the morphological
>>> analyzer; they are "unknown" words so to say.
>>> It happens with mispelled words or unknown named entities. It is possible
>>> to
>>> explicitly set a POS "Unknown" lexical feature for them, if you wish so,
>>> but
>>> there are no lexical feature retrieved by the morphological analyzer
>>> itself.
>>> Let me know if you want this update as well.
>>> Calling the named entities engine for Italian may be an alternative way
>>> for
>>> getting more info on that textual fragments.
>>>
>> OK that explains a lot. I had the impression that there is first a POS
>> tagger and than a morphological analyzer uses those results to provide
>> the lemmas and other information. If the morphological analyzer adds
>> possible lemmas based on words I would expect that there are no
>> results for some words and also that there are multiple readings for
>> others.
>>
>> Does linguagrid also have a POS tagging service?
>>
>>> I will send you an update next week as soon as I finished to integrate
>>> the
>>> updates
>>>
>> I am in Leibzig next week so I might be not as responsive as usually.
>>
>> best
>> Rupert
>>
>>> Bests
>>>  Alessio
>>>
>>>
>>> On 09/21/2012 09:16 AM, Rupert Westenthaler wrote:
>>>>
>>>> Hi Alessio, all
>>>>
>>>> I have started to work on the migration of the CELI lemmatizer Engine
>>>> to the new Stanbol NLP processing module (STANBOL-733, STANBOL-738).
>>>> Basically the Idea was to adapt the Lemmatizer Engine to use the
>>>> AnalysedText ContentPart (STANBOL-734) to store its result. The goal
>>&g

Re: Lessons learnt from EAP+ questions about future directions

2012-10-01 Thread Rupert Westenthaler
Hi,

let me just comment on your last point

On Mon, Oct 1, 2012 at 8:55 PM, Mihály Héder  wrote:
>> However this feature is much more important for UIMA as for Stanbol,
>> because with Stanbol EnhancementEngines are expected to create
>> Annotations that confirm to the EnhancementStructure.
>
> I totally support the self-description interface you propose, as the
> conformity to the structure is really helpful but not everything. For
> instance I had to experiment with Stanbol to figure out that LangId
> will provide a "dc:language" property, and there will be only one of
> this, not multiple ones (e.g. for every sentence).

This is defined by STANBOL-613.

> An other example
> that the UIMAToTriples in my current deployment puts an sso:posTag
> property to every TextAnnotation.

Here the idea is to use NIF (NLP Interchange Format), but this is
still in the workings. Current work is done in STANBOL-741, but most
likely I will create an own Issue that defines how NIF annotations are
linked to Stanbol Enhancements.

Generally representing Word/Phrase level annotations as RDF does not
scale. This is the reason why STANBOL-733 introduced the AnalyzedText
ContentPart. So if you would like to allow other Engines to consume
NLP annotations the UIMA integration should also support the
AnalyzedText ContentPart.

> That might be helpful for other EE
> developers but they have to figure the uri of the property somehow -
> ok, it is in the documentation, but still...
>

Maybe we can use the already existing

org.apache.stanbol.enhancer.servicesapi.ServiceProperties

interface (already implemented by most Enhancement Engines. Possible
additions would include

* EnhancementFeature: MetadataExtraction, PlainTextExtraction,
LanguageIdentification, POS tagging, Chunking, NER, EntityLinking, ...
* RequiresFeature: Enhancements required by an EnhancementEngine
* supportsLanguage: list of languages supported (with support for
exclusions and wildcard (e.g. !fr, !de, *)
* supportsMimeType: allows an EnhancementEngine to define the
supported mime types
* ...

If we use an Ontology for those Features we can

1. implement the Webservice that publishes the RDF metadata for
EnhancementEngines based on the ServiceProperties provided by an
EnhancementEngine
2. the URIs of those properties would be also a good entry point for
the documentation of how those features are represented in the
EnhancementStructure (or NIF)

best
Rupert

> Cheers
> Mihály
>
>> best
>> Rupert
>>
>>
>> --
>> | Rupert Westenthaler rupert.westentha...@gmail.com
>> | Bodenlehenstraße 11 ++43-699-11108907
>> | A-5500 Bischofshofen



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Trunk Console Changes

2012-10-02 Thread Rupert Westenthaler
Hi Reto,

I am really unhappy with this, as access to the Felix Web Console is
critical for Stanbol users and the workaround is much to complex for
most users. Because of that I would like to propose the removal of the
authentication bundle list form the Stanbol Launchers as long as this
is not fixed.

Generally I would recommend to move the development of this feature to
an own branch as I expect to have it a bigger impact to Stanbol and
its components. This would also allow more often commits of
intermediate states in the development

WDYT
Rupert

On Sun, Sep 23, 2012 at 1:24 PM, Reto Bachmann-Gmür  wrote:
> Ok, the problem has to do with two username and password checks which grant
> access under mutually exclusive conditions.
>
> So the work around: set the same password for your stanbol security admin
> user as for the felix console admin user. To do this you may follow the
> reset password instructions at http://incubator.apache.org/clerezza/faq/,
> but you have to install the additional bundle
> org.apche.clerezza:rdf.scala.utils.
>
> To do this enter the following command on the console:
>
> zz>start("
> http://central.maven.org/maven2/org/apache/clerezza/rdf.scala.utils/0.3-incubating/rdf.scala.utils-0.3-incubating.jar
> ")
>
> Then reconnect the console (:q to terminate) and follow the instructions as
> per the clerezza faq.
>
> Clearly this is just a work around. I'm working on a real solution which
> integrates the two authentication mechanisms.
>
> Cheers,
> Reto
>
>
> On Sun, Sep 23, 2012 at 9:55 AM, Rupert Westenthaler <
> rupert.westentha...@gmail.com> wrote:
>
>> Hi Reto,
>>
>> The problem reported here by Dave now also appears on the
>>
>> http://dev.iks-project.eu:8081/system/console
>>
>> Stanbol instance after we have updated it yesterday. Hopefully this
>> helps you in tracking this down.
>>
>> Reto as far as I know you have an admin account on that machine. If
>> not please contact Szaby
>>
>> best
>> Rupert
>>
>>
>> On Sat, Sep 1, 2012 at 5:19 AM, Dave Butler  wrote:
>> > We have been using the same configuration now for about a year. The
>> build,
>> > deployment and running. The change seems to have occured in the last
>> eight
>> > days, as prior builds appear to function. However I was running these
>> from
>> > another machine. Ran from Safari and Chrome and the same behaviour.
>> >
>> >
>> >
>> > On 1 September 2012 03:14, Reto Bachmann-Gmür  wrote:
>> >
>> >> Quite weird, it looks like your browser is sending an Authorization
>> header
>> >> with a value that can't be recognized. Does it work with another
>> browser or
>> >> if you restart your browser. Anyway I'll see how the system could handle
>> >> unrecognized Authorization values more gracefully.
>> >>
>> >> Cheers,
>> >> Reto
>> >>
>> >> On Fri, Aug 31, 2012 at 5:07 PM, Dave Butler 
>> wrote:
>> >>
>> >> > Reto,
>> >> >
>> >> > Starting from command line as normal with java -Xmx2048m
>> >> > -XX:MaxPermSize=256m  -jar
>> >> > org.apache.stanbol.launchers.full-0.10.0-incubating-SNAPSHOT.jar
>> -p9080
>> >> >
>> >> > And I only get this when going to the Osgi Console.
>> >> >
>> >> > The error generated is
>> >> > org.apache.stanbol.commons.security.auth.AuthenticationCheckerImpl No
>> >> > service could unsuccessfully authenticate user admin. Reason: user
>> does
>> >> not
>> >> > exist
>> >> > 31.08.2012 15:43:14.765 *WARN* [141002294@qtp-2046274478-5]
>> >> > org.apache.felix.http.jetty /system/console
>> >> > (java.lang.ArrayIndexOutOfBoundsException: 0)
>> >> > java.lang.ArrayIndexOutOfBoundsException: 0
>> >> > at
>> >> >
>> >> >
>> >>
>> org.apache.stanbol.commons.authentication.basic.BasicAuthentication.authenticate(BasicAuthentication.java:72)
>> >> > at
>> >> >
>> >> >
>> >>
>> org.apache.stanbol.commons.security.auth.AuthenticatingFilter.doFilter(AuthenticatingFilter.java:137)
>> >> > at
>> >> >
>> >> >
>> >>
>> org.apache.felix.http.base.internal.handler.FilterHandler.doHandle(FilterHandler.java:88)
>> >> > at
>> >> >
>> &g

Re: Trunk Console Changes

2012-10-02 Thread Rupert Westenthaler
Hi all,

On Tue, Oct 2, 2012 at 1:18 PM, Reto Bachmann-Gmür
 wrote:
> Hi Rupert
>
> Expect a fix for this issue by the end of the week. Consider that the users
> experiencing the issue are the most capable ones that were able to change
> the console password.

It definitely also affects user that do not change the password.
Yesterday, Sebastian Schaffert experienced this on a fresh new Stanbol
instance (on the first start, first access to the Felix Console) and
it also happened to myself. However it appears only from time to time
and is not something one can easily reproduce ... looks more like a
race condition.

>
> If you can't wait till the end of the week I could even commit a quick fix
> later today.
>

I think a fix later this week should be OK.

best
Rupert

> Cheers
> Reto
>  On 2 Oct 2012 10:58, "Rupert Westenthaler" 
> wrote:
>
>> Hi Reto,
>>
>> I am really unhappy with this, as access to the Felix Web Console is
>> critical for Stanbol users and the workaround is much to complex for
>> most users. Because of that I would like to propose the removal of the
>> authentication bundle list form the Stanbol Launchers as long as this
>> is not fixed.
>>
>> Generally I would recommend to move the development of this feature to
>> an own branch as I expect to have it a bigger impact to Stanbol and
>> its components. This would also allow more often commits of
>> intermediate states in the development
>>
>> WDYT
>> Rupert
>>
>> On Sun, Sep 23, 2012 at 1:24 PM, Reto Bachmann-Gmür 
>> wrote:
>> > Ok, the problem has to do with two username and password checks which
>> grant
>> > access under mutually exclusive conditions.
>> >
>> > So the work around: set the same password for your stanbol security admin
>> > user as for the felix console admin user. To do this you may follow the
>> > reset password instructions at http://incubator.apache.org/clerezza/faq/
>> ,
>> > but you have to install the additional bundle
>> > org.apche.clerezza:rdf.scala.utils.
>> >
>> > To do this enter the following command on the console:
>> >
>> > zz>start("
>> >
>> http://central.maven.org/maven2/org/apache/clerezza/rdf.scala.utils/0.3-incubating/rdf.scala.utils-0.3-incubating.jar
>> > ")
>> >
>> > Then reconnect the console (:q to terminate) and follow the instructions
>> as
>> > per the clerezza faq.
>> >
>> > Clearly this is just a work around. I'm working on a real solution which
>> > integrates the two authentication mechanisms.
>> >
>> > Cheers,
>> > Reto
>> >
>> >
>> > On Sun, Sep 23, 2012 at 9:55 AM, Rupert Westenthaler <
>> > rupert.westentha...@gmail.com> wrote:
>> >
>> >> Hi Reto,
>> >>
>> >> The problem reported here by Dave now also appears on the
>> >>
>> >> http://dev.iks-project.eu:8081/system/console
>> >>
>> >> Stanbol instance after we have updated it yesterday. Hopefully this
>> >> helps you in tracking this down.
>> >>
>> >> Reto as far as I know you have an admin account on that machine. If
>> >> not please contact Szaby
>> >>
>> >> best
>> >> Rupert
>> >>
>> >>
>> >> On Sat, Sep 1, 2012 at 5:19 AM, Dave Butler  wrote:
>> >> > We have been using the same configuration now for about a year. The
>> >> build,
>> >> > deployment and running. The change seems to have occured in the last
>> >> eight
>> >> > days, as prior builds appear to function. However I was running these
>> >> from
>> >> > another machine. Ran from Safari and Chrome and the same behaviour.
>> >> >
>> >> >
>> >> >
>> >> > On 1 September 2012 03:14, Reto Bachmann-Gmür 
>> wrote:
>> >> >
>> >> >> Quite weird, it looks like your browser is sending an Authorization
>> >> header
>> >> >> with a value that can't be recognized. Does it work with another
>> >> browser or
>> >> >> if you restart your browser. Anyway I'll see how the system could
>> handle
>> >> >> unrecognized Authorization values more gracefully.
>> >> >>
>> >> >> Cheers,
>> >> >> Reto
>> >> >>
>> >&g

Re: Engine to extract XMP and problems with the tika engine

2012-10-07 Thread Rupert Westenthaler
Hi Reto,

Normally it is not a problem if a parsed content does not contain any
plain text. There is even a unit test for the TikaEngine that test
EXIF metadata extraction for JPEG images (see
TikaEngineTest#testExifMetadata).

Because of that I assume that the library used by Tika does hove some
problem with your image. In fact TIKA-609 mentions a similar exception
and the first comment suggests an illegal char encoding as cause (what
might make sense, because this could cause a different number of bytes
to be read from the stream).

I would suggest to directly test your image with Tika 1.2 and see if
you can reproduce the error

best
Rupert

On Sat, Oct 6, 2012 at 2:48 PM, Reto Bachmann-Gmür  wrote:
> Hello
>
> I thought that adding an engine that extract XMP metadata and converts EXIF
> data to XMP would be pretty straight forward (expecially since clerezza
> provides a bundle with such utilities).
>
> However I've noticed that the tika engina already processes jpegs but for
> the jpeg I've been testing it I get:
>
> Caused
> by:org.apache.stanbol.enhancer.servicesapi.EngineException:
> Unable to convert ContentItem
> <urn:content-item-sha1-13b7a6ca2636d1e1e8d36b4bc69d623947a6acb7> with
> mimeType 'image/jpeg' to plain text!
> at
> org.apache.stanbol.enhancer.engines.tika.TikaEngine.computeEnhancements(TikaEngine.java:222)
> at
> org.apache.stanbol.enhancer.jobmanager.event.impl.EnhancementJobHandler.processEvent(EnhancementJobHandler.java:259)
> at
> org.apache.stanbol.enhancer.jobmanager.event.impl.EnhancementJobHandler.handleEvent(EnhancementJobHandler.java:181)
> at
> org.apache.felix.eventadmin.impl.tasks.HandlerTaskImpl.execute(HandlerTaskImpl.java:88)
> at
> org.apache.felix.eventadmin.impl.tasks.SyncDeliverTasks.execute(SyncDeliverTasks.java:221)
> at
> org.apache.felix.eventadmin.impl.tasks.AsyncDeliverTasks$TaskExecuter.run(AsyncDeliverTasks.java:110)
> at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown
> Source)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.tika.exception.TikaException: Can't read JPEG metadata
> at
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:104)
> at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> org.apache.stanbol.enhancer.engines.tika.TikaEngine.computeEnhancements(TikaEngine.java:220)
> ... 7 more
> Caused by: com.drew.imaging.jpeg.JpegProcessingException: segment size
> would extend beyond file stream length
> at com.drew.imaging.jpeg.JpegSegmentReader.readSegments(Unknown Source)
> at com.drew.imaging.jpeg.JpegSegmentReader.<init>(Unknown Source)
> at
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:94)
> ... 13 more
> 
> Caused by:org.apache.tika.exception.TikaException: Can't read
> JPEG metadata
> at
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:104)
> at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>
> Now its not surprising that a jpeg cannot be converted to plain text but
> why does tika attempts in the first place andy why can't the JPEG metadata
> be read?
>
> Any ideas?
>
> Cheers,
> Reto



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Getting topic classification to work

2012-10-07 Thread Rupert Westenthaler
Hi René

Based on the Error you are getting I assume that you tried to install
the Topic Engine to the stable launcher. I am able to reproduce this
and I do also know the reason for that. But more on that later.

To workaround that issue I ask you to use the full launcher instead
(you will need to use the "-XX:MaxPermSize=256M").


The reason why it does not work with the Stable launcher is that
somehow during the changes of the POM files related to the graduation
of Stanbol (STANBOL-747) all SNAPSHOT dependencies of the stable
launcher where removed. What I guess (based on the SVN history is that

{stanbol-module}-{version}-incubating-SNAPSHOT.jar

was changed to

{stanbol-module}-{version}-incubating.jar

instead of

{stanbol-module}-{version}-SNAPSHOT.jar

As those changes where committed by Fabian and there are also a lot of
other changes it would be best if Fabian could have a look at his
changes in revision 1389314.

As soon as the Stable Launcher does again use the most current
SNAPSHOT dependencies the topic engine should again run fine also in
the stable launcher.

best
Rupert


On Sun, Oct 7, 2012 at 3:20 PM, Rene Nederhand  wrote:
> Hi,
>
> Now that I have Stanbol up and running, I'd like to do some tests to see
> the capabilities of Stanbol.
>
> I am trying to follow the tutorial at the IKS ReviewMeeting
> [1].<http://dl.dropbox.com/u/5743203/IKS/ReviewMeeting2012/Topic-Classification.pdf>
>
> However, doing:
>
> cd ~/stanbol/enhancer/engines/topic
> mvn install -DskipTests -PinstallBundle -Dsling.url=
> http://localhost:8080/system/console
>
> gives me an error:
>
> ERROR: Bundle org.apache.stanbol.enhancer.engine.topic [147]: Error
> starting
> inputstream:org.apache.stanbol.enhancer.engine.topic-0.10.0-SNAPSHOT.jar
> (org.osgi.framework.BundleException: Unresolved constraint in bundle
> org.apache.stanbol.enhancer.engine.topic [147]: Unable to resolve 147.3:
> missing requirement [147.3] package;
> (&(package=org.apache.commons.compress.archivers)(version>=1.4.1)))
> org.osgi.framework.BundleException: Unresolved constraint in bundle
> org.apache.stanbol.enhancer.engine.topic [147]: Unable to resolve 147.3:
> missing requirement [147.3] package;
> (&(package=org.apache.commons.compress.archivers)(version>=1.4.1))
> at org.apache.felix.framework.Felix.resolveBundle(Felix.java:3443)
> at org.apache.felix.framework.Felix.startBundle(Felix.java:1727)
> at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1156)
> at org.apache.felix.framework.StartLevelImpl.run(StartLevelImpl.java:264)
> at java.lang.Thread.run(Thread.java:679)
>
> As it seems commons-compress is not included. So, I thought I add this to
> the pom.xml in stanbol/enhancer/engines/topic:
>
> 
> 
>org.apache.commons
>commons-compress
>1.4.1
> 
> mvn install -PinstallBundle  -Dsling.url=
> http://localhost:8080/system/console
>
> This does *NOT* solve the problem.
>
> I get a similar problem, when I continue to the next step in the tutorial
> (installing topic-web) with freemarker.cache.
>
> Am I doing something wrong?
>
> Best wishes,
>
> René Nederhand
>
>
> [1]
> http://dl.dropbox.com/u/5743203/IKS/ReviewMeeting2012/Topic-Classification.pdf



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Celi inaccessible: Error building Stanbol from svn trunk

2012-10-07 Thread Rupert Westenthaler
Hi René

I recently implemented an Utility that ensures that UnitTests of
EnahncementEngines are skipped of remote services are not available
(e.g. because you do not have a Internet Connection, or the service is
temporarily down) - STANBOL-759.

However as you and Jenkins Build 1061 have discovered the usage of
this new utility was not used by all unit tests of the CELI engines.
Because of that you where experiencing failed instead of skipped
tests.

Hopefully http://svn.apache.org/viewvc?rev=1395408&view=rev does fix this.

best
Rupert

On Sun, Oct 7, 2012 at 3:12 PM, Rene Nederhand  wrote:
> OK. I was able to sove this by:
>
> cd enhancers/engines
>
> edit pom.xml, change:
>
>
> 
>
> 
>
> It compiles fine now, but I have some other problems that I will bring up
> in new topic.
>
> Best wishes,
>
> René Nederhand
>
>
>
> On Sun, Oct 7, 2012 at 2:09 PM, Rene Nederhand  wrote:
>
>> Hi,
>>
>> I am experimenting with Apache Stanbol to see whether it can be fit in a
>> recommendation system I am creating. However, I cannot build the system.
>>
>> The error I am getting is:
>> ===
>> [ERROR] Failed to execute goal
>> org.apache.maven.plugins:maven-surefire-plugin:2.11:test (default-test) on
>> project org.apache.stanbol.enhancer.engines.celi: There are test failures.
>> [ERROR]
>> [ERROR] Please refer to
>> /Users/nederhrj/src/stanbol/enhancer/engines/celi/target/surefire-reports
>> for the individual test results.
>> [ERROR] -> [Help 1]
>> 
>>
>> I did:
>> export MAVEN_OPTS="-Xmx700M -XX:MaxPermSize=128M"
>> svn co http://svn.apache.org/repos/asf/stanbol/trunk stanbol
>> cd stanbol
>> mvn clean install -DskipTests (I am also getting errors on the tests,
>> therefore skipping)
>>
>> Then, when I get the error I do:
>>
>> mvn install -DskipTests -rf :org.apache.stanbol.enhancer.engines.celi -X
>>
>> and get errors like:
>>
>> =
>> 14:04:32,241 WARN  [Utils] no CELI license key configured for this Engine,
>> a guest account will be used (max 100 requests per day). Go on
>> http://linguagrid.org for getting a proper license key.
>> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.157 sec
>> <<< FAILURE!
>> Running
>> org.apache.stanbol.enhancer.engines.celi.langid.impl.CeliLanguageIdentifierEnhancementEngineTest
>> 14:04:32,387 WARN  [Utils] no CELI license key configured for this Engine,
>> a guest account will be used (max 100 requests per day). Go on
>> http://linguagrid.org for getting a proper license key.
>> 14:04:32,421 WARN  [RemoteServiceHelper] deactivate Test because
>> connection to remote service was refused (Message: 'Connection refused')
>> org.apache.stanbol.enhancer.servicesapi.EngineException: Error while
>> calling the CELI language identifier service (configured URL:
>> http://linguagrid.org/LSGrid/ws/language-identifier)!
>> ==
>>
>> So, it seems that Stanbol is relying on external services (celi) causing
>> my build to fail. Indeed, the server "http://www.linguagrid.org/"; is
>> unavailable at the moment.
>>
>> Is there anyway to resolve this? Can I skip the "celi" part?
>>
>> Looking forward to any help.
>>
>> Best wishes,
>>
>> René Nederhand
>>
>>
>>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Help creating a custom vocabulary

2012-10-09 Thread Rupert Westenthaler
Hi Rene,

The problem ist that the files of this dataset do use N-Quads and not
NTriples (basically SPOC (Subject, Predicate, Object, Context) instead
of SPO.

I can try to add support for importing N-Quads, but because the
importing tool does not use named graphs you might even than lose some
quads ( multiple Quads with the same SPO values).

best
Rupert

On Tue, Oct 9, 2012 at 2:44 PM, Rene Nederhand  wrote:
> Hi,
>
>
> I am trying to create a custom vocabulary using
> webdatacommons<http://webdatacommons.org/>RDFa data [1]. To do this I
> am following this
> tutorial <http://stanbol.apache.org/docs/trunk/customvocabulary.html> [2].
>
> I've installed the indexer tool without any problems, editing the config
> file and I am now working on the mapping.txt file. However, I am clueless
> on what I should change in this file.
>
> An example of the data is
> here<http://webdatacommons.org/samples/data/ccrdf.html-rdfa.sample.nq>[3]:
>
> head -n 5 ../resources/rdfdata/ccrdf.html-rdfa.sample.nq
> <
> http://turcanu.net/blog/2008/07/16/honglaowai-if-there-were-no-communist-party-then-there-would-be-no-new-china/>
> <http://creativecommons.org/ns#attributionURL> <http://turcanu.net> <
> http://turcanu.net/blog/2008/07/16/honglaowai-if-there-were-no-communist-party-then-there-would-be-no-new-china/>
>   .
> <
> http://turcanu.net/blog/2008/07/16/honglaowai-if-there-were-no-communist-party-then-there-would-be-no-new-china/>
> <http://creativecommons.org/ns#attributionName> "Sergiu Turcanu" <
> http://turcanu.net/blog/2008/07/16/honglaowai-if-there-were-no-communist-party-then-there-would-be-no-new-china/>
>   .
> <http://www.telemac0.net/marketing-50/> <
> http://purl.org/dc/elements/1.1/type> <http://purl.org/dc/dcmitype/Text> <
> http://www.telemac0.net/marketing-50/>   .
> <http://www.telemac0.net/marketing-50/> <
> http://purl.org/dc/elements/1.1/title> "telemac0" <
> http://www.telemac0.net/marketing-50/>   .
> <http://www.telemac0.net/marketing-50/> <
> http://creativecommons.org/ns#attributionURL> <http://telemac0.net> <
> http://www.telemac0.net/marketing-50/>
>
> Could anyone point me in de the right direction?
>
> Cheers,
>
> René Nederhand
>
>
> [1] http://webdatacommons.org/
> [2] http://stanbol.apache.org/docs/trunk/customvocabulary.html
> [3] http://webdatacommons.org/samples/data/ccrdf.html-rdfa.sample.nq



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Creating EE ...

2012-10-09 Thread Rupert Westenthaler
Hi Andrea

your path

> launchers/stable/target/classes/resources/bundles/20/org.apache.stanbol.enhancer.servicesapi-0.10.0-incubating-SNAPSHOT.jar

looks really outdated. First because the start level for the
o.a.stanbol.enhancer.servicesapi bundle was increased from 20 to 30
with rev1371819 (10.Aug.2012). Your path still refers to "20" and not
the expected "30". Second because since graduation the version
"0.10.0-incubating-SNAPSHOT" does no longer exist. Since rev1389387
(24.Sep.2012) the expected one is "0.10.0-SNAPSHOT".

On Tue, Oct 9, 2012 at 4:37 PM, Rene Nederhand  wrote:
> /launchers/full/target/classes/resources/bundles/30/org.apache.stanbol.enhancer.servicesapi-0.10.0-SNAPSHOT.jar

this is the expected path for the full launcher. For the stanble
launcher it is the same but with "/stable/" instead of "full"

When have you checked out the Stanbol source? What SVN url have you
used (you can use "svn info" to get the URL)? The correct URL is
"https://svn.apache.org/repos/asf/stanbol/trunk/";.

If you have an other URL I recommend a new checkout. In case the URL
is as expected you can also try to "svn update".

It could also help if you delete the Stanbol bundles from your local
Maven repository (~/.m2/repository/org/apache/stanbol/). This ensures
that you do not accidentally have old incubation bundles present in
your local repository.


I hope this helps
best
Rupert

>
> available.
>
> Can't you use that one?
>
> Best,
> René
>
> On Tue, Oct 9, 2012 at 3:52 PM, Andrea Taurchini wrote:
>
>> Dear Melanie,
>> thanks for your reply. I really don't know, however compiling stanbol
>> produced successfully the
>> jar org.apache.stanbol.enhancer.servicesapi-0.9.0-incubating.jar
>> under launchers/stable/target/**classes/resources/bundles/20/ but not the
>> 0.10.0.
>> I really can't get any clue !!!
>>
>> best,
>> Andrea
>>
>>
>>
>>
>> 2012/10/9 Melanie Reiplinger 
>>
>> > Hi Andrea,
>> >
>> > not sure if this is of any help, but have you considered that paths
>> > containing "incubating" may not be correct anymore? Stanbol has graduated
>> > recently.
>> >
>> > best,
>> > Melanie
>> >
>> > Am 09.10.2012 15:22, schrieb Andrea Taurchini:
>> >
>> >  Dear Sirs,
>> >> I wolud like to try to create my own ee following this tutorial
>> >> http://blog.iks-project.eu/**creating-enhancement-engines-**
>> >> for-stanbol-0-10-0-incubating-**using-netbeans-7-1-2/<
>> http://blog.iks-project.eu/creating-enhancement-engines-for-stanbol-0-10-0-incubating-using-netbeans-7-1-2/
>> >
>> >> but
>> >> unfortunately, after checking out stanbol as in
>> >> http://incubator.apache.org/**stanbol/docs/trunk/tutorial.**html<
>> http://incubator.apache.org/stanbol/docs/trunk/tutorial.html>I cannot make
>> >> reference to the compiled jar under
>> >> launchers/stable/target/**classes/resources/bundles/20/**
>> >> org.apache.stanbol.enhancer.**servicesapi-0.10.0-incubating-**
>> >> SNAPSHOT.jar
>> >> since I only found 0.9.0 version.
>> >> What did I make wrong ?
>> >>
>> >> KR,
>> >> Andrea
>> >>
>> >>
>> >
>>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Links to configuration at management console do not work.

2012-10-09 Thread Rupert Westenthaler
Hi,

this is a Bug I am aware of since some time, but I had not yet time to
fix. The reason is that the URL is constructed in the wrong way. You
need to remove the leading " /enhancer/chain/" than the link would
work.

best
Rupert

On Tue, Oct 9, 2012 at 1:36 PM, Rene Nederhand  wrote:
> Hi,
>
> When, I go to http://dev.iks-project.eu:8081/enhancer/chain and click
> "configure" I get a 404 error:
>
> Problem accessing
> /enhancer/chain/system/console/configMgr/org.apache.stanbol.enhancer.chain.weighted.impl.WeightedChain.9fed368c-2529-481f-8469-8bac9bd37f40.
> Reason:
>
> Not Found
>
> The same happens on my pilot installation.
>
> Did I miss something or is this a bug?
>
> Cheers,
> René



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Help creating a custom vocabulary

2012-10-10 Thread Rupert Westenthaler
Hi Rene,

With STANBOL-764 the indexing tool now supports importing quads.
However you will still have problems to work with the CommonCrawl data.

1. Because a lot of the data do use BNodes and those are ignored by
the Entityhub. As indexing of Bnodes was already requested several
times from I created STANBOL-765 to address this. While this will not
allow the Entityhub to handle BNodes it will allow users to specify
if/how Bnodes are converted to dereferable URIs.

2. I got a parse exception with Jena Riot in the test data file
refered by your original mail [3].

Caused by: org.openjena.riot.RiotException: [line: 3931, col: 124]
expected "_:"
at 
org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)

This was caused by a literal using a country specific language tag

<http://bearhungfactory.mysinablog.com/index.php>
<http://creativecommons.org/ns#attributionName>
"\u6D2A\u96C4\u718A"@zh_tw
<http://bearhungfactory.mysinablog.com/index.php>   .

changing "@zh_tw" to "@zh" fixed the problem. This is a bug in the
used Jena version.

com.hp.hpl.jena:jena:2.6.3
com.hp.hpl.jena:arq:2.8.5
com.hp.hpl.jena:tdb:0.8.7

Maybe upgrading to a newer Jena version could solve this. However this
would previously require Clerezza to adopt the newer version (see
STANBOL-621).

best
Rupert

On Tue, Oct 9, 2012 at 10:34 PM, Rene Nederhand  wrote:
> Hi Rupert,
>
> It would be great if we could make it possible to use CommonCrawl data even
> if we would lose some information. As I remember well, this was one of the
> requests that came up in the validation reports quite frequently. Freebase
> is an alternative.
>
> So, if this involves importing N-quads then I would appreciate adding this
> feature. No need for hurry and I am more than happy to help. Thanks!
>
> Best,
> René
>
>
>
>
>
> On Tue, Oct 9, 2012 at 10:02 PM, Rupert Westenthaler <
> rupert.westentha...@gmail.com> wrote:
>
>> Hi Rene,
>>
>> The problem ist that the files of this dataset do use N-Quads and not
>> NTriples (basically SPOC (Subject, Predicate, Object, Context) instead
>> of SPO.
>>
>> I can try to add support for importing N-Quads, but because the
>> importing tool does not use named graphs you might even than lose some
>> quads ( multiple Quads with the same SPO values).
>>
>> best
>> Rupert
>>
>> On Tue, Oct 9, 2012 at 2:44 PM, Rene Nederhand  wrote:
>> > Hi,
>> >
>> >
>> > I am trying to create a custom vocabulary using
>> > webdatacommons<http://webdatacommons.org/>RDFa data [1]. To do this I
>> > am following this
>> > tutorial <http://stanbol.apache.org/docs/trunk/customvocabulary.html>
>> [2].
>> >
>> > I've installed the indexer tool without any problems, editing the config
>> > file and I am now working on the mapping.txt file. However, I am clueless
>> > on what I should change in this file.
>> >
>> > An example of the data is
>> > here<http://webdatacommons.org/samples/data/ccrdf.html-rdfa.sample.nq
>> >[3]:
>> >
>> > head -n 5 ../resources/rdfdata/ccrdf.html-rdfa.sample.nq
>> > <
>> >
>> http://turcanu.net/blog/2008/07/16/honglaowai-if-there-were-no-communist-party-then-there-would-be-no-new-china/
>> >
>> > <http://creativecommons.org/ns#attributionURL> <http://turcanu.net> <
>> >
>> http://turcanu.net/blog/2008/07/16/honglaowai-if-there-were-no-communist-party-then-there-would-be-no-new-china/
>> >
>> >   .
>> > <
>> >
>> http://turcanu.net/blog/2008/07/16/honglaowai-if-there-were-no-communist-party-then-there-would-be-no-new-china/
>> >
>> > <http://creativecommons.org/ns#attributionName> "Sergiu Turcanu" <
>> >
>> http://turcanu.net/blog/2008/07/16/honglaowai-if-there-were-no-communist-party-then-there-would-be-no-new-china/
>> >
>> >   .
>> > <http://www.telemac0.net/marketing-50/> <
>> > http://purl.org/dc/elements/1.1/type> <http://purl.org/dc/dcmitype/Text>
>> <
>> > http://www.telemac0.net/marketing-50/>   .
>> > <http://www.telemac0.net/marketing-50/> <
>> > http://purl.org/dc/elements/1.1/title> "telemac0" <
>> > http://www.telemac0.net/marketing-50/>   .
>> > <http://www.telemac0.net/marketing-50/> <
>> > http://creativecommons.org/ns#attributionURL> <http://telemac0.net> <
>> > http://www.telemac0.net/marketing-50/>
>> >
>> > Could anyone point me in de the right direction?
>> >
>> > Cheers,
>> >
>> > René Nederhand
>> >
>> >
>> > [1] http://webdatacommons.org/
>> > [2] http://stanbol.apache.org/docs/trunk/customvocabulary.html
>> > [3] http://webdatacommons.org/samples/data/ccrdf.html-rdfa.sample.nq
>>
>>
>>
>> --
>> | Rupert Westenthaler rupert.westentha...@gmail.com
>> | Bodenlehenstraße 11 ++43-699-11108907
>> | A-5500 Bischofshofen
>>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Corrupted Files downloaded from dev.iks-project.eu (Fwd: Jenkins build became unstable: stanbol-trunk-1.6 #1068)

2012-10-10 Thread Rupert Westenthaler
Hi all,

during the Apache Stanbol build process some files (DBpedia default
index, OpenNLP models) are downloaded from dev.iks-project.eu. Since
the last week it happens that those files are corrupted. We do not
know the reason for that as the Apache2 logs of the dev.iks-project.eu
do not point to any problems. This is also the reason for a lot of
unstable Jenkins build on the last week.

Users that are affected by this should see "java.io.EOFException"s in
their logs. Affected files are located in the
"{stanbol-trunk}/data/{module-path}/download/resources" folders.
Deleted files will be re-downloaded on the next build. Because of that
deleting affected files and "mvm clean install" of the affected file
usually solves issues like that.

best
Rupert

-- Forwarded message --
From: Apache Jenkins Server 
Date: Wed, Oct 10, 2012 at 12:15 PM
Subject: Jenkins build became unstable:  stanbol-trunk-1.6 #1068
To: dev@stanbol.apache.org, rupert.westentha...@gmail.com


See <https://builds.apache.org/job/stanbol-trunk-1.6/1068/changes>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: "Error reloading cached bundle"

2012-10-10 Thread Rupert Westenthaler
Reto,

have you looked what module bundle64 refers to?

On Wed, Oct 10, 2012 at 11:53 AM, Reto Bachmann-Gmür  wrote:
> Occasionally when starting a fresh stanbol launcher I get the following
> error message. Does anybody knows what is causing this? After deleting the
> stanbol dectory and retrying the problem doesn't appear again.
>
> Cheers,
> Reto
>
> ERROR: Error reloading cached bundle, removing it:
> /home/reto/projects/apache/stanbol/launchers/full/target/stanbol/felix/bundle64
> (java.lang.Exception: No valid revisions in bundle archive directory:
> /home/reto/projects/apache/stanbol/launchers/full/target/stanbol/felix/bundle64)
> java.lang.Exception: No valid revisions in bundle archive directory:
> /home/reto/projects/apache/stanbol/launchers/full/target/stanbol/felix/bundle64
> at
> org.apache.felix.framework.cache.BundleArchive.(BundleArchive.java:205)
> at
> org.apache.felix.framework.cache.BundleCache.getArchives(BundleCache.java:223)
> at org.apache.felix.framework.Felix.init(Felix.java:656)
> at org.apache.sling.launchpad.base.impl.Sling.init(Sling.java:363)
> at org.apache.sling.launchpad.base.impl.Sling.(Sling.java:228)
> at
> org.apache.sling.launchpad.base.app.MainDelegate$1.(MainDelegate.java:181)
> at
> org.apache.sling.launchpad.base.app.MainDelegate.start(MainDelegate.java:181)
> at org.apache.sling.launchpad.app.Main.startSling(Main.java:424)
> at org.apache.sling.launchpad.app.Main.doStart(Main.java:349)
> at org.apache.sling.launchpad.app.Main.main(Main.java:123)
> at org.apache.stanbol.launchpad.Main.main(Main.java:61)



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Validate fix for STANBOL-768: Wrong "Install-Path" header when running Entityhub Indexing Tool on Windows

2012-10-10 Thread Rupert Westenthaler
Hi Gniewosław, all

it would be nice if you or anyone else could validate that OSGI
bundles created by the Entityhub Indexing Tool running on Windows now
correctly install the Configurations for the Entityhub ReferencedSite
when installed to a Stanbol Instance. See STANBOL-768 [1] [2] for
details.

I do currently not have access to any Windows box so help with that
would be really appreciated.

best
Rupert

[1] https://issues.apache.org/jira/browse/STANBOL-768
[2] http://svn.apache.org/viewvc?rev=1396614&view=rev

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: build problem

2012-10-10 Thread Rupert Westenthaler
Hi Harish,

On Thu, Oct 11, 2012 at 1:27 AM, harish suvarna  wrote:
> Failure to find
> org.apache.stanbol:org.apache.stanbol.data.sites.dbpedia:jar:1.0.5-SNAPSHOT

it should not be necessary to download this dependency from any maven
repository, as it is added to your local repository by "mvn install"
the "{stanbol-trunk}/data/sites/dbpedia" module. As the dependency in
[1] does refer the version defined in [2] I would not expect any
problem.

You can check for this dependency in the local maven repository at
"~/.m2/repository/org/apache/stanbol/org.apache.stanbol.data.sites.dbpedia/"

best
Rupert

[1] http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/ldpath/pom.xml
[2] http://svn.apache.org/repos/asf/stanbol/trunk/data/sites/dbpedia/pom.xml

On Thu, Oct 11, 2012 at 1:27 AM, harish suvarna  wrote:
> I am at svn rev 1396858.
>
> I get the following error while building ldpath.
>
> Error stacktraces are turned on.
> [INFO] Scanning for projects...
> [INFO]
>
> [INFO]
> 
> [INFO] Building Apache Stanbol Entityhub LDPath Support 0.11.0-SNAPSHOT
> [INFO]
> 
> [WARNING] The POM for
> org.apache.stanbol:org.apache.stanbol.data.sites.dbpedia:jar:1.0.5-SNAPSHOT
> is missing, no dependency information available
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 2.805s
> [INFO] Finished at: Wed Oct 10 16:16:45 PDT 2012
> [INFO] Final Memory: 8M/81M
> [INFO]
> 
> [ERROR] Failed to execute goal on project
> org.apache.stanbol.entityhub.ldpath: Could not resolve dependencies for
> project
> org.apache.stanbol:org.apache.stanbol.entityhub.ldpath:bundle:0.11.0-SNAPSHOT:
> Failure to find
> org.apache.stanbol:org.apache.stanbol.data.sites.dbpedia:jar:1.0.5-SNAPSHOT
> in http://repository.apache.org/snapshots was cached in the local
> repository, resolution will not be reattempted until the update interval of
> apache.snapshots has elapsed or updates are forced -> [Help 1]
>
> I checked repository.apache.org for dbpedia jar snapshot. But nothing is
> there.
>
>
> --
> Thanks
> Harish



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Build error: two child modules missing

2012-10-10 Thread Rupert Westenthaler
Hi,

those folders got recently moved in the SVN. You can check at [1] that
they are present on the server. Interestingly I had also problems
while "svn up" this changes. On my machine the old folder where not
correctly deleted and the new one where not created - no idea why.

I had to manually create and add those folder (mkdir {folder}, svn add
{folder}). Only after that I was getting the changes from the server
by calling "svn up". I would be also interested why things like that
happens from time to time.

best
Rupert


[1] http://svn.apache.org/repos/asf/stanbol/trunk/commons/security/

On Tue, Oct 9, 2012 at 9:12 PM, Andreas Kuckartz  wrote:
> I currently get a build error.
>
> Cheers,
> Andreas
> ---
>
> [INFO] Scanning for projects...
> [ERROR] The build could not read 1 project -> [Help 1]
> [ERROR]
> [ERROR]   The project
> org.apache.stanbol:org.apache.stanbol.commons.reactor:0.10.0-SNAPSHOT
> (/home/andreas/workspace/stanbol/commons/pom.xml) has 2 errors
> [ERROR] Child module
> /home/andreas/workspace/stanbol/commons/security/core of
> /home/andreas/workspace/stanbol/commons/pom.xml does not exist
> [ERROR] Child module
> /home/andreas/workspace/stanbol/commons/security/authentication.basic of
> /home/andreas/workspace/stanbol/commons/pom.xml does not exist
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Help creating a custom vocabulary

2012-10-11 Thread Rupert Westenthaler
Hi René,

BTW I finished the work on STANBOL-765 today. See first comment for
the documentation on how to enable indexing of Bnodes.

best
Rupert

On Thu, Oct 11, 2012 at 10:54 PM, Rene Nederhand  wrote:
> Hi Rupert,
>
> Thank you very much for all the work. I'd expected this would take much
> longer :)
>
> Probably this weekend, I will try to get some of the CommonCrawl data
> imported into Stanbol and see how this works out.
>
> In addition, I will try the Apache any23 tool (thx. A. Soroka).
>
> Best,
> René
>
> On Wed, Oct 10, 2012 at 11:39 AM, Rupert Westenthaler <
> rupert.westentha...@gmail.com> wrote:
>
>> Hi Rene,
>>
>> With STANBOL-764 the indexing tool now supports importing quads.
>> However you will still have problems to work with the CommonCrawl data.
>>
>> 1. Because a lot of the data do use BNodes and those are ignored by
>> the Entityhub. As indexing of Bnodes was already requested several
>> times from I created STANBOL-765 to address this. While this will not
>> allow the Entityhub to handle BNodes it will allow users to specify
>> if/how Bnodes are converted to dereferable URIs.
>>
>> 2. I got a parse exception with Jena Riot in the test data file
>> refered by your original mail [3].
>>
>> Caused by: org.openjena.riot.RiotException: [line: 3931, col: 124]
>> expected "_:"
>> at
>> org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
>>
>> This was caused by a literal using a country specific language tag
>>
>> <http://bearhungfactory.mysinablog.com/index.php>
>> <http://creativecommons.org/ns#attributionName>
>> "\u6D2A\u96C4\u718A"@zh_tw
>> <http://bearhungfactory.mysinablog.com/index.php>   .
>>
>> changing "@zh_tw" to "@zh" fixed the problem. This is a bug in the
>> used Jena version.
>>
>> com.hp.hpl.jena:jena:2.6.3
>> com.hp.hpl.jena:arq:2.8.5
>> com.hp.hpl.jena:tdb:0.8.7
>>
>> Maybe upgrading to a newer Jena version could solve this. However this
>> would previously require Clerezza to adopt the newer version (see
>> STANBOL-621).
>>
>> best
>> Rupert
>>
>> On Tue, Oct 9, 2012 at 10:34 PM, Rene Nederhand 
>> wrote:
>> > Hi Rupert,
>> >
>> > It would be great if we could make it possible to use CommonCrawl data
>> even
>> > if we would lose some information. As I remember well, this was one of
>> the
>> > requests that came up in the validation reports quite frequently.
>> Freebase
>> > is an alternative.
>> >
>> > So, if this involves importing N-quads then I would appreciate adding
>> this
>> > feature. No need for hurry and I am more than happy to help. Thanks!
>> >
>> > Best,
>> > René
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Oct 9, 2012 at 10:02 PM, Rupert Westenthaler <
>> > rupert.westentha...@gmail.com> wrote:
>> >
>> >> Hi Rene,
>> >>
>> >> The problem ist that the files of this dataset do use N-Quads and not
>> >> NTriples (basically SPOC (Subject, Predicate, Object, Context) instead
>> >> of SPO.
>> >>
>> >> I can try to add support for importing N-Quads, but because the
>> >> importing tool does not use named graphs you might even than lose some
>> >> quads ( multiple Quads with the same SPO values).
>> >>
>> >> best
>> >> Rupert
>> >>
>> >> On Tue, Oct 9, 2012 at 2:44 PM, Rene Nederhand 
>> wrote:
>> >> > Hi,
>> >> >
>> >> >
>> >> > I am trying to create a custom vocabulary using
>> >> > webdatacommons<http://webdatacommons.org/>RDFa data [1]. To do this I
>> >> > am following this
>> >> > tutorial <http://stanbol.apache.org/docs/trunk/customvocabulary.html>
>> >> [2].
>> >> >
>> >> > I've installed the indexer tool without any problems, editing the
>> config
>> >> > file and I am now working on the mapping.txt file. However, I am
>> clueless
>> >> > on what I should change in this file.
>> >> >
>> >> > An example of the data is
>> >> > here<http://webdatacommons.org/samples/data/ccrdf.html-rdfa.sample.nq
>> >> >[3]:
>> >> >
>> >> > head -n 5 ../resources/rdfdata/ccrdf.html-rdfa.sample.

Stanbol Semantic Indexing (was Re: Next releases)

2012-10-13 Thread Rupert Westenthaler
k we will add such a component. If that is the case than a
SemanticIndex would only need to specify its IndexingSource(s) and
Stanbol would keep the index in sync with its source.

### Provided Services

The services API of the semanticindexing module does NOT include the
actual Java APIs for SemanticSearch but rather leave it to the
implementations to register those APIs themselves as OSGI services.
Stanbol already defines/uses a lot of those interfaces.
Implementations that implement those will naturally integrate To give
some Examples: A SemanticIndex storing its data in Solr can register
its SolrCore as OSGI service as described by [2]. SemanticIndexes
using a Clerezza TripleStore can be accessed via the Clerezza
TCManager and can expose a SPARQL endpoint as described in [3].

This design has the advantages that

* the semanticindexing API keeps focused on the semantic indexing
process and therefore easier to implement
* it allows greater flexibility and extensibility (e.g. one could
write a semantic index based on couchDB and register the RESTful and
Java API similar as it is done for Solr
* it allows both the storage as the semanticindex layer to provide
additional services (e.g. if a TripleStore is used to store the data
it can directly provide the SPARQL endpoint in case data are stored in
a CMS the SPARQL endpoint can be provides by a SemanticIndex that
knows how to convert the CMS data to RDF).
* it fits very well to the service oriented architecture of OSGI

BTW we will also use the same system for the Stanbol specific services
(e.g. the featured search of the Contenthub, the LDpath Backend
functionality or the FieldQuery service of the Entityhub.

### Next Steps:

The first Stanbol Component that will use this infrastructure will be
the Contenthub. Suat is the Person in charge for this. Based on the
version re-integrated with the trunk I will than continue the
development of the Entityhub on top of the semanticindexing module.

best
Rupert Westenthaler

[1] http://stanbol.apache.org/presentations/Stanbol_Overview_2012-04.pdf
[2] http://stanbol.apache.org/docs/trunk/utils/commons-solr
[3] http://markmail.org/message/zm2tqlvs4flwvjyd


-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Next releases

2012-10-13 Thread Rupert Westenthaler
Hi all

On Fri, Oct 12, 2012 at 5:48 PM, aj...@virginia.edu  wrote:
> Thanks for that detailed answer. Don't worry, I understand that the notion of 
> yard is specific to the EntityHub-- I was just using it as an analogy.
>

The current Entityhub Yard implementations will be used as backend for
SemanticIndex implementations with the new System. In fact very
similar as the Yard interface is already used by the Entityhub
Indexing Tool - as an indexing destination.

>
> I have one other question about this specific effort: in IndexingSource I 
> find the important method:
>
> Item get(String uri) throws StoreException
>
> so it seems that this interface is meant to be used synchronously in direct 
> operation, when get() doesn't block for any long time waiting for a large 
> datum to transit or for slow storage to produce results. In order to use this 
> gear in these cases, would it be necessary to rewrite the upper-level 
> component "Content Create/Update"? Or could one expect to create a kind of 
> queuing component and wire it between "Content Create/Update" and "Content 
> Item Storage", maintaining synchronous behavior in the upper level of 
> architecture?

That is true. The intended usage of the interfaces of the
semanticindexing module is synchronous. If necessary the semantic
indexing process as a whole can be implemented to be asynchronously
(e.g. use a queue that is processed by multiple worker threads).

As mentioned in my other mail, the first release of the
semanticindexing module together with the Contenthub will most likely
not include a general implementation of the indexing process, but I
plan to implement such a component as part of the adaption of the
Entityhub to the new system. As the Entityhub Indexing Tool already
uses a multi threaded producer/consumer based indexing pipeline I
might likely start from their.

Note the description of the "indexing process" is included in my mail
about the "Stanbol Semantic Indexing" module.

best
Rupert

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Stanbol Semantic Indexing (was Re: Next releases)

2012-10-15 Thread Rupert Westenthaler
>>
>> SemanticIndexes do have states: UNINT, INDEXING are used during the
>> initial indexing state. ACTIVE means that the index is in normal
>> operation and finally REINDEXING is used after an *epoch* change of
>> the IndexingSource. In this state the SemanticIndex can still be used
>> (with the data before the epoch change) while the re-indexing based on
>> the new data is preformed.
>>
>> In the first version the Stanbol semanticindexing will not include a
>> component that provides an implementation of the above workflow, but I
>> think we will add such a component. If that is the case than a
>> SemanticIndex would only need to specify its IndexingSource(s) and
>> Stanbol would keep the index in sync with its source.
>
> In the first step Contenthub will provide an implementation the above
> workflow you mention with a Store (e.g FileStore[1] and SemanticIndex
> implementation[2] which is synchronized with the Store. Do you mean
> another (more generic) implementation?
>

Year. I think this workflow should be a service provided by Stanbol.
Maybe you can even start such a component when you implement the
ClerezzaIndex (needed to keep the SPARQL endpoint feature over the
enhancement metadata).

best
Rupert

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Stanbol - build - run

2012-10-15 Thread Rupert Westenthaler
Hi Adam.

While starting a Bundle (in your case
"org.apache.stanbol.ontologymanager.servicesapi") the OSGI framework
checks if all referenced packages of the module and its dependencies
are available. In your case the package ("com.hp.hpl.jena.graph")
seams to be missing for some reason. This has nothing todo with
memory. Typically "-XX:MaxPermSize=256m -Xmx1024m" are sufficient for
the Stanbol Full Launcher.

As this does not appear on continuous integration it might be related
to some invalid data in you local Maven Repository (
"~/.m2/repository" ). Can you try to delete the cache for the Bundles
referenced in the Error messages

~/.m2/repository/com/hp/hpl
~/.m2/repository/org/apache/stanbol

and afterwards make a new build of Stanbol.

If you want to validate your memory setting a binary download of the
Stanbol launcher is also available on [1]. This is build every night
from a fresh checkout.

best
Rupert

[1] http://dev.iks-project.eu/downloads/stanbol-launchers/


On Mon, Oct 15, 2012 at 9:04 PM, adasal  wrote:
> Hi,
> I have spent the last several days trying to compile and run a local
> instance of the Stanbol project.
> I can compile. If I include tests I must exclude integration as this fails
> with similar errors (the same errors plus out of heap space) as when I run
> the compiled project skiping that test.
> The errors I get are such like:-
> ERROR: Bundle org.apache.stanbol.ontologymanager.servicesapi [131]: Error
> starting
> inputstream:org.apache.stanbol.ontologymanager.servicesapi-0.10.0-SNAPSHOT.jar
> (org.osgi.framework.BundleException: Unresolved constraint in bundle
> org.apache.stanbol.ontologymanager.servicesapi [131]: Unable to resolve
> 131.0: missing requirement [131.0] package;
> (&(package=org.apache.stanbol.commons.owl.util)(version>=0.10.0)) [caused
> by: Unable to resolve 58.0: missing requirement [58.0] package;
> (package=com.hp.hpl.jena.graph)])
> org.osgi.framework.BundleException: Unresolved constraint in bundle
> org.apache.stanbol.ontologymanager.servicesapi [131]: Unable to resolve
> 131.0: missing requirement [131.0] package;
> (&(package=org.apache.stanbol.commons.owl.util)(version>=0.10.0)) [caused
> by: Unable to resolve 58.0: missing requirement [58.0] package;
> (package=com.hp.hpl.jena.graph)]
> at org.apache.felix.framework.Felix.resolveBundle(Felix.java:3443)
> at org.apache.felix.framework.Felix.startBundle(Felix.java:1727)
> at
> org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1156)
> at
> org.apache.felix.framework.StartLevelImpl.run(StartLevelImpl.java:264)
> at java.lang.Thread.run(Thread.java:680)
>
> They all indicate the missing requirement package=com.hp.hpl.jena.graph,
> rdf.model and datatypes.
>
> Is this really that I am not able to allocate enough memory to the runtime?
> (My Mac has 4g but it gets eaten up by these processes) or am I missing
> something else?
>
> Any ideas?
>
> Adam



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Request to validate/correct STANBOL-774 related POM file changes

2012-10-18 Thread Rupert Westenthaler
Hi all

Yesterday I committed changes to all modules POM files that do produce
bundles. The intension and nature of those changes are well described
by the description and comments of STANBOL-744 [1] so I will not
include those in this mail. But as I am also not very experienced with
this topic Feedback and Suggestions are very welcome.

The reason for this mail is that I ask the developers of those modules
to check/validate and when necessarily correct my changes! Please take
the time to compare the definitions in the

  
  
  

and compare it with the expected

   Import-Package:

in the generated MANIFEST.MF file.
(/target/classes/META-INF/MANIFEST.MF). STANBOL-744 provides
information on what is expected and [2] provides the details.

Note also that explicitly adding missing packages usually does not
solve the issue (as it typically would only lead to runtime issues).
The BnD tool does a really great job in analyzing dependencies so
typically:

1. Your expectations are wrong (e.g. an exported package that is not
used in an private package of the same Bundle needs not to be
imported).
2. Dependencies of any Class in the exported package to an private
package will avoid the BdD tool to add it to list of Import-Package.
In this case you will need to adapt your dependencies or packages (you
can use STANBOL-773 for those changes).

IMHO doing this is really important before going for a 1.0 release as
after such an release most of such changes would only be possible with
a 2.* release (or via keeping a lot of @Deprecated stuff that we would
need to maintain)

best
Rupert


[1] https://issues.apache.org/jira/browse/STANBOL-774
[2] http://www.aqute.biz/Bnd/Versioning

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Corrupted Files downloaded from dev.iks-project.eu (Fwd: Jenkins build became unstable: stanbol-trunk-1.6 #1068)

2012-10-18 Thread Rupert Westenthaler
Hi Suat

in the module of the dbpedia default dataset there should be a
download folder containing the file downloaded form the server.
Deleting that folder will trigger the re-download of that file.
This is also the best way to check if the file is actually corrupted.

You can find the folder at

{stanbol-trunk}/data/sites/dbpedia/download

best
Rupert

On Wed, Oct 17, 2012 at 6:19 PM, Suat Gonul  wrote:
> Hi Rupert,
>
> I have a similar problem but I am not sure it is related with the
> situation here. Here is the exception I get:
>
> 17.10.2012 18:51:59.930 *ERROR* [Thread-47]
> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl
> IOException while activating Index 'default:dbpedia'!
> java.io.IOException: Unable to copy Data for index 'dbpedia' (server
> 'default')
> at
> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl.updateCore(ManagedSolrServerImpl.java:779)
> at
> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl$IndexUpdateDaemon.run(ManagedSolrServerImpl.java:1162)
> Caused by: java.io.IOException: Truncated ZIP file
> at
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readDeflated(ZipArchiveInputStream.java:389)
> at
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:322)
> at java.io.InputStream.read(InputStream.java:101)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1025)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:999)
> at
> org.apache.stanbol.commons.solr.utils.ConfigUtils.copyArchiveEntry(ConfigUtils.java:539)
> at
> org.apache.stanbol.commons.solr.utils.ConfigUtils.copyCore(ConfigUtils.java:497)
> at
> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl.updateCore(ManagedSolrServerImpl.java:777)
>
>
> I tried deleting the stanbol directory inside the .m2 and even the .m2
> itself, however I still get this exception. Do you have any idea why
> this happens?
>
> Best,
> Suat
>
> On 10/10/2012 3:49 PM, Rupert Westenthaler wrote:
>> Hi all,
>>
>> during the Apache Stanbol build process some files (DBpedia default
>> index, OpenNLP models) are downloaded from dev.iks-project.eu. Since
>> the last week it happens that those files are corrupted. We do not
>> know the reason for that as the Apache2 logs of the dev.iks-project.eu
>> do not point to any problems. This is also the reason for a lot of
>> unstable Jenkins build on the last week.
>>
>> Users that are affected by this should see "java.io.EOFException"s in
>> their logs. Affected files are located in the
>> "{stanbol-trunk}/data/{module-path}/download/resources" folders.
>> Deleted files will be re-downloaded on the next build. Because of that
>> deleting affected files and "mvm clean install" of the affected file
>> usually solves issues like that.
>>
>> best
>> Rupert
>>
>> -- Forwarded message --
>> From: Apache Jenkins Server 
>> Date: Wed, Oct 10, 2012 at 12:15 PM
>> Subject: Jenkins build became unstable:  stanbol-trunk-1.6 #1068
>> To: dev@stanbol.apache.org, rupert.westentha...@gmail.com
>>
>>
>> See <https://builds.apache.org/job/stanbol-trunk-1.6/1068/changes>
>>
>>
>>
>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Corrupted Files downloaded from dev.iks-project.eu (Fwd: Jenkins build became unstable: stanbol-trunk-1.6 #1068)

2012-10-18 Thread Rupert Westenthaler
Hi

On Thu, Oct 18, 2012 at 10:35 AM, Suat Gonul  wrote:
> Thanks Rupert.
>
> I was thinking that "mvn clean" would delete the files. Manually
> removing that folder solved the problem.
>

No "mvn clean" intensionally does NOT delete those files to avoid
re-downloading them again and again. However this can be easily
changed by an according configuration.

best
RUpert


> Best,
> Suat
>
> On 10/18/2012 10:34 AM, Rupert Westenthaler wrote:
>> Hi Suat
>>
>> in the module of the dbpedia default dataset there should be a
>> download folder containing the file downloaded form the server.
>> Deleting that folder will trigger the re-download of that file.
>> This is also the best way to check if the file is actually corrupted.
>>
>> You can find the folder at
>>
>> {stanbol-trunk}/data/sites/dbpedia/download
>>
>> best
>> Rupert
>>
>> On Wed, Oct 17, 2012 at 6:19 PM, Suat Gonul  wrote:
>>> Hi Rupert,
>>>
>>> I have a similar problem but I am not sure it is related with the
>>> situation here. Here is the exception I get:
>>>
>>> 17.10.2012 18:51:59.930 *ERROR* [Thread-47]
>>> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl
>>> IOException while activating Index 'default:dbpedia'!
>>> java.io.IOException: Unable to copy Data for index 'dbpedia' (server
>>> 'default')
>>> at
>>> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl.updateCore(ManagedSolrServerImpl.java:779)
>>> at
>>> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl$IndexUpdateDaemon.run(ManagedSolrServerImpl.java:1162)
>>> Caused by: java.io.IOException: Truncated ZIP file
>>> at
>>> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readDeflated(ZipArchiveInputStream.java:389)
>>> at
>>> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:322)
>>> at java.io.InputStream.read(InputStream.java:101)
>>> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1025)
>>> at org.apache.commons.io.IOUtils.copy(IOUtils.java:999)
>>> at
>>> org.apache.stanbol.commons.solr.utils.ConfigUtils.copyArchiveEntry(ConfigUtils.java:539)
>>> at
>>> org.apache.stanbol.commons.solr.utils.ConfigUtils.copyCore(ConfigUtils.java:497)
>>> at
>>> org.apache.stanbol.commons.solr.managed.impl.ManagedSolrServerImpl.updateCore(ManagedSolrServerImpl.java:777)
>>>
>>>
>>> I tried deleting the stanbol directory inside the .m2 and even the .m2
>>> itself, however I still get this exception. Do you have any idea why
>>> this happens?
>>>
>>> Best,
>>> Suat
>>>
>>> On 10/10/2012 3:49 PM, Rupert Westenthaler wrote:
>>>> Hi all,
>>>>
>>>> during the Apache Stanbol build process some files (DBpedia default
>>>> index, OpenNLP models) are downloaded from dev.iks-project.eu. Since
>>>> the last week it happens that those files are corrupted. We do not
>>>> know the reason for that as the Apache2 logs of the dev.iks-project.eu
>>>> do not point to any problems. This is also the reason for a lot of
>>>> unstable Jenkins build on the last week.
>>>>
>>>> Users that are affected by this should see "java.io.EOFException"s in
>>>> their logs. Affected files are located in the
>>>> "{stanbol-trunk}/data/{module-path}/download/resources" folders.
>>>> Deleted files will be re-downloaded on the next build. Because of that
>>>> deleting affected files and "mvm clean install" of the affected file
>>>> usually solves issues like that.
>>>>
>>>> best
>>>> Rupert
>>>>
>>>> -- Forwarded message --
>>>> From: Apache Jenkins Server 
>>>> Date: Wed, Oct 10, 2012 at 12:15 PM
>>>> Subject: Jenkins build became unstable:  stanbol-trunk-1.6 #1068
>>>> To: dev@stanbol.apache.org, rupert.westentha...@gmail.com
>>>>
>>>>
>>>> See <https://builds.apache.org/job/stanbol-trunk-1.6/1068/changes>
>>>>
>>>>
>>>>
>>
>>
>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Hackathon at ApacheCon EU 2012?

2012-10-18 Thread Rupert Westenthaler
Hi Sergio,

cool idea! Thanks for sharing this information on the list. I would
enjoy participating in an Stanbol Hackathon.

best
Rupert

On Thu, Oct 18, 2012 at 2:52 PM, Sergio Fernández
 wrote:
> Hi,
>
> in addition to the Linked Data Track [1], what do you think to also organize
> a hackathon? They are collecting ideas at the wiki [2] until Monday 5th
> November. Maybe other projects (Jena, Clerezza and Any23) would be also
> interested.
>
> Kind regards,
>
> [1] http://www.apachecon.eu/tracks/#linked-data
> [2] http://wiki.apache.org/apachecon/HackathonEU12
>
>
> --
> Sergio Fernández
> Salzburg Research
> +43 662 2288 318
> Jakob-Haringer Strasse 5/II
> A-5020 Salzburg (Austria)
> http://www.salzburgresearch.at



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: How to add a new TripleCollection to Stanbol

2012-10-29 Thread Rupert Westenthaler
ting, forwarding or dissemination of this communication is 
> strictly prohibited. If you have received this  communication in error, 
> please erase all copies of the message and its  attachments and notify the 
> sender immediately. INQ Mobile Limited is  a company registered in the 
> British Virgin Islands. www.inqmobile.com.
>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Apache Stanbol ( Disambiguation Engine ) proposal and doubts

2012-10-30 Thread Rupert Westenthaler
y to create a version that works well
with the disambiguation-mlt engine. As soon as this is finished I can
also provide this demo on the http://dev.iks-project.eu server.

best
Rupert

> Thanks a lot for your attention. We hope to hear from you.
>
> Regards,
> Juan.




--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Opennlp NER ...

2012-10-30 Thread Rupert Westenthaler
Hi Andrea,

On Tue, Oct 30, 2012 at 4:15 PM, Andrea Taurchini  wrote:
> Dear All,
> I developed my own models for NER based on OPENNLP.
> Within these models I have more entities than person, organization and
> places ... will stanbol enhance text using this added entities ?
>

Currently both the OpenNLP NER engine as well as the
NamedEntityLinkingEngine can only handle Persons, Organizations and
Places. In its current form you will not be able to use them to link
other types.

For both engines this is mainly because of the configuration. So
extending those engines to support other (or better arbitrary
configureable) types would require to extend the engines configuration
options. In the following I will try to describe the necessary
extensions.

## OpenNLP NER engine

The NER engine needs the mappings for an {ner-model} to its {language}
and the extracted {entity-type}. Currently this works by a constant
defining the mappings for persons, organizations and places. NLP
models are loaded by using the OpenNLP service (defined by the
o.a.stanbol.commons.opennlp module).

To configure additional models and types I would suggest to add an
additional configuration property that uses the following syntax

{model-file-name};lang={language};type={entity-type}

The OpenNLP TokenNameFinderModel would be loaded from the configured
"{model-file-name}" via the Stanbol DataFileProvider service.
practically this means that users would need to copy their custom
models to the "{stanbol.home}/datafiles" directory.

The language parameter "lang={language}" would specify the language
supported by this model. The "type={entity-type}" parameter would
specify the dc-type value set for fise:TextAnnotations created for
named entities extracted by the model.


## NamedEntityLinkingEngine

For this engine the main problem with the current implementation is
that the current way to configure mappings does not allow to configure
arbitrary mappings. Because of that one would need to implement a
different approach to configure the mappings for linked
fise:TextAnnotations dc:type values.

I would suggest to use a configuration similar to the "type mapping"
[1] as already used by the KeywordLinkingEngine. The Syntax would be
like

 {dc-type} > {vocabulary-type}; {vocabulary-type}; ...
 {dc-type} > *
 {dc-type}

where the {dc-type} would be the value of the dc-type property of the
TextAnnotation and {vocabulary-type} is the rdf:type value required
for linked Entities in the vocabulary linked against. * represents the
wild-card (any type) and {dc-type} is a shorthand for {dc-type} >
{dc-type}

The current default mappings would be represented in this syntax by

dbp-ont:Place
dbp-ont:Person
dbp-ont:Organisation

I would suggest to keep support for the current properties for not
braking backward compatibility.

If this extension is sufficient I suggest to create according JIRA issues.

best
Rupert

[1] 
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax

> Thanks and best regards,
> Andrea



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Opennlp NER ...

2012-10-31 Thread Rupert Westenthaler
Hi

On Wed, Oct 31, 2012 at 2:25 PM, Andrea Taurchini  wrote:
> Dear Rupert,
> as always thanks for your support.
> Is it possible to use a single model file to detect multiple dc-type ... or
> should I add more than one configuration property each with the same model
> file but different dc-type ... or else should I produce different model
> file.

If this is possible with OpenNLP, than for sure, but AFAIK the
"opennlp.tools.namefind.NameFinderME#find(..)" method only provide the
token spans and probability. So it tells you only that you have found
an Named Entity from tokenA to tokenB and not the type of the Named
Entity.

While I can imagine that one can train a model that detects different
types of entities, you will not know the specific type of an found
named entity. So found Entities may have any of the trained types.

So if you want to distinguish between NamedEntities of the different
types you will need to train separate models.

Please correct me if I am wrong.

> However ... where do I have to set this configuration property (^_^) ?
> Throus OSGI admin ?

Using the configuration tab of the Felix Web Console is only one
option. There are also other possibilities to provide configurations.
You can also provide configuration files to the Sling FileInstaller as
described at [1] and soon also under the new "Production" section on
the Stanbol webpage (currently only available on the staging server
[2])



[1] http://markmail.org/message/jpxpl6x4nkmz6kda
[2] http://stanbol.staging.apache.org/production/partial-updates.html

>
> Thanks a lot.
>
> Kindest regards,
> Andrea
>
>
>
>
>
>
>
> 2012/10/31 Rupert Westenthaler 
>
>> Hi Andrea,
>>
>> On Tue, Oct 30, 2012 at 4:15 PM, Andrea Taurchini 
>> wrote:
>> > Dear All,
>> > I developed my own models for NER based on OPENNLP.
>> > Within these models I have more entities than person, organization and
>> > places ... will stanbol enhance text using this added entities ?
>> >
>>
>> Currently both the OpenNLP NER engine as well as the
>> NamedEntityLinkingEngine can only handle Persons, Organizations and
>> Places. In its current form you will not be able to use them to link
>> other types.
>>
>> For both engines this is mainly because of the configuration. So
>> extending those engines to support other (or better arbitrary
>> configureable) types would require to extend the engines configuration
>> options. In the following I will try to describe the necessary
>> extensions.
>>
>> ## OpenNLP NER engine
>>
>> The NER engine needs the mappings for an {ner-model} to its {language}
>> and the extracted {entity-type}. Currently this works by a constant
>> defining the mappings for persons, organizations and places. NLP
>> models are loaded by using the OpenNLP service (defined by the
>> o.a.stanbol.commons.opennlp module).
>>
>> To configure additional models and types I would suggest to add an
>> additional configuration property that uses the following syntax
>>
>> {model-file-name};lang={language};type={entity-type}
>>
>> The OpenNLP TokenNameFinderModel would be loaded from the configured
>> "{model-file-name}" via the Stanbol DataFileProvider service.
>> practically this means that users would need to copy their custom
>> models to the "{stanbol.home}/datafiles" directory.
>>
>> The language parameter "lang={language}" would specify the language
>> supported by this model. The "type={entity-type}" parameter would
>> specify the dc-type value set for fise:TextAnnotations created for
>> named entities extracted by the model.
>>
>>
>> ## NamedEntityLinkingEngine
>>
>> For this engine the main problem with the current implementation is
>> that the current way to configure mappings does not allow to configure
>> arbitrary mappings. Because of that one would need to implement a
>> different approach to configure the mappings for linked
>> fise:TextAnnotations dc:type values.
>>
>> I would suggest to use a configuration similar to the "type mapping"
>> [1] as already used by the KeywordLinkingEngine. The Syntax would be
>> like
>>
>>  {dc-type} > {vocabulary-type}; {vocabulary-type}; ...
>>  {dc-type} > *
>>  {dc-type}
>>
>> where the {dc-type} would be the value of the dc-type property of the
>> TextAnnotation and {vocabulary-type} is the rdf:type value required
>> for linked Entities in the vocabulary linked against. * represents the
>> wild-card (any type) and {dc-type} is a shorthand for {dc-t

Re: How to add a new TripleCollection to Stanbol

2012-10-31 Thread Rupert Westenthaler
Hi

AFAIK the Clerezza SPARQL implementation does not use the Graph
specific SPARQL implementation. Because of that you are limited to
what Clerezza supports and can not access additional features. This
limitation is also the reason why I am interested in extending the
STANBOL SPARQL endpoint to directly support Jena Datasets and possible
even others (Sesame, Virtuoso ...) registered with the same metadata
as currently supported for Clerezza TripleCollections.

best
Rupert

On Wed, Oct 31, 2012 at 2:32 PM, Andrea Di Menna  wrote:
> Hi Rupert,
>
> thanks for your precious help.
>
> I am using the default graph hence I had to build a custom component.
> After this was done I could access the TDB with Stanbol :-)
>
> From what I can see though, the Clerezza SPARQL processor Stanbol is using
> does not support aggregate functions like count.
> Can you confirm? Is it possible to switch to ARQ for SPARQL queries?
>
> At the moment I am using Fuseki to handle queries as well (b.t.w. I
> realised it was much much faster to build the TDB using tdbloader2 instead
> of sending triples to Fuseki - dumb me, should have know before starting).
>
> Thanks for your great support!
>
> Cheers
>
> 2012/10/30 Rupert Westenthaler 
>
>> Hi
>>
>> To use an existing Jena TDB store with Apache Stanbol you need:
>>
>> 1. to make the Jena TDB store available in Apache Clerezza
>> 2. configure a Stanbol Entityhub ClerezzaYard for your Graph URI
>>
>> ad1: Do you use named graphs or the TDB triple store? In In the
>> SNAPSHOT version of "rdf.jena.tdb.storage"
>> (org.apache.clerezza:rdf.jena.tdb.storage:0.6-incubating-SNAPSHOT)
>> there is a SingleTdbDatasetTcProvider. It allows you to configure
>> (e.g. via the Configuration tab of the Apache Felix WebConsole) the
>> directory of the local file system where your TDB store is located. If
>> you configure an instance with the location of your existing TDB
>> store, than Clerezza should have access to the data. However this
>> works only for named graphs (SPOC) and the union graph over all SPOC
>> graphs. The SPO graph is not exposed by the
>> SingleTdbDatasetTcProvider.
>>
>> ad2: As soon as you have your TDB store available in Clerezza you can
>> configure ClerezzaYard instance(s) (e.g. via the Configuration tab of
>> the Apache Felix WebConsole). Important is that the value of the
>> "Graph URI" property refers to a Context (C) of your named graphs
>> (SPOC) or to the URI of the union graph (as configured in the
>> configuration of the SingleTdbDatasetTcProvider.
>>
>> The ClerezzaYard will automatically register the Clerezza MGraph with
>> the Stanbol SPARQL endpoint.
>>
>>
>> As an alternative you could also implement an own component that (1)
>> opens the Jena TDB store (2) wraps the Jena graph with an Clerezza
>> MGraph
>>
>> For that you create your own module and implement a a component
>>
>> @Component(
>> configurationFactory=true,
>> policy=ConfigurationPolicy.REQUIRE, //the TDBpath is required!
>> specVersion="1.1",
>> metatype = true)
>>  public class TdbGraphRegistering component
>>
>> @Property
>> public static final String TDB_PATH = "jena.tdb.path";
>>
>> When your bundle starts OSGI will call the activate(..) method and
>> deactivate(..) when it is stopped.
>>
>> protected void activate(ComponentContext ctx) throws
>> ConfigurationException {
>> String tdbPath = (String)ctx.getProperties().get(TDB_PATH)
>> if(tdbPath == null){
>> throw new ConfigurationException(TDB_PATH,"Jena TDB path
>> MUST BE configured")
>> }
>>
>> So what you need to do is to initialize the Jena TDB store from the
>> configured TDB_PATH create
>> an Clerezza MGraph and register it as OSGI service
>>
>>  //Init the jena TDB model
>> com.hp.hpl.jena.rdf.model.Model model;
>>
>> MGraph graph = new LockableMGraphWrapper(
>> new PrivilegedMGraphWrapper(new JenaGraphAdaptor(model)
>>
>> and than registering this MGraph to the OSGI ServiceRegistry (whitboard
>> pattern)
>>
>> Dictionary graphRegProp = new
>> Hashtable();
>> //the URI under that you want to register your graph
>> graphRegProp.put("graph.uri", graphUri);
>> //optionally the name and description of the graph (used in the UI)
>> graphRegProp.put("graph.name", getConfig().getName());
>> graphRegProp.put("graph.de

Re: Opennlp NER ...

2012-10-31 Thread Rupert Westenthaler
On Wed, Oct 31, 2012 at 3:31 PM, Andrea Taurchini  wrote:
> Dear Rupert,
> thanks again.
> Uhmmm ... using tokennamefinder from command line of opennlp if you use a
> multitype trained model than you get a multitype tagged output ... as for
> api .find method I suppose is the way you told me (one type per model ??).
>

Maybe the Span#getType() returns the type of the found entity. I will
try this out. If this really provides the different types, that the
configuration will be like


{model-file-name};language={language};{type}={type-uri};{type2}={type-uri2};...

BTW I created already
https://issues.apache.org/jira/browse/STANBOL-792 for this feature.

> Forgive me if I'm silly but I can't see how can I add configuration
> property under configuration tab of Felix WC.
>

The form you see in the configuration in generated from a XML file in
the Bundle and this XML file is generated by the @Property annotations
in the implementation of the Engine. So as soon as this new
configuration options are implemented you will see the according
options in the form.


> Thanks and best regards,
> Andrea
>
>
>
>
>
> 2012/10/31 Rupert Westenthaler 
>
>> Hi
>>
>> On Wed, Oct 31, 2012 at 2:25 PM, Andrea Taurchini 
>> wrote:
>> > Dear Rupert,
>> > as always thanks for your support.
>> > Is it possible to use a single model file to detect multiple dc-type ...
>> or
>> > should I add more than one configuration property each with the same
>> model
>> > file but different dc-type ... or else should I produce different model
>> > file.
>>
>> If this is possible with OpenNLP, than for sure, but AFAIK the
>> "opennlp.tools.namefind.NameFinderME#find(..)" method only provide the
>> token spans and probability. So it tells you only that you have found
>> an Named Entity from tokenA to tokenB and not the type of the Named
>> Entity.
>>
>> While I can imagine that one can train a model that detects different
>> types of entities, you will not know the specific type of an found
>> named entity. So found Entities may have any of the trained types.
>>
>> So if you want to distinguish between NamedEntities of the different
>> types you will need to train separate models.
>>
>> Please correct me if I am wrong.
>>
>> > However ... where do I have to set this configuration property (^_^) ?
>> > Throus OSGI admin ?
>>
>> Using the configuration tab of the Felix Web Console is only one
>> option. There are also other possibilities to provide configurations.
>> You can also provide configuration files to the Sling FileInstaller as
>> described at [1] and soon also under the new "Production" section on
>> the Stanbol webpage (currently only available on the staging server
>> [2])
>>
>>
>>
>> [1] http://markmail.org/message/jpxpl6x4nkmz6kda
>> [2] http://stanbol.staging.apache.org/production/partial-updates.html
>>
>> >
>> > Thanks a lot.
>> >
>> > Kindest regards,
>> > Andrea
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > 2012/10/31 Rupert Westenthaler 
>> >
>> >> Hi Andrea,
>> >>
>> >> On Tue, Oct 30, 2012 at 4:15 PM, Andrea Taurchini > >
>> >> wrote:
>> >> > Dear All,
>> >> > I developed my own models for NER based on OPENNLP.
>> >> > Within these models I have more entities than person, organization and
>> >> > places ... will stanbol enhance text using this added entities ?
>> >> >
>> >>
>> >> Currently both the OpenNLP NER engine as well as the
>> >> NamedEntityLinkingEngine can only handle Persons, Organizations and
>> >> Places. In its current form you will not be able to use them to link
>> >> other types.
>> >>
>> >> For both engines this is mainly because of the configuration. So
>> >> extending those engines to support other (or better arbitrary
>> >> configureable) types would require to extend the engines configuration
>> >> options. In the following I will try to describe the necessary
>> >> extensions.
>> >>
>> >> ## OpenNLP NER engine
>> >>
>> >> The NER engine needs the mappings for an {ner-model} to its {language}
>> >> and the extracted {entity-type}. Currently this works by a constant
>> >> defining the mappings for persons, organizations and places. NLP
>> >> models are loaded by using the Open

Re: Opennlp NER ...

2012-10-31 Thread Rupert Westenthaler
Hi

just to lot you know that I can confirm that the type of the Named
Entity is indeed provided by the Span#getType() method. So models for
multiple Named Entity types are also supported by the Java API.

best
Rupert

On Wed, Oct 31, 2012 at 3:45 PM, Rupert Westenthaler
 wrote:
> On Wed, Oct 31, 2012 at 3:31 PM, Andrea Taurchini  
> wrote:
>> Dear Rupert,
>> thanks again.
>> Uhmmm ... using tokennamefinder from command line of opennlp if you use a
>> multitype trained model than you get a multitype tagged output ... as for
>> api .find method I suppose is the way you told me (one type per model ??).
>>
>
> Maybe the Span#getType() returns the type of the found entity. I will
> try this out. If this really provides the different types, that the
> configuration will be like
>
> 
> {model-file-name};language={language};{type}={type-uri};{type2}={type-uri2};...
>
> BTW I created already
> https://issues.apache.org/jira/browse/STANBOL-792 for this feature.
>
>> Forgive me if I'm silly but I can't see how can I add configuration
>> property under configuration tab of Felix WC.
>>
>
> The form you see in the configuration in generated from a XML file in
> the Bundle and this XML file is generated by the @Property annotations
> in the implementation of the Engine. So as soon as this new
> configuration options are implemented you will see the according
> options in the form.
>
>
>> Thanks and best regards,
>> Andrea
>>
>>
>>
>>
>>
>> 2012/10/31 Rupert Westenthaler 
>>
>>> Hi
>>>
>>> On Wed, Oct 31, 2012 at 2:25 PM, Andrea Taurchini 
>>> wrote:
>>> > Dear Rupert,
>>> > as always thanks for your support.
>>> > Is it possible to use a single model file to detect multiple dc-type ...
>>> or
>>> > should I add more than one configuration property each with the same
>>> model
>>> > file but different dc-type ... or else should I produce different model
>>> > file.
>>>
>>> If this is possible with OpenNLP, than for sure, but AFAIK the
>>> "opennlp.tools.namefind.NameFinderME#find(..)" method only provide the
>>> token spans and probability. So it tells you only that you have found
>>> an Named Entity from tokenA to tokenB and not the type of the Named
>>> Entity.
>>>
>>> While I can imagine that one can train a model that detects different
>>> types of entities, you will not know the specific type of an found
>>> named entity. So found Entities may have any of the trained types.
>>>
>>> So if you want to distinguish between NamedEntities of the different
>>> types you will need to train separate models.
>>>
>>> Please correct me if I am wrong.
>>>
>>> > However ... where do I have to set this configuration property (^_^) ?
>>> > Throus OSGI admin ?
>>>
>>> Using the configuration tab of the Felix Web Console is only one
>>> option. There are also other possibilities to provide configurations.
>>> You can also provide configuration files to the Sling FileInstaller as
>>> described at [1] and soon also under the new "Production" section on
>>> the Stanbol webpage (currently only available on the staging server
>>> [2])
>>>
>>>
>>>
>>> [1] http://markmail.org/message/jpxpl6x4nkmz6kda
>>> [2] http://stanbol.staging.apache.org/production/partial-updates.html
>>>
>>> >
>>> > Thanks a lot.
>>> >
>>> > Kindest regards,
>>> > Andrea
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > 2012/10/31 Rupert Westenthaler 
>>> >
>>> >> Hi Andrea,
>>> >>
>>> >> On Tue, Oct 30, 2012 at 4:15 PM, Andrea Taurchini >> >
>>> >> wrote:
>>> >> > Dear All,
>>> >> > I developed my own models for NER based on OPENNLP.
>>> >> > Within these models I have more entities than person, organization and
>>> >> > places ... will stanbol enhance text using this added entities ?
>>> >> >
>>> >>
>>> >> Currently both the OpenNLP NER engine as well as the
>>> >> NamedEntityLinkingEngine can only handle Persons, Organizations and
>>> >> Places. In its current form you will not be able to use them to link
>>> >> other types.
>>> >>
>>&g

Re: EntityHub Referenced Site and redirects

2012-11-03 Thread Rupert Westenthaler
s only intended for the person(s) to whom it is addressed and 
> may contain CONFIDENTIAL information. Any opinions or views are personal to 
> the writer and do not represent those of INQ Mobile Limited, Hutchison 
> Whampoa Limited or its group companies.  If you  are not the intended 
> recipient, you are hereby notified that any use, retention, disclosure, 
> copying, printing, forwarding or dissemination of this communication is 
> strictly prohibited. If you have received this  communication in error, 
> please erase all copies of the message and its  attachments and notify the 
> sender immediately. INQ Mobile Limited is  a company registered in the 
> British Virgin Islands. www.inqmobile.com.
>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Opennlp NER ...

2012-11-03 Thread Rupert Westenthaler
Hi

The implementation of the CustomNERModelEnhancementEngine
(STANBOL-792) is now available. The documentation can be found at [1].

I also updated the eHealth demo ("{stanbol-trunk}/demo/ehealth") to
use the new Engine with 5 custom NER models for DNA, RNA, Proteins,
Cell Type and Cell Line based on the BioNLP2004 dataset [2]. When you
build (mvn clean install and install the health demo bundle
(org.apache.stanbol.demo.ehealth-0.10.1-SNAPSHOT.jar) to the Stanbol
Launcher (revision > 1405306) than you can test the engine with the
chain http://localhost:8080/enhancer/chain/ehealth-ner

@Andrea: I was not able to test the engine with NER models that
extract multiple entity types, as I was not able to find/build such a
model for testing. So if you find any issues regarding that please
report it.

I dont think I will have time to work on STANBOL-793 the coming days
as ApacheCon is around the corner

best
Rupert

[1] 
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/customnermodelengine.html
[2] http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html

On Wed, Oct 31, 2012 at 5:22 PM, Rupert Westenthaler
 wrote:
> Hi
>
> just to lot you know that I can confirm that the type of the Named
> Entity is indeed provided by the Span#getType() method. So models for
> multiple Named Entity types are also supported by the Java API.
>
> best
> Rupert
>
> On Wed, Oct 31, 2012 at 3:45 PM, Rupert Westenthaler
>  wrote:
>> On Wed, Oct 31, 2012 at 3:31 PM, Andrea Taurchini  
>> wrote:
>>> Dear Rupert,
>>> thanks again.
>>> Uhmmm ... using tokennamefinder from command line of opennlp if you use a
>>> multitype trained model than you get a multitype tagged output ... as for
>>> api .find method I suppose is the way you told me (one type per model ??).
>>>
>>
>> Maybe the Span#getType() returns the type of the found entity. I will
>> try this out. If this really provides the different types, that the
>> configuration will be like
>>
>> 
>> {model-file-name};language={language};{type}={type-uri};{type2}={type-uri2};...
>>
>> BTW I created already
>> https://issues.apache.org/jira/browse/STANBOL-792 for this feature.
>>
>>> Forgive me if I'm silly but I can't see how can I add configuration
>>> property under configuration tab of Felix WC.
>>>
>>
>> The form you see in the configuration in generated from a XML file in
>> the Bundle and this XML file is generated by the @Property annotations
>> in the implementation of the Engine. So as soon as this new
>> configuration options are implemented you will see the according
>> options in the form.
>>
>>
>>> Thanks and best regards,
>>> Andrea
>>>
>>>
>>>
>>>
>>>
>>> 2012/10/31 Rupert Westenthaler 
>>>
>>>> Hi
>>>>
>>>> On Wed, Oct 31, 2012 at 2:25 PM, Andrea Taurchini 
>>>> wrote:
>>>> > Dear Rupert,
>>>> > as always thanks for your support.
>>>> > Is it possible to use a single model file to detect multiple dc-type ...
>>>> or
>>>> > should I add more than one configuration property each with the same
>>>> model
>>>> > file but different dc-type ... or else should I produce different model
>>>> > file.
>>>>
>>>> If this is possible with OpenNLP, than for sure, but AFAIK the
>>>> "opennlp.tools.namefind.NameFinderME#find(..)" method only provide the
>>>> token spans and probability. So it tells you only that you have found
>>>> an Named Entity from tokenA to tokenB and not the type of the Named
>>>> Entity.
>>>>
>>>> While I can imagine that one can train a model that detects different
>>>> types of entities, you will not know the specific type of an found
>>>> named entity. So found Entities may have any of the trained types.
>>>>
>>>> So if you want to distinguish between NamedEntities of the different
>>>> types you will need to train separate models.
>>>>
>>>> Please correct me if I am wrong.
>>>>
>>>> > However ... where do I have to set this configuration property (^_^) ?
>>>> > Throus OSGI admin ?
>>>>
>>>> Using the configuration tab of the Felix Web Console is only one
>>>> option. There are also other possibilities to provide configurations.
>>>> You can also provide configuration files to the Sling FileInstaller as
>>>> described at [1] and soon also under the new "

Re: EntityHub Referenced Site and redirects

2012-11-03 Thread Rupert Westenthaler
g, printing, forwarding or dissemination of this communication is
>> > strictly prohibited. If you have received this  communication in error,
>> > please erase all copies of the message and its  attachments and notify
>> the
>> > sender immediately. INQ Mobile Limited is  a company registered in the
>> > British Virgin Islands. www.inqmobile.com.
>> >
>> >
>>
>>
>> --
>> Thanks
>> Harish
>>
>
>
>
>
> This e-mail is only intended for the person(s) to whom it is addressed and 
> may contain CONFIDENTIAL information. Any opinions or views are personal to 
> the writer and do not represent those of INQ Mobile Limited, Hutchison 
> Whampoa Limited or its group companies.  If you  are not the intended 
> recipient, you are hereby notified that any use, retention, disclosure, 
> copying, printing, forwarding or dissemination of this communication is 
> strictly prohibited. If you have received this  communication in error, 
> please erase all copies of the message and its  attachments and notify the 
> sender immediately. INQ Mobile Limited is  a company registered in the 
> British Virgin Islands. www.inqmobile.com.
>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Enhancer engine deps problem for releases

2012-11-08 Thread Rupert Westenthaler
Hi Fabian,

do you think that would also mean to change the package
structure/module names of those engine or do you think it is OK for
any EnhancementEngine that is managed by the Stanbol Community to use
"org.apache.stanbol.enhancer.engine.{engine-name}" as artifactId and
package name.

Regardless of that +1 from my side.

best
Rupert



On Thu, Nov 8, 2012 at 11:37 AM, Olivier Grisel
 wrote:
> Sounds reasonable to me. +1 for refactorings that improve the release
> flow and lower the maintenance burden.
>
> 2012/11/8 Fabian Christ :
>> Hi,
>>
>> I am investigating the current SNAPSHOT deps of the Stanbol components in
>> order to find out what can be released and in which order.
>>
>> In the enhancer we have the problematic situation that we have enhancement
>> engines that rely on other components, like the refactor engine that relies
>> on rules.
>>
>> This is problematic to cut an Enhancer release because we would need to
>> release, e.g. the rules component first.
>>
>> I would like to prevent such situations. IMO it would be a more natural fit
>> if engines, that rely on a certain component, are removed from the Enhancer
>> source tree and moved to the source tree of that particular component or
>> even to a third place.
>>
>> The Engines included in the enhancer/engines directory should only be
>> engines that do not have such dependencies. If this is the case, releasing
>> the enhancer with all independent engines raises no problems anymore.
>>
>> My proposal would be to create a new top level folder in the source tree
>> for engines that rely on the availability of other components. We could
>> call it "enhancer-thirdparty-engines". This could also be a place for
>> contributed engines that we do not want to be in the default
>> enhancer/engines structure. Such engines will be released independently and
>> are not part of an Enhancer release anymore.
>>
>> WDYT?
>>
>> --
>> Fabian
>> http://twitter.com/fctwitt
>
>
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Future of Clerezza and Stanbol

2012-11-09 Thread Rupert Westenthaler
ion?
>> > >
>> > > Presumably the moved modules will be released by the new host - will
>> they
>> > > use group id org.apache.clerezza? or move to the new host project group
>> > id?
>> > > I'd suggest renaming the group to the new project but realise it is a
>> bit
>> > > more disruptive...
>> >
>> > I think that's really up to whatever project adopts that code. In
>> > theory package names should change but that's probably not convenient.
>> >
>> > Or maybe it's time to create a semantic module or two at
>> > http://commons.apache.org/ ? If existing committers are willing to
>> > support that with their work it should be easy to make it happen.
>> >
>> > -Bertrand
>> >
>>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Manually create a Vocabulary / Ontology

2012-11-09 Thread Rupert Westenthaler
Hi

As soon as we have SPARQL 1.1 support in Clerezza we/you can use
skos.js [1] with the SPARQL endpoint of Apache Stanbol. An other
possibility would be to add support for Entityhub Managed Sites to
VIE. This would allow you to create/update/delete entities in the
Entityhub Site used by the Stanbol enhancer.

best
Rupert

[1] https://github.com/tkurz/skosjs

On Thu, Nov 8, 2012 at 10:59 AM, Gabriel Vince
 wrote:
> Hi,
>
> I have neve tryed, but just a long shot worth to try - you can create
> RDF with Protege and then import it into the stanbol semantic store.
> Just first it could be useful to get its basic classes to start with.
>
>
>
> Best regards
>GAbriel
>
> On Thu, Nov 8, 2012 at 10:53 AM, Rüdiger Kurz  wrote:
>> Hi all,
>>
>> Having a simple UI providing the creation of individual entities that can be
>> used by Stanbol would be really helpful also for small and medium "Use
>> Cases" (ca. 100 categories in a hierarchy)
>>
>> regards Rüdiger
>
>
>
> --
> Gabriel Vince
> Senior Consultant
> Apogado
> http://www.apogado.com



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: User story: Don't want to lose the semantic information I already have inside my CMS

2012-11-09 Thread Rupert Westenthaler
Hi Walter, all

I had already a look at the htmlextractor and I think it is a nice
addition to Stanbol!

I would be interested in an Engine that does not only extract embedded
knowledge, but also keeps the link to the actual position within the
parsed Content. In more detail I would like to link the extracted
knowledge with an fise:Enhancement (e.g. a fise:TextAnnotation) that
selects the annotated part of the content.

This would not only allow to have the extracted knowledge in the
metadata of the ContentItem, but also allow EnhancementEngines to
process those information in the same way as if they would be
extracted by an other engine (e.g. linking an RDFa annotation about an
Person, Place in the same way as an Person, Place detected by an NER
engine).

Jukka Zitting  presentation "Content extraction with Apache Tika" [1]
at the ApacheCon included a nice example on how to extract the text of
an Link. I think this is a nice starting point for such an feature.

Generally I think it would be better to add RDFa, Micro Data support
to directly to Tika instead of implementing custom solutions within
Stanbol. WDYT?

best
Rupert

[1] http://www.slideshare.net/jukka/content-extraction-with-apache-tika Slide 19

On Thu, Nov 8, 2012 at 12:31 PM, Walter Kasper  wrote:
> Hi Rüdiger,
>
> RDFa extraction from HTML is part of the htmlextractor engine in Stanbol.
> Iwould welcome it if you could test it with yourOpenCms docs.
>
> Best regards,
>
> Walter
>
>
> Rüdiger Kurz wrote:
>>
>> Hi Staboler,
>>
>> during ApacheCon in Sinsheim I had some interesting conversations with
>> Fabian, Rupert and Anil as result I want to summarize one of the discussions
>> as an user story telling a typical requirement for us as CMS provider.
>>
>> Talking about traditional Content Management Systems and assuming that
>> they don't store semantic informations is not correct. For example CMS
>> Systems already deliver RDFa annotated HTML, nearly all systems are
>> providing some tagging/categorizing mechanism. Specially OpenCms provides a
>> generic approach to define a structured content and therefore we have the
>> information that a specific field/item of a content has a specified type and
>> a defined label. E.g. A technology event named ApacheCon takes place in
>> Sinsheim from 05. Nov until 08. Nov 2012 is the information that is already
>> stored in OpenCms. More over OpenCms is able to connect that event with all
>> speakers/persons that will make a presentation on that event, ...
>>
>> What we would like to achieve is not only a plain text enhancement more
>> over we are interested in telling Stanbol all informations and associations
>> we already know. In other words we absolutely don't want to lose the
>> semantic information that is already existent in OpenCms.
>>
>> A good starting point would be a REST endpoint providing the ability to
>> retrieve a RDFa annotated HTML document and than extracts the RDFa in order
>> to store those inside the semantic-index/entity-hub/... as I previously
>> suggested on the list under the subject "Extend stanbol content hub for RDFa
>> support". Maybe the content hub is not the right component, but the
>> requirement of RDFa extraction is still existent.
>>
>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: kas...@dfki.de
> -
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> -
>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Apache Stanbol: technical documentation and disambiguation

2012-11-10 Thread Rupert Westenthaler
Hi Jairo

thanks for your feedback regarding the disambiguation engine

On Fri, Nov 9, 2012 at 6:51 PM, Jairo Sarabia
 wrote:
> I'm Jairo Sarabia, a web developer at Notedlinks S.L. from Barcelone
> (Spain).
> We're very interested on Apache Stanbol and we would like to know how
> Stanbol works internally, so how works the framework is used, the directory
> structure and how works files of configuration.
> Is there any documentation about these? Could you send me?
>

For the Stanbol Enhancer there is a Developer level documentation available.

http://stanbol.apache.org/docs/trunk/components/enhancer/

is the starting point. The Section "Main Interfaces and Utility
Classes" links to
the description of the different components.

> Meanwhile, thank and congratulate you because we tested the disambiguation
> engine and we liked the improved responses in English, although I understand
> that the quality is still regularly in some respects. Especially with topics
> of Person and Organizations, so most times only detects part of the name and
> especially in compound words, and this makes the disambiguation is wrong.

This is probably because the disambiguation Engine does not refine the
fise:selected-text of the fise:TextAnnotation based on disambiguation
results. Can you provide some examples of this behavior so that I can
validate this assumption.

> We would like to know about future plans for the disambiguation engine, and
> whether it can be used for other languages.

Stanbol is a community driven Project. The engine itself was developed
by Kritarth Anand in a GSoC project [1] and contributed to Stanbol
with STANBOL-723 [2]. I am was mentoring this project.

I do not know Kritarth plans, but personally I plan is to continue
work on this engine as soon as I have finished - meaning re-integrated
the Stanbol NLP module with the trunk. This work will mainly focus on
making the MLT disambiguation engine configureable and testing that it
works well with the new Stanbol NLP processing module (STANBOL-733).


[1] http://www.google-melange.com/gsoc/project/google/gsoc2012/kritarth/12001
[2] https://issues.apache.org/jira/browse/STANBOL-723

>
> Finally, we would like to know if it is possible to create multilingual
> DBpedia indexes and then the responses link to the Dbpedia on the language
> of the text. For example, if the text is on Spanish language then the
> literals founded have relations to resources to the Spanish Dbpedia (not
> English Dbepdia resources).
> And if its possible could you explain me how to do it.

The disambiguation-mlt engine is not language specific. Principally it
works with any Entityhub Site and any language where a disambiguation
context is available.

AFAIK the currently hard coded configuration uses the full-text field
(that contains texts in any lanugages) for the Solr MLT query. The
1Gbyte Solr index you probably use for disambiguation includes short
abstracts only for English. Long abstracts are not included for any
language. This is also the reason why you are not getting
disambiguation results for other languages as English.

A better suited environment would provide short (or even long)
abstracts for the language you want to disambiguate. The configuration
of the Engine would not use the all-language full text field for the
MLT queries, but instead the language specific one. The reason why
such information are not included in the distributed index is simple
to reduce its size. In addition when this index was created there was
not yet an engine such as the disambiguation-mlt one that would have
consumed those information.

I have already created an DBpedia 3.8 based index that includes a lot
of information useful for disambiguation for several languages.
However this index in its current form is not easily shared as it is
about ~100GByte (45Gbyte compressed) is size. In addition I had not
yet time to validate the index (as indexing only completed shortly
before I left for ApacheCon last week). Anyway I will use this index
as base for further work on the disambiguation-mlt engine. I will also
share the used Entityhub indexing tool configuration and try to come
up with an modified configuration that is about 10GByte in size but
still useful for disambiguation with the MLT based engine.

best
Rupert

>
> That's all! and Thank you very much again!
>
> Best,
>
> Jairo Sarabia



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Future of Clerezza and Stanbol

2012-11-11 Thread Rupert Westenthaler
t offers doesn't alter the state of the
> system. If you intepret "stateless" very strictly then you would have to
> drop most parts of the felix webconsole as http requests to install bundle
> or configure services aren't stateless. For the user-configuration a simple
> file-based TcProvider would of course be enough so no TDB is needed for
> that.
>
> I think we should see where we want to go as a community. For me the
> important thing is that Stanbol remains very modular. I think statements
> like "Stanbol is no semantic CMS" do not bring us further. It's important
> that the stanbol services can be used as services and that many services
> are stateless. But the contenthub is a component to manage content (the
> entityhub to some degree as well), do we want to mandate a horrible user
> interface just to comply with some catchphrase about what Stanbol is not?
> Or do we want to reduce Stanbol to the be just the Enhancer and let the
> other stuff to other projects?
>
> I'd rather go for the vision of an ecosystem of modular semantic and
> restful osgi components, but if the community wants to focus on the
> enhancer I think a clear statement should be made to avoid unnecessary
> arguments about memory consumption.
>
> Cheers,
> Reto
>
>
> On Fri, Nov 9, 2012 at 10:56 AM, Rupert Westenthaler <
> rupert.westentha...@gmail.com> wrote:
>
>> Hi all,
>>
>> let me share my throughs. Because this mail is rather long I tried to
>> split it up in three separate section (1) RDF (2) RESTful/ Web
>> Interface and (3) other related topics
>>
>>
>> RDF libs:
>> 
>>
>> Out of the viewpoint of Apache Stanbol one needs to ask the Question
>> if it makes sense to manage an own RDF API. I expect the Semantic Web
>> Standards to evolve quite a bit in the coming years and I do have
>> concern that the Clerezza RDF modules will be updated/extended to
>> provide implementations of those. One example of such an situation is
>> SPARQL 1.1 that is around for quite some time and is still not
>> supported by Clerezza. While I do like the small API, the flexibility
>> to use different TripleStores and that Clerezza comes with OSGI
>> support I think given the current situation we would need to discuss
>> all options and those do also include a switch to Apache Jena or
>> Sesame. Especially Sesame would be an attractive option as their RDF
>> Graph API [1] is very similar to what Clerezza uses. Apache Jena's
>> counterparts (Model [2] and Graph [3]) are considerable different and
>> more complex interfaces. In addition Jena will only change to
>> org.apache packages with the next major release so a switch before
>> that release would mean two incompatible API changes.
>>
>> My personal opinion is that we should keep using Clerezza for now.
>> Invest some effort to improve the Clerezza RDF modules and than see
>> how it further develops. Such an Effort should include
>>
>> *  to implement SPQRAL fast lane (as already discussed with Reto
>> during ApacheCon). Fast lane would allow Clerezza to use the native
>> SPARQL engine of the used Triplestore. Meaning that Clerezza only
>> parses those parts of the SPARQL query to understand the RDF graph to
>> execute the Query on. This information is than used to parse the query
>> to the native SPARQL engine via an extended Interface of the
>> TcProvide. The Clerezza SPARQL implementation would only be used in
>> case the TcProvider does not provide a native SPARQL implementation of
>> if the Query spans RDF graphs managed by different TcProvider
>> instances. By that Clerezza users would be able to use any SPARQL
>> feature provided by the used TripleStore.
>> * update to the newest Jena versions (see also STANBOL-621; Peter
>> Ansell's Clerezza fork on github [5] as well as Sebastian Schaffert's
>> Jena bundle used for the Stanbol/LMF integration [5])
>> * finish and release the SingleTdbDatasetTcProvider.java
>> (CLEREZZA-691) as this is important for the Stanbol Ontology Manager
>> component
>> * move the Indexed in-memory graph (CLEREZZA-683) from the Stanbol
>> code base to Clerezza and release it so that we can use it from their
>> in Stanbol
>> * provide an Clerezza JsonLD parser/serializer. This is critical for
>> Stanbol as several CMS use this as preferred RDF serialization.
>>
>> [1]
>> http://www.openrdf.org/doc/sesame2/api/org/openrdf/model/package-summary.html
>> [2]
>> http://jena.apache.org/documentation/javadoc/jena/com/hp/hpl/jena/rdf/model/Model.html
>> [3]
>> http://jena.apache

Re: Future of Clerezza and Stanbol

2012-11-11 Thread Rupert Westenthaler
Hi all ,

On Sun, Nov 11, 2012 at 4:47 PM, Reto Bachmann-Gmür  wrote:
> - clerezza.rdf graudates as commons.rdf: a modular java/scala
> implementation of rdf related APIs, usable with and without OSGi

For me this immediately raises the question: Why should the Clerezza
API become commons.rdf if 90+% (just a guess) of the Java RDF stuff is
based on Jena and Sesame? Creating an Apache commons project based on
an RDF API that is only used by a very low percentage of all Java RDF
applications is not feasible. Generally I see not much room for a
commons RDF project as long as there is not a commonly agreed RDF API
for Java.

On Sun, Nov 11, 2012 at 5:40 PM, Fabian Christ
 wrote:
>
> Having the clerezza platform in Stanbol and thinking in the long term about
> merging and using this stuff is a good choice. This can not be done with
> some simple imports and we should carefully evaluate what will be the right
> way to go in Stanbol.

I would still suggest to do this within an own branch as this makes it
easier to commit/review unfinished stuff. In addition we will need a
branch for making a vote (I guess both for Clerezza and Stanbol) on
the proposed changes.

The following list tries to sum-up discussed points (please refine/complete)

* apache.commons.web:
+ Jersey -> Apache Wink
+ replace Viewable with LDViewable
+ Stanbol Web UI should become optional
* add type based Rendering (at a later time)
* apache.commons.security:
+ move security from Clerezza to Stanbol
+ based on Servlet filter
* Scala: no change needed
* TODO: observe the PermGen space issue
* Shell: no change needed
* Development Tools
* add Bundle-Dev-Tools to shell
* add Maven Archetype support to Stanbol
* Clerezza RDF framework:
? Is community strong enough to manage its own RDF framework
? Where to manage the code
+ SPARQL 1.1 via fast lane (direct access to the native SPARQL
implementations)
+ Update to the newest Jena versions
+ Merge Indexed in-memory TripleCollections to clerezza
+ finish and release the SingleTdbDatasetTcProvider
+ add support for JSON-LD parsing/serializing
? Clerezza Platform: Can someone make a list what else is present in Clerezza

best
Rupert

-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Allow usage of OSGI Services without an OSGI environment (was: Future of Clerezza and Stanbol)

2012-11-12 Thread Rupert Westenthaler
so that it is possible to use those services in environment with
different life-cycle and configuration facilities.

best
Rupert Westenthaler

[1] http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/site/managed/
the service Implementation:
org.apache.stanbol.entityhub.site.managed.impl.YardSite
the OSGI component
org.apache.stanbol.entityhub.site.managed.ManagedSiteComponent


On Mon, Nov 12, 2012 at 2:15 AM, Peter Ansell  wrote:
> On 12 November 2012 09:59, Reto Bachmann-Gmür  wrote:
>> Hi Peter and all,
>>
>> Good to read about your experiments.Just a first comment:
>>
>> In addition, I did not want to use OSGI, so I had to make changes in
>>> many cases to allow a completely programmatic instantiation of
>>> components, as some fields were left private with no mutator method
>>> and in some cases no public contructor that could be used to populate
>>> the field programmatically. For all of the good that OSGI may provide
>>> for otherwise complex systems, it is not good Java software
>>> engineering to make fields private.
>>>
>>
>> The clerezza.rdf package should all be usable withouth OSGi. OSGi cannot do
>> magic and set private fields, the compiled classes do have bind and unbind
>> methods for the private fields, these methods are added by the maven felix
>> scr-plugin.  For locating dependencies outside OSGi the META-INF/services
>> method is used so that for example one can add a serializitaion provider
>> seimply by adding it to the classpath without requiring and manual binding.
>
> Sorry, I was under the impression that OSGi could actually do Java
> reflection magic to inject dependencies directly into private fields
> based on annotations without having any alternative method of setting
> the field for regular plain old java users. :)
>
> In general I would like if OSGi classes that currently rely on
> bind/unbind, still offered public mutator methods and a public
> initialise/deinitialise method for any work that needs to be done
> after using the mutator methods. The bind/unbind methodology from
> memory when I was working on Clerezza/Stanbol, seemed to require that
> all of the mutators were run immediately and the initialise was
> automatically run, without offering any other possible sequence.
>
> Additionally, offering public mutators and a public initialise method
> gives the added benefit of compile-time typesafety for plain old java
> users, which a bind method taking a Dictionary
> parameter does not provide.
>
> In addition, from memory I think some of the bind methods were
> protected, and not public, which means they are not directly
> accessible, without resorting to using reflection or subclassing just
> to be able to call bind.
>
> I use META-INF/services heavily in my projects, and I rely on it when
> using Sesame and with my extensions to OWLAPI. I extended OWLAPI to
> use Sesame META-INF/services dependencies to find
> serialisation/parsing providers for OWLAPI based on the Sesame
> parser/writer services that are available on the classpath. However, I
> always try to make sure that the use of the automatically populated
> service registries is optional, so that users can populate their own
> registries from scratch using purely programmatic methods, and they do
> not have to resort to modifying global singleton registries as one
> does when using Jena.
>
> The services that I register in META-INF/services are always factories
> based on interfaces, so that dependencies can be passed into type-safe
> java "createServiceInstance" methods when creating instances of the
> service using the factory instance. This means that it does not matter
> if the java.util.ServiceLoader loads classes in a different order, as
> the actual objects are created from the factories explicitly by users,
> with or without a key to specify which instance of the service they
> require/prefer.
>
> Cheers,
>
> Peter



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: DBpedia indexing ...

2012-11-12 Thread Rupert Westenthaler
Hi Andrea,


On Mon, Nov 12, 2012 at 12:59 PM, Andrea Taurchini  wrote:
> folder /indexing/dist the two files :
>
> 1)dbpedia.solrindex.zip
> 2)org.apache.stanbol.data.site.dbpedia-{version}.jar
>
> I prefer to install it as a new referenced site and not overwriting it to
> previous dbpedia english index so I made the following :
>
> 1) saved the zip in the stanbol/datafiles directory
> 2) installed the bundle using the Apache Felix web console
>
> So I have a new referenced site under http://localhost:8080/entityhub.
> The problem is that if I try to search for an entity such as
>
> curl "
> http://localhost:8080/entityhub/site/ITdbpedia/entity?id=http://dbpedia.org/resource/Paris
> "
>

How have you managed to deploy the Site under "ITdbpedia"? Have you
manually changed the configuration after installing the Bundle?

While this might work (if you correctly adapt the configuration for
the ReferencedSite, Cache and SolrYard those will still override the
configurations of the default DBpedia index simple because the OSGI
config files provided by the bundle (2) do have the same name as the
default dbpedia index config files.

> Problem accessing /entityhub/site/ITdbpedia/find. Reason:
> Unable to initialize the Cache with Yard ITdbpediaIndex! This
> is usually caused by Errors while reading the Cache Configuration from
> the Yard.Caused
> by:java.lang.IllegalStateException: Unable to initialize the
> Cache with Yard ITdbpediaIndex! This is usually caused by Errors while
> reading the Cache Configuration from the Yard.

This usually happens if the SolrYard "ITdbpediaIndex" is configured
for a SolrCore that is not available. Are you sure that a SolrCore
with the name configured for the "Solr Index/Core" property of the
ITdbpediaIndex SolrYard is available?
Assuming you have configured {solr-core} you will need to (a) extract
the "dbpedia.solrindex.zip" file (b) rename the root folder from
"dbpedia" to "{solr-core}" (c) re-create the ZIP file (d) rename it to
"{solr-core}.solrindex.tzp".

- - -

The intended way to change the name of a ReferencedSite created by the
Entityhub Indexing Tool is to change the value of the "name" property
within the
"./indexing/config/indexing.properties" file.

In case of the dbpedia Indexing tool you need to change the
"indexingDestination" from


indexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf,boosts:fieldboosts

to


indexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf:dbpedia,boosts:fieldboosts

NOTE the change from "solrConf" to "solrConf:dbpedia". This is
necessary to tell the SolrYardIndexingDestination component that the
SolrCore configuration is called "dbpedia". By default it assumes that
the name is equals to the value of the "name" property.

Before re-indexing you should also delete the "./indexing/destination"
folder as otherwise you will have both the data of the old index
(dbpedia) and the new one {name} in the destination folder.

- - -

If you want to create an "installable bundle" without reindexing the
data you can follow the following steps:

0. if there are still files in the indexing/resources/rdfdata folder
remove them as they are already imported into the Jena TDB store
(indexing/resources/tdb)
1. make the changes as described above
2. delete the indexing/destination folder (make sure to NOT delete the
indexing/dist folder!)
3. replace the indexing/resource/incoming_links.txt file with an empty
one (make sure to not delete the current version)
4. start the indexing (this should now complete in some seconds as no
entities are indexed.

After that you should see in the indexing/dist folder 4 files

a. "dbpedia.solrindex.zip"
b. "{name}.solrindex.zip" (this is empty - delete it)
c. "org.apache.stanbol.data.site.dbpedia-{version}.jar" (the old
bundle - delete it)
d. "org.apache.stanbol.data.site.{name}-{version}.jar (the new bundle)

(d) is the patched Bundle that you can use to install your custom
dbpedia index without overriding the default one. However to use this
bundle you need still modify the "dbpedia.solrindex.zip" as described
above: (a) extract the "dbpedia.solrindex.zip" file (b) rename the
root folder from "dbpedia" to "{name}" (c) re-create the ZIP file (d)
renme it to "{name}.solrindex.zip".

I admit that those steps are complex, but they might save you the time
needed to re-create your index.

best
Rupert


-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Future of Clerezza and Stanbol

2012-11-13 Thread Rupert Westenthaler
Hi all,

I would like to share some thoughts/comments and suggestions from my side:

ResourceFactory: Clerezza is missing a Factory for RDF resources. I
would like to have such a Factory. The Factory should be obtainable
via the Graph - the Collection of Triples. IMO such a Factory is
required if all resource types (IRI, Bnode, Literal) are represented
by interfaces.

BNodes: If Bnode is an interface than any implementation is free to
internally use a "bnode-id". One argument pro such ids (that was not
yet mentioned) is that such id's allow you to avoid in-memory mappings
for bnodes when wrapping an native implementation. In Clerezza you
currently need to have this Bidi maps.

Triple, Quads: While for some use cases the Triple-in-Graph based API
(Quad := Triple t =
TripleStore#getGraph(context).filter(subject,predicate,object)) is
sufficient this is no longer the case as soon as Applications want to
work with an Graph that contains Quads with several contexts. So I
would vote for having support for Quads.

Dataset,Graph: Out of an User perspective Dataset (how the TripleStore
looks at the Triples) and Graph (how RDF looks at the Triples) are not
so different. Because of that I would like to have a single domain
object fitting for both. The API should focus on the Graph aspects (as
Clerezza does) while still allowing efficient implementations that do
not load all triples into memory (e.g. use closeable iterators)

Immutable Graphs: I had really problems to get this right and the
current Clerezza API does not help with that task (resulting in things
like read-only mutable graphs that are no Graphs as they only provide
a read-only view on a Graph that might still be changed by other
means). I think read-only Graphs (like
Collections.unmodifiableCollection(..)) should be sufficient. IMHO the
use case to protect a returned graph from modifications by the caller
of the method is much more prominent as truly immutable graphs.

SPARQL: I would not deal with parsing SPARQL queries but rather
forward them as is to the underlaying implementation. If doing so the
API would only need to border with result sets. This would also avoid
the need to deal with "Datasets". This is not arguing against a
fallback (e.g. the trick Clerezza does by using the Jena SPARQL
implementation) but in practice efficient SPARQL executions can only
happen natively within the TripleStore. Trying to do otherwise will
only trick users into use cases that will not scale.

best
Rupert

On Tue, Nov 13, 2012 at 9:08 AM, Reto Bachmann-Gmür  wrote:
> On Mon, Nov 12, 2012 at 10:40 PM, Andy Seaborne  wrote:
>
>> On 12/11/12 19:42, Reto Bachmann-Gmür wrote:
>>
>>> On Mon, Nov 12, 2012 at 5:46 PM, Andy Seaborne  wrote:
>>>
>>>  On 09/11/12 09:56, Rupert Westenthaler wrote:
>>>>
>>>>  RDF libs:
>>>>> 
>>>>>
>>>>> Out of the viewpoint of Apache Stanbol one needs to ask the Question
>>>>> if it makes sense to manage an own RDF API. I expect the Semantic Web
>>>>> Standards to evolve quite a bit in the coming years and I do have
>>>>> concern that the Clerezza RDF modules will be updated/extended to
>>>>> provide implementations of those. One example of such an situation is
>>>>> SPARQL 1.1 that is around for quite some time and is still not
>>>>> supported by Clerezza. While I do like the small API, the flexibility
>>>>> to use different TripleStores and that Clerezza comes with OSGI
>>>>> support I think given the current situation we would need to discuss
>>>>> all options and those do also include a switch to Apache Jena or
>>>>> Sesame. Especially Sesame would be an attractive option as their RDF
>>>>> Graph API [1] is very similar to what Clerezza uses. Apache Jena's
>>>>> counterparts (Model [2] and Graph [3]) are considerable different and
>>>>> more complex interfaces. In addition Jena will only change to
>>>>> org.apache packages with the next major release so a switch before
>>>>> that release would mean two incompatible API changes.
>>>>>
>>>>>
>>>> Jena isn't changing the packaging as such -- what we've discussed is
>>>> providing a package for the current API and then a new, org.apache API.
>>>>   The new API may be much the same as the existing one or it may be
>>>> different - that depends on contributions made!
>>>>
>>>>
>>> I didn't know about jena planning to introduce such a common API.
>>>
>>>
>>>> I'd like to hear more about your experiences esp. with Graph API as that
>>>> is supposed to be quite

Re: DBpedia indexing ...

2012-11-13 Thread Rupert Westenthaler
Calculate the incoming_links.txt file for the Italian page links
(http://downloads.dbpedia.org/3.8/it/page_links_it.nt.bz2)


2. Download all the RDF files you need

* basically the same you currently use from
http://downloads.dbpedia.org/3.8/en/ but now from
http://downloads.dbpedia.org/3.8/it/
* language specific labels from other languages you are interested in.
 IMPORTANT: use the
 http://downloads.dbpedia.org/3.8/{lang}/{type}_{lang}.nt.bz2
 files and NOT the
 
http://downloads.dbpedia.org/3.8/{lang}/{type}_en_uris_{lang}.nt.bz2
* include http://downloads.dbpedia.org/3.8/en/instance_types_en.nq.bz2


3. You will need to add the LdpathSourceProcessor to the list of
entityProcessor in the indexing.properties file. The configuration
should look like

entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:entityTypes;org.apache.stanbol.entityhub.indexing.core.processor.LdpathSourceProcessor,ldpath:dbpedia.ldpath;org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor

4. Create an LDPath [2] program that merges all the data you need with
the Italian dbpedia resource.

[2] http://code.google.com/p/ldpath/

The configuration in (3) refers to the ldpath file "dbpedia.ldpath".
This is a text file that is expected to be located within the
"indexing/config" directory. I will not give an LDpath introduction,
but what you need is something like

1: rdfs:label = (rdfs:label | dbp-ont:wikiPageInterLanguageLink/rdfs:label);
2: skos:altLabel = (^dbp-ont:wikiPageRedirects/rdfs:label |
dbp-ont:wikiPageInterLanguageLink/^dbp-ont:wikiPageRedirects/rdfs:label);
3: rdfs:comment = (rdfs:label | dbp-ont:wikiPageInterLanguageLink/rdfs:label);
4: dbp-ont:abstract = (dbp-ont:abstract |
dbp-ont:wikiPageInterLanguageLink/dbp-ont:abstract);
5: rdf:type = (rdf:type | dbp-ont:wikiPageInterLanguageLink/rdf:type);

NOTE: you will need to remove the '{line-number}: ' before using this ldpath

(1) merges the rdfs:labels of the current Entity (the Italian label)
with labels of entities referenced by inter language links. So this
will ensure that you have labels for all languages for the Italian
entity.
(2) merges labels of redirected pages to the skos:altLabel field. For
this to work you will need to include the
"redirects_{language}.nt.bz2" file in the languages you are interested
(3) same as for rdfs:labels but for short abstracts
(4) the same but for long abstracts
(5) rdf:type statements might be missing for Italian. So I merge those
as well with types from other languages. I would recommend to only
include types for the English dbpedia


5. Add surfaceForms mapping to the mappings.txt file

# add rdfs:labels and rdfs:labels of redirected sites to dbp-ont:surfaceForm
rdfs:label > dbp-ont:surfaceForm
skos:altLabel > dbp-ont:surfaceForm

Those two mappings ensure that both the rdfs:label and skos:altLabel
values are also stored in the dbp-ont:surfaceForm field. This allows
you to allow the Stanbol Enhancer (or more precisely the
NamedEntityLinkingEngine or KeywordLinkingEngine) to match against
labels of redirected pages by changing the name field form the default
rdfs:label to dbp-ont:surfaceForm


Let me conclude that I have never tried this exact use case myself,
but I have already created several dbpedia indexing with very similar
configurations. When using LDPath during indexing you need to expect
higher indexing times and you might also need to assign more memory to
the indexing tool.

Please also note http://markmail.org/message/67ivlyoxfqad6xoe as you
will most likely need process dbpedia files for some languages using
the

bzcat ${filename}.bz2 \
| sed 's//\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
| gzip -c > ${filename}.gz
rm -f ${filename}.bz2

best
Rupert

>
> Thanks,
> Andrea
>
>
>
>
>
>
>
> 2012/11/12 Rupert Westenthaler 
>
>> Hi Andrea,
>>
>>
>> On Mon, Nov 12, 2012 at 12:59 PM, Andrea Taurchini 
>> wrote:
>> > folder /indexing/dist the two files :
>> >
>> > 1)dbpedia.solrindex.zip
>> > 2)org.apache.stanbol.data.site.dbpedia-{version}.jar
>> >
>> > I prefer to install it as a new referenced site and not overwriting it to
>> > previous dbpedia english index so I made the following :
>> >
>> > 1) saved the zip in the stanbol/datafiles directory
>> > 2) installed the bundle using the Apache Felix web console
>> >
>> > So I have a new referenced site under http://localhost:8080/entityhub.
>> > The problem is that if I try to search for an entity such as
>> >
>> > curl "
>> >
>> http://localhost:8080/entityhub/site/ITdbpedia/entity?id=http://dbpedia.org/resource/Paris
>> > "
>> >
>>

Re: Creating a spanish index for Stanbol (doubts)

2012-11-13 Thread Rupert Westenthaler
n speak it as a first or second
> language)."@en .
>1881 <http://dbpedia.org/resource/Bishkek> <
> http://www.w3.org/2000/01/rdf-schema#comment> "Bishkek, formerly Pishpek
> and Frunze, is the capital and the largest city of Kyrgyzstan. Bishkek is
> also the administrative centre of Chuy Province which surrounds the city,
> even though the city itself is not part of the province but rather a
> province-level unit of Kyrgyzstan. The name is thought to derive from a
> Kyrgyz word for a churn used to make fermented mare's milk, the Kyrgyz
> national drink."@en .
>
> Someone might say why appears errors like "broken pipe" or if I'm doing
> something wrong. I think that i follow well the guide. Thanks, and I hope that
> this information can help others that try to create indexes and an Apache
> Stanbol, that is a really great project. Nice work!
>
> Best,
> Juan.



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: (Back to the) Future of Clerezza and Stanbol

2012-11-14 Thread Rupert Westenthaler
Hi

I am more with Fabian. The fact is that Clerezza has not much
activity. I am a Clerezza Committer myself and the reason why I am
rather inactive is because I have enough things to do for Stanbol.
This will also not much change in the future. Moving the Clerezza
modules to Stanbol does not solve this problem. It does only move it
from Clerezza over to Stanbol.

 - RDF libs: If Clerezza is no longer actively developed, than Stanbol
should - in the long term - switch to an other RDF framework. RDF is
not core feature of Stanbol so we should rather use existing stuff
than manage our own. So "if" Clerezza  can not graduate, than the
scenario mentioned by Fabian seams also likely to me.

 - Linked Data Platform: Reto I guess you have missed this
presentation [1] at ApacheCon. IMO a Linked Data Platform is something
that deserves an own project and as soon as there is such a Platform
available we should use it in Stanbol. This would allow us to remove a
lot of code in Stanbol (especially in the Entityhub) - a good thing as
it allows to focus more on core features of Stanbol.

best
Rupert

[1] http://www.slideshare.net/Wikier/incubating-apache-linda

On Wed, Nov 14, 2012 at 4:56 PM, Reto Bachmann-Gmür  wrote:
> Thanks for bringing the discussion back to the main issue.
>
> Clerezza could graduate as it is. But imho it would make sense to split
> clerezza into:
>
> - RDF libs
> - Linked Data Platform
>
> Imho the Semantic Platform that should strive for compliance with LDPWG
> standards could merge with Apache Stanbol as in fact for many modules it's
> hard to say were they best belong to. For this the clerezza stuff should
> not become a branch but a subproject of stanbol that can be released
> individually if needed. This subproject should become thinner and thinner
> as more stuff is being moved to the stanbol platform as technologies are
> being aligned. Discussing if this would be possible should be independent
> of the RDF API stuff.
>
> Cheers,
> Reto
>
> On Wed, Nov 14, 2012 at 4:18 PM, Fabian Christ > wrote:
>
>> Hi Andy,
>>
>> thanks for bringing the discussion back to the point where it started.
>>
>> Here is my view:
>>
>> If Clerezza can not graduate then the sources should be moved into the
>> archive. The Stanbol community can then freely fork from there and take
>> what it is needed. Other communities who also use Clerezza may do the same
>> to keep their projects working (it is not only a matter for Stanbol).
>> Clerezza committers are more than welcome to join Stanbol and help to
>> migrate the parts of Clerezza that are useful for Stanbol.
>>
>> I agree with Rupert that the best way to do it, is to set up branches to
>> explore different development paths.
>>
>> Maybe Clerezza will be able to graduate if they focus on a smaller set of
>> components. But this is a discussion for the Clerezza dev list.
>>
>> Best,
>>  - Fabian
>>
>>
>> 2012/11/14 Andy Seaborne 
>>
>> > The original issue was about whether migrating (part of) Clerezza into
>> > Stanbol made sense.  The concern raised was resourcing.
>> >
>> > Coupling this to new API design is making the resourcing more of a
>> > problem, not less.
>> >
>> > If I understand the discussion 
>> >
>> > Short term::
>> >
>> > Can Clerezza achieve graduation?
>> >
>> > Or not, does splitting out the part of Clerezza that Stanbol depends on
>> > work? (I sense "yes" with little work needed).  Maintaining such
>> > transferred code was raised as a concern - e.g. SPARQL 1.1 access.
>> >
>> > Long term::
>> >
>> > Where does this leave Stanbol?  Does the maintenance cost concern remain?
>> > or even get worse?
>> >
>> > I don't have sufficient knowledge of the codebase to know what the
>> balance
>> > is between fine-grained API work and query-based access (and update).
>> >
>> > How important is switching between (e.g.) storage providers?
>> >
>> > (local storage - remote would be SPARQL so stanbol-client-code and
>> > other-server can be chosen separately - that's why we do standards!)
>> >
>> > Andy
>> >
>> >
>>
>>
>> --
>> Fabian
>> http://twitter.com/fctwitt
>>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: REST API for dbpedia-spotlight chain

2012-11-14 Thread Rupert Westenthaler
Hi

When I send your request to http://dev.iks-project.eu:8080 I do get
the expected results.
Can you please try the same.

If you do not get those results than it has most likely todo with the
charset used by the terminal. The command you sent does not explicitly
set the charset so Stanbol will interpret it as "UTF-8" when parsing
the request.

I used the following command

curl -i -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" --data \
"üä Albrecht Václav Eusebius z Valdštejna and pe fights, leading to his \
imprisonment in town prison.[8] Already in February 1600,[8] Albrecht left \
Altdorf for his Grand Tour through the HRE, France and Italy,[10] where he \
studied at the universities of Bologna and Padua." \
http://dev.iks-project.eu:8080/enhancer/chain/dbpedia-spotlight

best
Rupert

On Wed, Nov 14, 2012 at 5:03 PM, Andriy Nikolov
 wrote:
> Dear all,
>
> I am working at fluid Operations AG on one of the IKS Early Adopters
> projects and trying to integrate Stanbol with our Information Workbench
> platform.
>
> Currently I am getting to know the Stanbol API, and I have a question
> related to the dbpedia-spotlight enhancement chain.
> I am trying to retrieve annotations via the REST interface, but I face a
> problem as the output I receive is different from the one I obtain via the
> web interface form.
> Do you know what can be the possible cause and how to deal with it?
> (possibly, it happens when sending input text with non-standard characters).
>
> As an example, I am trying to send the following string (it is meaningless,
> just that it contains non-standard chars and mentions of different entity
> types):
>
> üä Albrecht Václav Eusebius z Valdštejna and pe fights, leading to his
> imprisonment in town prison.[8] Already in February 1600,[8] Albrecht left
> Altdorf for his Grand Tour through the HRE, France and Italy,[10] where he
> studied at the universities of Bologna and Padua.
>
> When sending it via the web interface
> http://localhost:8080/enhancer/chain/dbpedia-spotlight,
> I retrieve a list of text and entity annotations, particularly the one
> mentioning the entity dbpedia:Albrecht_von_Wallenstein (the annotations are
> consistent with what I get from the dbpedia-spotlight demo service itself).
>
> However, when trying to send the same text via the API, e.g., with the
> following command:
>
> curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" --data
> "üä Albrecht Václav Eusebius z Valdštejna and pe fights, leading to his
> imprisonment in town prison.[8] Already in February 1600,[8] Albrecht left
> Altdorf for his Grand Tour through the HRE, France and Italy,[10] where he
> studied at the universities of Bologna and Padua."
> http://localhost:8080/enhancer/chain/dbpedia-spotlight
>
> I get a different set of annotations: particularly, there is no mention of
> dbpedia:Albrecht_von_Wallenstein, but there is a reference to
> dbpedia:Clavichord (extracted from the part "clav" of the name "Václav").
>
> Do you know what can be the reason for this problem? Are there any
> additional request parameters which has to be set?
>
> Thank you!
>
> Best regards,
> Andriy Nikolov



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: EntityHub Referenced Site and redirects

2012-11-15 Thread Rupert Westenthaler
Hi Andrea,

A followup:

(1) Sharing your indexes:

This would be great! I talked with a collage of mine. Most likely we
will add an FTP upload folder to the dev.iks-project.eu server. For
that we will need to add more HDD space to this virtual host what
might take some more time to accomplish. I will notify you as soon as
we are ready

(2) dbp-ont:surfaceForm

I recommended to you to copy labels of redirected pages to the
"dbp-ont:surfaceForm" field. In the meantime I made some tests with an
index build like that. The results where really bad because of that I
must revoke this recommendation!

The reason for that is that the scoring algorithm of Solr is affected
by the multi-valued "dbp-ont:surfaceForm" field. e.g. for
dbpedia:Paris you have ~35 "dbp-ont:surfaceForm" values where only
about ~15 contain "Paris". So if you now make a query for Paris in
this field

(((@en/dbp\-ont\:surfaceForm/:"paris")))

you will notice that dbpedia:Paris is not within the top 10 search
results. Instead Entities like "Paris Barclay" are listed because they
do have only a single value for "dbp-ont:surfaceForm" and therefore
the match for "Paris" is much more relevant.

This means that the current index-layout where URIs of redirected
pages are represented as own Entities within the index is much better
suited for entity extraction.

On Mon, Nov 5, 2012 at 10:59 AM, Andrea Di Menna  wrote:
> Hi Rupert,
> I would be more than happy to share the indexes.
> I have also created one including redirects by forcibly inserting
> redirecting entities into the incoming_links.txt file.

Do you have a script for creating such a incoming_links.txt file?
Because this would be very useful for properly creating indexes that
include Entities of redirected pages.

best
Rupert

> Redirects have been assigned the same entity rank as the entities they
> redirect to.
>
> Please let me know how and where to store those indexes.
>
> Cheers
>
> 2012/11/3 Rupert Westenthaler 
>
>> Hi,
>>
>> I have started to play around with indexing dbpedia 3.8 myself as well
>> and I con confirm that one has to preprocess nearly all files. Because
>> of that I have written a nice shell script that downloads, processes
>> and re-compresses the RDF files
>>
>> # array syntax is ({item-1} {items-2} ... {item-n})
>> # names need to include the language path segment!
>> files=(dbpedia_3.8.owl \
>> en/labels_en.nt \
>> {all-the-other-files-you-need} \
>> )
>>
>> for i in "${files[@]}"
>> do
>> :
>> # clean possible encoding errors
>> filename=$(basename $i)
>> if [ ! -f ${filename}.gz ]
>> then
>> url=${DBPEDIA}/${i}.bz2
>> wget -c ${url}
>> echo "cleaning $filename ..."
>> #corrects encoding and recompress using gz
>> #gz is used because it is faster
>> bzcat ${filename}.bz2 \
>> | sed 's//\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
>> | gzip -c > ${filename}.gz
>> rm -f ${filename}.bz2
>> fi
>> done
>>
>> > the SolrIndex zip file is about 3.5GB.
>> > I am using a min-score=2 in minincoming.properties
>> > I think the 3.7 index file from the IKS project downloads site was
>> created
>> > with min-score=10.
>>
>> The dbpedia 3.7 index was build by ogrisel, but I think you are right.
>> 3.5GByte for all entities wih >=2 incomming links (should be about
>> 4million entities) sound reasonable. If you  want to share your index
>> with the Stanbol community I am sure we can find a server to host it.
>>
>>
>> Note about languages:
>>
>> while it is easy include labels, comments, abstracts of additional
>> languages it is not so easy to add proper Solr field definition for
>> languages. While there is a great wiki page that provides all the
>> necessary links [1] I find it still very hard to add configurations
>> for languages I do not understand. So if someone can help with that I
>> am happy to improve the Solr schemas used by the Entityhub (and the
>> Entityhub Indexing tool)!
>>
>>
>> Upgrading the default DBpedia index:
>>
>> After the ApacheCon I will work on replacing the default dbpedia index
>> used with the Stanbol launchers with a dbpedia 3.8 based version (the
>> current one is still based on 3.6). This will need some time because I
>> expect that I will need to adapt a lot of unit/integration tests
>> affected by data changes.
>>
>> [1] http://wiki.apache.org/solr/LanguageAnalysis
>&g

Re: EntityHub Referenced Site and redirects

2012-11-15 Thread Rupert Westenthaler
Hi (again)

> (2) dbp-ont:surfaceForm
>
> I recommended to you to copy labels of redirected pages to the
> "dbp-ont:surfaceForm" field. In the meantime I made some tests with an
> index build like that. The results where really bad because of that I
> must revoke this recommendation!
>
> The reason for that is that the scoring algorithm of Solr is affected
> by the multi-valued "dbp-ont:surfaceForm" field. e.g. for
> dbpedia:Paris you have ~35 "dbp-ont:surfaceForm" values where only
> about ~15 contain "Paris". So if you now make a query for Paris in
> this field
>
> (((@en/dbp\-ont\:surfaceForm/:"paris")))
>
> you will notice that dbpedia:Paris is not within the top 10 search
> results. Instead Entities like "Paris Barclay" are listed because they
> do have only a single value for "dbp-ont:surfaceForm" and therefore
> the match for "Paris" is much more relevant.

Just talked about this problem with Sebastian Schaffert. He suggested
to try setting

omitNorms="true"

for all fields used for labels within the Entityhub. This should have
the affect that Entities with a lot of  "dbp-ont:surfaceForm" values
are no longer penalized by the Solr ranking algorithm. So testing that
will require some time.

best
Rupert


>
> This means that the current index-layout where URIs of redirected
> pages are represented as own Entities within the index is much better
> suited for entity extraction.


-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: REST API for dbpedia-spotlight chain

2012-11-15 Thread Rupert Westenthaler
Hi Andriy

So if our result differ from each other, than it is likely that the
reason for your issue is the charset use by the command.

Can you please try to copy the text into a {file} and than use

curl -v -X POST -H "Accept: text/turtle" -H \
"Content-type: text/plain" --data "@{file}" \
http://dev.iks-project.eu:8080/enhancer/chain/dbpedia-spotlight

if the file does not use UTF-8 you will need to parse the charset in
the Content-type header

Content-type: text/plain;charset={charset}

best
Rupert

On Thu, Nov 15, 2012 at 2:08 PM, Andriy Nikolov
 wrote:
> Hi Rupert,
>
> Thanks a lot for your reply.
> Actually, when I try it with http://dev.iks-project.eu:8080, I still get the
> same effect: the output I get when submitting through the web interface and
> via the API (I used the command from your mail) are different:
> in one case (using the web form), there is a mention of
> <http://dbpedia.org/resource/Albrecht_von_Wallenstein> (correct), while via
> the API there isn't, but there is a mention of
> http://dbpedia.org/resource/Clavichord (wrong).
>
> Best regards,
> Andriy
>
>
> On Wed, Nov 14, 2012 at 6:50 PM, Rupert Westenthaler
>  wrote:
>>
>> Hi
>>
>> When I send your request to http://dev.iks-project.eu:8080 I do get
>> the expected results.
>> Can you please try the same.
>>
>> If you do not get those results than it has most likely todo with the
>> charset used by the terminal. The command you sent does not explicitly
>> set the charset so Stanbol will interpret it as "UTF-8" when parsing
>> the request.
>>
>> I used the following command
>>
>> curl -i -X POST -H "Accept: text/turtle" -H "Content-type: text/plain"
>> --data \
>> "üä Albrecht Václav Eusebius z Valdštejna and pe fights, leading to his \
>> imprisonment in town prison.[8] Already in February 1600,[8] Albrecht left
>> \
>> Altdorf for his Grand Tour through the HRE, France and Italy,[10] where he
>> \
>> studied at the universities of Bologna and Padua." \
>> http://dev.iks-project.eu:8080/enhancer/chain/dbpedia-spotlight
>>
>> best
>> Rupert
>>
>> On Wed, Nov 14, 2012 at 5:03 PM, Andriy Nikolov
>>  wrote:
>> > Dear all,
>> >
>> > I am working at fluid Operations AG on one of the IKS Early Adopters
>> > projects and trying to integrate Stanbol with our Information Workbench
>> > platform.
>> >
>> > Currently I am getting to know the Stanbol API, and I have a question
>> > related to the dbpedia-spotlight enhancement chain.
>> > I am trying to retrieve annotations via the REST interface, but I face a
>> > problem as the output I receive is different from the one I obtain via
>> > the
>> > web interface form.
>> > Do you know what can be the possible cause and how to deal with it?
>> > (possibly, it happens when sending input text with non-standard
>> > characters).
>> >
>> > As an example, I am trying to send the following string (it is
>> > meaningless,
>> > just that it contains non-standard chars and mentions of different
>> > entity
>> > types):
>> >
>> > üä Albrecht Václav Eusebius z Valdštejna and pe fights, leading to his
>> > imprisonment in town prison.[8] Already in February 1600,[8] Albrecht
>> > left
>> > Altdorf for his Grand Tour through the HRE, France and Italy,[10] where
>> > he
>> > studied at the universities of Bologna and Padua.
>> >
>> > When sending it via the web interface
>> > http://localhost:8080/enhancer/chain/dbpedia-spotlight,
>> > I retrieve a list of text and entity annotations, particularly the one
>> > mentioning the entity dbpedia:Albrecht_von_Wallenstein (the annotations
>> > are
>> > consistent with what I get from the dbpedia-spotlight demo service
>> > itself).
>> >
>> > However, when trying to send the same text via the API, e.g., with the
>> > following command:
>> >
>> > curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain"
>> > --data
>> > "üä Albrecht Václav Eusebius z Valdštejna and pe fights, leading to his
>> > imprisonment in town prison.[8] Already in February 1600,[8] Albrecht
>> > left
>> > Altdorf for his Grand Tour through the HRE, France and Italy,[10] where
>> > he
>> > studied at the universities of Bologna and Padua."
>> > http://localhost:8080/enhancer/chain/dbpedia-spotlight

Re: Stopping the framework ...

2012-11-15 Thread Rupert Westenthaler
Hi

On Thu, Nov 15, 2012 at 2:36 PM, Andrea Taurchini  wrote:
> Dear All,
> maybe I'm missing (again) something, but if I stop the framework, no matter
> if through Felix Web Console or CTRL+C, configurations go to hell on the
> next restart.

No you are missing nothing. All those ways to shutdown Stanbol should
work just fine. I can not remember having ever a problem like that.

> Even the default enhancement chain will stop working since the order or the
> engine is changed to :
>
>- *metaxa* ( optional , currently not available)
>- *entityhubExtraction* ( required , currently not available)
>- *tika* ( optional , TikaEngine)
>- *langdetect* ( required , LanguageDetectionEnhancementEngine)
>- *ner* ( required , NamedEntityExtractionEnhancementEngine)
>- *dbpediaLinking* ( required , NamedEntityTaggingEngine)
>

that "not available" engines are listed first is expected for the
WeightedChain. This chain determines the order based on information
provided by the Engine. So if an Engine is not available such
Information are not available. As the order does not matter for
Engines that are not available my decision was to list them first.

> not to mention the fact that my own configurations (topic classifier ...)
> is completely removed ...
>

Somehow it looks like as OSGI is not able to write files to the disc.
Can you please check the Stanbol log file
{launcher-dir}/stanbol/logs/error.log if you can find related
information.

best
Rupert

> Thanks,
> Andrea



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: [STANBOL-798] Vocabularies/ontologies not available according the best practices

2012-11-16 Thread Rupert Westenthaler
Hi all,

just to let you know: I have started the process to fix this for

http://fise.iks-project.eu/ontology/

the URI used by the Stanbol Enhancement Structure Ontology.

for the apache.stanbol.org namespaces we will need to copy the files
in the correct directories and than to configure some things in the
.htaccess files. Here the question is if the use of .htaccess is
possible (had not yet time to look this up ... so please no RTFM
responses ^^)

best
Rupert

On Fri, Nov 16, 2012 at 1:24 PM, Fabian Christ
 wrote:
> 2012/11/16 Sergio Fernández 
>
>> that should be quite easy to solve
>
>
> Do you have patch for that easy one? ;)
>
> --
> Fabian
> http://twitter.com/fctwitt



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Stanbol indexing tool

2012-11-16 Thread Rupert Westenthaler
The TDB database is located under

{indexing-working-dir}/indexing/resources/tdb

If you do have an TDB store with the required data, than you can
provide them under that directory. Just make sure that the

{indexing-working-dir}/indexing/resources/rdfdata

folder is empty when you start the tool. Otherwise the RDF files in
that folder would get imported.

On Fri, Nov 16, 2012 at 2:18 PM, Andrea Di Menna  wrote:
> The first part of the process seems slower on my machine w.r.t. to
> loading triples in a TDB using directly tdbloader2 (Note: I am using
> the latest available version of Jena when running tdloader2 standalone
> - namely 2.7.4).

Yes the indexing tool uses

com.hp.hpl.jena:jena:2.6.3
com.hp.hpl.jena:arq:2.8.5
com.hp.hpl.jena:tdb:0.8.7

but you could still try to use your datastore. Maybe they have not
changed the binary format of the files.

If not let me know and I will try to update the Jena Version used by
the Indexing Tool

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Beginning Apache Stanbol

2012-11-17 Thread Rupert Westenthaler
cher only contains the
Enhancer and Entityhub. This would explain why you are only seeing
this two exceptions and also this exception is expected during startup
as the Refactor Engine is included in the Stanble Launcher but is
missing the dependencies to the OntologyManager and Rules components.


However

> I stopped the stanbol instance and tried the '"full" build.
> java -Xmx1g -XX:MaxPermSize=256m -jar 
> full/target/org.apache.stanbol.launchers.full-0.10.0-SNAPSHOT.jar
> instead, but got
> "ERROR: Bundle org.apache.stanbol.enhancer.engines.refactor [110]: Error 
> starting 
> slinginstall:org.apache.stanbol.enhancer.engines.refactor-0.10.0-SNAPSHOT.jar 
> (org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.stanbol.enhancer.engines.refactor [110]: Unable to resolve 110.0: 
> missing requirement [110.0] package; 
> (&(package=org.apache.stanbol.ontologymanager.servicesapi.collector)(version>=0.10.0)(!(version>=1.0.0
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.stanbol.enhancer.engines.refactor [110]: Unable to resolve 110.0: 
> missing requirement [110.0] package; 
> (&(package=org.apache.stanbol.ontologymanager.servicesapi.collector)(version>=0.10.0)(!(version>=1.0.0)))"
>

this is not expected and indicates some issue with the ontology
manager. But as with the integration tests I was also unable to
reproduce this.

Also a look at the Exported packages of the
"org.apache.stanbol.ontologymanager.servicesapi" shows that this
module correctly exports
"org.apache.stanbol.ontologymanager.servicesapi.collector,version=0.10.0.SNAPSHOT"
and also that "org.apache.stanbol.enhancer.engines.refactor" imports
"org.apache.stanbol.ontologymanager.servicesapi.collector,version=0.10.0.SNAPSHOT
from org.apache.stanbol.ontologymanager.servicesapi (123)"

You can check that yourself via the Apache Felix Webconsole under
http://localhost:8080/system/console/bundles
Also the "http://localhost:8090/system/console/depfinder"; (packages)
tab is useful to check for packages that cause errors like that (in
your case org.apache.stanbol.ontologymanager.servicesapi.collector)

Can you please validate this in your launcher. Especially if
"org.apache.stanbol.ontologymanager.servicesapi" exports
"org.apache.stanbol.ontologymanager.servicesapi.collector".


best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Beginning Apache Stanbol

2012-11-18 Thread Rupert Westenthaler
initialisation of the dbpedia sites by
stopping/starting the "org.apache.stanbol.data.sites.dbpedia" bundle
(e.g. via the Felix Webconsole under
http://localhost:8080/system/console/bundles). If this does not solve
your issue it makes it at lease easier to find the reason by looking
at the loggings during the startup.

Additional information are also available in a file called
"dbpedia.solrindex.ref" (best use find to search for the file as it is
hard to explain the where it is located). the file is a normal text
file with the current state of the site. If the State is Error it
should also contain the exception that caused the initialization to
fail.

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Beginning Apache Stanbol

2012-11-18 Thread Rupert Westenthaler
Hi Jonathan,

not the 3rd and last part answering your questions

>
> I'd appreciate help with the following:
> - how to enable dbpediaLinking?

If there would not be errors this should be enabled by default. The
only thing you might want todo is to install a full dbpedia index (as
the one included in Stanbol only contains ~40k Entities.

A index based on dbpedia 3.7 (spring 2011) can be found at
http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.7/
There is also a new dbpedia 3.8 (summer 2012) contributed by "Andrea
Di Menna" available at
http://dev.iks-project.eu/downloads/stanbol-indices/upload/dbpedia-3.8/

> - should the integration tests be passing?

Definitely yes. This is also checked by the Stanbol Jenkins Server
after every commit to the trunk (see
https://builds.apache.org/job/stanbol-trunk-1.6/)

> - how to enable contenthub

You will need to use the full launcher to have the contenthub (or
build a custom launcher as described by
http://stanbol.apache.org/production/your-launcher.html)

> - how to enable additional engines (e.g. 
> https://github.com/insideout10/wordlift-stanbol has Freebase and Schema.org, 
> but I'm not clear on how to include that code in the Stanbol src.
>

I had not yet time to look specifically at this contribution. But
generally you just need

* to add a bundle to your Stanbol Launcher
* provide a configuration for the Component(s) provided by those bundles

You can do all this via the Felix Webconsole
(http://localhost:8090/system/console): Install bundles via the
"Bundle" tag and Provide configuration via the "Configuration" tab.
Sometimes you will also need to manually start configured services via
the "Components" tab.

An other possibility is to use the Sling FileInstaller (just create
the "{stanbol-working-dir}/stanbol/fileinstall" if it does not yet
exist) and than copy the bundles and configurations in this directory.
Stanbol will pick up and install automatically from this location.

best
Rupert


--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Beginning Apache Stanbol

2012-11-18 Thread Rupert Westenthaler
w3.org/2000/01/rdf-schema#
> 18.11.2012 22:59:03.669 *DEBUG* [937106871@qtp-2017995693-10] 
> org.apache.stanbol.entityhub.yard.solr.impl.SolrFieldMapper  > prefix: 
> entityhub value: http://stanbol.apache.org/ontology/entityhub/entityhub#
> 18.11.2012 22:59:03.670 *WARN* [937106871@qtp-2017995693-10] 
> org.apache.felix.http.jetty /entityhub/entity 
> (java.lang.IllegalStateException: Unknown prefix foaf (parsed from field 
> foaf:schoolHomepage)!) java.lang.IllegalStateException: Unknown prefix foaf 
> (parsed from field foaf:schoolHomepage)!
[..]
> 18.11.2012 22:59:03.686 *DEBUG* [Event Job Manager Observer Daemon] 
> org.apache.stanbol.enhancer.jobmanager.event.impl.EnhancementJobHandler  -- 
> No active Enhancement Jobs
> 18.11.2012 22:59:04.545 *DEBUG* [Timer-1] 
> org.apache.sling.installer.provider.file.impl.FileMonitor Checking 
> /Users/jonathan/Documents/HuntDesign/Projects/stanbol/launchers/stanbol/fileinstall
> 18.11.2012 22:59:04.545 *DEBUG* [Timer-1] 
> org.apache.sling.installer.provider.file.impl.FileMonitor Checking 
> /Users/jonathan/Documents/HuntDesign/Projects/stanbol/launchers/stanbol/fileinstall/org.apache.stanbol.enhancer.engines.geonames.impl.LocationEnhancementEngine.config
>

I think this logs are from making the

> curl -X GET 
> "http://localhost:8080/entityhub/entity?id=http://huntdesign.co.nz/person/DavidBanner";

request and not from start/stopping the
"org.apache.stanbol.data.sites.dbpedia" bundle. Is this assumption
correct?

BTW has restarting the "org.apache.stanbol.data.sites.dbpedia" solved the Issue?

>> Additional information are also available in a file called
>> "dbpedia.solrindex.ref" (best use find to search for the file as it is
>> hard to explain the where it is located). the file is a normal text
>> file with the current state of the site. If the State is Error it
>> should also contain the exception that caused the initialization to
>> fail.
>
> ~/Documents/HuntDesign/Projects/stanbol/data/sites/dbpedia/target/classes/org/apache/stanbol/data/site/dbpedia/default/config/dbpedia.solrindex.ref
>

this is the file as included in the stanbol distribution. The file I
was referring to should be under
"{stanbol-launcher-dir}/stanbol/felix/"

> shows
>
> Name=SolrIndex for dbpedia
> Description=DBpedia.org
> Index-Archive=dbpedia.solrindex.zip,dbpedia_43k.solrindex.zip
> Download-Location=http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.7/dbpedia.solrindex.zip
>

The file within the stanbol launcher directory should contain
additional information like

Directory=dbpedia-2012.11.18
Index-Name=dbpedia
State=ACTIVE
Archive=dbpedia_43k.solrindex.zip

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Tika content type detection

2012-11-19 Thread Rupert Westenthaler
Hi Andriy,

On Mon, Nov 19, 2012 at 10:25 AM, Andriy Nikolov
 wrote:
> Dear all,
>
> I have a question about the use of tika engine to detect the content-type
> of uploaded document. Does it require any special configuration of stanbol?

No it does not as Stanbol directly forwards the parsed content to the
Tika Mime Magic Detction if the Content-Type header is not set in the
request.

> Problem accessing /enhancer/engine/tika. Reason:
> Enhancement Chain failed because of required Engine 'tika' failed
> with Message: Unable to process ContentItem
> '<urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f>'
> with Enhancement Engine 'tika' because the engine is currently not
> active(Reason: Unexpected Exception while processing ContentItem
> <urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f> with
> EnhancementJobManager: class
> org.apache.stanbol.enhancer.jobmanager.event.impl.EventJobManagerImpl)!Caused
> by:org.apache.stanbol.enhancer.servicesapi.ChainException:
> Enhancement Chain failed because of required Engine 'tika' failed with
> Message: Unable to process ContentItem
> '<urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f>'
> with Enhancement Engine 'tika' because the engine is currently not
> active(Reason: Unexpected Exception while processing ContentItem
> <urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f> with
> EnhancementJobManager: class
> org.apache.stanbol.enhancer.jobmanager.event.impl.EventJobManagerImpl)!
> at
> org.apache.stanbol.enhancer.jobmanager.event.impl.EventJobManagerImpl.enhanceContent(EventJobManagerImpl.java:153)
> at
> org.apache.stanbol.enhancer.jersey.resource.AbstractEnhancerResource.enhance(AbstractEnhancerResource.java:233)
> at
> org.apache.stanbol.enhancer.jersey.resource.AbstractEnhancerResource.enhanceFromData(AbstractEnhancerResource.java:215)
>

This is the reason why it does not work for your. However to determine
the problem I would need the whole stack trace including all 'caused
by' sections.

The other error referenced in your mail seems unrelated.

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Stopping the framework ...

2012-11-19 Thread Rupert Westenthaler
Hi,

you can not send attachments via the list. Feel free to send it directly to me.

On Mon, Nov 19, 2012 at 11:09 AM, Andrea Taurchini  wrote:
> 1) clean stanbol folder
> 2) launch java -Xmx1g -jar -XX:MaxPermSize=128m
> stanbol_src\launchers\full\target\org.apache.stanbol.launchers.full-0.10.0-SNAPSHOT.jar

-XX:MaxPermSize=128m is not enough for the full launcher as it
requires ~200 MByte. You should use -XX:MaxPermSize=256m instead.

With only 128m of PermGen memory I would expect the full launcher to
throw OutOfMemory exceptions during startup. If this happens the
initial configuration (created during the first startup) will be
incomplete and corrupted. This could indeed explain the issues you are
experiences.

best
Rupert


> 3) once fully active ... stop the service with CTRL^C
> 4) wait for stopping the service
> 5) relaunch the same startup command
> 6) verify that entityhubExtraction enhancer is no more available
>
> Thanks for your help.
>
>
> Best,
> Andrea
>
>
>
>
>
>
> 2012/11/15 Rupert Westenthaler 
>>
>> Hi
>>
>> On Thu, Nov 15, 2012 at 2:36 PM, Andrea Taurchini 
>> wrote:
>> > Dear All,
>> > maybe I'm missing (again) something, but if I stop the framework, no
>> > matter
>> > if through Felix Web Console or CTRL+C, configurations go to hell on the
>> > next restart.
>>
>> No you are missing nothing. All those ways to shutdown Stanbol should
>> work just fine. I can not remember having ever a problem like that.
>>
>> > Even the default enhancement chain will stop working since the order or
>> > the
>> > engine is changed to :
>> >
>> >- *metaxa* ( optional , currently not available)
>> >- *entityhubExtraction* ( required , currently not available)
>> >- *tika* ( optional , TikaEngine)
>> >- *langdetect* ( required , LanguageDetectionEnhancementEngine)
>> >- *ner* ( required , NamedEntityExtractionEnhancementEngine)
>> >- *dbpediaLinking* ( required , NamedEntityTaggingEngine)
>> >
>>
>> that "not available" engines are listed first is expected for the
>> WeightedChain. This chain determines the order based on information
>> provided by the Engine. So if an Engine is not available such
>> Information are not available. As the order does not matter for
>> Engines that are not available my decision was to list them first.
>>
>> > not to mention the fact that my own configurations (topic classifier
>> > ...)
>> > is completely removed ...
>> >
>>
>> Somehow it looks like as OSGI is not able to write files to the disc.
>> Can you please check the Stanbol log file
>> {launcher-dir}/stanbol/logs/error.log if you can find related
>> information.
>>
>> best
>> Rupert
>>
>> > Thanks,
>> > Andrea
>>
>>
>>
>> --
>> | Rupert Westenthaler rupert.westentha...@gmail.com
>> | Bodenlehenstraße 11 ++43-699-11108907
>> | A-5500 Bischofshofen
>
>



--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Tika content type detection

2012-11-19 Thread Rupert Westenthaler
Hi Andriy, all

sending this again to the list as others might be affected/interested
as well. Especially Suat as he is currently fighting an very similar
issue in the CMS adapter

The assumption that Tika may miss XML Beans is wrong as Tika includes
xmlbeans 2.3.

java.lang.NoClassDefFoundError: Could not initialize class
org.apache.xmlbeans.XmlBeans

Errors like that indicate that during the initialization of a Class.
This includes the initialization of static variables (or static
blocks) in the mentioned class and all super classes. Looking at the
source of XmlBeans shows that in this case nearly everything is called
during static initialization :(

However when it comes to external dependencies there are only two and
in that context only the dependency to
javax.xml.stream.XMLStreamReader seams relevant.

javax.xml.stream.XMLStreamReader is part of the "stax-api". This API
is included in JDK 1.6. Stanbol imports the stax-api twice

1. via the JDK because the Stanbol frameworkfragment lists all the
packages of the stax-api
2. via the 
org.apache.servicemix.specs:org.apache.servicemix.specs.stax-api-1.0:2.1.0

This could indeed cause the error you are expiriencing. I have created
a launcher with a preliminary fix for that. You can find it under [1].
Can you please try if this solves your issue.

Please use "-Xmx1024m -XX:MaxPermSize=256M" when staring the full launcher

best
Rupert

[1] http://dev.iks-project.eu/downloads/stanbol-launchers/tmp/stax-api-debug/

On Mon, Nov 19, 2012 at 1:19 PM, Andriy Nikolov
 wrote:
> Thanks a lot!
> The error message is attached (seems like XMLBeans is not on classpath - is
> this something to configure separately?).
>
> Best,
> Andriy
>
>
> On Mon, Nov 19, 2012 at 12:52 PM, Rupert Westenthaler
>  wrote:
>>
>> Hi Andriy,
>>
>> On Mon, Nov 19, 2012 at 10:25 AM, Andriy Nikolov
>>  wrote:
>> > Dear all,
>> >
>> > I have a question about the use of tika engine to detect the
>> > content-type
>> > of uploaded document. Does it require any special configuration of
>> > stanbol?
>>
>> No it does not as Stanbol directly forwards the parsed content to the
>> Tika Mime Magic Detction if the Content-Type header is not set in the
>> request.
>>
>> > Problem accessing /enhancer/engine/tika. Reason:
>> > Enhancement Chain failed because of required Engine 'tika'
>> > failed
>> > with Message: Unable to process ContentItem
>> > '<urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f>'
>> > with Enhancement Engine 'tika' because the engine is currently not
>> > active(Reason: Unexpected Exception while processing ContentItem
>> > <urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f>
>> > with
>> > EnhancementJobManager: class
>> >
>> > org.apache.stanbol.enhancer.jobmanager.event.impl.EventJobManagerImpl)!Caused
>> > by:org.apache.stanbol.enhancer.servicesapi.ChainException:
>> > Enhancement Chain failed because of required Engine 'tika' failed with
>> > Message: Unable to process ContentItem
>> > '<urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f>'
>> > with Enhancement Engine 'tika' because the engine is currently not
>> > active(Reason: Unexpected Exception while processing ContentItem
>> > <urn:content-item-sha1-445158f36b9d4c42842c1f190950891524ba957f>
>> > with
>> > EnhancementJobManager: class
>> > org.apache.stanbol.enhancer.jobmanager.event.impl.EventJobManagerImpl)!
>> > at
>> >
>> > org.apache.stanbol.enhancer.jobmanager.event.impl.EventJobManagerImpl.enhanceContent(EventJobManagerImpl.java:153)
>> > at
>> >
>> > org.apache.stanbol.enhancer.jersey.resource.AbstractEnhancerResource.enhance(AbstractEnhancerResource.java:233)
>> > at
>> >
>> > org.apache.stanbol.enhancer.jersey.resource.AbstractEnhancerResource.enhanceFromData(AbstractEnhancerResource.java:215)
>> >
>>
>> This is the reason why it does not work for your. However to determine
>> the problem I would need the whole stack trace including all 'caused
>> by' sections.
>>
>> The other error referenced in your mail seems unrelated.
>>
>> best
>> Rupert
>>
>> --
>> | Rupert Westenthaler rupert.westentha...@gmail.com
>> | Bodenlehenstraße 11 ++43-699-11108907
>> | A-5500 Bischofshofen
>
>
>
>
> --
>
> Dr Andriy Nikolov
>
> R&D Engineer
>
> F +49 6227 3849-565
>
> 

Re: Stopping the framework ...

2012-11-19 Thread Rupert Westenthaler
Hi,

even after a detailed inspection of the log file you provided was not
able to find an indication of any problem. Based on the logging the
Stanbol instance was started, stopped and than started and stopped
again. The loggings of the first and the second startup are really
similar.

The only thing that might cause problems is the "java.io.IOException:
Unable to establish loopback connection" but as you mentioned in your
last mail solving this has also not solved your issue.

On Mon, Nov 19, 2012 at 4:17 PM, Andrea Taurchini  wrote:
> I should install stanbol on a windows server ... so it is not possible ?

AFAIK there are several users that do use Stanbol on Windows. Only
yesterday a Blog about MakoLab using Stanbol with their Windows CMS
was posted [1].

Andrea can you try to do the following

1. 1st time start of Stanbol (in an empty directory)
2. after the start archive the "stanbol\config" folder
3. stop the stanbol instance
4. after shutdown again archive the "stanbol\config" folder
5. start stanbol a 2nd time
6. make an third archive of the "stanbol\config" folder

If you can send me those three archives I will make and before after
check of the OSGI component configurations as written to your HD.
Maybe this will provide a hint about your issue

best
Rupert



[1] 
http://blog.iks-project.eu/makolabs-stanbol-integration-with-renault-international-cms-system/

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Apache Stanbol Enhancer Engine

2012-11-21 Thread Rupert Westenthaler
Hi Stefan

To be sure I would need to check this in detail, but I think the
reason is that your labels do use 'en-GB' as language and the language
identification determines the text to be in 'en'. Because of that your
labels are not considered for linking. You can try to set the "Default
Matching Language" of the KeywordLinkingEngine to "en-GB" if you get
than the expected results it would validate my assumption.

Do you need to support country specific language identifier? Otherwise
I would suggest to change the language tags in your dataset from
"en-GB" to "en".

best
Rupert

On Wed, Nov 21, 2012 at 3:18 PM, Stefan Zwicklbauer
 wrote:
> Hello,
>
> I have generated my own index which is available in the entityhub. The index
> is small and the apropriate rdf file has the following structure:
>
>  rdf:about="http://cv.iptc.org/newscodes/genre/Archive_material";>
>  rdf:resource="http://www.w3.org/TR/skos-reference/skos.html#Concept"/>
> Archive material
> The object contains material
> distributed previously that has been selected from the originator's
> archives.
> 
> http://cv.iptc.org/newscodes/genre/Background";>
>  rdf:resource="http://www.w3.org/TR/skos-reference/skos.html#Concept"/>
> Background
> The object provides some scene setting
> and explanation for the event being reported.
> 
> http://cv.iptc.org/newscodes/genre/Biography";>
>  rdf:resource="http://www.w3.org/TR/skos-reference/skos.html#Concept"/>
> Biography
> Facts and views about a
> person
> 
>
> In the following I created an enhancement chain which consists of a Language
> Identification Engine and a Linking Keyword Engine. If i try to use this
> chain with an input word like "Background" (look example) the result does
> not contain any relevant information (only language).
>
> During the creation of the Keyword Linking Engine in the Web console i had
> to specify the label field and type field. What are the correct values
> concerning the given example above?
>
> Sincerly
> Stefan



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: stanbol internal dependencies to dependency management

2012-11-21 Thread Rupert Westenthaler
Those decisions where intensional.

Normally Fabian would be the right person to answer this, but I will
try it anyway:

This is only a short summary as there was a long discussion that leaded to this:

Dependency management for Stanbol modules can not be in the parent (1)
because modules  versions as this would not allow releasing components
without releasing also the parent.
Components that depending on the oldest supported version gives users
more freedom in their launcher configurations. In addition keeping
dependencies to released version is critical for released of
single/subsets of components.
If Stanbol modules do not depend on the latest version doing the
dependency management in the parent does not work. Developers of
modules need to manage their dependencies  themselves.
Only the Stanbol launchers are supposed to use the newest versions of modules.

On Wed, Nov 21, 2012 at 5:56 PM, Reto Bachmann-Gmür  wrote:
> e.g. the
> enhancer using 0.9.0-incubating of commons.web.base. This can cause
> incompatibilities as in the launchers 0.10.1-SNAPSHOT is used.

If there is a change in the enhancer.jersey module that requires the
current commons.web.base than the developer that introduces this
change needs to update the dependency.

This happened also to myself. But after some time one gets used to it.

best
Rupert


--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: stanbol internal dependencies to dependency management

2012-11-21 Thread Rupert Westenthaler
> But I think they should nevertheless be kept up to date as
> otherwise we have no compile time check that the module will indeed work in
> the trunk version of the launcher. So I think we should regularly run a

UnitTest are executed using compile time dependencies
Integration test do check the runtime dependencies.

So I do not see a problem with that.

In addition one has to consider that the OSGI dependency management is
anyway different from the maven one.

To give two examples (for details have a look at the Semantic
Versioning Whitepaper [1])

1. consumer and provider policy: Stanbol uses (since STANBOL-774)

 -provider-policy : ${range;[==,=+)}
 -consumer-policy : ${range;[==,+)}

That means that by default dependencies do use a version range of
[==,+). However this is not feasible for imported packages that are
implemented by an module, as minor versions may e.g. extend an
interface by an additional method. So for such cases the import needs
to be marked with the provider-policy to ensure [==,=+).

2. Dependency management in maven is on module level where for OSGI
uses a package level granularity.

Depending on the latest version undermines version ranges (especially
for consumer-policy dependencies) - [==,=+) where the left side is the
most current version means basically that there is no version range at
all.

- - -

While such things are not really visible as long as you run everything
within the OSGI environment it starts really to hurt as soon as you
want to access services form outside of OSGI (e.g. when you run
Stanbol in an embedded OSGI environment). In such settings one needs
to expose all packages of used interfaces via the system bundle and
therefore you do not have the possibility to use different versions of
the same class.

But also within OSGI there are some disadvantages one might encounter.
One example is a fragmentation of the service registry (basically a
bundle may not use a service, because it's version of the Interface
was loaded using a different classloader as the version of the
Interface provided by the Service). If that happens ServiceTracker
will not get notified for available services - because they would not
be compatible. Debugging that is not fun and solving such issues is
only possible by fixing version ranges.

I agree that out of an maven and build perspective this might look
like a bad choice, but from the OSGI perspective  it is exactly how it
should be done.

I think the version number confusion of sling is caused by the fact
the every single module can have totally different versions. I think
in Stanbol this can be easily avoided by ensuring that all modules of
a Stanbol component (enhancer, entityhub, ... ) are all within
[==,=+). For the commons stuff we could use the same rule but one
level below (e.g. that all commons.solr modules are within [==,=+))

best
Rupert

 [1] http://www.osgi.org/wiki/uploads/Links/SemanticVersioning.pdf

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

2012-11-21 Thread Rupert Westenthaler
Hi

thanks for the feedback. I think we should go for (2) renaming the
engine. First because the current name (KeywordExtractionEngine) is
anyway not so fitting. Keyword extraction is typically more related to
finding central words within a text but the engine is more about
linking words with a vocabulary. Second because there might be some
use cases where it would still make sense to use the old engine in
parallel with the new one - e.g for extracting Product-Ids, ISBN
numbers, chemical formulas such as CH3CH2OH ... Third it is easier to
adapt the documentation - especially the usage scenarios - if there is
a new name for the new engine and finally I do also like to have
warnings instead of errors for users that have not yet adapted to the
new engine.

While Fabians suggestion would clearly document the change it would
still mean to break most current Stanbol installations as most of the
users currently use the trunk version. However as soon as we do have a
faster release cycle this option would be much more attractive.

I would than suggest to use "EntityhubLinkingEngine" as the new name
for the Engine as this name makes it very clear what this engine does.

Thanks for the feedback
best
Rupert


On Thu, Nov 22, 2012 at 12:01 AM, Bertrand Delacretaz
 wrote:
> On Wed, Nov 21, 2012 at 8:46 PM, Fabian Christ
>  wrote:
>> ...what about creating a branch from the trunk with the current version
>> (before the merge) that is known to be working? People could switch to that
>> branch to keep the status quo and we should make clear that this branch
>> will not be maintained in the future...
>
> I'd make that just a BEFORE_740 tag then - that makes it clearer that
> this is not supposed to evolve further.
>
> -Bertrand



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

2012-11-22 Thread Rupert Westenthaler
On Thu, Nov 22, 2012 at 12:10 PM, Bertrand Delacretaz
 wrote:
>
> Isn't the "hub" part an implementation detail?
>
> EntityLinkingEngine sounds better to be - but no strong opinion,
> whoever does the work decides.

Good point. While refactoring the code I came to the same conclusion

Currently I have

(1) "EntityLinkingEngine": This is the class implementing the
EnhancementEngine interface and in registered as OSGI service and
(2) "EntityhubLinkingEngine": The OSGI Component that gets the
configuration, registered an ServiceTracker for the Entityhub Site and
registers the  "EntityLinkingEngine" instance as soon as all the
required Services are available.

The goal of this is to make it really easy implement a
"MyServiceLinkingEngine". Even my current refactoring we are not yet
there, but it is getting much better.

best
Rupert



--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: "Error reloading cached bundle"

2012-11-22 Thread Rupert Westenthaler
Hi Reto,

I am now able to reproduce this by the following

(1) start the Stanbol launcher within the target folder
(2) make a mvn clean install while the stanbol is still running in the
./target folder
(4) stop the Stanbol launcher (of the in the meantime deleted folder)
(5) go to the newly created ./target folder
(6) start the stanbol launcher within the target folder

I think this is because in (4) the launcher writes some data to the
./target/stanbol folder of the new one. Because of that the
initialisation of the new launcher in (6) fails with the reported
exception.

Could this be related to the cases you are reporting?

best
Rupert

On Wed, Nov 14, 2012 at 1:29 PM, Reto Bachmann-Gmür  wrote:
> Now tha I saw the same error again its bundle 105 which is
> slinginstall:org.apache.stanbol.commons.solr.core-0.10.1-SNAPSHOT.jar.
> Again same symptoms.
>
> Reto
>
> On Wed, Oct 10, 2012 at 3:06 PM, Rupert Westenthaler <
> rupert.westentha...@gmail.com> wrote:
>
>> Reto,
>>
>> have you looked what module bundle64 refers to?
>>
>> On Wed, Oct 10, 2012 at 11:53 AM, Reto Bachmann-Gmür 
>> wrote:
>> > Occasionally when starting a fresh stanbol launcher I get the following
>> > error message. Does anybody knows what is causing this? After deleting
>> the
>> > stanbol dectory and retrying the problem doesn't appear again.
>> >
>> > Cheers,
>> > Reto
>> >
>> > ERROR: Error reloading cached bundle, removing it:
>> >
>> /home/reto/projects/apache/stanbol/launchers/full/target/stanbol/felix/bundle64
>> > (java.lang.Exception: No valid revisions in bundle archive directory:
>> >
>> /home/reto/projects/apache/stanbol/launchers/full/target/stanbol/felix/bundle64)
>> > java.lang.Exception: No valid revisions in bundle archive directory:
>> >
>> /home/reto/projects/apache/stanbol/launchers/full/target/stanbol/felix/bundle64
>> > at
>> >
>> org.apache.felix.framework.cache.BundleArchive.(BundleArchive.java:205)
>> > at
>> >
>> org.apache.felix.framework.cache.BundleCache.getArchives(BundleCache.java:223)
>> > at org.apache.felix.framework.Felix.init(Felix.java:656)
>> > at org.apache.sling.launchpad.base.impl.Sling.init(Sling.java:363)
>> > at org.apache.sling.launchpad.base.impl.Sling.(Sling.java:228)
>> > at
>> >
>> org.apache.sling.launchpad.base.app.MainDelegate$1.(MainDelegate.java:181)
>> > at
>> >
>> org.apache.sling.launchpad.base.app.MainDelegate.start(MainDelegate.java:181)
>> > at org.apache.sling.launchpad.app.Main.startSling(Main.java:424)
>> > at org.apache.sling.launchpad.app.Main.doStart(Main.java:349)
>> > at org.apache.sling.launchpad.app.Main.main(Main.java:123)
>> > at org.apache.stanbol.launchpad.Main.main(Main.java:61)
>>
>>
>>
>> --
>> | Rupert Westenthaler rupert.westentha...@gmail.com
>> | Bodenlehenstraße 11 ++43-699-11108907
>> | A-5500 Bischofshofen
>>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Apache stanbol: Enhancer service codification problem

2012-11-22 Thread Rupert Westenthaler
Hi Jairo,

I created STANBOL-813 [1] and implemented a fix with revision [2].
Your test case now works for me so it should be fine for you to.

Note that this fix does not tackle the general issues as mentioned in
my first replay so Stanbol might still write characters to the
Enhancement structure that might cause "application/rdf+xml"
serializations to fail.

best
Rupert



[1] https://issues.apache.org/jira/browse/STANBOL-813
[2] http://svn.apache.org/viewvc?rev=1412756&view=rev

On Tue, Nov 20, 2012 at 6:19 AM, Rupert Westenthaler
 wrote:
> Hi Jairo,
>
> This is caused by the "removeNonUtf8CompliantCharacters(..)" in the
> NEREngineCore class (OpenNLP-NER engine) [1]. The JavaDoc says that
> this was added to avoid errors while creating "application/rdf+xml"
> responses.
>
> I am only recently noticed this method as I adapted the OpenNLP NER
> engine to work with the new Stanbol NLP processing chain
> (STANBOL-797). In the branch version of this engine [2] this method
> the
> "removeNonUtf8CompliantCharacters(..)" is no longer called if the
> AnalyzedText ContentPart (STANBOL-734) is used as source for the
> enhancements.
>
> Generally I do not like this method as it creates a copy of the parsed
> content what can be a problem for big texts. In addition as this is
> only done by this engine there is still no guarantee that there are no
> non UTF-8 compliant chars in the response (they might even come from
> literals in dereferenced Entities).
>
> In addition this method seams to be overdoing as well, because the 'í'
> in 'París' is clearly an UTF-8 conform character.  Maybe Olivier
> Grisel can comment to that, because as far as I can remember he was
> the one adding this feature years ago.
>
> best
> Rupert
>
>
> [1] 
> http://svn.apache.org/repos/asf/stanbol/trunk/enhancer/engines/opennlp-ner/src/main/java/org/apache/stanbol/enhancer/engines/opennlp/impl/NEREngineCore.java
> [2] 
> http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/engines/opennlp-ner/src/main/java/org/apache/stanbol/enhancer/engines/opennlp/impl/NEREngineCore.java
>
> On Mon, Nov 19, 2012 at 7:01 PM, Jairo Sarabia
>  wrote:
>> Hi Rupert,
>>
>> I tried to use enhancer service for spanish texts and I have problems with
>> codification.
>> In the service, the  caracters with accents disappear in json response and
>> consequently there are important words of de Language that no appear in the
>> responses.
>> I've tried using different codifications in the requests but none seem to
>> work:
>>
>> Examples of Headers:
>> 1)  -H "Accept: application/json", "Content-type: text/plain"
>> 2)  -H "Accept: application/json", "Content-type: text/plain; charset=utf-8"
>> 3)  -H "Accept: application/json", "Content-type: text/plain;
>> charset=iso-8859-1"
>> 4) -H "Accept: application/json", "Content-type: text/html; charset=utf-8",
>> "Accept-Language: es-es"
>> 5) -H "Accept: application/json", "Content-type: text/html;
>> charset=iso-8859-1", "Accept-Language: es-es"
>>
>> Example of curl request:
>>
>> REQUEST:
>>
>> curl -v -X POST -H "Accept: text/plain" -H "Content-type: text/html;
>> charset=utf-8" -H "Accept-language:es-es;en" --data "The
>> Stanbol enhancer puede detectar personas famosas como Mariano Rajoy y
>> ciudades como París."
>> "http://ec2-50-16-118-169.compute-1.amazonaws.com:8080/enhancer/chain/notedlinks";
>>
>> JSON RESPONSE:
>>
>> {
>>  
>>
>> {
>>   "@subject":
>> "urn:content-item-sha1-69a7889f31ea325dda4a9e08f735b1499e7d6e3c",
>>   "dc:format": "text/html; charset=UTF-8",
>>   "http://www.w3.org/ns/ma-ont#hasFormat": "text/html; charset=UTF-8"
>> },
>> {
>>   "@subject": "urn:enhancement-0367734f-e48d-4dc3-e634-e5a3a4770706",
>>   "@type": [
>> "enhancer:Enhancement",
>> "enhancer:TextAnnotation"
>>   ],
>>   "dc:created": "2012-11-19T17:48:25.977Z",
>>   "dc:creator":
>> "org.apache.stanbol.enhancer.engines.opennlp.impl.NamedEntityExtractionEnhancementEngine",
>>   "dc:type": "dbp-ont:Person",
>>   "enhancer:confidence": 0.98616,
>>   "enhancer:end": 71,
>>  

Re: Question: REST API expected content type

2012-11-23 Thread Rupert Westenthaler
Hi Andriy,

For the Enhancer RESTful API

The MediaType is taken from the "MediaType mediaType" parameter as
parsed by JAX-RS to the "readFrom(..)" method of the
"MessageBodyReader". This should be equals to the 'Content-Type'
header parsed in the request. The uploaded content is stored as Blob
to the created ContentItem.

In case you are sending "multipart/form-data" requests than you need
to consider the specification as documented in the "Multipart MIME
serialization" section of [1]


For the Tika Engine:

The MimeType is parsed from ContentItem#getBlob()#getMimeType() (see
also [1]). If the mime type can no be parsed of is
application/octet-stream than the Tika is used to detect the correct
MimeType. Otherwise the content type as set in the Blob is used.

BTW. plain text files are not processed by the Tika engine.

best
Rupert


[1] http://stanbol.apache.org/docs/trunk/components/enhancer/contentitem.html

On Fri, Nov 23, 2012 at 9:18 AM, Andriy Nikolov
 wrote:
> Dear all,
>
> I have another question about the use of Stanbol enhancer REST API
> (apologies if it is already covered in the documentation, i didn't
> find it).
> Is there some default content type which is expected by the enhancer?
> For instance, if I send a PDF file to the dbpedia-spotlight chain
> without specifying its content type, it gets processed correctly:
> curl -X POST -H "Accept: text/turtle" -T test.pdf
> http://localhost:8080/enhancer/chain/dbpedia-spotlight?uri=urn:testItem
> However, if I send a plain text file instead, nothing is returned:
> curl -X POST -H "Accept: text/turtle" -T dummy.txt
> http://localhost:8080/enhancer/chain/dbpedia-spotlight
> I have to set "Content-type: text/plain" in the header.
> Similarly, when I send PDF content from Java client via
> HttpURLConnection, if I don't set "Content-type:
> application/octet-stream" explicitly, it gets interpreted as plain
> text.
>
> I guess, Tika engine is able to recognise both plain text and
> different binary formats, so can I set some "default" content type,
> which will just defer the recognition of input format to the Tika
> engine?
> That will allow me sending any file to the service without first doing
> some "pre-guessing" on the client side.
>
> Best regards,
>
> Andriy Nikolov
>
> R&D Engineer
>
> F +49 6227 3849-565
>
> andriy.niko...@fluidops.com
>
> http://www.fluidops.com
>
> fluid Operations AG
>
> Altrottstr. 31
>
> 69190 Walldorf, Germany
>
> Geschäftsführer/Managing Directors: Vasu Chandrasekhara, Dr. Andreas
> Eberhart, Dr. Stefan Kraus, Dr. Ulrich Walther
>
> Beirat/Advisory Board: Prof. Dr. Andreas Reuter, Thomas Reinhart
>
> Registergericht/Commercial Register: Mannheim, HRB 704027
>
> USt-Id Nr./VAT-No.: DE258759786
>
> This e-mail may contain confidential and/or privileged information. If
> you are not the intended recipient (or have received this e-mail in
> error) please notify the sender immediately and destroy this e-mail.
> Any unauthorised copying, disclosure or distribution of the material
> in this e-mail is strictly forbidden.



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Question: REST API expected content type

2012-11-23 Thread Rupert Westenthaler
On Fri, Nov 23, 2012 at 10:57 AM, Gniewosław Rzepka
 wrote:
> http://upload.wikimedia.org/wikipedia/commons/4/45/F1_logo.svg

The current URL used by wikipedia is

http://upload.wikimedia.org/wikipedia/en/4/45/F1_logo.svg

So basically it seams that they replace "commons" with "en" in the URL.

> I tghouth this might be useful information.

Thanks for the notice, but this is something we can not easily correct
as we do use the data as provided by DBpedia. In case of dbpedia 3.8
those are from this summer. So recent changes are not reflected in
them.

best
Rupert

>
> Gniewosław Rzepka



--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: OpenNLP models license

2012-11-24 Thread Rupert Westenthaler
On Fri, Nov 23, 2012 at 11:39 AM, Andrea Di Menna  wrote:
> Even though I am not copying models in the datafiles dir, it looks
> like those models are anyway available in the stable launcher.
>
> My questions follow:
> 1) Are the en model from lang and ner bundles licensed with a Apache
> 2.0 license?

No. This is the reason why you get Messages like that during the build

*
* WARNING - this build downloads some OpenNLP files that are *not*
* licensed under the Apache License, and have more restrictive usage
* terms than the Apache Stanbol code. See STANBOL-545 for more
* information: https://issues.apache.org/jira/browse/STANBOL-545
*

> 2) Is there any safe/preferable way to remove those models from a
> Stanbol instance without completely disrupting the Keyword Linking
> engine?

They KeywordLinking engine only requires Tokens. Those are also
available if no models are present. However this will have an influence
on the Results and the Performance.

> I am wondering if those models are absolutely needed for the purpose
> of Keyword Linking or if the related bundles can be safely removed
> from the Felix console.
>

just exclude/remove all org.apache.stanbol.data.opennlp.* bundles

Regarding Licenses: You will find a lot of relevant posts on the
OpenNLP mailing lists.

best
Rupert

>
> [1] https://issues.apache.org/jira/browse/STANBOL-545
> [2] http://opennlp.sourceforge.net/models-1.5/



--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

2012-11-24 Thread Rupert Westenthaler
Hi all

The refactoring is completed (for now) - see STANBOL-812 [1].
Documentation is already online on the Staging Server

* EntityhubLinkingEngine [2]: This is the direct successor of the
KeywordlinkingEngine
* EntityLinkingEngine [3]: This is the "generic" implementation of
EntityLinking based on the NLP processing API [4]

There will be a 2nd refactoring step to make the EntityLinkingEngine
fully independent of the Stanbol Entityhub. But this will not have any
influence on public APIs, Chain configurations nor Enhancement results
so this can be done after reintegration with the trunk.

Thanks for the feedback
best
Rupert

[1] https://issues.apache.org/jira/browse/STANBOL-812
[2] 
http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entityhublinking
[3] 
http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entitylinking
[4] http://stanbol.staging.apache.org/docs/trunk/components/enhancer/nlp/

On Thu, Nov 22, 2012 at 1:05 PM, Rupert Westenthaler
 wrote:
> On Thu, Nov 22, 2012 at 12:10 PM, Bertrand Delacretaz
>  wrote:
>>
>> Isn't the "hub" part an implementation detail?
>>
>> EntityLinkingEngine sounds better to be - but no strong opinion,
>> whoever does the work decides.
>
> Good point. While refactoring the code I came to the same conclusion
>
> Currently I have
>
> (1) "EntityLinkingEngine": This is the class implementing the
> EnhancementEngine interface and in registered as OSGI service and
> (2) "EntityhubLinkingEngine": The OSGI Component that gets the
> configuration, registered an ServiceTracker for the Entityhub Site and
> registers the  "EntityLinkingEngine" instance as soon as all the
> required Services are available.
>
> The goal of this is to make it really easy implement a
> "MyServiceLinkingEngine". Even my current refactoring we are not yet
> there, but it is getting much better.
>
> best
> Rupert
>
>
>
> --
> | Rupert Westenthaler rupert.westentha...@gmail.com
> | Bodenlehenstraße 11 ++43-699-11108907
> | A-5500 Bischofshofen



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: stanbol internal dependencies to dependency management

2012-11-24 Thread Rupert Westenthaler
Hi Reto,

if you make incompatible changes to an module than you need to adapt
all dependent modules and update the dependency of them to the current
version.

Normally the

 -provider-policy : ${range;[==,=+)}
 -consumer-policy : ${range;[==,+)}

would ensure that release Bundles are not affected by that. This is
also the reason why for an incompatible API change a major version
increase is required. However for pre 1.0.0 versions this is not the
case.

best
Rupert

On Fri, Nov 23, 2012 at 11:10 AM, Reto Bachmann-Gmür  wrote:
> Hi,
>
> The concrete problem: I've made changes to the WebFragment interface (in
> org.apache.stanbol.commons.web.base). The classes implementing it no longer
> compile if they have proper @Override annotations. Packages which used to
> implement the old version should remove the method and move the templates
> to another location.
>
> At runtime implementation of the old interface still work except that the
> method is never invoked and the templates are looked up in the new
> location. I've moved the templates to the new location in all modules and
> I've removed the method in those modules dependeing on the trunk version.
> The other modules are now in the state that they work only with the trunk
> launchers but compile only with the dependency to the old comms.web.base.
> If developer update the dependency version they'll have to find out why it
> fails and what adaptations are needed.
>
> I think it would be much more efficient if the one that changes an
> interface also changes all dependencies in trunk to compile with the new
> version. Of course one could just update the modules depending on the
> updated one to use the latest version. Howver I think it would be more
> consistent to keep the reactor modules to depend on the latest versions,
> this can be done running and needs no change to depenedency management:
>
> mvn org.codehaus.mojo:versions-maven-plugin:1.3.1:use-latest-versions
> "-Dincludes=org.apache.stanbol:*:*:*"  -DallowSnapshots=true
> -DexcludeReactor=false
>
> For the following modules we have other modules depending of older versions
> of them:
>
> org.apache.stanbol.commons.jsonld
> org.apache.stanbol.commons.solr.core
> org.apache.stanbol.commons.stanboltools.datafileprovider
> org.apache.stanbol.commons.stanboltools.offline
> org.apache.stanbol.commons.web.base
> org.apache.stanbol.entityhub.core
> org.apache.stanbol.entityhub.model.clerezza
> org.apache.stanbol.entityhub.servicesapi
> org.apache.stanbol.entityhub.yard.solr
>
> Given that in the launchers the reactor build they have to be  compatible
> with the latest versions anyway this seems inconsistent to me.
>
> For now I'll just update the modules to depend on the latest version of
> org.apache.stanbol.commons.web.base.
>
> Cheers,
> Reto
>
>
>
>
> On Fri, Nov 23, 2012 at 9:15 AM, Fabian Christ > wrote:
>
>> Hi,
>>
>> is there any concrete problem with this approach? I would like to live with
>> it at least for some releases and then decide upon our experience if it
>> fits. Otherwise it is just a meta-discussion. I see pros and cons on each
>> side.
>>
>> Let's do a few releases and collect some evidence ;)
>>
>>
>> 2012/11/22 Reto Bachmann-Gmür 
>>
>> > I agree that if integration offer full coverage the will fail when a
>> > compatibility breaking change is introduced. However the advantage of
>> > statically typed languages is that you can detect this problems already
>> at
>> > compile time.
>> >
>> > The two arguments you mention package split and interaction with host
>> > environment are in fact arguments for having all modules depend on the
>> same
>> > versions of their dependencies. As in the trunk launchers we use the
>> trunk
>> > versions these modules should also depend exclusively of the trunk
>> versions
>> > of other stanbol modules. Embedding is an import usecase that should be
>> > supported, the easiest way to address it is to have just one version of
>> the
>> > bundles and consistent dependencies. Backward compatibility (e.g. that
>> > somebody wants to use an old version of an engine with a new enhancer)
>> > seems less important and to provide this the current approach of having
>> > engines compile but then fail at runtime doesn't seem a good approach
>> > anyway.
>> >
>> > Cheers,
>> > Reto
>> >
>> > On Wed, Nov 21, 2012 at 11:25 PM, Rupert Westenthaler <
>> > rupert.westentha...@gmail.com> wrote:
>> >
>> > > > But I think they should n

Re: stanbol internal dependencies to dependency management

2012-11-24 Thread Rupert Westenthaler
On Sat, Nov 24, 2012 at 1:09 PM, Reto Bachmann-Gmür  wrote:
> Hi Rupert,
>
> So assuming a module is in trunk at version 3.4.1-SNAPSHOT and I make an
> incompatible change, to what should I change the version number to? Does
> the degree of incompatibility makes a difference:
> - A change that affects clients of the interface

e.g. Changing/Removing/renaming any existing method of an interface

3.4.1 -> 4.0

The typical workaround is to keep the old version and deprecate it. In
this case an increase to 3.5 is sufficient

> - A change that affects subclasses (when knowing that there are such
> subclasses/not knowing)

e.g. adding a method to an interface, or abstract method in a class

3.4.1 -> 3.5

> - A change in the behaviour (documented behaviour/undocumented side effect)

3.4.1 -> 3.4.2

but this are only the minimum required version increases to ensure
that the used OSGI provider-policy and consumer-policy do work as
intended.

best
Rupert


--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: stanbol internal dependencies to dependency management

2012-11-24 Thread Rupert Westenthaler
On Sat, Nov 24, 2012 at 7:50 PM, Reto Bachmann-Gmür  wrote:
> Ok, thanks. Good to have such a policy.
>
> Just the last point:
>
>> - A change in the behaviour (documented behaviour/undocumented side
>> effect)
>>
>> 3.4.1 -> 3.4.2
>>
>
> The version in trunk is a snapshot version (3.4.1-SNAPSHOT) so the latest
> released version is probably 3.4 so this change would only change what
> changed already.

3.4.2-SNAPSHOT is automatically created as soon as 3.4.1 is released.
However as long there are no changes in the trunk there will not be a
3.4.2 release.

Practically that means that a minor change in the trunk does not
increase the version. But out of a release perspective the first
change in the trunk does increase the version (as it triggers a new
release) and all further changes do not unless one decides (for some
other reason) to increase the version number.

BTW: With the introduction of ManagedSites in the Entityhub I had to
made some incompatible version changes. Back than I decided to
increase the version number of the trunk trom 0.10.1 to 0.11. I also
created an entityhub-0.10 branch [1] so that we could do 0.10.*
releases if we wanted to fix bugs in the version with the old API.
This was mainly because the entityhub 0.11.* is no longer compatible
with the release stanbol version 0.9.0.

But as Fabian already noted. We need really to do some more releases
to see how this all works out in practice. The current discussions are
all very theoretically and need to be validated by forging real
releases.

best
Rupert


[1] http://svn.apache.org/repos/asf/stanbol/branches/entityhub-0.10/

>
> Reto



--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: stanbol internal dependencies to dependency management

2012-11-25 Thread Rupert Westenthaler
>
> What's missing for doinh a 1.0 release (which seems to be the precondition
> for all this major/minor/micro stuff?
>

Fabian and myself talked about that during ApacheCon. If I remember
correctly the plan was like follows:

* reintegrate the Stanbol NLP processing module (I am currently
merging ... should be finished today/tomorrow)
* make an other 0.* release of all modules (1. release candidate next
week seems feasible)
* will be 0.10 for most components
* 0.11 for the entityhub
* work towards the 1.0 release for
   * not all components will have a 1.0 release. This is something we
need to decide. Commons, Data, Enhancer, Entityhub are good
candidates. Contenthub will need to wait for the 2-layerd storage
infrastructure. Not sure about the Ontonet/reasoning and Rules.
   * for the 1.0 release we might need to change the current folder
structure of the SVN a little bit (e.g moving Engines that depend on
non Enhancer components out of enhancer/engines ...

BTW Fabian has already started the work. If you look at the recent
Jira issues you will find some that cover things mentioned above.

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Improvement of DOAP file for Stanbol

2012-11-26 Thread Rupert Westenthaler
Hi

Actually a great Idea

> BTW: Maybe someone has a good idea on how that semantic data provided by
> the ASF can be used by Stanbol.

If someone could write a simple script that collects the rdf files
form all HTML files in

https://projects.apache.org/projects/

They are referenced by the following meta tag



Than we could create a Entityhub ManagedSite for those data and
include it into the Stanbol default configuration.

BTW: I think there are even more RDF files available (see information
on http://people.apache.org/foaf/) but I do not have a clear idea how
to access the RDF version with the public available information.

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Reintegration of the Stanbol NLP processing branch (STANBOL-733) with trunk

2012-11-26 Thread Rupert Westenthaler
Hi all,

with revision 1413560 [2] the stanbol-nlp-processing branch [1] is
re-integrated with the Stanbol trunk. There are still some TODOs such
as adding integration tests for the newly added engines based on the
"dbpedia-proper-noun chain" but starting from this revision the
Stanbol NLP processing module is available in the trunk.

Documentation is still in progress. The current version can be viewed
on the staging server

* NLP processing API [3]:
* Enhancement Engine List [4] with links to the newly added Engines

Especially note that the "EntityhubLinkingEngine" replaces the - now
deprecated "KeywordLinkingEngine". Users that are not using the
"Keyword Tokenizer" feature should definitely switch!

If you want to link against DBpedia you should also give the new
DBpedia 3.8 index [5] a try

best
Rupert


> [1] http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/
[2] http://svn.apache.org/viewvc?rev=1413560&view=rev
[3] http://stanbol.staging.apache.org/docs/trunk/components/enhancer/nlp/
[4] 
http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/list.html
[5] http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.8/


--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Fwd: Build failed in Jenkins: stanbol-trunk-1.6 #1116

2012-11-26 Thread Rupert Westenthaler
s not
within its bound

[ERROR] 
<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/Spanish.java>:[17,31]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound

[ERROR] 
<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/Spanish.java>:[17,59]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound

[INFO] 7 errors
[INFO] -
[JENKINS] Archiving disabled - not archiving
<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/pom.xml>
[INFO] 
[ERROR] BUILD FAILURE
[INFO] 
[INFO] Compilation failure

<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/German.java>:[20,31]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound

<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/German.java>:[20,57]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound

<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/model/impl/AnalysedTextFactoryImpl.java>:[20,7]
org.apache.stanbol.enhancer.nlp.model.impl.AnalysedTextFactoryImpl is
not abstract and does not override abstract method
createAnalysedText(org.apache.stanbol.enhancer.servicesapi.ContentItem,org.apache.stanbol.enhancer.servicesapi.Blob)
in org.apache.stanbol.enhancer.nlp.model.AnalysedTextFactory

<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/English.java>:[20,31]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound

<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/English.java>:[20,66]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound

<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/Spanish.java>:[17,31]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound

<https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/olia/Spanish.java>:[17,59]
type parameter org.apache.stanbol.enhancer.nlp.pos.PosTag is not
within its bound


[INFO] 
[INFO] For more information, run Maven with the -e switch
[INFO] ----
[INFO] Total time: 13 minutes 25 seconds
[INFO] Finished at: Mon Nov 26 12:22:01 UTC 2012
[INFO] Final Memory: 558M/907M
[INFO] 
Sending e-mails to: dev@stanbol.apache.org rupert.westentha...@gmail.com
channel stopped
[locks-and-latches] Releasing all the locks
[locks-and-latches] All the locks released


--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

On Mon, Nov 26, 2012 at 1:23 PM, Apache Jenkins Server
 wrote:
> See <https://builds.apache.org/job/stanbol-trunk-1.6/1116/changes>
>
> Changes:
>
> [rwesten] STANBOL-733: Merged changed from the stanbol-nlp-processing branch 
> back to the trunk; added sentimentdata bundlelist; changed default 
> configuration of the stanbol launcher(s) by editing the /data/dafaultconfig 
> bundle; Adapted the EnhancerConfiguration integration test to the new 
> configuration.
>
> --
> [...truncated 13991 lines...]
> ---
>  T E S T S
> ---
>
> Results :
>
> Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
>
> [JENKINS] Recording test results
> [INFO] [jar:jar {execution: default-jar}]
> [INFO] Building jar: 
> <https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/test/target/org.apache.stanbol.enhancer.test-0.10.0-SNAPSHOT.jar>
> [INFO] Preparing source:jar
> [WARNING] Removing: jar from forked lifecycle, to prevent recursive 
> invocation.
> [JENKINS] Archiving disabled - not archiving 
> <https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhan

Re: Changes to get rid of jersey dependencies

2012-11-26 Thread Rupert Westenthaler
yes this was the wrong thread ... sorry ... no change to the
contentitem.ftl. AFAIK this is duplicated to avoid a dependency
between the enhancer and the contenthub.

best
Rupert

On Mon, Nov 26, 2012 at 7:53 PM, Reto Bachmann-Gmür  wrote:
> Glad to hear, also this seems t have been the wrong thread. Or the wrong
> pathc link as I see no reference to the duplicate contentitem.ftl there.
>
> Cheers,
> Reto
>
> On Mon, Nov 26, 2012 at 4:43 PM, Rupert Westenthaler <
> rupert.westentha...@gmail.com> wrote:
>
>> Hi all
>>
>> with http://svn.apache.org/viewvc?rev=1413674&view=rev the trunk
>> should be fixed.
>>
>> best
>> Rupert
>>
>> On Sat, Nov 24, 2012 at 4:41 PM, Reto Bachmann-Gmür 
>> wrote:
>> > Hi Rupert,
>> >
>> > I see two templates by that name in the source:
>> >
>> > ./contenthub/web/target/classes/templates/imports/contentitem.ftl
>> > ./enhancer/jersey/src/main/resources/templates/imports/contentitem.ftl
>> >
>> > The two templates seem to differ only by a bit of formatting and they are
>> > registered at the same location where they should be included with
>> > <#include "/imports/contentitem">.
>> >
>> > Furthermore I see:
>> >
>> >
>> ./enhancer/jersey/src/main/resources/org/apache/stanbol/enhancer/jersey/templates/ajax/contentitem.ftl
>> >
>> > At that location it is not accessible to the templating system.
>> > I couldn't find an include for ajax/contentitem, the root for includes is
>> > the templates folder (not templates/html so to allow to include other
>> media
>> > types). From the error message I now moved the file to be where its is
>> > expected, verified with
>> >
>> > zz>val b = bundleContext.getBundle(28)
>> > b: org.osgi.framework.Bundle = org.apache.stanbol.enhancer.jersey [28]
>> >
>> zz>b.getResource("templates/html/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource/ajax/contentitem.ftl")
>> > res3: java.net.URL =
>> >
>> bundle://28.3:1/templates/html/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource/ajax/contentitem.ftl
>> >
>> > The error is gone now.
>> >
>> > Cheers,
>> > Reto
>> >
>> > On Sat, Nov 24, 2012 at 1:22 PM, Rupert Westenthaler <
>> > rupert.westentha...@gmail.com> wrote:
>> >
>> >> Hi Reto,
>> >>
>> >> I think I discovered  an other issue with the new template loading
>> >> mechanism while re-integrating the stanbol-nlp-proessing brach.
>> >> However also a test on the current trunk shows the same issue.
>> >>
>> >> When I post an Request to the Stanbol Enhancer via the Web UI I do get
>> >> an "Invalid query" because of
>> >>
>> >> 24.11.2012 13:17:17.994 *WARN* [1346380557@qtp-2082765220-174]
>> >> org.apache.felix.http.jetty /enhancer (java.lang.RuntimeException:
>> >> java.io.FileNotFoundException: Template
>> >>
>> >>
>> html/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource/ajax/contentitem
>> >> not found.) java.lang.RuntimeException: java.io.FileNotFoundException:
>> >> Template
>> >>
>> html/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource/ajax/contentitem
>> >> not found.
>> >> at
>> >>
>> org.apache.stanbol.commons.ldpathtemplate.LdRenderer.renderPojo(LdRenderer.java:173)
>> >> at
>> >>
>> org.apache.stanbol.commons.ldviewable.mbw.ViewableWriter.writeTo(ViewableWriter.java:80)
>> >> at
>> >>
>> org.apache.stanbol.commons.ldviewable.mbw.ViewableWriter.writeTo(ViewableWriter.java:53)
>> >> [..]
>> >> Caused by: java.io.FileNotFoundException: Template
>> >>
>> >>
>> html/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource/ajax/contentitem
>> >> not found.
>> >> at freemarker.template.Configuration.getTemplate(Configuration.java:580)
>> >> at freemarker.template.Configuration.getTemplate(Configuration.java:543)
>> >> at
>> >>
>> org.apache.stanbol.commons.ldpathtemplate.LdRenderer.renderPojo(LdRenderer.java:169)
>> >> ... 48 more
>> >>
>> >> I think this is because all the imports are still in the old location.
>> >> Can you please have a look at this. How do import work with the new
>> >> infrastructure?

Re: Build failed in Jenkins: stanbol-trunk-1.6 #1116

2012-11-26 Thread Rupert Westenthaler
Hi all

with http://svn.apache.org/viewvc?rev=1413674&view=rev the trunk
should be fixed. Also Jenkins is happy again. Looks like as the
Stanbol NLP module (STANBOL-733) has finally found its way to
the trunk!

best
Rupert

On Mon, Nov 26, 2012 at 3:27 PM, Rupert Westenthaler
 wrote:
> Hi,
>
> the commit
>
> http://svn.apache.org/viewvc?rev=1413560&view=rev
>
> got somehow broken. Basically the contents sent by the Eclipse SVN
> plugin where not the version as written on the disc. The result is
> that
>
> (1) the data on the SVN does not compile (because it is missing
> necessary adaptions after the merge of the nlp processing branch)
> (2) in my local version the SVN metadata are out of sync with the
> contents of the file (making it impossible to commit the correct
> version)
>
> I assume that resolving this will take some time as I will need to
> manually copy the correct files from the corrupted workspace to a
> fresh checkout.
>
> sorry for any inconvenience
> Rupert
>
> -- Forwarded message --
> From: Apache Jenkins Server 
> Date: Mon, Nov 26, 2012 at 1:23 PM
> Subject: Build failed in Jenkins: stanbol-trunk-1.6 #1116
> To: dev@stanbol.apache.org, rupert.westentha...@gmail.com
>
>
> See <https://builds.apache.org/job/stanbol-trunk-1.6/1116/changes>
>
> Changes:
>
> [rwesten] STANBOL-733: Merged changed from the stanbol-nlp-processing
> branch back to the trunk; added sentimentdata bundlelist; changed
> default configuration of the stanbol launcher(s) by editing the
> /data/dafaultconfig bundle; Adapted the EnhancerConfiguration
> integration test to the new configuration.
>
> --
> [...truncated 13991 lines...]
> ---
>  T E S T S
> ---
>
> Results :
>
> Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
>
> [JENKINS] Recording test results
> [INFO] [jar:jar {execution: default-jar}]
> [INFO] Building jar:
> <https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/test/target/org.apache.stanbol.enhancer.test-0.10.0-SNAPSHOT.jar>
> [INFO] Preparing source:jar
> [WARNING] Removing: jar from forked lifecycle, to prevent recursive 
> invocation.
> [JENKINS] Archiving disabled - not archiving
> <https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/test/pom.xml>
> [JENKINS] Archiving disabled - not archiving
> <https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/test/target/org.apache.stanbol.enhancer.test-0.10.0-SNAPSHOT.jar>
> [INFO] [enforcer:enforce {execution: enforce-java}]
> [JENKINS] Archiving disabled - not archiving
> <https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/test/pom.xml>
> [JENKINS] Archiving disabled - not archiving
> <https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/test/target/org.apache.stanbol.enhancer.test-0.10.0-SNAPSHOT.jar>
> [INFO] [source:jar {execution: attach-sources}]
> [INFO] META-INF already added, skipping
> [INFO] META-INF/LICENSE already added, skipping
> [INFO] META-INF/NOTICE already added, skipping
> [INFO] META-INF/DEPENDENCIES already added, skipping
> [INFO] Building jar:
> <https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/test/target/org.apache.stanbol.enhancer.test-0.10.0-SNAPSHOT-sources.jar>
> [INFO] META-INF already added, skipping
> [INFO] META-INF/LICENSE already added, skipping
> [INFO] META-INF/NOTICE already added, skipping
> [INFO] META-INF/DEPENDENCIES already added, skipping
> [INFO] [install:install {execution: default-install}]
> [INFO] Installing
> <https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/test/target/org.apache.stanbol.enhancer.test-0.10.0-SNAPSHOT.jar>
> to 
> /home/jenkins/jenkins-slave/maven-repositories/1/org/apache/stanbol/org.apache.stanbol.enhancer.test/0.10.0-SNAPSHOT/org.apache.stanbol.enhancer.test-0.10.0-SNAPSHOT.jar
> [INFO] Installing
> <https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/test/target/org.apache.stanbol.enhancer.test-0.10.0-SNAPSHOT-sources.jar>
> to 
> /home/jenkins/jenkins-slave/maven-repositories/1/org/apache/stanbol/org.apache.stanbol.enhancer.test/0.10.0-SNAPSHOT/org.apache.stanbol.enhancer.test-0.10.0-SNAPSHOT-sources.jar
> [JENKINS] Archiving disabled - not archiving
> <https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/test/pom.xml>
> [JENKINS] Archiving disabled - not archiving
> <https://builds.apache.org/job/stanbol-trunk-1.6/ws/trunk/enhancer/generic/test/target/org.apache.stanbol.enhancer.test-0.10

Re: Confused by engines names

2012-11-27 Thread Rupert Westenthaler
Hi Fabian

Short version:

I totally agree. Our vocabulary has changed over time, but the Engines
still use the names as when they where introduced. Changing them
(artifactIds and class names) is dangerous as this does break
backwards compatibility. So I would suggest change names only if we
can also come up with better implementation/design.

Regarding Vocabulary I think we should prefer the terms
"EntityLinking" and "NamedEntityLinking" and deprecate all others like
"keyword" instead of "entity" or "extraction" or "tagging" instead of
"linking".

The 'engines/entitylinking' and 'engines/entityhublinking' introduced
by STANBOL-733 do already use this new terminology. They also
deprecate the 'engines/keywordextraction'.

- - -

Long version with more background information

Regarding the linking of Entities there are currently two different principles:

* "NamedEntityLinking": A "NamedEntity" has a 'selected text' AND a
'type'. So the selected text AND the type can be used for linking
* "EntityLinking": An "Entity" does only have a 'selected text'. Here
linking is only possible based on the selected text.

The plan would be to also have two Engine implementations that support
those linking models.

* 'NamedEntityLinkingEngine' (currently /engines/entitytagging)
* 'EntityLinkingEngine' (was /engines/keywordextraction (now
deprecated) ; since yesterday  /engines/entitylinking)

Those should not have external dependencies (meaning to Stanbol
components other than Stanbol Commons, Enhancer module; also not other
major frameworks such as Solr or OpenNLP; no calls to external
services). That would allow to keep those Engines within the enhancer
module but also means that those implementation can not be directly
used by the user (as the Service used for linking will be just defined
by an Interface without an actual implementation.

Because of that there will be "Engines" that are based on the above,
but come with adapters to Services that do support the EntityLookup.
The default will be implementations based on the StanbolEntityhub, but
Stanbol users could also implement versions for their own
infrastructure needs.

The "EntityhubLinking" module [1] is the first example. When you look
at the module you will recognize that it does not contain an single
EnhancementEngine implementation. It only provides Entityhub specific
implementations of the EntitySearcher interface defined by the
"EntityLinkingEngine" and a OSGI component that allows users to
configure an EntityLinkingEngine instance that uses the Entityhub to
lookup Entities.

Current state:

Currently we are not yet there. The '/engines/entitytagging' still
implements both NamedEntityLinking AND Lookup via the Entityhub. This
engine could be replaced by a 'engines/namedentitylinking' that
follows the design as described above. The new
'/engines/entitylinking' already implements the above design. However
it still depends on the Entityhub, because the EntitySearcher
interface [3] that is still using the Entityhub Model classes.

'engines/entityhublinking' currently provides the ability to do
'entitylinking' with the Entityhub. As soon as the
'engines/namedentitylinking' is available I would add named entity
linking functionality to that module. In a last step this module will
also move out of the /enhancer component (as already suggested by
STANBOL-805 [4]).


BTW this design was the result of this [2] discussion on the Stanbol
dev mailing list.

best
Rupert



[1] 
http://svn.apache.org/repos/asf/stanbol/trunk/enhancer/engines/entityhublinking/
[2] http://markmail.org/message/nptkntyuthv7wwqh
[3] 
http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entitylinking#entitysearcher
[4] https://issues.apache.org/jira/browse/STANBOL-805


On Tue, Nov 27, 2012 at 11:14 AM, Fabian Christ
 wrote:
> Hi,
>
> enhancement engines in Stanbol can have several names and this is confusing
> myself and very likely our users. Here are some examples that I came across
> when trying to identify the running engines. I started to look at the
> Web-UI and clicked through the OSGi console.
>
> dbpediaLinking (NamedEntityTaggingEngine) ->
> Named Entity Tagging -> Entity Tagging ->
> /engines/entitytagging
>
> entityhubExtraction (EntityLinkingEngine) ->
> Entityhub Linking -> Entityhub Linking ->
> /engines/entityhublinking
>
> Could we simplify this a bit to make it more obvious especially for new
> users what is going on?
>
> Best,
>  - Fabian
>
> --
> Fabian
> http://twitter.com/fctwitt



--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Confused by engines names

2012-11-27 Thread Rupert Westenthaler
Hi

there are also inconsistencies in the names of the OSGI parameters,
default names of Engines ... Thats why I would like to make an other
0.* release and than change/fix all those things while working towards
the 1.0 release

best
Rupert

On Tue, Nov 27, 2012 at 8:38 PM, Reto Bachmann-Gmür  wrote:
> Which reminded me that we already discussed once that the artifact names
> are unncessarily long, create STANBOL-820. Maybe some other renaming could
> be done with that?
>
> Cheers,
> Reto
>
> On Tue, Nov 27, 2012 at 12:17 PM, Rupert Westenthaler <
> rupert.westentha...@gmail.com> wrote:
>
>> Hi Fabian
>>
>> Short version:
>>
>> I totally agree. Our vocabulary has changed over time, but the Engines
>> still use the names as when they where introduced. Changing them
>> (artifactIds and class names) is dangerous as this does break
>> backwards compatibility. So I would suggest change names only if we
>> can also come up with better implementation/design.
>>
>> Regarding Vocabulary I think we should prefer the terms
>> "EntityLinking" and "NamedEntityLinking" and deprecate all others like
>> "keyword" instead of "entity" or "extraction" or "tagging" instead of
>> "linking".
>>
>> The 'engines/entitylinking' and 'engines/entityhublinking' introduced
>> by STANBOL-733 do already use this new terminology. They also
>> deprecate the 'engines/keywordextraction'.
>>
>> - - -
>>
>> Long version with more background information
>>
>> Regarding the linking of Entities there are currently two different
>> principles:
>>
>> * "NamedEntityLinking": A "NamedEntity" has a 'selected text' AND a
>> 'type'. So the selected text AND the type can be used for linking
>> * "EntityLinking": An "Entity" does only have a 'selected text'. Here
>> linking is only possible based on the selected text.
>>
>> The plan would be to also have two Engine implementations that support
>> those linking models.
>>
>> * 'NamedEntityLinkingEngine' (currently /engines/entitytagging)
>> * 'EntityLinkingEngine' (was /engines/keywordextraction (now
>> deprecated) ; since yesterday  /engines/entitylinking)
>>
>> Those should not have external dependencies (meaning to Stanbol
>> components other than Stanbol Commons, Enhancer module; also not other
>> major frameworks such as Solr or OpenNLP; no calls to external
>> services). That would allow to keep those Engines within the enhancer
>> module but also means that those implementation can not be directly
>> used by the user (as the Service used for linking will be just defined
>> by an Interface without an actual implementation.
>>
>> Because of that there will be "Engines" that are based on the above,
>> but come with adapters to Services that do support the EntityLookup.
>> The default will be implementations based on the StanbolEntityhub, but
>> Stanbol users could also implement versions for their own
>> infrastructure needs.
>>
>> The "EntityhubLinking" module [1] is the first example. When you look
>> at the module you will recognize that it does not contain an single
>> EnhancementEngine implementation. It only provides Entityhub specific
>> implementations of the EntitySearcher interface defined by the
>> "EntityLinkingEngine" and a OSGI component that allows users to
>> configure an EntityLinkingEngine instance that uses the Entityhub to
>> lookup Entities.
>>
>> Current state:
>>
>> Currently we are not yet there. The '/engines/entitytagging' still
>> implements both NamedEntityLinking AND Lookup via the Entityhub. This
>> engine could be replaced by a 'engines/namedentitylinking' that
>> follows the design as described above. The new
>> '/engines/entitylinking' already implements the above design. However
>> it still depends on the Entityhub, because the EntitySearcher
>> interface [3] that is still using the Entityhub Model classes.
>>
>> 'engines/entityhublinking' currently provides the ability to do
>> 'entitylinking' with the Entityhub. As soon as the
>> 'engines/namedentitylinking' is available I would add named entity
>> linking functionality to that module. In a last step this module will
>> also move out of the /enhancer component (as already suggested by
>> STANBOL-805 [4]).
>>
>>
>> BTW this design

Re: Enabling security be default

2012-11-29 Thread Rupert Westenthaler
Hi all

Regarding Security I am missing the following things:

1. HOWTO configure users and passwords: I would like to have the
possibility to do that via the Felix Webconsole (e.g. an own Stanbol
User Management and/or Stanbol Security tab). This is simple because
that will be the place where users will look first. So even if that is
not possible I would suggest to add such an tab that shows the
description of how to do it.

2. User Documentation: On the Webpage there should be an own Section
for Security: What launchers support it. What Bundlelists to include.
How to configure ...

3. Developer Documentation: How to add higher level Permissions to an
Stanbol Component. With an example and Walk through. The best would be
an example for an EnhancementEngine.

4. Definition/Implementation of Stanbol Component specific Permissions
in own modules (e.g. a module like o.a.s.enhancer.security) that
contains Permissions (and other useful stuff) relevant for the Stanbol
Enhancer (e.g Execute Enhancement Engine, Enhance Content for
Language, Enhance Content Item with a maximum size ...)

5. Integration tests that test security

If those things would be available I would feel much better to vote
about Security. Because currently my understanding is on a very
abstract level (based on the discussion of the thread already linked
by Fabian [1]


best
Rupert


[1] http://markmail.org/message/yamwhcla3b2j4onj


--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: Stanbol Namespace Prefix Service (was: How to introduce new features in Stanbol)

2012-11-30 Thread Rupert Westenthaler
On Fri, Nov 30, 2012 at 2:14 PM, Reto Bachmann-Gmür  wrote:
> Hi Fabian and Rupert,
>
> Was there something missing ny the service I added for CLEREZZA-222 or why
> are you suggesting a new interface?
>

I had totally forgot about CLEREZZA-222. I made only a search for a
STANBOL issues. The reason why I finally started to implement this
service is because the NamespaceEnums are the last part that block the
separation form the Enhancer EntityLinking and the Entityhub (see
STANBOL-823).

Is CLEREZZA-222 already implemented. If not, than we could make have a
"parser level" solution in Clerezza that is compatible with the
current Jena Parsers AND use STANBOL-823 as a "user level"
implementation that provides different sources for mappings and also
allows to manage custom mappings.


WDYT
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: DBPedia Spotlight Enhancer is not working

2012-12-03 Thread Rupert Westenthaler
Hi Rafa,

a fast lookup in the code has shown that SUPPORTED_LANGUAGES is
hardcoded to "en" at the moment. Making this configureable is not a
big deal. Even better would be if spotlight would provide a service
where one can request the supported languages.

I suggest to open a JIRA issue about that. If we go for the "make it
configureable" option, than I can provide a fix later this week

best
Rupert

On Mon, Dec 3, 2012 at 4:33 PM, Iavor Jelev
 wrote:
> Hi Rafa,
>
> yes, it should. For Spanish - add (or change to) "es". For further
> languages, please refer to:
>
> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/langidengine.html
>
> cheers,
> Iavor
>
> Am 03.12.2012 16:28, schrieb Rafa Haro:
>> Hi Iavor,
>>
>> thanks for your quick response. Until have it configurable, to get by
>> now, would it work just adding new languages to this parameter?
>>
>> Thanks. Regards
>>
>> El 03/12/12 15:58, Iavor Jelev escribió:
>>> Hi Rafa,
>>>
>>> you are correct, that's the cause. At the time we contributed it, we
>>> aggreed on english. The parameter can be found in [path to
>>> engines]/dbpspotlight/Constants.java
>>>
>>> The parameter is called SUPPORTED_LANGUAGES.
>>>
>>> I think it is time we make that configurable.
>>>
>>> best,
>>> Iavor
>>>
>>> Am 03.12.2012 14:37, schrieb Rafa Haro:
>>>> Hi again,
>>>>
>>>> In the post about DBpedia Spotlight and Apache Stanbol Integration by
>>>> Iavor Jelev [1] you can read exactly the following:
>>>>
>>>> */$chainURL/dbpspotlight/*/
>>>> //This chain replicates the functionality of dbpspotlightannotate, by
>>>> chaining dbpspotlightspot and dbpspotlightdisambiguate. Please note that
>>>> langidis run first, and only english texts are processed. In the near
>>>> future, DBpedia Spotlight will support multiple languages and this
>>>> constraint will be adapted accordingly./
>>>>
>>>> Is this maybe a hard-coded restriction?
>>>>
>>>> Regards
>>>>
>>>> [1]
>>>> http://blog.iks-project.eu/dbpedia-spotlight-integration-in-apache-stanbol-2/
>>>>
>>>>
>>>>
>>>> El 03/12/12 09:39, Rafa Haro escribió:
>>>>> Hi Rupert,
>>>>>
>>>>> As always, thanks for your help. Inspecting the logs, part of the
>>>>> mystery has clarified. Basically, the problem is the language. I'm
>>>>> trying to test DBPedia Spotlight enhancer with Spanish texts. So, I
>>>>> did a request to the Stanbol Dev Server with a Spanish text and got
>>>>> the same result. Then I configured again my local Stanbol to work with
>>>>> a local installation of DBPedia Spotlight, try again with a Spanish
>>>>> text and this time I can read the following messages in the log file:
>>>>> /
>>>>> //[Thread-114]
>>>>> org.apache.stanbol.enhancer.engines.langdetect.LanguageDetectionEnhancementEngine
>>>>>
>>>>> language identified: [es:0.96063192582]//
>>>>> //03.12.2012 08:52:55.386 *INFO* [Thread-116]
>>>>> org.apache.stanbol.enhancer.engines.dbpspotlight.utils.SpotlightEngineUtils
>>>>>
>>>>> DBpedia Spotlight can not process ContentItem
>>>>> 
>>>>> because language es is not supported (supported: [en])/
>>>>>
>>>>> So far, I haven't been able to find anything to change supported
>>>>> languages for the enhancer. I suppose that it should be possible to do
>>>>> that, am I wrong??
>>>>>
>>>>> Thanks. Regards
>>>>>
>>>>> El 01/12/12 15:19, Rupert Westenthaler escribió:
>>>>>> On Fri, Nov 30, 2012 at 2:14 PM, Rafa Haro  wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I was trying to test the DBPedia Spotlight enhancer with a local
>>>>>>> installation of DBPedia Spotlight in an out-of-the-box Stanbol
>>>>>>> from the
>>>>>>> repository. So, I changed the URL of the service in
>>>>>>> dbpspotlightannotate
>>>>>>> engine to my point to my local service endpoint. When I tested it,
>>>>>>> the
>>>>>>> enhancement chain always stopped at language detection engine. I
>>>&g

Re: DBPedia Spotlight Enhancer is not working

2012-12-03 Thread Rupert Westenthaler
Hi Reto, Rafa

@Rafa: I do not thing this is related to the spotlight engine

@Reto: Could this be related to the Stanbol Security. Is this already
active. Andreas Gruber was asking me about the same Exception earlier
today.

On Mon, Dec 3, 2012 at 7:09 PM, Rafa Haro  wrote:
> //03.12.2012 18:54:25.738 *WARN* [1548445156@qtp-558009892-5]
> org.apache.felix.http.jetty /enhancer/chain/dbpedia-spotlight
> (java.lang.RuntimeException: java.security.PrivilegedActionException:
> java.io.IOException: Stream closed) java.lang.RuntimeException:
> java.security.PrivilegedActionException: java.io.IOException: Stream
> closed//

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: DBPedia Spotlight Enhancer is not working

2012-12-04 Thread Rupert Westenthaler
Hi


 Can you produce additional information on how to reproduce this. I
tried some texts that give no results but those have not triggered
this.

best
Rupert

On Wed, Dec 5, 2012 at 12:06 AM, Pablo N. Mendes  wrote:
> Just for clarity, the log message tells you (perhaps not clearly enough)
> that the output of spotlight is empty:
>
> "//Información: Removed 1 (100 percent) spots using spotSelector
> ChainedSelector//"
>
> However, there seems to be still a problem on Stanbol's side when no
> results are returned. Even with perfectly installed DBpedia Spotlight, it
> is conceivable that some piece of text will have no annotations (rare, but
> possible). The enhancement engine should not break from that. From Rupert's
> message I understand that he's on top of this issue, but I just wanted to
> make sure that this is clear.
>
> Cheers
> Pablo
> On Dec 4, 2012 5:58 PM, "Rafa Haro"  wrote:
>
>> Hi Rupert and Reto,
>>
>> Just wanted to let you know that finally the problem was produced by an
>> empty output of DBpedia Spotlight due to a bad configuration in its side.
>>
>> Thanks for your help again
>>
>> Regards
>>
>> El 03/12/12 19:20, Rupert Westenthaler escribió:
>>
>>> Hi Reto, Rafa
>>>
>>> @Rafa: I do not thing this is related to the spotlight engine
>>>
>>> @Reto: Could this be related to the Stanbol Security. Is this already
>>> active. Andreas Gruber was asking me about the same Exception earlier
>>> today.
>>>
>>> On Mon, Dec 3, 2012 at 7:09 PM, Rafa Haro  wrote:
>>>
>>>> //03.12.2012 18:54:25.738 *WARN* [1548445156@qtp-558009892-5]
>>>> org.apache.felix.http.jetty /enhancer/chain/dbpedia-**spotlight
>>>> (java.lang.RuntimeException: java.security.**PrivilegedActionException:
>>>> java.io.IOException: Stream closed) java.lang.RuntimeException:
>>>> java.security.**PrivilegedActionException: java.io.IOException: Stream
>>>> closed//
>>>>
>>> best
>>> Rupert
>>>
>>> --
>>> | Rupert Westenthaler rupert.westentha...@gmail.com
>>> | Bodenlehenstraße 11 ++43-699-11108907
>>> | A-5500 Bischofshofen
>>>
>>
>> This message should be regarded as confidential. If you have received this
>> email in error please notify the sender and destroy it immediately.
>> Statements of intent shall only become binding when confirmed in hard copy
>> by an authorised signatory.
>>
>> Zaizi Ltd is registered in England and Wales with the registration number
>> 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,
>> London W10 5JJ, UK.
>>
>>



-- 
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: DBPedia Spotlight Enhancer is not working

2012-12-06 Thread Rupert Westenthaler
On Wed, Dec 5, 2012 at 1:09 PM, Rafa Haro  wrote:
> Hi Rupert,
>
> I haven't been able to reproduce it using online dbpedia spotlight web
> service. The only way I know to trigger the exception is using a local
> installation of DBpedia Spotlight and loading a wrong spotting dictionary. I
> know that sounds very specific but I have tried other options and never get
> that exception again

Ok. I was just interested, because when we integrated Stanbol with the
LinkedMediaFramework we where also seeing "Caused by:
java.io.IOException: Stream closed" exceptions from time to time. We
where never completely sure about the cause for them (as they might be
also caused by the request being closed). However your issue suggest
that those exceptions can be indirectly caused by some other exception
within Stanbol what is definitely interesting and might be an
indication of some bug somewhere in the Stanbol commons.web modules.

best
Rupert

--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


Re: several integration questions

2012-12-08 Thread Rupert Westenthaler
On Sat, Dec 8, 2012 at 7:58 AM, Alexey Kudinov  wrote:
> Hi,
>
> I'm new to stanbol (I do have some experience with SOLR), and a few things
> are not clear from the wiki:
>
> 1.   Can I integrate Stanbol EntityHub with an external SOLR instance?
>

Yes this is possible. The {name}.solrindex.zip files are compressed
Solr core directory structures. Just unpack them and install them on
you Solr server. If you want to start from an empty Site you can use
[1].

When the Solr Core is available on your Solr server you need to
configure the URL to the RESTful API in the "Solr Index/Core"
(org.apache.stanbol.entityhub.yard.solr.solrUri) field of the Solr
Yard.


[1] 
https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/yard/solr/src/main/resources/solr/core/default.solrindex.zip


> 2.   My application should run in the enterprise environment, and my
> managed site requires user information, mainly for security purposes. I
> already have a fine-grained security module for SOLR (reflecting user
> repositories). How can I pass relevant user information through Stanbol
> EntityHub API? I know that I can pass-by the EntityHub and call Solr API,
> but it would be the last resort.
>

I do not fully understand this question. But maybe the "Multiple Yard
Layout" of the SolrYard could help you.

Basically the SolrYard supports the creation of multiple Instances
that do access the same Solr Core. To activate this you need to enable
the "Multiple Yard Layout"
(org.apache.stanbol.entityhub.yard.solr.multiYardIndexLayout). What
this does it that the SolrYard will add an additional field '_domain'
and store the name of the SolrYard as value for all Entities stored by
this SolrYard. Also all queries will use this as an additional
constraint.

This feature was introduced to allow the storage of multiple
(typically small) vocabularies within the same Solr Core but maybe it
could be also useful for your usecase.

best
Rupert

> Thanks,
>
> Alexey
>



--
| Rupert Westenthaler rupert.westentha...@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen


  1   2   3   4   5   6   7   8   9   >