Re: [GSOC Idea] Disambiguation algorithm for Apache Stanbol

Rupert Westenthaler Wed, 01 May 2013 01:15:49 -0700

On Tue, Apr 30, 2013 at 6:03 PM, Rafa Haro <[email protected]> wrote:
> Hi Rupert, Antonio, all
>
> El 27/04/13 16:35, Rupert Westenthaler escribió:
>
>>> For this, I would like to discuss some topics about the proposal:
>>> >
>>> >- Knowledge Base: I have decided to stick first to Freebase, because it
>>> > has
>>> >a REST API allowing 100k calls per day for read and 10k for write.
>>> > Besides
>>> >the REST API, an alternative could be to integrate the whole freebase
>>> > graph
>>> >in Stanbol and use their Java API to manage it. Ideally, the management
>>> >framework should be valid for others knowledge bases as Wikipedia or
>>> >DBpedia.
>>> >
>>
>> I recently created my first Freebase index for Stanbol (see
>> STANBOL-1014 for the Indexing tool). First test on an Index with all
>> Freebase Topics and all languages have shown very nice result! IMO
>> Freebase is currently for sure the better choice over DBpedia. However
>> one needs to see/wait how Freebase compares to the Wikidata project
>> [4] that only recently entered phase 2.
>>
>> Designing disambiguation in a way that it can be applied to other
>> datasets would be for sure a great bonus. But given the good results
>> one can get with Freebase I would even be very interested if the
>> results would only work on Freebase ^^
>
> Following Rupert's idea, I agree that maybe the best is to develop a
> Knowledge Base manager within Stanbol for disambiguation purposes. IMO, it
> would be a mistake to try to come with an universal solution. I suppose that
> one wants to generate its knowledge base differently according to custom
> data domains. For instance, a graph representation is more suitable in "real
> world" knowledge bases, while most domains are well covered with a taxonomy
> structure.
>
> It would be important to develop tools to allow Stanbol to interact with
> these knowledge bases from-to EntityHub sites. Of course, a good way to
> learn how to do that could be developing first a nice solution only for
> Freebase.
>


There is already the "2-layered storage infrastructure" [1] for the
Contenthub. Developed mainly by Suat and Anil in an own branch. In
this trunk there is also a new commons.semanticindexing [2] package.
This architecture would allow for using a "knowledge base" as
"Indexing Source" and use an Entityhub Site as "Indexing Destination".
So it goes exactly in the proposed direction.

I have planed for long to adapt this for the Entityhub, but
development in this branch has not shown much progress in the recent
time and I do also have higher priority task ATM.

To be clear: I would not recommend to use this for a GSoC project but
rather use available APIs of the Entityhub as this is clearly off
topic to disambiguation. However the results of the propoesed GSoC
project could be a nice application/validation case for [1].


[1] https://issues.apache.org/jira/browse/STANBOL-471
[2] https://issues.apache.org/jira/browse/STANBOL-701

best
Rupert


--
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: [GSOC Idea] Disambiguation algorithm for Apache Stanbol

Reply via email to