2011/11/23 valentina presutti <[email protected]>: > Hi Olivier, > as I said we have some work done related to automatic categorization. > In the meantime, I have collected some documentation that you may want to > have a look at. > We are willing to bring this method in Stanbol either by reusing the > software directly or re-implementing the methods. > The only thing we ask for is that anything comes from it to be open :) > I am pretty sure that some services can be reused and integrated, hence > you're welcome to review it, and ask us any question and support. We are > happy to discuss solutions that can be carried out collaboratively. If there > is space for this in the hackathon Alberto can join you. > At [1] you can find a digram that describes the workflow implemented. > Please, notice that the software addresses NER, Terminology extraction and > identity resolution and relies on some of this elaborations for performing > automatic categorization. > We use Alchemy API that are commercial, but this is not a mandatory piece of > the component, it can be replaced with Stanbol Enhancers. > Of course, the performances of such step impact on the overall performance. > The exploitation of identity resolution makes this approach slightly > different from yours, but still I think we can find a good hybrid for > improving performances. > [2] contains a description of the main functionalities and the methods > implemented. You will notice that the index is obtained through customizable > SPARQL queries, we see here a possible integration with the EntityHub. > [3] is the javadoc. > Val > [1] > http://wit.istc.cnr.it/API/WikiFierAPI/WikiFierFlowChart/WikiFierFlowChart_Page-1.html > [2] http://stlab.istc.cnr.it/stlab/STLabWikifier > [3] http://wit.istc.cnr.it/API/WikiFierAPI/javadoc/index.html > On Nov 18, 2011, at 4:53 PM, Olivier Grisel wrote:
Ok thanks for the links. I think I will go on with my version first: it might be a bit more complicated to build the initial Solr index but it's probably much faster at classification time (1 single full-text query, albeit a large one) vs. many steps involving sub-queries in the system you describe. Once implemented it would be worth comparing the results on the same dataset to make some qualitative evaluation of the output. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
