2011/2/15 Rupert Westenthaler <[email protected]>:
>> As soon as you have such an howto ready I would be glad to write a
>> bunch of pig scripts to build indexes for topics (rather than
>> entities) so as to be able to perform document level topic assignment
>> rather than occurrence-based entity lookups.
>>
> OK I do not really understand what you mean by that.

Ok let me explain. In the old autotagging enhancer, there is a tool
that does "more like this" similarity queries to find the main topic
of a complete document or paragraph (without first using opennlp to
find occurrences of names). To make this usable we need to build a
topic index from the top skos categories available in DBpedia. Each
category should contain a full-text indexed field with the aggregate
text content of the most popular article abstract of entities of that
category. That way more like this will be able to get that a document
is about "Economy of India" if it sees statistical significant terms
such as "rupee", "Tata Nano", "GDP", "Bangalore".

To build such an index we need to compute joins between. That would
take a lot of time to do it using a triple store so it's best IMHO to
use Apache Pig scripts run on a cluster of machine on EC2. This would
be similar to what I did to build new OpenNLP models here:

  
http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to