2011/2/15 Rupert Westenthaler <[email protected]>: >> As soon as you have such an howto ready I would be glad to write a >> bunch of pig scripts to build indexes for topics (rather than >> entities) so as to be able to perform document level topic assignment >> rather than occurrence-based entity lookups. >> > OK I do not really understand what you mean by that.
Ok let me explain. In the old autotagging enhancer, there is a tool that does "more like this" similarity queries to find the main topic of a complete document or paragraph (without first using opennlp to find occurrences of names). To make this usable we need to build a topic index from the top skos categories available in DBpedia. Each category should contain a full-text indexed field with the aggregate text content of the most popular article abstract of entities of that category. That way more like this will be able to get that a document is about "Economy of India" if it sees statistical significant terms such as "rupee", "Tata Nano", "GDP", "Bangalore". To build such an index we need to compute joins between. That would take a lot of time to do it using a triple store so it's best IMHO to use Apache Pig scripts run on a cluster of machine on EC2. This would be similar to what I did to build new OpenNLP models here: http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
