Hi Lewis,

thanks for your explanations , I will try to describe more precise what I
need:

Suppose I have the webpage http://example.com/persons1.html that I want to
index.
It contains a list of persons, having all the same structure. I already have
a nutch plugin
that allows me to extract information about one person and add it to the
index fields (say firstname, lastname,...). What I now need (to make my solr
queries possible) is that one document in the index
contains only information about one person and not all contained in
persons1.html.
Idealy after nutch crawled the document I have a plugin that inspects the
html content. My plugin then
decides where to splt the content into the several pieces (something like
each person is within a div with
class=person).
I would now like to create out of this one document several documents where
all have the same url but the content is only the part my plugin extracted
(namely information about one person). Then my 
already existing plugin extracts the meta information about one person. So
if persons1.html contained
information about 10 persons, in the index I would like to have 10
documents, each with separate content but the same url.
So crawling persons1.html should generate 10 documents in the index.
So i do not want to remove clutter, but it is also not exactly as case [1].
I would like to have a indexed document for each <paragraph>...</paragraph>.

The big question for me now is where can I access the current document,
count how many persons are in the content,copy it n times, each containing
one person and let the parsers that I already have deal with
each document separately?

I tried to add new information to the ParseResult within my filter() method
with something like this:
parseResult.put(content.getUrl(), new ParseText("myParseText2"), new
ParseData(
        new ParseStatus(ParseStatus.SUCCESS), "myTitle2", new Outlink[0],
        content.getMetadata()));
But the Solr index only containes one document having the above information.
The original one is overwritten. If i change the URL to something like
content.getUrl()+"#123" , the original information
is added to the index, but again only one document.

How would you tackle this problem?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-hierarchical-data-schema-design-tp3052894p3079208.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to