Hi Lewis, thanks for your explanations , I will try to describe more precise what I need:
Suppose I have the webpage http://example.com/persons1.html that I want to index. It contains a list of persons, having all the same structure. I already have a nutch plugin that allows me to extract information about one person and add it to the index fields (say firstname, lastname,...). What I now need (to make my solr queries possible) is that one document in the index contains only information about one person and not all contained in persons1.html. Idealy after nutch crawled the document I have a plugin that inspects the html content. My plugin then decides where to splt the content into the several pieces (something like each person is within a div with class=person). I would now like to create out of this one document several documents where all have the same url but the content is only the part my plugin extracted (namely information about one person). Then my already existing plugin extracts the meta information about one person. So if persons1.html contained information about 10 persons, in the index I would like to have 10 documents, each with separate content but the same url. So crawling persons1.html should generate 10 documents in the index. So i do not want to remove clutter, but it is also not exactly as case [1]. I would like to have a indexed document for each <paragraph>...</paragraph>. The big question for me now is where can I access the current document, count how many persons are in the content,copy it n times, each containing one person and let the parsers that I already have deal with each document separately? I tried to add new information to the ParseResult within my filter() method with something like this: parseResult.put(content.getUrl(), new ParseText("myParseText2"), new ParseData( new ParseStatus(ParseStatus.SUCCESS), "myTitle2", new Outlink[0], content.getMetadata())); But the Solr index only containes one document having the above information. The original one is overwritten. If i change the URL to something like content.getUrl()+"#123" , the original information is added to the index, but again only one document. How would you tackle this problem? -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-hierarchical-data-schema-design-tp3052894p3079208.html Sent from the Nutch - User mailing list archive at Nabble.com.

