Please don't mail me further.
Thanks On 6/21/11, Khang Ich <[email protected]> wrote: > Hi, > > how do you use Solr to index your "documents" ? > > > The big question for me now is where can I access the current document, > count how many persons are in the content,copy it n times, each containing > one person and let the parsers that I already have deal with > each document separately? > > > And what do you mean by > > "But the Solr index only containes one document having the above > information. The original one is overwritten. " > > > I'm not very sure about your problem. IMO the easiest way is this, you have > list of ten people, just construct ten documents and write into XML files. > Each person data will be written into one specific file with file names are > generated by id or any random text to make them unique. > > So for person 1 you have 1.xml, person 2 you will have 2.xml ... and so on. > > Finally you index all produced documents similarly as the example: post.sh > > -- Khang > > On Sat, Jun 18, 2011 at 3:58 PM, jasimop <[email protected]> wrote: > >> Hi Lewis, >> >> thanks for your explanations , I will try to describe more precise what I >> need: >> >> Suppose I have the webpage http://example.com/persons1.html that I want to >> index. >> It contains a list of persons, having all the same structure. I already >> have >> a nutch plugin >> that allows me to extract information about one person and add it to the >> index fields (say firstname, lastname,...). What I now need (to make my >> solr >> queries possible) is that one document in the index >> contains only information about one person and not all contained in >> persons1.html. >> Idealy after nutch crawled the document I have a plugin that inspects the >> html content. My plugin then >> decides where to splt the content into the several pieces (something like >> each person is within a div with >> class=person). >> I would now like to create out of this one document several documents >> where >> all have the same url but the content is only the part my plugin extracted >> (namely information about one person). Then my >> already existing plugin extracts the meta information about one person. So >> if persons1.html contained >> information about 10 persons, in the index I would like to have 10 >> documents, each with separate content but the same url. >> So crawling persons1.html should generate 10 documents in the index. >> So i do not want to remove clutter, but it is also not exactly as case >> [1]. >> I would like to have a indexed document for each >> <paragraph>...</paragraph>. >> >> The big question for me now is where can I access the current document, >> count how many persons are in the content,copy it n times, each containing >> one person and let the parsers that I already have deal with >> each document separately? >> >> I tried to add new information to the ParseResult within my filter() >> method >> with something like this: >> parseResult.put(content.getUrl(), new ParseText("myParseText2"), new >> ParseData( >> new ParseStatus(ParseStatus.SUCCESS), "myTitle2", new Outlink[0], >> content.getMetadata())); >> But the Solr index only containes one document having the above >> information. >> The original one is overwritten. If i change the URL to something like >> content.getUrl()+"#123" , the original information >> is added to the index, but again only one document. >> >> How would you tackle this problem? >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/indexing-hierarchical-data-schema-design-tp3052894p3079208.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >

