Re: indexing hierarchical data, schema design

Khang Ich Tue, 21 Jun 2011 03:02:56 -0700

Hi,

how do you use Solr to index your "documents" ?



The big question for me now is where can I access the current document,
count how many persons are in the content,copy it n times, each containing
one person and let the parsers that I already have deal with
each document separately?


And what do you mean by

"But the Solr index only containes one document having the above
information. The original one is overwritten. "


I'm not very sure about your problem. IMO the easiest way is this, you have
list of ten people, just construct ten documents and write into XML files.
Each person data will be written into one specific file with file names are
generated by id or any random text to make them unique.

So for person 1 you have 1.xml, person 2 you will have 2.xml ... and so on.

Finally you index all produced documents similarly as the example: post.sh

-- Khang

On Sat, Jun 18, 2011 at 3:58 PM, jasimop <[email protected]> wrote:

> Hi Lewis,
>
> thanks for your explanations , I will try to describe more precise what I
> need:
>
> Suppose I have the webpage http://example.com/persons1.html that I want to
> index.
> It contains a list of persons, having all the same structure. I already
> have
> a nutch plugin
> that allows me to extract information about one person and add it to the
> index fields (say firstname, lastname,...). What I now need (to make my
> solr
> queries possible) is that one document in the index
> contains only information about one person and not all contained in
> persons1.html.
> Idealy after nutch crawled the document I have a plugin that inspects the
> html content. My plugin then
> decides where to splt the content into the several pieces (something like
> each person is within a div with
> class=person).
> I would now like to create out of this one document several documents where
> all have the same url but the content is only the part my plugin extracted
> (namely information about one person). Then my
> already existing plugin extracts the meta information about one person. So
> if persons1.html contained
> information about 10 persons, in the index I would like to have 10
> documents, each with separate content but the same url.
> So crawling persons1.html should generate 10 documents in the index.
> So i do not want to remove clutter, but it is also not exactly as case [1].
> I would like to have a indexed document for each
> <paragraph>...</paragraph>.
>
> The big question for me now is where can I access the current document,
> count how many persons are in the content,copy it n times, each containing
> one person and let the parsers that I already have deal with
> each document separately?
>
> I tried to add new information to the ParseResult within my filter() method
> with something like this:
> parseResult.put(content.getUrl(), new ParseText("myParseText2"), new
> ParseData(
>        new ParseStatus(ParseStatus.SUCCESS), "myTitle2", new Outlink[0],
>        content.getMetadata()));
> But the Solr index only containes one document having the above
> information.
> The original one is overwritten. If i change the URL to something like
> content.getUrl()+"#123" , the original information
> is added to the index, but again only one document.
>
> How would you tackle this problem?
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-hierarchical-data-schema-design-tp3052894p3079208.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: indexing hierarchical data, schema design

Reply via email to