Re: indexing hierarchical data, schema design

vinay vaish Tue, 21 Jun 2011 09:29:59 -0700

Please don't mail me further.



Thanks

On 6/21/11, Khang Ich <[email protected]> wrote:
> Hi,
>
> how do you use Solr to index your "documents" ?
>
>
> The big question for me now is where can I access the current document,
> count how many persons are in the content,copy it n times, each containing
> one person and let the parsers that I already have deal with
> each document separately?
>
>
> And what do you mean by
>
> "But the Solr index only containes one document having the above
> information. The original one is overwritten. "
>
>
> I'm not very sure about your problem. IMO the easiest way is this, you have
> list of ten people, just construct ten documents and write into XML files.
> Each person data will be written into one specific file with file names are
> generated by id or any random text to make them unique.
>
> So for person 1 you have 1.xml, person 2 you will have 2.xml ... and so on.
>
> Finally you index all produced documents similarly as the example: post.sh
>
> -- Khang
>
> On Sat, Jun 18, 2011 at 3:58 PM, jasimop <[email protected]> wrote:
>
>> Hi Lewis,
>>
>> thanks for your explanations , I will try to describe more precise what I
>> need:
>>
>> Suppose I have the webpage http://example.com/persons1.html that I want to
>> index.
>> It contains a list of persons, having all the same structure. I already
>> have
>> a nutch plugin
>> that allows me to extract information about one person and add it to the
>> index fields (say firstname, lastname,...). What I now need (to make my
>> solr
>> queries possible) is that one document in the index
>> contains only information about one person and not all contained in
>> persons1.html.
>> Idealy after nutch crawled the document I have a plugin that inspects the
>> html content. My plugin then
>> decides where to splt the content into the several pieces (something like
>> each person is within a div with
>> class=person).
>> I would now like to create out of this one document several documents
>> where
>> all have the same url but the content is only the part my plugin extracted
>> (namely information about one person). Then my
>> already existing plugin extracts the meta information about one person. So
>> if persons1.html contained
>> information about 10 persons, in the index I would like to have 10
>> documents, each with separate content but the same url.
>> So crawling persons1.html should generate 10 documents in the index.
>> So i do not want to remove clutter, but it is also not exactly as case
>> [1].
>> I would like to have a indexed document for each
>> <paragraph>...</paragraph>.
>>
>> The big question for me now is where can I access the current document,
>> count how many persons are in the content,copy it n times, each containing
>> one person and let the parsers that I already have deal with
>> each document separately?
>>
>> I tried to add new information to the ParseResult within my filter()
>> method
>> with something like this:
>> parseResult.put(content.getUrl(), new ParseText("myParseText2"), new
>> ParseData(
>>        new ParseStatus(ParseStatus.SUCCESS), "myTitle2", new Outlink[0],
>>        content.getMetadata()));
>> But the Solr index only containes one document having the above
>> information.
>> The original one is overwritten. If i change the URL to something like
>> content.getUrl()+"#123" , the original information
>> is added to the index, but again only one document.
>>
>> How would you tackle this problem?
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/indexing-hierarchical-data-schema-design-tp3052894p3079208.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>

Re: indexing hierarchical data, schema design

Reply via email to