Re: indexing hierarchical data, schema design

lewis john mcgibbney Mon, 20 Jun 2011 17:58:00 -0700

Hi Jasimop,

I am abroad just now and do not have access to workstation for any decent
length of time to try this out. I will look into this come Thursday. In the
meantime if you get any closer please post your results. Once this one is
cracked it would provide an excellent contribution to the wiki. Sorry I
can't be of more help just now.


Lewis

On Sat, Jun 18, 2011 at 12:58 AM, jasimop <[email protected]> wrote:

> Hi Lewis,
>
> thanks for your explanations , I will try to describe more precise what I
> need:
>
> Suppose I have the webpage http://example.com/persons1.html that I want to
> index.
> It contains a list of persons, having all the same structure. I already
> have
> a nutch plugin
> that allows me to extract information about one person and add it to the
> index fields (say firstname, lastname,...). What I now need (to make my
> solr
> queries possible) is that one document in the index
> contains only information about one person and not all contained in
> persons1.html.
> Idealy after nutch crawled the document I have a plugin that inspects the
> html content. My plugin then
> decides where to splt the content into the several pieces (something like
> each person is within a div with
> class=person).
> I would now like to create out of this one document several documents where
> all have the same url but the content is only the part my plugin extracted
> (namely information about one person). Then my
> already existing plugin extracts the meta information about one person. So
> if persons1.html contained
> information about 10 persons, in the index I would like to have 10
> documents, each with separate content but the same url.
> So crawling persons1.html should generate 10 documents in the index.
> So i do not want to remove clutter, but it is also not exactly as case [1].
> I would like to have a indexed document for each
> <paragraph>...</paragraph>.
>
> The big question for me now is where can I access the current document,
> count how many persons are in the content,copy it n times, each containing
> one person and let the parsers that I already have deal with
> each document separately?
>
> I tried to add new information to the ParseResult within my filter() method
> with something like this:
> parseResult.put(content.getUrl(), new ParseText("myParseText2"), new
> ParseData(
>        new ParseStatus(ParseStatus.SUCCESS), "myTitle2", new Outlink[0],
>        content.getMetadata()));
> But the Solr index only containes one document having the above
> information.
> The original one is overwritten. If i change the URL to something like
> content.getUrl()+"#123" , the original information
> is added to the index, but again only one document.
>
> How would you tackle this problem?
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-hierarchical-data-schema-design-tp3052894p3079208.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: indexing hierarchical data, schema design

Reply via email to