On 6/20/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:
> Thanks, Doğacan.
>
> Thanks for the clarification conserning the content setting.
>
> The index-basic plugin modification is okay, but is it possible to access the 
> segment data containing content from a lucene client?

It is possible if you are OK with adding hadoop jar to your lucene
client. Take a look at FetchedSegments, it will show you how to access
content in a segment. Basically, content is a set of MapFile's (all
part-*'s under content are MapFiles), when you want to access content
of a url, you first apply a hash function to find out under which part
it is stored then get it with a MapFile.get(). This may sound
difficult but it actually is very easy. I would definitely suggest
reading FetchedSegments.java, especially getContent and getEntry.

> I kind of like the speed nutch provides by caching content as segment data, 
> and if searching the index will be a big performance issue after storing 
> content in index I will choose to access the segmentet data if possible.
>
> Also, I must specify that I am bound to use java 1.4 for our client, but I 
> guess I could rewrite/recompile some needed Nutch code for java 1.4 if needed 
> to access segmented data.
>
> Best regards,
> Ronny
>
>
> -----Opprinnelig melding-----
> Fra: Doğacan Güney [mailto:[EMAIL PROTECTED]
> Sendt: 20. juni 2007 08:14
> Til: [EMAIL PROTECTED]
> Emne: Re: Lucene client and nutch index
>
> On 6/20/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:
> > I tried your tip Brian, but the property
> >
> >  <property>
> >          <name>fetcher.store.content</name>
> >          <value>true</value>
> >          <description>If true, fetcher will store
> > content.</description>  </property>
> >
> > set in nutch-site.xml does not seem to work (still no content) and I
> > found exactly the same setting in nutch-default.xml anyway, and it was
> > also set to true.....strange!!??
> >
> > Does it mean what we think it does as in store into index or does it
> > mean store as segment data?
>
> If fetcher.store.content is set to true, then fetcher stores the original 
> version of the page (its 'content') in <segment>/content directory. It has 
> nothing to do with indexing.
>
> Note that content is not available to Indexer but parse text is. If you want 
> to store parse text in index, just change index-basic plugin where it adds 
> the "content" field to Store.YES. (If there is any confusion, parse text is 
> indexed as "content").
>
> >
> > Regards,
> > Ronny
> >
> > -----Opprinnelig melding-----
> > Fra: Brian Whitman [mailto:[EMAIL PROTECTED]
> > Sendt: 19. juni 2007 19:52
> > Til: [EMAIL PROTECTED]
> > Emne: Re: Lucene client and nutch index
> >
> >
> > On Jun 19, 2007, at 1:39 PM, Naess, Ronny wrote:
> >
> > > I have made a small Lucene client reading my nutch index created
> > > with
> > > Nutch-0.9
> > >
> > > This works fine. However since 'content' is not stored only indexed
> > > in
> >
> > > the index I have to find a way to access the content to create a
> > > summary (and highlighting the query terms).
> > >
> >
> > You can simply set the content to be stored in the Lucene index then
> > highlighting will work normally from any Lucene client. Search the
> > mailing list (there was a post just yesterday) about how to accomplish
> > this, there's a single line of code to change. Do realise that storing
> > content will slow down some queries and your index size will grow very
> > large.
> >
> > -Brian
> >
> >
> >
> >
> >
> >
>
>
> --
> Doğacan Güney
>
> !DSPAM:4678c5de300541387220021!
>
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to