On 6/20/07, Naess, Ronny <[EMAIL PROTECTED]> wrote: > Thanks, Doğacan. > > Thanks for the clarification conserning the content setting. > > The index-basic plugin modification is okay, but is it possible to access the > segment data containing content from a lucene client?
It is possible if you are OK with adding hadoop jar to your lucene client. Take a look at FetchedSegments, it will show you how to access content in a segment. Basically, content is a set of MapFile's (all part-*'s under content are MapFiles), when you want to access content of a url, you first apply a hash function to find out under which part it is stored then get it with a MapFile.get(). This may sound difficult but it actually is very easy. I would definitely suggest reading FetchedSegments.java, especially getContent and getEntry. > I kind of like the speed nutch provides by caching content as segment data, > and if searching the index will be a big performance issue after storing > content in index I will choose to access the segmentet data if possible. > > Also, I must specify that I am bound to use java 1.4 for our client, but I > guess I could rewrite/recompile some needed Nutch code for java 1.4 if needed > to access segmented data. > > Best regards, > Ronny > > > -----Opprinnelig melding----- > Fra: Doğacan Güney [mailto:[EMAIL PROTECTED] > Sendt: 20. juni 2007 08:14 > Til: [EMAIL PROTECTED] > Emne: Re: Lucene client and nutch index > > On 6/20/07, Naess, Ronny <[EMAIL PROTECTED]> wrote: > > I tried your tip Brian, but the property > > > > <property> > > <name>fetcher.store.content</name> > > <value>true</value> > > <description>If true, fetcher will store > > content.</description> </property> > > > > set in nutch-site.xml does not seem to work (still no content) and I > > found exactly the same setting in nutch-default.xml anyway, and it was > > also set to true.....strange!!?? > > > > Does it mean what we think it does as in store into index or does it > > mean store as segment data? > > If fetcher.store.content is set to true, then fetcher stores the original > version of the page (its 'content') in <segment>/content directory. It has > nothing to do with indexing. > > Note that content is not available to Indexer but parse text is. If you want > to store parse text in index, just change index-basic plugin where it adds > the "content" field to Store.YES. (If there is any confusion, parse text is > indexed as "content"). > > > > > Regards, > > Ronny > > > > -----Opprinnelig melding----- > > Fra: Brian Whitman [mailto:[EMAIL PROTECTED] > > Sendt: 19. juni 2007 19:52 > > Til: [EMAIL PROTECTED] > > Emne: Re: Lucene client and nutch index > > > > > > On Jun 19, 2007, at 1:39 PM, Naess, Ronny wrote: > > > > > I have made a small Lucene client reading my nutch index created > > > with > > > Nutch-0.9 > > > > > > This works fine. However since 'content' is not stored only indexed > > > in > > > > > the index I have to find a way to access the content to create a > > > summary (and highlighting the query terms). > > > > > > > You can simply set the content to be stored in the Lucene index then > > highlighting will work normally from any Lucene client. Search the > > mailing list (there was a post just yesterday) about how to accomplish > > this, there's a single line of code to change. Do realise that storing > > content will slow down some queries and your index size will grow very > > large. > > > > -Brian > > > > > > > > > > > > > > > -- > Doğacan Güney > > !DSPAM:4678c5de300541387220021! > > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
