Field.Text vs Field.UnStored

2005-08-12 Thread EM
I need some help figuring out the following: I was looking at: BasicIndexingFilter.java where it's stated: // url is both stored and indexed, so it's both searchable and returned doc.add(Field.Text(url, url)); // content is indexed, so that it's searchable, but not stored in index

Re: page ranking weights

2005-08-12 Thread Jay Pound
also how does it keep track of incoming links globally on these pages, if the weight is determined by # of incoming links then there would have to be somewhere it keeps track so when you split your indexes it can still have an accurate value for the distributed search? -J - Original Message

Re: Different Number of Doc in Index and WebDB

2005-08-12 Thread Michael Weber
in the Database all(let s say 24) pages are stored. The Database Stored 24 URLs. That is the one URL which is Indexed an the 23 URLs which are on the linked on the page an Nutch must Index in the next crawl. Best regards from Germany Michael Nils Hoeller schrieb: Hi, i ve got following

Re: Different Number of Doc in Index and WebDB

2005-08-12 Thread Nils Hoeller
Hi Michael, Am Freitag, den 12.08.2005, 15:36 +0200 schrieb Michael Weber: in the Database all(let s say 24) pages are stored. The Database Stored 24 URLs. That is the one URL which is Indexed an the 23 URLs which are on the linked on the page an Nutch must Index in the next crawl.

[jira] Closed: (NUTCH-30) rss feed parser

2005-08-12 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-30?page=all ] Andrzej Bialecki closed NUTCH-30: -- Resolution: Fixed Assign To: Andrzej Bialecki (was: Chris A. Mattmann) Committed to trunk. Thank you! rss feed parser ---

Re: Site Content not indexed ? Nutch 0.7

2005-08-12 Thread Andrzej Bialecki
Nils Hoeller wrote: Hi, actually I thought the content of the pages, is beeing indexed. When I have a look with Luke at the index of a Nutch Crawl, it says contents not available. Please try reconstruct Edit button, and you should see some text from the content. The plain text is NOT

Clustering plugin upgrade

2005-08-12 Thread Dawid Weiss
I'll prepare an upgrade of the clustering code to the Carrot2 HEAD. There have been a few fixes, so it is worth it before the release. If anyone objects, please speak up. Also, what's the preferred way of submitting that (remember it is a few megabytes) -- JIRA? Direct contact with somebody

Re: Injecting documents manually.

2005-08-12 Thread Andy Liu
This is built into Nutch. Instead of injecting http:// url's, use file:// , and Nutch will use protocol-file to fetch the files locally. Andy On 8/12/05, Dawid Weiss [EMAIL PROTECTED] wrote: Has anyone considered/ implemented injecting static pages with a different URL scheme? I mean the

mapred

2005-08-12 Thread webmaster
I need some help with how to use mapred, what are the commands to use with it? Thanks, Jay Pound -- Pound Web Hosting www.poundwebhosting.com (607)-435-3048

Language detection

2005-08-12 Thread Ken Krugler
Given the recent discussion regarding charset/language detection on this list, people might find this IBM reseearch paper interesting: ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdfftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf Linguini: