I need some help figuring out the following:
I was looking at: BasicIndexingFilter.java where it's stated:
// url is both stored and indexed, so it's both searchable and returned
doc.add(Field.Text(url, url));
// content is indexed, so that it's searchable, but not stored in index
also how does it keep track of incoming links globally on these pages, if
the weight is determined by # of incoming links then there would have to be
somewhere it keeps track so when you split your indexes it can still have an
accurate value for the distributed search?
-J
- Original Message
in the Database all(let s say 24) pages are stored.
The Database Stored 24 URLs. That is the one URL which is Indexed an
the 23 URLs which are on the linked on the page an Nutch must Index in
the next crawl.
Best regards from Germany
Michael
Nils Hoeller schrieb:
Hi,
i ve got following
Hi Michael,
Am Freitag, den 12.08.2005, 15:36 +0200 schrieb Michael Weber:
in the Database all(let s say 24) pages are stored.
The Database Stored 24 URLs. That is the one URL which is Indexed
an
the 23 URLs which are on the linked on the page an Nutch must Index
in
the next crawl.
[ http://issues.apache.org/jira/browse/NUTCH-30?page=all ]
Andrzej Bialecki closed NUTCH-30:
--
Resolution: Fixed
Assign To: Andrzej Bialecki (was: Chris A. Mattmann)
Committed to trunk. Thank you!
rss feed parser
---
Nils Hoeller wrote:
Hi,
actually I thought the content of the pages,
is beeing indexed.
When I have a look with Luke at the
index of a Nutch Crawl, it says
contents not available.
Please try reconstruct Edit button, and you should see some text
from the content. The plain text is NOT
I'll prepare an upgrade of the clustering code to the Carrot2 HEAD.
There have been a few fixes, so it is worth it before the release.
If anyone objects, please speak up. Also, what's the preferred way of
submitting that (remember it is a few megabytes) -- JIRA? Direct contact
with somebody
This is built into Nutch. Instead of injecting http:// url's, use
file:// , and Nutch will use protocol-file to fetch the files locally.
Andy
On 8/12/05, Dawid Weiss [EMAIL PROTECTED] wrote:
Has anyone considered/ implemented injecting static pages with a
different URL scheme? I mean the
I need some help with how to use mapred, what are the commands to use with it?
Thanks,
Jay Pound
--
Pound Web Hosting www.poundwebhosting.com
(607)-435-3048
Given the recent discussion regarding charset/language detection on
this list, people might find this IBM reseearch paper interesting:
ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdfftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf
Linguini:
10 matches
Mail list logo