Re: [Nutch-dev] File does not exist (-l\fetchlist\data)

2005-01-19 Thread Matt Kangas
Charita, what is your value for $s1 in this example? ("echo $s1" to find out) For more logging detail, try "nutch fetch -logLevel finest $s1" --matt On Thu, 20 Jan 2005 00:03:38 -0600, Charitha Tillekeratne <[EMAIL PROTECTED]> wrote: > I am trying to use Nutch by following the tutorial on a Wind

[Nutch-dev] File does not exist (-l\fetchlist\data)

2005-01-19 Thread Charitha Tillekeratne
I am trying to use Nutch by following the tutorial on a Windows system using Cygwin. When executing the fetch command I get the following error (at the end). I added a print statement in LocalFileSystem.java and found out that it was looking for the file "-l\fetchlist\data". Any idea on how to fix

[Nutch-dev] Nutch feature questions

2005-01-19 Thread Gavin Chan
1. Does Nutch support URL alias? Meaning the URL I crawled is different than the URL I display in the search result page? 2. Does Nutch support file system crawling and database crawling? What would be the configuration? 3. So far the Nutch documentation I can find are: 1. the tutorial 2.

Re: [Nutch-dev] Nutch crawling issues

2005-01-19 Thread Gavin Chan
1. Yup, the outlink config fixes the problem. 2. The segread -fix is one way to save the broken data. This can be a work around for the problem. How much time would it take to copy the data compare to crawling? I think copying data from the local disk is still faster than re-starting a new crawl

[Nutch-dev] File URL, HTTPClient patch

2005-01-19 Thread Ken Meltsner
Three queries, somewhat related: 1. I'm feeling stupid, but I can't figure out the right syntax for a file URL for the crawler for Nutch on Windows/Cygwin. Suggestions? I've tried: file:///c:/foo file:///c|/foo [Netscape style] and a few others. 2. A while ago, Andy Hedges mentioned that h

[Nutch-dev] [ nutch-Bugs-1105652 ] Ignore HTML links with 'rel=nofollow' attribute

2005-01-19 Thread SourceForge.net
Bugs item #1105652, was opened at 2005-01-19 18:04 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1105652&group_id=59548 Category: fetcher Group: None Status: Open Resolution: N

[Nutch-dev] Antigen found HTML/Bofr virus

2005-01-19 Thread Antigen_DR_MAIL
Antigen for Exchange found Unknown infected with HTML/Bofr virus. The file is currently Removed. The message, "[Nutch-dev] Confirmation", was sent from [EMAIL PROTECTED] and was discovered in First Storage Group\Nikola Midich\Inbox\Mailing Lists\Nutch-Dev located at Perfectinfo/First Administrati

[Nutch-dev] [ nutch-Bugs-1077261 ] [PATCH] UpdateDatabaseTool: make "pageXXX" methods protected

2005-01-19 Thread SourceForge.net
Bugs item #1077261, was opened at 2004-12-01 21:34 Message generated for change (Comment added) made by mkangas You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1077261&group_id=59548 Category: tools Group: mainline Status: Open Resolution: None Priority:

[Nutch-dev] Nutch samples

2005-01-19 Thread Joshua Oliver
I have seen in the wiki different samples of nutch. Is there any samples of nutch in its default state. (out of the box) Thanx --- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create

nutch-developers@lists.sourceforge.net

2005-01-19 Thread Andrzej Bialecki
Nutch wrote: Nutch wrote: >> Hi, >> >> I have been testing Nutch on our Intranet site but since we have "&" in >> our url"s Nutch doesn"t work very well. Are there some way of getting Nutch >> to accept url"s containing &? > Yes, just add it to the allowed characters in the regex-urlfilter.t

[Nutch-dev] [ nutch-Bugs-1077261 ] [PATCH] UpdateDatabaseTool: make "pageXXX" methods protected

2005-01-19 Thread SourceForge.net
Bugs item #1077261, was opened at 2004-12-02 03:34 Message generated for change (Comment added) made by abial You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1077261&group_id=59548 Category: tools Group: mainline Status: Open Resolution: None Priority: 5

[Nutch-dev] [ nutch-Bugs-1077258 ] [PATCH] WebDBInjector.addPage - make public

2005-01-19 Thread SourceForge.net
Bugs item #1077258, was opened at 2004-12-02 03:27 Message generated for change (Comment added) made by abial You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1077258&group_id=59548 Category: web db Group: mainline >Status: Closed >Resolution: Accepted Pri

[Nutch-dev] [ nutch-Bugs-1077173 ] [PATCH] configurable IndexWriter/Segment Lucene parameters

2005-01-19 Thread SourceForge.net
Bugs item #1077173, was opened at 2004-12-02 00:44 Message generated for change (Comment added) made by abial You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1077173&group_id=59548 Category: indexer Group: mainline >Status: Closed Resolution: None Priorit

nutch-developers@lists.sourceforge.net

2005-01-19 Thread Nutch
Nutch wrote: >> Hi, >> >> I have been testing Nutch on our Intranet site but since we have "&" in >> our url"s Nutch doesn"t work very well. Are there some way of getting Nutch >> to accept url"s containing &? > Yes, just add it to the allowed characters in the regex-urlfilter.txt > config file

Re: [Nutch-dev] MS word files parsing

2005-01-19 Thread Oscar Picasso
Thank you all for your input. --- Ken Meltsner <[EMAIL PROTECTED]> wrote: > [...] have Windows, while Java or C++ solutions (POI, *WVWare*, OpenOffice) > run on Linux/Unix as well. I didn't know about WVWare. Did you have a chance to use it? How it compares to POI or OpenOffice/UNO ? _

nutch-developers@lists.sourceforge.net

2005-01-19 Thread Andrzej Bialecki
Nutch wrote: Hi, I have been testing Nutch on our Intranet site but since we have "&" in our url's Nutch doesn't work very well. Are there some way of getting Nutch to accept url's containing &? Yes, just add it to the allowed characters in the regex-urlfilter.txt config file. -- Best regards,

nutch-developers@lists.sourceforge.net

2005-01-19 Thread Nutch
Hi, I have been testing Nutch on our Intranet site but since we have "&" in our url's Nutch doesn't work very well. Are there some way of getting Nutch to accept url's containing &? Thanks Fredrik --- The SF.Net email is sponsored by: Beat the p

Re: [Nutch-dev] Nutch crawling issues

2005-01-19 Thread Andrzej Bialecki
Gavin Chan wrote: We are evaluating nutch for our internet and intranet crawling. However, I am encountering the following problems/questions when using it and would like to seek your comments/suggestions: 1. Not all URLs in a HTML page are crawled/indexed. * After massaging the URL filters and