[Nutch-dev] Fetchlist generation taking hours...

2005-01-15 Thread Andrew Chen
All, Once I have a couple million pages in the DB, how long should it take to generate a fetchlist? I'm finding that all the step that it says "Processing " takes a short amount of time, but once it starts to sort through the DB, it takes much much longer, often hours. How long is it supp

Re: [Nutch-dev] 0.6 release?

2004-12-04 Thread Andrew Chen
Hey Doug, following up on your e-mail from earlier last month... Has there been an official 0.6 release? I'd love to try the new NDFS stuff, but don't want to grab some random development version. On Thu, 11 Nov 2004 15:53:41 +0100, Doug Cutting <[EMAIL PROTECTED]> wrote: > It looks to me like t

Re: [Nutch-dev] SegmentMergeTool INPUT/OUTPUT diff?

2004-12-03 Thread Andrew Chen
Earlier in the output, they will tell you a couple reasons. For example, duplicate URLs, 404s, empty bodies, etc., etc. Should be nothing to worry about... that looks about right. Andrew On Fri, 3 Dec 2004 15:12:59 -0800 (PST), [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Hello, > > I just r

Re: [Nutch-dev] updatedb performance issues

2004-12-02 Thread Andrew Chen
I have the same problem. I'm doing repeat fetching of frequently updating content (needed for blogs and other breaking-news sites) and am finding the the process of updatedb and also the generating fetchlists is becoming the bottleneck as the webdb gets bigger. Any thoughts or suggestions? Andre

[Nutch-dev] does banned-hosts.txt still work?

2004-11-29 Thread Andrew Chen
I don't see any reference to it in the code. Every once in a while, I run into sites like: http://spodzone.org.uk/cesspit.jl ... that seem designed to ensnare crawlers like Nutch. I e-mailed the website owner because he seems to have made a half-hearted attempt in the robots.txt file to be nice t

Re: [Nutch-dev] Re: Any way to sort hits by date?

2004-11-01 Thread Andrew Chen
; -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Andrew > Chen > Sent: Sunday, October 31, 2004 11:51 PM > To: [EMAIL PROTECTED] > Subject: [Nutch-dev] Re: Any way to sort hits by date? > > Let me answer my own question: Crea

[Nutch-dev] Re: Any way to sort hits by date?

2004-10-31 Thread Andrew Chen
Let me answer my own question: Create a new queryFilter for date. Then create a search function that searches by today, yesterday, the day before, etc., and return those hits. I believe that should work... Andrew On Sun, 31 Oct 2004 20:33:03 -0800, Andrew Chen <[EMAIL PROTECTED]> wrote

[Nutch-dev] Any way to sort hits by date?

2004-10-31 Thread Andrew Chen
Is there an easy way to sort search results by date? I'm working on a project that requires date first, then relevance. Looking at the Hits and Hit classes, it doesn't look like there's an easy way to sort results except by score. I can always do a sort at the very top layer in a bean, but maybe

Re: [Nutch-dev] Copyrighted Sites

2004-10-25 Thread Andrew Chen
Yeah, the DMCA says that search engines are protected under "safe harbor" provisions, but that we have to allow some form of "notice and takedown" action to occur. Here's a good ./ post on it: http://yro.slashdot.org/yro/04/04/25/1746200.shtml?tid=103&tid=126&tid=188&tid=95&tid=99 Basically, sear

[Nutch-dev] Quick question on token seperators

2004-10-24 Thread Andrew Chen
Quick question: I'd like to add underscore "_" as a token seperator. Is there anything else other than adding an entry to NutchAnalysisConstants? Andrew --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT produc

Re: [Nutch-dev] Fetch intervals of hours, not days...

2004-10-17 Thread Andrew Chen
I'll wait, and stick with the hack I have until the next release ;) Awesome that you're making this change though... Andrew On Sat, 16 Oct 2004 19:07:00 +0200, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Andrew Chen wrote: > > > > > All, > > > >

[Nutch-dev] Fetch intervals of hours, not days...

2004-10-15 Thread Andrew Chen
All, I'm working on a project where one component is looking at RSS feeds - Although Nutch has been quite focused on lots of pages without necessarily "freshness" as the goal, for RSS feeds it's pretty crucial to be able to hit the server every couple hours, rather than even every day... Right no

Re: [Nutch-dev] Re: Indexing links to robots.txt blocked pages

2004-10-04 Thread Andrew Chen
rawler by robots.txt) is a violation of the robots > exclusion protocol; after all, if I wanted the crawler to index and > display those pages, I wouldn't put them under robots.txt, right? > > I'm looking forward to your point of view on the subject. > > Dawid > &

[Nutch-dev] Re: Indexing links to robots.txt blocked pages

2004-10-03 Thread Andrew Chen
avior though, especially if Google handles it this way! Sorry for the spam ;) Andrew On Sun, 3 Oct 2004 23:13:39 -0700, Andrew Chen <[EMAIL PROTECTED]> wrote: > Hi everyone, > > I'm trying very very hard to not modify the core Nutch code, and build > everything as plug-i

[Nutch-dev] Indexing links to robots.txt blocked pages

2004-10-03 Thread Andrew Chen
Hi everyone, I'm trying very very hard to not modify the core Nutch code, and build everything as plug-ins. Kudos to Doug, Mike, and everyone else for building a system that's so easily extended. I haven't had to make any core changes except on one occasion, involving the page score thread discuss

Re: [Nutch-dev] Changing the score

2004-09-14 Thread Andrew Chen
; discussion on this topic would be very helpful :) > > Thanks, > > --Jagdeep > > > > > -Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of Andrew > Chen > Sent: Tuesday, September 14, 2004 12:27 AM > To: [EMAIL PROTEC

Re: [Nutch-dev] Changing the score

2004-09-14 Thread Andrew Chen
; > Warning: My familiarity with Nutch code is at the Intermediate level. So more > discussion on this topic would be very helpful :) > > Thanks, > > --Jagdeep > > > > > -Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTEC

[Nutch-dev] Changing the score

2004-09-13 Thread Andrew Chen
Hi everyone, I'm in the middle of making a couple extensions to Nutch, and I had a question about the best way to plug into the scoring engine. Maybe I'm missing something simple, so any assistance would be helpful. The project I'm working on is indexing a specific type of file (let's say MPEG fi

Re: [Nutch-dev] Building QueryFilter Extensions...

2004-08-30 Thread Andrew Chen
That does help - I'll take a close look at both of the classes you suggested. Thanks Doug. Andrew On Mon, 30 Aug 2004 13:11:31 -0700, Doug Cutting <[EMAIL PROTECTED]> wrote: > Andrew Chen wrote: > > Don't know if anyone else has been building QueryFilter extension

[Nutch-dev] Building QueryFilter Extensions...

2004-08-29 Thread Andrew Chen
Don't know if anyone else has been building QueryFilter extensions, but here's a quick question. In the filter method that needs to be implemented as part of the QueryFilter interface, the method looks like this: public BooleanQuery filter(Query input, BooleanQuery translation) Query is a Nutch c