All,
Once I have a couple million pages in the DB, how long should it take
to generate a fetchlist? I'm finding that all the step that it says
"Processing " takes a short amount of time, but once it starts
to sort through the DB, it takes much much longer, often hours.
How long is it supp
Hey Doug, following up on your e-mail from earlier last month...
Has there been an official 0.6 release? I'd love to try the new NDFS
stuff, but don't want to grab some random development version.
On Thu, 11 Nov 2004 15:53:41 +0100, Doug Cutting <[EMAIL PROTECTED]> wrote:
> It looks to me like t
Earlier in the output, they will tell you a couple reasons. For
example, duplicate URLs, 404s, empty bodies, etc., etc.
Should be nothing to worry about... that looks about right.
Andrew
On Fri, 3 Dec 2004 15:12:59 -0800 (PST), [EMAIL PROTECTED]
<[EMAIL PROTECTED]> wrote:
> Hello,
>
> I just r
I have the same problem.
I'm doing repeat fetching of frequently updating content (needed for
blogs and other breaking-news sites) and am finding the the process of
updatedb and also the generating fetchlists is becoming the bottleneck
as the webdb gets bigger.
Any thoughts or suggestions?
Andre
I don't see any reference to it in the code.
Every once in a while, I run into sites like:
http://spodzone.org.uk/cesspit.jl
... that seem designed to ensnare crawlers like Nutch. I e-mailed the
website owner because he seems to have made a half-hearted attempt in
the robots.txt file to be nice t
; -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Andrew
> Chen
> Sent: Sunday, October 31, 2004 11:51 PM
> To: [EMAIL PROTECTED]
> Subject: [Nutch-dev] Re: Any way to sort hits by date?
>
> Let me answer my own question: Crea
Let me answer my own question: Create a new queryFilter for date. Then
create a search function that searches by today, yesterday, the day
before, etc., and return those hits. I believe that should work...
Andrew
On Sun, 31 Oct 2004 20:33:03 -0800, Andrew Chen <[EMAIL PROTECTED]> wrote
Is there an easy way to sort search results by date?
I'm working on a project that requires date first, then relevance.
Looking at the Hits and Hit classes, it doesn't look like there's an
easy way to sort results except by score.
I can always do a sort at the very top layer in a bean, but maybe
Yeah, the DMCA says that search engines are protected under "safe
harbor" provisions, but that we have to allow some form of "notice and
takedown" action to occur.
Here's a good ./ post on it:
http://yro.slashdot.org/yro/04/04/25/1746200.shtml?tid=103&tid=126&tid=188&tid=95&tid=99
Basically, sear
Quick question: I'd like to add underscore "_" as a token seperator.
Is there anything else other than adding an entry to NutchAnalysisConstants?
Andrew
---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT produc
I'll wait, and stick with the hack I have until the next release ;)
Awesome that you're making this change though...
Andrew
On Sat, 16 Oct 2004 19:07:00 +0200, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Andrew Chen wrote:
>
>
>
> > All,
> >
> >
All,
I'm working on a project where one component is looking at RSS feeds -
Although Nutch has been quite focused on lots of pages without
necessarily "freshness" as the goal, for RSS feeds it's pretty crucial
to be able to hit the server every couple hours, rather than even
every day...
Right no
rawler by robots.txt) is a violation of the robots
> exclusion protocol; after all, if I wanted the crawler to index and
> display those pages, I wouldn't put them under robots.txt, right?
>
> I'm looking forward to your point of view on the subject.
>
> Dawid
>
&
avior though,
especially if Google handles it this way!
Sorry for the spam ;)
Andrew
On Sun, 3 Oct 2004 23:13:39 -0700, Andrew Chen <[EMAIL PROTECTED]> wrote:
> Hi everyone,
>
> I'm trying very very hard to not modify the core Nutch code, and build
> everything as plug-i
Hi everyone,
I'm trying very very hard to not modify the core Nutch code, and build
everything as plug-ins. Kudos to Doug, Mike, and everyone else for
building a system that's so easily extended. I haven't had to make any
core changes except on one occasion, involving the page score thread
discuss
; discussion on this topic would be very helpful :)
>
> Thanks,
>
> --Jagdeep
>
>
>
>
> -Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of Andrew
> Chen
> Sent: Tuesday, September 14, 2004 12:27 AM
> To: [EMAIL PROTEC
;
> Warning: My familiarity with Nutch code is at the Intermediate level. So more
> discussion on this topic would be very helpful :)
>
> Thanks,
>
> --Jagdeep
>
>
>
>
> -Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTEC
Hi everyone,
I'm in the middle of making a couple extensions to Nutch, and I had a
question about the best way to plug into the scoring engine. Maybe I'm
missing something simple, so any assistance would be helpful.
The project I'm working on is indexing a specific type of file (let's
say MPEG fi
That does help - I'll take a close look at both of the classes you
suggested. Thanks Doug.
Andrew
On Mon, 30 Aug 2004 13:11:31 -0700, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Andrew Chen wrote:
> > Don't know if anyone else has been building QueryFilter extension
Don't know if anyone else has been building QueryFilter extensions,
but here's a quick question. In the filter method that needs to be
implemented as part of the QueryFilter interface, the method looks
like this:
public BooleanQuery filter(Query input, BooleanQuery translation)
Query is a Nutch c
20 matches
Mail list logo