Geoff Hutchison writes:
One additional word. The 3.2 index structure will make it possible
to index large amount of data. But the way the crawler works still prevents
updating a large amount of URLs. At present an update tries again *all*
URLs. It should have some heuristics to say something like : try this URL
only 15 days after a successfull loading, arrange URL updates so that a
maximum of N URLs is checked every day to prevent saturating the bandwidth
etc..
>
> On Mon, 16 Aug 1999, William Freman wrote:
>
> > memory and disk space...what about a a quad PIII/550 Xenon with
> > 2GB RAM with a 5TB RAID array on a T3. somthing such as that would take
> > away a good amount of the theoretical hindermets. with those out of the
> > way, would it be possible to index the web?
>
> I still wouldn't recommend it. It's only recently that we've been
> receiving feedback on scaling to huge (i.e. 500,000+ URLs) indexes. In
> particular, the 3.1.x series requires the htmerge phase and that requires
> sorting the word database. For even modest-sized databases, it can take an
> enormous amount of RAM to sort.
>
> The 3.2 development code should help with some of these performance
> bottlenecks.
>
> -Geoff Hutchison
> Williams Students Online
> http://wso.williams.edu/
>
>
> ------------------------------------
> To unsubscribe from the htdig3-dev mailing list, send a message to
> [EMAIL PROTECTED] containing the single word "unsubscribe" in
> the SUBJECT of the message.
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.