yeah but if it is only disk-based it is way too slow. What is needed is a mixture of both. We thought about the following:
- links show a high grade of locality. That means 90% of the links go back to the same host. We have to take advantage of that. We want to hold the links of the hosts currently crawled in RAM if possible - The data structure behind that may be what is called a red-black tree: Some nodes are in RAM, some on disk, some are compressed, some not. http://citeseer.nj.nec.com/shkapenyuk01design.html gives insights on this. - Prerequisite for that is that only a distinct number of hosts is crawled at each time. We want to change the crawler threads such that one thread loads subsequent URLs only from _one_ host, which also allows for adding the politeness features. --Clemens ----- Original Message ----- From: "Ype Kingma" <[EMAIL PROTECTED]> To: "Lucene Developers List" <[EMAIL PROTECTED]>; "Clemens Marschner" <[EMAIL PROTECTED]> Sent: Thursday, October 31, 2002 9:10 AM Subject: Re: LARM web crawler: use lucene itself for visited URLs > On Wednesday 30 October 2002 23:30, Clemens Marschner wrote: > > There's a good paper on compressing URLs in > > http://citeseer.nj.nec.com/suel01compressing.html It takes advantage of the > > regular structure of the sorted list of URLs and compresses the resulting > > structure with some Huffman encoding. > > I have already implemented a somewhat simpler algorithm that can compress > > URLs based on their prefixes. I maybe contribute that a little later. > > Compressing is one part, storing the visited URL's on disk (to save RAM) > is another. Once the hashtable being used now grows over a max size, > it could be added to a lucene db, after which a new indexreader can be opened > and table can be flushed from RAM. > No analyzer is needed to create the lucene documents, as the URL's are > already normalized. > Lookup can be done on directly with an indexreader, in case > the lookup in RAM fails. > The nice thing about it is that this lucene scales up quite a bit. > > Have fun, > Ype > > > > ----- Original Message ----- > > From: "Otis Gospodnetic" <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Sent: Wednesday, October 30, 2002 11:00 PM > > Subject: Re: LARM web crawler: use lucene itself for visited URLs > > > > > Redirecting this to lucene-dev, seems more appropriate. > > > > > > Clemens is the person to talk to. > > > Yes, I thought of that, but it always felt like a weird idea to me. I > > > can't really explain why.... Clemens, what do you think about this? I > > > was imagining something like skipping the link parts that are the same > > > in the previous link....and now I know where I got that :) > > > > > > Otis > > > > > > --- Ype Kingma <[EMAIL PROTECTED]> wrote: > > > > I managed to loose some recent messages on the LARM crawler and the > > > > lucene > > > > file formats, so I don't know whom to address. > > > > > > > > Anyway, I noticed this on the LARM crawler info page > > > > > > http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html > > > > > > > <<< > > > > Something worth while would be to compress the URLs. A lot of parts > > > > of URLs > > > > are the same between hundreds of URLs (i.e. the host name). And since > > > > only a > > > > limited number of characters are allowed in URLs, Huffman compression > > > > will > > > > lead to a good compression rate. > > > > > > > > > > > > and this on the file formats page > > > > http://jakarta.apache.org/lucene/docs/fileformats.html > > > > <<< > > > > Term text prefixes are shared. The PrefixLength is the number of > > > > initial > > > > characters from the previous term which must be pre-pended to a > > > > term's suffix > > > > in order to form the term's text. Thus, if the previous term's text > > > > was > > > > "bone" and the term is "boy", the PrefixLength is two and the suffix > > > > is "y". > > > > > > > > > > > > Somehow I get the impression that lucene itself would be quite > > > > helpful for > > > > the crawler by using indexed, non stored fields for the normalized > > > > visited > > > > URLs. > > > > > > > > Have fun, > > > > Ype > > > > > > > > -- > > > > To unsubscribe, e-mail: > > > > <mailto:lucene-user-unsubscribe@;jakarta.apache.org> > > > > For additional commands, e-mail: > > > > <mailto:lucene-user-help@;jakarta.apache.org> > > > > > > __________________________________________________ > > > Do you Yahoo!? > > > HotJobs - Search new jobs daily now > > > http://hotjobs.yahoo.com/ > > > > > > -- > > > To unsubscribe, e-mail: > > > > <mailto:lucene-dev-unsubscribe@;jakarta.apache.org> > > > > > For additional commands, e-mail: > > > > <mailto:lucene-dev-help@;jakarta.apache.org> -- To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-dev-help@;jakarta.apache.org>
