There's a good paper on compressing URLs in http://citeseer.nj.nec.com/suel01compressing.html It takes advantage of the regular structure of the sorted list of URLs and compresses the resulting structure with some Huffman encoding. I have already implemented a somewhat simpler algorithm that can compress URLs based on their prefixes. I maybe contribute that a little later.
----- Original Message ----- From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, October 30, 2002 11:00 PM Subject: Re: LARM web crawler: use lucene itself for visited URLs > Redirecting this to lucene-dev, seems more appropriate. > > Clemens is the person to talk to. > Yes, I thought of that, but it always felt like a weird idea to me. I > can't really explain why.... Clemens, what do you think about this? I > was imagining something like skipping the link parts that are the same > in the previous link....and now I know where I got that :) > > Otis > > > > --- Ype Kingma <[EMAIL PROTECTED]> wrote: > > > > I managed to loose some recent messages on the LARM crawler and the > > lucene > > file formats, so I don't know whom to address. > > > > Anyway, I noticed this on the LARM crawler info page > > > http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html > > <<< > > Something worth while would be to compress the URLs. A lot of parts > > of URLs > > are the same between hundreds of URLs (i.e. the host name). And since > > only a > > limited number of characters are allowed in URLs, Huffman compression > > will > > lead to a good compression rate. > > >>> > > > > and this on the file formats page > > http://jakarta.apache.org/lucene/docs/fileformats.html > > <<< > > Term text prefixes are shared. The PrefixLength is the number of > > initial > > characters from the previous term which must be pre-pended to a > > term's suffix > > in order to form the term's text. Thus, if the previous term's text > > was > > "bone" and the term is "boy", the PrefixLength is two and the suffix > > is "y". > > >>> > > > > Somehow I get the impression that lucene itself would be quite > > helpful for > > the crawler by using indexed, non stored fields for the normalized > > visited > > URLs. > > > > Have fun, > > Ype > > > > -- > > To unsubscribe, e-mail: > > <mailto:lucene-user-unsubscribe@;jakarta.apache.org> > > For additional commands, e-mail: > > <mailto:lucene-user-help@;jakarta.apache.org> > > > > > __________________________________________________ > Do you Yahoo!? > HotJobs - Search new jobs daily now > http://hotjobs.yahoo.com/ > > -- > To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@;jakarta.apache.org> > For additional commands, e-mail: <mailto:lucene-dev-help@;jakarta.apache.org> > -- To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-dev-help@;jakarta.apache.org>
