Redirecting this to lucene-dev, seems more appropriate. Clemens is the person to talk to. Yes, I thought of that, but it always felt like a weird idea to me. I can't really explain why.... Clemens, what do you think about this? I was imagining something like skipping the link parts that are the same in the previous link....and now I know where I got that :)
Otis --- Ype Kingma <[EMAIL PROTECTED]> wrote: > > I managed to loose some recent messages on the LARM crawler and the > lucene > file formats, so I don't know whom to address. > > Anyway, I noticed this on the LARM crawler info page > http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html > <<< > Something worth while would be to compress the URLs. A lot of parts > of URLs > are the same between hundreds of URLs (i.e. the host name). And since > only a > limited number of characters are allowed in URLs, Huffman compression > will > lead to a good compression rate. > >>> > > and this on the file formats page > http://jakarta.apache.org/lucene/docs/fileformats.html > <<< > Term text prefixes are shared. The PrefixLength is the number of > initial > characters from the previous term which must be pre-pended to a > term's suffix > in order to form the term's text. Thus, if the previous term's text > was > "bone" and the term is "boy", the PrefixLength is two and the suffix > is "y". > >>> > > Somehow I get the impression that lucene itself would be quite > helpful for > the crawler by using indexed, non stored fields for the normalized > visited > URLs. > > Have fun, > Ype > > -- > To unsubscribe, e-mail: > <mailto:lucene-user-unsubscribe@;jakarta.apache.org> > For additional commands, e-mail: > <mailto:lucene-user-help@;jakarta.apache.org> > __________________________________________________ Do you Yahoo!? HotJobs - Search new jobs daily now http://hotjobs.yahoo.com/ -- To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-dev-help@;jakarta.apache.org>
