Re: LARM web crawler: use lucene itself for visited URLs

Clemens Marschner Wed, 30 Oct 2002 14:32:59 -0800

There's a good paper on compressing URLs in
http://citeseer.nj.nec.com/suel01compressing.html It takes advantage of the
regular structure of the sorted list of URLs and compresses the resulting
structure with some Huffman encoding.
I have already implemented a somewhat simpler algorithm that can compress
URLs based on their prefixes. I maybe contribute that a little later.


----- Original Message -----
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, October 30, 2002 11:00 PM
Subject: Re: LARM web crawler: use lucene itself for visited URLs


> Redirecting this to lucene-dev, seems more appropriate.
>
> Clemens is the person to talk to.
> Yes, I thought of that, but it always felt like a weird idea to me.  I
> can't really explain why....  Clemens, what do you think about this?  I
> was imagining something like skipping the link parts that are the same
> in the previous link....and now I know where I got that :)
>
> Otis
>
>
>
> --- Ype Kingma <[EMAIL PROTECTED]> wrote:
> >
> > I managed to loose some recent messages on the LARM crawler and the
> > lucene
> > file formats, so I don't know whom to address.
> >
> > Anyway, I noticed this on the LARM crawler info page
> >
> http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html
> > <<<
> > Something worth while would be to compress the URLs. A lot of parts
> > of URLs
> > are the same between hundreds of URLs (i.e. the host name). And since
> > only a
> > limited number of characters are allowed in URLs, Huffman compression
> > will
> > lead to a good compression rate.
> > >>>
> >
> > and this on the file formats page
> > http://jakarta.apache.org/lucene/docs/fileformats.html
> > <<<
> > Term text prefixes are shared. The PrefixLength is the number of
> > initial
> > characters from the previous term which must be pre-pended to a
> > term's suffix
> > in order to form the term's text. Thus, if the previous term's text
> > was
> > "bone" and the term is "boy", the PrefixLength is two and the suffix
> > is "y".
> > >>>
> >
> > Somehow I get the impression that lucene itself would be quite
> > helpful for
> > the crawler by using indexed, non stored fields for the normalized
> > visited
> > URLs.
> >
> > Have fun,
> > Ype
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-user-help@;jakarta.apache.org>
> >
>
>
> __________________________________________________
> Do you Yahoo!?
> HotJobs - Search new jobs daily now
> http://hotjobs.yahoo.com/
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@;jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-dev-help@;jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@;jakarta.apache.org>

Re: LARM web crawler: use lucene itself for visited URLs

Reply via email to