Re: LARM web crawler: use lucene itself for visited URLs

Clemens Marschner Thu, 31 Oct 2002 02:36:11 -0800

yeah but if it is only disk-based it is way too slow.
What is needed is a mixture of both.
We thought about the following:


- links show a high grade of locality. That means 90% of the links go back
to the same host. We have to take advantage of that. We want to hold the
links of the hosts currently crawled in RAM if possible

- The data structure behind that may be what is called a red-black tree:
Some nodes are in RAM, some on disk, some are compressed, some not.
http://citeseer.nj.nec.com/shkapenyuk01design.html gives insights on this.

- Prerequisite for that is that only a distinct number of hosts is crawled
at each time. We want to change the crawler threads such that one thread
loads subsequent URLs only from _one_ host, which also allows for adding the
politeness features.

--Clemens

----- Original Message -----
From: "Ype Kingma" <[EMAIL PROTECTED]>
To: "Lucene Developers List" <[EMAIL PROTECTED]>; "Clemens
Marschner" <[EMAIL PROTECTED]>
Sent: Thursday, October 31, 2002 9:10 AM
Subject: Re: LARM web crawler: use lucene itself for visited URLs


> On Wednesday 30 October 2002 23:30, Clemens Marschner wrote:
> > There's a good paper on compressing URLs in
> > http://citeseer.nj.nec.com/suel01compressing.html It takes advantage of
the
> > regular structure of the sorted list of URLs and compresses the
resulting
> > structure with some Huffman encoding.
> > I have already implemented a somewhat simpler algorithm that can
compress
> > URLs based on their prefixes. I maybe contribute that a little later.
>
> Compressing is one part, storing the visited URL's on disk (to save RAM)
> is another. Once the hashtable being used now grows over a max size,
> it could be added to a lucene db, after which a new indexreader can be
opened
> and table can be flushed from RAM.
> No analyzer is needed to create the lucene documents, as the URL's are
> already normalized.
> Lookup can be done on directly with an indexreader, in case
> the lookup in RAM fails.
> The nice thing about it is that this lucene scales up quite a bit.
>
> Have fun,
> Ype
>
>
> > ----- Original Message -----
> > From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Wednesday, October 30, 2002 11:00 PM
> > Subject: Re: LARM web crawler: use lucene itself for visited URLs
> >
> > > Redirecting this to lucene-dev, seems more appropriate.
> > >
> > > Clemens is the person to talk to.
> > > Yes, I thought of that, but it always felt like a weird idea to me.  I
> > > can't really explain why....  Clemens, what do you think about this?
I
> > > was imagining something like skipping the link parts that are the same
> > > in the previous link....and now I know where I got that :)
> > >
> > > Otis
> > >
> > > --- Ype Kingma <[EMAIL PROTECTED]> wrote:
> > > > I managed to loose some recent messages on the LARM crawler and the
> > > > lucene
> > > > file formats, so I don't know whom to address.
> > > >
> > > > Anyway, I noticed this on the LARM crawler info page
> > >
> > >
http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html
> > >
> > > > <<<
> > > > Something worth while would be to compress the URLs. A lot of parts
> > > > of URLs
> > > > are the same between hundreds of URLs (i.e. the host name). And
since
> > > > only a
> > > > limited number of characters are allowed in URLs, Huffman
compression
> > > > will
> > > > lead to a good compression rate.
> > > >
> > > >
> > > > and this on the file formats page
> > > > http://jakarta.apache.org/lucene/docs/fileformats.html
> > > > <<<
> > > > Term text prefixes are shared. The PrefixLength is the number of
> > > > initial
> > > > characters from the previous term which must be pre-pended to a
> > > > term's suffix
> > > > in order to form the term's text. Thus, if the previous term's text
> > > > was
> > > > "bone" and the term is "boy", the PrefixLength is two and the suffix
> > > > is "y".
> > > >
> > > >
> > > > Somehow I get the impression that lucene itself would be quite
> > > > helpful for
> > > > the crawler by using indexed, non stored fields for the normalized
> > > > visited
> > > > URLs.
> > > >
> > > > Have fun,
> > > > Ype
> > > >
> > > > --
> > > > To unsubscribe, e-mail:
> > > > <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
> > > > For additional commands, e-mail:
> > > > <mailto:lucene-user-help@;jakarta.apache.org>
> > >
> > > __________________________________________________
> > > Do you Yahoo!?
> > > HotJobs - Search new jobs daily now
> > > http://hotjobs.yahoo.com/
> > >
> > > --
> > > To unsubscribe, e-mail:
> >
> > <mailto:lucene-dev-unsubscribe@;jakarta.apache.org>
> >
> > > For additional commands, e-mail:
> >
> > <mailto:lucene-dev-help@;jakarta.apache.org>


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@;jakarta.apache.org>

Re: LARM web crawler: use lucene itself for visited URLs

Reply via email to