Re: Is Python good for web crawlers?

2006-02-07 Thread Andrew Gwozdziewycz
On 7 Feb 2006 08:33:28 -0800, Tempo <[EMAIL PROTECTED]> wrote:
> I was wondering if python is a good language to build a web crawler
> with? For example, to construct a program that will routinely search x
> amount of sites to check the availability of a product. Or to search
> for news articles containing the word 'XYZ'. These are just random
> ideas to try to explain my question a bit further. Well if you have an
> opinion about this please let me know becasue I am very interested to
> hear what you have to say. Thanks.
>

Google supplies a basic webcrawler as a google desktop plugin called
Kongulo (http://sourceforge.net/projects/goog-kongulo/) which is
written in python. I would think python would be perfect for this sort
of application. Your bottleneck is always going to be downloading the
page.

--
Andrew Gwozdziewycz <[EMAIL PROTECTED]>
http://ihadagreatview.org
http://plasticandroid.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-07 Thread Tempo
Why do you say that the bottleneck of the crawler will always be
downloading the page? Is it becasue there isn't already a modual to do
this and I will have to start from scratch? Or a bandwidth issue?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-07 Thread Diez B. Roggisch
Tempo wrote:

> Why do you say that the bottleneck of the crawler will always be
> downloading the page? Is it becasue there isn't already a modual to do
> this and I will have to start from scratch? Or a bandwidth issue?

Because of bandwidth - not necessarily yours directly, but the maximum flow
between your uplink and the site in question. It will always take at least
a fractioin of a second up to several seconds until the data is there - in
that time, lots of python code can run.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-07 Thread Tempo
Does a web crawler have to download an entire page if it only needs to
check if the product is in stock on a page? Or if it just needs to
search for one match of a certain word on a page?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-07 Thread Tim Parkin
Tempo wrote:

>Does a web crawler have to download an entire page if it only needs to
>check if the product is in stock on a page? Or if it just needs to
>search for one match of a certain word on a page?
>
>  
>
Typically you would download the whole html file and then perform any
analysis on this. It is possible to parse the stream of characters as
they come back from the server but this would statistically only reduce
the download time by a half (presuming the item you want is of a single
byte in length and can appear anywhere in the html). In reality, unless
the pages you are requesting are very large (200k+) or your bandwidth
very expensive (in time and/or capacity) then it is probably easier for
you to just download the whole file.

I would recommend that you use BeautifulSoup to parse badly formatted
html documents (which is most of the web). (google 'beautiful soup' and
you should find it easily).

Tim Parkin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-07 Thread Tempo
I took your advice and got a copy of BeautifulSoup, but I am having
trouble installing the module. Any advice? I noticed that I just can't
put it into the 'lib' directory of python to install it.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-07 Thread George Sakkis
You may want to start from HarvestMan:

http://harvestman.freezope.org/

George

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-07 Thread Tim Parkin
Tempo wrote:

>I took your advice and got a copy of BeautifulSoup, but I am having
>trouble installing the module. Any advice? I noticed that I just can't
>put it into the 'lib' directory of python to install it.
>
>  
>
Just save the file in the same directory as your project then you should
be able to use the sample code.

Tim Parkin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-07 Thread Paul Rubin
"Tempo" <[EMAIL PROTECTED]> writes:
> I was wondering if python is a good language to build a web crawler
> with? For example, to construct a program that will routinely search x
> amount of sites to check the availability of a product. Or to search
> for news articles containing the word 'XYZ'. These are just random
> ideas to try to explain my question a bit further.

I've written a few of these in Python.  The language itself is fine
for this.  The built-in libraries do most of what you'd hope, though
they have room for improvement.  Generally I use urllib.read() to get
the whole html page as a string, then process it from there.  I just
look for the substrings I'm interested in, making no attempt to
actually parse the html into a DOM or anything like that.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-07 Thread Xavier Morel
Paul Rubin wrote:
> Generally I use urllib.read() to get
> the whole html page as a string, then process it from there.  I just
> look for the substrings I'm interested in, making no attempt to
> actually parse the html into a DOM or anything like that.
 >
BeautifulSoup works *really* well when you want to parse the source 
(e.g. when you don't want to use string matching, or when the structures 
you're looking for are a bit too complicated for simple string 
matching/substring search)

The API of the package is extremely simple, straightforward and... obvious.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-07 Thread Paul Rubin
Xavier Morel <[EMAIL PROTECTED]> writes:
> BeautifulSoup...
> The API of the package is extremely simple, straightforward and... obvious.

I did not find that.  I spent a few minutes looking at the
documentation and it wasn't obvious at all how to use it.  Maybe I
could have figured it out with more effort, but I got whatever the
immediate task was done without it instead.  It does look like a nice
package but the docs need improvement.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-07 Thread Tempo
I agree. I think the way that I will learn to use most of it is by
going through the source code.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-08 Thread Magnus Lycka
Tempo wrote:
> I was wondering if python is a good language to build a web crawler
> with? For example, to construct a program that will routinely search x
> amount of sites to check the availability of a product. Or to search
> for news articles containing the word 'XYZ'. These are just random
> ideas to try to explain my question a bit further. Well if you have an
> opinion about this please let me know becasue I am very interested to
> hear what you have to say. Thanks.

I dunno, but there are these two guys, Sergey Brin and Lawrence Page,
who wrote a web crawler in Python. As far as I understood, they were
fairly successful with it. I think they called their system Koogle,
Bugle, or Gobble or something like that. Goo...can't remember.

See http://www-db.stanford.edu/~backrub/google.html

They've also employed some clever Python programmers, such as Greg
Stein, Alex Martelli (isn't he a bot?) and some obscure dutch
mathematician called Guido van something. It seems they still like
Python.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-08 Thread Simon Brunning
On 2/8/06, Alex Martelli <[EMAIL PROTECTED]> wrote:
>
> Bot? me? did I fail a Turing test again without even noticing?!

If you'd noticed the test, you'd have passed.

;-)

--
Cheers,
Simon B,
[EMAIL PROTECTED],
http://www.brunningonline.net/simon/blog/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-08 Thread Alex Martelli
Magnus Lycka <[EMAIL PROTECTED]> wrote:
   ...
> I dunno, but there are these two guys, Sergey Brin and Lawrence Page,
> who wrote a web crawler in Python. As far as I understood, they were
> fairly successful with it. I think they called their system Koogle,
> Bugle, or Gobble or something like that. Goo...can't remember.
> 
> See http://www-db.stanford.edu/~backrub/google.html

Yeah, I've heard of them, too.


> They've also employed some clever Python programmers, such as Greg
> Stein, Alex Martelli (isn't he a bot?) and some obscure dutch
> mathematician called Guido van something. It seems they still like
> Python.

Bot? me? did I fail a Turing test again without even noticing?!


Alex
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-08 Thread Magnus Lycka
Simon Brunning wrote:
> On 2/8/06, Alex Martelli <[EMAIL PROTECTED]> wrote:
> 
>>Bot? me? did I fail a Turing test again without even noticing?!
> 
> 
> If you'd noticed the test, you'd have passed.

No no, it's just a regular expression that notices the
word 'bot' close to 'Martelli'. Wouldn't surprise me
if more or less the same message appears again as a
response to this post. ;)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-09 Thread John J. Lee
[EMAIL PROTECTED] (Alex Martelli) writes:

> Magnus Lycka <[EMAIL PROTECTED]> wrote:
>...
> > I dunno, but there are these two guys, Sergey Brin and Lawrence Page,
> > who wrote a web crawler in Python. As far as I understood, they were
> > fairly successful with it. I think they called their system Koogle,
> > Bugle, or Gobble or something like that. Goo...can't remember.
> > 
> > See http://www-db.stanford.edu/~backrub/google.html
> 
> Yeah, I've heard of them, too.

I wonder if that little outfit has considered open-sourcing any of
their web client code?

(Declaring my interest: I'm maintaining, and very slowly developing,
some open-source libraries for web scraping and testing)


John

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-09 Thread Alex Martelli
John J. Lee <[EMAIL PROTECTED]> wrote:
   ...
> I wonder if that little outfit has considered open-sourcing any of
> their web client code?

What they've open-sourced so far is listed at
 -- of these, the only
crawl/spider is Könguló, so far.


Alex
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Python good for web crawlers?

2006-02-10 Thread gene tani

Paul Rubin wrote:
> Xavier Morel <[EMAIL PROTECTED]> writes:
> > BeautifulSoup...
> > The API of the package is extremely simple, straightforward and... obvious.
>
> I did not find that.  I spent a few minutes looking at the
> documentation and it wasn't obvious at all how to use it.  Maybe I

1. read about Soup and mechanize
http://sig.levillage.org/?p=599

2. flip thru oreilly spidering hacks book (put on YAPH t-shirt)

3. go at your task

4. write Spidering Hacks in Python, 1st edition.  Cite me as
inspiration.

-- 
http://mail.python.org/mailman/listinfo/python-list