Re: web crawler in python

2009-12-10 Thread Philip Semanchuk


On Dec 9, 2009, at 7:39 PM, my name wrote:


I'm currently planning on writing a web crawler in python but have a
question as far as how I should design it. My goal is speed and  
maximum

efficient use of the hardware\bandwidth I have available.

As of now I have a Dual 2.4ghz xeon box, 4gb ram, 500gb sata and a  
20mbps

bandwidth cap (for now) . Running FreeBSD.

What would be the best way to design the crawler? Using the thread  
module?
Would I be able to max out this connection with the hardware listed  
above

using python threads?


I wrote a web crawler in Python (under FreeBSD, in fact) and I chose  
to do it using separate processes. Process A would download pages and  
write them to disk, process B would attempt to convert them to  
Unicode, process C would evaluate the content, etc. That worked well  
for me because the processes were very independent of one another so  
they had very little data to share. Each process had a work queue  
(Postgres database table); process A would feed B's queue, B would  
feed C  D's queues, etc.


I should point out that my crawler spidered one site at a time. As a  
result the downloading process spent a lot of time waiting (in order  
to be polite to the remote Web server). This sounds pretty different  
from what you want to do (an indeed from most crawlers).


Figuring out the best design for your crawler depends on a host of  
factors that you haven't mentioned.  (What are you doing with the  
pages you download? Is the box doing anything else? Are you storing  
the pages long term or discarding them? etc.) I don't think we can do  
it for you -- I know *I* can't; I have a day job. ;)  But I encourage  
you to try something out. If you find your code isn't giving what you  
want, come back to the list with a specific problem. It's always  
easier to help with specific than with general problems.


Good luck
Philip
--
http://mail.python.org/mailman/listinfo/python-list


web crawler in python

2009-12-09 Thread my name
I'm currently planning on writing a web crawler in python but have a
question as far as how I should design it. My goal is speed and maximum
efficient use of the hardware\bandwidth I have available.

As of now I have a Dual 2.4ghz xeon box, 4gb ram, 500gb sata and a 20mbps
bandwidth cap (for now) . Running FreeBSD.

What would be the best way to design the crawler? Using the thread module?
Would I be able to max out this connection with the hardware listed above
using python threads?

Thank you kindly.
-- 
http://mail.python.org/mailman/listinfo/python-list


Web crawler on python

2008-10-30 Thread yura
I need simple web crawler, I found Ruya, but it's seems not currently
maintained. Does anybody know good web crawler on python or with
python interface?
http://watch-me.890m.com
--
http://mail.python.org/mailman/listinfo/python-list


Re: Web crawler on python

2008-10-30 Thread James Mills
On Fri, Oct 31, 2008 at 8:13 AM, yura [EMAIL PROTECTED] wrote:
 I need simple web crawler, I found Ruya, but it's seems not currently
 maintained. Does anybody know good web crawler on python or with
 python interface?
 http://watch-me.890m.com

http://hg.softcircuit.com.au/index.wsgi/projects/pymills/file/edc08c87ecb7/examples/spider.py

cheers
James

-- 
--
-- Problems are solved by method
--
http://mail.python.org/mailman/listinfo/python-list


Re: Web crawler on python

2008-10-28 Thread Alex
On Oct 26, 9:54 pm, sonich [EMAIL PROTECTED] wrote:
 I need simple web crawler,
 I found Ruya, but it's seems not currently maintained.
 Does anybody know good web crawler on python or with python interface?

You should try Orchid http://pypi.python.org/pypi/Orchid/1.1
 or you can have a look at my project on launchpad
https://code.launchpad.net/~esaurito/jazz-crawler/experimental.
It's a single site crawler but you can easily modified it.

Bye.

Alex
--
http://mail.python.org/mailman/listinfo/python-list


RE: Web crawler on python

2008-10-27 Thread Support Desk


-Original Message-
From: James Mills [mailto:[EMAIL PROTECTED] 
Sent: Sunday, October 26, 2008 5:26 PM
To: sonich
Cc: python-list@python.org
Subject: Re: Web crawler on python

On Mon, Oct 27, 2008 at 6:54 AM, sonich [EMAIL PROTECTED] wrote:
 I need simple web crawler,
 I found Ruya, but it's seems not currently maintained.
 Does anybody know good web crawler on python or with python interface?

Simple, but  it works. Extend it all you like.

http://hg.softcircuit.com.au/index.wsgi/projects/pymills/file/330d047ff663/e
xamples/spider.py

$ spider.py --help
Usage: spider.py [options] url

Options:
  --version show program's version number and exit
  -h, --helpshow this help message and exit
  -q, --quiet   Enable quiet mode
  -l, --links   Get links for specified url only
  -d DEPTH, --depth=DEPTH
Maximum depth to traverse

cheers
James

-- 
--
-- Problems are solved by method


--
http://mail.python.org/mailman/listinfo/python-list


Web crawler on python

2008-10-26 Thread sonich
I need simple web crawler,
I found Ruya, but it's seems not currently maintained.
Does anybody know good web crawler on python or with python interface?
--
http://mail.python.org/mailman/listinfo/python-list


Re: Web crawler on python

2008-10-26 Thread Mr . SpOOn
On Sun, Oct 26, 2008 at 9:54 PM, sonich [EMAIL PROTECTED] wrote:
 I need simple web crawler,
 I found Ruya, but it's seems not currently maintained.
 Does anybody know good web crawler on python or with python interface?

What about BeautifulSoup?

http://www.crummy.com/software/BeautifulSoup/
--
http://mail.python.org/mailman/listinfo/python-list


Re: Web crawler on python

2008-10-26 Thread James Mills
On Mon, Oct 27, 2008 at 6:54 AM, sonich [EMAIL PROTECTED] wrote:
 I need simple web crawler,
 I found Ruya, but it's seems not currently maintained.
 Does anybody know good web crawler on python or with python interface?

Simple, but  it works. Extend it all you like.

http://hg.softcircuit.com.au/index.wsgi/projects/pymills/file/330d047ff663/examples/spider.py

$ spider.py --help
Usage: spider.py [options] url

Options:
  --version show program's version number and exit
  -h, --helpshow this help message and exit
  -q, --quiet   Enable quiet mode
  -l, --links   Get links for specified url only
  -d DEPTH, --depth=DEPTH
Maximum depth to traverse

cheers
James

-- 
--
-- Problems are solved by method
--
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-20 Thread Magnus Lycka
abhinav wrote:
 I want to strke a balance between development speed and crawler speed.

The best performance improvement is the transition from the
nonworking state to the working state.- J. Osterhout

Try to get there are soon as possible. You can figure out what
that means. ;^)

When you do all your programming in Python, most of the code that
is relevant for speed *is* written in C already. If performance
is slow, measure! Use the profiler to see if you are spending a
lot of time in Python code. If that is your problem, take a close
look at your algorithms and perhaps your data structures and see
what you can improve with Python. In the long run, going from from
e.g. O(n^2) to O(n log n) might mean much more than going from
Python to C. A poor algorithm in machine code still sucks when you
have to handle enough data. Changing your code to improve on
algorithms and structure is a lot easier in Python than in C.

If you've done all these things, still have performance problems,
and have identified a bottle neck in your Python code, it might
be time to get that piece rewritten in C. The easiest and least
intrusive way to do that might be with pyrex. You might also want
to try Psyco before you do this.

Even if you end up writing a whole program in C, it's not unlikely
that you will get to your goal faster if your first version is
written in Python.

Good luck!

P.S. Why someone would want to write yet another web crawler is
a puzzle to me. Surely there are plenty of good ideas that haven't
been properly implemented yet! It's probably very difficult to
beat Google on their home turf now, but I'd really like to see
a good tool to manage all that information I got from the net,
or through mail or wrote myself. I don't think they wrote that
yet--although I'm sure they are trying.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-20 Thread [EMAIL PROTECTED]
I think something that may be even more important to consider than just
the pure speed of your program, would be ease of design as well as the
overall stability of your code.

My opinion would be that writing in Python would have many benefits
over the speed gains of using C. For instance, you crawler will have to
handle all types of input from all over the web. Who can say what types
of malformed or poorly writen data it will come across. I think it
would be easier to create a system to handle this type of data in
Python than in C.

I don't want to pigeon-hole your project, but if it is for any use
other than a commercial product, I would say speed would be a concern
lower on the list than accurracy or time to develop. As others have
pointed out, if you hit many performance barriers chances are the
problem  is the algorithm and not Python itself.

I wish you luck and hope you will experiment in Python first. If your
crawler is still not up to par, at the very least you might come up
with some ideas for how Python could be improved.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-17 Thread Steve Holden
abhinav wrote:
 Hi guys.I have to implement a topical crawler as a part of my
 project.What language should i implement
 C or Python?Python though has fast development cycle but my concern is
 speed also.I want to strke a balance between development speed and
 crawler speed.Since Python is an interpreted language it is rather
 slow.The crawler which will be working on huge set of pages should be
 as fast as possible.One possible implementation would be implementing
 partly in C and partly in Python so that i can have best of both
 worlds.But i don't know to approach about it.Can anyone guide me on
 what part should i implement in C and what should be in Python?
 
Get real. Any web crawler is bound to spend huge amounts of its time 
waiting for data to come in over network pipes. Or do you have plans for 
massive parallelism previously unheard of in the Python world?

regards
  Steve
-- 
Steve Holden   +44 150 684 7255  +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006  www.python.org/pycon/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-17 Thread Ravi Teja
This is following the pattern of your previous post on language choice
wrt. writing a mail server. It is very common for beginers to over
emphasize performance requirements, size of the executable etc. More is
always good. Right? Yes! But at what cost?

The rule of thumb for all your Python Vs C questions is ...
1.) Choose Python by default.
2.) If your program is slow, it's your algorithm that you need to check
first. Python strictly speaking will be slow because of its dynamism.
However, most of whatever is performance critical in Python is already
implemented in C. And the speed difference of well written Python
programs with properly chosen extensions and algorithms is not far off.
3.) Remember that you can always drop back to C where ever you need to
without throwing all of your code. And even if you had to, Python is
very valuable as a prototyping tool since it is very agile. You would
have figured out what you needed to do by then, that rewriting it in C
will only take a fraction of the time compared to if it was written in
C directly.

Don't even start with asking the question, is it fast enough? till
you have already written it in Python and it turns out that it is not
running fast enough despite correctness of your code. If it does, you
can fix it relatively easily. It is easy to write bad code in C and
poorly written C code performance is lower than well written Python
code performance.

Remember Donald Knuth's quote.
Premature optimization is the root of all evil in programming.

C is a language intended to be used when you NEED tight control over
memory allocation. It has few advantages in other scenarios. Don't
abuse it by choosing it by default.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-17 Thread Alex Martelli
Ravi Teja [EMAIL PROTECTED] wrote:
   ...
 The rule of thumb for all your Python Vs C questions is ...
 1.) Choose Python by default.

+1 QOTW!-)


 2.) If your program is slow, it's your algorithm that you need to check

Seriously: yes, and (often even more importantly) data structure.

However, often most important tip, particularly for large-scale systems,
is to consider your program's _architecture_ (algorithms are about
details of computation, architecture is about partitioning systems into
components, locating their deployment, and so forth). At a generic and
lowish level: are you for example creating a lot of threads each for a
small amount of work? Then consider reusing threads from a worker
threads pool. Or maybe you could avoid threads and use event-driven
programming; or, at the other extreme, have multiple processes
communicating by TCP/IP so you can scale up your system to tens or
hundreds of processors -- in the latter case, partitioning your system
appropriately to minimize inter process communication may be the
bottleneck. Consider UDP, when you can afford missing a packet once in a
while -- sometimes it may let you reduce overheads compared to TCP
connections.

Database connections, and less importantly database cursors, are well
worth reusing. What are you caching, and what instead is getting
recomputed over and over?  It's possible to undercache (needless
repeated computation) but also to overcache (tying up memory and causing
paging). Are you making lots of system calls that you might be able to
avoid? Each system call has a context-switching cost, after all...

Any or all of these hints may be irrelevant to a specific category of
applications, but then, so can the hint about algorithms be. One cool
thing about Python is that it makes it easy and fast for you to try out
different approaches (particularly to architecture, but to algorithms as
well), even drastically different ones, when simple reasoning about the
issues leaves you undecided and you need to settle them empirically.

 
 Remember Donald Knuth's quote.
 Premature optimization is the root of all evil in programming.

I believe Knuth himself said he was quoting Tony Hoare, and indeed
referred to this as Hoare's dictum.


Alex
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-16 Thread Fuzzyman

abhinav wrote:
 It is DSL broadband 128kbps.But thats not the point.What i am saying is
 that would python be fine for implementing fast crawler algorithms or
 should i use C.

But a web crawler is going to be *mainly* I/O bound - so language
efficiency won't be the main issue. There are several web crawler
implemented in Python.

 Handling huge data,multithreading,file
 handling,heuristics for ranking,and maintaining huge data
 structures.What should be the language so as not to compromise that
 much on speed.What is the performance of python based crawlers vs C
 based crawlers.Should I use both the languages(partly C and python).How

If your data processing requirements are fairly heavy you will
*probably* get a speed advantage coding them in C and accessing them
from Python.

The usdual advice (which seems to be applicable to you), is to
prototype in Python (which will be much more fun than in C) then test.

Profile to find your real bottlenecks (if the Python one isn't fast
enough - which it may be), and move your bottlenecks to C.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

 should i decide what part to be implemented in C and what should be
 done in python?
 Please guide me.Thanks.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-16 Thread Paul Rubin
abhinav [EMAIL PROTECTED] writes:
 It is DSL broadband 128kbps.But thats not the point.

But it is the point.

 What i am saying is that would python be fine for implementing fast
 crawler algorithms or should i use C.Handling huge
 data,multithreading,file handling,heuristics for ranking,and
 maintaining huge data structures.What should be the language so as
 not to compromise that much on speed.What is the performance of
 python based crawlers vs C based crawlers.Should I use both the
 languages(partly C and python).How should i decide what part to be
 implemented in C and what should be done in python?  Please guide
 me.Thanks.

I think if you don't know how to answer these questions for yourself,
you're not ready to take on projects of that complexity.  My advice
is start in Python since development will be much easier.  If and when
you start hitting performance problems, you'll have to examine many
combinations of tactics for dealing with them, and switching languages
is just one such tactic.  
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-16 Thread gene tani

Paul Rubin wrote:
 abhinav [EMAIL PROTECTED] writes:

  maintaining huge data structures.What should be the language so as
  not to compromise that much on speed.What is the performance of
  python based crawlers vs C based crawlers.Should I use both the
  languages(partly C and python).How should i decide what part to be
  implemented in C and what should be done in python?  Please guide
  me.Thanks.

 I think if you don't know how to answer these questions for yourself,
 you're not ready to take on projects of that complexity.  My advice
 is start in Python since development will be much easier.  If and when
 you start hitting performance problems, you'll have to examine many
 combinations of tactics for dealing with them, and switching languages
 is just one such tactic.

There's another potential bottleneck, parsing HTML and extracting the
text you want, especially when you hit pages that don't meet HTML 4 or
XHTML spec.
http://sig.levillage.org/?p=599

Paul's advice is very sound, given what little info you've provided.

http://trific.ath.cx/resources/python/optimization/
(and look at psyco, pyrex, boost, Swig, Ctypes for bridging C and
python, you have a lot of options.  Also look at Harvestman, mechanize,
other existing libs.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-16 Thread gene tani

abhinav wrote:
 Hi guys.I have to implement a topical crawler as a part of my
 project.What language should i implement

Oh, and there's some really good books out there, besides the Orilly
Spidering Hacks.  Springer Verlag has a couple books on Text Mining
and at least a couple books with web intelligence in the title.
Expensive but worth it.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-16 Thread Andrew Gwozdziewycz
On 15 Feb 2006 21:56:52 -0800, abhinav [EMAIL PROTECTED] wrote:
 Hi guys.I have to implement a topical crawler as a part of my
 project.What language should i implement
 C or Python?

Why does this keep coming up on here as of late? If you search the
archives, you can find numerous posts about spiders. One interesting
fact is that google itself starting with their spiders in python.
http://www-db.stanford.edu/~backrub/google.html I'm _sure_ it'll work
for you.



--
Andrew Gwozdziewycz [EMAIL PROTECTED]
http://ihadagreatview.org
http://plasticandroid.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-16 Thread Steven D'Aprano
On Wed, 15 Feb 2006 21:56:52 -0800, abhinav wrote:

 Hi guys.I have to implement a topical crawler as a part of my
 project.What language should i implement
 C or Python?Python though has fast development cycle but my concern is
 speed also.I want to strke a balance between development speed and
 crawler speed.Since Python is an interpreted language it is rather
 slow.

Python is no more interpreted than Java. Like Java, it is compiled to
byte-code. Unlike Java, it doesn't take three weeks to start the runtime
environment. (Okay, maybe it just *seems* like three weeks.)

The nice clean distinctions between compiled and interpreted languages
haven't existed in most serious programming languages for a decade or
more. In these days of tokenizers and byte-code compilers and processors
emulating other processors, the difference is more of degree than kind.

It is true that standard Python doesn't compile to platform dependent
machine code, but that is rarely an issue since the bottleneck for most
applications is I/O or human interaction, not language speed. And for
those cases where it is a problem, there are solutions, like Psycho.

After all, it is almost never true that your code must run as fast as
physically possible. That's called over-engineering. It just needs to
run as fast as needed, that's all. And that's a much simpler problem to
solve cheaply.



 The crawler which will be working on huge set of pages should be
 as fast as possible.

Web crawler performance is almost certainly going to be I/O bound. Sounds
to me like you are guilty of trying to optimize your code before even
writing a single line of code. What you call huge may not be huge to
your computer. Have you tried? The great thing about Python is you can
write a prototype in maybe a tenth the time it would take you to do the
same thing in C. Instead of trying to guess what the performance
bottlenecks will be, you can write your code and profile it and find the
bottlenecks with accuracy.


 One possible implementation would be implementing
 partly in C and partly in Python so that i can have best of both
 worlds.

Sure you can do that, if you need to. 

 But i don't know to approach about it.Can anyone guide me on
 what part should i implement in C and what should be in Python?

Yes. Write it all in Python. Test it, debug it, get it working. 

Once it is working, and not before, rigorously profile it. You may find it
is fast enough.

If it is not fast enough, find the bottlenecks. Replace them with better
algorithms. We had an example on comp.lang.python just a day or two ago
where a function which was taking hours to complete was re-written with a
better algorithm which took only seconds. And still in Python.

If it is still too slow after using better algorithms, or if there are no
better algorithms, then and only then re-write those bottlenecks in C for
speed.



-- 
Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list


web crawler in python or C?

2006-02-15 Thread abhinav
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?Python though has fast development cycle but my concern is
speed also.I want to strke a balance between development speed and
crawler speed.Since Python is an interpreted language it is rather
slow.The crawler which will be working on huge set of pages should be
as fast as possible.One possible implementation would be implementing
partly in C and partly in Python so that i can have best of both
worlds.But i don't know to approach about it.Can anyone guide me on
what part should i implement in C and what should be in Python?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-15 Thread Paul Rubin
abhinav [EMAIL PROTECTED] writes:
 The crawler which will be working on huge set of pages should be
 as fast as possible.

What kind of network connection do you have, that's fast enough
that even a fairly cpu-inefficient crawler won't saturate it?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-15 Thread abhinav
It is DSL broadband 128kbps.But thats not the point.What i am saying is
that would python be fine for implementing fast crawler algorithms or
should i use C.Handling huge data,multithreading,file
handling,heuristics for ranking,and maintaining huge data
structures.What should be the language so as not to compromise that
much on speed.What is the performance of python based crawlers vs C
based crawlers.Should I use both the languages(partly C and python).How
should i decide what part to be implemented in C and what should be
done in python?
Please guide me.Thanks.

-- 
http://mail.python.org/mailman/listinfo/python-list