Re: web crawler in python

2009-12-10 Thread Philip Semanchuk


On Dec 9, 2009, at 7:39 PM, my name wrote:


I'm currently planning on writing a web crawler in python but have a
question as far as how I should design it. My goal is speed and  
maximum

efficient use of the hardware\bandwidth I have available.

As of now I have a Dual 2.4ghz xeon box, 4gb ram, 500gb sata and a  
20mbps

bandwidth cap (for now) . Running FreeBSD.

What would be the best way to design the crawler? Using the thread  
module?
Would I be able to max out this connection with the hardware listed  
above

using python threads?


I wrote a web crawler in Python (under FreeBSD, in fact) and I chose  
to do it using separate processes. Process A would download pages and  
write them to disk, process B would attempt to convert them to  
Unicode, process C would evaluate the content, etc. That worked well  
for me because the processes were very independent of one another so  
they had very little data to share. Each process had a work queue  
(Postgres database table); process A would feed B's queue, B would  
feed C & D's queues, etc.


I should point out that my crawler spidered one site at a time. As a  
result the downloading process spent a lot of time waiting (in order  
to be polite to the remote Web server). This sounds pretty different  
from what you want to do (an indeed from most crawlers).


Figuring out the best design for your crawler depends on a host of  
factors that you haven't mentioned.  (What are you doing with the  
pages you download? Is the box doing anything else? Are you storing  
the pages long term or discarding them? etc.) I don't think we can do  
it for you -- I know *I* can't; I have a day job. ;)  But I encourage  
you to try something out. If you find your code isn't giving what you  
want, come back to the list with a specific problem. It's always  
easier to help with specific than with general problems.


Good luck
Philip
--
http://mail.python.org/mailman/listinfo/python-list


web crawler in python

2009-12-09 Thread my name
I'm currently planning on writing a web crawler in python but have a
question as far as how I should design it. My goal is speed and maximum
efficient use of the hardware\bandwidth I have available.

As of now I have a Dual 2.4ghz xeon box, 4gb ram, 500gb sata and a 20mbps
bandwidth cap (for now) . Running FreeBSD.

What would be the best way to design the crawler? Using the thread module?
Would I be able to max out this connection with the hardware listed above
using python threads?

Thank you kindly.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-20 Thread [EMAIL PROTECTED]
I think something that may be even more important to consider than just
the pure speed of your program, would be ease of design as well as the
overall stability of your code.

My opinion would be that writing in Python would have many benefits
over the speed gains of using C. For instance, you crawler will have to
handle all types of input from all over the web. Who can say what types
of malformed or poorly writen data it will come across. I think it
would be easier to create a system to handle this type of data in
Python than in C.

I don't want to pigeon-hole your project, but if it is for any use
other than a commercial product, I would say speed would be a concern
lower on the list than accurracy or time to develop. As others have
pointed out, if you hit many performance barriers chances are the
problem  is the algorithm and not Python itself.

I wish you luck and hope you will experiment in Python first. If your
crawler is still not up to par, at the very least you might come up
with some ideas for how Python could be improved.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-20 Thread Magnus Lycka
abhinav wrote:
> I want to strke a balance between development speed and crawler speed.

"The best performance improvement is the transition from the
nonworking state to the working state."- J. Osterhout

Try to get there are soon as possible. You can figure out what
that means. ;^)

When you do all your programming in Python, most of the code that
is relevant for speed *is* written in C already. If performance
is slow, measure! Use the profiler to see if you are spending a
lot of time in Python code. If that is your problem, take a close
look at your algorithms and perhaps your data structures and see
what you can improve with Python. In the long run, going from from
e.g. O(n^2) to O(n log n) might mean much more than going from
Python to C. A poor algorithm in machine code still sucks when you
have to handle enough data. Changing your code to improve on
algorithms and structure is a lot easier in Python than in C.

If you've done all these things, still have performance problems,
and have identified a bottle neck in your Python code, it might
be time to get that piece rewritten in C. The easiest and least
intrusive way to do that might be with pyrex. You might also want
to try Psyco before you do this.

Even if you end up writing a whole program in C, it's not unlikely
that you will get to your goal faster if your first version is
written in Python.

Good luck!

P.S. Why someone would want to write yet another web crawler is
a puzzle to me. Surely there are plenty of good ideas that haven't
been properly implemented yet! It's probably very difficult to
beat Google on their home turf now, but I'd really like to see
a good tool to manage all that information I got from the net,
or through mail or wrote myself. I don't think they wrote that
yet--although I'm sure they are trying.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-17 Thread Alex Martelli
Ravi Teja <[EMAIL PROTECTED]> wrote:
   ...
> The rule of thumb for all your Python Vs C questions is ...
> 1.) Choose Python by default.

+1 QOTW!-)


> 2.) If your program is slow, it's your algorithm that you need to check

Seriously: yes, and (often even more importantly) data structure.

However, often most important tip, particularly for large-scale systems,
is to consider your program's _architecture_ (algorithms are about
details of computation, architecture is about partitioning systems into
components, locating their deployment, and so forth). At a generic and
lowish level: are you for example creating a lot of threads each for a
small amount of work? Then consider reusing threads from a "worker
threads" pool. Or maybe you could avoid threads and use event-driven
programming; or, at the other extreme, have multiple processes
communicating by TCP/IP so you can scale up your system to tens or
hundreds of processors -- in the latter case, partitioning your system
appropriately to minimize inter process communication may be the
bottleneck. Consider UDP, when you can afford missing a packet once in a
while -- sometimes it may let you reduce overheads compared to TCP
connections.

Database connections, and less importantly database cursors, are well
worth reusing. What are you "caching", and what instead is getting
recomputed over and over?  It's possible to undercache (needless
repeated computation) but also to overcache (tying up memory and causing
paging). Are you making lots of system calls that you might be able to
avoid? Each system call has a context-switching cost, after all...

Any or all of these hints may be irrelevant to a specific category of
applications, but then, so can the hint about algorithms be. One cool
thing about Python is that it makes it easy and fast for you to try out
different approaches (particularly to architecture, but to algorithms as
well), even drastically different ones, when simple reasoning about the
issues leaves you undecided and you need to settle them empirically.

 
> Remember Donald Knuth's quote.
> "Premature optimization is the root of all evil in programming".

I believe Knuth himself said he was quoting Tony Hoare, and indeed
referred to this as "Hoare's dictum".


Alex
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-17 Thread Ravi Teja
This is following the pattern of your previous post on language choice
wrt. writing a mail server. It is very common for beginers to over
emphasize performance requirements, size of the executable etc. More is
always good. Right? Yes! But at what cost?

The rule of thumb for all your Python Vs C questions is ...
1.) Choose Python by default.
2.) If your program is slow, it's your algorithm that you need to check
first. Python strictly speaking will be slow because of its dynamism.
However, most of whatever is performance critical in Python is already
implemented in C. And the speed difference of well written Python
programs with properly chosen extensions and algorithms is not far off.
3.) Remember that you can always drop back to C where ever you need to
without throwing all of your code. And even if you had to, Python is
very valuable as a prototyping tool since it is very agile. You would
have figured out what you needed to do by then, that rewriting it in C
will only take a fraction of the time compared to if it was written in
C directly.

Don't even start with asking the question, "is it fast enough?" till
you have already written it in Python and it turns out that it is not
running fast enough despite correctness of your code. If it does, you
can fix it relatively easily. It is easy to write bad code in C and
poorly written C code performance is lower than well written Python
code performance.

Remember Donald Knuth's quote.
"Premature optimization is the root of all evil in programming".

C is a language intended to be used when you NEED tight control over
memory allocation. It has few advantages in other scenarios. Don't
abuse it by choosing it by default.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-17 Thread Steve Holden
abhinav wrote:
> Hi guys.I have to implement a topical crawler as a part of my
> project.What language should i implement
> C or Python?Python though has fast development cycle but my concern is
> speed also.I want to strke a balance between development speed and
> crawler speed.Since Python is an interpreted language it is rather
> slow.The crawler which will be working on huge set of pages should be
> as fast as possible.One possible implementation would be implementing
> partly in C and partly in Python so that i can have best of both
> worlds.But i don't know to approach about it.Can anyone guide me on
> what part should i implement in C and what should be in Python?
> 
Get real. Any web crawler is bound to spend huge amounts of its time 
waiting for data to come in over network pipes. Or do you have plans for 
massive parallelism previously unheard of in the Python world?

regards
  Steve
-- 
Steve Holden   +44 150 684 7255  +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006  www.python.org/pycon/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-16 Thread Steven D'Aprano
On Wed, 15 Feb 2006 21:56:52 -0800, abhinav wrote:

> Hi guys.I have to implement a topical crawler as a part of my
> project.What language should i implement
> C or Python?Python though has fast development cycle but my concern is
> speed also.I want to strke a balance between development speed and
> crawler speed.Since Python is an interpreted language it is rather
> slow.

Python is no more interpreted than Java. Like Java, it is compiled to
byte-code. Unlike Java, it doesn't take three weeks to start the runtime
environment. (Okay, maybe it just *seems* like three weeks.)

The nice clean distinctions between "compiled" and "interpreted" languages
haven't existed in most serious programming languages for a decade or
more. In these days of tokenizers and byte-code compilers and processors
emulating other processors, the difference is more of degree than kind.

It is true that standard Python doesn't compile to platform dependent
machine code, but that is rarely an issue since the bottleneck for most
applications is I/O or human interaction, not language speed. And for
those cases where it is a problem, there are solutions, like Psycho.

After all, it is almost never true that your code must run as fast as
physically possible. That's called "over-engineering". It just needs to
run as fast as needed, that's all. And that's a much simpler problem to
solve cheaply.



> The crawler which will be working on huge set of pages should be
> as fast as possible.

Web crawler performance is almost certainly going to be I/O bound. Sounds
to me like you are guilty of trying to optimize your code before even
writing a single line of code. What you call "huge" may not be huge to
your computer. Have you tried? The great thing about Python is you can
write a prototype in maybe a tenth the time it would take you to do the
same thing in C. Instead of trying to guess what the performance
bottlenecks will be, you can write your code and profile it and find the
bottlenecks with accuracy.


> One possible implementation would be implementing
> partly in C and partly in Python so that i can have best of both
> worlds.

Sure you can do that, if you need to. 

> But i don't know to approach about it.Can anyone guide me on
> what part should i implement in C and what should be in Python?

Yes. Write it all in Python. Test it, debug it, get it working. 

Once it is working, and not before, rigorously profile it. You may find it
is fast enough.

If it is not fast enough, find the bottlenecks. Replace them with better
algorithms. We had an example on comp.lang.python just a day or two ago
where a function which was taking hours to complete was re-written with a
better algorithm which took only seconds. And still in Python.

If it is still too slow after using better algorithms, or if there are no
better algorithms, then and only then re-write those bottlenecks in C for
speed.



-- 
Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-16 Thread Andrew Gwozdziewycz
On 15 Feb 2006 21:56:52 -0800, abhinav <[EMAIL PROTECTED]> wrote:
> Hi guys.I have to implement a topical crawler as a part of my
> project.What language should i implement
> C or Python?

Why does this keep coming up on here as of late? If you search the
archives, you can find numerous posts about spiders. One interesting
fact is that google itself starting with their spiders in python.
http://www-db.stanford.edu/~backrub/google.html I'm _sure_ it'll work
for you.



--
Andrew Gwozdziewycz <[EMAIL PROTECTED]>
http://ihadagreatview.org
http://plasticandroid.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-16 Thread gene tani

abhinav wrote:
> Hi guys.I have to implement a topical crawler as a part of my
> project.What language should i implement

Oh, and there's some really good books out there, besides the Orilly
Spidering Hacks.  Springer Verlag has a couple books on "Text Mining"
and at least a couple books with "web intelligence" in the title.
Expensive but worth it.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-16 Thread gene tani

Paul Rubin wrote:
> "abhinav" <[EMAIL PROTECTED]> writes:

> > maintaining huge data structures.What should be the language so as
> > not to compromise that much on speed.What is the performance of
> > python based crawlers vs C based crawlers.Should I use both the
> > languages(partly C and python).How should i decide what part to be
> > implemented in C and what should be done in python?  Please guide
> > me.Thanks.
>
> I think if you don't know how to answer these questions for yourself,
> you're not ready to take on projects of that complexity.  My advice
> is start in Python since development will be much easier.  If and when
> you start hitting performance problems, you'll have to examine many
> combinations of tactics for dealing with them, and switching languages
> is just one such tactic.

There's another potential bottleneck, parsing HTML and extracting the
text you want, especially when you hit pages that don't meet HTML 4 or
XHTML spec.
http://sig.levillage.org/?p=599

Paul's advice is very sound, given what little info you've provided.

http://trific.ath.cx/resources/python/optimization/
(and look at psyco, pyrex, boost, Swig, Ctypes for bridging C and
python, you have a lot of options.  Also look at Harvestman, mechanize,
other existing libs.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-16 Thread Paul Rubin
"abhinav" <[EMAIL PROTECTED]> writes:
> It is DSL broadband 128kbps.But thats not the point.

But it is the point.

> What i am saying is that would python be fine for implementing fast
> crawler algorithms or should i use C.Handling huge
> data,multithreading,file handling,heuristics for ranking,and
> maintaining huge data structures.What should be the language so as
> not to compromise that much on speed.What is the performance of
> python based crawlers vs C based crawlers.Should I use both the
> languages(partly C and python).How should i decide what part to be
> implemented in C and what should be done in python?  Please guide
> me.Thanks.

I think if you don't know how to answer these questions for yourself,
you're not ready to take on projects of that complexity.  My advice
is start in Python since development will be much easier.  If and when
you start hitting performance problems, you'll have to examine many
combinations of tactics for dealing with them, and switching languages
is just one such tactic.  
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-16 Thread Fuzzyman

abhinav wrote:
> It is DSL broadband 128kbps.But thats not the point.What i am saying is
> that would python be fine for implementing fast crawler algorithms or
> should i use C.

But a web crawler is going to be *mainly* I/O bound - so language
efficiency won't be the main issue. There are several web crawler
implemented in Python.

> Handling huge data,multithreading,file
> handling,heuristics for ranking,and maintaining huge data
> structures.What should be the language so as not to compromise that
> much on speed.What is the performance of python based crawlers vs C
> based crawlers.Should I use both the languages(partly C and python).How

If your data processing requirements are fairly heavy you will
*probably* get a speed advantage coding them in C and accessing them
from Python.

The usdual advice (which seems to be applicable to you), is to
prototype in Python (which will be much more fun than in C) then test.

Profile to find your real bottlenecks (if the Python one isn't fast
enough - which it may be), and move your bottlenecks to C.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

> should i decide what part to be implemented in C and what should be
> done in python?
> Please guide me.Thanks.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-15 Thread abhinav
It is DSL broadband 128kbps.But thats not the point.What i am saying is
that would python be fine for implementing fast crawler algorithms or
should i use C.Handling huge data,multithreading,file
handling,heuristics for ranking,and maintaining huge data
structures.What should be the language so as not to compromise that
much on speed.What is the performance of python based crawlers vs C
based crawlers.Should I use both the languages(partly C and python).How
should i decide what part to be implemented in C and what should be
done in python?
Please guide me.Thanks.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: web crawler in python or C?

2006-02-15 Thread Paul Rubin
"abhinav" <[EMAIL PROTECTED]> writes:
> The crawler which will be working on huge set of pages should be
> as fast as possible.

What kind of network connection do you have, that's fast enough
that even a fairly cpu-inefficient crawler won't saturate it?
-- 
http://mail.python.org/mailman/listinfo/python-list


web crawler in python or C?

2006-02-15 Thread abhinav
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?Python though has fast development cycle but my concern is
speed also.I want to strke a balance between development speed and
crawler speed.Since Python is an interpreted language it is rather
slow.The crawler which will be working on huge set of pages should be
as fast as possible.One possible implementation would be implementing
partly in C and partly in Python so that i can have best of both
worlds.But i don't know to approach about it.Can anyone guide me on
what part should i implement in C and what should be in Python?

-- 
http://mail.python.org/mailman/listinfo/python-list