Re: web crawler in python
On Dec 9, 2009, at 7:39 PM, my name wrote: I'm currently planning on writing a web crawler in python but have a question as far as how I should design it. My goal is speed and maximum efficient use of the hardware\bandwidth I have available. As of now I have a Dual 2.4ghz xeon box, 4gb ram, 500gb sata and a 20mbps bandwidth cap (for now) . Running FreeBSD. What would be the best way to design the crawler? Using the thread module? Would I be able to max out this connection with the hardware listed above using python threads? I wrote a web crawler in Python (under FreeBSD, in fact) and I chose to do it using separate processes. Process A would download pages and write them to disk, process B would attempt to convert them to Unicode, process C would evaluate the content, etc. That worked well for me because the processes were very independent of one another so they had very little data to share. Each process had a work queue (Postgres database table); process A would feed B's queue, B would feed C D's queues, etc. I should point out that my crawler spidered one site at a time. As a result the downloading process spent a lot of time waiting (in order to be polite to the remote Web server). This sounds pretty different from what you want to do (an indeed from most crawlers). Figuring out the best design for your crawler depends on a host of factors that you haven't mentioned. (What are you doing with the pages you download? Is the box doing anything else? Are you storing the pages long term or discarding them? etc.) I don't think we can do it for you -- I know *I* can't; I have a day job. ;) But I encourage you to try something out. If you find your code isn't giving what you want, come back to the list with a specific problem. It's always easier to help with specific than with general problems. Good luck Philip -- http://mail.python.org/mailman/listinfo/python-list
web crawler in python
I'm currently planning on writing a web crawler in python but have a question as far as how I should design it. My goal is speed and maximum efficient use of the hardware\bandwidth I have available. As of now I have a Dual 2.4ghz xeon box, 4gb ram, 500gb sata and a 20mbps bandwidth cap (for now) . Running FreeBSD. What would be the best way to design the crawler? Using the thread module? Would I be able to max out this connection with the hardware listed above using python threads? Thank you kindly. -- http://mail.python.org/mailman/listinfo/python-list
Web crawler on python
I need simple web crawler, I found Ruya, but it's seems not currently maintained. Does anybody know good web crawler on python or with python interface? http://watch-me.890m.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Web crawler on python
On Fri, Oct 31, 2008 at 8:13 AM, yura [EMAIL PROTECTED] wrote: I need simple web crawler, I found Ruya, but it's seems not currently maintained. Does anybody know good web crawler on python or with python interface? http://watch-me.890m.com http://hg.softcircuit.com.au/index.wsgi/projects/pymills/file/edc08c87ecb7/examples/spider.py cheers James -- -- -- Problems are solved by method -- http://mail.python.org/mailman/listinfo/python-list
Re: Web crawler on python
On Oct 26, 9:54 pm, sonich [EMAIL PROTECTED] wrote: I need simple web crawler, I found Ruya, but it's seems not currently maintained. Does anybody know good web crawler on python or with python interface? You should try Orchid http://pypi.python.org/pypi/Orchid/1.1 or you can have a look at my project on launchpad https://code.launchpad.net/~esaurito/jazz-crawler/experimental. It's a single site crawler but you can easily modified it. Bye. Alex -- http://mail.python.org/mailman/listinfo/python-list
RE: Web crawler on python
-Original Message- From: James Mills [mailto:[EMAIL PROTECTED] Sent: Sunday, October 26, 2008 5:26 PM To: sonich Cc: python-list@python.org Subject: Re: Web crawler on python On Mon, Oct 27, 2008 at 6:54 AM, sonich [EMAIL PROTECTED] wrote: I need simple web crawler, I found Ruya, but it's seems not currently maintained. Does anybody know good web crawler on python or with python interface? Simple, but it works. Extend it all you like. http://hg.softcircuit.com.au/index.wsgi/projects/pymills/file/330d047ff663/e xamples/spider.py $ spider.py --help Usage: spider.py [options] url Options: --version show program's version number and exit -h, --helpshow this help message and exit -q, --quiet Enable quiet mode -l, --links Get links for specified url only -d DEPTH, --depth=DEPTH Maximum depth to traverse cheers James -- -- -- Problems are solved by method -- http://mail.python.org/mailman/listinfo/python-list
Web crawler on python
I need simple web crawler, I found Ruya, but it's seems not currently maintained. Does anybody know good web crawler on python or with python interface? -- http://mail.python.org/mailman/listinfo/python-list
Re: Web crawler on python
On Sun, Oct 26, 2008 at 9:54 PM, sonich [EMAIL PROTECTED] wrote: I need simple web crawler, I found Ruya, but it's seems not currently maintained. Does anybody know good web crawler on python or with python interface? What about BeautifulSoup? http://www.crummy.com/software/BeautifulSoup/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Web crawler on python
On Mon, Oct 27, 2008 at 6:54 AM, sonich [EMAIL PROTECTED] wrote: I need simple web crawler, I found Ruya, but it's seems not currently maintained. Does anybody know good web crawler on python or with python interface? Simple, but it works. Extend it all you like. http://hg.softcircuit.com.au/index.wsgi/projects/pymills/file/330d047ff663/examples/spider.py $ spider.py --help Usage: spider.py [options] url Options: --version show program's version number and exit -h, --helpshow this help message and exit -q, --quiet Enable quiet mode -l, --links Get links for specified url only -d DEPTH, --depth=DEPTH Maximum depth to traverse cheers James -- -- -- Problems are solved by method -- http://mail.python.org/mailman/listinfo/python-list
Re: web crawler in python or C?
abhinav wrote: I want to strke a balance between development speed and crawler speed. The best performance improvement is the transition from the nonworking state to the working state.- J. Osterhout Try to get there are soon as possible. You can figure out what that means. ;^) When you do all your programming in Python, most of the code that is relevant for speed *is* written in C already. If performance is slow, measure! Use the profiler to see if you are spending a lot of time in Python code. If that is your problem, take a close look at your algorithms and perhaps your data structures and see what you can improve with Python. In the long run, going from from e.g. O(n^2) to O(n log n) might mean much more than going from Python to C. A poor algorithm in machine code still sucks when you have to handle enough data. Changing your code to improve on algorithms and structure is a lot easier in Python than in C. If you've done all these things, still have performance problems, and have identified a bottle neck in your Python code, it might be time to get that piece rewritten in C. The easiest and least intrusive way to do that might be with pyrex. You might also want to try Psyco before you do this. Even if you end up writing a whole program in C, it's not unlikely that you will get to your goal faster if your first version is written in Python. Good luck! P.S. Why someone would want to write yet another web crawler is a puzzle to me. Surely there are plenty of good ideas that haven't been properly implemented yet! It's probably very difficult to beat Google on their home turf now, but I'd really like to see a good tool to manage all that information I got from the net, or through mail or wrote myself. I don't think they wrote that yet--although I'm sure they are trying. -- http://mail.python.org/mailman/listinfo/python-list
Re: web crawler in python or C?
I think something that may be even more important to consider than just the pure speed of your program, would be ease of design as well as the overall stability of your code. My opinion would be that writing in Python would have many benefits over the speed gains of using C. For instance, you crawler will have to handle all types of input from all over the web. Who can say what types of malformed or poorly writen data it will come across. I think it would be easier to create a system to handle this type of data in Python than in C. I don't want to pigeon-hole your project, but if it is for any use other than a commercial product, I would say speed would be a concern lower on the list than accurracy or time to develop. As others have pointed out, if you hit many performance barriers chances are the problem is the algorithm and not Python itself. I wish you luck and hope you will experiment in Python first. If your crawler is still not up to par, at the very least you might come up with some ideas for how Python could be improved. -- http://mail.python.org/mailman/listinfo/python-list
Re: web crawler in python or C?
abhinav wrote: Hi guys.I have to implement a topical crawler as a part of my project.What language should i implement C or Python?Python though has fast development cycle but my concern is speed also.I want to strke a balance between development speed and crawler speed.Since Python is an interpreted language it is rather slow.The crawler which will be working on huge set of pages should be as fast as possible.One possible implementation would be implementing partly in C and partly in Python so that i can have best of both worlds.But i don't know to approach about it.Can anyone guide me on what part should i implement in C and what should be in Python? Get real. Any web crawler is bound to spend huge amounts of its time waiting for data to come in over network pipes. Or do you have plans for massive parallelism previously unheard of in the Python world? regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC www.holdenweb.com PyCon TX 2006 www.python.org/pycon/ -- http://mail.python.org/mailman/listinfo/python-list
Re: web crawler in python or C?
This is following the pattern of your previous post on language choice wrt. writing a mail server. It is very common for beginers to over emphasize performance requirements, size of the executable etc. More is always good. Right? Yes! But at what cost? The rule of thumb for all your Python Vs C questions is ... 1.) Choose Python by default. 2.) If your program is slow, it's your algorithm that you need to check first. Python strictly speaking will be slow because of its dynamism. However, most of whatever is performance critical in Python is already implemented in C. And the speed difference of well written Python programs with properly chosen extensions and algorithms is not far off. 3.) Remember that you can always drop back to C where ever you need to without throwing all of your code. And even if you had to, Python is very valuable as a prototyping tool since it is very agile. You would have figured out what you needed to do by then, that rewriting it in C will only take a fraction of the time compared to if it was written in C directly. Don't even start with asking the question, is it fast enough? till you have already written it in Python and it turns out that it is not running fast enough despite correctness of your code. If it does, you can fix it relatively easily. It is easy to write bad code in C and poorly written C code performance is lower than well written Python code performance. Remember Donald Knuth's quote. Premature optimization is the root of all evil in programming. C is a language intended to be used when you NEED tight control over memory allocation. It has few advantages in other scenarios. Don't abuse it by choosing it by default. -- http://mail.python.org/mailman/listinfo/python-list
Re: web crawler in python or C?
Ravi Teja [EMAIL PROTECTED] wrote: ... The rule of thumb for all your Python Vs C questions is ... 1.) Choose Python by default. +1 QOTW!-) 2.) If your program is slow, it's your algorithm that you need to check Seriously: yes, and (often even more importantly) data structure. However, often most important tip, particularly for large-scale systems, is to consider your program's _architecture_ (algorithms are about details of computation, architecture is about partitioning systems into components, locating their deployment, and so forth). At a generic and lowish level: are you for example creating a lot of threads each for a small amount of work? Then consider reusing threads from a worker threads pool. Or maybe you could avoid threads and use event-driven programming; or, at the other extreme, have multiple processes communicating by TCP/IP so you can scale up your system to tens or hundreds of processors -- in the latter case, partitioning your system appropriately to minimize inter process communication may be the bottleneck. Consider UDP, when you can afford missing a packet once in a while -- sometimes it may let you reduce overheads compared to TCP connections. Database connections, and less importantly database cursors, are well worth reusing. What are you caching, and what instead is getting recomputed over and over? It's possible to undercache (needless repeated computation) but also to overcache (tying up memory and causing paging). Are you making lots of system calls that you might be able to avoid? Each system call has a context-switching cost, after all... Any or all of these hints may be irrelevant to a specific category of applications, but then, so can the hint about algorithms be. One cool thing about Python is that it makes it easy and fast for you to try out different approaches (particularly to architecture, but to algorithms as well), even drastically different ones, when simple reasoning about the issues leaves you undecided and you need to settle them empirically. Remember Donald Knuth's quote. Premature optimization is the root of all evil in programming. I believe Knuth himself said he was quoting Tony Hoare, and indeed referred to this as Hoare's dictum. Alex -- http://mail.python.org/mailman/listinfo/python-list
Re: web crawler in python or C?
abhinav wrote: It is DSL broadband 128kbps.But thats not the point.What i am saying is that would python be fine for implementing fast crawler algorithms or should i use C. But a web crawler is going to be *mainly* I/O bound - so language efficiency won't be the main issue. There are several web crawler implemented in Python. Handling huge data,multithreading,file handling,heuristics for ranking,and maintaining huge data structures.What should be the language so as not to compromise that much on speed.What is the performance of python based crawlers vs C based crawlers.Should I use both the languages(partly C and python).How If your data processing requirements are fairly heavy you will *probably* get a speed advantage coding them in C and accessing them from Python. The usdual advice (which seems to be applicable to you), is to prototype in Python (which will be much more fun than in C) then test. Profile to find your real bottlenecks (if the Python one isn't fast enough - which it may be), and move your bottlenecks to C. All the best, Fuzzyman http://www.voidspace.org.uk/python/index.shtml should i decide what part to be implemented in C and what should be done in python? Please guide me.Thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: web crawler in python or C?
abhinav [EMAIL PROTECTED] writes: It is DSL broadband 128kbps.But thats not the point. But it is the point. What i am saying is that would python be fine for implementing fast crawler algorithms or should i use C.Handling huge data,multithreading,file handling,heuristics for ranking,and maintaining huge data structures.What should be the language so as not to compromise that much on speed.What is the performance of python based crawlers vs C based crawlers.Should I use both the languages(partly C and python).How should i decide what part to be implemented in C and what should be done in python? Please guide me.Thanks. I think if you don't know how to answer these questions for yourself, you're not ready to take on projects of that complexity. My advice is start in Python since development will be much easier. If and when you start hitting performance problems, you'll have to examine many combinations of tactics for dealing with them, and switching languages is just one such tactic. -- http://mail.python.org/mailman/listinfo/python-list
Re: web crawler in python or C?
Paul Rubin wrote: abhinav [EMAIL PROTECTED] writes: maintaining huge data structures.What should be the language so as not to compromise that much on speed.What is the performance of python based crawlers vs C based crawlers.Should I use both the languages(partly C and python).How should i decide what part to be implemented in C and what should be done in python? Please guide me.Thanks. I think if you don't know how to answer these questions for yourself, you're not ready to take on projects of that complexity. My advice is start in Python since development will be much easier. If and when you start hitting performance problems, you'll have to examine many combinations of tactics for dealing with them, and switching languages is just one such tactic. There's another potential bottleneck, parsing HTML and extracting the text you want, especially when you hit pages that don't meet HTML 4 or XHTML spec. http://sig.levillage.org/?p=599 Paul's advice is very sound, given what little info you've provided. http://trific.ath.cx/resources/python/optimization/ (and look at psyco, pyrex, boost, Swig, Ctypes for bridging C and python, you have a lot of options. Also look at Harvestman, mechanize, other existing libs. -- http://mail.python.org/mailman/listinfo/python-list
Re: web crawler in python or C?
abhinav wrote: Hi guys.I have to implement a topical crawler as a part of my project.What language should i implement Oh, and there's some really good books out there, besides the Orilly Spidering Hacks. Springer Verlag has a couple books on Text Mining and at least a couple books with web intelligence in the title. Expensive but worth it. -- http://mail.python.org/mailman/listinfo/python-list
Re: web crawler in python or C?
On 15 Feb 2006 21:56:52 -0800, abhinav [EMAIL PROTECTED] wrote: Hi guys.I have to implement a topical crawler as a part of my project.What language should i implement C or Python? Why does this keep coming up on here as of late? If you search the archives, you can find numerous posts about spiders. One interesting fact is that google itself starting with their spiders in python. http://www-db.stanford.edu/~backrub/google.html I'm _sure_ it'll work for you. -- Andrew Gwozdziewycz [EMAIL PROTECTED] http://ihadagreatview.org http://plasticandroid.org -- http://mail.python.org/mailman/listinfo/python-list
Re: web crawler in python or C?
On Wed, 15 Feb 2006 21:56:52 -0800, abhinav wrote: Hi guys.I have to implement a topical crawler as a part of my project.What language should i implement C or Python?Python though has fast development cycle but my concern is speed also.I want to strke a balance between development speed and crawler speed.Since Python is an interpreted language it is rather slow. Python is no more interpreted than Java. Like Java, it is compiled to byte-code. Unlike Java, it doesn't take three weeks to start the runtime environment. (Okay, maybe it just *seems* like three weeks.) The nice clean distinctions between compiled and interpreted languages haven't existed in most serious programming languages for a decade or more. In these days of tokenizers and byte-code compilers and processors emulating other processors, the difference is more of degree than kind. It is true that standard Python doesn't compile to platform dependent machine code, but that is rarely an issue since the bottleneck for most applications is I/O or human interaction, not language speed. And for those cases where it is a problem, there are solutions, like Psycho. After all, it is almost never true that your code must run as fast as physically possible. That's called over-engineering. It just needs to run as fast as needed, that's all. And that's a much simpler problem to solve cheaply. The crawler which will be working on huge set of pages should be as fast as possible. Web crawler performance is almost certainly going to be I/O bound. Sounds to me like you are guilty of trying to optimize your code before even writing a single line of code. What you call huge may not be huge to your computer. Have you tried? The great thing about Python is you can write a prototype in maybe a tenth the time it would take you to do the same thing in C. Instead of trying to guess what the performance bottlenecks will be, you can write your code and profile it and find the bottlenecks with accuracy. One possible implementation would be implementing partly in C and partly in Python so that i can have best of both worlds. Sure you can do that, if you need to. But i don't know to approach about it.Can anyone guide me on what part should i implement in C and what should be in Python? Yes. Write it all in Python. Test it, debug it, get it working. Once it is working, and not before, rigorously profile it. You may find it is fast enough. If it is not fast enough, find the bottlenecks. Replace them with better algorithms. We had an example on comp.lang.python just a day or two ago where a function which was taking hours to complete was re-written with a better algorithm which took only seconds. And still in Python. If it is still too slow after using better algorithms, or if there are no better algorithms, then and only then re-write those bottlenecks in C for speed. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
web crawler in python or C?
Hi guys.I have to implement a topical crawler as a part of my project.What language should i implement C or Python?Python though has fast development cycle but my concern is speed also.I want to strke a balance between development speed and crawler speed.Since Python is an interpreted language it is rather slow.The crawler which will be working on huge set of pages should be as fast as possible.One possible implementation would be implementing partly in C and partly in Python so that i can have best of both worlds.But i don't know to approach about it.Can anyone guide me on what part should i implement in C and what should be in Python? -- http://mail.python.org/mailman/listinfo/python-list
Re: web crawler in python or C?
abhinav [EMAIL PROTECTED] writes: The crawler which will be working on huge set of pages should be as fast as possible. What kind of network connection do you have, that's fast enough that even a fairly cpu-inefficient crawler won't saturate it? -- http://mail.python.org/mailman/listinfo/python-list
Re: web crawler in python or C?
It is DSL broadband 128kbps.But thats not the point.What i am saying is that would python be fine for implementing fast crawler algorithms or should i use C.Handling huge data,multithreading,file handling,heuristics for ranking,and maintaining huge data structures.What should be the language so as not to compromise that much on speed.What is the performance of python based crawlers vs C based crawlers.Should I use both the languages(partly C and python).How should i decide what part to be implemented in C and what should be done in python? Please guide me.Thanks. -- http://mail.python.org/mailman/listinfo/python-list