RE: [Robots] robot in python?
At 11:47 PM 2003-11-17, SsolSsinclair wrote: Open Source is a project which came into being through a collective effort. Intelligence matching Intelligence. This movement cannot be stopped or prevented, SHORT of ceasing communication of all [resulting in Deaf Silence, and the Elimination of Sound as a sensory perception, clearly not in the interest of any individual or body or civilization, if it were possible in the first place. You talk funny! This pleases me. -- Sean M. Burkehttp://search.cpan.org/~sburke/ ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] robot in python?
Petter Karlström wrote: Hello all, Nice to see that this list woke up again! :^) And now the list owner finally woke up, too... I hadn't noticed the recent traffic on the list until just now. Are those messages about an address no longer in use going to the whole list? Aghh. I've taken care of that, I hope, but the source address wasn't actually subscribed, so I had to guess. Back to the point at hand... I've written several specialized robots in Python over the last few years. They are specifically for crawling on-line discussions and parsing out individual messages and meta-data. Look for Aahz's examples (do a Google search on Aahz and Python, I'm sure that'll lead you there). He makes multi-threading for your spider pretty easy and adaptable to various kinds of processing. I have written crawlers in Perl before, but I wish to try out Python for a hobby project. Has anybody here written a webbot i Python? Python is of course a smaller language, so the libraries aren't as extensive as the Perl counterparts. Also, I find the documentation somewhat lacking (or it could be me being new to the language). After switching from Perl to Python a couple of years ago, I haven't ever found the Python libraries lacking, although I expected to. Documentation, in the form of published books, has been a bit scarce, but new ones have been coming out lately. I just looked through one on text applications in Python, but haven't bought it yet. It definitely looked good. Are there any small examples available on use of HTMLParser and htmllib? Specifically, I need something like the linkextor available in Perl. One trick is to search on import [modulename] as a phrase. That'll often uncover code you can use as an example. What does linkextor do? Link extractor? If so, I just use regular expressions. Also, what is the neatest way to store session data like login and password? PassWordMgr? Store in what sense? I'll take a look at my code and see if I can share something generic. Since we're doing www.opensector.org, I suppose it would only be right for us to share at least *some* of our code! However... I just looked at what I have and the older stuff doesn't really add much to Aahz's examples, other than some simple use of MySQL as the store; my newer stuff is far too specific to the task I'm doing to be able to quickly sanitize it. The main thing I did to address our specific needs was to create a Java class for message pages in specific types of web-based discussion forums. That's partly to extract URLs, but mostly to extract other features and to intelligently (in the sense of being able to update my database rapidly, re-visiting the minimum number of pages) navigate the threading structures, which work in various ways. The class for Jive-based forums is only 225 lines, as an example. The multi-threaded module that uses it is 100 lines; a single-threaded version is 25 lines. We also have a Python robot for NNTP servers, which obviously doesn't need recursion. It's about 400 lines. A lot of it deals with things like missing messages, zeroing in on desired date ranges, avoiding downloading huge messages, recovery from failure, etc. All of these talk to MySQL... Nick -- Nick Arnett Phone/fax: (408) 904-7198 [EMAIL PROTECTED] ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] robot in python?
Me again, Still wonder how to handle logins, though... In case anyone else is interested, and for the archives... John J. Lees ClientCookie module is excellent for handling cookies and other kinds of session data. Here: http://wwwsearch.sourceforge.net/ClientCookie/ cheers /petter ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] robot in python?
Anyone know the best method for simulating session with .net/C#? This maybe built into the framework I don't know. Any suggestions? -E Me again, Still wonder how to handle logins, though... In case anyone else is interested, and for the archives... John J. Lees ClientCookie module is excellent for handling cookies and other kinds of session data. Here: http://wwwsearch.sourceforge.net/ClientCookie/ cheers /petter _ Set yourself up for fun at home! Get tips on home entertainment equipment, video game reviews, and more here. http://special.msn.com/home/homeent.armx ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] robot in python?
There are built in methods/objects for dealing with both cookies and http authencation. It can even handle x509 cert. see HttpWebRequest.Credentials HttpWebRequest.ClientCertificates HttpWebRequest.CookieContainer So if you wanted to simulate a session, you'd have to hold onto an instance of a cookie container, and make sure that you use that container for every request. HTH Erick -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Eric Thompson Sent: Tuesday, November 25, 2003 12:35 PM To: [EMAIL PROTECTED] Subject: Re: [Robots] robot in python? Anyone know the best method for simulating session with .net/C#? This maybe built into the framework I don't know. Any suggestions? -E Me again, Still wonder how to handle logins, though... In case anyone else is interested, and for the archives... John J. Lees ClientCookie module is excellent for handling cookies and other kinds of session data. Here: http://wwwsearch.sourceforge.net/ClientCookie/ cheers /petter _ Set yourself up for fun at home! Get tips on home entertainment equipment, video game reviews, and more here. http://special.msn.com/home/homeent.armx ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] robot in python?
developing a crawler for a search engine with Python == You may find that the threads and exceptions in Python more than make up for anything you are missing in Perl. The Python libraries are not as extensive, but that is mostly because they have one of everything instead of five or six of everything. Extracting links using a regular HTML parser works fine, and isn't that much work. One of the major issues in an HTML parser is dealing with all the illegal HTML on the web. this conclusional statement [above] comes from Walter which outlines the coding advantages of using Python. A person capable of inventing these statements on the spot would know them to be true. I am unclear, therefore, why Portions of Verity UltraSeek [a commercial product] would need to use C or C++ modules. Has anybody here written a webbot in Python? Answer Verity Ultraseek is a web crawler and search engine written in Python. Portions of it are C or C++ native modules. Ultraseek is a commercial product, so we don't give out the code. Sorry. from: Alexander Halavais It really depends on what you are looking for, and how tolerant of errors you are. For most of what I do, I use the HTML parser, but I have also done simple expression matching to pull out links. This tends to overestimate the links (e.g., pulling out references in comments, etc.), and often yields fragments that are not really followable, but it is at least a possibility. .I am unclear and have an uninformed opinion about the intent of finding web-pages in which the original author did not wish to b e found. i, cannot perceive there being any possibility these pages to be of any interest to the general public, or to the corporate citizenship. However, this statement by Walter: QThis tends to overestimate the links (e.g., pulling out references in comments, etc.), and often yields fragments that are not really followable, but it is at least a possibility./Q seems to indicate there is a difference between the # o9f pages retrieved, and the possible number of pages that could be retrieved. This numerical difference is QUITE noteable, and of interest to competing coders. It is, however, A privately held #. Thanks Alex. Specifically, I need something like the linkextor available in Perl. petter wrote:: Yes, in fact I found some very good examples on the website Dive Into Python, including how to do a linkextor. Quite simple. http://diveintopython.org/html_processing/extracting_data.html This uses SGMLParser which presumably is more tolerant on illegal HTML. Primary concern of Petter::QStill wunder how to handle logins, though.../Q MeThis is really your determination to make. Taking anyone's opinion on the matter would end in your system being less secure. I would guess some method of encryptian. I am not really clear on why the need of a PassWrdMgr is necessary with the development of a Search Engine, crawler. This system should really already be in place in your work environment. Maybe even a firewall? .Ssol. Digital Acquizitionatory Inventory Drive Imaging Sol ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] robot in python?
BUSH CALLS ON SENATE TO RATIFY CYBERCRIME TREATY President Bush has asked the US Senate to ratify the first international cybercrime treaty. Bush called the Council of Europe's controversial treaty an effective tool in the global effort to combat computer-related crime and the only multilateral treaty to address the problems of computer-related crime and electronic evidence gathering. http://news.com.com/2100-1028_3-5108854.html this comment looks to effect the coding environment. Appreciate comments on pertinence. I don't think Bush has time, however, to spend developing code, however. ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] robot in python?
from [EMAIL PROTECTED] wrote: I have written crawlers in Perl before, but I wish to try out Python for a hobby project. Has anybody here written a webbot i Python? Verity Ultraseek is a web crawler and search engine written in Python. Portions of it are C or C++ native modules. Ultraseek is a commercial product, so we don't give out the code. Sorry. I accept the passage of this thought, for the sake of the situiatiuonal information it yields. On any other reasoning [meaning] of issuence, i object. AND under NO circumstances is an apology necessary or appropriate, as it is, does and remains Strictly your choice. thanks for participating. Python is of course a smaller language, so the libraries aren't as extensive as the Perl counterparts. Also, I find the documentation somewhat lacking (or it could be me being new to the language). .Refer to Extra-Curricular Information in previous thread, on Python, basically explaing it is better to use older code. wunder on, Chief sincere thanks Ssol ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] robot in python?
--On Sunday, November 16, 2003 4:23 PM +0100 Petter Karlström [EMAIL PROTECTED] wrote: I have written crawlers in Perl before, but I wish to try out Python for a hobby project. Has anybody here written a webbot i Python? Verity Ultraseek is a web crawler and search engine written in Python. Portions of it are C or C++ native modules. Ultraseek is a commercial product, so we don't give out the code. Sorry. Python is of course a smaller language, so the libraries aren't as extensive as the Perl counterparts. Also, I find the documentation somewhat lacking (or it could be me being new to the language). You may find that the threads and exceptions in Python more than make up for anything you are missing in Perl. The Python libraries are not as extensive, but that is mostly because they have one of everything instead of five or six of everything. Extracting links using a regular HTML parser works fine, and isn't that much work. One of the major issues in an HTML parser is dealing with all the illegal HTML on the web. wunder -- Walter Underwood Principal Architect Verity Ultraseek ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] robot in python?
Walter Underwood wrote: Extracting links using a regular HTML parser works fine, and isn't that much work. One of the major issues in an HTML parser is dealing with all the illegal HTML on the web. It really depends on what you are looking for, and how tolerant of errors you are. For most of what I do, I use the HTML parser, but I have also done simple expression matching to pull out links. This tends to overestimate the links (e.g., pulling out references in comments, etc.), and often yields fragments that are not really followable, but it is at least a possibility. ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots