Re: [Robots] robot in python?

2003-11-26 Thread Nick Arnett
Petter Karlström wrote:

Hello all,

Nice to see that this list woke up again! :^)
And now the list owner finally woke up, too... I hadn't noticed the
recent traffic on the list until just now.  Are those messages about an
address no longer in use going to the whole list?  Aghh.  I've taken
care of that, I hope, but the source address wasn't actually subscribed,
so I had to guess.
Back to the point at hand... I've written several specialized robots in
Python over the last few years.  They are specifically for crawling
on-line discussions and parsing out individual messages and meta-data.
Look for Aahz's examples (do a Google search on Aahz and Python, I'm
sure that'll lead you there).  He makes multi-threading for your spider
pretty easy and adaptable to various kinds of processing.
I have written crawlers in Perl before, but I wish to try out Python for
a hobby project. Has anybody here written a webbot i Python?
Python is of course a smaller language, so the libraries aren't as
extensive as the Perl counterparts. Also, I find the documentation
somewhat lacking (or it could be me being new to the language).
After switching from Perl to Python a couple of years ago, I haven't
ever found the Python libraries lacking, although I expected to.
Documentation, in the form of published books, has been a bit scarce,
but new ones have been coming out lately.  I just looked through one on
text applications in Python, but haven't bought it yet.  It definitely
looked good.
Are there any small examples available on use of HTMLParser and htmllib?
Specifically, I need something like the linkextor available in Perl.
One trick is to search on "import [modulename]" as a phrase.  That'll
often uncover code you can use as an example.  What does linkextor do?
Link extractor?  If so, I just use regular expressions.
Also, what is the neatest way to store session data like login and 
password? PassWordMgr?
Store in what sense?

I'll take a look at my code and see if I can share something generic.
Since we're doing www.opensector.org, I suppose it would only be right
for us to share at least *some* of our code!
However... I just looked at what I have and the older stuff doesn't
really add much to Aahz's examples, other than some simple use of MySQL
as the store; my newer stuff is far too specific to the task I'm doing
to be able to quickly "sanitize" it.
The main thing I did to address our specific needs was to create a Java
class for message pages in specific types of web-based discussion
forums.  That's partly to extract URLs, but mostly to extract other
features and to intelligently (in the sense of being able to update my
database rapidly, re-visiting the minimum number of pages) navigate the
threading structures, which work in various ways.  The class for
Jive-based forums is only 225 lines, as an example.  The multi-threaded
module that uses it is 100 lines; a single-threaded version is 25 lines.
We also have a Python robot for NNTP servers, which obviously doesn't
need recursion.  It's about 400 lines.  A lot of it deals with things
like missing messages, zeroing in on desired date ranges, avoiding
downloading huge messages, recovery from failure, etc.
All of these talk to MySQL...

Nick

--
Nick Arnett
Phone/fax: (408) 904-7198
[EMAIL PROTECTED]
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] robot in python?

2003-11-26 Thread Sean M. Burke
At 11:47 PM 2003-11-17, SsolSsinclair wrote:
Open Source is a project which came into being through a collective 
effort. Intelligence matching Intelligence. This movement cannot be 
stopped or prevented, SHORT of ceasing communication of all [resulting in 
Deaf Silence, and the Elimination of Sound as a sensory perception, 
clearly not in the interest of any individual or body or civilization, if 
it were possible in the first place.
You talk funny!

This pleases me.

--
Sean M. Burkehttp://search.cpan.org/~sburke/
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] robot in python?

2003-11-25 Thread Erick Thompson
There are built in methods/objects for dealing with both cookies and http 
authencation. It can even handle x509 cert.

see 
HttpWebRequest.Credentials
HttpWebRequest.ClientCertificates
HttpWebRequest.CookieContainer

So if you wanted to simulate a session, you'd have to hold onto an instance of a 
cookie container, and make sure that you use that container for every request. 

HTH
Erick

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]
> Behalf Of Eric Thompson
> Sent: Tuesday, November 25, 2003 12:35 PM
> To: [EMAIL PROTECTED]
> Subject: Re: [Robots] robot in python?
> 
> 
> Anyone know the best method for simulating session with 
> .net/C#?  This maybe 
> built into the framework I don't know.  Any suggestions?
> 
> -E
> 
> 
> Me again,
> 
> 
> >Still wonder how to handle logins, though...
> >
> 
> In case anyone else is interested, and for the archives... 
> John J. Lees
> ClientCookie module is excellent for handling cookies and 
> other kinds of
> session data. Here: http://wwwsearch.sourceforge.net/ClientCookie/
> 
> cheers
> 
> /petter
> 
> _
> Set yourself up for fun at home!  Get tips on home 
> entertainment equipment, 
> video game reviews, and more here.   
> http://special.msn.com/home/homeent.armx
> 
> ___
> Robots mailing list
> [EMAIL PROTECTED]
> http://www.mccmedia.com/mailman/listinfo/robots
> 
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] robot in python?

2003-11-25 Thread Eric Thompson
Anyone know the best method for simulating session with .net/C#?  This maybe 
built into the framework I don't know.  Any suggestions?

-E

Me again,


Still wonder how to handle logins, though...

In case anyone else is interested, and for the archives... John J. Lees
ClientCookie module is excellent for handling cookies and other kinds of
session data. Here: http://wwwsearch.sourceforge.net/ClientCookie/
cheers

/petter

_
Set yourself up for fun at home!  Get tips on home entertainment equipment, 
video game reviews, and more here.   
http://special.msn.com/home/homeent.armx

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] robot in python?

2003-11-25 Thread Petter Karlström
Me again,


Still wonder how to handle logins, though...

In case anyone else is interested, and for the archives... John J. Lees 
ClientCookie module is excellent for handling cookies and other kinds of 
session data. Here: http://wwwsearch.sourceforge.net/ClientCookie/

cheers

/petter

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] robot in python?

2003-11-19 Thread Petter Karlström
SsolSsinclair wrote:

Walter:
You may find that the threads and exceptions in Python more than
make up for anything you are missing in Perl. The Python libraries
are not as extensive, but that is mostly because they have one of
everything instead of five or six of everything.
SsolSsinclair:
this conclusional statement [above] comes from Walter which outlines the coding
advantages of using Python. A person capable of inventing these statements on the 
spot would
know them to be true. I am unclear, therefore, why Portions of Verity UltraSeek [a 
commercial
product] would need to use C or C++ modules.
Well, Walter compared Python and Perl not Python and C or C++. I can see 
why portions of a bot would be written in C or C++. Performance issues 
would perhaps not be too wild a guess.



.I am unclear and have an uninformed opinion about the intent of "finding" web-pages in
which the original author did not wish to b e found. i, cannot perceive there 
being any
possibility these pages to be of any interest to the general public, or to the 
corporate
citizenship. However, this statement by Walter: This tends to overestimate the 
links (e.g.,
pulling out references in comments, etc.), and often yields fragments that are not 
really
followable, but it is at least a possibility. seems to indicate there is a 
difference
between the # o9f pages retrieved, and the "possible number of pages that could be 
retrieved".
This numerical difference is QUITE noteable, and of interest to competing coders. It 
is,
however, A privately held #. Thanks Alex.
Sorry, but what you're discussing is a quite different matter than the 
discussion you're quoting. Overestimation by for example regexps is just 
that the bot may, for example, mistakingly store some things as tags 
that really aren't. Overestimating what you may need in more general 
terms is a different (alebeit interesting) matter.


Primary concern of Petter::Still wunder how to handle logins, though...

This is really your determination to make. Taking anyone's opinion on the matter 
would end
in your system being less secure. I would guess some method of encryptian. I am not 
really clear
on why the need of a PassWrdMgr is necessary with the development of a Search Engine, 
crawler.
This system should really already be in place in your work environment. Maybe even a 
firewall?
The reason I need some password management is that my app has to login 
to a secure site. Encryption would certainly be nice!

/Petter

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] robot in python?

2003-11-19 Thread SsolSsinclair
BUSH CALLS ON SENATE TO RATIFY CYBERCRIME TREATY
President Bush has asked the US Senate to ratify the first
international cybercrime treaty.  Bush called the Council of
Europe's controversial treaty "an effective tool in the
global effort to combat computer-related crime" and "the
only multilateral treaty to address the problems of
computer-related crime and electronic evidence gathering."
http://news.com.com/2100-1028_3-5108854.html

this comment looks to effect the coding environment. Appreciate comments on 
pertinence. I don't
think Bush has time, however, to spend developing code, however.


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] robot in python?

2003-11-19 Thread SsolSsinclair
developing a crawler for a search engine with Python
==

You may find that the threads and exceptions in Python more than
make up for anything you are missing in Perl. The Python libraries
are not as extensive, but that is mostly because they have one of
everything instead of five or six of everything.

Extracting links using a regular HTML parser works fine, and isn't
that much work. One of the major issues in an HTML parser is
dealing with all the illegal HTML on the web.

this conclusional statement [above] comes from Walter which outlines the coding
advantages of using Python. A person capable of inventing these statements on the spot 
would
know them to be true. I am unclear, therefore, why Portions of Verity UltraSeek [a 
commercial
product] would need to use C or C++ modules.

Has anybody here written a webbot in Python?

Answer
Verity Ultraseek is a web crawler and search engine written in
Python. Portions of it are C or C++ native modules. Ultraseek
is a commercial product, so we don't give out the code. Sorry.

from: Alexander Halavais
It really depends on what you are looking for, and how tolerant of
errors you are. For most of what I do, I use the HTML parser, but I have
also done simple expression matching to pull out links. This tends to
overestimate the links (e.g., pulling out references in comments, etc.),
and often yields fragments that are not really followable, but it is at
least a possibility.

.I am unclear and have an uninformed opinion about the intent of "finding" 
web-pages in
which the original author did not wish to b e found. i, cannot perceive there being any
possibility these pages to be of any interest to the general public, or to the 
corporate
citizenship. However, this statement by Walter: This tends to overestimate the 
links (e.g.,
pulling out references in comments, etc.), and often yields fragments that are not 
really
followable, but it is at least a possibility. seems to indicate there is a 
difference
between the # o9f pages retrieved, and the "possible number of pages that could be 
retrieved".
This numerical difference is QUITE noteable, and of interest to competing coders. It 
is,
however, A privately held #. Thanks Alex.

Specifically, I need something like the linkextor available in Perl.

petter wrote::
Yes, in fact I found some very good examples on the website "Dive Into
Python", including how to do a linkextor. Quite simple.
http://diveintopython.org/html_processing/extracting_data.html This uses
SGMLParser which presumably is more tolerant on illegal HTML.


Primary concern of Petter::Still wunder how to handle logins, though...

This is really your determination to make. Taking anyone's opinion on the matter 
would end
in your system being less secure. I would guess some method of encryptian. I am not 
really clear
on why the need of a PassWrdMgr is necessary with the development of a Search Engine, 
crawler.
This system should really already be in place in your work environment. Maybe even a 
firewall?

.Ssol>.
Digital Acquizitionatory Inventory Drive Imaging

>Sol


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] robot in python?

2003-11-19 Thread Petter Karlström
Walter Underwood wrote:


Python is of course a smaller language, so the libraries aren't as
extensive as the Perl counterparts. Also, I find the documentation
somewhat lacking (or it could be me being new to the language).


You may find that the threads and exceptions in Python more than
make up for anything you are missing in Perl. The Python libraries
are not as extensive, but that is mostly because they have one of
everything instead of five or six of everything.
Yup, that's why I'm learning Python! I got tired of the "after the fact" 
object orientation and the sometimes maddening syntax of Perl.

Extracting links using a regular HTML parser works fine, and isn't
that much work. One of the major issues in an HTML parser is
dealing with all the illegal HTML on the web.
Yes, in fact I found some very good examples on the website "Dive Into 
Python", including how to do a linkextor. Quite simple. 
http://diveintopython.org/html_processing/extracting_data.html This uses 
SGMLParser which presumably is more tolerant on illegal HTML.

Still wonder how to handle logins, though...

/petter

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] robot in python?

2003-11-18 Thread SsolSsinclair

This following statement indicates a primary concern for World Wealth, thereby 
considered a
critical issue needing co=addressing, issued by Mr.Walter Underwood.. The Importance 
is Hereby
noted, and the above HereIn regarded as simply Walter. from my standpoint, in the 
least.

Walter Underwood wrote:

>Extracting links using a regular HTML parser works fine, and isn't
>that much work. One of the major issues in an HTML parser is
>dealing with all the illegal HTML on the web.

It really depends on what you are looking for, and how tolerant of
errors you are. For most of what I do, I use the HTML parser, but I have
also done simple expression matching to pull out links. This tends to
overestimate the links (e.g., pulling out references in comments, etc.),
and often yields fragments that are not really followable, but it is at
least a possibility.

>.This statement is based in principle on the coders accuracy of assumption. 
>In other
words, it is an intelligence barrier. The ultimate goal would be not to try to 
infiltrate a
system, or set up some sort os SPYnet, but rather to ask national concerns [ISP's] to 
be
receptable. I do not know whether Open Source is acceptable, but in code world it does 
exist,
thereby havin g a presence.

Open Source is a project which came into being through a collective effort. 
Intelligence
matching Intelligence. This movement cannot be stopped or prevented, SHORT of ceasing
communication of all [resulting in Deaf Silence, and the Elimination of Sound as a 
sensory
perception, clearly not in the interest of any individual or body or civilization, if 
it were
possible in the first place.

This is said, therefore, to make the point that it is not in our interest to hold 
so-called
Global CODE wars, when it can exist on a much more competitive level. Reasoning people 
would
agree, i would think. This division of people can, in fact, mean that their are 
competing ISP's
offering different propriety systems to people of common geography. All people, 
therefore can
try either. Not really to matter, but if they do [will] not contribute their code, 
they cannot
compete, as it would be unfair.

Really am trying to quit philosophizing. Sorry for the bore.

>>
Inspiring Questions



___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] robot in python?

2003-11-18 Thread SsolSsinclair
from <[EMAIL PROTECTED]> wrote:
>
> I have written crawlers in Perl before, but I wish to try out Python for
> a hobby project. Has anybody here written a webbot i Python?

Verity Ultraseek is a web crawler and search engine written in
Python. Portions of it are C or C++ native modules. Ultraseek
is a commercial product, so we don't give out the code. Sorry.

>I accept the passage of this thought, for the sake of the situiatiuonal 
>information it
yields. On any other reasoning [meaning] of issuence, i object. AND under NO 
circumstances is an
apology necessary or appropriate, as it is, does and remains Strictly your choice. 
thanks for
participating.

> Python is of course a smaller language, so the libraries aren't as
> extensive as the Perl counterparts. Also, I find the documentation
> somewhat lacking (or it could be me being new to the language).

>.Refer to Extra-Curricular Information in previous thread, on Python, 
>basically
explaing it is better to use older code.

wunder on,  Chief

sincere thanks >Ssol


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] robot in python?

2003-11-17 Thread Alexander Halavais
Walter Underwood wrote:

Extracting links using a regular HTML parser works fine, and isn't
that much work. One of the major issues in an HTML parser is
dealing with all the illegal HTML on the web.
 

It really depends on what you are looking for, and how tolerant of 
errors you are. For most of what I do, I use the HTML parser, but I have 
also done simple expression matching to pull out links. This tends to 
overestimate the links (e.g., pulling out references in comments, etc.), 
and often yields fragments that are not really followable, but it is at 
least a possibility.

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] robot in python?

2003-11-17 Thread Walter Underwood
--On Sunday, November 16, 2003 4:23 PM +0100 Petter Karlström <[EMAIL PROTECTED]> 
wrote:
> 
> I have written crawlers in Perl before, but I wish to try out Python for
> a hobby project. Has anybody here written a webbot i Python?

Verity Ultraseek is a web crawler and search engine written in
Python. Portions of it are C or C++ native modules. Ultraseek
is a commercial product, so we don't give out the code. Sorry.

> Python is of course a smaller language, so the libraries aren't as
> extensive as the Perl counterparts. Also, I find the documentation
> somewhat lacking (or it could be me being new to the language).

You may find that the threads and exceptions in Python more than
make up for anything you are missing in Perl. The Python libraries
are not as extensive, but that is mostly because they have one of
everything instead of five or six of everything.

Extracting links using a regular HTML parser works fine, and isn't
that much work. One of the major issues in an HTML parser is
dealing with all the illegal HTML on the web.

wunder
--
Walter Underwood
Principal Architect
Verity Ultraseek

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] robot in python?

2003-11-16 Thread SsolSsinclair
i have never worked with python, thereby not being able to justify giving a
correct answer to youir question.

if there is not a linktextor available in the python INTERPRETATION, i would
not suggest using it, unless there is some other justification

if you are considering developing a linktextor for the project, i wouild, at
least, be interested in the resulting code how much time is the project
ongoing?

i would understand that the htmlParser + htmlLib samples would result in the
same user behavioural course outline as the number of browsers

PassWord Mgr = likely UNIX-style

i frankly cannot honestly answer these inquieries.

.Ssol>.
Digital Acquizitionatory Inventory Drive Imaging

thanks >Sol



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Petter Karlstrom
Sent: Sunday, November 16, 2003 10:24 AM
To: [EMAIL PROTECTED]
Subject: [Robots] robot in python?


Hello all,

Nice to see that this list woke up again! :^)

I have written crawlers in Perl before, but I wish to try out Python for
a hobby project. Has anybody here written a webbot i Python?

Python is of course a smaller language, so the libraries aren't as
extensive as the Perl counterparts. Also, I find the documentation
somewhat lacking (or it could be me being new to the language).

Are there any small examples available on use of HTMLParser and htmllib?
Specifically, I need something like the linkextor available in Perl.

Also, what is the neatest way to store session data like login and
password? PassWordMgr?

Pointers, anyone?

cheers

/Petter


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots