Check

Gabriele Bartolini Thu, 20 Apr 2000 01:03:07 -0700
Ciao my friends,

        I am glad to inform you that the first version (beta-beta-beta-...) of 
htcheck is pretty ready. This program uses a big slice of htdig library and 
technology and stores info on a MySQL database. It goes without saying that 
this program needs improvements (it is my biggest work so far) and if 
someone of you, with MySql experience, offers to test it, I can send him 
sources. I need new ideas and advice (who doesn't?).

        Its purpose is to help a webmaster to maintain the sites he manages: of 
course, checking websites on a local net is faster ... And what makes 
htcheck slow is checking "external" Urls ... It can tell everything you can 
ask to a website (a set of HTML documents or those that can be parsed - 
html for now): failed urls, which urls are linked to another link and which 
tags and attributes are involved, and if u know SQL this list can grow ...

        Well, htcheck capabilities are to store every tag present on a HTML (so 
far only this, I'll see in the future) document on the web, every attribute 
of overy Html element parsed and every link created by them.

        At today, I have developed the "crawling" and storing system and I need to 
build the interface for querying the database created ( I can choose the 
name of the DB for every scan operation): this is the easiest part, cos I 
only have to do SQL statements. I got 2 ideas: a standalone program and a 
PHP interface. I am using this right now (very very simple) to check the 
results htcheck produces.

        The crawling system is very similar to the htdig's one, and it uses 
HTTP/1.1 (partially developed by me on htdig) and configuration 
inclusion/exclusion too. For those Urls included, htcheck tries to retrieve 
them and depending on the content type returned, parses them. The 
"external" Urls can be checked if they exist or not (configuration attr.).

Last indexing process, on a Linux RedHat 6.0 - Intel Pentium II 333 Mhz, 
had this result:
- about 18.000 Urls seen
- 12 minutes and 46 seconds.
- about 30.000 HTTP 1/1 requests with only 366 connections on 15 servers
- 100.000 Links stored
- 102.000 html statements and attributes stored

I only need to get it ready for the OPENSOURCE community, but I think 
Geoff, GIlles, Loic and you guys can help me.

Ciao and thanks to all of you
-Gabriele

P.S.: Have a good Easter if we won't have the chance to hear again ...

-------------------------------------------------

Gabriele Bartolini
Computer Programmer (are U sure?)
U.O. Rete Civica - Comune di Prato
Prato - Italia - Europa

e-mail: [EMAIL PROTECTED]
http://www.po-net.prato.it

-------------------------------------------------
Zinedine "Zizou" Zidane. Just for soccer lovers.
-------------------------------------------------
-------------------------------------------------


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.
[htdig3-dev] Ht://Check

Reply via email to