[htdig3-dev] Final Patch for Retriever stuff

Gabriele Bartolini Fri, 21 Jan 2000 04:18:25 -0800

Ciao amici,

        I have been testing my patch last night (obviously in automatic ;-) and
this morning and I have come up to good results.

        The retrieving system now works fine, either with persistent_connections
activated or not. I also tried the head_before_get attribute. And it seems
to me to be OK. Obviously I am not completely sure because I haven't been
managing that code for a couple of months. That's why I wanna wait to
COMMIT the changes. But, as I wrote the HtHTTP and Transport code, I think
that 'logically' it works. And it does it wonderfully in my environment (11
web servers, with about 10000 documents: the whole process took 2 hours and
10 minutes with persistent_connections activated and head_before_get 'on').

        I also modified the Retriever code for showing HTTP connections stats at
the end of htdig if '-s' options been chosen.

Here are some interesting results:

I run htdig 3 times in a restricted area of my environment, by indexing the
following sites, this way:

htdig: Run complete
htdig: 4 servers seen:
htdig:     balwww.comune.prato.it:80 53 documents
htdig:     search.comune.prato.it:80 1 document
htdig:     sportelloamico.po-net.prato.it:80 3 documents
htdig:     www.po-net.prato.it:80 428 documents


This is the first result, with pcs and no head before get:
HTTP statistics
===============
Persistent connections: Yes
HEAD call before GET: No
 Connections opened        : 92
 Connections closed        : 91
 Changes of server         : 3
 HTTP Requests             : 491
 HTTP KBytes requested     : 2018,69
 HTTP Average request time : 0,01222 secs
 HTTP Average speed        : 336,448 KBytes/secs

Here's the second, with both the options activated:

Persistent connections: Yes
HEAD call before GET: Yes
 Connections opened        : 17
 Connections closed        : 16
 Changes of server         : 3
 HTTP Requests             : 909
 HTTP KBytes requested     : 2113,91
 HTTP Average request time : 0,00660066 secs
 HTTP Average speed        : 352,318 KBytes/secs

And here's the traditional way of retrieving test (no persistent conn.):

HTTP statistics
===============
Persistent connections: No
 Connections opened        : 489
 Connections closed        : 489
 Changes of server         : 110
 HTTP Requests             : 489
 HTTP KBytes requested     : 2018,65
 HTTP Average request time : 0,0163599 secs
 HTTP Average speed        : 252,331 KBytes/secs


Obviously I can take advantage (strongly) of persistent connections in a
"close" environment where I have a few sites to index (and so I have a few
"server changes", that need connections to be closed/opened serveral
times). But it's all CONFIGURABLE, isn't it? So ... No problem !!!

Let me have your opinion on this result. 

And overall try the patch in the week-end so I can modify it at the
beginning of the week (if something goes wrong - cross my fingers) or - I
HOPE - commit it to the cvs tree.

Then let me know if you want to raise the new configuration attribute for
limiting the number of consecutive requests on the same server if
persistent connections are on (and maybe, propose a name).

Ah ... I vote +1 for this configuration attribute and please Geoff suggest
a name.

I say Ciao to everyone cos I stop working now ... I am getting ready to
dive into a new wonderful week-end !!! Have a nice week-end and don't work
too much !!! And ... I hope Juventus is going to win, maybe with a goal of
Zidane !!! OK, Gab, stop it !!!

Ciao
-Gabriele

Index: Document.h
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htdig/Document.h,v
retrieving revision 1.10.2.5
diff -3 -u -p -r1.10.2.5 Document.h
--- Document.h  2000/01/14 01:23:43     1.10.2.5
+++ Document.h  2000/01/21 11:49:41
@@ -74,6 +74,8 @@ public:
     //
     void                       setUsernamePassword(const char *credentials)
                                           { authorization = credentials;}
+
+    HtHTTP *GetHTTPHandler() { return HTTPConnect; }

 private:
     enum
Index: Retriever.cc
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htdig/Retriever.cc,v
retrieving revision 1.72.2.15
diff -3 -u -p -r1.72.2.15 Retriever.cc
--- Retriever.cc        2000/01/20 03:55:47     1.72.2.15
+++ Retriever.cc        2000/01/21 11:49:48
@@ -26,6 +26,7 @@
 #include "StringList.h"
 #include "WordType.h"
 #include "Transport.h"
+#include "HtHTTP.h"    // For HTTP statistics

 #include <pwd.h>
 #include <signal.h>
@@ -297,43 +298,83 @@ Retriever::Start()

     while (more && noSignal)
     {
-       more = 0;
+        more = 0;

        //
-       // Go through all the current servers in sequence.  We take only one
-       // URL from each server during this loop.  This ensures that the load
-       // on the servers is distributed evenly.
+       // Go through all the current servers in sequence.
+        // If they support persistent connections, we keep on popping
+        // from the server queue until we reach a maximum number of
+        // consecutive requests (so we will probably have to issue a new
+        // attribute, like "server_repeat_connections"). Or the loop may
+        // continue for the infinite, if we set the max to -1 (and maybe
+        // the attribute too).
+        // If the server doesn't support persistent connection, we take
+        // only an URL from it, then we skip to the next server.
        //
+
+        // Let's position at the beginning
        servers.Start_Get();
+
+        int count;
+
+        // Maximum number of repeated requests with the same
+        // socket connection.
+        int max_repeat_requests;
+
        while ( (server = (Server *)servers.Get_NextElement()) && noSignal)
        {
            if (debug > 1)
-               cout << "pick: " << server->host() << ", # servers = " <<
+               cout << "pick: " << server->host() << ", # servers = " <<
                    servers.Count() << endl;

-           ref = server->pop();
-           if (!ref)
-               continue;                     // Nothing on this server
-           // There may be no more documents, or the server
-           // has passed the server_max_docs limit
+            // and if the Server doesn't support persistent connections
+            // turn it down to 1.

-           //
-           // We have a URL to index, now.  We need to register the
-           // fact that we are not done yet by setting the 'more'
-           // variable.
-           //
-           more = 1;
+            // We already know if a server supports HTTP pers. connections,
+            // because we asked it for the robots.txt file (constructor of
+            // the class).
+
+            if (server->IsPersistentConnectionAllowed())
+                // Once the new attribute is set
+                // max_repeat_requests=config["server_repeat_connections"];
+                max_repeat_requests = -1; // Set to -1 (infinite loop)
+            else
+                max_repeat_requests = 1;
+
+            count = 0;
+
+           while ( ( (max_repeat_requests ==-1) ||
+                          (count < max_repeat_requests) ) &&
+                    (ref = server->pop()) && noSignal)
+            {
+                count ++;
+
+               //
+               // We have a URL to index, now.  We need to register the
+               // fact that we are not done yet by setting the 'more'
+               // variable. So, we have to restart scanning the queue.
+               //
+
+               more = 1;
+
+               //
+               // Deal with the actual URL.
+               // We'll check with the server to see if we need to sleep()
+               // before parsing it.
+               //
+
+               parse_url(*ref);
+                delete ref;
+
+                // No HTTP connections available, so we change server and pause
+               if (max_repeat_requests == 1)
+                    server->delay();   // This will pause if needed
+                                       // and reset the time

-           //
-           // Deal with the actual URL.
-           // We'll check with the server to see if we need to sleep()
-           // before parsing it.
-           //
-           server->delay();   // This will pause if needed and reset the time
-           parse_url(*ref);
-            delete ref;
-       }
+            }
+        }
     }
+
     // if we exited on signal
     if (Retriever_noLog != log && !noSignal)
     {
@@ -1562,5 +1603,28 @@ Retriever::ReportStatistics(const String
        cout << "\n" << name << ": Errors to take note of:\n";
        cout << notFound;
     }
+
+    cout << endl;
+
+    // Report HTTP connections stats
+    cout << "HTTP statistics" << endl;
+    cout << "===============" << endl;
+
+    if (config.Boolean("persistent_connections"))
+    {
+        cout << " Persistent connections    : Yes" << endl;
+
+        if (config.Boolean("head_before_get"))
+            cout << " HEAD call before GET      : Yes" << endl;
+        else
+            cout << " HEAD call before GET      : No" << endl;
+    }
+    else
+    {
+        cout << "Persistent connections    : No" << endl;
+    }
+
+    HtHTTP::ShowStatistics(cout) << endl;
+
 }

Index: Server.cc
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htdig/Server.cc,v
retrieving revision 1.17.2.6
diff -3 -u -p -r1.17.2.6 Server.cc
--- Server.cc   1999/12/11 16:19:47     1.17.2.6
+++ Server.cc   2000/01/21 11:49:49
@@ -21,6 +21,7 @@
 #include "Document.h"
 #include "URLRef.h"
 #include "Transport.h"
+#include "HtHTTP.h"    // for checking persistent connections

 #include <ctype.h>

@@ -38,8 +39,10 @@ Server::Server(URL u, String *local_robo
     _port = u.port();
     _bad_server = 0;
     _documents = 0;
-    _persistent_connections = 1;  // Allowed by default

+    // We take it from the configuration
+    _persistent_connections = config.Boolean("persistent_connections");
+
     _max_documents = config.Value("server",_host,"server_max_docs", -1);
     _connection_space = config.Value("server",_host,"server_wait_time", 0);
     _last_connection.SettoNow();  // For getting robots.txt
@@ -78,7 +81,23 @@ Server::Server(URL u, String *local_robo
              }
          }
        else if (!local_urls_only)
+        {
          status = doc.Retrieve(timeZero);
+
+          // Let's check if persistent connections are both
+          // allowed by the configuration and possible after
+          // having requested the robots.txt file.
+
+          HtHTTP *http;
+          if (IsPersistentConnectionAllowed() &&
+                  (http = doc.GetHTTPHandler()))
+          {
+              if (! http->isPersistentConnectionPossible())
+                  _persistent_connections=0;  // not possible. Let's disable
+                                              // them on this server.
+          }
+
+        }
        else
          status = Transport::Document_not_found;



-------------------------------------------------

Gabriele Bartolini
Computer Programmer (are U sure?)
U.O. Rete Civica - Comune di Prato
Prato - Italia - Europa

e-mail: [EMAIL PROTECTED]
http://www.po-net.prato.it

-------------------------------------------------
Zinedine "Zizou" Zidane. Just for soccer lovers.
-------------------------------------------------
-------------------------------------------------

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.

[htdig3-dev] Final Patch for Retriever stuff

Reply via email to