This doesn't really work for me because my max doc size is about 5000 bytes.  I'm 
indexing a LOT of pages with a lot of .confs and don't want to fetch that much of 
them.   I 
just had to replace a 9 gig drive with a 18 gig drive because I was running out of 
space 
all the time.

Shouldn't robots.txt just be handled as a special case and ignore the max doc size?  
It 
should grab the whole file.



> According to Ryan Scott:
> > Ok, I've been working with this, and it seems that the entire robots.txt
> > that geocities uses is not processed.  Their file is very large, btw.
> > 
> > It basically stops right after architext, having not found itself, and
> > then it htdig refuses to bring any geocities pages back.  Is there a
> > filesize limit with robots.txt?
> > 
> > Here's much of the output, sorry for the length:
> > 
> > 
> > New server: www.geocities.com, 80
> > Retrieval command for http://www.geocities.com/robots.txt: GET
> > /robots.txt HTTP/1.0 User-Agent: htdig/3.1.0b1 ([EMAIL PROTECTED])
> > Host: www.geocities.com
> > 
> > Header line: HTTP/1.1 200 OK
> > Header line: Date: Fri, 15 Jan 1999 22:52:02 GMT
> > Header line: Server: Apache/1.2.6
> > Header line: Last-Modified: Thu, 14 Jan 1999 22:05:26 GMT
> > Translated Thu, 14 Jan 1999 22:05:26 GMT to Thu, 14 Jan 1999 22:05:26
> > (99) And converted to Thu, 14 Jan 1999 22:05:26 Header line: ETag:
> > "c6f2-6058-369e6a26" Header line: Content-Length: 24664 Header line:
> > Accept-Ranges: bytes Header line: Connection: close Header line:
> > Content-Type: text/plain Header line: returnStatus = 0 Read 8192 from
> > document Read 8192 from document Read a total of 8192 bytes    (I think
> > this is the problem!  their server says it is 24664 bytes, not 8192)
> > Parsing robots.txt file using myname = htdig
> [snip]
> > It looks to me like you are asking for up to 8 K of robots.txt, which
> > isn't enough for this whole file.
> > 
> > So how can we work around this or fix it?
> 
> There are actually two separate problems here.  First of all,
> htdig/Server.cc has a hard-coded size limit of 10000 bytes for the
> robots.txt file, which should be changed.  Setting it to 0 will make
> the Document constructor use the "max_doc_size" attribute, which
> puts this limit under user control.  Alternatively, you could replace the
> 10000 with whatever hard-coded limit you want, or introduce a new
> configuration attribute, e.g. max_robotstxt_size, and replace the 10000
> with config.Value("max_robotstxt_size") instead.  Personally, I think the
> patch below is adequate, as max_doc_size is almost always going to be
> generous enough to handle the robots.txt file.
> 
> --- ./htdig/Server.cc.robots  Thu Dec 10 20:54:07 1998
> +++ ./htdig/Server.cc Mon Jan 18 12:35:09 1999
> @@ -64,7 +64,7 @@
>      //
>      String   url = "http://";
>      url << host << ':' << port << "/robots.txt";
> -    Document doc(url, 10000);
> +    Document doc(url, 0);
>      switch (doc.RetrieveHTTP(0))
>      {
>   case Document::Document_ok:
> 
> Secondly, there's a bug in RetrieveHTTP() and RetrieveLocal(), in how they
> deal with files that are over the size limit.  These functions read in the
> file 8K at a time, and if appending the most recent 8K chunk would take
> the string over the size limit, the chunk is tossed out instead of
> truncating it to the length you requested.  This patch will solve that
> problem.
> 
> --- ./htdig/Document.cc.robots        Wed Jan 13 15:20:50 1999
> +++ ./htdig/Document.cc       Mon Jan 18 12:38:27 1999
> @@ -511,8 +511,10 @@
>   if (debug > 2)
>       cout << "Read " << bytesRead << " from document\n";
>   if (contents.length() + bytesRead > max_doc_size)
> -         break;
> +         bytesRead = max_doc_size - contents.length();
>   contents.append(docBuffer, bytesRead);
> +     if (contents.length() >= max_doc_size)
> +         break;
>      }
>      c.close();
>      document_length = contents.length();
> @@ -657,8 +659,10 @@
>   if (debug > 2)
>       cout << "Read " << bytesRead << " from document\n";
>   if (contents.length() + bytesRead > max_doc_size)
> -         break;
> +         bytesRead = max_doc_size - contents.length();
>   contents.append(docBuffer, bytesRead);
> +     if (contents.length() >= max_doc_size)
> +         break;
>      }
>      fclose(f);
>      document_length = contents.length();
> 
> Both patches were to the htdig-3.1.0b5dev-011299 source, but should be
> applicable to the 3.1.0b4 source as well.
> 
> -- 
> Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
> Spinal Cord Research Centre       WWW:   
> http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba 
> Phone:  (204)789-3766 Winnipeg, MB  R3E 3J7  (Canada)   Fax:   
> (204)789-3930


______________________________________________________________________
Ryan Scott - [EMAIL PROTECTED] - 212 625 1370
PostMaster Direct Response - Targeted 100% OPT IN Email
http://www.postmasterdirect.com/



----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.

Reply via email to