According to Jaap de Heer:
> I have a little problem excluding JavaScript from ht://dig
> indexes, being that the noindex_start, noindex_end
> attributes are apparently case sensitive.
> So when i set them to <SCRIPT and </SCRIPT>, pages with
> lowercase tags (<script>, </script>) still get indexed -
> including the JavaScript.
> I guess the solution to this could be to either make the
> noindex attributes case insensitive or allow multiple
> exclusions.. could anyone tell me if such is possible?

This was fixed a couple weeks ago in the development snapshots, as
well as a small bug and some documentation errors.  This patch, which
I posted back then, should work for 3.1.1.  The mailing list archives
at http://www.htdig.org/ are a good source of patches to fix known bugs
and common complaints.

--- ./htdig/HTML.cc.skipendbug  Wed Mar 17 16:11:52 1999
+++ ./htdig/HTML.cc     Wed Mar 17 17:05:15 1999
@@ -125,9 +125,10 @@
       // Filter out section marked to be ignored for indexing. 
       // This can contain any HTML. 
       //
-      if (strncmp((char *)position, skip_start, strlen(skip_start)) == 0)
+      if (*skip_start &&
+         mystrncasecmp((char *)position, skip_start, strlen(skip_start)) == 0)
        {
-         q = (unsigned char*)strstr((char *)position, skip_end);
+         q = (unsigned char*)mystrcasestr((char *)position, skip_end);
          if (!q)
            *position = '\0';       // Rest of document will be skipped...
          else
--- ./htdoc/attrs.html.skipendbug       Tue Feb 16 23:03:53 1999
+++ ./htdoc/attrs.html  Wed Mar 17 16:21:55 1999
@@ -3433,7 +3433,7 @@
        <dl>
          <dt>
                <strong><a name="noindex_start">noindex_start</a>,
-               <a name="noindex_stop">noindex_stop</a></strong>
+               <a name="noindex_end">noindex_end</a></strong>
          </dt>
          <dd>
                <dl>
@@ -3453,7 +3453,7 @@
                        <em>default:</em>
                  </dt>
                  <dd>
-                       &lt;!--htdig-noindex--&gt; &lt;!--/htdig-noindex--&gt;
+                       &lt;!--htdig_noindex--&gt; &lt;!--/htdig_noindex--&gt;
                  </dd>
                  <dt>
                        <em>description:</em>
@@ -3468,14 +3468,14 @@
                        SCRIPT sections in 'uneditable' documents can be skipped; note 
how
                        noindex_start does not contain an ending &gt;: this allows for 
all SCRIPT
                        tags to be matched regardless of attributes defined (different 
types or
-                       languages).
+                       languages). Note that the match for this string is case 
+insensitive.
                  </dd>
                  <dt>
                        <em>example:</em>
                  </dt>
                  <dd>
                        noindex_start: &lt;SCRIPT<br>
-                       noindex_stop: &lt;/SCRIPT&gt;
+                       noindex_end: &lt;/SCRIPT&gt;
                  </dd>
                </dl>
          </dd>
--- ./htdoc/cf_byname.html.skipendbug   Tue Feb 16 23:03:54 1999
+++ ./htdoc/cf_byname.html      Wed Mar 17 16:22:47 1999
@@ -105,8 +105,8 @@
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#next_page_text">next_page_text</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#no_excerpt_text">no_excerpt_text</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#no_excerpt_show_top">no_excerpt_show_top</a><br>
+         <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
+href="attrs.html#noindex_end">noindex_end</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#noindex_start">noindex_start</a><br>
-         <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#noindex_stop">noindex_stop</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#no_next_page_text">no_next_page_text</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#no_page_list_header">no_page_list_header</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#no_page_number_text">no_page_number_text</a><br>
--- ./htdoc/cf_byprog.html.skipendbug   Tue Feb 16 23:03:54 1999
+++ ./htdoc/cf_byprog.html      Wed Mar 17 16:23:10 1999
@@ -56,8 +56,8 @@
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#meta_description_factor">meta_description_factor</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#minimum_word_length">minimum_word_length</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#modification_time_is_now">modification_time_is_now</a><br>
+         <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
+href="attrs.html#noindex_end">noindex_end</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#noindex_start">noindex_start</a><br>
-         <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#noindex_stop">noindex_stop</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#pdf_parser">pdf_parser</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#remove_default_doc">remove_default_doc</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#robotstxt_name">robotstxt_name</a><br>

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to