[htdig] exclude_urls vs. url_part_aliases

2001-01-15 Thread SMantscheff

I exclude URLs like
exclude_urls: our.server.de/F1 \
our.server.de/F2 \
our.server.de/F3 \
our.server.de/F4 \
our.server.de/F5

Instead, I index a database with URLs like
db.server/db/F1 \
db.server/db/F2 \
db.server/db/F3 \
db.server/db/F4 \
db.server/db/F5 

Then I rewrite URLs with 
url_part_aliases our.server.de/F db.server/db/F

This works. But the results from the DB URLs are not displayed. 
By the number of pages I know that all matching documents from the database 
are found. But no document excerpts are shown. Is this a bug, a feature, or 
am I missing something?

s.m.


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] more solaris problems

2001-01-15 Thread Geoff Hutchison

At 7:13 PM -0500 1/15/01, Ronald Edward Petty wrote:
>I found these articles on the problem but not sure if this is correct or
>is there some other way to fix this

Obviously the GCC website is a good reference and you are unlikely to 
find better advice elsewhere. I can say that the code does compile on 
Solaris, so I would normally think it's some sort of compiler error, 
though most of these are usually picked up by the configure script.

>I did   as -version and it is gnu... so um what is wrong ... any ideas?

No, but someone on the gcc mailing list might have some.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] how do you index local pages in 3.1.5?

2001-01-15 Thread Geoff Hutchison

At 4:02 PM -0800 1/15/01, Jon Beyer wrote:
>This is probably a really easy thing, but I can't get
>htdig to index HTML from my hard drive.

You can't in 3.1.5. It only understands http:// URLs natively.

The current 3.2 development snapshots will index file:// URLs and 
recursively generate directory listings if this is something very 
important to you.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] more solaris problems

2001-01-15 Thread Ronald Edward Petty


c++ -o htfuzzy -L../htlib -L../htcommon -L../db/dist -L/usr/lib Endings.o
EndingsDB.o Exact.o Fuzzy.o Metaphone.o Soundex.o SuffixEntry.o Synonym.o
htfuzzy.o Substring.o Prefix.o ../htcommon/libcommon.a ../htlib/libht.a
../db/dist/libdb.a -lz -lnsl -lsocket
/usr/local/lib/gcc-lib/sparc-sun-solaris2.6/2.95.2/libgcc.a: could not
read symbols: Bad value
collect2: ld returned 1 exit status
make[1]: *** [htfuzzy] Error 1
make[1]: Leaving directory `/export/netapp/user/rpy/htdig-3.1.5/htfuzzy'
make: *** [all] Error 1


I found these articles on the problem but not sure if this is correct or
is there some other way to fix this

http://gcc.gnu.org/install/specific.html#sparc-sun-solaris*
http://gcc.gnu.org/fom_serv/cache/16.html
http://gcc.gnu.org/ml/gcc-bugs/2000-03/msg00952.html

I did   as -version and it is gnu... so um what is wrong ... any ideas?
thanks alot
ron



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] how do you index local pages in 3.1.5?

2001-01-15 Thread Jon Beyer

This is probably a really easy thing, but I can't get
htdig to index HTML from my hard drive.  I tried
setting start_url to file:/, but that didn't work
and I played around with local_urls_only and
local_urls but couldn't get it to work.  Any advice is
greatly appreciated.  Thanks.

__
Do You Yahoo!?
Get email at your own domain with Yahoo! Mail. 
http://personal.mail.yahoo.com/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] PATCH correction: backport ExternalParser.cc from 3.2.0b3 to 3.1.5

2001-01-15 Thread Gilles Detillieux

I discovered some problems with the argument handling in the patch I posted
earlier today.  Please ignore that one and apply this one instead...

According to Elijah Kagan:
> I run htdig 3.1.5.
> I tried both the Debian package and a compiled one with the same result.
> I am absolutely sure there is something stupid I forgot to put into the
> configuration.

OK, after getting to the bottom of this (I think!), I have backported
the 3.2.0b3 development code for htdig/ExternalParser.cc to version
3.1.5, to fix this and other problems.  Please give this patch file
a try and let me know if it works.  You will probably get a warning
about the wait() function being implicitly declared, unless you manually
define HAVE_WAIT_H or HAVE_SYS_WAIT_H (depending on whether your system
has  or ).  Also, if your system has the mkstemp()
function, you may want to define HAVE_MKSTEMP manually as well, as this
will enhance security.  I didn't have time to figure out how to patch
aclocal.m4 and configure to add tests for all of these.

The patch fixes the following problems in external_parsers support in
3.1.5:
  - it got confused by "; charset=..." in the Content-Type header,
as described in "http://www.htdig.org/mail/2000/09/index.html#75".
  - security problems with using popen(), and therefore the shell,
to parse URL and content-type strings from untrusted sources
(now uses pipe/fork/exec instead of popen) - PR#542, PR#951.
  - used predictable temporary file name, which could be exploited
via symlinks - fixed if mkstemp() exists & HAVE_MKSTEMP is defined.
  - binary output from an external converter could get mangled.
  - error messages were sometimes ambiguous or missing altogether.
  - didn't open temporary file in binary mode for non-Unix systems
(attempts were made to fix this, but it's not clear yet whether
 the security fixes and pipe/fork/exec will port well to Cygwin).

Here's the patch, which you can apply in the main source directory for
htdig-3.1.5 using "patch -p0 < this-file":

--- htdig/ExternalParser.cc.origThu Feb 24 20:29:10 2000
+++ htdig/ExternalParser.cc Mon Jan 15 17:16:50 2001
@@ -1,14 +1,24 @@
 //
 // ExternalParser.cc
 //
-// Implementation of ExternalParser
-// Allows external programs to parse unknown document formats.
-// The parser is expected to return the document in a specific format.
-// The format is documented in http://www.htdig.org/attrs.html#external_parser
+// ExternalParser: Implementation of ExternalParser
+// Allows external programs to parse unknown document formats.
+// The parser is expected to return the document in a 
+// specific format. The format is documented 
+// in http://www.htdig.org/attrs.html#external_parser
 //
-#if RELEASE
-static char RCSid[] = "$Id: ExternalParser.cc,v 1.9.2.3 1999/11/24 02:14:09 grdetil 
Exp $";
-#endif
+// Part of the ht://Dig package   
+// Copyright (c) 1995-2001 The ht://Dig Group
+// For copyright details, see the file COPYING in your distribution
+// or the GNU Public License version 2 or later
+// 
+//
+// $Id: ExternalParser.cc,v 1.9.2.4 2001/01/15 17:16:50 grdetil Exp $
+//
+
+#ifdef HAVE_CONFIG_H
+#include "htconfig.h"
+#endif /* HAVE_CONFIG_H */
 
 #include "ExternalParser.h"
 #include "HTML.h"
@@ -19,9 +29,18 @@ static char RCSid[] = "$Id: ExternalPars
 #include "QuotedStringList.h"
 #include "URL.h"
 #include "Dictionary.h"
+#include "good_strtok.h"
+
 #include 
 #include 
-#include "good_strtok.h"
+#include 
+#include 
+#include 
+#ifdef HAVE_WAIT_H
+#include 
+#elif HAVE_SYS_WAIT_H
+#include 
+#endif
 
 static Dictionary  *parsers = 0;
 static Dictionary  *toTypes = 0;
@@ -32,9 +51,18 @@ extern StringconfigFile;
 //
 ExternalParser::ExternalParser(char *contentType)
 {
+  String mime;
+  int sep;
+
 if (canParse(contentType))
 {
-   currentParser = ((String *)parsers->Find(contentType))->get();
+String mime = contentType;
+   mime.lowercase();
+   sep = mime.indexOf(';');
+   if (sep != -1)
+ mime = mime.sub(0, sep).get();
+   
+   currentParser = ((String *)parsers->Find(mime))->get();
 }
 ExternalParser::contentType = contentType;
 }
@@ -89,6 +117,8 @@ ExternalParser::readLine(FILE *in, Strin
 int
 ExternalParser::canParse(char *contentType)
 {
+  int  sep;
+
 if (!parsers)
 {
parsers = new Dictionary();
@@ -97,7 +127,6 @@ ExternalParser::canParse(char *contentTy
QuotedStringListqsl(config["external_parsers"], " \t");
String  from, to;
int i;
-   int sep;
 
for (i = 0; qsl[i]; i += 2)
{
@@ -109,11 +138,22 @@ ExternalParser::canParse(char *contentTy
to = from.sub(sep+2).get();
from = from.sub(0, sep).get();
}
+  

Re: [htdig] make error on solaris 2.6

2001-01-15 Thread Ronald Edward Petty

I understood you about the web root thing...  I have a really funky way of
doing things here at work... so thats why it is like this... we test like
that and then move it to a REAL config when we prove it works... thats
what Im trying to do let me see if the sapce is the prob and Ill reply
so you know if that is the problem... thanks for the speedy help..

ron

> the DocumentRoot, so that the installed image files can be accessed by
> web clients.
>
> E.g., on my system, IMAGE_DIR is set to /home/httpd/html/htdig, and
> my Apache configuration sets DocumentRoot to /home/httpd/html, so my
> IMAGE_URL_PREFIX is simply "/htdig".
>



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




htdig@htdig.org

2001-01-15 Thread AddieADOBY

[EMAIL PROTECTED]



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




htdig@htdig.org

2001-01-15 Thread AddieADOBY

Unsubscribe me- I tried several times I still get mail, the same things over 
and over
[EMAIL PROTECTED]



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] NEED HELP with indexing

2001-01-15 Thread Gilles Detillieux

On Mon, 15 Jan 2001, George Roberts wrote:
> > I'm completely new to this software, but inherited a large site which
> > uses it.  I made a simple change to some javascript on one of the
> > indexed pages, and I have NO CLUE how to reindex the whole site.  Could
> > someone please help?

According to Geoff Hutchison:
> It depends a lot on how the original person installed it and your system.
> But usually there's a program "rundig" that creates the databases. Many
> people just hack this script to fit local needs, others create local
> versions (e.g. mine is "rundig.sh"). Of course the best thing to do is to
> write a version that can be run through the cron program which ensures the
> indexes are updated on a regular basis automatically. But I digress.
> 
> So first, I'd suggest finding the directory containing the databases, e.g.
> 
> locate db.wordlist
> 
> Next, make a backup of the files in there. Then see if you can find the
> rundig script. If so, look for any evidence of a local version with a
> possibly newer date. If the rundig script you have mentions "alt"
> somewhere in it, try running "rundig -a" which will update the databases
> using alternate .work files.
> 
> That should get you started in the right direction.

But bear in mind that htdig does not index JavaScript, so your
changes to the JavaScript on one of the indexed pages may not
have any effect at all on searches even after you reindex.
See http://www.htdig.org/FAQ.html#q5.18

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] NEED HELP with indexing

2001-01-15 Thread Geoff Hutchison

On Mon, 15 Jan 2001, George Roberts wrote:

> indexed pages, and I have NO CLUE how to reindex the whole site.  Could
> someone please help?

It depends a lot on how the original person installed it and your system.
But usually there's a program "rundig" that creates the databases. Many
people just hack this script to fit local needs, others create local
versions (e.g. mine is "rundig.sh"). Of course the best thing to do is to
write a version that can be run through the cron program which ensures the
indexes are updated on a regular basis automatically. But I digress.

So first, I'd suggest finding the directory containing the databases, e.g.

locate db.wordlist

Next, make a backup of the files in there. Then see if you can find the
rundig script. If so, look for any evidence of a local version with a
possibly newer date. If the rundig script you have mentions "alt"
somewhere in it, try running "rundig -a" which will update the databases
using alternate .work files.

That should get you started in the right direction.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] make error on solaris 2.6

2001-01-15 Thread Gilles Detillieux

According to Ronald Edward Petty:
> When I was doing make I got this error for DocumentDB.cc and I did a work
> around doing this, but then I type make again and it gets past
> DocumentDB.cc and does this for the next file... Is there something wrong
> with my shell or something...  I dont feel like typing
> #!/usr/bin/tcsh
> 
> setenv BIN_DIR /export/netapp/user/rpy/htdig/bin
> setenv DCOMMON_DIR "/export/netapp/user/rpy/htdig/common"
> setenv DCONFIG_DIR "/export/netapp/user/rpy/htdig/conf"
> setenv DATABASE_DIR "/export/netapp/user/rpy/htdig/db"
> setenv IMAGE_URL_PREFIX "/export/netapp/user/rpy/htdig/images"
> setenv PDF_PARSER "/usr/local/bin/acroread"
> setenv SORT_PROG "/bin/sort"
> setenv DEFAULT_CONFIG_FILE "/export/netapp/user/rpy/htdig/conf/htdig.conf"
> 
> 
> c++ -c -DBIN_DIR -DCOMMON_DIR -DCONFIG_DIR -DDATABASE_DIR
> -DIMAGE_URL_PREFIX -DPDF_PARSER -DSORT_PROG -DDEFAULT_CONFIG_FILE
> -I../htlib -I../ht
> common -I../db/dist -I../include -g -O2 DocumentDB.cc
> 
> 
> -
> Any idea why this top thing worked but the other doesn't
> -
> 
> 
> ares:/export/netapp/user/rpy/htdig-3.1.5/> make
> make[1]: Entering directory `/export/netapp/user/rpy/htdig-3.1.5/db/dist'
> make[1]: Nothing to be done for `all'.
> make[1]: Leaving directory `/export/netapp/user/rpy/htdig-3.1.5/db/dist'
> make[1]: Entering directory `/export/netapp/user/rpy/htdig-3.1.5/htlib'
> make[1]: Nothing to be done for `all'.
> make[1]: Leaving directory `/export/netapp/user/rpy/htdig-3.1.5/htlib'
> make[1]: Entering directory `/export/netapp/user/rpy/htdig-3.1.5/htcommon'
> c++ -c -DBIN_DIR=\"/export/netapp/user/rpy/htdig/bin\"
> -DCOMMON_DIR=\"/export/netapp/user/rpy/htdig/common\"
> -DCONFIG_DIR=\"/export/netapp/user/rpy/htdig/conf\"
> -DDATABASE_DIR=\"/export/netapp/user/rpy/htdig/db\"
> -DIMAGE_URL_PREFIX=\"/export/netapp/user/rpy/htdig/images \"

   ^
   |
I think the problem is right here. +
There seems to be a space (or maybe a control character) in your definition
for the IMAGE_URL_PREFIX, which is messing things up.

> -DPDF_PARSER=\"/usr/local/bin/acroread\" -DSORT_PROG=\"/bin/sort\"
> -DDEFAULT_CONFIG_FILE=\"/export/netapp/user/rpy/htdig/conf/htdig.conf\"
> -I../htlib -I../htcommon -I../db/dist -I../include -g -O2 DocumentRef.cc
> c++: ": No such file or directory
> DocumentRef.cc:0: unterminated string or character constant
> DocumentRef.cc:0: possible real start of unterminated constant
> make[1]: *** [DocumentRef.o] Error 1
> make[1]: Leaving directory `/export/netapp/user/rpy/htdig-3.1.5/htcommon'
> make: *** [all] Error 1
> ares:/export/netapp/user/rpy/htdig-3.1.5/>

By the way, I think you may be misunderstanding what the IMAGE_URL_PREFIX
is supposed to be.  It's supposed to be an URL path, relative to the
DocumentRoot of your web server, not relative to your system's root
directory.  It's the IMAGE_DIR that is relative to the system's root
directory, but it must point to a directory that will be somewhere under
the DocumentRoot, so that the installed image files can be accessed by
web clients.

E.g., on my system, IMAGE_DIR is set to /home/httpd/html/htdig, and
my Apache configuration sets DocumentRoot to /home/httpd/html, so my
IMAGE_URL_PREFIX is simply "/htdig".

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] make error on solaris 2.6

2001-01-15 Thread Ronald Edward Petty

When I was doing make I got this error for DocumentDB.cc and I did a work
around doing this, but then I type make again and it gets past
DocumentDB.cc and does this for the next file... Is there something wrong
with my shell or something...  I dont feel like typing
#!/usr/bin/tcsh

setenv BIN_DIR /export/netapp/user/rpy/htdig/bin
setenv DCOMMON_DIR "/export/netapp/user/rpy/htdig/common"
setenv DCONFIG_DIR "/export/netapp/user/rpy/htdig/conf"
setenv DATABASE_DIR "/export/netapp/user/rpy/htdig/db"
setenv IMAGE_URL_PREFIX "/export/netapp/user/rpy/htdig/images"
setenv PDF_PARSER "/usr/local/bin/acroread"
setenv SORT_PROG "/bin/sort"
setenv DEFAULT_CONFIG_FILE "/export/netapp/user/rpy/htdig/conf/htdig.conf"


c++ -c -DBIN_DIR -DCOMMON_DIR -DCONFIG_DIR -DDATABASE_DIR
-DIMAGE_URL_PREFIX -DPDF_PARSER -DSORT_PROG -DDEFAULT_CONFIG_FILE
-I../htlib -I../ht
common -I../db/dist -I../include -g -O2 DocumentDB.cc


-
Any idea why this top thing worked but the other doesn't
-


ares:/export/netapp/user/rpy/htdig-3.1.5/> make
make[1]: Entering directory `/export/netapp/user/rpy/htdig-3.1.5/db/dist'
make[1]: Nothing to be done for `all'.
make[1]: Leaving directory `/export/netapp/user/rpy/htdig-3.1.5/db/dist'
make[1]: Entering directory `/export/netapp/user/rpy/htdig-3.1.5/htlib'
make[1]: Nothing to be done for `all'.
make[1]: Leaving directory `/export/netapp/user/rpy/htdig-3.1.5/htlib'
make[1]: Entering directory `/export/netapp/user/rpy/htdig-3.1.5/htcommon'
c++ -c -DBIN_DIR=\"/export/netapp/user/rpy/htdig/bin\"
-DCOMMON_DIR=\"/export/netapp/user/rpy/htdig/common\"
-DCONFIG_DIR=\"/export/netapp/user/rpy/htdig/conf\"
-DDATABASE_DIR=\"/export/netapp/user/rpy/htdig/db\"
-DIMAGE_URL_PREFIX=\"/export/netapp/user/rpy/htdig/images \"
-DPDF_PARSER=\"/usr/local/bin/acroread\" -DSORT_PROG=\"/bin/sort\"
-DDEFAULT_CONFIG_FILE=\"/export/netapp/user/rpy/htdig/conf/htdig.conf\"
-I../htlib -I../htcommon -I../db/dist -I../include -g -O2 DocumentRef.cc
c++: ": No such file or directory
DocumentRef.cc:0: unterminated string or character constant
DocumentRef.cc:0: possible real start of unterminated constant
make[1]: *** [DocumentRef.o] Error 1
make[1]: Leaving directory `/export/netapp/user/rpy/htdig-3.1.5/htcommon'
make: *** [all] Error 1
ares:/export/netapp/user/rpy/htdig-3.1.5/>






To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] PATCH: backport ExternalParser.cc from 3.2.0b3 to 3.1.5

2001-01-15 Thread Gilles Detillieux

According to Elijah Kagan:
> I run htdig 3.1.5.
> I tried both the Debian package and a compiled one with the same result.
> I am absolutely sure there is something stupid I forgot to put into the
> configuration.

OK, after getting to the bottom of this (I think!), I have backported
the 3.2.0b3 development code for htdig/ExternalParser.cc to version
3.1.5, to fix this and other problems.  Please give this patch file
a try and let me know if it works.  You will probably get a warning
about the wait() function being implicitly declared, unless you manually
define HAVE_WAIT_H or HAVE_SYS_WAIT_H (depending on whether your system
has  or ).  Also, if your system has the mkstemp()
function, you may want to define HAVE_MKSTEMP manually as well, as this
will enhance security.  I didn't have time to figure out how to patch
aclocal.m4 and configure to add tests for all of these.

The patch fixes the following problems in external_parsers support in
3.1.5:
  - it got confused by "; charset=..." in the Content-Type header,
as described in "http://www.htdig.org/mail/2000/09/index.html#75".
  - security problems with using popen(), and therefore the shell,
to parse URL and content-type strings from untrusted sources
(now uses pipe/fork/exec instead of popen) - PR#542, PR#951.
  - used predictable temporary file name, which could be exploited
via symlinks - fixed if mkstemp() exists & HAVE_MKSTEMP is defined.
  - binary output from an external converter could get mangled.
  - error messages were sometimes ambiguous or missing altogether.
  - didn't open temporary file in binary mode for non-Unix systems
(attempts were made to fix this, but it's not clear yet whether
 the security fixes and pipe/fork/exec will port well to Cygwin).

Here's the patch, which you can apply in the main source directory for
htdig-3.1.5 using "patch -p0 < this-file":

--- htdig/ExternalParser.cc.origThu Feb 24 20:29:10 2000
+++ htdig/ExternalParser.cc Mon Jan 15 13:18:47 2001
@@ -1,14 +1,24 @@
 //
 // ExternalParser.cc
 //
-// Implementation of ExternalParser
-// Allows external programs to parse unknown document formats.
-// The parser is expected to return the document in a specific format.
-// The format is documented in http://www.htdig.org/attrs.html#external_parser
+// ExternalParser: Implementation of ExternalParser
+// Allows external programs to parse unknown document formats.
+// The parser is expected to return the document in a 
+// specific format. The format is documented 
+// in http://www.htdig.org/attrs.html#external_parser
 //
-#if RELEASE
-static char RCSid[] = "$Id: ExternalParser.cc,v 1.9.2.3 1999/11/24 02:14:09 grdetil 
Exp $";
-#endif
+// Part of the ht://Dig package   
+// Copyright (c) 1995-2001 The ht://Dig Group
+// For copyright details, see the file COPYING in your distribution
+// or the GNU Public License version 2 or later
+// 
+//
+// $Id: ExternalParser.cc,v 1.9.2.4 2001/01/15 13:18:47 grdetil Exp $
+//
+
+#ifdef HAVE_CONFIG_H
+#include "htconfig.h"
+#endif /* HAVE_CONFIG_H */
 
 #include "ExternalParser.h"
 #include "HTML.h"
@@ -19,9 +29,18 @@ static char RCSid[] = "$Id: ExternalPars
 #include "QuotedStringList.h"
 #include "URL.h"
 #include "Dictionary.h"
+#include "good_strtok.h"
+
 #include 
 #include 
-#include "good_strtok.h"
+#include 
+#include 
+#include 
+#ifdef HAVE_WAIT_H
+#include 
+#elif HAVE_SYS_WAIT_H
+#include 
+#endif
 
 static Dictionary  *parsers = 0;
 static Dictionary  *toTypes = 0;
@@ -32,9 +51,18 @@ extern StringconfigFile;
 //
 ExternalParser::ExternalParser(char *contentType)
 {
+  String mime;
+  int sep;
+
 if (canParse(contentType))
 {
-   currentParser = ((String *)parsers->Find(contentType))->get();
+String mime = contentType;
+   mime.lowercase();
+   sep = mime.indexOf(';');
+   if (sep != -1)
+ mime = mime.sub(0, sep).get();
+   
+   currentParser = ((String *)parsers->Find(mime))->get();
 }
 ExternalParser::contentType = contentType;
 }
@@ -89,6 +117,8 @@ ExternalParser::readLine(FILE *in, Strin
 int
 ExternalParser::canParse(char *contentType)
 {
+  int  sep;
+
 if (!parsers)
 {
parsers = new Dictionary();
@@ -97,7 +127,6 @@ ExternalParser::canParse(char *contentTy
QuotedStringListqsl(config["external_parsers"], " \t");
String  from, to;
int i;
-   int sep;
 
for (i = 0; qsl[i]; i += 2)
{
@@ -109,11 +138,22 @@ ExternalParser::canParse(char *contentTy
to = from.sub(sep+2).get();
from = from.sub(0, sep).get();
}
+   from.lowercase();
+   sep = from.indexOf(';');
+   if (sep != -1)
+ from = from.sub(0, sep).get();
+

Re: [htdig] htdig ignores *.doc file extension

2001-01-15 Thread Geoff Hutchison

On Mon, 15 Jan 2001, Evelio Martinez wrote:

> Now, I do not understand anything.  I have run the same command  but
> with  "-u user:password" and now htdig  finds out all the  .doc and
>.pdf files and creates the link.
[snip]
> Has something to do with Apache conf ?

Yes. If you need the -u flag or the authorization attribute, then you have
a password-protected site. If you don't supply the password, then htdig
will index very little.

> The next problem to solve will be the character set.

Have you taken a look at the FAQ? For example,


> but it ignores the Windows 2000 server.

Are these also password-protected? If so, are they using the "Basic"
authentication scheme?

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] NEED HELP with indexing

2001-01-15 Thread George Roberts

Hi-

I'm completely new to this software, but inherited a large site which
uses it.  I made a simple change to some javascript on one of the
indexed pages, and I have NO CLUE how to reindex the whole site.  Could
someone please help?

Thanks



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] htdig ignores *.doc file extension

2001-01-15 Thread Evelio Martinez

Geoff Hutchison escribió:

> At 12:15 PM +0100 1/15/01, Evelio Martinez wrote:
> >I have run  bin/htdig -i -vvv -s  | tee /tmp/ht   and   the 3 .doc
> >and 2  .pdf files that are under /home/httpd/html does not have any
> >reference in the debug file /tmp/ht.
>
> No, this is not normal. So you're saying when htdig hits a document
> linking to these .doc or .pdf files, it doesn't list the link?

Right

> Or do
> you not have any documents linking to these files?
>

Now, I do not understand anything.  I have run the same command  but
with  "-u user:password" and now
htdig  finds out all the  .doc and  .pdf files and creates the link.

Also it seems that with -u   it behaves recursive, but without it I have
to write down every directory.

Has something to do with Apache conf ?


The next problem to solve will be the character set.

I seetÊcnicos  instead of   técnicos
 aÓade  instead of   añade
 ...
 etc.


By the way,  htdig is suppose to index any web server, isn´ t   it?
We have 3 linux servers and 1 Windows 2000 server. There is no problem
with Linux servers
but it ignores the Windows 2000 server.

Any idea?

--
Evelio Martínez
Testanet. Dept. desarrollo software.
Av. Reino de Valencia, 15 - 5
46005 Valencia (Spain)
Tel: +34 96 395 90 00
Fax: +34 96 316 23 19




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Problems compiling 3.20b2

2001-01-15 Thread Gilles Detillieux

According to Richard van Drimmelen:
> I'm trying to compile 3.20b2 on a Sparc Solaris 7 machine with gcc
> 2.95.2
> 
> During 'make':
> 
> ld: warning: symbol `Object type_info node' has differing alignments:
> (file Endings.o value=0x8; file ../htlib/libht.a(StringMatch.o)
> value=0x4);
> largest value applied
> Undefined   first referenced
>  symbol in file
> __eh_pc Endings.o
> ld: fatal: Symbol referencing errors. No output written to htfuzzy
> collect2: ld returned 1 exit status
> make[1]: *** [htfuzzy] Error 1
> 
> Any suggestions ?

I can't say for sure that the next beta will solve this problem, but could
you please try the latest development snapshot of it to see if it does?
The 3.2.0b2 beta has a number of known bugs and many compilation problems
that are fixed in the upcoming 3.2.0b3 beta.  You can try the latest
development snapshot of it at...

   http://www.htdig.org/files/snapshots/htdig-3.2.0b3-011401.tar.gz

In either case, let us know whether or not it solves this problem, so we
can know if it still needs fixing before releasing it.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Phrases

2001-01-15 Thread Gilles Detillieux

According to Bill Vick:
> We have tried both the current and beta versions and
> are having problems getting the phrase search to work
> correctly and consistently. Any patches or should we
> hang tight for the next version?

What to you mean by current?  If you mean the current stable release,
3.1.5, it does not support phrase searching, as explained in FAQ 1.9.
The 3.2.0b2 beta, the last one released, has a number of known bugs.
The upcoming 3.2.0b3 beta should be much more reliable than the last
beta.  You can wait for it, or you can try the latest development
snapshot of it...

   http://www.htdig.org/files/snapshots/htdig-3.2.0b3-011401.tar.gz

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Problem with PDF files....

2001-01-15 Thread Gilles Detillieux

According to Elijah Kagan:
> I run htdig 3.1.5.
> I tried both the Debian package and a compiled one with the same result.
> I am absolutely sure there is something stupid I forgot to put into the
> configuration.
> 
> Attached is the config file.
> 
> Thanks for your help.
> 
> Elijah
> 
> 
> On Fri, 12 Jan 2001, Gilles Detillieux wrote:
> 
> > According to Elijah Kagan:
> > > 1. I run htdig with an explicit -c option, so it uses the correct conf
> > > file.
> > > 2. I rewrote the external_parsers so it includes only one line...
> > > 3. ..and it is the first line in the file
> > > 
> > > Results are the same! It is still looking for an acroread!
> > > 
> > > Please, help. I am getting desperate...
> > 
> > Hmm.  You're sure you're running version 3.1.5 of htdig, and you
> > don't have a pre-3.1.4 binary of htdig kicking around that you might be
> > unknowingly running instead?  External converter support was added to the
> > external_parsers attribute only in version 3.1.4 and above.  If you're
> > sure this isn't the problem either, please send me a copy of your conf
> > file as it stands now (preferably uuencoded right on your htdig box to
> > prevent e-mail mangling of it), and I'll have a look and try a test or two.
> > 
> > Oh, another thing.  You mentioned this was on a Debian system.  Did you
> > compile htdig yourself, or did you use a pre-compiled binary?  If the
> > latter, which one?

OK, it took a while, but the light finally came on!  If you look up the
following thread on the mailing list archives:

http://www.htdig.org/mail/2000/09/index.html#75

you'll see that the bug has come up before.  I think there's something
about the Debian configuration for Apache that causes it to add the
"; charset=..." string to the Content-Type header, which is the source
of the problem here.  At least I strongly suspect it must be the same
problem, as I can't see anything else that would explain the behaviour
you're reporting.  If you run htdig -vvv -i -c ..., you can then look
at the header lines returned by your server for the PDF files, and see
if the Content-Type header does indeed have something on the line after
the application/pdf string.

Geoff and I made some hacks to ExternalParser.cc in the 3.2.0b3
development code to address this, but none of this has been backported
to 3.1.5 yet.  I'll see if I can backport some or all of the external
parser patches to 3.1.5 in the next day or two.  In the meantime,
you can try working around this either by using local_urls, if you're
running htdig on the same machine as your Apache server, or by using
the same hack that Klaus used, i.e. add a line like the following to
your external_parsers definition.

"application/pdf; charset=iso-8859-1->text/html" 
/usr/share/htdig/conv_doc.pl

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] Phrases

2001-01-15 Thread Bill Vick

We have tried both the current and beta versions and
are having problems getting the phrase search to work
correctly and consistently. Any patches or should we
hang tight for the next version?

__
Do You Yahoo!?
Get email at your own domain with Yahoo! Mail. 
http://personal.mail.yahoo.com/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] Problems compiling 3.20b2

2001-01-15 Thread Richard van Drimmelen

I'm trying to compile 3.20b2 on a Sparc Solaris 7 machine with gcc
2.95.2

During 'make':

ld: warning: symbol `Object type_info node' has differing alignments:
(file Endings.o value=0x8; file ../htlib/libht.a(StringMatch.o)
value=0x4);
largest value applied
Undefined   first referenced
 symbol in file
__eh_pc Endings.o
ld: fatal: Symbol referencing errors. No output written to htfuzzy
collect2: ld returned 1 exit status
make[1]: *** [htfuzzy] Error 1

Any suggestions ?


--
Richard van Drimmelen   | email: [EMAIL PROTECTED]
Facility Management | phone: +31 20 5928080
SARA Computing Services | fax:   +31 20 6683167


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] static compilation of htdig

2001-01-15 Thread Geoff Hutchison

At 11:52 AM +0100 1/15/01, Matthias Kleine wrote:
>I tried to give a --static option to configure, but it doen't know this
>option. Do I have to edit the Makefile myself or is there another

Any release of 3.1.x or before compiles essentially statically (it 
still links to your libc, for example), but the internal libraries 
like htcommon and htlib are linked statically.

In the 3.2 code, you may link these libraries statically with 
"--disable-shared" as an option to configure. (With all configure 
scripts, you may get a list of options by typing ./configure -?)

If you want to even link against libc statically, you'll have to edit 
the Makefiles.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] htdig ignores *.doc file extension

2001-01-15 Thread Geoff Hutchison

At 12:15 PM +0100 1/15/01, Evelio Martinez wrote:
>I have run  bin/htdig -i -vvv -s  | tee /tmp/ht   and   the 3 .doc 
>and 2  .pdf files that are under /home/httpd/html does not have any 
>reference in the debug file /tmp/ht.

No, this is not normal. So you're saying when htdig hits a document 
linking to these .doc or .pdf files, it doesn't list the link? Or do 
you not have any documents linking to these files?

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] htdig ignores *.doc file extension

2001-01-15 Thread Evelio Martinez


Geoff Hutchison escribió:
On Fri, 12 Jan 2001, Evelio Martinez wrote:
> htdig is ignoring the files with pdf and doc extension.
By this, I assume you mean they're not indexed.
Correct.
 
Try running htdig -vvv and take a look at what happens when it encounters
a link to a PDF file. Does it reject the link? Or does it get to the
link
and try to index it later?
I have run  bin/htdig -i -vvv -s  | tee /tmp/ht   and  
the 3 .doc  and 2  .pdf files that are
under /home/httpd/html does not have any reference in the debug file
/tmp/ht.
Is this normal?
 
If it's the former, then one of your limits is set incorrectly. (e.g.
bad_extensions, valid_extensions, exclude_urls, limit_urls_to ...)
I have not seen anything apparently wrong. Do you?
I attached the htdig.conf
 
If it's the latter, then make sure you can run a .doc or a .pdf through
the external converter itself and get reasonable-looking output.
If I execute /usr/local/bin/catdoc  /home/httpd/html/*.doc  I
can see a reasonable-looking output.
Any idea?
Thanks
-- 
Evelio Martínez
Testanet. Dept. desarrollo software.
Av. Reino de Valencia, 15 - 5
46005 Valencia (Spain)
Tel: +34 96 395 90 00
Fax: +34 96 316 23 19
 

#
# Example config file for ht://Dig.
#
# This configuration file is used by all the programs that make up ht://Dig.
# Please refer to the attribute reference manual for more details on what
# can be put into this file.  (http://www.htdig.org/confindex.html)
# Note that most attributes have very reasonable default values so you
# really only have to add attributes here if you want to change the defaults.
#
# What follows are some of the common attributes you might want to change.
#

#
# Specify where the database files need to go.  Make sure that there is
# plenty of free disk space available for the databases.  They can get
# pretty big.
#
database_dir:   /opt/www/htdig/db

#
# This specifies the URL where the robot (htdig) will start.  You can specify
# multiple URLs here.  Just separate them by some whitespace.
# The example here will cause the ht://Dig homepage and related pages to be
# indexed.
# You could also index all the URLs in a file like so:
# start_url:   `${common_dir}/start.url`
#
start_url:  http://correo.testanet.com/   \
http://correo.testanet.com/akopia/
#   http://correo.testanet.com/manual \ 
#   http://correo.testanet.com/tareas \
#   http://correo.testanet.com/tienda \
#   http://correo.testanet.com/phpshop\
#   http://correo.testanet.com/icons  \
#   http://correo.testanet.com/pruebas\
#   http://correo.testanet.com/ps_image   \
#   http://correo.testanet.com/freetrade  \
#   http://correo.testanet.com/phpfwgen   \
#   http://correo.testanet.com/akopia \
#   http://correo.testanet.com/construct
#   http://www.testanet.com/

#
# This attribute limits the scope of the indexing process.  The default is to
# set it to the same as the start_url above.  This way only pages that are on
# the sites specified in the start_url attribute will be indexed and it will
# reject any URLs that go outside of those sites.
#
# Keep in mind that the value for this attribute is just a list of string
# patterns. As long as URLs contain at least one of the patterns it will be
# seen as part of the scope of the index.
#
limit_urls_to:  ${start_url}

#
# If there are particular pages that you definately do NOT want to index, you
# can use the exclude_urls attribute.  The value is a list of string patterns.
# If a URL matches any of the patterns, it will NOT be indexed.  This is
# useful to exclude things like virtual web trees or database accesses.  By
# default, all CGI URLs will be excluded.  (Note that the /cgi-bin/ convention
# may not work on your web server.  Check the  path prefix used on your web
# server.)
#
exclude_urls:   /cgi-bin/ .cgi


#
# Since ht://Dig does not (and cannot) parse every document type, this 
# attribute is a list of strings (extensions) that will be ignored during 
# indexing. These are *only* checked at the end of a URL, whereas 
# exclude_url patterns are matched anywhere.
#
bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi


#
# The string htdig will send in every request to identify the robot.  Change
# this to your email address.
#
maintainer: [EMAIL PROTECTED]

#
# The excerpts that are displayed in long results rely on stored information
# in the index databases.  The compiled default only stores 512 characters of
# text from each document (this excludes any HTML markup...)  If you plan on
# using the excerpts you probably want to make this larger.  The only 

[htdig] static compilation of htdig

2001-01-15 Thread Matthias Kleine

Hi there!

I tried to give a --static option to configure, but it doen't know this
option. Do I have to edit the Makefile myself or is there another 
possibility for a static compilation. (Machine runs with Linux 2.2.14,
glibc 2.1.3, gcc v2.95.2).

Thanks for any hints,
Matthias 
-- 
-
Matthias Kleine   Phone: ++49-(0)6 11-17 31-624
Patzschke + Rasp Software AG  Fax:   ++49-(0)6 11-17 31-31
Bierstadter Straße 7  mailto:[EMAIL PROTECTED]
D-65189 Wiesbaden Web Site: http://www.prs.de/
-


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ: