Dear list,

I recently had my hosting provider install the Htdig search engine for our server 
which runs the Apache web server on Linux.  They installed the search engine as root 
and I had to get them to provide permissions for all the files.

I ran a rundig and it appeared to have created the db files...but when I use the form 
in my browser to pull up the results, I get:

------------------------------------------
ht://Dig error
htsearch detected an error. Please report this to the webmaster of this site. The 
error message is:

Unable to read configuration file
------------------------------------------

I assumed that the configuration file is the "htdig.conf" file but the file is just 
fine as I can read it and it is already set to my permission...is there another 
configuration file that it's not reading?

Please note that I had to create a symlink in my cgi-bin to the htsearch binary at 
/opt/www/cgi-bin/htsearch.  All the permissions are fine.

The form to do a search is at http://uahc.org/htdig.html.

Also, are there any consultants in this list that can help troubleshoot any more 
potential problems such as the one above?


Jonathan Lam 
<http://uahc.org> 



-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]
Sent: Wednesday, March 13, 2002 4:35 AM
To: [EMAIL PROTECTED]
Subject: htdig-general digest, Vol 1 #551 - 6 msgs


Send htdig-general mailing list submissions to
        [EMAIL PROTECTED]

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.sourceforge.net/lists/listinfo/htdig-general
or, via email, send a message with subject or body 'help' to
        [EMAIL PROTECTED]

You can reach the person managing the list at
        [EMAIL PROTECTED]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of htdig-general digest..."


Today's Topics:

   1. Re: Help building a select list (Gilles Detillieux)
   2. Re: recreate url list (Gilles Detillieux)
   3. Re: Problem with Foreign Chars (Swedish) (Gilles Detillieux)
   4. No Navegue Mas!!! (Luis)
   5. RE: Deleted, no excerpt with pdf files (Steve Marshall)
   6. Re: Deleted, no excerpt with pdf files (David Adams)

--__--__--

Message: 1
From: Gilles Detillieux <[EMAIL PROTECTED]>
Subject: Re: [htdig] Help building a select list
To: [EMAIL PROTECTED]
Date: Tue, 12 Mar 2002 17:51:41 -0600 (CST)
Cc: [EMAIL PROTECTED] (ht://Dig mailing list)

According to [EMAIL PROTECTED]:
> Thank you Gilles for your response. I made the changes you suggested and it
> still does not work. I wonder if it is the combination of the two lists I
> built that is the problem.
> 
> build_select_lists:           RESTRICT_LIST restrict restrict_names 2 1 2
> restrict "" \
>                                               EXCLUDE_LIST,checkbox
> exclude exclude_names 2 1 2 exclude ""        
> 
> 
> restrict_names: "" "Austin City Connection" \
>                               "http://www.ci.austin.tx.us/budget/";
> "Budget" \
>                       "http://www.ci.austin.tx.us/council/"; "Council" \
>                               "http://www.ci.austin.tx.us/library/";
> "Library" \
>                               "http://www.ci.austin.tx.us/minutes/";
> "Minutes" \
>                               "http://www.ci.austin.tx.us/news/"; "News" \
>                               "http://www.ci.austin.tx.us/police/";
> "Police" \
>                               "http://www.ci.austin.tx.us/sws/"; "Solid
> Waste Services" \
>                               "http://www.ci.austin.tx.us/watershed/";
> "Watershed Protection"
>                               
> exclude_names:        "http://www.ci.austin.tx.us/agenda/"; "Council Agenda" \
>                               "http://www.ci.austin.tx.us/council/";
> "Council Transcripts" \
>                               "http://www.ci.austin.tx.us/minutes/";
> "Council Minutes" \
>                               "http://www.ci.austin.tx.us/news/"; "News"
> 
> My form for the html is as follows:
> 
> <form method="get" action="$(CGI)"><input type="hidden" name="config"
> value="$&(CONFIG)">
>       <table bgcolor="#BCB6A0" cellpadding="2" cellspacing="1">
>       <tr>
>       <td bgcolor="#EAE8DC" width="96"><p>Search for:</p></td>
>       <td bgcolor="#EAE8DC" width="208">
>       <input type="text" size="25" name="words" value="$(WORDS)"></td>
>       <td colspan="2" align="center" valign="middle"><p><input
> type="submit" value="Search">
>       </p></td>
>       </tr>
>       <tr>
>       <td bgcolor="#EAE8DC"><p>Search in:</p></td>
>       <td bgcolor="#EAE8DC" colspan="3">$(RESTRICT_LIST)</td>
>       </tr>
>       <tr>
>       <td bgcolor="#EAE8DC"><p>Match:</p></td>
>       <td bgcolor="#EAE8DC">$(METHOD)</td>
>       <td bgcolor="#EAE8DC" width="96"><p>Format:</p></td>
>       <td bgcolor="#EAE8DC" width="207">$(FORMAT)</td>
>       </tr>
>       <tr>
>       <td colspan="4" bgcolor="#EAE8DC"><p>
>        If you elect to search all of Austin City Connection, you may want
> to exclude one or more of the following directories. These large directories
> make it difficult to target documents.</p></td>
>       </tr>
>       <tr>
>       <td bgcolor="#EAE8DC" valign="top"><p>Exclude:</p></td>
>       <td bgcolor="#EAE8DC" valign="top"
> colspan="3">$(EXCLUDE_LIST)</form></td>
>       </tr>
> 
> The Restrict list works, the Exclude list does not. 
> 
> Thanks again for your help. I have the search engine so close to working
> like I want it.

All the above looks fine to me.  I have to ask, though, are your sure
you're running version 3.1.6 of htsearch?  3.1.5 doesn't support the
extensions to build_select_lists for checkboxes, radio buttons and
select multiple without a patch.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930


--__--__--

Message: 2
From: Gilles Detillieux <[EMAIL PROTECTED]>
Subject: Re: [htdig] recreate url list
To: [EMAIL PROTECTED] (Gabriele Bartolini)
Date: Tue, 12 Mar 2002 18:05:59 -0600 (CST)
Cc: [EMAIL PROTECTED], [EMAIL PROTECTED] (htdig-general)

According to Gabriele Bartolini:
> At 13.44 12/03/2002 +0200, Greg wrote:
> >I am running htdig 3.1.4 and want to do a re-index of the existing URL list
> >within the current database.  The conf file no longer contains the 
> >original URL
> >list.  Is there a way to redig the existing URL list without the start_url 
> >list and
> >if possible place the new db into a seperate set of files so that general
> >access is not affected?  If not how do I extract the existing url 
> >list?  Please
> >can you be very specific as I am not at all familiar with htdig command
> >syntax.
> 
> Ciao,
> 
>     please guys correct me if I am wrong, but I think that you Greg should 
> probably switch to the 3.1.6 version if you can. It should be almost 
> painless. It's just a consideration I am doing now ... :-)
> 
>     Geoff, Gilles & co, is the 3.1.4 database compatible with the 3.1.6 
> version?

Yes, 3.1.6 should be able to handle a 3.1.4 database without any
difficulty.  3.1.6 also includes an htdump utility to extract the
whole document database as an ASCII file.  You could probably fairly
easily extract the list of URLs from the db.docs file produced by
htdump, using an awk/sed/perl script.  See http://www.htdig.org/ for
all ht://Dig documentation, including syntax for individual commands.
See the documentation for awk, sed or Perl for information about how
you could use one of these to strip out the URLs from db.docs.

Something like "sed -n 's/^.*   u:\([^  ]*\)    .*/\1/p' db.docs" would
probably do it, where the spaces in the s/// command are actually tab
characters.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930


--__--__--

Message: 3
From: Gilles Detillieux <[EMAIL PROTECTED]>
Subject: Re: [htdig] Problem with Foreign Chars (Swedish)
To: [EMAIL PROTECTED] (Stefan Wold)
Date: Tue, 12 Mar 2002 18:07:46 -0600 (CST)
Cc: [EMAIL PROTECTED] (htdig-general)

According to Stefan Wold:
> I'm running htdig 3.1.6 on Linux. When I use rundig to create the
> database for a website it index it correct except that it doesn't take
> ANY foreign chars (Swedish) at all, =E5=E4=F6 nor =C5=C4=D6 can be foun=
d in the
> db.wordlist. It seem to skip the whole word if it contains a Swedish
> char. I have tried with different locale settings before running rundig
> without any luck.
>=20
> Anyone had this kind of problem?

Lots of people do!  See http://www.htdig.org/FAQ.html#q5.8
I added a couple paragraphs to it this morning.

--=20
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~g=
rdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930


--__--__--

Message: 4
Reply-To: [EMAIL PROTECTED]
From: "Luis" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Date: Wed, 13 Mar 2002 02:39:34 -0300
Subject: [htdig] No Navegue Mas!!!

------=_NextPart_84815C5ABAF209EF376268C8
Content-type: text/plain; charset="US-ASCII"


No navegue mas!!!

No gastes miles de horas de conexi&oacute;n!!!

No utilice su tarjeta para entrar a sitios que le ofrecen lo mismo o menos, y con 
demoras para bajar los videos y fotos que quiere ver.

En un solo CD lo tiene todo.

Todo lo que siempre busc&oacute; y perdi&oacute; noches de sue&ntilde;o para conseguir.

2.500 fotos seleccionadas de mayor cantidad, de alta calidad.

102 Videos con mas de 2hs - (89 cortos y 13 DIVX extractos de las mejores escenas de 
Pel&iacute;culas)

Categor&iacute;as : Anal, Coitos, Orales, Colas, Fist, Famosas, Lesvis, Vaginas, 
Embarazadas, Hentai (fotos y Videos), Manga, Teen, Meadas, Negras, Playboy, Sadismo, 
Tetonas, Toys, Zoo (fotos y Videos), etc. etc.

Los videos se ven perfectamente con el Mediaplayer 6.4 de Windows.

(Incluimos Visualizador reecompacto para ver fotos, y Codecs para ver en formato Divx 
los videos)

SI HA COMPRADO ALGUNA VEZ REVISTAS QUE TIENEN CD CON MATERIAL PARECIDO, SE DARA CUENTA 
QUE NO CONTIENEN NI EL 10% DE LO QUE TIENE ESTE CD.

TODO EN UN SOLO CD GARANTIZADO SIN ERRORES, DE PRIMERA MARCA.

SOLAMENTE DENTRO DE REPUBLICA ARGENTINA.

ENVIO INCLUIDO CONTRAREEMBOLSO $ 20.-

Pedidos Exclusivamente a : [EMAIL PROTECTED]

 Estrictamente para mayores de 18 a&ntilde;os.

Absoluta reserva, el CD llegar&aacute; sin identificaci&oacute;n alguna.

El CD no es autoejecutable, para que al ponerlo por error no se vea el contenido.

Disculpe las molestias si este mensaje no es de su inter&eacute;s, solo se enviara una 
vez

Por secci&oacute;n, p&aacute;rrafo (a) (2) (C) de S.1618. Bajo el decreto titulo 3ro. 
Aprobado 
por el 105 congreso base de las normativas internacionales sobre SPAM, un E-mail 
no podr&aacute; se considerado SPAM mientras incluya una forma de ser removido. Si 
usted 
desea ser removido de nuestra base de datos en forma definitiva por favor responda 
a este mail indicando "Remover" en el subject.

------=_NextPart_84815C5ABAF209EF376268C8
Content-Type: text/html; charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable

<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; charset=3Dwindows-=
1252">
<META NAME=3D"Generator" CONTENT=3D"Microsoft Word 97">
<TITLE>NO NAVEGUE MAS</TITLE>
</HEAD>
<BODY LINK=3D"#0000ff" VLINK=3D"#800080">

<B><FONT FACE=3D"Arial" SIZE=3D4 COLOR=3D"#ff0000"><P ALIGN=3D"CENTER">No =
navegue mas!!!</P>
<P ALIGN=3D"CENTER">No gastes miles de horas de conexi&oacute;n!!!</P>
</FONT><P ALIGN=3D"JUSTIFY">No utilice su tarjeta para entrar a sitios que=
 le ofrecen lo mismo o menos, y con demoras para bajar los videos y fotos =
que quiere ver=2E</P>
<P ALIGN=3D"JUSTIFY">En un solo CD lo tiene todo=2E</P>
<P ALIGN=3D"JUSTIFY">Todo lo que siempre busc&oacute; y perdi&oacute; noch=
es de sue&ntilde;o para conseguir=2E</P>
<P ALIGN=3D"JUSTIFY">2=2E500 fotos seleccionadas de mayor cantidad, de alt=
a calidad=2E</P>
<P ALIGN=3D"JUSTIFY">102 Videos con mas de 2hs - (89 cortos y 13 DIVX extr=
actos de las mejores escenas de Pel&iacute;culas)</P>
<P ALIGN=3D"JUSTIFY">Categor&iacute;as : Anal, Coitos, Orales, Colas, Fist=
, Famosas, Lesvis, Vaginas, Embarazadas, Hentai (fotos y Videos), Manga, T=
een, Meadas, Negras, Playboy, Sadismo, Tetonas, Toys, Zoo (fotos y Videos)=
, etc=2E etc=2E</P>
<P ALIGN=3D"JUSTIFY">Los videos se ven perfectamente con el Mediaplayer 6=2E=
4 de Windows=2E</P>
<P ALIGN=3D"JUSTIFY">(Incluimos Visualizador reecompacto para ver fotos, y=
 Codecs para ver en formato Divx los videos)</P>
<FONT FACE=3D"Arial"><P ALIGN=3D"JUSTIFY">SI HA COMPRADO ALGUNA VEZ REVIST=
AS QUE TIENEN CD CON MATERIAL PARECIDO, SE DARA CUENTA QUE NO CONTIENEN NI=
 EL 10% DE LO QUE TIENE ESTE CD=2E</P>
</FONT><FONT FACE=3D"Arial" COLOR=3D"#0000ff"><P ALIGN=3D"CENTER">TODO EN =
UN SOLO CD GARANTIZADO SIN ERRORES, DE PRIMERA MARCA=2E</P>
</FONT><FONT FACE=3D"Arial" COLOR=3D"#ff0000"><P ALIGN=3D"CENTER">SOLAMENT=
E DENTRO DE REPUBLICA ARGENTINA=2E</P>
<P ALIGN=3D"CENTER">ENVIO INCLUIDO CONTRAREEMBOLSO $ 20=2E-</P>
</FONT><FONT FACE=3D"Arial"><P ALIGN=3D"JUSTIFY">Pedidos Exclusivamente a =
: </B></FONT><A HREF=3D"mailto:luicd@hot-shot=2Ecom";><FONT SIZE=3D4>luicd@=
hot-shot=2Ecom</FONT></A></P>
<B><FONT FACE=3D"Arial"><P ALIGN=3D"JUSTIFY"> Estrictamente para mayores d=
e 18 a&ntilde;os=2E</P>
<P ALIGN=3D"JUSTIFY">Absoluta reserva, el CD llegar&aacute; sin identifica=
ci&oacute;n alguna=2E</P>
<P ALIGN=3D"JUSTIFY">El CD no es autoejecutable, para que al ponerlo por e=
rror no se vea el contenido=2E</P>
</FONT><FONT COLOR=3D"#0000ff"><P ALIGN=3D"CENTER">Disculpe las molestias =
si este mensaje no es de su inter&eacute;s, solo se enviara una vez</P>
</B></FONT><FONT SIZE=3D2><P ALIGN=3D"CENTER">Por secci&oacute;n, p&aacute=
;rrafo (a) (2) (C) de S=2E1618=2E Bajo el decreto titulo 3ro=2E Aprobado <=
BR>
por el 105 congreso base de las normativas internacionales sobre SPAM, un =
E-mail <BR>
no podr&aacute; se considerado SPAM mientras incluya una forma de ser remo=
vido=2E Si usted <BR>
desea ser removido de nuestra base de datos en forma definitiva por favor =
responda <BR>
a este mail indicando "Remover" en el subject=2E</P></FONT></BODY>
</HTML>

------=_NextPart_84815C5ABAF209EF376268C8--



--__--__--

Message: 5
Reply-To: <[EMAIL PROTECTED]>
From: "Steve Marshall" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>,
        "'David Adams'" <[EMAIL PROTECTED]>
Subject: RE: [htdig] Deleted, no excerpt with pdf files
Date: Wed, 13 Mar 2002 08:49:50 -0000
Organization: Atelier Ten

This is a multi-part message in MIME format.

------=_NextPart_000_0000_01C1CA6C.0A7234A0
Content-Type: text/plain;
        charset="us-ascii"
Content-Transfer-Encoding: 7bit

David, thanks for your suggestion

The trouble is, htdig's output looks fine to me, seems to get the
Content-Type correct, the length looks sensible at 29122 bytes, it just
doesn't put anything it finds into its  database scratch files. It lists
the text from the pdf when in -vvvv mode, so it's not one of those
pdf-image issues.

Output is listed below

Any other thoughts?

Steve


title: Atelier Ten Web Graphics
image: http://192.168.1.2/pdfs/TSB_Exterior_thumb.gif
href: http://192.168.1.2/pdfs/phoenix.pdf (support images)
resolving 'http://192.168.1.2/pdfs/phoenix.pdf'

   pushing http://192.168.1.2/pdfs/phoenix.pdf
+ size = 1186
pick: 192.168.1.2, # servers = 1
1:1:1:http://192.168.1.2/pdfs/phoenix.pdf: Retrieval command for
http://192.168.1.2/pdfs/phoenix.pdf: GET /pdfs/phoenix.pdf HTTP/1.0
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
Referer: http://192.168.1.2/
Host: 192.168.1.2

Header line: HTTP/1.1 200 OK
Header line: Date: Tue, 12 Mar 2002 20:00:42 GMT
Header line: Server: Apache/1.3.20 (Linux/SuSE) PHP/4.0.6
Header line: Last-Modified: Thu, 14 Jun 2001 08:59:02 GMT
Converted Thu, 14 Jun 2001 08:59:02 GMT to Thu, 14 Jun 2001 08:59:02
Header line: ETag: "9813c-71c2-3b287cd6"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 29122
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 4546 from document
Read a total of 29122 bytes
PDF::setContents(29122 bytes)
PDF::parse(http://192.168.1.2/pdfs/phoenix.pdf)
PDF::parse: 19272 lines parsed
PDF::parse ends normally
 size = 29122
pick: 192.168.1.2, # servers = 1


________________________________________________________________________
This e-mail has been scanned for all viruses by Star Internet. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________
------=_NextPart_000_0000_01C1CA6C.0A7234A0
Content-Type: text/html;
        charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Dus-ascii">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.0.4630.0">
<TITLE>RE: [htdig] Deleted, no excerpt with pdf files</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/rtf format -->

<P><FONT SIZE=3D2 FACE=3D"Arial">David, thanks for your =
suggestion</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">The trouble is, htdig's output looks =
fine to me, seems to get the Content-Type correct, the length looks =
sensible at 29122 bytes, it just doesn't put anything it finds into =
its&nbsp; database scratch files. It lists the text from the pdf when in =
-vvvv mode, so it's not one of those pdf-image issues.</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Output is listed below</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Any other thoughts?</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Steve</FONT>
</P>
<BR>

<P><FONT SIZE=3D2 FACE=3D"Arial">title: Atelier Ten Web Graphics</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">image: </FONT><A =
HREF=3D"http://192.168.1.2/pdfs/TSB_Exterior_thumb.gif";><U><FONT =
COLOR=3D"#0000FF" SIZE=3D2 =
FACE=3D"Arial">http://192.168.1.2/pdfs/TSB_Exterior_thumb.gif</FONT></U><=
/A>

<BR><FONT SIZE=3D2 FACE=3D"Arial">href: </FONT><A =
HREF=3D"http://192.168.1.2/pdfs/phoenix.pdf";><U><FONT COLOR=3D"#0000FF" =
SIZE=3D2 =
FACE=3D"Arial">http://192.168.1.2/pdfs/phoenix.pdf</FONT></U></A><FONT =
SIZE=3D2 FACE=3D"Arial"> (support images)</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">resolving '</FONT><A =
HREF=3D"http://192.168.1.2/pdfs/phoenix.pdf";><U><FONT COLOR=3D"#0000FF" =
SIZE=3D2 =
FACE=3D"Arial">http://192.168.1.2/pdfs/phoenix.pdf</FONT></U></A><FONT =
SIZE=3D2 FACE=3D"Arial">'</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">&nbsp;&nbsp; pushing </FONT><A =
HREF=3D"http://192.168.1.2/pdfs/phoenix.pdf";><U><FONT COLOR=3D"#0000FF" =
SIZE=3D2 =
FACE=3D"Arial">http://192.168.1.2/pdfs/phoenix.pdf</FONT></U></A>

<BR><FONT SIZE=3D2 FACE=3D"Arial">+ size =3D 1186</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">pick: 192.168.1.2, # servers =3D =
1</FONT>

<BR><FONT SIZE=3D2 =
FACE=3D"Arial">1:1:1:http://192.168.1.2/pdfs/phoenix.pdf: Retrieval =
command for </FONT><A =
HREF=3D"http://192.168.1.2/pdfs/phoenix.pdf";><U><FONT COLOR=3D"#0000FF" =
SIZE=3D2 =
FACE=3D"Arial">http://192.168.1.2/pdfs/phoenix.pdf</FONT></U></A><FONT =
SIZE=3D2 FACE=3D"Arial">: GET /pdfs/phoenix.pdf HTTP/1.0</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">User-Agent: htdig/3.1.6 =
([EMAIL PROTECTED])</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Referer: </FONT><A =
HREF=3D"http://192.168.1.2/";><U><FONT COLOR=3D"#0000FF" SIZE=3D2 =
FACE=3D"Arial">http://192.168.1.2/</FONT></U></A>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Host: 192.168.1.2</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Header line: HTTP/1.1 200 OK</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Header line: Date: Tue, 12 Mar 2002 =
20:00:42 GMT</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Header line: Server: Apache/1.3.20 =
(Linux/SuSE) PHP/4.0.6</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Header line: Last-Modified: Thu, 14 =
Jun 2001 08:59:02 GMT</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Converted Thu, 14 Jun 2001 08:59:02 =
GMT to Thu, 14 Jun 2001 08:59:02</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Header line: ETag: =
&quot;9813c-71c2-3b287cd6&quot;</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Header line: Accept-Ranges: =
bytes</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Header line: Content-Length: =
29122</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Header line: Connection: close</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Header line: Content-Type: =
application/pdf</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Header line:</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">returnStatus =3D 0</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Read 8192 from document</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Read 8192 from document</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Read 8192 from document</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Read 4546 from document</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Read a total of 29122 bytes</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">PDF::setContents(29122 bytes)</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">PDF::parse(<A =
HREF=3D"http://192.168.1.2/pdfs/phoenix.pdf";>http://192.168.1.2/pdfs/phoe=
nix.pdf</A>)</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">PDF::parse: 19272 lines parsed</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">PDF::parse ends normally</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">&nbsp;size =3D 29122</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">pick: 192.168.1.2, # servers =3D =
1</FONT>
</P>

</BODY>
</HTML>
<HTML><BODY><BR>
________________________________________________________________________<BR>
This e-mail has been scanned for all viruses by Star Internet. The<BR>
service is powered by MessageLabs. For more information on a proactive<BR>
anti-virus service working around the clock, around the globe, visit:<BR>
<A =20
href=3Dhttp://www.star.net.uk>
http://www.star.net.uk</A><BR>
________________________________________________________________________<BR>
</BODY></HTML>

------=_NextPart_000_0000_01C1CA6C.0A7234A0--




--__--__--

Message: 6
From: "David Adams" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>
Subject: Re: [htdig] Deleted, no excerpt with pdf files
Date: Wed, 13 Mar 2002 09:31:03 -0000

This is a multi-part message in MIME format.

------=_NextPart_000_002B_01C1CA71.CB32BDE0
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

RE: [htdig] Deleted, no excerpt with pdf filesSteve,

It looks as though there must be a problem with your configuration file. =
 The lines:

PDF::setContents(29122 bytes)=20
PDF::parse(http://192.168.1.2/pdfs/phoenix.pdf)=20
PDF::parse: 19272 lines parsed=20
PDF::parse ends normally=20
 size =3D 29122=20

are definitely NOT what I would expect from doc2html.pl, pdf2html.pl or =
pdftotext.
Some other parser is being used.

--
David Adams
Computing Services
Southampton University

  ----- Original Message -----=20
  From: Steve Marshall=20
  To: [EMAIL PROTECTED] ; 'David Adams'=20
  Sent: Wednesday, March 13, 2002 8:49 AM
  Subject: RE: [htdig] Deleted, no excerpt with pdf files


  David, thanks for your suggestion=20

  The trouble is, htdig's output looks fine to me, seems to get the =
Content-Type correct, the length looks sensible at 29122 bytes, it just =
doesn't put anything it finds into its  database scratch files. It lists =
the text from the pdf when in -vvvv mode, so it's not one of those =
pdf-image issues.

  Output is listed below=20

  Any other thoughts?=20

  Steve=20



  title: Atelier Ten Web Graphics=20
  image: http://192.168.1.2/pdfs/TSB_Exterior_thumb.gif=20
  href: http://192.168.1.2/pdfs/phoenix.pdf (support images)=20
  resolving 'http://192.168.1.2/pdfs/phoenix.pdf'=20

     pushing http://192.168.1.2/pdfs/phoenix.pdf=20
  + size =3D 1186=20
  pick: 192.168.1.2, # servers =3D 1=20
  1:1:1:http://192.168.1.2/pdfs/phoenix.pdf: Retrieval command for =
http://192.168.1.2/pdfs/phoenix.pdf: GET /pdfs/phoenix.pdf HTTP/1.0

  User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])=20
  Referer: http://192.168.1.2/=20
  Host: 192.168.1.2=20

  Header line: HTTP/1.1 200 OK=20
  Header line: Date: Tue, 12 Mar 2002 20:00:42 GMT=20
  Header line: Server: Apache/1.3.20 (Linux/SuSE) PHP/4.0.6=20
  Header line: Last-Modified: Thu, 14 Jun 2001 08:59:02 GMT=20
  Converted Thu, 14 Jun 2001 08:59:02 GMT to Thu, 14 Jun 2001 08:59:02=20
  Header line: ETag: "9813c-71c2-3b287cd6"=20
  Header line: Accept-Ranges: bytes=20
  Header line: Content-Length: 29122=20
  Header line: Connection: close=20
  Header line: Content-Type: application/pdf=20
  Header line:=20
  returnStatus =3D 0=20
  Read 8192 from document=20
  Read 8192 from document=20
  Read 8192 from document=20
  Read 4546 from document=20
  Read a total of 29122 bytes=20
  PDF::setContents(29122 bytes)=20
  PDF::parse(http://192.168.1.2/pdfs/phoenix.pdf)=20
  PDF::parse: 19272 lines parsed=20
  PDF::parse ends normally=20
   size =3D 29122=20
  pick: 192.168.1.2, # servers =3D 1=20


  =
________________________________________________________________________
  This e-mail has been scanned for all viruses by Star Internet. The
  service is powered by MessageLabs. For more information on a proactive
  anti-virus service working around the clock, around the globe, visit:
  http://www.star.net.uk
  =
________________________________________________________________________


------=_NextPart_000_002B_01C1CA71.CB32BDE0
Content-Type: text/html;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>RE: [htdig] Deleted, no excerpt with pdf =
files</TITLE>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 5.50.4134.600" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT size=3D2>Steve,</FONT></DIV>
<DIV><FONT size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT size=3D2>It looks as though there must be a problem with your =

configuration file.&nbsp; The lines:</FONT></DIV>
<DIV><FONT size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT size=3D2>PDF::setContents(29122 bytes)<FONT size=3D3> =
<BR></FONT><FONT=20
face=3DArial size=3D2>PDF::parse(<A=20
href=3D"http://192.168.1.2/pdfs/phoenix.pdf";>http://192.168.1.2/pdfs/phoe=
nix.pdf</A>)</FONT><FONT=20
size=3D3> <BR></FONT><FONT face=3DArial size=3D2>PDF::parse: 19272 lines =

parsed</FONT><FONT size=3D3> <BR></FONT><FONT face=3DArial =
size=3D2>PDF::parse ends=20
normally</FONT><FONT size=3D3> <BR></FONT><FONT face=3DArial =
size=3D2>&nbsp;size =3D=20
29122</FONT><FONT size=3D3> </FONT><BR></FONT></DIV>
<DIV><FONT size=3D2>are definitely NOT what I would expect from =
doc2html.pl,=20
pdf2html.pl or pdftotext.</FONT></DIV>
<DIV><FONT size=3D2>Some other parser is being used.</DIV></FONT>
<DIV><FONT size=3D2></FONT>&nbsp;</DIV>
<DIV>--<BR>David Adams<BR>Computing Services<BR>Southampton =
University<BR></DIV>
<BLOCKQUOTE=20
style=3D"PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; =
BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
  <DIV style=3D"FONT: 10pt arial">----- Original Message ----- </DIV>
  <DIV=20
  style=3D"BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: =
black"><B>From:</B>=20
  <A [EMAIL PROTECTED] href=3D"mailto:[EMAIL PROTECTED]";>Steve =
Marshall</A>=20
  </DIV>
  <DIV style=3D"FONT: 10pt arial"><B>To:</B> <A=20
  [EMAIL PROTECTED]=20
  =
href=3D"mailto:[EMAIL PROTECTED]";>[EMAIL PROTECTED]=
ourceforge.net</A>=20
  ; <A [EMAIL PROTECTED] =
href=3D"mailto:[EMAIL PROTECTED]";>'David=20
  Adams'</A> </DIV>
  <DIV style=3D"FONT: 10pt arial"><B>Sent:</B> Wednesday, March 13, 2002 =
8:49=20
  AM</DIV>
  <DIV style=3D"FONT: 10pt arial"><B>Subject:</B> RE: [htdig] Deleted, =
no excerpt=20
  with pdf files</DIV>
  <DIV><BR></DIV><!-- Converted from text/rtf format -->
  <P><FONT face=3DArial size=3D2>David, thanks for your =
suggestion</FONT> </P>
  <P><FONT face=3DArial size=3D2>The trouble is, htdig's output looks =
fine to me,=20
  seems to get the Content-Type correct, the length looks sensible at =
29122=20
  bytes, it just doesn't put anything it finds into its&nbsp; database =
scratch=20
  files. It lists the text from the pdf when in -vvvv mode, so it's not =
one of=20
  those pdf-image issues.</FONT></P>
  <P><FONT face=3DArial size=3D2>Output is listed below</FONT> </P>
  <P><FONT face=3DArial size=3D2>Any other thoughts?</FONT> </P>
  <P><FONT face=3DArial size=3D2>Steve</FONT> </P><BR>
  <P><FONT face=3DArial size=3D2>title: Atelier Ten Web Graphics</FONT> =
<BR><FONT=20
  face=3DArial size=3D2>image: </FONT><A=20
  href=3D"http://192.168.1.2/pdfs/TSB_Exterior_thumb.gif";><U><FONT =
face=3DArial=20
  color=3D#0000ff=20
  size=3D2>http://192.168.1.2/pdfs/TSB_Exterior_thumb.gif</FONT></U></A> =
<BR><FONT=20
  face=3DArial size=3D2>href: </FONT><A=20
  href=3D"http://192.168.1.2/pdfs/phoenix.pdf";><U><FONT face=3DArial =
color=3D#0000ff=20
  size=3D2>http://192.168.1.2/pdfs/phoenix.pdf</FONT></U></A><FONT =
face=3DArial=20
  size=3D2> (support images)</FONT> <BR><FONT face=3DArial =
size=3D2>resolving=20
  '</FONT><A href=3D"http://192.168.1.2/pdfs/phoenix.pdf";><U><FONT =
face=3DArial=20
  color=3D#0000ff =
size=3D2>http://192.168.1.2/pdfs/phoenix.pdf</FONT></U></A><FONT=20
  face=3DArial size=3D2>'</FONT> </P>
  <P><FONT face=3DArial size=3D2>&nbsp;&nbsp; pushing </FONT><A=20
  href=3D"http://192.168.1.2/pdfs/phoenix.pdf";><U><FONT face=3DArial =
color=3D#0000ff=20
  size=3D2>http://192.168.1.2/pdfs/phoenix.pdf</FONT></U></A> <BR><FONT =
face=3DArial=20
  size=3D2>+ size =3D 1186</FONT> <BR><FONT face=3DArial size=3D2>pick: =
192.168.1.2, #=20
  servers =3D 1</FONT> <BR><FONT face=3DArial=20
  size=3D2>1:1:1:http://192.168.1.2/pdfs/phoenix.pdf: Retrieval command =
for=20
  </FONT><A href=3D"http://192.168.1.2/pdfs/phoenix.pdf";><U><FONT =
face=3DArial=20
  color=3D#0000ff =
size=3D2>http://192.168.1.2/pdfs/phoenix.pdf</FONT></U></A><FONT=20
  face=3DArial size=3D2>: GET /pdfs/phoenix.pdf HTTP/1.0</FONT></P>
  <P><FONT face=3DArial size=3D2>User-Agent: htdig/3.1.6=20
  ([EMAIL PROTECTED])</FONT> <BR><FONT =
face=3DArial=20
  size=3D2>Referer: </FONT><A href=3D"http://192.168.1.2/";><U><FONT =
face=3DArial=20
  color=3D#0000ff size=3D2>http://192.168.1.2/</FONT></U></A> <BR><FONT =
face=3DArial=20
  size=3D2>Host: 192.168.1.2</FONT> </P>
  <P><FONT face=3DArial size=3D2>Header line: HTTP/1.1 200 OK</FONT> =
<BR><FONT=20
  face=3DArial size=3D2>Header line: Date: Tue, 12 Mar 2002 20:00:42 =
GMT</FONT>=20
  <BR><FONT face=3DArial size=3D2>Header line: Server: Apache/1.3.20 =
(Linux/SuSE)=20
  PHP/4.0.6</FONT> <BR><FONT face=3DArial size=3D2>Header line: =
Last-Modified: Thu,=20
  14 Jun 2001 08:59:02 GMT</FONT> <BR><FONT face=3DArial =
size=3D2>Converted Thu, 14=20
  Jun 2001 08:59:02 GMT to Thu, 14 Jun 2001 08:59:02</FONT> <BR><FONT =
face=3DArial=20
  size=3D2>Header line: ETag: "9813c-71c2-3b287cd6"</FONT> <BR><FONT =
face=3DArial=20
  size=3D2>Header line: Accept-Ranges: bytes</FONT> <BR><FONT =
face=3DArial=20
  size=3D2>Header line: Content-Length: 29122</FONT> <BR><FONT =
face=3DArial=20
  size=3D2>Header line: Connection: close</FONT> <BR><FONT face=3DArial=20
  size=3D2>Header line: Content-Type: application/pdf</FONT> <BR><FONT =
face=3DArial=20
  size=3D2>Header line:</FONT> <BR><FONT face=3DArial =
size=3D2>returnStatus =3D 0</FONT>=20
  <BR><FONT face=3DArial size=3D2>Read 8192 from document</FONT> =
<BR><FONT=20
  face=3DArial size=3D2>Read 8192 from document</FONT> <BR><FONT =
face=3DArial=20
  size=3D2>Read 8192 from document</FONT> <BR><FONT face=3DArial =
size=3D2>Read 4546=20
  from document</FONT> <BR><FONT face=3DArial size=3D2>Read a total of =
29122=20
  bytes</FONT> <BR><FONT face=3DArial size=3D2>PDF::setContents(29122 =
bytes)</FONT>=20
  <BR><FONT face=3DArial size=3D2>PDF::parse(<A=20
  =
href=3D"http://192.168.1.2/pdfs/phoenix.pdf";>http://192.168.1.2/pdfs/phoe=
nix.pdf</A>)</FONT>=20
  <BR><FONT face=3DArial size=3D2>PDF::parse: 19272 lines parsed</FONT> =
<BR><FONT=20
  face=3DArial size=3D2>PDF::parse ends normally</FONT> <BR><FONT =
face=3DArial=20
  size=3D2>&nbsp;size =3D 29122</FONT> <BR><FONT face=3DArial =
size=3D2>pick:=20
  192.168.1.2, # servers =3D 1</FONT>=20
  =
</P><BR>_________________________________________________________________=
_______<BR>This=20
  e-mail has been scanned for all viruses by Star Internet. =
The<BR>service is=20
  powered by MessageLabs. For more information on a =
proactive<BR>anti-virus=20
  service working around the clock, around the globe, visit:<BR><A=20
  =
href=3D"http://www.star.net.uk";>http://www.star.net.uk</A><BR>___________=
_____________________________________________________________<BR></BLOCKQ=
UOTE></BODY></HTML>

------=_NextPart_000_002B_01C1CA71.CB32BDE0--




--__--__--

_______________________________________________
htdig-general list digest <[EMAIL PROTECTED]>
Information: https://lists.sourceforge.net/lists/listinfo/htdig-general
FAQ: http://htdig.sourceforge.net/FAQ.html


End of htdig-general Digest

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to