Yeah you are right, I think it doesn't like the output at all. Instead of
the words it is taking as words:
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
:S
So I suppose htdig just doesn't really like the output of the parser. I'm
attaching the output of parser (executed manually) and the output of dig
just in case you have any more ideas :)
Thanks a lot!
Ainhoa
On Feb 11, 2008 12:16 PM, <[EMAIL PROTECTED]> wrote:
> Ainhoa,
> My first instinct would now be to check the parser output - try adding
> another v to your config, (and possibly restricting your indexing to just
> this one file) and check the log output - it may be that htdig does not like
> the output from your PERL script. www.htdig.org explains what the output
> means. I seem to recall you saying that you had already tested that it ran
> on its own, but possibly there is something not right there, or a typo in
> the config that neither of us can see.
>
> Regards,
> Mike
>
>
> ------------------------------
> *From:* Ainhoa L [mailto:[EMAIL PROTECTED]
> *Sent:* Monday, February 11, 2008 9:33 AM
> *To:* Brockington,MJ,Michael,JPGA4X R
> *Cc:* [email protected]
> *Subject:* Re: [htdig] Htdig and MHT files
>
> Hi Mike,
> Yes you were right, I was missing that part and I didn't even noticed!
> I changed the config file and wrote this:
>
> application/pdf->text/html
> /usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl
> \
>
> application/vnd.wap.xhtml+xml->text/html /opt/vin/mht2html.pl
> vnd.wap.xhtml+xml was the MIME type for my mht documents. So I run dig and
> everything seems to go fine, having at the end:
>
>
> 0/http://172.26.0.169/testdig/
> 1/http://172.26.0.169/testdig/About_comments_eex3.mht
> 2/http://172.26.0.169/testdig/aster.pdf
> 3/http://172.26.0.169/testdig/beepmacro.mht
> 4/http://172.26.0.169/testdig/index.txt
> 5/http://172.26.0.169/testdig/test.html
>
> (I am doing this in a test folder)
>
> But when I go to the search page, it won't find words inside the mht
> files. It works for the pdf, txt and html ones, but can't find the words
> that are in the mht ones.
>
> I suppose I am missing something here... do I need to setup any other
> settings for the search engine?
>
> Thanks a lot for all your help,
>
> Ainhoa
>
>
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=us-ascii">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 12">
<meta name=Originator content="Microsoft Word 12">
<link rel=File-List href="BeepMacroAction_archivos/filelist.xml">
<link rel=Edit-Time-Data href="BeepMacroAction_archivos/editdata.mso">
<link rel=themeData href="BeepMacroAction_archivos/themedata.thmx">
<link rel=colorSchemeMapping
href="BeepMacroAction_archivos/colorschememapping.xml">
<style>
<!--
/* Font Definitions */
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;
mso-font-charset:2;
mso-generic-font-family:auto;
mso-font-pitch:variable;
mso-font-signature:0 268435456 0 0 -2147483648 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;
mso-font-charset:0;
mso-generic-font-family:roman;
mso-font-pitch:variable;
mso-font-signature:-1610611985 1107304683 0 0 159 0;}
@font-face
{font-family:"Arial Unicode MS";
panose-1:2 11 6 4 2 2 2 2 2 4;
mso-font-charset:128;
mso-generic-font-family:swiss;
mso-font-pitch:variable;
mso-font-signature:-134238209 -371195905 63 0 4129279 0;}
@font-face
{font-family:"[EMAIL PROTECTED] Unicode MS";
panose-1:2 11 6 4 2 2 2 2 2 4;
mso-font-charset:128;
mso-generic-font-family:swiss;
mso-font-pitch:variable;
mso-font-signature:-134238209 -371195905 63 0 4129279 0;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-unhide:no;
mso-style-qformat:yes;
mso-style-parent:"";
margin:0cm;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman","serif";
mso-fareast-font-family:"Times New Roman";
mso-ansi-language:EN-US;
mso-fareast-language:EN-US;}
h2
{mso-style-unhide:no;
mso-style-qformat:yes;
mso-margin-top-alt:auto;
margin-right:0cm;
mso-margin-bottom-alt:auto;
margin-left:0cm;
mso-pagination:widow-orphan;
mso-outline-level:2;
font-size:18.0pt;
font-family:"Arial Unicode MS","sans-serif";
mso-ansi-language:EN-US;
mso-fareast-language:EN-US;
font-weight:bold;}
a:link, span.MsoHyperlink
{mso-style-noshow:yes;
mso-style-unhide:no;
color:blue;
text-decoration:underline;
text-underline:single;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-noshow:yes;
mso-style-priority:99;
color:purple;
mso-themecolor:followedhyperlink;
text-decoration:underline;
text-underline:single;}
p
{mso-style-noshow:yes;
mso-style-unhide:no;
mso-margin-top-alt:auto;
margin-right:0cm;
mso-margin-bottom-alt:auto;
margin-left:0cm;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Arial Unicode MS","sans-serif";
mso-ansi-language:EN-US;
mso-fareast-language:EN-US;}
span.cdappliestotitle
{mso-style-name:cdappliestotitle;
mso-style-unhide:no;}
span.cdappliestotext
{mso-style-name:cdappliestotext;
mso-style-unhide:no;}
span.acicollapsed
{mso-style-name:acicollapsed;
mso-style-unhide:no;}
span.GramE
{mso-style-name:"";
mso-gram-e:yes;}
@page Section1
{size:612.0pt 792.0pt;
margin:72.0pt 90.0pt 72.0pt 90.0pt;
mso-header-margin:36.0pt;
mso-footer-margin:36.0pt;
mso-paper-source:0;}
div.Section1
{page:Section1;}
/* List Definitions */
@list l0
{mso-list-id:1373536133;
mso-list-type:hybrid;
mso-list-template-ids:-983145446 -1000322458 2027998910 -628996008
-865584228 756866296 -631309372 448825086 975487862 272377032;}
@list l0:level1
{mso-level-number-format:bullet;
mso-level-text:\F0B7;
mso-level-tab-stop:36.0pt;
mso-level-number-position:left;
text-indent:-18.0pt;
mso-ansi-font-size:10.0pt;
font-family:Symbol;}
ol
{margin-bottom:0cm;}
ul
{margin-bottom:0cm;}
-->
</style>
</head>
<body lang=ES link=blue vlink=purple style='tab-interval:36.0pt'>
<div class=Section1>
<p class=MsoNormal><span lang=EN-US>Beep Macro Action</span><span lang=EN-US
style='font-family:"Arial Unicode MS","sans-serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span class=cdappliestotitle><span lang=EN-US>Applies to:
</span></span><span
class=cdappliestotext><span lang=EN-US><a
href="http://office.microsoft.com/en-us/access/FX100646911033.aspx">Microsoft
Office Access 2007</a></span></span></p>
<p class=MsoNormal><span lang=EN-US> </span></p>
<p class=MsoNormal align=right style='text-align:right'><span lang=EN-US
style='display:none;mso-hide:all'><a
href="javascript:AlterAllDivs('block');"><span
style='text-decoration:none;text-underline:none'><span
style='mso-ignore:vglayout'><img
border=0 width=15 height=10 src="image001.gif"
alt="Show All" v:shapes="picHeader"></span></span>Show
All</a><o:p></o:p></span></p>
<p class=MsoNormal align=right style='text-align:right'><span
lang=EN-US><o:p> </o:p></span></p>
<p><span lang=EN-US>You can use the <b>Beep</b> action to sound a beep tone
through the computer's speaker.</span></p>
<h2><span lang=EN-US>Setting</span></h2>
<p style='margin:0cm;margin-bottom:.0001pt'><span lang=EN-US>The <b>Beep</b>
action doesn't have any arguments.</span></p>
<h2><span lang=EN-US>Remarks</span></h2>
<p style='margin:0cm;margin-bottom:.0001pt'><span lang=EN-US>You can use the
<b>Beep</b>
action to signal the following occurrences:</span></p>
<ul type=disc>
<li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;
mso-list:l0 level1 lfo1;tab-stops:list 36.0pt'><span lang=EN-US>Important
screen changes have occurred.</span></li>
<li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;
mso-list:l0 level1 lfo1;tab-stops:list 36.0pt'><span lang=EN-US>The wrong
kind of data has been entered in a <a
href="javascript:AppendPopup(this,'ofControl_1')">control<span
class=acicollapsed> (control: A graphical user interface object, such
as a text box, check box, scroll bar, or command button, that <span
class=GramE>lets</span> users control the program. You use controls to
display data or choices, perform an action, or make the user interface
easier to read.)</span></a>. For example, the user has entered numeric
data in a <a href="javascript:AppendPopup(this,'defTextBox_2')">text
box<span
class=acicollapsed> (text box: A control, also called an edit <span
class=GramE>field, that</span> is used on a form, report, or data access
page to display text or accept data entry. It can have a label attached to
it.)</span></a> <span class=GramE>control</span>.</span></li>
<li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;
mso-list:l0 level1 lfo1;tab-stops:list 36.0pt'><span lang=EN-US>A <a
href="javascript:AppendPopup(this,'AcMacro_3')">macro<span
class=acicollapsed> (macro: An action or set of actions that you can
use to automate tasks.)</span></a> <span class=GramE>has</span> reached a
specified point or has completed its actions.</span></li>
</ul>
<p style='margin:0cm;margin-bottom:.0001pt'><span lang=EN-US>The frequency and
duration of the beep depend on the hardware, which may vary between
computers.</span></p>
<p style='margin:0cm;margin-bottom:.0001pt'><span lang=EN-US>To run the
<b>Beep</b>
action in a Visual Basic for Applications (VBA) module, use the <b>Beep</b>
method of the <b>DoCmd</b> object.</span></p>
<p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p>
<p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p>
<p class=MsoNormal><span lang=EN-US><a
href="http://office.microsoft.com/en-us/access/HA012262081033.aspx">http://office.microsoft.com/en-us/access/HA012262081033.aspx</a></span></p>
<p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p>
</div>
</body>
</html> 1:1:http://172.26.0.169/testdig/
New server: 172.26.0.169, 80
Retrieval command for http://172.26.0.169/robots.txt: GET /robots.txt HTTP/1.0
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
Host: 172.26.0.169
Header line: HTTP/1.1 404 Not Found
Header line: Date: Mon, 11 Feb 2008 11:56:32 GMT
Header line: Server: Apache/2.2.4 (Unix) mod_ssl/2.2.4 OpenSSL/0.9.7a DAV/2
PHP/5.1.6
Header line: X-Powered-By: PHP/5.1.6
Header line: Content-Length: 2815
Header line: Connection: close
Header line: Content-Type: text/html; charset=utf-8
Header line:
returnStatus = 1
pushed
pick: 172.26.0.169, # servers = 1
0:0:0:http://172.26.0.169/testdig/: Retrieval command for
http://172.26.0.169/testdig/: GET /testdig/ HTTP/1.0
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
Host: 172.26.0.169
Header line: HTTP/1.1 200 OK
Header line: Date: Mon, 11 Feb 2008 11:56:32 GMT
Header line: Server: Apache/2.2.4 (Unix) mod_ssl/2.2.4 OpenSSL/0.9.7a DAV/2
PHP/5.1.6
Header line: Content-Length: 270
Header line: Connection: close
Header line: Content-Type: text/html; charset=utf-8
Header line:
returnStatus = 0
Read 270 from document
Read a total of 270 bytes
Tag: <html>, matched -1
Tag: <head>, matched -1
Tag: <title>, matched 0
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
Tag: </title>, matched 1
title: Index of /testdig
Tag: </head>, matched -1
Tag: <body>, matched -1
Tag: <h1>, matched 4
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
Tag: </h1>, matched 10
Tag: <ul>, matched -1
Tag: <li>, matched 19
Tag: <a href="/">, matched 2
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
Tag: </a>, matched 3
href: http://172.26.0.169/ (Parent Directory)
Rejected: URL not in the limits!
url rejected: (level 1)http://172.26.0.169/
Tag: </li>, matched -1
Tag: <li>, matched 19
Tag: <a href="beepmacro.mht">, matched 2
word: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
word part: [EMAIL PROTECTED]
Tag: </a>, matched 3
href: http://172.26.0.169/testdig/beepmacro.mht (beepmacro.mht)
resolving 'http://172.26.0.169/testdig/beepmacro.mht'
pushing http://172.26.0.169/testdig/beepmacro.mht
+Tag: </li>, matched -1
Tag: </ul>, matched -1
Tag: </body>, matched -1
Tag: </html>, matched -1
head: Index of /testdig * Parent Directory * beepmacro.mht
size = 270
pick: 172.26.0.169, # servers = 1
1:1:1:http://172.26.0.169/testdig/beepmacro.mht: Retrieval command for
http://172.26.0.169/testdig/beepmacro.mht: GET /testdig/beepmacro.mht HTTP/1.0
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
Referer: http://172.26.0.169/testdig/
Host: 172.26.0.169
Header line: HTTP/1.1 200 OK
Header line: Date: Mon, 11 Feb 2008 11:56:32 GMT
Header line: Server: Apache/2.2.4 (Unix) mod_ssl/2.2.4 OpenSSL/0.9.7a DAV/2
PHP/5.1.6
Header line: Last-Modified: Fri, 11 Jan 2008 06:36:08 GMT
Converted Fri, 11 Jan 2008 06:36:08 GMT to Fri, 11 Jan 2008 06:36:08
Header line: ETag: "99866e-8f5a-8a9e4600"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 36698
Header line: Connection: close
Header line: Content-Type: application/vnd.wap.xhtml+xml; charset=us-ascii
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 3930 from document
Read a total of 36698 bytes
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
head: Read 8192 from document Read 8192 from document Read 8192 from document
Read 8192 from document Read 3930 from document Read a total of 36698 bytes
size = 36698
pick: 172.26.0.169, # servers = 1
htmerge: Sorting...
htmerge: Merging...
0/http://172.26.0.169/testdig/
1/http://172.26.0.169/testdig/beepmacro.mht
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general