Hello,

I am fairly new to htDig and brand new to this mailing list.

We have a situation in which the indexing of some PDF documents is causing some pain.

Here's the deal.   We currently are using WebTrends On Demand for website statistics gathering and reporting.   In order to use this service, we need to send snippets of information to the webtrends server when a file read on our website.   We do this for a boatload of PDFs about our company and its products.

So, the URLs kinda look like the following:

http://www.qad.com/cgi-bin/sdc/sdc.pl?file=/company/resources/data_sheets/lean_manufacturing.pdf

Now, under normal circumstances (from what I can tell) the /cgi-bin directory is part of the exclude list, so that this particular URL would not even be indexed.   I got around this by allowing the /cgi-bin/blah directory (in the include list) and excluding other specific sub-directories in the /cgi-bin directory.  OK, so after doing this, I was able to search and find the PDF file.  However, the index appears to be based only on the file name and not on the content of the file, since the PDF file is not being converted and indexed directly.   So, the search result looks kinda like:

Content Download (PDF file icon)

www.qad.com/cgi-bin/sdc/sdc.pl?file=/company/resources/data_sheets....

There is no document information describing the hit that would normally be there if the PDF file had been converted and indexed.

So, here's the question, is there any way to have htdig actually parse the URL as above for the "file=" portion, index the actual PDF file and still have the referenced URL in the database include the "/cgi-bin/sdc/sdc.pl?file=" portion?   So, if I searched for "lean manufacturing", I would like to see a hit like the following:

Lean Manufacturing Data Sheet (PDF File Icon)
Description from the converted and parsed PDF file......
www.qad.com/cgi-bin/sdc/sdc.pl?file=/company/resources/data_sheets.....

Hope this makes sense.    Appreciate any assistance you may be able to give.

Thanks,

Bruce

Reply via email to