Re: [aseek-users] Why aren't I indexing .pdf files?

Kir Kolyshkin Tue, 14 Jan 2003 08:49:41 -0800

KEVIN ZEMBOWER wrote:

I'm trying to get my first installation of aspseek working. It seems to index HTML documents fine, but now I'm trying to expand into .pdf documents.

My aspseek.conf file looks like this:
aspseek@www:~$ grep -v '^[[:space:]]*$' etc/aspseek.conf |grep -v "^#"
Include db.conf
Include ucharset.conf
Include stopwords.conf
Converter application/pdf text/html /usr/local/bin/pdftohtml -i -noframes -stdout $in > $out
DeleteNoServer no
Server http://www.jhuccp.org/
DeltaBufferSize 64
Disallow /cgi-bin/ \.cgi /nph
Disallow \.tif$ \.au$ \.mov$ \.jpe$ \.cur$ \.qt$
Disallow \.b$ \.sh$ \.md5$ \.rpm$
Disallow \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$
Disallow \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$
Disallow \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$ \.xpm$ \.xbm$
Disallow \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$
Disallow \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$ \.ra$
Disallow \.vrml$ \.wrl$ \.png$
Disallow \.exe$ \.cab$ \.dll$ \.bin$ \.class$
Disallow \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$
Disallow \.rtf$ \.cdf$ \.ps$
Disallow \.ai$ \.eps$ \.ppt$ \.hqx$
Disallow \.cpt$ \.bms$ \.oda$ \.tcl$
Disallow \.o$ \.a$ \.la$ \.so$ \.so\.[0-9]$
Disallow \.pat$ \.pm$ \.m4$ \.am$
Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$ \?S=D$
Disallow [^:]//
Disallow mmc/.*\.php
Disallow PHPTEST
aspseek@www:~$
I've got links to .pdf files in my .shtml files which seem to be indexed fine:
aspseek@www:~$ find /var/www/main/htdocs/ -iname "*.*htm*" -o -iname "*.stm"|xargs fgrep .pdf |head //var/www/main/htdocs/popreporter/2002/08-19.shtml: | <a href="http://www.jhuccp.org/pr/j52/J52.pdf";>PDF</a></p>

The first thing I notice is document is named J52.pdf while it is available
as j52.pdf from your server. Notice the case!

<snip>

There are 14 rows in the urlword table which end in '.pdf':
mysql> select * from urlword where url like '%pdf'; +--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
| url_id | site_id | deleted | url | next_index_time | status | crc | last_modified | etag | last_index_time | referrer | tag | hops | redir | origin |
+--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
| 5244 | 1 | 0 | http://www.jhuccp.org/pr/j52/j52.pdf | 1043164839 | 200 | d41d8cd98f00b204e9800998ecf8427e | Fri, 03 Jan 2003 17:06:16 GMT | "20d0ae-1328a5-3e15c308" | 1042496187 | 2794 | 0 | 5 | 0 | 0 |
<snip>
14 rows in set (0.06 sec)

The "200" in the status column indicates that it was found.

For this first .pdf document, I computed the urlwords table name as 'urlwords12' (5244 mod 16)

That is right answer, although ASPseek uses 'urlid & 15', which is the same but much more efficient ;)

, but there's no entry in that table for this url_id:

mysql> select * from urlwords12 where url_id="5244";
Empty set (0.00 sec)

This leads me to believe that .pdf documents are being checked, but not indexed.

When I run this document, http://www.jhuccp.org/pr/j52/j52.pdf, through pdftohtml, I get HTML output, so pdftohtml seems to be working okay.

Can anyone suggest any other diagnostics that could help me solve this problem? Any thoughts or comments?

Thank you all in advance for your help.

Hmm...

Try index -T http://www.jhuccp.org/pr/j52/j52.pdf and see what happens.

--
== kir_at_asplinux.ru == 7551596_at_ICQ == 6722750_at_sms.beemail.ru ==

Dream like you'll live forever...Love like you've never been hurt...
Work like you don't need the money...and Dance like nobody is watching!
       -- Satchel Paige

Re: [aseek-users] Why aren't I indexing .pdf files?

Reply via email to