Thanks, again, Kir, for your offer of help.

I had already fixed the case in the link to http://www.jhuccp.org/pr/j52/J52.pdf from 
the document http://www.jhuccp.org/popreporter/2002/08-19.shtml while I was writing 
the note, but forgot to update my snippet. Sorry for the confusion.

Here's the output you asked for:
aspseek@www:~$ sbin/index -T http://www.jhuccp.org/pr/j52/j52.pdf
Loading configuration from /usr/local/aspseek/etc/db.conf
Loading configuration from /usr/local/aspseek/etc/ucharset.conf
Loading configuration from /usr/local/aspseek/etc/stopwords.conf
Loading configuration from /usr/local/aspseek/etc/aspseek.conf
Adding URL: http://www.jhuccp.org/pr/j52/j52.pdf
Status: OK
index process finished.
aspseek@www:~$ 

And yet:
mysql> select * from urlword where url like '%pdf' limit 1;
+--------+---------+---------+--------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
| url_id | site_id | deleted | url                                  | next_index_time 
|| status | crc                              | last_modified                 | etag    
|                 | last_index_time | referrer | tag | hops | redir | origin |
+--------+---------+---------+--------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
|   5244 |       1 |       0 | http://www.jhuccp.org/pr/j52/j52.pdf |      1043167913 
||    200 | d41d8cd98f00b204e9800998ecf8427e | Fri, 03 Jan 2003 17:06:16 GMT | 
|"20d0ae-1328a5-3e15c308" |      1042496187 |     2794 |   0 |    5 |     0 |      0 |
+--------+---------+---------+--------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
1 row in set (0.07 sec)

mysql> select * from urlwords12 where url_id="5244";       
Empty set (0.00 sec)

mysql> 

Just for good measure, I checked all the urlwordsNN tables for '5244' without luck.

Are there any extra diagnostics or logging I could turn on to help with this problem? 
Any other suggestions?

Thanks, again, for your help.

-Kevin

>>> [EMAIL PROTECTED] 01/14/03 11:50AM >>>
> I've got links to .pdf files in my .shtml files which seem to be indexed fine:
> aspseek@www:~$ find /var/www/main/htdocs/ -iname "*.*htm*" -o -iname "*.stm"|xargs 
>fgrep .pdf |head                     
> //var/www/main/htdocs/popreporter/2002/08-19.shtml:                            | <a 
>href="http://www.jhuccp.org/pr/j52/J52.pdf";>PDF</a></p>

The first thing I notice is document is named J52.pdf while it is available
as j52.pdf from your server. Notice the case!

> <snip>
> 
> There are 14 rows in the urlword table which end in '.pdf':
> mysql> select * from urlword where url like '%pdf'; 
> 
>+--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
> | url_id | site_id | deleted | url                                                   
>    | next_index_time | status | crc                              | last_modified     
>            | etag                     | last_index_time | referrer | tag | hops | 
>redir | origin |
> 
>+--------+---------+---------+-----------------------------------------------------------+-----------------+--------+----------------------------------+-------------------------------+--------------------------+-----------------+----------+-----+------+-------+--------+
> |   5244 |       1 |       0 | http://www.jhuccp.org/pr/j52/j52.pdf                  
>    |      1043164839 |    200 | d41d8cd98f00b204e9800998ecf8427e | Fri, 03 Jan 2003 
>17:06:16 GMT | "20d0ae-1328a5-3e15c308" |      1042496187 |     2794 |   0 |    5 |   
>  0 |      0 |
> <snip>
> 14 rows in set (0.06 sec)
> 
> The "200" in the status column indicates that it was found.
> 
> For this first .pdf document, I computed the urlwords table name as 'urlwords12' 
>(5244 mod 16)

That is right answer, although ASPseek uses 'urlid & 15', which is the same 
but much more efficient ;)

, but there's no entry in that table for this url_id:
> mysql> select * from urlwords12 where url_id="5244";
> Empty set (0.00 sec)
> 
> This leads me to believe that .pdf documents are being checked, but not indexed.
> 
> When I run this document, http://www.jhuccp.org/pr/j52/j52.pdf, through pdftohtml, I 
>get HTML output, so pdftohtml seems to be working okay.
> 
> Can anyone suggest any other diagnostics that could help me solve this problem? Any 
>thoughts or comments?
> 
> Thank you all in advance for your help.

Hmm...

Try index -T http://www.jhuccp.org/pr/j52/j52.pdf and see what happens.

-- 
== kir_at_asplinux.ru == 7551596_at_ICQ == 6722750_at_sms.beemail.ru ==

Dream like you'll live forever...Love like you've never been hurt...
Work like you don't need the money...and Dance like nobody is watching!
        -- Satchel Paige

Reply via email to