According to Zachary Jenks:
> > Greetings Mr. Adams,

Please see http://www.htdig.org/FAQ.html#q1.16

> > Currently I have auto indexing turned off in my Apache setup because I do
> not want the public to access any of my php applications or view file lists.
> Therefore, htdig is not indexing my directories.  I've read over the FAQ and
> entered in the sample script (FAQ 5.25) into my htdig.conf file but I'm not
> getting any results.  I get the following message after ./rundig:
> > ------------------------------------------------------------------
> > htmerge: Unable to open word list file
> '/www3/umesd/searchengine/htdig/db/db.wordlist'.
> > Did you index anything?
> > Check your config file and try running htdig again.
> > ------------------------------------------------------------------
> > And -vvv shows me that it's setting New server to:  , 0.
> >
> > Question1: Can you tell me exactly how and where to place that sample
> script so that it works?  I put it all in htdig.conf after "start_url" as
> follows:
> > -------------------------------------------------------------------------
> > start_url:              '/www3/umesd/searchengine/docs/':
> >
> > find /www3 -type f -name \*.html -print | \
> >     sed -e 's|/www3|http://www.umesd.k12.or.us/|' > \
> >         /www3/umesd/searchengine/docs/
> > --------------------------------------------------------------------------
> ---
> >
> > Is this correct???

Incorrect on 3 counts.  First of all, the output from find and sed
should go to a regular file, not a directory, so you either need to
remove the trailing slash after "docs", if docs is a file and not a
directory, or append a file name after docs/ if docs is a directory.
This change must be made in both the start_url entry in htdig.conf and in
the script that runs find and sed.  Secondly, the file name you use in the
start_url entry must be enclosed in left quotes (`), i.e. the character
usually on the same key as the tilde (~), and not in apostrophes (').
Finally the find command above doesn't go in htdig.conf, but rather in
a separate script that should be run before you run htdig and htmerge,
or rundig or whatever you use to do an indexing run.  If you use a shell
script to do the indexing, you can just add the find and sed commands
above to that script.

> > Question2: Will this script allow me to index directories without
> providing access to the public?

That actually depends a lot on how you set things up.  Normally, htdig
will read the URLs it's given via HTTP requests, so technically those
URLs would be publically accessible.  However, they can be protected from
public access if you set up "Basic" authentication in your web server
and use the -u option to htdig to give the username/password, or the
authorization attribute (http://www.htdig.org/attrs.html#authorization)
in htdig.conf.  You can also side-step the HTTP server using local_urls.

Note too that the find command above will find documents that aren't
linked from other documents on your site, so it may make things accessible
from the search engine that would otherwise be hidden from view by simply
following links on your site.  This, however, is different than documents
that aren't accessible to the public -- documents that are "hidden" by not
linking to them are still accessible if they're under your DocumentRoot
and aren't otherwise protected by web server security controls.
Such documents may be found by some editing of URLs in the browser's
"Location" field.  (Several recent news stories about confidential
information being unwittingly "leaked" to the public from a company's
web server hinged on such misguided attempts at securing information.)

There's also the question of whether you make the resulting index
publicly searchable or not.  You can protect the original documents
all you want, but if the index you make from them is wide open to the
public, you're opening up a pretty big peephole into that content.
See http://www.htdig.org/FAQ.html#q4.20

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to