Re: [htdig] Inherent limitaitions of htdig

2000-10-03 Thread Geoff Hutchison

On Tue, 3 Oct 2000, Eric Bliss wrote:

> Does anybody know of any inherent limitations between htdig, an Intel
> Pentium 3 class architecture, a Red Hat Linux kernel, and roughly
> 50,000 files containing roughly 650 megs of data being indexed?  I'm

None. I'd only see a limitation if you said "oh and we only have 100 megs
of free space" or "oh and we only have 16MB of RAM."

> having no end of troubles with this, and htdig isn't the first search
> engine we've had problems with.  htdig seems to just be silently dying
> during indexing of the sites we have.  Any thoughts?

In the case of htdig, it only dies silently if you've set it to *run*
silently. If you use flags like -s for statistics or -v for debugging (or
-vvv for *more* debugging), it will give you enough information to figure
out your pblems in almost all circumstances.

But you haven't given a whole lot of information. I'd guess that quite a
few people on this list have similar setups and don't have problems.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] Inherent limitaitions of htdig

2000-10-03 Thread Eric Bliss

Does anybody know of any inherent limitations between htdig, an Intel Pentium 3 class 
architecture, a Red Hat Linux kernel, and
roughly 50,000 files containing roughly 650 megs of data being indexed?  I'm having no 
end of troubles with this, and htdig isn't
the first search engine we've had problems with.  htdig seems to just be silently 
dying during indexing of the sites we have.  Any
thoughts?

Eric
Web Developer
Digital Media Online
Santa Ana, CA
http://www.digitalmedianet.com




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Last modified date revisited - Apache

2000-10-03 Thread Peaveway

It looks like a neverending story for you...


In einer eMail vom 03.10.00 20:18:27 (MEZ) - Mitteleurop. Sommerzeit schreibt 
[EMAIL PROTECTED]:

> From what I can glean, there are 2 ways to get this.
>  
>  Either by putting an echo command into the html files (SSI),

Wich echo command? The ssi code that i send to you? Thats was always a test 
if the internal ssi handler activatd for *.html or not.

  or by setting xbithack=full and setting the executable bits on for group and
>  user.
>  
>  Neither approach is good for us.  We have many static html pages, and many
>  more being created every day.
>  It's not feasible for us to put the extra code into each html file, or to
>  change the x bit on each file as well.

The xbithack can you set in your httpd.conf and a little shell script startet 
via cron can made the changes before htdig starts digging. Another way is to 
change the umask vor people you can publish html documents.

  >  IS THERE ANY OTHER WAY.??
>  It seems like a pretty simple request, and yet we can't find any answer in
>  the apache docs or htdig.
>  It seems to be more of an apache issue, but its my HTdig that needs the
>  info!

I think the hole world is now interessting to see your httpd.conf :)

The ssi handler is configurated via the AddHandler/Addtype directive wich is 
setting in your mainserver configfile, virtual host, .htaccess!
It is possible thats all your static parsed thru an external parser like the 
cgi php ?
Is htdig runing at the same machine as the webserver you would like to 
indexing? When not is there a proxyserver between htdig and the webserver?

Questions over questions...

Joerg Behrens


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] from IIS searching to HtDig

2000-10-03 Thread Geoff Hutchison

On Tue, 3 Oct 2000, Frances Santiago wrote:

> # CiScope is the directory (virtual or real) under which results are
> # returned.  If a file matches the query but is not in a directory beneath
> # CiScope, it is not returned in the result set.
> # A scope of / means all hits matching the query are returned.
> 
> How I understand it is that if I have a search page under http://foo.com/x
> the search will return hits in the files recursively from x - not the files
> from http://foo.com. Can htdig do this without using a different config
> file and db for each? 

I would read the IIS documentation the same way.

Yes, you can do this (and more) using just the search form to ht://Dig. To
mirror the IIS behaviour, you'd want the "restrict" field.


--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Last modified date revisited - Apache

2000-10-03 Thread Gilles Detillieux

According to Roger Weiss:
> From what I can glean, there are 2 ways to get this.
> 
> Either by putting an echo command into the html files (SSI),
> or by setting xbithack=full and setting the executable bits on for group and
> user.

No, the XBitHack turns .html files with execute permission into SSI
files (equivalent to .shtml), and for SSI files, Apache does NOT put
out a Last-Modified header because SSI generates dynamic content.  To my
knowledge, you can't put out HTTP headers from an SSI file, so I don't
think this is the way to go.

In fact, the behaviour you describe, i.e. no Last-Modified header for
supposedly static .html files, suggests to me that your Apache server
is set to use a server-parsed handler for .html files, just as it is
normally configured by default for .shtml files.  Take a close look
at your *.conf files and any relevant .htaccess file for AddHandler
directives.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] ... but not changed

2000-10-03 Thread Gilles Detillieux

According to David Adams:
> When, during an update run, htdig says of a page: "retrieved but not
> changed", how does htdig decide that the page is the same as the last time?
> 
> An author is maintaining that she added a link to a page and that an update
> run of htdig failed to follow the new link(s) she had added.

The retrieved but not changed message occurs when the web server ignores
the "If-Modified-Since" header that htdig sends it, and sends the page
anyway, but htdig sees that the Last-Modified header contains the exact
same date it did last time the document was indexed.

I would check the modification time on the document, and if it's wrong,
correct it.  You may also want to check the clock on the web server
and/or on the system where the file was edited.

Another possibility, but I'm not sure about this one, is that the server
isn't returning a Last-Modified header at all, so the DocTime field is
0 for both the old and new versions.  You can confirm this by seeing if
the modification time shows up for this document in htsearch results.
It doesn't if the field is 0.  If this is the case, ypu should set
modification_time_is_now to true.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] from IIS searching to HtDig

2000-10-03 Thread Frances Santiago

I am switching a site running on MS IIS to Unix Apache. In order to make
the transition as seamless as possible I will need htdig to mirror the way
the IIS search engine works. In the .idq file the following is listed:

# CiScope is the directory (virtual or real) under which results are
# returned.  If a file matches the query but is not in a directory beneath
# CiScope, it is not returned in the result set.
# A scope of / means all hits matching the query are returned.

How I understand it is that if I have a search page under http://foo.com/x
the search will return hits in the files recursively from x - not the files
from http://foo.com. Can htdig do this without using a different config
file and db for each? 

I have never taken the time to learn IIS - so I could be misunderstanding
how this works. If anyone has experience transferring from IIS to
Apache/HtDig I would be very interested in hearing from you.

Thanks,
Frances Santiago




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] ... but not changed

2000-10-03 Thread Geoff Hutchison

On Tue, 3 Oct 2000, David Adams wrote:

> When, during an update run, htdig says of a page: "retrieved but not
> changed", how does htdig decide that the page is the same as the last time?

It checks the date it received from the server (if present) against the
date in the database. If they're the same, it ignores the file.

> An author is maintaining that she added a link to a page and that an update
> run of htdig failed to follow the new link(s) she had added.

Are these static or dynamic pages? If the server is not returning
Last-Modified headers, then this could be the problem.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Last modified date revisited - Apache

2000-10-03 Thread Geoff Hutchison

On Tue, 3 Oct 2000, Roger Weiss wrote:

> Neither approach is good for us.  We have many static html pages, and many
> more being created every day.
> It's not feasible for us to put the extra code into each html file, or to
> change the x bit on each file as well.
> 

I'm going to be completely honest. I've *never* had this problem on an
Apache installation. Out of the box, Apache serves up Last-Modified
headers for all static HTML pages for any version of Apache I've used.
(Please note, I compile my own Apache and don't include things like
mod_usertrack.)

> It seems like a pretty simple request, and yet we can't find any answer in
> the apache docs or htdig.

Since people seem to have this problem with increasing frequency, why
don't you post some information about your Apache configuration? Include
as much as you can.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] Last modified date revisited - Apache

2000-10-03 Thread Roger Weiss

>From what I can glean, there are 2 ways to get this.

Either by putting an echo command into the html files (SSI),
or by setting xbithack=full and setting the executable bits on for group and
user.

Neither approach is good for us.  We have many static html pages, and many
more being created every day.
It's not feasible for us to put the extra code into each html file, or to
change the x bit on each file as well.

IS THERE ANY OTHER WAY.??
It seems like a pretty simple request, and yet we can't find any answer in
the apache docs or htdig.
It seems to be more of an apache issue, but its my HTdig that needs the
info!
Hasn't anybody had a similar problem? Any assistance would be welcome.

Thanks,
Roger


Roger Weiss
[EMAIL PROTECTED]
(978) 318-7301
http://www.trellix.com



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] ... but not changed

2000-10-03 Thread David Adams

A simple query (I hope).

When, during an update run, htdig says of a page: "retrieved but not
changed", how does htdig decide that the page is the same as the last time?

An author is maintaining that she added a link to a page and that an update
run of htdig failed to follow the new link(s) she had added.

-- 
 
David Adams
<[EMAIL PROTECTED]>
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Problems with PDF files

2000-10-03 Thread David Adams

> 
> Hello,
> 
> > 
> > At 10:06 AM +0200 10/3/00, Martin Mielke wrote:
> > >Error (0): PDF file is damaged - attempting to reconstruct 
> > xref table...
> > >Error: Couldn't find trailer dictionary
> > >Error: Couldn't read xref table
> > >Error (0): PDF file is damaged - attempting to reconstruct 
> > xref table...
> > >Error: Couldn't find trailer dictionary
> > >Error: Couldn't read xref table
> > >Error (139803): Bad colorspace
> > 
> > There are a few possibilities. One is that your max_doc_size 
> > attribute is too small for your PDF files and so they're being 
> > truncated.
> > 
> > 
> > 
> 
> the max_doc_size is greater than the biggest PDF file actually

I would double check that, the symptoms are precisely those you
get with max_doc_size too small.

> 
> > The other possibility, as the messages say, is that one or more of 
> > your PDF files is actually damaged. In that case, the best thing to 
> > do is to run htdig with more debugging turned on and send both STDOUT 
> > and STDERR to a file to peruse. Obviously the files reported just 
> > before this output would be ones to check.
> 
> All PDFs are readable or, at least, Acrobat Reader 4.x doesn't complain ...

-- 
 
David J Adams
<[EMAIL PROTECTED]>
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




RE: [htdig] Problems with PDF files

2000-10-03 Thread Martin Mielke

Hello,

> 
> At 10:06 AM +0200 10/3/00, Martin Mielke wrote:
> >Error (0): PDF file is damaged - attempting to reconstruct 
> xref table...
> >Error: Couldn't find trailer dictionary
> >Error: Couldn't read xref table
> >Error (0): PDF file is damaged - attempting to reconstruct 
> xref table...
> >Error: Couldn't find trailer dictionary
> >Error: Couldn't read xref table
> >Error (139803): Bad colorspace
> 
> There are a few possibilities. One is that your max_doc_size 
> attribute is too small for your PDF files and so they're being 
> truncated.
> 
> 
> 

the max_doc_size is greater than the biggest PDF file actually

> The other possibility, as the messages say, is that one or more of 
> your PDF files is actually damaged. In that case, the best thing to 
> do is to run htdig with more debugging turned on and send both STDOUT 
> and STDERR to a file to peruse. Obviously the files reported just 
> before this output would be ones to check.

All PDFs are readable or, at least, Acrobat Reader 4.x doesn't complain ...




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Analyzer script for access_log

2000-10-03 Thread Bill Carlson

On Tue, 3 Oct 2000, Charles Nepote wrote:

> And may be a better choice than Webalizer as it can give stats on search
> words (which Webalizer cannot).

Webalizer does support search words, as of 1.3.0 at least. Haven't looked
at the latest versions of analog (stuck on an ancient version), but
Webalizer does a good job of getting the stats out. And the source is
fairly good too, easy to modify.

$.02

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Question about Htdig's database

2000-10-03 Thread Geoff Hutchison

At 10:40 AM +0100 10/3/00, Adam Rice wrote:
>Geoff Hutchison wrote:
>  > > Second question -- can I use such a database from my own Perl script?
>  >
>  > Can you use a Berkeley DB? Sure, use the DBI interface--it should be part
>  > of your Perl 5 installation.
>
>No. Well, maybe there's a DBI driver that works with Berkeley DB, but I
>recommend using the DB_File module instead.
>
>use DB_File;
>tie %something, "DB_File", "a_file_on_disk.db";
>
>that's all the code you need to make the %something hash be stored on
>disk using Berkeley DB. There are some restrictions compared to a normal
>Perl hash, and some more clever stuff you can do, see the DB_File
>manpage for details. AFAIK, a complete Perl installation on most modern
>Linux systems will include the DB_File module, otherwise you will have
>to mess around getting it from CPAN.
>
>Adam

Yes, I misspoke. I meant to say DB_File, not DBI. Thanks for correcting me.

Cheers,

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Problems with PDF files

2000-10-03 Thread Geoff Hutchison

At 10:06 AM +0200 10/3/00, Martin Mielke wrote:
>Error (0): PDF file is damaged - attempting to reconstruct xref table...
>Error: Couldn't find trailer dictionary
>Error: Couldn't read xref table
>Error (0): PDF file is damaged - attempting to reconstruct xref table...
>Error: Couldn't find trailer dictionary
>Error: Couldn't read xref table
>Error (139803): Bad colorspace

There are a few possibilities. One is that your max_doc_size 
attribute is too small for your PDF files and so they're being 
truncated.



The other possibility, as the messages say, is that one or more of 
your PDF files is actually damaged. In that case, the best thing to 
do is to run htdig with more debugging turned on and send both STDOUT 
and STDERR to a file to peruse. Obviously the files reported just 
before this output would be ones to check.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Analyzer script for access_log

2000-10-03 Thread Kapil Biyani

Has anybody tired summary...
A little costly, but being shareware can use it for a month to test it and believe me,
you won't feel like using anything else after summary. Gives you around 118+ reports 
and more
included in new version.

check it out http://www.summary.net
the author, Jason T. Linhart, is always there available on the mailing list of summary 
for
any queries, suggestions...

Give it a try. and again a binary program, worth it.

Signing off
Kaps
  \\ \\\ | /// //
   \\ \\ | /
~ ~
 ( @ @ )
   --oOOo-(_)-oOOo

 When I read about the evils of drinking,
 I gave up reading.


   ---oooO( )
( ) ) /
  \ ( (_/
   \_)

Kapil Biyani - India Infoline.com  http://www.indiainfoline.com?sig
Net Prodigy - Your Date with the Net - http://www.indiainfoline.com/week/netp.html?sig

- Original Message -
From: "Charles Nepote" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Tuesday, October 03, 2000 1:44 PM
Subject: [htdig] Analyzer script for access_log


|
|
| Todd Wallace ([EMAIL PROTECTED]) asked :
|
| > Does anyone have a nice analyzer script for the access_log that apache
| > produces? Preferably a Perl script.
| >
| > Thanks,
| > Todd Wallace
|
|
| I think Analog is a good choice (it is not a perl script).
| http://www.analog.cx
|
| And may be a better choice than Webalizer as it can give stats on search
| words (which Webalizer cannot).
| Analog is *highly* powered but it can installed and used in a little
| time. By excluding/including some datas from Apache's log, you can build
| very precise stats (only centered on a service for example (such as a
| search engine...)).
| It may be too powered for you, and the documentation is not very clear
| sometimes.
| Analog is less sexy than Webalizer.
|
| Charles Népote.
|
| 
| To unsubscribe from the htdig mailing list, send a message to
| [EMAIL PROTECTED]
| You will receive a message to confirm this.
| List archives:  
| FAQ:



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] Analyzer script for access_log

2000-10-03 Thread Charles Nepote



Todd Wallace ([EMAIL PROTECTED]) asked :
 
> Does anyone have a nice analyzer script for the access_log that apache
> produces? Preferably a Perl script.
> 
> Thanks,
> Todd Wallace


I think Analog is a good choice (it is not a perl script).
http://www.analog.cx

And may be a better choice than Webalizer as it can give stats on search
words (which Webalizer cannot).
Analog is *highly* powered but it can installed and used in a little
time. By excluding/including some datas from Apache's log, you can build
very precise stats (only centered on a service for example (such as a
search engine...)).
It may be too powered for you, and the documentation is not very clear
sometimes.
Analog is less sexy than Webalizer.

Charles Népote.


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] Problems with PDF files

2000-10-03 Thread Martin Mielke

Dear all,

indexing the database using a crontab, generates (short) emails like this:

--8<--8<--8<--

Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
Error (139803): Bad colorspace

--8<--8<--8<--

is there any way to correct it? what's going wrong? :-/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ: