Re: [htdig] Can htdig kill Linux?

2000-12-07 Thread Bill Carlson

On Wed, 6 Dec 2000, Clint Gilders wrote:

 David Gewirtz wrote:
 
  I just love getting to know new software. There's always some form of
  teething pain. Yesterday, I started running my first set of reasonably
  large htdig/htmerge processes. Came in today to find the Linux server
  (which is running nothing besides basic Mandrake processes and, of course,
  htdig) was deader than a doornail (have to say "deader than" because saying
  "hung more than" would just be too weird).

   I use Mandrake at home and love it, but have nothing but problems with
 it in Server environment.  Our lone Linux Server (The rest are free BSD)
 has been crashing daily (hanging, not telnet, no ftp etc) since we
 installed apache/mod_ssl.  Even before that it wasn't the most reliable
 box going.   If you are going to continue to use it in a production
 environment I suggest not running X or KDE as these can eat up 60% of
 you CPU.

   We have indexed well over 200,000 documents with htdig running on a
 single Free BSD machine without as much as hiccup.

 Almost makes me wish for NT.
 Be careful what you wish for!  You just might get it.   Ahh!!! The
 horror.

I can say from experience, the only times I've crashed a Linux box has
been due to faulty hardware or faulty admin. There might be times when the
system is so loaded that it might take 2 minutes to login, but login it
eventually does. The few times where even login wouldn't work have been
admin error, things like writing memory bombs accidently or letting
file systems get full.

Now, having also run htdig for quite a while, here are the things that
could cause a box to become overloaded and die:

* running htmerge where TMPDIR points to a file system that is too
small. When sort runs it fills the file system, which is bad. And
people usually run dig as root, which means the file system really gets full.
If this happens to be the / file system, well, things get very ugly
when / is full.

* running htdig against a large number of pages and filling up / .

First, I would verify the hardware. The test of choice is still compiling
the kernel, this really does exercise the system more than anything else
(to really have fun, compile several kernels at once or alter the -j
parameter for make in the Makefile). I had a machine that could not
compile a kernel but otherwise ran fine. Turned out the CPU was
overheating, but only when it was really pushed.

So, compile a kernel or two and then start looking at htdig again.

$.02

Bill Carlson
-- 
Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Can htdig kill Linux? (redux)

2000-12-07 Thread Bill Carlson

On Wed, 6 Dec 2000, David Gewirtz wrote:


 Well, I can't be sure what caused it, but the end result was that Linux'
 crash had some serious filesystem errors. I did an fsck and the filesystem
 now seems better, but there are a heck of a lot of lost+found nodes.

 So, here are my questions (could be Linux-newbie questions, sorry):

 * Is there a way to tell what files got chomped by the fsck and have
 lost+found nodes?
 * Is there a way to check a log for htdig?
 * Is an fsck -f -y good enough, or should I reformat and reinstall the hard
 drive?


If the machine goes down while there is a lot going on in the file system,
file changes that are in the memory cache don't get written to disk and
that is what fsck cleans up.

Generally, those lost+found nodes are going to be those files that were
being written to at the time of the crash. In most cases, this will be
working files or something along those lines. If you're running and RPM
based distro, I'd run rpm -Va  and see if you're missing any files (check
the man page for rpm, this command will also list alterations you have
made to some files).

Last thing is to examine those files in lost+found. Use less against them,
then file if that doesn't make any sense.

Finally, reformatting and reinstalling is a bad habit, break it if you
can. You'll learn much more by trying to fix things rather than reinstall.

Contary to Windows, with Linux you CAN fix these types of things. :)

Good Luck,

Bill Carlson
-- 
Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] SQL handling start_url

2000-12-07 Thread Bill Carlson

On Wed, 6 Dec 2000, Curtis Ireland wrote:

 2) Before htDig starts its database build, dump all the links to a text
 file and have the htdig.conf include this file

 The one problem with these two solutions is how would the limit_urls_to
 variable work? I want to make sure the links are properly indexed
 without going past the linked site.

This is the method I used, though in my case the backend was an email full
of links from the person directing the crawl. :)

Write 2 files, one for start_url and one for limit_urls, include both in
the conf file like so:

start_url:  `/home/htdig/conf/start_url_file`

limit_urls_to:  `/home/htdig/conf/limit_url_file`


The contents of both files are just links.

Good Luck,

Bill Carlson
-- 
Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Analyzer script for access_log

2000-10-03 Thread Bill Carlson

On Tue, 3 Oct 2000, Charles Nepote wrote:

 And may be a better choice than Webalizer as it can give stats on search
 words (which Webalizer cannot).

Webalizer does support search words, as of 1.3.0 at least. Haven't looked
at the latest versions of analog (stuck on an ancient version), but
Webalizer does a good job of getting the stats out. And the source is
fairly good too, easy to modify.

$.02

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




RE: [htdig] URL substitution (or other solution)

2000-05-25 Thread Bill Carlson

On Wed, 24 May 2000, Bruce Fancher wrote:

 Is everything else I did, like the url_part_alias: config entries, correct?
 
 Thanks

Hey Bruce,

The parameter is url_part_aliases. Add the 'es' in each config file and
you should be set.

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Introductory questions

2000-03-17 Thread Bill Carlson

On Thu, 16 Mar 2000, Gary Day wrote:

 I almost never join a list and immediately send a question without first
 listening in a bit but I need some information.
 
 I've installed htdig on a RedHat 6.0 Linux server.  Everything runs just
 dandy (one of the simplest compiles and installs I've ever done).
 
 I've read the docs.
 
 1. In your experience, how scalable is htdig?  I'm just using it to
 prototype a "community" search engine now so it will be fine for now, but
 will it scale to 5000 sites if I have the disk space?  So far, it looks
 like it should but the digging time may be a while.
 
 2. It looks like it is clearly possible to just reindex one site without
 all the rest.  Is that correct?  Currently when I do it, no matter what is
 in my config file, it at least confirms all the existing sites/urls as
 well as the one in the config file.

Hi Gary,

I'll bite on this one. How well it scales depends more on how many
documents rather than sites. For example, my site has something like
25,000 pages and htdig does a great job. I know others are indexing much
more than that, I don't know what kind of hardware they are using.

When indexing, it is possible to merge separate digs into one large
database. It's all a matter of planning and reading the fine print in the
documentation (which is excellent).

Building a scaling solution is always very iffy and challenging, but I
think with htdig you've got a great start.

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



Re: [htdig] How to change the directory for star.gif

2000-03-15 Thread Bill Carlson

On Tue, 14 Mar 2000, wenlong wrote:

 Hello, All,
 
 I try to use a different directory for htdig program. Everything works
 fine except the star won't show up to indicate the match status. The
 default directory for the star.gif is /htdig/star.gif.  Which file
 should I modify to change the directory of star.gif, such as
 "temp/htdig/star.gif"
 

You need to either provide templates other than the "builtin" ones in your
conf file or recompile htdig after modifying CONFIG (look for IMAGE).

I'd use the builtin templates if you need speed, otherwise make your own
custom ones.

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



[htdig] Databases on different Platforms?

2000-03-14 Thread Bill Carlson

Hey all,

I'm just starting to work on this and wanted to check with you all before
I got too involved.

Are the databases transportable across platforms? IE, if I dig on a SUN
box, should I be able to move the resulting dbs to an intel box and expect
htdig to work?

Thanks!

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



Re: [htdig] htdig seg. fault (core dumped) on cobalt raq

2000-03-13 Thread Bill Carlson

On Fri, 10 Mar 2000, Gilles Detillieux wrote:

 According to atta dubson:
  could anyone send me a precompiled binary of the program htdig for a
  cobalt raq server (linux version 2.0.34C52_SK on mips), or maybe any
  linux-mips binary?  i have read the new faq, but i am not able to upgrade
  the server i need it on.
  
  i just need that one program which core-dumps, not the whole collection.
 
 If you can't past the htdig stage, are you really sure that htmerge,
 htfuzzy and htsearch won't dump core as well?  If anyone out there is
 running MIPS Linux, and has rpm installed, it would be great if they
 grabbed a copy of
 
   http://www.htdig.org/files/binaries/htdig-3.1.5-0.src.rpm
 
 and rebuilt it (rpm --rebuild htdig-3.1.5-0.src.rpm), and sent us the
 binary RPM for others to use.  If I'm not mistaken, Cobalt servers do
 support rpm.

Unfortunately, the different cobalt products tend to have different
versions of libraries installed, pre-compiling the RPM doesn't help.

For a starter on the raq, upgrade to the latest glibc and ldconfig, these
verisions from Cobalt's site got me up and running:

glibc-2.0.7-29C2
ldconfig-1.9.5-2

I believe the fix in ldconfig is what stops the core dump from htdig.

HTH,

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



Re: [htdig] virtual domain name searches

2000-01-11 Thread Bill Carlson

On Mon, 10 Jan 2000, Jake Johnson wrote:

 is it possible to have one large database while searching individually for
 four seperate virtual domains?
 


Sure, when you rundig, use a configuration file that lists all the virtual
sites, this will index them all.

Then for your searches, look at the restrict field for htsearch. A tag
like this will do what you need:

INPUT TYPE=hidden NAME=restrict VALUE="virtualhost.com/"


Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



Re: [htdig] limited search on domains

1999-10-25 Thread Bill Carlson


On Mon, 25 Oct 1999, Florin Andrei wrote:

 
 
   I have many domains (and websites) hosted on the same machine (with virtual
 servers, in Apache). How can i create different search forms, with the search
 scope limited to a certain domain or a certain site?


Look at the htsearch documentation, specifically the restrict tag.

Include something like:

input name=restrict type=hidden
value="http://some.host/Search/Here/Only/"

in your search form. 

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] url_part_aliases - howto needed

1999-09-29 Thread Bill Carlson


On Tue, 28 Sep 1999 [EMAIL PROTECTED] wrote:

 
 Howdy!
 
 Sorry, if this topic is discussed various times already, but digging thru
 the mail archive didn't bring me any further.
 I simply don't understand the dokumentation for url_part_aliases.
 
 I made an index with htdig and when I search it, it returns my machinename
 in the URLs. Works fine lokally, but unfortunately no other host in the LAN
 knows my name and can access my apache webserver only thru the IP.
 So I would like htdig to put the IP in the URL whenever it encounters the
 name in the index-Database.
 Putting the following 2 lines in my htdig.conf simply doesn't work:
 
 url_part_aliases: http://myname/
 url_part_aliases_http://12.34.56.78/
 
 I understood the documentation in a way that the first url_part_aliases -
 line vcontains the from-part and the second the to-part. Obviously I'm
 wrong, but what would be right?
 Is it possible to change the URLs on the fly or do I have to rebuild my
 database? How is that done?


Hey Kai,

This one is not covered by the docs, though I understand 3.2 will have
clearer documentation.

To get this to work requires 2 different conf files. One conf is used when
digging, the other is used when searching. The only difference between the
2 file would be the url_part_aliases and would run like this:

Dig conf:
url_part_aliases: http://myname/ *1

Search conf:
url_part_aliases: http://12.34.56.78/ *1

Think of the *1 as a place holder which is set when digging and replaced
when searching.

HTH,

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] Question about htdig scope

1999-09-28 Thread Bill Carlson


On Tue, 28 Sep 1999, Greg W wrote:

 
 So how could you force it to look at all docs in a directory and under it ?  
 I would like to be able to dump docs at random into a directory without 
 having to have it organised
 

On your webserver, turn on indexes. Htdig will treat the 'index' page as a
regular page and follow the links to all files and subdirectories.

For Apache, this is done with the Options Indexes configuration directive.

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



RE: [htdig] Merging Databases

1999-09-21 Thread Bill Carlson


Hello again,

Are there some configuration options that would prevent the merge from
working? I used a -vv when running htmerge, I get a line saying merging,
but then no URLs are listed as merged. I have run a test merge with
different database sets, it works just fine and the verbose output is what
I would expect:

..
htmerge: Merged URL:
http://ryoko.radiology.uiowa.edu/manual/sourcereorg.html
htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/stopping.html
htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/suexec.html
htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/unixware.html
htmerge: Merged URL:
http://ryoko.radiology.uiowa.edu/manual/upgrading_to_1_3.html
htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/vhosts/
htmerge: Merged URL:
http://ryoko.radiology.uiowa.edu/manual/vhosts/details.html
htmerge: Merged URL:
http://ryoko.radiology.uiowa.edu/manual/vhosts/examples.html
htmerge: Merged URL:
http://ryoko.radiology.uiowa.edu/manual/vhosts/fd-limits.html
htmerge: Merged URL:
http://ryoko.radiology.uiowa.edu/manual/vhosts/ip-based.html
htmerge: Merged URL:
http://ryoko.radiology.uiowa.edu/manual/vhosts/name-based.html
htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/windows.html
htmerge: Sorting...
htmerge: Merging...
htmerge: 100:06b  
htmerge: 200:0x00f6  
htmerge: 300:0x0c6  
htmerge: 400:0x1cf  
..


Now, on the two sets I want to merge ( not the above test sets ), all I
get is:

htmerge: Sorting...
htmerge: doc #2145 has been superceeded.
htmerge: Merging...
htmerge: Removing doc #102
htmerge: Removing doc #10538
htmerge: Removing doc #10626
htmerge: Removing doc #10653
..

I ran htmerge with the following:

htmerge -c crawl.conf -m main.conf -vvv | tee htmerge.log

It sat there for the longest time, then started with htmerge: Sorting.

Is there some other option I can use to get more information about what is
going on?

FWIT, these databases are fairly large, 200 MB and 600 MB (hence why I am
so eager to merge : )

Thanks,


Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|





To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



[htdig] Merging Databases

1999-09-20 Thread Bill Carlson


Hello,

I have been trying to get a handle on the merge feature of htmerge with no
success. Would someone mind explaining step by step what the process is?

Here is what I have been doing:

Say I have 2 databases main and crawl. I need searches that get main or
main and crawl. So I:

dig main (stock rundig)
htdig crawl
htmerge crawl with '-m main.conf'
rundig skipping htdig and htmerge

The merge doesn't seem to happen. Search on main works fine, search on
crawl only returns crawl hits. What am I doing wrong?

Setup:
HtDig 3.1.2
Solaris 2.6

Any help appreciated.

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] Merging Databases

1999-09-20 Thread Bill Carlson


On Mon, 20 Sep 1999, Geoff Hutchison wrote:

 The merge doesn't seem to happen. Search on main works fine, search on
 crawl only returns crawl hits. What am I doing wrong?
 
 OK, since you didn't specify conf files, I'll go through the syntax 
 exactly (obviously you may use additional flags):
 
 htdig -c main.conf
 htmerge -c main.conf
 htdig -c crawl.conf
 htmerge -m main.conf -c crawl.conf
 
 That's it. This will merge main - crawl, performing the normal 
 htmerge runs as needed.

Ok, that is essentially what I am doing:

rundig -c main.conf
rundig -m main.conf -c crawl.conf -skipdig

where I modified rundig to take -skipdig and skip the htdig and htmerge
portion.

How can I verify that the merge happens other than performing a search?

What can I lookup for in the verbose output? In one of the databases?

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



[htdig] doclist, perl db.docdb access

1999-09-10 Thread Bill Carlson


Hello,

I am stumbling into some problems using any of the contrib perl scripts. I
understand that various fields have been added to the docdb that aren't in
some of the scripts; I have accounted for those.

I have access to the database, but the hashed information doesn't seem
right. For example, the key should be the URL in question, yet when
running doclist.pl for example, the output is something like:

^Gwww.somewhere.org/index.html^S

where those are control characters that only show when piping through
less.

I modified the script to use BerekeleyDB instead of GDBM_File, but no
change.

Any pointers?

Thanks,

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] compile issues

1999-09-08 Thread Bill Carlson


On Wed, 8 Sep 1999, Daniel Stringfield wrote:

 
 I'm trying to install htdig on a Cobalt Qube 2700WG.  When I run
 /configure, I get to the point at which it checks for a c++ compiler.
 it states that the c++ compiler can not create executables and stops.
 
 After looking through the online FAQs for htdig, I did see that you need
 libstdc++ which wasn't installed. I got it installed, and that does not
 fix it.  For those of you not familiar with the Cobalt Qube, it runs a
 modified version of Redhat Linux, on a MIPS cpu.


Daniel,

Glad you hit this before I did, I was about to try htdig on a qube.

At any rate, to compile htdig you need the c++ (such as g++) compiler, not
just libstdc++. I would guess there is an rpm on Cobalt's site for g++, if
not htdig itself.

Good Luck! 

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.