Re: [htdig] Can htdig kill Linux?
On Wed, 6 Dec 2000, Clint Gilders wrote: David Gewirtz wrote: I just love getting to know new software. There's always some form of teething pain. Yesterday, I started running my first set of reasonably large htdig/htmerge processes. Came in today to find the Linux server (which is running nothing besides basic Mandrake processes and, of course, htdig) was deader than a doornail (have to say "deader than" because saying "hung more than" would just be too weird). I use Mandrake at home and love it, but have nothing but problems with it in Server environment. Our lone Linux Server (The rest are free BSD) has been crashing daily (hanging, not telnet, no ftp etc) since we installed apache/mod_ssl. Even before that it wasn't the most reliable box going. If you are going to continue to use it in a production environment I suggest not running X or KDE as these can eat up 60% of you CPU. We have indexed well over 200,000 documents with htdig running on a single Free BSD machine without as much as hiccup. Almost makes me wish for NT. Be careful what you wish for! You just might get it. Ahh!!! The horror. I can say from experience, the only times I've crashed a Linux box has been due to faulty hardware or faulty admin. There might be times when the system is so loaded that it might take 2 minutes to login, but login it eventually does. The few times where even login wouldn't work have been admin error, things like writing memory bombs accidently or letting file systems get full. Now, having also run htdig for quite a while, here are the things that could cause a box to become overloaded and die: * running htmerge where TMPDIR points to a file system that is too small. When sort runs it fills the file system, which is bad. And people usually run dig as root, which means the file system really gets full. If this happens to be the / file system, well, things get very ugly when / is full. * running htdig against a large number of pages and filling up / . First, I would verify the hardware. The test of choice is still compiling the kernel, this really does exercise the system more than anything else (to really have fun, compile several kernels at once or alter the -j parameter for make in the Makefile). I had a machine that could not compile a kernel but otherwise ran fine. Turned out the CPU was overheating, but only when it was really pushed. So, compile a kernel or two and then start looking at htdig again. $.02 Bill Carlson -- Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Can htdig kill Linux? (redux)
On Wed, 6 Dec 2000, David Gewirtz wrote: Well, I can't be sure what caused it, but the end result was that Linux' crash had some serious filesystem errors. I did an fsck and the filesystem now seems better, but there are a heck of a lot of lost+found nodes. So, here are my questions (could be Linux-newbie questions, sorry): * Is there a way to tell what files got chomped by the fsck and have lost+found nodes? * Is there a way to check a log for htdig? * Is an fsck -f -y good enough, or should I reformat and reinstall the hard drive? If the machine goes down while there is a lot going on in the file system, file changes that are in the memory cache don't get written to disk and that is what fsck cleans up. Generally, those lost+found nodes are going to be those files that were being written to at the time of the crash. In most cases, this will be working files or something along those lines. If you're running and RPM based distro, I'd run rpm -Va and see if you're missing any files (check the man page for rpm, this command will also list alterations you have made to some files). Last thing is to examine those files in lost+found. Use less against them, then file if that doesn't make any sense. Finally, reformatting and reinstalling is a bad habit, break it if you can. You'll learn much more by trying to fix things rather than reinstall. Contary to Windows, with Linux you CAN fix these types of things. :) Good Luck, Bill Carlson -- Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] SQL handling start_url
On Wed, 6 Dec 2000, Curtis Ireland wrote: 2) Before htDig starts its database build, dump all the links to a text file and have the htdig.conf include this file The one problem with these two solutions is how would the limit_urls_to variable work? I want to make sure the links are properly indexed without going past the linked site. This is the method I used, though in my case the backend was an email full of links from the person directing the crawl. :) Write 2 files, one for start_url and one for limit_urls, include both in the conf file like so: start_url: `/home/htdig/conf/start_url_file` limit_urls_to: `/home/htdig/conf/limit_url_file` The contents of both files are just links. Good Luck, Bill Carlson -- Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Analyzer script for access_log
On Tue, 3 Oct 2000, Charles Nepote wrote: And may be a better choice than Webalizer as it can give stats on search words (which Webalizer cannot). Webalizer does support search words, as of 1.3.0 at least. Haven't looked at the latest versions of analog (stuck on an ancient version), but Webalizer does a good job of getting the stats out. And the source is fairly good too, easy to modify. $.02 Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
RE: [htdig] URL substitution (or other solution)
On Wed, 24 May 2000, Bruce Fancher wrote: Is everything else I did, like the url_part_alias: config entries, correct? Thanks Hey Bruce, The parameter is url_part_aliases. Add the 'es' in each config file and you should be set. Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Introductory questions
On Thu, 16 Mar 2000, Gary Day wrote: I almost never join a list and immediately send a question without first listening in a bit but I need some information. I've installed htdig on a RedHat 6.0 Linux server. Everything runs just dandy (one of the simplest compiles and installs I've ever done). I've read the docs. 1. In your experience, how scalable is htdig? I'm just using it to prototype a "community" search engine now so it will be fine for now, but will it scale to 5000 sites if I have the disk space? So far, it looks like it should but the digging time may be a while. 2. It looks like it is clearly possible to just reindex one site without all the rest. Is that correct? Currently when I do it, no matter what is in my config file, it at least confirms all the existing sites/urls as well as the one in the config file. Hi Gary, I'll bite on this one. How well it scales depends more on how many documents rather than sites. For example, my site has something like 25,000 pages and htdig does a great job. I know others are indexing much more than that, I don't know what kind of hardware they are using. When indexing, it is possible to merge separate digs into one large database. It's all a matter of planning and reading the fine print in the documentation (which is excellent). Building a scaling solution is always very iffy and challenging, but I think with htdig you've got a great start. Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] How to change the directory for star.gif
On Tue, 14 Mar 2000, wenlong wrote: Hello, All, I try to use a different directory for htdig program. Everything works fine except the star won't show up to indicate the match status. The default directory for the star.gif is /htdig/star.gif. Which file should I modify to change the directory of star.gif, such as "temp/htdig/star.gif" You need to either provide templates other than the "builtin" ones in your conf file or recompile htdig after modifying CONFIG (look for IMAGE). I'd use the builtin templates if you need speed, otherwise make your own custom ones. Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
[htdig] Databases on different Platforms?
Hey all, I'm just starting to work on this and wanted to check with you all before I got too involved. Are the databases transportable across platforms? IE, if I dig on a SUN box, should I be able to move the resulting dbs to an intel box and expect htdig to work? Thanks! Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] htdig seg. fault (core dumped) on cobalt raq
On Fri, 10 Mar 2000, Gilles Detillieux wrote: According to atta dubson: could anyone send me a precompiled binary of the program htdig for a cobalt raq server (linux version 2.0.34C52_SK on mips), or maybe any linux-mips binary? i have read the new faq, but i am not able to upgrade the server i need it on. i just need that one program which core-dumps, not the whole collection. If you can't past the htdig stage, are you really sure that htmerge, htfuzzy and htsearch won't dump core as well? If anyone out there is running MIPS Linux, and has rpm installed, it would be great if they grabbed a copy of http://www.htdig.org/files/binaries/htdig-3.1.5-0.src.rpm and rebuilt it (rpm --rebuild htdig-3.1.5-0.src.rpm), and sent us the binary RPM for others to use. If I'm not mistaken, Cobalt servers do support rpm. Unfortunately, the different cobalt products tend to have different versions of libraries installed, pre-compiling the RPM doesn't help. For a starter on the raq, upgrade to the latest glibc and ldconfig, these verisions from Cobalt's site got me up and running: glibc-2.0.7-29C2 ldconfig-1.9.5-2 I believe the fix in ldconfig is what stops the core dump from htdig. HTH, Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] virtual domain name searches
On Mon, 10 Jan 2000, Jake Johnson wrote: is it possible to have one large database while searching individually for four seperate virtual domains? Sure, when you rundig, use a configuration file that lists all the virtual sites, this will index them all. Then for your searches, look at the restrict field for htsearch. A tag like this will do what you need: INPUT TYPE=hidden NAME=restrict VALUE="virtualhost.com/" Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] limited search on domains
On Mon, 25 Oct 1999, Florin Andrei wrote: I have many domains (and websites) hosted on the same machine (with virtual servers, in Apache). How can i create different search forms, with the search scope limited to a certain domain or a certain site? Look at the htsearch documentation, specifically the restrict tag. Include something like: input name=restrict type=hidden value="http://some.host/Search/Here/Only/" in your search form. Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
Re: [htdig] url_part_aliases - howto needed
On Tue, 28 Sep 1999 [EMAIL PROTECTED] wrote: Howdy! Sorry, if this topic is discussed various times already, but digging thru the mail archive didn't bring me any further. I simply don't understand the dokumentation for url_part_aliases. I made an index with htdig and when I search it, it returns my machinename in the URLs. Works fine lokally, but unfortunately no other host in the LAN knows my name and can access my apache webserver only thru the IP. So I would like htdig to put the IP in the URL whenever it encounters the name in the index-Database. Putting the following 2 lines in my htdig.conf simply doesn't work: url_part_aliases: http://myname/ url_part_aliases_http://12.34.56.78/ I understood the documentation in a way that the first url_part_aliases - line vcontains the from-part and the second the to-part. Obviously I'm wrong, but what would be right? Is it possible to change the URLs on the fly or do I have to rebuild my database? How is that done? Hey Kai, This one is not covered by the docs, though I understand 3.2 will have clearer documentation. To get this to work requires 2 different conf files. One conf is used when digging, the other is used when searching. The only difference between the 2 file would be the url_part_aliases and would run like this: Dig conf: url_part_aliases: http://myname/ *1 Search conf: url_part_aliases: http://12.34.56.78/ *1 Think of the *1 as a place holder which is set when digging and replaced when searching. HTH, Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
Re: [htdig] Question about htdig scope
On Tue, 28 Sep 1999, Greg W wrote: So how could you force it to look at all docs in a directory and under it ? I would like to be able to dump docs at random into a directory without having to have it organised On your webserver, turn on indexes. Htdig will treat the 'index' page as a regular page and follow the links to all files and subdirectories. For Apache, this is done with the Options Indexes configuration directive. Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
RE: [htdig] Merging Databases
Hello again, Are there some configuration options that would prevent the merge from working? I used a -vv when running htmerge, I get a line saying merging, but then no URLs are listed as merged. I have run a test merge with different database sets, it works just fine and the verbose output is what I would expect: .. htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/sourcereorg.html htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/stopping.html htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/suexec.html htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/unixware.html htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/upgrading_to_1_3.html htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/vhosts/ htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/vhosts/details.html htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/vhosts/examples.html htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/vhosts/fd-limits.html htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/vhosts/ip-based.html htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/vhosts/name-based.html htmerge: Merged URL: http://ryoko.radiology.uiowa.edu/manual/windows.html htmerge: Sorting... htmerge: Merging... htmerge: 100:06b htmerge: 200:0x00f6 htmerge: 300:0x0c6 htmerge: 400:0x1cf .. Now, on the two sets I want to merge ( not the above test sets ), all I get is: htmerge: Sorting... htmerge: doc #2145 has been superceeded. htmerge: Merging... htmerge: Removing doc #102 htmerge: Removing doc #10538 htmerge: Removing doc #10626 htmerge: Removing doc #10653 .. I ran htmerge with the following: htmerge -c crawl.conf -m main.conf -vvv | tee htmerge.log It sat there for the longest time, then started with htmerge: Sorting. Is there some other option I can use to get more information about what is going on? FWIT, these databases are fairly large, 200 MB and 600 MB (hence why I am so eager to merge : ) Thanks, Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
[htdig] Merging Databases
Hello, I have been trying to get a handle on the merge feature of htmerge with no success. Would someone mind explaining step by step what the process is? Here is what I have been doing: Say I have 2 databases main and crawl. I need searches that get main or main and crawl. So I: dig main (stock rundig) htdig crawl htmerge crawl with '-m main.conf' rundig skipping htdig and htmerge The merge doesn't seem to happen. Search on main works fine, search on crawl only returns crawl hits. What am I doing wrong? Setup: HtDig 3.1.2 Solaris 2.6 Any help appreciated. Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
Re: [htdig] Merging Databases
On Mon, 20 Sep 1999, Geoff Hutchison wrote: The merge doesn't seem to happen. Search on main works fine, search on crawl only returns crawl hits. What am I doing wrong? OK, since you didn't specify conf files, I'll go through the syntax exactly (obviously you may use additional flags): htdig -c main.conf htmerge -c main.conf htdig -c crawl.conf htmerge -m main.conf -c crawl.conf That's it. This will merge main - crawl, performing the normal htmerge runs as needed. Ok, that is essentially what I am doing: rundig -c main.conf rundig -m main.conf -c crawl.conf -skipdig where I modified rundig to take -skipdig and skip the htdig and htmerge portion. How can I verify that the merge happens other than performing a search? What can I lookup for in the verbose output? In one of the databases? Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
[htdig] doclist, perl db.docdb access
Hello, I am stumbling into some problems using any of the contrib perl scripts. I understand that various fields have been added to the docdb that aren't in some of the scripts; I have accounted for those. I have access to the database, but the hashed information doesn't seem right. For example, the key should be the URL in question, yet when running doclist.pl for example, the output is something like: ^Gwww.somewhere.org/index.html^S where those are control characters that only show when piping through less. I modified the script to use BerekeleyDB instead of GDBM_File, but no change. Any pointers? Thanks, Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
Re: [htdig] compile issues
On Wed, 8 Sep 1999, Daniel Stringfield wrote: I'm trying to install htdig on a Cobalt Qube 2700WG. When I run /configure, I get to the point at which it checks for a c++ compiler. it states that the c++ compiler can not create executables and stops. After looking through the online FAQs for htdig, I did see that you need libstdc++ which wasn't installed. I got it installed, and that does not fix it. For those of you not familiar with the Cobalt Qube, it runs a modified version of Redhat Linux, on a MIPS cpu. Daniel, Glad you hit this before I did, I was about to try htdig on a qube. At any rate, to compile htdig you need the c++ (such as g++) compiler, not just libstdc++. I would guess there is an rpm on Cobalt's site for g++, if not htdig itself. Good Luck! Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.