[htdig] htstat crashs by gen. the url-list
Dear all, i use htdig 3.2b2 and i have a problem with htstat: When i call htstat -u url_list htstat crash with the following message: WordDB: /opt/www/var/htdig/db.words.db: page 83131 doesn't exist, create flag no WordDBCursor::Get(15) failed Cannot allocate memory WordDB: /opt/www/var/htdig/db.words.db: page 1 doesn't exist, create flag not set WordDBCursor::Get(22) failed Cannot allocate memory Segmentation fault The machine has 512MB RAM and a 265MB swap partition. So i spend another 1GB swap-file. - I´ve got the same message, only a little bit later... (curios: when this happens, "top" says, that there´s 700MB swap space left...) Any idea how to solve that problem? Mike To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Can htdig kill Linux?
On Wed, 6 Dec 2000, Clint Gilders wrote: David Gewirtz wrote: I just love getting to know new software. There's always some form of teething pain. Yesterday, I started running my first set of reasonably large htdig/htmerge processes. Came in today to find the Linux server (which is running nothing besides basic Mandrake processes and, of course, htdig) was deader than a doornail (have to say "deader than" because saying "hung more than" would just be too weird). I use Mandrake at home and love it, but have nothing but problems with it in Server environment. Our lone Linux Server (The rest are free BSD) has been crashing daily (hanging, not telnet, no ftp etc) since we installed apache/mod_ssl. Even before that it wasn't the most reliable box going. If you are going to continue to use it in a production environment I suggest not running X or KDE as these can eat up 60% of you CPU. We have indexed well over 200,000 documents with htdig running on a single Free BSD machine without as much as hiccup. Almost makes me wish for NT. Be careful what you wish for! You just might get it. Ahh!!! The horror. I can say from experience, the only times I've crashed a Linux box has been due to faulty hardware or faulty admin. There might be times when the system is so loaded that it might take 2 minutes to login, but login it eventually does. The few times where even login wouldn't work have been admin error, things like writing memory bombs accidently or letting file systems get full. Now, having also run htdig for quite a while, here are the things that could cause a box to become overloaded and die: * running htmerge where TMPDIR points to a file system that is too small. When sort runs it fills the file system, which is bad. And people usually run dig as root, which means the file system really gets full. If this happens to be the / file system, well, things get very ugly when / is full. * running htdig against a large number of pages and filling up / . First, I would verify the hardware. The test of choice is still compiling the kernel, this really does exercise the system more than anything else (to really have fun, compile several kernels at once or alter the -j parameter for make in the Makefile). I had a machine that could not compile a kernel but otherwise ran fine. Turned out the CPU was overheating, but only when it was really pushed. So, compile a kernel or two and then start looking at htdig again. $.02 Bill Carlson -- Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Can htdig kill Linux? (redux)
On Wed, 6 Dec 2000, David Gewirtz wrote: Well, I can't be sure what caused it, but the end result was that Linux' crash had some serious filesystem errors. I did an fsck and the filesystem now seems better, but there are a heck of a lot of lost+found nodes. So, here are my questions (could be Linux-newbie questions, sorry): * Is there a way to tell what files got chomped by the fsck and have lost+found nodes? * Is there a way to check a log for htdig? * Is an fsck -f -y good enough, or should I reformat and reinstall the hard drive? If the machine goes down while there is a lot going on in the file system, file changes that are in the memory cache don't get written to disk and that is what fsck cleans up. Generally, those lost+found nodes are going to be those files that were being written to at the time of the crash. In most cases, this will be working files or something along those lines. If you're running and RPM based distro, I'd run rpm -Va and see if you're missing any files (check the man page for rpm, this command will also list alterations you have made to some files). Last thing is to examine those files in lost+found. Use less against them, then file if that doesn't make any sense. Finally, reformatting and reinstalling is a bad habit, break it if you can. You'll learn much more by trying to fix things rather than reinstall. Contary to Windows, with Linux you CAN fix these types of things. :) Good Luck, Bill Carlson -- Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] SQL handling start_url
On Wed, 6 Dec 2000, Curtis Ireland wrote: 2) Before htDig starts its database build, dump all the links to a text file and have the htdig.conf include this file The one problem with these two solutions is how would the limit_urls_to variable work? I want to make sure the links are properly indexed without going past the linked site. This is the method I used, though in my case the backend was an email full of links from the person directing the crawl. :) Write 2 files, one for start_url and one for limit_urls, include both in the conf file like so: start_url: `/home/htdig/conf/start_url_file` limit_urls_to: `/home/htdig/conf/limit_url_file` The contents of both files are just links. Good Luck, Bill Carlson -- Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] SQL handling start_url
According to Curtis Ireland: Is there any way to have start_url get its list from an SQL back-end? Has anyone already built a patch to handle this? Here are a couple of solutions I can think of to bi-pass the problem, but I'm sure I'm not alone in desiring this feature. 1) Build a PHP link built with links to all the sites we want to index. Have htDig use this as its start_url 2) Before htDig starts its database build, dump all the links to a text file and have the htdig.conf include this file The one problem with these two solutions is how would the limit_urls_to variable work? I want to make sure the links are properly indexed without going past the linked site. Either solution seems workable - it all depends on what your preference is. For the first solution, you'd need to have a limit_urls_to setting that's liberal enough to allow through all the links that the PHP script will spit out. You should probably set your max_hop_count to 1 to avoid having htdig go beyond the first hop, from the PHP output to the documents it references. For the second solution, you could probably just leave limit_urls_to as the default, which is the same as the value of start_url, and set your max_hop_count to 0. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Pb indexing HTML with htdig 3.1.5
According to =?iso-8859-1?Q?Andr=E9?= LAGADEC: I use htdig 3.1.5 on a Red Hat Linux 5.0, and I want to index a new web site. But when I run rundig I get only one document. So to see what is doing, I use rundig -vvv and I get this output : Header line: HTTP/1.1 200 OK Header line: Server: Netscape-Enterprise/3.5.1C Header line: Date: Wed, 06 Dec 2000 07:32:02 GMT Header line: Content-type: text/html Header line: Last-modified: Mon, 15 Nov 1999 10:45:01 GMT Translated Mon, 15 Nov 1999 10:45:01 GMT to 1999-11-15 10:45:01 (99) And converted to Mon, 15 Nov 1999 10:45:01 Header line: Content-length: 1258 Header line: Accept-ranges: bytes Header line: Connection: close Header line: returnStatus = 0 Read 1258 from document Read a total of 1258 bytes Tag: html, matched -1 head: size = 1258 pick: x.y.z.t, # servers = 1 htdig: Run complete htdig: 1 server seen: htdig: x.y.z.t:8000 1 document You should be getting much more output than that with a verbosity level of 7! Is it possible that there is a NUL byte in the document, soon after the "html" tag? For some reason, htdig seems to be stopping right after this tag, and not getting anywhere close to the other tags in the document. I've tried it myself on the document you sent, and on that copy it worked fine. The comment around the JavaScript code is correct, and htdig was able to handle it. There must be something different in your copy of the document, such as a NUL byte, which is causing htdig's parser to end prematurely. I think that htdig doesn't like the HTML code "!--//" and "//--", and it see beginning of comment but not the end and ignore the rest of HTML code of the page. I am true ? An other idea ? What can I do ? N.B. : The HTML code of the first page on the site is under this line. _ html head titleAccueil DIRECTION/title base target="rtop" script language="JavaScript" !--// var url=""; var nom=""; var bName=""; function Ouvrir() { bName = navigator.appName Version = navigator.appVersion Version = Version.substring(0,1) browserOK = ((Version = 2)) if (browserOK) { this.name="home"; msgWindow=window.open("actu/default2.htm","popupdpd","location=no,toolbar=no,status=no,directories=no,scrollbars=yes,width=400,height=450"); bName=navigator.appName; if (bName=="Netscape") msgWindow.focus(); } } Ouvrir() //-- /script /head frameset framespacing="0" border="false" frameborder="0" cols="155,*" frame name="gauche" scrolling="no" noresize target="haut_droite" src="defaulta.htm" marginwidth="0" marginheight="5" frameset rows="*,45" frame name="texte" target="bas_droite" src="defaultb.htm" scrolling="auto" marginwidth="0" marginheight="0" noresize frame name="bas" src="basac.htm" scrolling="no" marginwidth="7" marginheight="15" noresize /frameset noframes body pCette page utilise des cadres, mais votre navigateur ne les prend pas en charge./p /body /noframes /frameset /html -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Incremental indexing
Hi, Does htdig support incremental indexing? I mean it is possible to only index new created or modified files. Thanks in advance. Wayne To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] htdig fails to parse all files
I've compiled htdig 3.1.5 on a Solaris 2.6 system. I have 5 directories on my web server containing a total of 54190 html docs and when I run htdig it only finds just over 18,000. I've used the -vvv -s options and see no errors during the dig. I am able to successfully htmerge these into the database and search, but can't figure out why htdig doesn't see them all. Anybody have an idea where I can go from here? To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Incremental indexing
According to Wanrong Qiu: Does htdig support incremental indexing? I mean it is possible to only index new created or modified files. Thanks in advance. Yes, this is what htdig does by default if there is an existing database, and the htdig program is called without the -i (initialize) option. However, the rundig script that comes with the package calls htdig with the initialize option, as its main purpose is to create all the initial databases, so don't use the standard rundig script for update runs. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] htdig fails to parse all files
According to Jeffery T Aiken: I've compiled htdig 3.1.5 on a Solaris 2.6 system. I have 5 directories on my web server containing a total of 54190 html docs and when I run htdig it only finds just over 18,000. I've used the -vvv -s options and see no errors during the dig. I am able to successfully htmerge these into the database and search, but can't figure out why htdig doesn't see them all. Anybody have an idea where I can go from here? Have you looked at FAQ 5.25 5.1 ? FAQ:http://www.htdig.org/FAQ.html -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Htdig in spanish
At 07:40 p.m. 06/12/00 -0600, Geoff Hutchison wrote: At 5:59 PM -0600 12/6/00, Heriberto Cantu wrote: It was a fast work so probably need a second review and the completion of the synonyms.es file. I think it a good idea to have this package in the www.htdig.org site, but couln't find a way to upload this. You can try ftp://www.htdig.org/upload/ but it might be worth thinking about a "File Upload" form. If anyone has coded a CGI like this (and can ensure that files transfer in binary form), it might be worth trying. I been looking in the france version and found that the files bad_words.fr and dictionaries have acented chars. I have problems with words "oír", "prohibido", "grande" in the generation of the ending, so change the double chars for one acented ej ('a 'e 'i 'o 'u "u 'n) == (á é í ó ú ü ñ) Now the ending generation works better and add acented words to the list. I have a new .tar.gz with acented chars files bad_words.es, espa~nol.0 and espa~nol.aff And still couldn't upload, you can get it at http://www.elinux.com.mx/pub/htdig-3.1.5-es-1.1.tar.gz Thanks Heriberto Cantu http://www.elinux.com.mx Monterrey, Mexico Tel: (8)129-1121 Cel: 0448-256-8807 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] htdig fails to parse all files
Sorry for the dup, Gilles... I have looked at thoses FAQ's, particulary 5.1 which seems to match my problem. I increased my max_doc_size to 5mb (no actual file is over 800K - directory listings can get up to 2Mb) and still I get the same results. I do get files from each of the 5 directories, but not all of them, so I don't see how the web server could be the problem. Any other suggestions? To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] indexing mySQL table
Can htdig index mySQL tables? rgds. To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] htdig dumps core on Linus
y env is Linux: 2.2.14-5.0smp (Redhat 6.2) HTDIG: 3.1.5 Apache: 1.3.14 When I search for few the word "rajkumar" on the news finder window on http://news.indiainfo.com/2000/12/08/india-index.html it gives me an error. When I check the cgi-bin dir I see a core file. % file core core: ELF 32-bit LSB core file of 'htsearch' (signal 11), Intel 80386, version 1 Why does this happen? --- B.G. Mahesh [EMAIL PROTECTED] http://www.indiainfo.com/ http://mail.indiainfo.com First you had 10MB of free mail space. Now you can send mails in your own language !!! To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] PDF problem
hi I am using htdig 3.1.5 on Linux. I get these errors when I try to index the files How can I fix the problem [ii@iinj-lxs015 bin]$ /disk2/v/apache/htdocs/VIRTUAL/ii/search/HTDIG//db/htdig11551.pdf: Unterminated string. PDF::parse: cannot open acroread output from http://www.indiainfo.com/awards/ET-ArmyInKashmir.pdf /disk2/v/apache/htdocs/VIRTUAL/ii/search/HTDIG//db/htdig11551.pdf: Could not repair file. PDF::parse: cannot open acroread output from http://travel.indiainfo.com/utilities/passport/passport_app.pdf /disk2/v/apache/htdocs/VIRTUAL/ii/search/HTDIG//db/htdig11551.pdf: Could not repair file. PDF::parse: cannot open acroread output from http://travel.indiainfo.com/utilities/passport/lostpp.pdf -- -- B.G. Mahesh http://www.indiainfo.com/ mailto:[EMAIL PROTECTED] To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html