[htdig] htstat crashs by gen. the url-list

2000-12-07 Thread Michael Schulz

Dear all,

i use htdig 3.2b2 and i have a problem with htstat:
When i call

htstat -u  url_list

htstat crash with the following message:

WordDB: /opt/www/var/htdig/db.words.db: page 83131 doesn't exist, create flag
no
WordDBCursor::Get(15) failed Cannot allocate memory
WordDB: /opt/www/var/htdig/db.words.db: page 1 doesn't exist, create flag not
set
WordDBCursor::Get(22) failed Cannot allocate memory
Segmentation fault

The machine has 512MB RAM and a 265MB swap partition.
So i spend another 1GB swap-file.
- I´ve got the same message, only a little bit later...
(curios: when this happens, "top" says, that there´s 700MB swap space left...)

Any idea how to solve that problem?

Mike


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Can htdig kill Linux?

2000-12-07 Thread Bill Carlson

On Wed, 6 Dec 2000, Clint Gilders wrote:

 David Gewirtz wrote:
 
  I just love getting to know new software. There's always some form of
  teething pain. Yesterday, I started running my first set of reasonably
  large htdig/htmerge processes. Came in today to find the Linux server
  (which is running nothing besides basic Mandrake processes and, of course,
  htdig) was deader than a doornail (have to say "deader than" because saying
  "hung more than" would just be too weird).

   I use Mandrake at home and love it, but have nothing but problems with
 it in Server environment.  Our lone Linux Server (The rest are free BSD)
 has been crashing daily (hanging, not telnet, no ftp etc) since we
 installed apache/mod_ssl.  Even before that it wasn't the most reliable
 box going.   If you are going to continue to use it in a production
 environment I suggest not running X or KDE as these can eat up 60% of
 you CPU.

   We have indexed well over 200,000 documents with htdig running on a
 single Free BSD machine without as much as hiccup.

 Almost makes me wish for NT.
 Be careful what you wish for!  You just might get it.   Ahh!!! The
 horror.

I can say from experience, the only times I've crashed a Linux box has
been due to faulty hardware or faulty admin. There might be times when the
system is so loaded that it might take 2 minutes to login, but login it
eventually does. The few times where even login wouldn't work have been
admin error, things like writing memory bombs accidently or letting
file systems get full.

Now, having also run htdig for quite a while, here are the things that
could cause a box to become overloaded and die:

* running htmerge where TMPDIR points to a file system that is too
small. When sort runs it fills the file system, which is bad. And
people usually run dig as root, which means the file system really gets full.
If this happens to be the / file system, well, things get very ugly
when / is full.

* running htdig against a large number of pages and filling up / .

First, I would verify the hardware. The test of choice is still compiling
the kernel, this really does exercise the system more than anything else
(to really have fun, compile several kernels at once or alter the -j
parameter for make in the Makefile). I had a machine that could not
compile a kernel but otherwise ran fine. Turned out the CPU was
overheating, but only when it was really pushed.

So, compile a kernel or two and then start looking at htdig again.

$.02

Bill Carlson
-- 
Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Can htdig kill Linux? (redux)

2000-12-07 Thread Bill Carlson

On Wed, 6 Dec 2000, David Gewirtz wrote:


 Well, I can't be sure what caused it, but the end result was that Linux'
 crash had some serious filesystem errors. I did an fsck and the filesystem
 now seems better, but there are a heck of a lot of lost+found nodes.

 So, here are my questions (could be Linux-newbie questions, sorry):

 * Is there a way to tell what files got chomped by the fsck and have
 lost+found nodes?
 * Is there a way to check a log for htdig?
 * Is an fsck -f -y good enough, or should I reformat and reinstall the hard
 drive?


If the machine goes down while there is a lot going on in the file system,
file changes that are in the memory cache don't get written to disk and
that is what fsck cleans up.

Generally, those lost+found nodes are going to be those files that were
being written to at the time of the crash. In most cases, this will be
working files or something along those lines. If you're running and RPM
based distro, I'd run rpm -Va  and see if you're missing any files (check
the man page for rpm, this command will also list alterations you have
made to some files).

Last thing is to examine those files in lost+found. Use less against them,
then file if that doesn't make any sense.

Finally, reformatting and reinstalling is a bad habit, break it if you
can. You'll learn much more by trying to fix things rather than reinstall.

Contary to Windows, with Linux you CAN fix these types of things. :)

Good Luck,

Bill Carlson
-- 
Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] SQL handling start_url

2000-12-07 Thread Bill Carlson

On Wed, 6 Dec 2000, Curtis Ireland wrote:

 2) Before htDig starts its database build, dump all the links to a text
 file and have the htdig.conf include this file

 The one problem with these two solutions is how would the limit_urls_to
 variable work? I want to make sure the links are properly indexed
 without going past the linked site.

This is the method I used, though in my case the backend was an email full
of links from the person directing the crawl. :)

Write 2 files, one for start_url and one for limit_urls, include both in
the conf file like so:

start_url:  `/home/htdig/conf/start_url_file`

limit_urls_to:  `/home/htdig/conf/limit_url_file`


The contents of both files are just links.

Good Luck,

Bill Carlson
-- 
Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] SQL handling start_url

2000-12-07 Thread Gilles Detillieux

According to Curtis Ireland:
 Is there any way to have start_url get its list from an SQL back-end?
 Has anyone already built a patch to handle this?
 
 Here are a couple of solutions I can think of to bi-pass the problem,
 but I'm sure I'm not alone in desiring this feature.
 
 1) Build a PHP link built with links to all the sites we want to index.
 Have htDig use this as its start_url
 2) Before htDig starts its database build, dump all the links to a text
 file and have the htdig.conf include this file
 
 The one problem with these two solutions is how would the limit_urls_to
 variable work? I want to make sure the links are properly indexed
 without going past the linked site.

Either solution seems workable - it all depends on what your preference
is.  For the first solution, you'd need to have a limit_urls_to setting
that's liberal enough to allow through all the links that the PHP script
will spit out.  You should probably set your max_hop_count to 1 to avoid
having htdig go beyond the first hop, from the PHP output to the documents
it references.

For the second solution, you could probably just leave limit_urls_to as
the default, which is the same as the value of start_url, and set your
max_hop_count to 0.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Pb indexing HTML with htdig 3.1.5

2000-12-07 Thread Gilles Detillieux

According to =?iso-8859-1?Q?Andr=E9?= LAGADEC:
 I use htdig 3.1.5 on a Red Hat Linux 5.0, and I want to index a new web
 site. But when I run rundig I get only one document.
 
 So to see what is doing, I use rundig -vvv and I get this output :
 Header line: HTTP/1.1 200 OK
 Header line: Server: Netscape-Enterprise/3.5.1C
 Header line: Date: Wed, 06 Dec 2000 07:32:02 GMT
 Header line: Content-type: text/html
 Header line: Last-modified: Mon, 15 Nov 1999 10:45:01 GMT
 Translated Mon, 15 Nov 1999 10:45:01 GMT to 1999-11-15 10:45:01 (99)
 And converted to Mon, 15 Nov 1999 10:45:01
 Header line: Content-length: 1258
 Header line: Accept-ranges: bytes
 Header line: Connection: close
 Header line: 
 returnStatus = 0
 Read 1258 from document
 Read a total of 1258 bytes
 Tag: html, matched -1
 head:  
  size = 1258
 pick: x.y.z.t, # servers = 1
 htdig: Run complete
 htdig: 1 server seen:
 htdig: x.y.z.t:8000 1 document

You should be getting much more output than that with a verbosity level of
7!  Is it possible that there is a NUL byte in the document, soon after the
"html" tag?  For some reason, htdig seems to be stopping right after this
tag, and not getting anywhere close to the other tags in the document.  I've
tried it myself on the document you sent, and on that copy it worked fine.
The comment around the JavaScript code is correct, and htdig was able to
handle it.  There must be something different in your copy of the document,
such as a NUL byte, which is causing htdig's parser to end prematurely.

 I think that htdig doesn't like the HTML code "!--//" and "//--", and
 it see beginning of comment but not the end and ignore the rest of HTML
 code of the page.
 
 I am true ? An other idea ? What can I do ?
 
 N.B. : The HTML code of the first page on the site is under this line.
 _
 html
 
 head
 titleAccueil DIRECTION/title
 base target="rtop"
 script language="JavaScript"
 !--//
 var url="";
 var nom="";
 var bName="";
 
 function Ouvrir()
 {
 bName = navigator.appName
 Version = navigator.appVersion
 Version = Version.substring(0,1)
 browserOK = ((Version = 2))
 
 if (browserOK) 
 {
 this.name="home";

 
msgWindow=window.open("actu/default2.htm","popupdpd","location=no,toolbar=no,status=no,directories=no,scrollbars=yes,width=400,height=450");
 bName=navigator.appName;
 if (bName=="Netscape") msgWindow.focus();
 
 }
 }
 Ouvrir()
 
 //--
 /script
 /head
 
 frameset framespacing="0" border="false" frameborder="0" cols="155,*"
   frame name="gauche" scrolling="no" noresize target="haut_droite"
 src="defaulta.htm"
   marginwidth="0" marginheight="5"
   frameset rows="*,45"
 frame name="texte" target="bas_droite" src="defaultb.htm"
 scrolling="auto"
 marginwidth="0" marginheight="0" noresize
 frame name="bas" src="basac.htm" scrolling="no" marginwidth="7"
 marginheight="15"
 noresize
   /frameset
   noframes
   body
   pCette page utilise des cadres, mais votre navigateur ne les prend
 pas en charge./p
   /body
   /noframes
 /frameset
 /html


-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] Incremental indexing

2000-12-07 Thread Wanrong Qiu

Hi,

Does htdig support incremental indexing? I mean it is possible to only
index new
created or modified files. Thanks in advance.

Wayne




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] htdig fails to parse all files

2000-12-07 Thread Jeffery T Aiken

I've compiled htdig 3.1.5 on a Solaris 2.6 system.  I have 5 directories on my
web server containing a total of 54190 html docs and when I run htdig it only
finds just over 18,000.  I've used the -vvv -s options and see no errors during
the dig.  I am able to successfully htmerge these into the database and search,
but can't figure out why htdig doesn't see them all.

Anybody have an idea where I can go from here?




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Incremental indexing

2000-12-07 Thread Gilles Detillieux

According to Wanrong Qiu:
 Does htdig support incremental indexing? I mean it is possible to only
 index new created or modified files. Thanks in advance.

Yes, this is what htdig does by default if there is an existing database,
and the htdig program is called without the -i (initialize) option.
However, the rundig script that comes with the package calls htdig with
the initialize option, as its main purpose is to create all the initial
databases, so don't use the standard rundig script for update runs.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] htdig fails to parse all files

2000-12-07 Thread Gilles Detillieux

According to Jeffery T Aiken:
 I've compiled htdig 3.1.5 on a Solaris 2.6 system.  I have 5 directories on my
 web server containing a total of 54190 html docs and when I run htdig it only
 finds just over 18,000.  I've used the -vvv -s options and see no errors during
 the dig.  I am able to successfully htmerge these into the database and search,
 but can't figure out why htdig doesn't see them all.
 
 Anybody have an idea where I can go from here?

Have you looked at FAQ 5.25  5.1 ?

 FAQ:http://www.htdig.org/FAQ.html

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Htdig in spanish

2000-12-07 Thread Heriberto Cantu

At 07:40 p.m. 06/12/00 -0600, Geoff Hutchison wrote:
At 5:59 PM -0600 12/6/00, Heriberto Cantu wrote:
It was a fast work so probably need a second review and the completion
of the synonyms.es file.

I think it a good idea to have this package in the www.htdig.org site,
but couln't find a way to upload this.

You can try ftp://www.htdig.org/upload/ but it might be worth 
thinking about a "File Upload" form. If anyone has coded a CGI like 
this (and can ensure that files transfer in binary form), it might be 
worth trying.

I been looking in the france version and found that the files
bad_words.fr and dictionaries have acented chars.

I have problems with words "oír", "prohibido", "grande" in the generation of
the ending, so change the double chars for one acented ej
('a 'e 'i 'o 'u "u 'n) == (á é í ó ú ü ñ)

Now the ending generation works better and add acented words to the list.

I have a new .tar.gz with acented chars files bad_words.es,
espa~nol.0 and espa~nol.aff

And still couldn't upload, you can get it at
http://www.elinux.com.mx/pub/htdig-3.1.5-es-1.1.tar.gz

Thanks


Heriberto Cantu
http://www.elinux.com.mx
Monterrey, Mexico
Tel: (8)129-1121
Cel: 0448-256-8807




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] htdig fails to parse all files

2000-12-07 Thread Jeffery T Aiken

Sorry for the dup, Gilles...

I have looked at thoses FAQ's, particulary 5.1 which seems to match my problem.
I increased my max_doc_size to 5mb (no actual file is over 800K - directory
listings can get up to 2Mb) and still I get the same results.  I do get files
from each of the 5 directories, but not all of them, so I don't see how the web
server could be the problem.

Any other suggestions?




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] indexing mySQL table

2000-12-07 Thread Zon Hisham Bin Zainal Abidin

Can htdig index mySQL tables?

rgds.


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] htdig dumps core on Linus

2000-12-07 Thread B.G. Mahesh


y env is

Linux: 2.2.14-5.0smp (Redhat 6.2)
HTDIG: 3.1.5
Apache: 1.3.14

When I search for few the word "rajkumar" on
the news finder window on
 http://news.indiainfo.com/2000/12/08/india-index.html
it gives me an error. When I check the cgi-bin dir I see a core file.

% file core
core: ELF 32-bit LSB core file of 'htsearch' (signal 11), Intel 80386,
version 1

Why does this happen?


---
B.G. Mahesh
[EMAIL PROTECTED]
http://www.indiainfo.com/

http://mail.indiainfo.com
First you had 10MB of free mail space.
Now you can send mails in your own language !!!


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] PDF problem

2000-12-07 Thread bg . mahesh

hi

I am using htdig 3.1.5 on Linux. I get these errors when I try to index
the files

How can I fix the problem

[ii@iinj-lxs015 bin]$ 
/disk2/v/apache/htdocs/VIRTUAL/ii/search/HTDIG//db/htdig11551.pdf: Unterminated string.
PDF::parse: cannot open acroread output from 
http://www.indiainfo.com/awards/ET-ArmyInKashmir.pdf
/disk2/v/apache/htdocs/VIRTUAL/ii/search/HTDIG//db/htdig11551.pdf: Could not repair 
file.
PDF::parse: cannot open acroread output from 
http://travel.indiainfo.com/utilities/passport/passport_app.pdf
/disk2/v/apache/htdocs/VIRTUAL/ii/search/HTDIG//db/htdig11551.pdf: Could not repair 
file.
PDF::parse: cannot open acroread output from 
http://travel.indiainfo.com/utilities/passport/lostpp.pdf


-- 
-- 
B.G. Mahesh   
http://www.indiainfo.com/
mailto:[EMAIL PROTECTED]


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html