Re: Possible Fix? (Re: UdmSearch: DeleteNoServer still broken in 3.1.9)

2001-02-01 Thread Alexander Barkov

  Hello!

Please find a patch which fixes "DeleteNoServer no" command incorrect
behaviour. Patch against server.c

  Thanks for help in debugging!





Caffeinate The World wrote:
 
 --- Caffeinate The World [EMAIL PROTECTED] wrote:
  # ./indexer -ma -n 1 -u http://www.mnpage.com/%
  AddServer 'http://www.state.mn.us/' 17
  AddServer 'http://www.mnworkforcecenter.org/' 17
  AddServer 'http://www.exploreminnesota.com/' 17
  AddServer 'http://www.tpt.org/' 17
  AddServer 'http://www.gorp.com/gorp/location/mn/mn.htm' 17
  AddServer 'http://lists.rootsweb.com/index/usa/MN/' 17
  AddServer 'http://*.mn.us/*' 18
  AddServer '(null)' 17
  Indexer[12748]: indexer from mnogosearch-3.1.9/PgSQL started with
  '/usr/local/install/mnogosearch
  -3.1.9/etc/indexer.conf'
  Indexer[12748]: [1] http://www.mnpage.com/magazines.html
  0 'http://www.state.mn.us/' 17
  1 'http://www.mnworkforcecenter.org/' 17
  2 'http://www.exploreminnesota.com/' 17
  3 'http://www.tpt.org/' 17
  4 'http://www.gorp.com/gorp/location/mn/mn.htm' 17
  5 'http://lists.rootsweb.com/index/usa/MN/' 17
  6 'http://*.mn.us/*' 18
  7 'http://lists.rootsweb.com/index/usa/MN/' 17
  Indexer[12748]: [1] No 'Server' command for url... deleted.
  Indexer[12748]: [1] Done (627 seconds)
 
  ---cut---
 
  looks like the 'null' server wasn't matched?
 
 also note that #5 and #7 are the same:
 
 'http://lists.rootsweb.com/index/usa/MN/'
 
 
  --- Alexander Barkov [EMAIL PROTECTED] wrote:
   Well, indexer.conf is loaded as expected.
  
   Now find this in UdmFindServer()  :
  
  
   for(i=0;iConf-nservers;i++){
 
   int res;
   regmatch_t subs[NS];
  
  and insert here:
  
  printf("%d '%s'
   %d\n",i,Conf-Server[i].url,Conf-Server[i].match_type);
  

--- server.c.orig   Thu Feb  1 14:17:35 2001
+++ server.cThu Feb  1 14:20:46 2001
@@ -30,9 +30,9 @@
/* to keep srv-url unchanged */
strcpy(urlstr,UDM_NULL2EMPTY(srv-url));
 
-   if(UDM_SRV_TYPE(match_type)==UDM_SERVER_SUBSTR){
+   if((UDM_SRV_TYPE(match_type)==UDM_SERVER_SUBSTR)(urlstr[0])){
/* Check whether valid URL is passed */
-   if((urlstr[0])(res=UdmParseURL(from,urlstr))){
+   if((res=UdmParseURL(from,urlstr))){
switch(res){
case UDM_PARSEURL_LONG:
Conf-errcode=1;



Possible Fix? (Re: UdmSearch: DeleteNoServer still broken in 3.1.9)

2001-01-30 Thread Caffeinate The World

alex or serge, could you look over this patch? i believe this patch
should fix this problem described below:

---cut---
# diff -ru indexer.c.orig indexer.c
--- indexer.c.orig  Tue Jan 30 10:45:03 2001
+++ indexer.c   Tue Jan 30 10:47:29 2001
@@ -368,7 +368,7 @@
}

/* Find correspondent Server record from indexer.conf */
-   if(!(CurSrv=UdmFindServer(Indexer-Conf,Doc-url,aliastr))){
+   if((!(CurSrv=UdmFindServer(Indexer-Conf,Doc-url,aliastr)) 
(!CurSrv-delete_no_server
))){
UdmLog(Indexer,UDM_LOG_WARN,"No 'Server' command for
url... deleted.");
if(!strcmp(CurURL.filename,"robots.txt")){
   
if(IND_OK==(result=UdmDeleteRobotsFromHost(Indexer,CurURL.hostinfo)))
---/cut---


--- Caffeinate The World [EMAIL PROTECTED] wrote:
 i reported this back in 3.1.9pre13. i have 'DeleteNoServer no' set
 with many
 URL's in my sql db not having associated Server commands. here i just
 tried to
 reindex and i see that my URL is being deleted:
 
 # indexer -m -s 200
 Indexer[2397]: indexer from mnogosearch-3.1.9/PgSQL started with
 '/usr/local/install/mnogosearch-
 3.1.9/etc/indexer.conf'
 jobs
 Indexer[2397]: [1]
 http://www.mnworkforcecenter.org/lmi/pub1/mms/index.htm
 Indexer[2397]: [1] No 'Server' command for url... deleted.
 ò^C
 Received signal 2 - exit! (NOTE: i had to Ctrl-C it to stop it from
 deleting
 more URL's.
 
 here is my full indexer.conf:
 
 ---cut---
 #Include inc1.conf
 
 DBAddr  pgsql://***:*@/work/
 DBMode cache
 #SyslogFacility local7
 LogdAddr localhost:7000
 LocalCharset iso-8859-1
 Ispellmode db
 StopwordTable stopword
 
 #ServerTable server
 
 DeleteNoServer no
 
 #Allow *
 
 #Disallow NoMatch *.state.mn.us/*
 Disallow http://www.rootsweb.com/~mn*
 Disallow http://www.wxusa.com/*
 Disallow http://www.vitalrec.com/*
 Disallow http://*yahoo.com/*
 Disallow http://*aol.com/*
 Disallow http://www.salescircular.com/*
 Disallow http://*.wellsfargo.com/*
 # Disallow any except known extensions and directory index using
 "regex" match:
 Disallow NoMatch Regex

\/$|\/SMTMall|\.htm$|\.html$|\.shtml$|\.jhtml$|\.phtml$|\.php$|\.php3$|\.a
 sp|\.txt$
 # Exclude cgi-bin and non-parsed-headers using "string" match:
 Disallow */cgi-bin/* *.cgi */nph-*
 # Exclude anything with '?' sign in URL. Note that '?' sign has a
 # special meaning in "string" match, so we have to use "regex" match
 here:
 #Disallow Regex  \?
 
 # Exclude some known extensions using fast "String" match:
 Disallow *.b*.sh   *.md5  *.rpm
 Disallow *.arj  *.tar  *.zip  *.tgz  *.gz   *.z *.bz2
 Disallow *.lha  *.lzh  *.rar  *.zoo  *.ha   *.tar.Z
 Disallow *.gif  *.jpg  *.jpeg *.bmp  *.tiff *.tif   *.xpm  *.xbm
 *.pcx
 Disallow *.vdo  *.mpeg *.mpe  *.mpg  *.avi  *.movie *.mov  *.dat
 Disallow *.mid  *.mp3  *.rm   *.ram  *.wav  *.aiff  *.ra
 Disallow *.vrml *.wrl  *.png
 Disallow *.exe  *.com  *.cab  *.dll  *.bin  *.class *.ex_
 Disallow *.tex  *.texi *.xls  *.doc  *.texinfo
 Disallow *.rtf  *.pdf  *.cdf  *.ps
 Disallow *.ai   *.eps  *.ppt  *.hqx
 Disallow *.cpt  *.bms  *.oda  *.tcl
 Disallow *.o*.a*.la   *.so
 Disallow *.pat  *.pm   *.m4   *.am   *.css
 Disallow *.map  *.aif  *.sit  *.sea
 Disallow *.m3u  *.qt   *.mov
 
 # Exclude Apache directory list in different sort order using
 "string" match:
 Disallow *D=A *D=D *M=A *M=D *N=A *N=D *S=A *S=D
 
 # More complicated case. RAR .r00-.r99, ARJ a00-a99 files
 # and unix shared libraries. We use "Regex" match type here:
 Disallow Regex \.r[0-9][0-9]$ \.a[0-9][0-9]$ \.so\.[0-9]$
 
 #CheckOnly *.b*.sh   *.md5
 #CheckOnly *.arj  *.tar  *.zip  *.tgz  *.gz
 #CheckOnly *.lha  *.lzh  *.rar  *.zoo  *.tar*.Z
 #CheckOnly *.gif  *.jpg  *.jpeg *.bmp  *.tiff
 #CheckOnly *.vdo  *.mpeg *.mpe  *.mpg  *.avi  *.movie
 #CheckOnly *.mid  *.mp3  *.rm   *.ram  *.wav  *.aiff
 #CheckOnly *.vrml *.wrl  *.png
 #CheckOnly *.exe  *.cab  *.dll  *.bin  *.class
 #CheckOnly *.tex  *.texi *.xls  *.doc  *.texinfo
 #CheckOnly *.rtf  *.pdf  *.cdf  *.ps
 #CheckOnly *.ai   *.eps  *.ppt  *.hqx
 #CheckOnly *.cpt  *.bms  *.oda  *.tcl
 #CheckOnly *.rpm  *.m3u  *.qt   *.mov
 #CheckOnly *.map  *.aif  *.sit  *.sea
 #
 # or check ANY except known text extensions using "regex" match:
 #Check NoMatch Regex \/$|\.html$|\.shtml$|\.phtml$|\.php$|\.txt$
 
 #HrefOnly */mail*.html */thread*.html
 
 UseRemoteContentType yes
 
 AddType text/plain  *.txt  *.pl *.js *.h *.c *.pm *.e
 AddType text/html   *.html *.htm *.m
 AddType image/x-xpixmap *.xpm
 AddType image/x-xbitmap *.xbm
 AddType image/gif   *.gif
 AddType Regex \.r[0-9][0-9]$
 AddType application/unknown *.*
 
 #Mime application/msword   "text/plain; charset=cp1251"   "catdoc
 $1"
 #Mime application/x-troff-man  text/plain
 "deroff"
 #Mime text/x-postscripttext/plain
 "ps2ascii"
 
 Period 6m
 #Tag string
 #Category FFAABBCCDD
 MaxHops 56
 MaxNetErrors 6
 ReadTimeOut 30s
 DocTimeOut 1m30s
 NetErrorDelayTime 1d
 Robots yes
 

Re: Possible Fix? (Re: UdmSearch: DeleteNoServer still broken in 3.1.9)

2001-01-30 Thread Caffeinate The World

oops that didn't work. but i'm pretty sure we need to test for the
condition of delete_no_server here. i also tried:

  /* Find correspondent Server record from indexer.conf */
  if(!(CurSrv=UdmFindServer(Indexer-Conf,Doc-url,aliastr))){
if(Indexer-Conf-csrv-delete_no_server){
  UdmLog(Indexer,UDM_LOG_WARN,"No 'Server' command for url...
deleted.");
  if(!strcmp(CurURL.filename,"robots.txt")){
   
if(IND_OK==(result=UdmDeleteRobotsFromHost(Indexer,CurURL.hostinfo)))
  result=UdmLoadRobots(Indexer);
  }else{
result=IND_OK;
  }
  if(result==IND_OK)result=UdmDeleteUrl(Indexer,Doc-url_id);
  FreeDoc(Doc);
  return(result);
}
  }

---/cut---

but that didn't work either. any ideas?


--- Caffeinate The World [EMAIL PROTECTED] wrote:
 alex or serge, could you look over this patch? i believe this patch
 should fix this problem described below:
 
 ---cut---
 # diff -ru indexer.c.orig indexer.c
 --- indexer.c.orig  Tue Jan 30 10:45:03 2001
 +++ indexer.c   Tue Jan 30 10:47:29 2001
 @@ -368,7 +368,7 @@
 }
 
 /* Find correspondent Server record from indexer.conf */
 -   if(!(CurSrv=UdmFindServer(Indexer-Conf,Doc-url,aliastr))){
 +   if((!(CurSrv=UdmFindServer(Indexer-Conf,Doc-url,aliastr))
 
 (!CurSrv-delete_no_server
 ))){
 UdmLog(Indexer,UDM_LOG_WARN,"No 'Server' command for
 url... deleted.");
 if(!strcmp(CurURL.filename,"robots.txt")){

 if(IND_OK==(result=UdmDeleteRobotsFromHost(Indexer,CurURL.hostinfo)))
 ---/cut---
 
 
 --- Caffeinate The World [EMAIL PROTECTED] wrote:
  i reported this back in 3.1.9pre13. i have 'DeleteNoServer no' set
  with many
  URL's in my sql db not having associated Server commands. here i
 just
  tried to
  reindex and i see that my URL is being deleted:
  
  # indexer -m -s 200
  Indexer[2397]: indexer from mnogosearch-3.1.9/PgSQL started with
  '/usr/local/install/mnogosearch-
  3.1.9/etc/indexer.conf'
  jobs
  Indexer[2397]: [1]
  http://www.mnworkforcecenter.org/lmi/pub1/mms/index.htm
  Indexer[2397]: [1] No 'Server' command for url... deleted.
  ò^C
  Received signal 2 - exit! (NOTE: i had to Ctrl-C it to stop it from
  deleting
  more URL's.
  
  here is my full indexer.conf:
  
  ---cut---
  #Include inc1.conf
  
  DBAddr  pgsql://***:*@/work/
  DBMode cache
  #SyslogFacility local7
  LogdAddr localhost:7000
  LocalCharset iso-8859-1
  Ispellmode db
  StopwordTable stopword
  
  #ServerTable server
  
  DeleteNoServer no
  
  #Allow *
  
  #Disallow NoMatch *.state.mn.us/*
  Disallow http://www.rootsweb.com/~mn*
  Disallow http://www.wxusa.com/*
  Disallow http://www.vitalrec.com/*
  Disallow http://*yahoo.com/*
  Disallow http://*aol.com/*
  Disallow http://www.salescircular.com/*
  Disallow http://*.wellsfargo.com/*
  # Disallow any except known extensions and directory index using
  "regex" match:
  Disallow NoMatch Regex
 

\/$|\/SMTMall|\.htm$|\.html$|\.shtml$|\.jhtml$|\.phtml$|\.php$|\.php3$|\.a
  sp|\.txt$
  # Exclude cgi-bin and non-parsed-headers using "string" match:
  Disallow */cgi-bin/* *.cgi */nph-*
  # Exclude anything with '?' sign in URL. Note that '?' sign has a
  # special meaning in "string" match, so we have to use "regex"
 match
  here:
  #Disallow Regex  \?
  
  # Exclude some known extensions using fast "String" match:
  Disallow *.b*.sh   *.md5  *.rpm
  Disallow *.arj  *.tar  *.zip  *.tgz  *.gz   *.z *.bz2
  Disallow *.lha  *.lzh  *.rar  *.zoo  *.ha   *.tar.Z
  Disallow *.gif  *.jpg  *.jpeg *.bmp  *.tiff *.tif   *.xpm  *.xbm
  *.pcx
  Disallow *.vdo  *.mpeg *.mpe  *.mpg  *.avi  *.movie *.mov  *.dat
  Disallow *.mid  *.mp3  *.rm   *.ram  *.wav  *.aiff  *.ra
  Disallow *.vrml *.wrl  *.png
  Disallow *.exe  *.com  *.cab  *.dll  *.bin  *.class *.ex_
  Disallow *.tex  *.texi *.xls  *.doc  *.texinfo
  Disallow *.rtf  *.pdf  *.cdf  *.ps
  Disallow *.ai   *.eps  *.ppt  *.hqx
  Disallow *.cpt  *.bms  *.oda  *.tcl
  Disallow *.o*.a*.la   *.so
  Disallow *.pat  *.pm   *.m4   *.am   *.css
  Disallow *.map  *.aif  *.sit  *.sea
  Disallow *.m3u  *.qt   *.mov
  
  # Exclude Apache directory list in different sort order using
  "string" match:
  Disallow *D=A *D=D *M=A *M=D *N=A *N=D *S=A *S=D
  
  # More complicated case. RAR .r00-.r99, ARJ a00-a99 files
  # and unix shared libraries. We use "Regex" match type here:
  Disallow Regex \.r[0-9][0-9]$ \.a[0-9][0-9]$ \.so\.[0-9]$
  
  #CheckOnly *.b*.sh   *.md5
  #CheckOnly *.arj  *.tar  *.zip  *.tgz  *.gz
  #CheckOnly *.lha  *.lzh  *.rar  *.zoo  *.tar*.Z
  #CheckOnly *.gif  *.jpg  *.jpeg *.bmp  *.tiff
  #CheckOnly *.vdo  *.mpeg *.mpe  *.mpg  *.avi  *.movie
  #CheckOnly *.mid  *.mp3  *.rm   *.ram  *.wav  *.aiff
  #CheckOnly *.vrml *.wrl  *.png
  #CheckOnly *.exe  *.cab  *.dll  *.bin  *.class
  #CheckOnly *.tex  *.texi *.xls  *.doc  *.texinfo
  #CheckOnly *.rtf  *.pdf  *.cdf  *.ps
  #CheckOnly *.ai   *.eps  *.ppt  *.hqx
  #CheckOnly *.cpt  *.bms  *.oda  

Re: UdmSearch: DeleteNoServer still broken in 3.1.9

2001-01-30 Thread Alexander Barkov

That's strange. I've tested your indexer.conf. Everything works fine.
indexer does not delete this URL.



Caffeinate The World wrote:
 
 i reported this back in 3.1.9pre13. i have 'DeleteNoServer no' set with many
 URL's in my sql db not having associated Server commands. here i just tried to
 reindex and i see that my URL is being deleted:
 
 # indexer -m -s 200
 Indexer[2397]: indexer from mnogosearch-3.1.9/PgSQL started with
 '/usr/local/install/mnogosearch-
 3.1.9/etc/indexer.conf'
 jobs
 Indexer[2397]: [1] http://www.mnworkforcecenter.org/lmi/pub1/mms/index.htm
 Indexer[2397]: [1] No 'Server' command for url... deleted.
 ò^C
 Received signal 2 - exit! (NOTE: i had to Ctrl-C it to stop it from deleting
 more URL's.
 
 here is my full indexer.conf:
 
 ---cut---
 #Include inc1.conf
 
 DBAddr  pgsql://***:*@/work/
 DBMode cache
 #SyslogFacility local7
 LogdAddr localhost:7000
 LocalCharset iso-8859-1
 Ispellmode db
 StopwordTable stopword
 
 #ServerTable server
 
 DeleteNoServer no
 
 #Allow *
 
 #Disallow NoMatch *.state.mn.us/*
 Disallow http://www.rootsweb.com/~mn*
 Disallow http://www.wxusa.com/*
 Disallow http://www.vitalrec.com/*
 Disallow http://*yahoo.com/*
 Disallow http://*aol.com/*
 Disallow http://www.salescircular.com/*
 Disallow http://*.wellsfargo.com/*
 # Disallow any except known extensions and directory index using "regex" match:
 Disallow NoMatch Regex
 \/$|\/SMTMall|\.htm$|\.html$|\.shtml$|\.jhtml$|\.phtml$|\.php$|\.php3$|\.a
 sp|\.txt$
 # Exclude cgi-bin and non-parsed-headers using "string" match:
 Disallow */cgi-bin/* *.cgi */nph-*
 # Exclude anything with '?' sign in URL. Note that '?' sign has a
 # special meaning in "string" match, so we have to use "regex" match here:
 #Disallow Regex  \?
 
 # Exclude some known extensions using fast "String" match:
 Disallow *.b*.sh   *.md5  *.rpm
 Disallow *.arj  *.tar  *.zip  *.tgz  *.gz   *.z *.bz2
 Disallow *.lha  *.lzh  *.rar  *.zoo  *.ha   *.tar.Z
 Disallow *.gif  *.jpg  *.jpeg *.bmp  *.tiff *.tif   *.xpm  *.xbm *.pcx
 Disallow *.vdo  *.mpeg *.mpe  *.mpg  *.avi  *.movie *.mov  *.dat
 Disallow *.mid  *.mp3  *.rm   *.ram  *.wav  *.aiff  *.ra
 Disallow *.vrml *.wrl  *.png
 Disallow *.exe  *.com  *.cab  *.dll  *.bin  *.class *.ex_
 Disallow *.tex  *.texi *.xls  *.doc  *.texinfo
 Disallow *.rtf  *.pdf  *.cdf  *.ps
 Disallow *.ai   *.eps  *.ppt  *.hqx
 Disallow *.cpt  *.bms  *.oda  *.tcl
 Disallow *.o*.a*.la   *.so
 Disallow *.pat  *.pm   *.m4   *.am   *.css
 Disallow *.map  *.aif  *.sit  *.sea
 Disallow *.m3u  *.qt   *.mov
 
 # Exclude Apache directory list in different sort order using "string" match:
 Disallow *D=A *D=D *M=A *M=D *N=A *N=D *S=A *S=D
 
 # More complicated case. RAR .r00-.r99, ARJ a00-a99 files
 # and unix shared libraries. We use "Regex" match type here:
 Disallow Regex \.r[0-9][0-9]$ \.a[0-9][0-9]$ \.so\.[0-9]$
 
 #CheckOnly *.b*.sh   *.md5
 #CheckOnly *.arj  *.tar  *.zip  *.tgz  *.gz
 #CheckOnly *.lha  *.lzh  *.rar  *.zoo  *.tar*.Z
 #CheckOnly *.gif  *.jpg  *.jpeg *.bmp  *.tiff
 #CheckOnly *.vdo  *.mpeg *.mpe  *.mpg  *.avi  *.movie
 #CheckOnly *.mid  *.mp3  *.rm   *.ram  *.wav  *.aiff
 #CheckOnly *.vrml *.wrl  *.png
 #CheckOnly *.exe  *.cab  *.dll  *.bin  *.class
 #CheckOnly *.tex  *.texi *.xls  *.doc  *.texinfo
 #CheckOnly *.rtf  *.pdf  *.cdf  *.ps
 #CheckOnly *.ai   *.eps  *.ppt  *.hqx
 #CheckOnly *.cpt  *.bms  *.oda  *.tcl
 #CheckOnly *.rpm  *.m3u  *.qt   *.mov
 #CheckOnly *.map  *.aif  *.sit  *.sea
 #
 # or check ANY except known text extensions using "regex" match:
 #Check NoMatch Regex \/$|\.html$|\.shtml$|\.phtml$|\.php$|\.txt$
 
 #HrefOnly */mail*.html */thread*.html
 
 UseRemoteContentType yes
 
 AddType text/plain  *.txt  *.pl *.js *.h *.c *.pm *.e
 AddType text/html   *.html *.htm *.m
 AddType image/x-xpixmap *.xpm
 AddType image/x-xbitmap *.xbm
 AddType image/gif   *.gif
 AddType Regex \.r[0-9][0-9]$
 AddType application/unknown *.*
 
 #Mime application/msword   "text/plain; charset=cp1251"   "catdoc $1"
 #Mime application/x-troff-man  text/plain "deroff"
 #Mime text/x-postscripttext/plain "ps2ascii"
 
 Period 6m
 #Tag string
 #Category FFAABBCCDD
 MaxHops 56
 MaxNetErrors 6
 ReadTimeOut 30s
 DocTimeOut 1m30s
 NetErrorDelayTime 1d
 Robots yes
 Clones yes
 BodyWeight 2
 TitleWeight 4
 KeywordWeight 8
 DescWeight 16
 #UrlWeight 16
 #UrlHostWeight 8
 #Category FFAABBCCDD
 MaxHops 56
 MaxNetErrors 6
 ReadTimeOut 30s
 DocTimeOut 1m30s
 NetErrorDelayTime 1d
 Robots yes
 Clones yes
 BodyWeight 2
 TitleWeight 4
 KeywordWeight 8
 DescWeight 16
 #UrlWeight 16
 #UrlHostWeight 8
 #UrlPathWeight 8
 #UrlFileWeight 0
 #IspellCorrectFactor1
 #IspellIncorrectFactor  1
 #NumberFactor 1
 #AlnumFactor  1
 #MinWordLength 1
 #MaxWordLength 32
 #DeleteBad no
 Index yes
 Follow path
 Server site http://www.state.mn.us/
 Server site http://www.exploreminnesota.com/
 Server site http://www.tpt.org/
 Server page 

Re: Possible Fix? (Re: UdmSearch: DeleteNoServer still broken in 3.1.9)

2001-01-30 Thread Alexander Barkov

This patch will not fix the problem. The problem is not here.
"DeleteNoServer no" is implemented via adding one virtual emtpy server 
after loading indexer.conf. It means that if there is no other
correspondent
Server or Realm commands for some URL, indexer will find the last one
empty server and will execute something like this:

   strncmp(url,Server[i].url,strlen(Server[i].url))

 where Server[i].url is an empty string. So, any URL will pass this
condition. 


 I can't reproduce the same unexpected behaviour on my box,
To debug it please check two things:

1. function UdmAddServer in the file server.c

   add as a first statement:

  printf("AddServer '%s' %d\n",srv-url,match_type);

   and check that an empty string appeare in the output after all
"Server" 
arguments given in indexer.conf


Then if the above works fine

2. Add debugging output into UdmFindServer function. I think it is clean
enough
how does it work.



Caffeinate The World wrote:
 
 alex or serge, could you look over this patch? i believe this patch
 should fix this problem described below:
 
 ---cut---
 # diff -ru indexer.c.orig indexer.c
 --- indexer.c.orig  Tue Jan 30 10:45:03 2001
 +++ indexer.c   Tue Jan 30 10:47:29 2001
 @@ -368,7 +368,7 @@
 }
 
 /* Find correspondent Server record from indexer.conf */
 -   if(!(CurSrv=UdmFindServer(Indexer-Conf,Doc-url,aliastr))){
 +   if((!(CurSrv=UdmFindServer(Indexer-Conf,Doc-url,aliastr)) 
 (!CurSrv-delete_no_server
 ))){
 UdmLog(Indexer,UDM_LOG_WARN,"No 'Server' command for
 url... deleted.");
 if(!strcmp(CurURL.filename,"robots.txt")){
 
 if(IND_OK==(result=UdmDeleteRobotsFromHost(Indexer,CurURL.hostinfo)))
 ---/cut---
 
 --- Caffeinate The World [EMAIL PROTECTED] wrote:
  i reported this back in 3.1.9pre13. i have 'DeleteNoServer no' set
  with many
  URL's in my sql db not having associated Server commands. here i just
  tried to
  reindex and i see that my URL is being deleted:
 
  # indexer -m -s 200
  Indexer[2397]: indexer from mnogosearch-3.1.9/PgSQL started with
  '/usr/local/install/mnogosearch-
  3.1.9/etc/indexer.conf'
  jobs
  Indexer[2397]: [1]
  http://www.mnworkforcecenter.org/lmi/pub1/mms/index.htm
  Indexer[2397]: [1] No 'Server' command for url... deleted.
  ò^C
  Received signal 2 - exit! (NOTE: i had to Ctrl-C it to stop it from
  deleting
  more URL's.
 
  here is my full indexer.conf:
 
  ---cut---
  #Include inc1.conf
 
  DBAddr  pgsql://***:*@/work/
  DBMode cache
  #SyslogFacility local7
  LogdAddr localhost:7000
  LocalCharset iso-8859-1
  Ispellmode db
  StopwordTable stopword
 
  #ServerTable server
 
  DeleteNoServer no
 
  #Allow *
 
  #Disallow NoMatch *.state.mn.us/*
  Disallow http://www.rootsweb.com/~mn*
  Disallow http://www.wxusa.com/*
  Disallow http://www.vitalrec.com/*
  Disallow http://*yahoo.com/*
  Disallow http://*aol.com/*
  Disallow http://www.salescircular.com/*
  Disallow http://*.wellsfargo.com/*
  # Disallow any except known extensions and directory index using
  "regex" match:
  Disallow NoMatch Regex
 
 \/$|\/SMTMall|\.htm$|\.html$|\.shtml$|\.jhtml$|\.phtml$|\.php$|\.php3$|\.a
  sp|\.txt$
  # Exclude cgi-bin and non-parsed-headers using "string" match:
  Disallow */cgi-bin/* *.cgi */nph-*
  # Exclude anything with '?' sign in URL. Note that '?' sign has a
  # special meaning in "string" match, so we have to use "regex" match
  here:
  #Disallow Regex  \?
 
  # Exclude some known extensions using fast "String" match:
  Disallow *.b*.sh   *.md5  *.rpm
  Disallow *.arj  *.tar  *.zip  *.tgz  *.gz   *.z *.bz2
  Disallow *.lha  *.lzh  *.rar  *.zoo  *.ha   *.tar.Z
  Disallow *.gif  *.jpg  *.jpeg *.bmp  *.tiff *.tif   *.xpm  *.xbm
  *.pcx
  Disallow *.vdo  *.mpeg *.mpe  *.mpg  *.avi  *.movie *.mov  *.dat
  Disallow *.mid  *.mp3  *.rm   *.ram  *.wav  *.aiff  *.ra
  Disallow *.vrml *.wrl  *.png
  Disallow *.exe  *.com  *.cab  *.dll  *.bin  *.class *.ex_
  Disallow *.tex  *.texi *.xls  *.doc  *.texinfo
  Disallow *.rtf  *.pdf  *.cdf  *.ps
  Disallow *.ai   *.eps  *.ppt  *.hqx
  Disallow *.cpt  *.bms  *.oda  *.tcl
  Disallow *.o*.a*.la   *.so
  Disallow *.pat  *.pm   *.m4   *.am   *.css
  Disallow *.map  *.aif  *.sit  *.sea
  Disallow *.m3u  *.qt   *.mov
 
  # Exclude Apache directory list in different sort order using
  "string" match:
  Disallow *D=A *D=D *M=A *M=D *N=A *N=D *S=A *S=D
 
  # More complicated case. RAR .r00-.r99, ARJ a00-a99 files
  # and unix shared libraries. We use "Regex" match type here:
  Disallow Regex \.r[0-9][0-9]$ \.a[0-9][0-9]$ \.so\.[0-9]$
 
  #CheckOnly *.b*.sh   *.md5
  #CheckOnly *.arj  *.tar  *.zip  *.tgz  *.gz
  #CheckOnly *.lha  *.lzh  *.rar  *.zoo  *.tar*.Z
  #CheckOnly *.gif  *.jpg  *.jpeg *.bmp  *.tiff
  #CheckOnly *.vdo  *.mpeg *.mpe  *.mpg  *.avi  *.movie
  #CheckOnly *.mid  *.mp3  *.rm   *.ram  *.wav  *.aiff
  #CheckOnly *.vrml *.wrl  *.png
  #CheckOnly *.exe  *.cab  *.dll  *.bin  *.class
  #CheckOnly *.tex  *.texi 

Re: Possible Fix? (Re: UdmSearch: DeleteNoServer still broken in 3.1.9)

2001-01-30 Thread Alexander Barkov

Well, indexer.conf is loaded as expected.

Now find this in UdmFindServer()  :


for(i=0;iConf-nservers;i++){  
int res;   
regmatch_t subs[NS]; 

   and insert here:

   printf("%d '%s'
%d\n",i,Conf-Server[i].url,Conf-Server[i].match_type);



Caffeinate The World wrote:
 
 # ./indexer -ma -u http://www.mnpage.com/%
 AddServer 'http://www.state.mn.us/' 17
 AddServer 'http://www.mnworkforcecenter.org/' 17
 AddServer 'http://www.exploreminnesota.com/' 17
 AddServer 'http://www.tpt.org/' 17
 AddServer 'http://www.gorp.com/gorp/location/mn/mn.htm' 17
 AddServer 'http://lists.rootsweb.com/index/usa/MN/' 17
 AddServer 'http://*.mn.us/*' 18
 AddServer '(null)' 17
__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]