UdmSearch: Bug in udmsearch-3.0.8, it's improvement and other related

Max A.Rozhkov Mon, 20 Mar 2000 03:18:09 -0800

Hello, All!

I just tried to use udmsearch v3.0.8 and found some
bugs with it and can tell some words about it.

Some words about my configuration:
OS:     FreeBSD-3.4
SQL:    PostgreSQL v6.5

I decided to use postgres because it was already installed and
configured.

First I've found that not all words from my web-server appeared
in dbms. After some research I've found bug in file src/conf.c
that makes some configuration instructions to no effect (for example
'Index', etc).

Sample patch here:
------------------cut to file 'src/conf.c.patch'---------------------
304c304
< if(!STRNCASECMP(str,name)){ \
---
> if(!STRNCASECMP(src,name)){ \
306c306
< if(sscanf(str+7,"%s",xxx)==1){ \
---
> if(sscanf(src+strlen(name),"%s",xxx)==1){ \
------------------------------cut------------------------------------

Next important thing is that I found some bad fact about
database size. It was too large! For example web server
has ~9Mb file tree, but db was about ~40Mb. I've found that
Postgres don't vary size of cells in table with type
'varying character(xx)'. Every cell will always be
sized by given number of chars! Workarond for this can be replacing
this type by 'text' type. After that table 'url' became much smaller.
I think this problem may present also in other dbms.

But this was not be the end. Also I've found, that psql's
btree based index of int4 (that is used in crc model) crashes postmaster.
Temporary workaround for that - to use hash index. But this is bad thing -
see bellow.

Now the database tables 'ndict' + 'url' + 'robots' + indices of
'url' and 'robots' became somewhat smaller than origial web tree.
But two indices for ndict are about the same size as the 'ndict'.
For example, 'ndict' has ~6Mb size but indice 'ndict_url_key' has
size about ~8Mb and 'ndict_url_key' has about ~5Mb! :-((( Too bad!

I'll try to download latest version of postgres and use it. If it is
interresting - i'll mail about results in a week.

Also I'm thinking about using oracle for index storage. Can anybody tell
me about that? Is it bad idea? Has anybody installed linux oracle
under FreeBSD box correctly and working with it? Are there some
hints bugs or hards? Please, tell about. I heared that linux oracle
needs about 256Mb to work - is that right?

I tried to use under FreeBSD linux version of Inprise
InterBase v4.0 (just to look at) and found, that it doesn't work
with udmsearch. I think it doesn't work at all under FreeBSD
(to my happyness!) because their libraries has some function calls,
that aren't in FreeBSD's libc library - but they are in linux'
libc library. Have anybody used InterBase v6.0?

I think that such dbms as msql and mysql don't suit me, because
they are the same as postgres, aren't they? Also I need to use
dbms with mission critical task. So, I think to test some freely
distributed commertional dbms that runs under FreeBSD (distributed
FreeBSD or Linux binaries or UNIX sources). Can anybody to say
about tested variants and give some statistics?

Also I've found that udmsearch indexer for crc and multi database
formats is implemented not so optimal as it can and works slow with
postgres (may be not only with postgres). So, when I have rewrote
'sql.c' crc variant and got some speed improvement. Patch is bellow.
There I've got storing algorithm for single database format and
rewrote for crc db format. If someone and authors wants - I can
do it for milti database format.

For authors of udmsearch I can give some idea about db format that
will combine advantages of single and crc db formats. They can be
combined together. There will be tables 'dict' and 'ndict' but the
table 'dict' has format as follows:

CREATE TABLE dict (
        word_crc        int4    NOT NULL        PRIMARY KEY,
        word            text    NOT NULL        UNIQUE
);

So, there will be table of words and table of their appears in some
urls. First table must be much smaller than 'ndict'. Now about what
is this for. It can be simply implemented, that if in web query
string word is partially given (for example as 'hydro%n'
or as 'cultur*') - it can be quickly found.

If authors thinks that this is good idea - I can help them to
implement this feature.


Here somewhat reviewed version of included for pgsql's databse
creating scripts and patch for 'sql.c':
---------------cut to file 'create/pgsql/create.txt'------------------
DROP table url;
DROP table dict;
DROP table robots;
DROP table stopword;
DROP SEQUENCE next_url_id;

BEGIN;

CREATE SEQUENCE next_url_id;
CREATE TABLE url (
        rec_id          int4    DEFAULT nextval('next_url_id'),
        status          int4    NOT NULL DEFAULT 0,
        url             text    NOT NULL,
        content_type    text    NOT NULL DEFAULT '',
        last_modified   text    NOT NULL DEFAULT '',
        title           text    NOT NULL DEFAULT '',
        txt             text    NOT NULL DEFAULT '',
        docsize         int4    NOT NULL DEFAULT 0,
        last_index_time int4    NOT NULL,
        next_index_time int4    NOT NULL,
        referrer        int4    NOT NULL DEFAULT 0,
        tag             int4    NOT NULL DEFAULT 0,
        hops            int4    NOT NULL DEFAULT 0,
        keywords        text    NOT NULL DEFAULT '',
        description     text    NOT NULL DEFAULT '',
        crc             char(33) NOT NULL DEFAULT '',
        lang            char(2) NOT NULL DEFAULT ' ',
        PRIMARY KEY (rec_id)
);
CREATE  UNIQUE  INDEX url_url_key       on url ( url );
CREATE          INDEX url_crc_key       on url ( crc );

CREATE TABLE dict (
        url_id          int4    NOT NULL,
        word            text    NOT NULL,
        intag           int4    NOT NULL
);
CREATE          INDEX dict_word_key     on dict ( word   );
CREATE          INDEX dict_url_key      on dict ( url_id );

CREATE TABLE robots (
        hostinfo        text    NOT NULL,
        path            text    NOT NULL
);

CREATE TABLE stopword (
        word            text    NOT NULL DEFAULT '',
        lang            char(2) DEFAULT '' NOT NULL,
        PRIMARY KEY (word, lang)
);

END;
----------------cut to file 'create/pgsql/crc.txt'--------------------
DROP TABLE ndict;

CREATE TABLE ndict (
        url_id          int4,
        word_id         int4,
        intag           int2
);
CREATE          INDEX   ndict_url_key   ON ndict USING hash (url_id);
CREATE          INDEX   ndict_word_key  ON ndict USING hash (word_id);
------------------cut to file 'src/sql.c.patch'-----------------------
*** sql.c.orig  Sun Mar 19 13:20:12 2000
--- sql.c       Sun Mar 19 14:08:55 2000
***************
*** 175,180 ****
--- 175,183 ----
  #define lock_dict(db) sql_query(db,"LOCK TABLES dict WRITE")
  #define flush_dict(db)        sql_query(db,"UNLOCK TABLES");sql_query(db,"LOCK 
TABLES dict WRITE")
  #define unlock_dict(db)       sql_query(db,"UNLOCK TABLES")
+ #define lock_ndict(db)        sql_query(db,"LOCK TABLES ndict WRITE")
+ #define flush_ndict(db)       sql_query(db,"UNLOCK TABLES");sql_query(db,"LOCK 
+TABLES ndict WRITE")
+ #define unlock_ndict(db) sql_query(db,"UNLOCK TABLES")
  #define lock_url(db)  sql_query(db,"LOCK TABLES url WRITE")
  #define unlock_url(db)        sql_query(db,"UNLOCK TABLES")
  
***************
*** 268,273 ****
--- 271,279 ----
  #define lock_dict(db) sql_query(db,"BEGIN WORK")
  #define       flush_dict(db)  sql_query(db,"END WORK");sql_query(db,"BEGIN WORK")
  #define unlock_dict(db)       sql_query(db,"END WORK")
+ #define lock_ndict(db)        sql_query(db,"BEGIN WORK")
+ #define       flush_ndict(db) sql_query(db,"END WORK");sql_query(db,"BEGIN WORK")
+ #define unlock_ndict(db) sql_query(db,"END WORK")
  #define lock_url(db)  sql_query(db,"BEGIN WORK");sql_query(db,"LOCK url")
  #define unlock_url(db)        sql_query(db,"END WORK")
  static int InitDB(DB *db){
***************
*** 289,294 ****
--- 295,303 ----
        sprintf(db->errstr, "%s", PQerrorMessage(db->pgsql));
  }
  static PGresult * safe_pgsql_query(DB *db,char *q){
+ #ifdef DEBUG_SQL
+       fprintf(stderr, "QUERY: %s\n", q);
+ #endif
        if(!(db->connected)){
                InitDB(db);
                if(db->errcode)return(0);
***************
*** 339,344 ****
--- 348,356 ----
  #define lock_dict(db) 
  #define       flush_dict(db)  
  #define unlock_dict(db)       
+ #define lock_ndict(db)        
+ #define       flush_ndict(db) 
+ #define unlock_ndict(db)      
  #define lock_url(db)  
  #define unlock_url(db)
  
***************
*** 425,430 ****
--- 437,445 ----
  #define lock_dict(db)
  #define       flush_dict(db)  
  #define unlock_dict(db)       
+ #define lock_ndict(db)        
+ #define       flush_ndict(db) 
+ #define unlock_ndict(db)      
  #define lock_url(db)
  #define unlock_url(db)
  #define SQL_OK(rc)    ((rc==SQL_SUCCESS)||(rc==SQL_SUCCESS_WITH_INFO))
***************
*** 621,626 ****
--- 636,644 ----
  #define lock_dict(x)
  #define       flush_dict(x)   
  #define unlock_dict(x)        
+ #define lock_ndict(db)        
+ #define       flush_ndict(db) 
+ #define unlock_ndict(db)      
  #define lock_url(x)
  #define unlock_url(x)
  
***************
*** 939,944 ****
--- 957,965 ----
  #define lock_dict(db)
  #define       flush_dict(db)
  #define unlock_dict(db)
+ #define lock_ndict(db)        
+ #define       flush_ndict(db) 
+ #define unlock_ndict(db)      
  #define lock_url(db)
  #define unlock_url(db)
  #define SQL_OK(rc)    ((rc==OCI_SUCCESS)||(rc==OCI_SUCCESS_WITH_INFO))
***************
*** 1232,1237 ****
--- 1253,1261 ----
  #define lock_dict(x)
  #define       flush_dict(x)
  #define unlock_dict(x)
+ #define lock_ndict(db)        
+ #define       flush_ndict(db) 
+ #define unlock_ndict(db)      
  #define lock_url(x)
  #define unlock_url(x)
  static int InitDB(DB *db){
***************
*** 1602,1607 ****
--- 1626,1639 ----
                        return(&Word[i]);
        return(0);
  }
+ extern int crc32(char * buf);
+ static UDM_WORD * findcrcword(int wcur,UDM_WORD *Word,int crc){
+ int i;
+       for(i=0;i<wcur;i++)
+               if((int)crc32(Word[i].word)==crc)
+                       return(&Word[i]);
+       return(0);
+ }
  static int StoreWordsSingle(UDM_INDEXER * Indexer,int url_id){
  int i,old,new,flush,wcur;
  int were,added,deleted,updated;
***************
*** 1664,1686 ****
  
  extern int crc32(char * buf);
  static int StoreWordsSingleCRC(UDM_INDEXER * Indexer,int url_id){
! int i,wcur,res;
! UDM_WORD *Word;
  char qbuf[UDMSTRSIZ];
- char tablename[64]="ndict";
- int crc;
- 
        wcur=Indexer->nwords;
        Word=Indexer->Word;
!       if(IND_OK!=(res=DeleteWordFromURL(Indexer,url_id)))return(res);
        for(i=0;i<wcur;i++){
                if(Word[i].count){
!                       crc=(int)crc32(Word[i].word);
!                       sprintf(qbuf,"INSERT INTO %s (url_id,word_id,intag) 
VALUES(%d,%d,%d)",tablename,url_id,crc,Word[i].count);
                        sql_query(((DB*)(Indexer->db)),qbuf);
                        if(DBErrorCode(Indexer->db))return(IND_ERROR);
                }
        }
        return(IND_OK);
  }
  
--- 1696,1756 ----
  
  extern int crc32(char * buf);
  static int StoreWordsSingleCRC(UDM_INDEXER * Indexer,int url_id){
! int i,old,new,flush,wcur;
! int were,added,deleted,updated;
! int s;UDM_WORD *w,*Word;
! int e;
  char qbuf[UDMSTRSIZ];
        wcur=Indexer->nwords;
        Word=Indexer->Word;
!       flush=were=added=deleted=updated=0;
!       sprintf(qbuf,"SELECT word_id,intag FROM ndict WHERE url_id=%d",url_id);
!       ((DB*)(Indexer->db))->res=sql_query(((DB*)(Indexer->db)),qbuf);
!       if(DBErrorCode(Indexer->db))return(IND_ERROR);
!       if(DBUseLock)lock_ndict((DB*)(Indexer->db));
!       if(DBErrorCode(Indexer->db))return(IND_ERROR);
! 
!       were=SQL_NUM_ROWS(((DB*)(Indexer->db))->res);
!       
!       for(i=0;i<were;i++){
!               e=s=atoi(sql_value(((DB*)(Indexer->db))->res,i,0));
!               old=atoi(sql_value(((DB*)(Indexer->db))->res,i,1));
!               if((w=findcrcword(wcur,Word,s))){
!                       new=w->count;
!                       if((new)&&(old)&&(new!=old)){
!                               sprintf(qbuf,"UPDATE ndict SET intag=%d WHERE 
word_id=%d AND url_id='%d'",new,s,url_id);
!                               sql_query(((DB*)(Indexer->db)),qbuf);
!                               if(DBErrorCode(Indexer->db))return(IND_ERROR);
!                               updated++;
!                               flush++;
!                       }
!                       w->count=0;
!               }else{
!                       sprintf(qbuf,"DELETE FROM ndict WHERE url_id=%d AND 
word_id=%d",url_id,s);
!                       sql_query(((DB*)(Indexer->db)),qbuf);
!                       if(DBErrorCode(Indexer->db))return(IND_ERROR);
!                       deleted++;flush++;
!               }
!               if(flush>1024){
!                       flush_ndict(Indexer->db);
!                       flush=0;
!               }
!       }
!       SQL_FREE(((DB*)(Indexer->db))->res);
        for(i=0;i<wcur;i++){
                if(Word[i].count){
!                       sprintf(qbuf,"INSERT INTO ndict (url_id,word_id,intag) 
VALUES(%d,%d,%d)",url_id,(int)crc32(Word[i].word),Word[i].count);
                        sql_query(((DB*)(Indexer->db)),qbuf);
                        if(DBErrorCode(Indexer->db))return(IND_ERROR);
+                       flush++;added++;
+                       if(flush>1024){
+                               flush_ndict((DB*)(Indexer->db));
+                               flush=0;
+                       }
                }
        }
+       if(DBUseLock)unlock_ndict((DB*)(Indexer->db));
+       if(DBErrorCode(Indexer->db))return(IND_ERROR);
        return(IND_OK);
  }
-------------------------------cut------------------------------------



--
With best wishes,
Max Rozhkov.

______________
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]
UdmSearch: Bug in udmsearch-3.0.8, it's improvement and other related

Reply via email to