Hello, All! I just tried to use udmsearch v3.0.8 and found some bugs with it and can tell some words about it. Some words about my configuration: OS: FreeBSD-3.4 SQL: PostgreSQL v6.5 I decided to use postgres because it was already installed and configured. First I've found that not all words from my web-server appeared in dbms. After some research I've found bug in file src/conf.c that makes some configuration instructions to no effect (for example 'Index', etc). Sample patch here: ------------------cut to file 'src/conf.c.patch'--------------------- 304c304 < if(!STRNCASECMP(str,name)){ \ --- > if(!STRNCASECMP(src,name)){ \ 306c306 < if(sscanf(str+7,"%s",xxx)==1){ \ --- > if(sscanf(src+strlen(name),"%s",xxx)==1){ \ ------------------------------cut------------------------------------ Next important thing is that I found some bad fact about database size. It was too large! For example web server has ~9Mb file tree, but db was about ~40Mb. I've found that Postgres don't vary size of cells in table with type 'varying character(xx)'. Every cell will always be sized by given number of chars! Workarond for this can be replacing this type by 'text' type. After that table 'url' became much smaller. I think this problem may present also in other dbms. But this was not be the end. Also I've found, that psql's btree based index of int4 (that is used in crc model) crashes postmaster. Temporary workaround for that - to use hash index. But this is bad thing - see bellow. Now the database tables 'ndict' + 'url' + 'robots' + indices of 'url' and 'robots' became somewhat smaller than origial web tree. But two indices for ndict are about the same size as the 'ndict'. For example, 'ndict' has ~6Mb size but indice 'ndict_url_key' has size about ~8Mb and 'ndict_url_key' has about ~5Mb! :-((( Too bad! I'll try to download latest version of postgres and use it. If it is interresting - i'll mail about results in a week. Also I'm thinking about using oracle for index storage. Can anybody tell me about that? Is it bad idea? Has anybody installed linux oracle under FreeBSD box correctly and working with it? Are there some hints bugs or hards? Please, tell about. I heared that linux oracle needs about 256Mb to work - is that right? I tried to use under FreeBSD linux version of Inprise InterBase v4.0 (just to look at) and found, that it doesn't work with udmsearch. I think it doesn't work at all under FreeBSD (to my happyness!) because their libraries has some function calls, that aren't in FreeBSD's libc library - but they are in linux' libc library. Have anybody used InterBase v6.0? I think that such dbms as msql and mysql don't suit me, because they are the same as postgres, aren't they? Also I need to use dbms with mission critical task. So, I think to test some freely distributed commertional dbms that runs under FreeBSD (distributed FreeBSD or Linux binaries or UNIX sources). Can anybody to say about tested variants and give some statistics? Also I've found that udmsearch indexer for crc and multi database formats is implemented not so optimal as it can and works slow with postgres (may be not only with postgres). So, when I have rewrote 'sql.c' crc variant and got some speed improvement. Patch is bellow. There I've got storing algorithm for single database format and rewrote for crc db format. If someone and authors wants - I can do it for milti database format. For authors of udmsearch I can give some idea about db format that will combine advantages of single and crc db formats. They can be combined together. There will be tables 'dict' and 'ndict' but the table 'dict' has format as follows: CREATE TABLE dict ( word_crc int4 NOT NULL PRIMARY KEY, word text NOT NULL UNIQUE ); So, there will be table of words and table of their appears in some urls. First table must be much smaller than 'ndict'. Now about what is this for. It can be simply implemented, that if in web query string word is partially given (for example as 'hydro%n' or as 'cultur*') - it can be quickly found. If authors thinks that this is good idea - I can help them to implement this feature. Here somewhat reviewed version of included for pgsql's databse creating scripts and patch for 'sql.c': ---------------cut to file 'create/pgsql/create.txt'------------------ DROP table url; DROP table dict; DROP table robots; DROP table stopword; DROP SEQUENCE next_url_id; BEGIN; CREATE SEQUENCE next_url_id; CREATE TABLE url ( rec_id int4 DEFAULT nextval('next_url_id'), status int4 NOT NULL DEFAULT 0, url text NOT NULL, content_type text NOT NULL DEFAULT '', last_modified text NOT NULL DEFAULT '', title text NOT NULL DEFAULT '', txt text NOT NULL DEFAULT '', docsize int4 NOT NULL DEFAULT 0, last_index_time int4 NOT NULL, next_index_time int4 NOT NULL, referrer int4 NOT NULL DEFAULT 0, tag int4 NOT NULL DEFAULT 0, hops int4 NOT NULL DEFAULT 0, keywords text NOT NULL DEFAULT '', description text NOT NULL DEFAULT '', crc char(33) NOT NULL DEFAULT '', lang char(2) NOT NULL DEFAULT ' ', PRIMARY KEY (rec_id) ); CREATE UNIQUE INDEX url_url_key on url ( url ); CREATE INDEX url_crc_key on url ( crc ); CREATE TABLE dict ( url_id int4 NOT NULL, word text NOT NULL, intag int4 NOT NULL ); CREATE INDEX dict_word_key on dict ( word ); CREATE INDEX dict_url_key on dict ( url_id ); CREATE TABLE robots ( hostinfo text NOT NULL, path text NOT NULL ); CREATE TABLE stopword ( word text NOT NULL DEFAULT '', lang char(2) DEFAULT '' NOT NULL, PRIMARY KEY (word, lang) ); END; ----------------cut to file 'create/pgsql/crc.txt'-------------------- DROP TABLE ndict; CREATE TABLE ndict ( url_id int4, word_id int4, intag int2 ); CREATE INDEX ndict_url_key ON ndict USING hash (url_id); CREATE INDEX ndict_word_key ON ndict USING hash (word_id); ------------------cut to file 'src/sql.c.patch'----------------------- *** sql.c.orig Sun Mar 19 13:20:12 2000 --- sql.c Sun Mar 19 14:08:55 2000 *************** *** 175,180 **** --- 175,183 ---- #define lock_dict(db) sql_query(db,"LOCK TABLES dict WRITE") #define flush_dict(db) sql_query(db,"UNLOCK TABLES");sql_query(db,"LOCK TABLES dict WRITE") #define unlock_dict(db) sql_query(db,"UNLOCK TABLES") + #define lock_ndict(db) sql_query(db,"LOCK TABLES ndict WRITE") + #define flush_ndict(db) sql_query(db,"UNLOCK TABLES");sql_query(db,"LOCK +TABLES ndict WRITE") + #define unlock_ndict(db) sql_query(db,"UNLOCK TABLES") #define lock_url(db) sql_query(db,"LOCK TABLES url WRITE") #define unlock_url(db) sql_query(db,"UNLOCK TABLES") *************** *** 268,273 **** --- 271,279 ---- #define lock_dict(db) sql_query(db,"BEGIN WORK") #define flush_dict(db) sql_query(db,"END WORK");sql_query(db,"BEGIN WORK") #define unlock_dict(db) sql_query(db,"END WORK") + #define lock_ndict(db) sql_query(db,"BEGIN WORK") + #define flush_ndict(db) sql_query(db,"END WORK");sql_query(db,"BEGIN WORK") + #define unlock_ndict(db) sql_query(db,"END WORK") #define lock_url(db) sql_query(db,"BEGIN WORK");sql_query(db,"LOCK url") #define unlock_url(db) sql_query(db,"END WORK") static int InitDB(DB *db){ *************** *** 289,294 **** --- 295,303 ---- sprintf(db->errstr, "%s", PQerrorMessage(db->pgsql)); } static PGresult * safe_pgsql_query(DB *db,char *q){ + #ifdef DEBUG_SQL + fprintf(stderr, "QUERY: %s\n", q); + #endif if(!(db->connected)){ InitDB(db); if(db->errcode)return(0); *************** *** 339,344 **** --- 348,356 ---- #define lock_dict(db) #define flush_dict(db) #define unlock_dict(db) + #define lock_ndict(db) + #define flush_ndict(db) + #define unlock_ndict(db) #define lock_url(db) #define unlock_url(db) *************** *** 425,430 **** --- 437,445 ---- #define lock_dict(db) #define flush_dict(db) #define unlock_dict(db) + #define lock_ndict(db) + #define flush_ndict(db) + #define unlock_ndict(db) #define lock_url(db) #define unlock_url(db) #define SQL_OK(rc) ((rc==SQL_SUCCESS)||(rc==SQL_SUCCESS_WITH_INFO)) *************** *** 621,626 **** --- 636,644 ---- #define lock_dict(x) #define flush_dict(x) #define unlock_dict(x) + #define lock_ndict(db) + #define flush_ndict(db) + #define unlock_ndict(db) #define lock_url(x) #define unlock_url(x) *************** *** 939,944 **** --- 957,965 ---- #define lock_dict(db) #define flush_dict(db) #define unlock_dict(db) + #define lock_ndict(db) + #define flush_ndict(db) + #define unlock_ndict(db) #define lock_url(db) #define unlock_url(db) #define SQL_OK(rc) ((rc==OCI_SUCCESS)||(rc==OCI_SUCCESS_WITH_INFO)) *************** *** 1232,1237 **** --- 1253,1261 ---- #define lock_dict(x) #define flush_dict(x) #define unlock_dict(x) + #define lock_ndict(db) + #define flush_ndict(db) + #define unlock_ndict(db) #define lock_url(x) #define unlock_url(x) static int InitDB(DB *db){ *************** *** 1602,1607 **** --- 1626,1639 ---- return(&Word[i]); return(0); } + extern int crc32(char * buf); + static UDM_WORD * findcrcword(int wcur,UDM_WORD *Word,int crc){ + int i; + for(i=0;i<wcur;i++) + if((int)crc32(Word[i].word)==crc) + return(&Word[i]); + return(0); + } static int StoreWordsSingle(UDM_INDEXER * Indexer,int url_id){ int i,old,new,flush,wcur; int were,added,deleted,updated; *************** *** 1664,1686 **** extern int crc32(char * buf); static int StoreWordsSingleCRC(UDM_INDEXER * Indexer,int url_id){ ! int i,wcur,res; ! UDM_WORD *Word; char qbuf[UDMSTRSIZ]; - char tablename[64]="ndict"; - int crc; - wcur=Indexer->nwords; Word=Indexer->Word; ! if(IND_OK!=(res=DeleteWordFromURL(Indexer,url_id)))return(res); for(i=0;i<wcur;i++){ if(Word[i].count){ ! crc=(int)crc32(Word[i].word); ! sprintf(qbuf,"INSERT INTO %s (url_id,word_id,intag) VALUES(%d,%d,%d)",tablename,url_id,crc,Word[i].count); sql_query(((DB*)(Indexer->db)),qbuf); if(DBErrorCode(Indexer->db))return(IND_ERROR); } } return(IND_OK); } --- 1696,1756 ---- extern int crc32(char * buf); static int StoreWordsSingleCRC(UDM_INDEXER * Indexer,int url_id){ ! int i,old,new,flush,wcur; ! int were,added,deleted,updated; ! int s;UDM_WORD *w,*Word; ! int e; char qbuf[UDMSTRSIZ]; wcur=Indexer->nwords; Word=Indexer->Word; ! flush=were=added=deleted=updated=0; ! sprintf(qbuf,"SELECT word_id,intag FROM ndict WHERE url_id=%d",url_id); ! ((DB*)(Indexer->db))->res=sql_query(((DB*)(Indexer->db)),qbuf); ! if(DBErrorCode(Indexer->db))return(IND_ERROR); ! if(DBUseLock)lock_ndict((DB*)(Indexer->db)); ! if(DBErrorCode(Indexer->db))return(IND_ERROR); ! ! were=SQL_NUM_ROWS(((DB*)(Indexer->db))->res); ! ! for(i=0;i<were;i++){ ! e=s=atoi(sql_value(((DB*)(Indexer->db))->res,i,0)); ! old=atoi(sql_value(((DB*)(Indexer->db))->res,i,1)); ! if((w=findcrcword(wcur,Word,s))){ ! new=w->count; ! if((new)&&(old)&&(new!=old)){ ! sprintf(qbuf,"UPDATE ndict SET intag=%d WHERE word_id=%d AND url_id='%d'",new,s,url_id); ! sql_query(((DB*)(Indexer->db)),qbuf); ! if(DBErrorCode(Indexer->db))return(IND_ERROR); ! updated++; ! flush++; ! } ! w->count=0; ! }else{ ! sprintf(qbuf,"DELETE FROM ndict WHERE url_id=%d AND word_id=%d",url_id,s); ! sql_query(((DB*)(Indexer->db)),qbuf); ! if(DBErrorCode(Indexer->db))return(IND_ERROR); ! deleted++;flush++; ! } ! if(flush>1024){ ! flush_ndict(Indexer->db); ! flush=0; ! } ! } ! SQL_FREE(((DB*)(Indexer->db))->res); for(i=0;i<wcur;i++){ if(Word[i].count){ ! sprintf(qbuf,"INSERT INTO ndict (url_id,word_id,intag) VALUES(%d,%d,%d)",url_id,(int)crc32(Word[i].word),Word[i].count); sql_query(((DB*)(Indexer->db)),qbuf); if(DBErrorCode(Indexer->db))return(IND_ERROR); + flush++;added++; + if(flush>1024){ + flush_ndict((DB*)(Indexer->db)); + flush=0; + } } } + if(DBUseLock)unlock_ndict((DB*)(Indexer->db)); + if(DBErrorCode(Indexer->db))return(IND_ERROR); return(IND_OK); } -------------------------------cut------------------------------------ -- With best wishes, Max Rozhkov. ______________ If you want to unsubscribe send "unsubscribe udmsearch" to [EMAIL PROTECTED]