fellow php programmers, sorry to you two who were kind enough to take the time to attempt to answer my question. i thought it was enough to say "i had 550 lines with a bunch of sql calls and loops and big data structures"
to help you make a better assessment i will summarize in more detail now. for my biggest file ( http://www.myowndictionary.com/index.php?old_cfile=rss&cfile=rss&rss_id=198&cid=242&from_lang=korean ) i get this: [EMAIL PROTECTED] mysql]# tail -f /var/lib/mysql/mysqld_query.log > out [EMAIL PROTECTED] mysql]# cat out | wc -l 15910 <<< ----------- that's line count [EMAIL PROTECTED] mysql]# head out 9795 Query select word,def,wordid,pos,posn,wordsize,syn from korean_english where word like '운전할 %' order by wordsize desc 9795 Query select word,def,wordid,pos,posn,wordsize,syn from korean_english where word like '운전할' order by wordsize desc 9795 Query select word,def,wordid,pos,posn,wordsize,syn from korean_english where word like ' %' order by wordsize desc 9795 Query select word,def,wordid,pos,posn,wordsize,syn from korean_english where word like '수 %' order by wordsize desc 9795 Query select word,def,wordid,pos,posn,wordsize,syn from korean_english where word like '수' order by wordsize desc takes 6 minutes [EMAIL PROTECTED] current]$ date 2006. 03. 13. (월) 20:25:16 KST [EMAIL PROTECTED] current]$ date 2006. 03. 13. (월) 20:31:37 KST [EMAIL PROTECTED] current]$ in general, you will see what is involved in processing each token of text and sending that token through a match against databases .... and processing results: a summary of the code is: mainLoop starts, it breaks some text into tokens using my_mb_preg_split. (line 3) the reg exp is like $split_pattern = "/((?:[\.!\s\?,:-]|\\\"는)+)/"; mb_preg_split i'm sure is broken, but haven't prepared a legal case at this time ;) then, it sends each token to CallmatchThis (line 14) which calls matchThis (line 27 - 47 below) matchThis may be called twice (2 sql queeries) (line 51) select * where word = '$token' and another (take that, but if it's not there .... issue the next sql ) (line 55) select * where word like '$token%'; those can result sometimes in arrays from mysql of ~ what 100 words and 6 fields each word .... for those queeries. then all that array is narrowed down by finding the longest word (yes, a field in the DB is pre-set with that value) .... there is an algo involved, i cannot simply select the longest match, because i am also selecting against not just that single token but against the following tokens as well, to stop getting a match for "hair" if "hair of the dog that bit you" is in the databse, i want the longer match. that was a necessary embellishment. the array of arrays returned from matchThis are processed again by MainLoop at line 16 to produce a javascript overlib for every word. as for indexes, i only search against word, so i have an index for that. ----------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------+ | korean_english | CREATE TABLE `korean_english` ( `wordid` int(11) NOT NULL auto_increment, `word` varchar(130) default NULL, `syn` varchar(190) default NULL, `def` blob, `posn` int(2) default '1', `pos` varchar(13) default '1', `submitter` varchar(25) default NULL, `doe` datetime NOT NULL default '0000-00-00 00:00:00', `wordsize` tinyint(3) unsigned default NULL, PRIMARY KEY (`wordid`), KEY `word_idx` (`word`), KEY `wordid_idx` (`wordid`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 | +----------------+----------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------- 1 function MainLoop() 2 3 -> $text_wordArr = my_mb_preg_split($subject, $utf8_split_pattern,$debug_my_mb); 4 while (!empty($text_word) ) { 5 // somehow this allows matches for 'in the bag' (idiom) by some algorithm or other i have 6 if (preg_match($split_pattern,$text_word) 7 && (!(preg_match("/^$one_word_matches$/", $text_wordArr[$i-2][0]) && ($text_wordArr[$i-1][0]==' '))) // gobble up the space follow ing those in such list, for previuos one 8 && (!(preg_match("/^$one_word_matches$/", $text_wordArr[$i-3][0])) && ($text_wordArr[$i-2][0]==' '))) { // try to limit it to jus t blanks and quotes, other such garbage for previous TWO 9 } else if ( preg_match("/^$one_word_matches$/i", $text_word)) { // case insensitive. this eliminates searching for the% instead of "the One %" 10 } else { // else do matching and processing as normal 11 // section to match against this word and its array of 12 // all possible grammatical variations on the form of the word 13 // start with the longest first 14 $matchArr = callMatchThis($db, $text_word, $match_count,$subject,$debug_matchThis_detail,$debug_matchThis_general, $debug_matchThis_dbArr ay, $debug_array_of_possibles, $debug_matchThis_great_detail); 15 // make the overlib function 16 $sideviewStr = dict_get_sideview($sideview_word_to_replace, $matchArr, $debug_sideview); // overlib stuff, all the href tag 17 }}}} 18 19 // end MainLoop function 20 21 22 // find possible matches for submitted query 23 // run query, get db resultArray and match for longest match length 24 // against $subject 25 // return best match in array form as $matchArr 26 // vary text_word as well as db_word trying to match everything possible (by max length order, of course) 27 function matchThis(&$db, $query_part1, $query_part2,$text_word,& $match_count,$subject,$debug_matchThis_detail,$debug_matchThis_general, $debug_match This_dbArray, $debug_array_of_possibles, $debug_matchThis_great_detail ) { 28 for ($idx_outer =0,$text_word_variation = $text_word_variationArr[$idx_outer]; $idx_outer < count($text_word_variationArr) && !$major_break; $i dx_outer++) { 29 30 $text_word_variation = $text_word_variationArr[$idx_outer]; 31 $query = $query_part1 . $text_word_variation . $query_part2; 32 // run a query on each variation 33 $dbTWVArr = db_getAssoc($db, $query); // gives an array of arrays 34 while ((list($db_word, $valArr) = each($dbTWVArr)) && ! $major_break) { // key is a word, $val is an array 35 if ($debug_matchThis_detail) echo "starting search for db_word***".$db_word."***on subject***".substr($subject,0,13)."***<br>"; 36 if (preg_match("/^$db_word_match/i",$subject)) { // check for match on this possibly 1+ word "db_word" on all remaining text 37 for ($index=0; $index < count($valArr); $index ++) { 38 // code to select longest db_word out of those words which do match 39 $matchArr[$match_count]['text_word'] = $text_word; // from way up on top!! as arg to function,the original word from the text 40 $matchArr[$match_count]['db_word'] = $db_word; // 41 // get other stuff from table as well 42 $matchArr[$match_count]['word_id'] = $valArr[$index][1]; 43 $matchArr[$match_count]['pos'] = $valArr[$index][2]; 44 $matchArr[$match_count]['pos_number'] = $valArr[$index][3]; 45 } 46 } 47 return $matchArr; 48 } 49 50 function callMatchThis(&$db, $text_word,&$match_count,$subject, $debug_matchThis_detail,$debug_matchThis_general, $debug_matchThis_dbArray, $debug_ar ray_of_possibles, $debug_matchThis_great_detail) { 51 $query_part1 = "select word,def,wordid,pos,posn,wordsize,syn from $def_table where word like '"; 52 $query_part2 = " %' order by wordsize desc"; 53 $matchArr = matchThis($db, $query_part1, $query_part2, $text_word,$match_count,$subject,$debug_matchThis_detail, $debug_matchThis_general,$de bug_matchThis_dbArray, $debug_array_of_possibles,$debug_matchThis_great_detail); 54 if (!$match_count) { 55 $query_part1 = "select word,def,wordid,pos,posn,wordsize,syn from $def_table where 56 $query_part2 = "' order by wordsize desc "; 57 $matchArr = matchThis($db, $query_part1, $query_part2,$text_word,$match_count,$subject,$debug_matchThis_detail, $debug_matchThis_general ,$debug_matchThis_dbArray, $debug_array_of_possibles,$debug_matchThis_great_detail); 58 if ($debug_matchThis_general) { echo "after matchThis#2**match_count***".$match_count."<br>"; } 59 } 60 61 } 62 63 // unrelated function which makes the javascript overlib 64 65 function dict_get_sideview($word_in, $defArr,$debug_sideview=0) { 66 } 67 68 2006-03-13 (월), 10:16 +0100, Jochem Maas 쓰시길: > joseph wrote: > > hi, > > my site is www.myowndictionary.com > > i use javascript to create definitions for words (from open source > > dictionaries) as pop-ups and hook word-lists for vocabulary study with > > that. > > i can now parse the html from rss feeds and match only text of interest. > > i just wrote my own multibyte html parser by combining various other > > code i already had on hand for the parsing of multibyte text anyhow. > > but, that's not the slowest part. that takes maybe 3 seconds to get the > > page, 3 seconds to parse it, but it sometimes takes 5 minutes for the > > next step on huge rss files. > > (I know the time because they are cached) > > that last step... > > The slow part that's left is the dictionary searching itself. It's the > > parsing of complicated php code -- 550 lines of complex data structure > > loops, function calls, looping through arrays to make sql queeries -- > > lots of sql queeries.... > > the seconds ~ 5 minute it takes it all takes so much time. > > use the 'EXPLAIN' syntax of MySQL to figure out if there are any bottlenecks > related to crap or 'missing' indexes with regard to the sql queries you > mention > above. - chances are adding a couple of well placed indexes will probably > shave massive ammounts from the process time. > > > > > i want to ask for opinion about: > > 1) rewriting the php as C extension. Would it slow my down to the below > > 15 second range? > > How Long is a chinaman. How Long is a chinaman? > yes the C extension would be much faster (assuming you wrote it correctly - > but it's much harder to write and maintain) > > > 2) does mysql have server-side functions yet? > > there are stored procedures in mysql5. > > > 3) if i moved the php code off mysql to postgresql and wrote server-side > > functions (which i've been trained to do), how much of a time > > improvement would we be talking about? > > how on gods' (or whoever runs the place these days) earth can we tell > whether an unwritten stored procedure in postgres would be faster or slower > than an unknown (to us) piece of php code? > > that said I doubt postgres will beat mysql in terms of pure speed assuming > you db schema is sound (indexes are good, etc). > > > > > please send me your input! > > i am at a growing stage and need direction here. > > because any of those steps will require considerable time. > > it's monday morning here too ;-) > > interesting language tool your building, you might consider repackaging it > in a modular form so that other people could implement in their own site. > (you do call it 'open source') > > rgds, > Jochem > > > > > thank you. > > > > joseph. > > > -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php