Re: [PHP] what would a c extension buy me

2006-03-13 Thread Ben Litton
Something that is chiefly bottlenecked at the database isn't going to be  
improved enough at the C level to be worth the trouble. Always optimize  
where it's slow. Database design and indexing helps. Minimizing  
unnecessary queries, writing good ones, helps too.


Once you get to the point that your code is the problem, you should look  
into your algorithms to see if they're as fast as they should be before  
you dive into C. A crappy algorithm will run like crap in any language. I  
had a program in C run in three minutes which was, in my estimation,  
crappy. I re-worked it and used a more sensible algorithm and it now runs  
in less than a second. Obviously it wasn't the language.


In the end maybe an extension will be the way to go for you, depending on  
your needs, but do everything else you can first.


Just my opinion,
Ben

On Mon, 13 Mar 2006 08:04:53 -0500, Jochem Maas <[EMAIL PROTECTED]>  
wrote:



...


word,def,wordid,pos,posn,wordsize,syn from  korean_english  where word
like '운전할 %'  order by wordsize desc


oh would you look at this
you're ordering by WORDSIZE.
stick an index on WORDSIZE!!!




http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html


In some cases, MySQL cannot use indexes to resolve the ORDER BY,


note 'In Some Cases'.

so stick an index on WORDSIZE and find out.


although it still uses indexes to find the rows that match the WHERE
clause. These cases include the following:

The key used to fetch the rows is not the same as the one used in the
ORDER BY:

SELECT * FROM t1 WHERE key2=constant ORDER BY key1;

2)



you have an iceballs' chance in hell that I'm going to even read
the 500+ lines of code that followed here ... let alone try to
optimize it. ;-)




it was just a 40 line summary of 550 lines of code 


ah, talk about being caught out :-)
regardless a 40line summary won't cut it either - you have to take
the block as a whole.





questions i may deem myself to answer:


) i get this:
[EMAIL PROTECTED] mysql]# tail -f /var/lib/mysql/mysqld_query.log  > out

[EMAIL PROTECTED] mysql]# cat out | wc -l
15910<<< --- that's line count


how many queries?


15, 910 queries


so roughly (15 * 60) seconds to run 15,000+
queries and do the processing? that doesn't actually
sound so bad.





things already taken care of:
1)


  9795 Query   select
word,def,wordid,pos,posn,wordsize,syn from  korean_english where word
like '운전할'  order by wordsize desc


in cases when you are not using the wildcard tokens (percentage signs)
try changing the query to use something like:

... word = '운전할' ...



your suggestion and a line from the query_log match exactly.


2)





then, it sends each token to CallmatchThis (line 14) which calls
matchThis (line 27 - 47 below)
matchThis may be called twice (2 sql queeries)
(line 51) select * where word = '$token' and another  (take that, but  
if

it's not there  issue the next sql )
(line 55) select * where word like '$token%';


Dont do "SELECT *" - always explicitly specify the fields you want.



that was just a paraphrase. the previous email points to the line number
of the code summary


never paraphrase code - you only end up with smart ass comments like  
mine!


have you reordered you fields in the db yet? adn made as many VARCHARs as
possible into CHARs?








--+
| korean_english | CREATE TABLE `korean_english` (
 `wordid` int(11) NOT NULL auto_increment,
 `word` varchar(130) default NULL,
 `syn` varchar(190) default NULL,
 `def` blob,
 `posn` int(2) default '1',
 `pos` varchar(13) default '1',
 `submitter` varchar(25) default NULL,
 `doe` datetime NOT NULL default '-00-00 00:00:00',
 `wordsize` tinyint(3) unsigned default NULL,
 PRIMARY KEY  (`wordid`),
 KEY `word_idx` (`word`),
 KEY `wordid_idx` (`wordid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |


reorder the fields so that the VARCHARS are at the end of the
table (and the BLOB field at the very, very end of the table).

also change VARCHARs to CHARs where you can.



++-
---
-





1 function MainLoop()







--

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] what would a c extension buy me

2006-03-13 Thread Curt Zirzow
On Mon, Mar 13, 2006 at 10:32:55PM +0900, joseph wrote:
> mr. maas,
> 
> psychic me  you are a man.
> 
> i created an index, no change.  but i already knew that because one of
> the cases where an index is never used is when 
> > The key used to fetch the rows is not the same as the one used in the
> > > ORDER BY: 
> > 
> > which is the case here.
> 
> (index creation stuff follows)
> 
> i'm just going to bite the bullet for a few months and do a massive code
> rehaul.  move all the matching code preferably to the sql-server.   or
> pre-compiled C code for the php [ make an extension ].  i think the mb_
> functions in php are very slow ( reportedly ) ... so move those
> somewhere else is a good idea.  maybe the 'select ... like' clauses to
> the sql-server 

Personally I think you are making a wrong decision.

If you think having a store_procedure of what you are wanting to do
it may buy you perhaps 1 second of compile time. The problem is
design.

Put stuff in a pre-compile C code for php, well, i'd like to see
that happen but besides that, let me ask you, if you have query 'A'
that you are exeucting in php. how long does query 'A' take to
execute without using php?

blaming mb_* functions on speed isn't even the question of your
problems.

Curt.
-- 
cat .signature: No such file or directory

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] what would a c extension buy me

2006-03-13 Thread joseph
mr. maas,

psychic me  you are a man.

i created an index, no change.  but i already knew that because one of
the cases where an index is never used is when 
> The key used to fetch the rows is not the same as the one used in the
> > ORDER BY: 
> 
> which is the case here.

(index creation stuff follows)

i'm just going to bite the bullet for a few months and do a massive code
rehaul.  move all the matching code preferably to the sql-server.   or
pre-compiled C code for the php [ make an extension ].  i think the mb_
functions in php are very slow ( reportedly ) ... so move those
somewhere else is a good idea.  maybe the 'select ... like' clauses to
the sql-server 
maybe the improvements you suggest will make a difference, maybe not,
but  when i think about how fast mysql runs its  

mysqlimport --local --fields-terminated-by='*X*X' dict_explicit
french_english french_english 

compared with 

mysql < a.sql  (bunches of insert statements)

that's the kind of improvement i am hoping for.  more than 500 queeries
a second i am getting now, i need to really really improve that if i am
going to parse html on the fly with the technology i have.




mysql> describe korean_english;
+---+-+--+-+-++
| Field | Type| Null | Key | Default |
Extra  |
+---+-+--+-+-++
| wordid| int(11) |  | PRI | NULL|
auto_increment |
| word  | varchar(130)| YES  | MUL | NULL|
|
| syn   | varchar(190)| YES  | | NULL|
|
| def   | blob| YES  | | NULL|
|
| posn  | int(2)  | YES  | | 1   |
|
| pos   | varchar(13) | YES  | | 1   |
|
| submitter | varchar(25) | YES  | | NULL|
|
| doe   | datetime|  | | -00-00 00:00:00 |
|
| wordsize  | tinyint(3) unsigned | YES  | | NULL|
|
+---+-+--+-+-++
9 rows in set (0.00 sec)

mysql> explain select
-> word,def,wordid,pos,posn,wordsize,syn from  korean_english  where
word
-> like '운전할 %'  order by wordsize desc;
++-++---+---+--+-+--+--+-+
| id | select_type | table  | type  | possible_keys | key  |
key_len | ref  | rows | Extra   |
++-++---+---+--+-+--+--+-+
|  1 | SIMPLE  | korean_english | range | word_idx  | word_idx |
391 | NULL |1 | Using where; Using filesort |
++-++---+---+--+-+--+--+-+
1 row in set (0.01 sec)

mysql> CREATE INDEX wordsize_index USING BTREE ON
korean_english(wordsize);
Query OK, 205265 rows affected (11.16 sec)
Records: 205265  Duplicates: 0  Warnings: 0

mysql> explain select word,def,wordid,pos,posn,wordsize,syn from
korean_english  where word like '운전할 %'  order by wordsize desc;e);
++-++---+---+--+-+--+--+-+
| id | select_type | table  | type  | possible_keys | key  |
key_len | ref  | rows | Extra   |
++-++---+---+--+-+--+--+-+
|  1 | SIMPLE  | korean_english | range | word_idx  | word_idx |
391 | NULL |1 | Using where; Using filesort |
++-++---+---+--+-+--+--+-+
1 row in set (0.00 sec)

mysql>   

2006-03-13 (월), 14:04 +0100, Jochem Maas 쓰시길:
> ...
> 
> >>>word,def,wordid,pos,posn,wordsize,syn from  korean_english  where word
> >>>like '운전할 %'  order by wordsize desc
> >>
> >>oh would you look at this
> >>you're ordering by WORDSIZE.
> >>stick an index on WORDSIZE!!!
> > 
> > 
> > 
> > http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
> > 
> > 
> > In some cases, MySQL cannot use indexes to resolve the ORDER BY,
> 
> note 'In Some Cases'.
> 
> so stick an index on WORDSIZE and find out.
> 
> > although it still uses indexes to find the rows that match the WHERE
> > clause. These cases include the following: 
> > 
> > The key used to fetch the rows is not the same as the one used in the
> > ORDER BY: 
> > 
> > SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
> > 
> > 2) 
> > 
> > 
> >>you have an iceballs' chance in hell that I'm going to even read
> >>the 500+ lines of code that f

Re: [PHP] what would a c extension buy me

2006-03-13 Thread Jochem Maas
joseph wrote:
> sorry, i made a mistake before.
> 
> 
>>>   9795 Query   select
>>>word,def,wordid,pos,posn,wordsize,syn from  korean_english where word
>>>like '운전할'  order by wordsize desc
>>
>>in cases when you are not using the wildcard tokens (percentage signs)
>>try changing the query to use something like:
>>
>>  ... word = '운전할' ...
>>
> 
> 
> i have to stick with 'like' because it matches case-insensitive, else
> the first word of a sentence (with a capital letter) doesn't match the
> db word.  FYI

do you really care about case (in languages where characters even have case)?
if not stick everything into the DB as uppercase and test against that (always 
normalizing
everything to uppercase before doing the queries). it would be faster than a
case-insensitive search

also consider it might be worth creating a special index for the first (couple?)
of letters of the WORD field, that would allow something like this (not in all 
cases obviously):

SELECT * FROM korean_english WHERE word = '운전할' AND SUBSTR(word,0,1) = 
'운'

(now I'm allowed to sue 'SELECT * ...' because I'm offering a possible
concept/idea - as opposed to pasting code for people look at) - basically the 
idea
entails creating an index (possibly based on an extra column) that would allow 
the
mysql engine to start searching in a smaller subset based on the 2nd (in this 
case)
WHERE clause and using the more compact 'SUBSTR(word,0,1)' index.

this might not have any effect whatsoever depending on how indexes are used
internally by the mysql engine. so basically this is me guessing :-)


> 

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] what would a c extension buy me

2006-03-13 Thread Jochem Maas
...

>>>word,def,wordid,pos,posn,wordsize,syn from  korean_english  where word
>>>like '운전할 %'  order by wordsize desc
>>
>>oh would you look at this
>>you're ordering by WORDSIZE.
>>stick an index on WORDSIZE!!!
> 
> 
> 
> http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
> 
> 
> In some cases, MySQL cannot use indexes to resolve the ORDER BY,

note 'In Some Cases'.

so stick an index on WORDSIZE and find out.

> although it still uses indexes to find the rows that match the WHERE
> clause. These cases include the following: 
> 
> The key used to fetch the rows is not the same as the one used in the
> ORDER BY: 
> 
> SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
> 
> 2) 
> 
> 
>>you have an iceballs' chance in hell that I'm going to even read
>>the 500+ lines of code that followed here ... let alone try to
>>optimize it. ;-)
>>
> 
> 
> it was just a 40 line summary of 550 lines of code   

ah, talk about being caught out :-)
regardless a 40line summary won't cut it either - you have to take
the block as a whole.

> 
> 
> 
> questions i may deem myself to answer: 
> 
>>>) i get this:
>>>[EMAIL PROTECTED] mysql]# tail -f /var/lib/mysql/mysqld_query.log  > out
>>>
>>>[EMAIL PROTECTED] mysql]# cat out | wc -l
>>>15910<<< --- that's line count
>>
>>how many queries?
> 
> 15, 910 queries 

so roughly (15 * 60) seconds to run 15,000+
queries and do the processing? that doesn't actually
sound so bad.

> 
> 
> 
> things already taken care of:
> 1) 
> 
>>>   9795 Query   select
>>>word,def,wordid,pos,posn,wordsize,syn from  korean_english where word
>>>like '운전할'  order by wordsize desc
>>
>>in cases when you are not using the wildcard tokens (percentage signs)
>>try changing the query to use something like:
>>
>>  ... word = '운전할' ...
> 
> 
> your suggestion and a line from the query_log match exactly.
> 
> 
> 2) 
> 
>>
>>
>>>then, it sends each token to CallmatchThis (line 14) which calls
>>>matchThis (line 27 - 47 below)  
>>>matchThis may be called twice (2 sql queeries)
>>>(line 51) select * where word = '$token' and another  (take that, but if
>>>it's not there  issue the next sql )
>>>(line 55) select * where word like '$token%';
>>
>>Dont do "SELECT *" - always explicitly specify the fields you want.
>>
> 
> that was just a paraphrase. the previous email points to the line number
> of the code summary 

never paraphrase code - you only end up with smart ass comments like mine!

have you reordered you fields in the db yet? adn made as many VARCHARs as
possible into CHARs?

> 
> 
>>
>>
>>>--+
>>>| korean_english | CREATE TABLE `korean_english` (
>>>  `wordid` int(11) NOT NULL auto_increment,
>>>  `word` varchar(130) default NULL,
>>>  `syn` varchar(190) default NULL,
>>>  `def` blob,
>>>  `posn` int(2) default '1',
>>>  `pos` varchar(13) default '1',
>>>  `submitter` varchar(25) default NULL,
>>>  `doe` datetime NOT NULL default '-00-00 00:00:00',
>>>  `wordsize` tinyint(3) unsigned default NULL,
>>>  PRIMARY KEY  (`wordid`),
>>>  KEY `word_idx` (`word`),
>>>  KEY `wordid_idx` (`wordid`)
>>>) ENGINE=MyISAM DEFAULT CHARSET=utf8 |
>>
>>reorder the fields so that the VARCHARS are at the end of the
>>table (and the BLOB field at the very, very end of the table).
>>
>>also change VARCHARs to CHARs where you can.
>>
>>
>>>++-
>>>---
>>>-
>>>
>>>
>>>
>>>
>>>
>>> 1 function MainLoop()
> 
> 

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] what would a c extension buy me

2006-03-13 Thread joseph

sorry, i made a mistake before.

> >9795 Query   select
> > word,def,wordid,pos,posn,wordsize,syn from  korean_english where word
> > like '운전할'  order by wordsize desc
> 
> in cases when you are not using the wildcard tokens (percentage signs)
> try changing the query to use something like:
> 
>   ... word = '운전할' ...
> 

i have to stick with 'like' because it matches case-insensitive, else
the first word of a sentence (with a capital letter) doesn't match the
db word.  FYI

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] what would a c extension buy me

2006-03-13 Thread joseph
many points given to my esteemed and humble correspondant for:
1) 
> psychics-php is a seperate mailing lists, please channel/mindmeld the correct
> subscription procedure from John Nichel ;-)

> 


points to me for a quick comeback  ( word is indexed, wordsize is not)

1) per mysql website 
...
> 
> > 
> > as for indexes, i only search against word, so i have an index for that.
> 
> er, you might want to read up a bit on indexes :-)
> 
> [EMAIL PROTECTED] mysql]# head out 
> >9795 Query   select
> > word,def,wordid,pos,posn,wordsize,syn from  korean_english  where word
> > like '운전할 %'  order by wordsize desc
> 
> oh would you look at this
> you're ordering by WORDSIZE.
> stick an index on WORDSIZE!!!


http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html


In some cases, MySQL cannot use indexes to resolve the ORDER BY,
although it still uses indexes to find the rows that match the WHERE
clause. These cases include the following: 

The key used to fetch the rows is not the same as the one used in the
ORDER BY: 

SELECT * FROM t1 WHERE key2=constant ORDER BY key1;

2) 

> you have an iceballs' chance in hell that I'm going to even read
> the 500+ lines of code that followed here ... let alone try to
> optimize it. ;-)
> 

it was just a 40 line summary of 550 lines of code   



questions i may deem myself to answer: 
> > 
> > ) i get this:
> > [EMAIL PROTECTED] mysql]# tail -f /var/lib/mysql/mysqld_query.log  > out
> > 
> > [EMAIL PROTECTED] mysql]# cat out | wc -l
> > 15910<<< --- that's line count
> 
> how many queries?
15, 910 queries 
> 


things already taken care of:
1) 
> >9795 Query   select
> > word,def,wordid,pos,posn,wordsize,syn from  korean_english where word
> > like '운전할'  order by wordsize desc
> 
> in cases when you are not using the wildcard tokens (percentage signs)
> try changing the query to use something like:
> 
>   ... word = '운전할' ...

your suggestion and a line from the query_log match exactly.


2) 
> 
> 
> > 
> > then, it sends each token to CallmatchThis (line 14) which calls
> > matchThis (line 27 - 47 below)  
> > matchThis may be called twice (2 sql queeries)
> > (line 51) select * where word = '$token' and another  (take that, but if
> > it's not there  issue the next sql )
> > (line 55) select * where word like '$token%';
> 
> Dont do "SELECT *" - always explicitly specify the fields you want.
> 
that was just a paraphrase. the previous email points to the line number
of the code summary 

> 
> 
> 
> > --+
> > | korean_english | CREATE TABLE `korean_english` (
> >   `wordid` int(11) NOT NULL auto_increment,
> >   `word` varchar(130) default NULL,
> >   `syn` varchar(190) default NULL,
> >   `def` blob,
> >   `posn` int(2) default '1',
> >   `pos` varchar(13) default '1',
> >   `submitter` varchar(25) default NULL,
> >   `doe` datetime NOT NULL default '-00-00 00:00:00',
> >   `wordsize` tinyint(3) unsigned default NULL,
> >   PRIMARY KEY  (`wordid`),
> >   KEY `word_idx` (`word`),
> >   KEY `wordid_idx` (`wordid`)
> > ) ENGINE=MyISAM DEFAULT CHARSET=utf8 |
> 
> reorder the fields so that the VARCHARS are at the end of the
> table (and the BLOB field at the very, very end of the table).
> 
> also change VARCHARs to CHARs where you can.
> 
> > ++-
> > ---
> > -
> > 
> > 
> > 
> > 
> > 
> >  1 function MainLoop()

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] what would a c extension buy me

2006-03-13 Thread Jochem Maas
joseph wrote:
> fellow php programmers, 
> 
> sorry to you two who were kind enough to take the time to attempt to
> answer my question.  i thought it was enough to say "i had 550 lines
> with a bunch of sql calls and loops and big data structures"

psychics-php is a seperate mailing lists, please channel/mindmeld the correct
subscription procedure from John Nichel ;-)

> 
> to help you make a better assessment i will summarize in more detail
> now.
> 
> for my biggest file  (  
> 
> http://www.myowndictionary.com/index.php?old_cfile=rss&cfile=rss&rss_id=198&cid=242&from_lang=korean
> 
> ) i get this:
> [EMAIL PROTECTED] mysql]# tail -f /var/lib/mysql/mysqld_query.log  > out
> 
> [EMAIL PROTECTED] mysql]# cat out | wc -l
> 15910<<< --- that's line count

how many queries?

> [EMAIL PROTECTED] mysql]# head out
>9795 Query   select
> word,def,wordid,pos,posn,wordsize,syn from  korean_english  where word
> like '운전할 %'  order by wordsize desc

oh would you look at this
you're ordering by WORDSIZE.

stick an index on WORDSIZE!!!


>9795 Query   select
> word,def,wordid,pos,posn,wordsize,syn from  korean_english where word
> like '운전할'  order by wordsize desc

in cases when you are not using the wildcard tokens (percentage signs)
try changing the query to use something like:

... word = '운전할' ...




> 
> then, it sends each token to CallmatchThis (line 14) which calls
> matchThis (line 27 - 47 below)  
> matchThis may be called twice (2 sql queeries)
> (line 51) select * where word = '$token' and another  (take that, but if
> it's not there  issue the next sql )
> (line 55) select * where word like '$token%';

Dont do "SELECT *" - always explicitly specify the fields you want.

...

> 
> as for indexes, i only search against word, so i have an index for that.

er, you might want to read up a bit on indexes :-)

> --+
> | korean_english | CREATE TABLE `korean_english` (
>   `wordid` int(11) NOT NULL auto_increment,
>   `word` varchar(130) default NULL,
>   `syn` varchar(190) default NULL,
>   `def` blob,
>   `posn` int(2) default '1',
>   `pos` varchar(13) default '1',
>   `submitter` varchar(25) default NULL,
>   `doe` datetime NOT NULL default '-00-00 00:00:00',
>   `wordsize` tinyint(3) unsigned default NULL,
>   PRIMARY KEY  (`wordid`),
>   KEY `word_idx` (`word`),
>   KEY `wordid_idx` (`wordid`)
> ) ENGINE=MyISAM DEFAULT CHARSET=utf8 |

reorder the fields so that the VARCHARS are at the end of the
table (and the BLOB field at the very, very end of the table).

also change VARCHARs to CHARs where you can.

> ++-
> ---
> -
> 
> 
> 
> 
> 
>  1 function MainLoop()

you have an iceballs' chance in hell that I'm going to even read
the 500+ lines of code that followed here ... let alone try to
optimize it. ;-)

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] what would a c extension buy me

2006-03-13 Thread joseph
fellow php programmers, 

sorry to you two who were kind enough to take the time to attempt to
answer my question.  i thought it was enough to say "i had 550 lines
with a bunch of sql calls and loops and big data structures"

to help you make a better assessment i will summarize in more detail
now.

for my biggest file  (  

http://www.myowndictionary.com/index.php?old_cfile=rss&cfile=rss&rss_id=198&cid=242&from_lang=korean

) i get this:
[EMAIL PROTECTED] mysql]# tail -f /var/lib/mysql/mysqld_query.log  > out

[EMAIL PROTECTED] mysql]# cat out | wc -l
15910<<< --- that's line count
[EMAIL PROTECTED] mysql]# head out
   9795 Query   select
word,def,wordid,pos,posn,wordsize,syn from  korean_english  where word
like '운전할 %'  order by wordsize desc
   9795 Query   select
word,def,wordid,pos,posn,wordsize,syn from  korean_english where word
like '운전할'  order by wordsize desc
   9795 Query   select
word,def,wordid,pos,posn,wordsize,syn from  korean_english  where word
like '  %'  order by wordsize desc
   9795 Query   select
word,def,wordid,pos,posn,wordsize,syn from  korean_english  where word
like '수 %'  order by wordsize desc
   9795 Query   select
word,def,wordid,pos,posn,wordsize,syn from  korean_english where word
like '수'  order by wordsize desc

takes 6 minutes


[EMAIL PROTECTED] current]$ date
2006. 03. 13. (월) 20:25:16 KST
[EMAIL PROTECTED] current]$ date
2006. 03. 13. (월) 20:31:37 KST
[EMAIL PROTECTED] current]$



in general, you will see what is involved in processing each token of
text and sending that token through a match against databases  and
processing results:

a summary of the code is:
mainLoop starts, it breaks some text into tokens using my_mb_preg_split.
(line 3)  
the reg exp is like 
$split_pattern =   "/((?:[\.!\s\?,:-]|\\\"는)+)/";
mb_preg_split i'm sure is broken, but haven't prepared a legal case at
this time  ;)


then, it sends each token to CallmatchThis (line 14) which calls
matchThis (line 27 - 47 below)  
matchThis may be called twice (2 sql queeries)
(line 51) select * where word = '$token' and another  (take that, but if
it's not there  issue the next sql )
(line 55) select * where word like '$token%';

those can result sometimes in arrays from mysql of ~ what 100 words and
6 fields each word  for those queeries.

then all that array is narrowed down by finding the longest word (yes, a
field in the DB is pre-set with that value)   there is an algo
involved, i cannot simply select the longest match, because i am also
selecting against not just that single token but against the following
tokens as well, to stop getting a match for "hair" if "hair of the dog
that bit you" is in the databse, i want the longer match.  that was a
necessary embellishment.

the array of arrays returned from matchThis are processed again by
MainLoop  at line 16 to produce a javascript overlib for every word.

as for indexes, i only search against word, so i have an index for that.
---
--+
| korean_english | CREATE TABLE `korean_english` (
  `wordid` int(11) NOT NULL auto_increment,
  `word` varchar(130) default NULL,
  `syn` varchar(190) default NULL,
  `def` blob,
  `posn` int(2) default '1',
  `pos` varchar(13) default '1',
  `submitter` varchar(25) default NULL,
  `doe` datetime NOT NULL default '-00-00 00:00:00',
  `wordsize` tinyint(3) unsigned default NULL,
  PRIMARY KEY  (`wordid`),
  KEY `word_idx` (`word`),
  KEY `wordid_idx` (`wordid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |
++-
---
-





 1 function MainLoop()
  2
  3 -> $text_wordArr = my_mb_preg_split($subject,
$utf8_split_pattern,$debug_my_mb);
  4 while (!empty($text_word) ) {
  5 // somehow this allows matches for 'in the bag' (idiom)
by some algorithm or other i have
  6 if (preg_match($split_pattern,$text_word)
  7 &&  (!(preg_match("/^$one_word_matches$/",
$text_wordArr[$i-2][0]) && ($text_wordArr[$i-1][0]==' '))) // gobble up
the space following those in such list, for previuos one
  8 && (!(preg_match("/^$one_word_matches$/",
$text_wordArr[$i-3][0])) && ($text_wordArr[$i-2][0]==' '))) {   // try
to limit it to just blanks and quotes, other such garbage for
previous TWO
  9 } else if ( preg_match("/^$one_word_matches$/i",
$text_word)) { 

Re: [PHP] what would a c extension buy me

2006-03-13 Thread Jochem Maas

joseph wrote:

hi,
my site is www.myowndictionary.com
i use javascript to create definitions for words (from open source
dictionaries) as pop-ups and hook word-lists for vocabulary study with
that.  
i can now parse the html from rss feeds and match only text of interest.

i just wrote my own multibyte html parser by combining various other
code i already had on hand for the parsing of multibyte text anyhow.  
but, that's not the slowest part.  that takes maybe 3 seconds to get the

page, 3 seconds to parse it, but it sometimes takes 5 minutes for the
next step on huge rss files.
(I know the time because they are cached)
that last step...
The slow part that's left is the dictionary searching itself.  It's the
parsing of complicated php code -- 550 lines of complex data structure
loops, function calls, looping through arrays to make sql queeries --
lots of sql queeries 
the seconds ~ 5 minute it takes it all takes so much time.  


use the 'EXPLAIN' syntax of MySQL to figure out if there are any bottlenecks
related to crap or 'missing' indexes with regard to the sql queries you mention
above. - chances are adding a couple of well placed indexes will probably
shave massive ammounts from the process time.



i want to ask for opinion about:
1) rewriting the php as C extension.  Would it slow my down to the below
15 second range?


How Long is a chinaman. How Long is a chinaman?
yes the C extension would be much faster (assuming you wrote it correctly -
but it's much harder to write and maintain)


2) does mysql have server-side functions yet?


there are stored procedures in mysql5.


3) if i moved the php code off mysql to postgresql and wrote server-side
functions (which i've been trained to do), how much of a time
improvement would we be talking about?


how on gods' (or whoever runs the place these days) earth can we tell
whether an unwritten stored procedure in postgres would be faster or slower
than an unknown (to us) piece of php code?

that said I doubt postgres will beat mysql in terms of pure speed assuming
you db schema is sound (indexes are good, etc).



please send me your input!  
i am at a growing stage and need direction here.  
because any of those steps will require considerable time.  


it's monday morning here too ;-)

interesting language tool your building, you might consider repackaging it
in a modular form so that other people could implement in their own site.
(you do call it 'open source')

rgds,
Jochem



thank you.

joseph.



--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] what would a c extension buy me

2006-03-12 Thread Chris

joseph wrote:

hi,
my site is www.myowndictionary.com
i use javascript to create definitions for words (from open source
dictionaries) as pop-ups and hook word-lists for vocabulary study with
that.  
i can now parse the html from rss feeds and match only text of interest.

i just wrote my own multibyte html parser by combining various other
code i already had on hand for the parsing of multibyte text anyhow.  
but, that's not the slowest part.  that takes maybe 3 seconds to get the

page, 3 seconds to parse it, but it sometimes takes 5 minutes for the
next step on huge rss files.
(I know the time because they are cached)
that last step...
The slow part that's left is the dictionary searching itself.  It's the
parsing of complicated php code -- 550 lines of complex data structure
loops, function calls, looping through arrays to make sql queeries --
lots of sql queeries 
the seconds ~ 5 minute it takes it all takes so much time.  


i want to ask for opinion about:
1) rewriting the php as C extension.  Would it slow my down to the below
15 second range?


Who knows? We don't know what the code does or how it works.

C will be quicker because it's already compiled. It will be harder to 
maintain and debug because it's compiled.



2) does mysql have server-side functions yet?


Mysql 5 has stored procedures if that's what you mean.


3) if i moved the php code off mysql to postgresql and wrote server-side
functions (which i've been trained to do), how much of a time
improvement would we be talking about?


See # 1.

--
Postgresql & php tutorials
http://www.designmagick.com/

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php