Re: [PHP] Fast search

2006-05-18 Thread Richard Lynch
On Wed, May 17, 2006 11:37 am, René Fournier wrote:
> Looking for suggestions on the most compute-efficient way to search
> variable-length strings (~200 characters) for the occurrence of one
> of about 100 possible needles. In other words:
>
> $needles = array ( 1 => "Hello Jim" , 2 => "Yellow Banana" , 3 =>
> "Red Car", ... etc  (100 elements)
>
> $haystack = "Once upon a time there was a programming language that
> everyone loved. Its name was PHP, and the people that worked with it
> also liked Yellow Bananas. One day...";
>
> Now perhaps I'm blind, but I can't see an obvious string or array
> function that simply allows you to use an array as a needle while
> searching a string haystack. What I'm really looking for is something
> like the reverse of in_array.
>
> The obvious thing would be to loop through the needles and run
> substr_count on the haystack... But is there a faster way?

//Kind of a hack, I guess, but might be decently fast:
$pattern = implode('|', $needles);
if (preg_match("/($pattern)/", $haystack, $matches)){
  var_dump($matches);
}

-- 
Like Music?
http://l-i-e.com/artists.htm

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Fast search

2006-05-17 Thread Robert Cummings
On Wed, 2006-05-17 at 16:38, Robin Vickery wrote:
> On 17/05/06, René Fournier <[EMAIL PROTECTED]> wrote:
> > Looking for suggestions on the most compute-efficient way to search
> > variable-length strings (~200 characters) for the occurrence of one
> > of about 100 possible needles. In other words:
> 
> >
> >
> >
> > $needles = array ( 1 => "Hello Jim" , 2 => "Yellow Banana" , 3 =>
> > "Red Car", ... etc  (100 elements)
> >
> > $haystack = "Once upon a time there was a programming language that
> > everyone loved. Its name was PHP, and the people that worked with it
> > also liked Yellow Bananas. One day...";
> >
> 
> What version of php?
> 
> If you're using php5  then you could do:
> 
> $count = 0;
>str_replace($needles, '', $haystack, $count);
>if ($count)  { /* there was a match */ }
> ?>

Surely you meant strpos( $haystack, $needle ) ???

Incidentally do you need to match on words beginnings or endings? For
instance should 'put' match all of the following:

I put the apple core in the compost.
Someone always puts the fire out.
The car sputtered and died.

Cheers,
Rob.
-- 
..
| InterJinn Application Framework - http://www.interjinn.com |
::
| An application and templating framework for PHP. Boasting  |
| a powerful, scalable system for accessing system services  |
| such as forms, properties, sessions, and caches. InterJinn |
| also provides an extremely flexible architecture for   |
| creating re-usable components quickly and easily.  |
`'

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Fast search

2006-05-17 Thread Evan Priestley
You can make a pretty effective time/memory tradeoff -- particularly  
if your array of patterns is relatively fixed -- but unless you re- 
implement this in C, it will probably be slower than a brute force  
approach with the str* functions unless you are searching for a  
fairly huge number of needles.


Below, I build a tree of pattern matches, which you navigate branch- 
by-branch by using successive letters of the pattern. So, after the  
tree is built, the index $tree[ 'h' ][ 'a' ][ 't' ] exists and has  
the value `true'. To search the string, you push cursors onto a  
stack, advance them as you advance through the string, and destroy  
them when they hit a dead end.


This has a worst case of O( N^2 ) (on the length of the string),  
which happens when you search for patterns like "a"  
and "b" in the string "h". The  
average case for random or typical strings should be very close to O 
( N ). Critically, search time is a function only of the length of  
the string, not of the number of patterns. Memory usage is also only  
a function of the size of the character set and the length of the  
longest pattern, not of the total number of patterns.


This particular implementation has some nice side-effects; uncomment  
the var_dump() line and note how 'pirate' and 'firm' have been  
optimized out of the tree in favor of 'pir' and 'fir'.


Evan

/*  var_dump 
( $tree );  */


$corpus = 'lorem ipsum dolor sur amet pi blue do he hi h fipir  
slog';

$clen   = strlen( $corpus );

$stack  = array( );
for( $ii = 0; $ii < $clen; $ii++ ) {
$c = $corpus[ $ii ];
foreach( $stack as $k => $_ ) {
if( isset( $stack[ $k ][ $c ] ) ) {
if( $stack[ $k ][ $c ] === true ) {
die( 'Matched at position '.$ii );
}
$stack[ $k ] =& $stack[ $k ][ $c ];
}
else {
unset( $stack[ $k ] );
}
}
if( isset( $tree[ $c ] ) ) {
$stack[] =& $tree[ $c ];
}
}
die( 'Did not find a match.' );

?>



On May 17, 2006, at 12:37 PM, René Fournier wrote:

Looking for suggestions on the most compute-efficient way to search  
variable-length strings (~200 characters) for the occurrence of one  
of about 100 possible needles. In other words:




$needles = array ( 1 => "Hello Jim" , 2 => "Yellow Banana" , 3 =>  
"Red Car", ... etc  (100 elements)


$haystack = "Once upon a time there was a programming language that  
everyone loved. Its name was PHP, and the people that worked with  
it also liked Yellow Bananas. One day...";




Now perhaps I'm blind, but I can't see an obvious string or array  
function that simply allows you to use an array as a needle while  
searching a string haystack. What I'm really looking for is  
something like the reverse of in_array.


The obvious thing would be to loop through the needles and run  
substr_count on the haystack... But is there a faster way?


...Rene

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Fast search

2006-05-17 Thread Robin Vickery

On 17/05/06, René Fournier <[EMAIL PROTECTED]> wrote:

Looking for suggestions on the most compute-efficient way to search
variable-length strings (~200 characters) for the occurrence of one
of about 100 possible needles. In other words:






$needles = array ( 1 => "Hello Jim" , 2 => "Yellow Banana" , 3 =>
"Red Car", ... etc  (100 elements)

$haystack = "Once upon a time there was a programming language that
everyone loved. Its name was PHP, and the people that worked with it
also liked Yellow Bananas. One day...";



What version of php?

If you're using php5  then you could do:



-robin

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php