On Monday 23 February 2004 03:02 pm, Axel IS Main wrote:
> That's not bad, but I found a way to do it simply using chr() and
> passing it a value. It turns out the if I go 0-31 Almost nothing will
> get through. Even the simples html has something in there from that
> list. However, by just looking between 14 and 26, one more than carriage
> return, and one less than escape, it worked really well. I crawled a
> site with a large number of jpg, gif, mp3, wav, and pdf files. Of the
> 100's of binaries there only one pdf got through. Not a bad record. I
It should be noted that PDF isn't necessarily a binary. It's just most people
like to use compression, and embed images, sounds, etc. But if you want to,
you can fire up emacs and create a PDF from scratch. So really the record is
better than you think ;)
> also found that in order for this to work I have to process the URLs.
> This makes things really slow so I'm going to have to use both this and
> the "check for extension" function together. Still, I can worry a lot
> less about getting my index weighted down by binary files. The code is
> pretty basic at this point, but here it is:
>
> // Check for binaries
> $ckbin = 14;
> while($ckbin <= 26){
> $ck = chr($ckbin);
> $cbin = substr_count($read, $ck);
> if($cbin > 0){
> echo "Killing off binary file URL: $url\n";
> $kill = mysql_unbuffered_query("DELETE FROM search WHERE
> url_id='$url_id'");
> continue 2;
> }
> ++$ckbin;
> }
> I know it looks kind of funky out of context, but it works really great.
>
> Nick
>
> Richard Davey wrote:
> >Hello Evan,
> >
> >Monday, February 23, 2004, 8:57:43 PM, you wrote:
> >>>It would be wise to check for characters from 0 to 31, if they appear
> >>>then it's almost certainly (but not guaranteed) binary.
> >
> >EN> Assuming that's decimal, you're including 0x09 0x0a and 0x0d which
> > are, EN> respectively, tab, line feed, and carriage return. That's off
> > the top of my EN> head, which means two things: (1) i may be forgetting
> > something, and (2) I EN> need a life ;)
> >
> >Let me rephrase - check for the existence of characters 0 through 31
> >and count how many there are. Set a percentage weight yourself and
> >figure out in your script if you deem the quantity too many or too
> >few.
> >
> >The count_chars() function will be absolutely ideal for this.
--
Evan Nemerson
[EMAIL PROTECTED]
http://coeusgroup.com/en
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php