On Thu, 2005-01-27 at 13:05 -0600, Damian Menscher wrote:

> Oh, ok.  Apparently we have a different definition of plaintext.  I 
> generally take anything using only the lower 7 bits (ASCII table) to 
> mean plaintext, and things that use the 8th bit to mean binary. 
> Regardless of your definition of "plaintext", it would seem that my 
> conclusion that phishing signatures that rely exclusively on 7-bit ascii 
> are more likely to have a false positive than binary signatures that use 
> the full 8 bits is correct.

Even with your definition of plaintext you are still wrong :-)

Why? Because the structure of language in plaintext files is much richer
than that used in the binaries of computer programs.

An aside:
HTML is actually Universal Character Set (UCS), or to quote the
standard:

"The ASCII character set is not sufficient for a global information
system such as the Web, so HTML uses the much more complete character
set called the Universal Character Set (UCS), defined in [ISO10646].
This standard defines a repertoire of thousands of characters used by
communities all over the world."

and

"When HTML text is transmitted in UTF-16 (charset=UTF-16), text data
should be transmitted in network byte order ("big-endian", high-order
byte first) in accordance with [ISO10646], Section 6.3 and [UNICODE],
clause C3, page 3-1.

Furthermore, to maximize chances of proper interpretation, it is
recommended that documents transmitted as UTF-16 always begin with a
ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called
Byte Order Mark (BOM)) which, when byte-reversed, becomes hexadecimal
FFFE, a character guaranteed never to be assigned. Thus, a user-agent
receiving a hexadecimal FFFE as the first bytes of a text would know
that bytes have to be reversed for the remainder of the text."

-trog



Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
http://lists.clamav.net/cgi-bin/mailman/listinfo/clamav-users

Reply via email to