[PHP] Cleaning pasted Word text

2002-10-29 Thread a . h . s . boy
I'm working on a PHP-based CMS that allows users to post lengthy  
article texts by submitting through a form. The short version of my  
quandary is this: How can I create a conversion routine that reliably  
substitutes HTML-acceptable output for high-ASCII characters pasted  
into the form (from a variety of operating systems)?

The longer version is this:
In order to prevent scripting vulnerabilities and a variety of other  
undesirable content, I run the body of the text through a cleantext()  
function. This function first strips out illegal HTML tags and  
JavaScript. So far so good.

Then it attempts to perform some character conversions to clean up  
8-bit ASCII characters in the text, so smart quotes, en- and em-dashes,  
ellipses, etc. are converted to suitable alternative, or to HTML  
entities. I'm using:

// Reference:
// chr(133) = ellipsis
// chr(145) = left curly single quote
// chr(146) = right curly single quote (apostrophe)
// chr(147) = left curly double quote
// chr(148) = right curly double quote
// chr(149) = bullet
// chr(150) = en dash
// chr(151) = em dash
// chr(153) = trademark
// chr(160) = non-breaking space
// chr(161) = inverted exclamation mark
// chr(169) = copyright symbol
// chr(171) = left guillemet
// chr(173) = soft hyphen
// chr(174) = registered trademark
// chr(187) = right guillemet
// chr(188) = 1/4 fraction
// chr(189) = 1/2 fraction
// chr(190) = 3/4 fraction
// chr(191) = inverted question mark
$changearr = array( = ,
	\r=\n,
	\r\n=\n,
	\n\n\n = \n\n,
	chr(133)=...,
	chr(145)=',
	chr(146)=',
	chr(147)=\,
	chr(148)=\,
	chr(149)=*,
	chr(150)=-,
	chr(151)=--,
	chr(153)=(TM),
	chr(160)=nbsp;,
	chr(161)=iexcl;,
	chr(169)=copy;,
	chr(171)=laquo;,
	chr(173)=-,
	chr(174)=(R),
	chr(187)=raquo;,
	chr(188)=1/4,
	chr(189)=1/2,
	chr(190)=3/4,
	chr(191)=iquest;);
$returnstr = strtr($returnstr,$changearr);

The server's on a Linux box (RedHat 7.2, standard US installation);  
users can obviously post from any sort of operating system.

This routine seems to work well on Word text pasted in from my Mac (OS  
X 10.2.1), but I see a number of articles appearing on the site with  
text like:

Wouldnâ€(TM)t you say?

(That's Wouldn[a circumflex][Euro symbol](TM)t instead of Wouldn't.

...which was almost definitely pasted in from a Windows-based Microsoft  
Word, and the conversion routines are failing. (And inserting even  
weirder characters...why would the single quote be replace by _3_  
character substitutions?)

I understand that Windows may well use a different character set for  
high-ASCII, but I frankly don't understand how to work that knowledge  
into this situation. And the combination of original text, Linux ,  
chr(), and ord() stuff just doesn't make sense to me. For example, if I  
post text (from my Mac) containing only:

“”‘’…
(that's  
[open-double-quote][close-double-quote][open-single-quote][close- 
single-quote][ellipsis])

and have PHP run this:

for ($x = 0; $x  strlen($str); $x++) {
   $mailstr .= $str[$x].' is '.ord($str[$x]).\n;
}
mail('me','Characters',$mailstr);

I get mail that says (in parentheses is a description of the character):

ì is 147 (accent-grave-i)
î is 148 (circumflex-i)
ë is 145 (umlaut-e)
í is 146 (accent-acute-i)
Ö is 133 (umlaut capital o)

...which means that recognizes the correct ASCII value (147) of a  
double-quote, though my Linux box seems to think that the character is  
a lowercase i with a grave accent on it. With this kind of strange  
sub-conversion going on, I'm not all that surprised that things are  
getting mucked up.

Is there some way of getting pasted Word text from Windows clean in  
this manner, as well as accommodating the already-working-right Mac  
Word text?

Cheers,
spud.

-
a.h.s. boy
[EMAIL PROTECTED]
dadaIMC support
http://www.dadaimc.org/
-

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Cleaning pasted Word text

2002-10-29 Thread Brent Baisley
I think you have posted before and probably didn't get an answer. I'm 
not going to give you an answer (because I don't have one), but perhaps 
I can point you in the right direction.
Look at http://www.w3.org/TR/REC-html40/charset.html and see if that 
helps you. Below is a paragraph I pulled from it.

The document character set, however, does not suffice to allow user 
agents to correctly interpret HTML documents as they are typically 
exchanged -- encoded as a sequence of bytes in a file or during a 
network transmission. User agents must also know the specific character 
encoding that was used to transform the document character stream into a 
byte stream.


On Tuesday, October 29, 2002, at 02:20 PM, a.h.s. boy wrote:

I'm working on a PHP-based CMS that allows users to post lengthy  
article texts by submitting through a form. The short version of my  
quandary is this: How can I create a conversion routine that reliably  
substitutes HTML-acceptable output for high-ASCII characters pasted  
into the form (from a variety of operating systems)?

--
Brent Baisley
Systems Architect
Landover Associates, Inc.
Search  Advisory Services for Advanced Technology Environments
p: 212.759.6400/800.759.0577


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




Re: [PHP] Cleaning pasted Word text

2002-10-29 Thread a . h . s . boy
Brent --

Thanks for the pointer, but it doesn't really address the problem. I am 
specifying the character set for the page (ISO-8859-1), and I'm 
inserting an ACCEPT-CHARSET parameter into the FORM element, but it 
specifies acceptable charsets as UTF-8, ISO-8859-1, and Windows 1252. 
The problem isn't accepting or displaying the characters correctly, the 
problem is figuring out what characters PHP thinks it's looking at!

After further investigation, I find that ISO-8859-1 doesn't even use 
ASCII codes 128-159, so when a user types in a smart quote, it can't 
_really_ be using Latin 1 (but could be Windows Latin 1).

Oddly enough, I've set the page charset to ISO-8859-1 (which doesn't 
have a smart quote), and my browser is set to Use character set 
specified by server, and it displays a smart quote just fine with 
chr(147). If I manually change my browser to use Latin 1, it displays 
a ? (unknown character symbol). So between browsers, character sets, 
meta tags, and operating systems, I'm beginning to think that 
interpreting high-ASCII input is an art rather than a science...

spud.

On Tuesday, October 29, 2002, at 02:51  PM, Brent Baisley wrote:

I think you have posted before and probably didn't get an answer. I'm 
not going to give you an answer (because I don't have one), but 
perhaps I can point you in the right direction.
Look at http://www.w3.org/TR/REC-html40/charset.html and see if that 
helps you. Below is a paragraph I pulled from it.

---
a.h.s. boy
spud(at)nothingness.orgas yes is to if,love is to yes
http://www.nothingness.org/
---


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




Re: [PHP] Cleaning pasted Word text

2002-10-29 Thread Daniel Guerrier
Paste into notepad, the copy the text from notepad. 
Notepad should remove the high ASCII text.
--- Brent Baisley [EMAIL PROTECTED] wrote:
 I think you have posted before and probably didn't
 get an answer. I'm 
 not going to give you an answer (because I don't
 have one), but perhaps 
 I can point you in the right direction.
 Look at http://www.w3.org/TR/REC-html40/charset.html
 and see if that 
 helps you. Below is a paragraph I pulled from it.
 
 The document character set, however, does not
 suffice to allow user 
 agents to correctly interpret HTML documents as they
 are typically 
 exchanged -- encoded as a sequence of bytes in a
 file or during a 
 network transmission. User agents must also know the
 specific character 
 encoding that was used to transform the document
 character stream into a 
 byte stream.
 
 
 On Tuesday, October 29, 2002, at 02:20 PM, a.h.s.
 boy wrote:
 
  I'm working on a PHP-based CMS that allows users
 to post lengthy  
  article texts by submitting through a form. The
 short version of my  
  quandary is this: How can I create a conversion
 routine that reliably  
  substitutes HTML-acceptable output for high-ASCII
 characters pasted  
  into the form (from a variety of operating
 systems)?
 
 --
 Brent Baisley
 Systems Architect
 Landover Associates, Inc.
 Search  Advisory Services for Advanced Technology
 Environments
 p: 212.759.6400/800.759.0577
 
 
 -- 
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php
 


__
Do you Yahoo!?
HotJobs - Search new jobs daily now
http://hotjobs.yahoo.com/

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




Re: [PHP] Cleaning pasted Word text

2002-10-29 Thread Jimmy Brake
for file maker pro (windows/mac) -- word (windows/mac) 

function make_safe($text)
{
$text = preg_replace(/(\cM)/,  , $text);
$text = preg_replace(/(\c])/,  , $text);
$text = str_replace(\r\n,  , $text);
$text = str_replace(\x0B,  , $text);
$text = str_replace('',  , $text);
$text = explode(\n, $text);
$text = implode( , $text);
$text = addslashes(trim($text));
return($text);
}

function make_safe2($text)
{
$text = str_replace(\r\n, \n, $text);
$text = preg_replace(/(\cM)/, \n, $text);
$text = preg_replace(/(\c])/, \n, $text);
$text = str_replace(\x0B, \n, $text);
$text = addslashes($text);
return($text);
}

cannot remember I why put in two functions ... but anyhow have fun you
will probably not the the implode / explode either



On Tue, 2002-10-29 at 16:39, Daniel Guerrier wrote:
 Paste into notepad, the copy the text from notepad. 
 Notepad should remove the high ASCII text.
 --- Brent Baisley [EMAIL PROTECTED] wrote:
  I think you have posted before and probably didn't
  get an answer. I'm 
  not going to give you an answer (because I don't
  have one), but perhaps 
  I can point you in the right direction.
  Look at http://www.w3.org/TR/REC-html40/charset.html
  and see if that 
  helps you. Below is a paragraph I pulled from it.
  
  The document character set, however, does not
  suffice to allow user 
  agents to correctly interpret HTML documents as they
  are typically 
  exchanged -- encoded as a sequence of bytes in a
  file or during a 
  network transmission. User agents must also know the
  specific character 
  encoding that was used to transform the document
  character stream into a 
  byte stream.
  
  
  On Tuesday, October 29, 2002, at 02:20 PM, a.h.s.
  boy wrote:
  
   I'm working on a PHP-based CMS that allows users
  to post lengthy  
   article texts by submitting through a form. The
  short version of my  
   quandary is this: How can I create a conversion
  routine that reliably  
   substitutes HTML-acceptable output for high-ASCII
  characters pasted  
   into the form (from a variety of operating
  systems)?
  
  --
  Brent Baisley
  Systems Architect
  Landover Associates, Inc.
  Search  Advisory Services for Advanced Technology
  Environments
  p: 212.759.6400/800.759.0577
  
  
  -- 
  PHP General Mailing List (http://www.php.net/)
  To unsubscribe, visit: http://www.php.net/unsub.php
  
 
 
 __
 Do you Yahoo!?
 HotJobs - Search new jobs daily now
 http://hotjobs.yahoo.com/
 
 -- 
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php
 
 



-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




Re: [PHP] Cleaning pasted Word text

2002-10-29 Thread a . h . s . boy
Errr...I'm not sure how this is applicable to my situation. I'm 
concerned, above all, with converting

curly double quotes
curly single quotes
em and en dashes
inverted exclamation points
inverted question marks
ellipses
non-breaking spaces
registered trademark symbols
bullets
left and right guillemets

Many of these characters do not exist in the ISO Latin 1 character set, 
but can nonetheless be inserted by a browser which defaults to MacRoman 
or Windows Latin 1 (1252) character sets.

The big questions, I suppose, are:

1) What character/ASCII code does PHP interpret “ (left curly quote) 
as, when pasted into a form?
2) Does it interpret it the same way pasted in on a Mac as on a Windows 
box?
3) What influence does the page charset meta tag have on such a 
submission?
4) What influence does the form ACCEPT-CHARSET parameter have?
5) What influence does the browser encoding setting have on such 
submissions?
and finally,
6) If all of these factors can influence the final interpretation of a 
character, what's the best way to approach handling all possible 
combinations?

All of this would be s much easier if I'd just get my hands on a 
Windows box for testing. Guess I'll have to do that. I'm just a bit 
surprised that no one seems to have tackled this problem already...it 
can't be that uncommon.

Then again, I've seen any number of CMS-driven web sites that obviously 
haven't this sort of conversion, including large news corporation 
sites. And given the paucity of Mac-friendly programming on the web, 
it's not too surprising that so few sites attempt to accommodate Mac 
users. (Testing for Mac compatibility tends to be on par with testing 
for Netscape 3.0 compatibility...not usually a very high priority, 
despite IE 5 for the Mac supposedly being more standards-compliant than 
the Windows version.)

spud.

On Tuesday, October 29, 2002, at 08:49  PM, Jimmy Brake wrote:

for file maker pro (windows/mac) -- word (windows/mac)

function make_safe($text)
{
$text = preg_replace(/(\cM)/,  , $text);
$text = preg_replace(/(\c])/,  , $text);
$text = str_replace(\r\n,  , $text);
$text = str_replace(\x0B,  , $text);
$text = str_replace('',  , $text);
$text = explode(\n, $text);
$text = implode( , $text);
$text = addslashes(trim($text));
return($text);
}

function make_safe2($text)
{
$text = str_replace(\r\n, \n, $text);
$text = preg_replace(/(\cM)/, \n, $text);
$text = preg_replace(/(\c])/, \n, $text);
$text = str_replace(\x0B, \n, $text);
$text = addslashes($text);
return($text);
}

cannot remember I why put in two functions ... but anyhow have fun you
will probably not the the implode / explode either



On Tue, 2002-10-29 at 16:39, Daniel Guerrier wrote:

Paste into notepad, the copy the text from notepad.
Notepad should remove the high ASCII text.
--- Brent Baisley [EMAIL PROTECTED] wrote:

I think you have posted before and probably didn't
get an answer. I'm
not going to give you an answer (because I don't
have one), but perhaps
I can point you in the right direction.
Look at http://www.w3.org/TR/REC-html40/charset.html
and see if that
helps you. Below is a paragraph I pulled from it.

The document character set, however, does not
suffice to allow user
agents to correctly interpret HTML documents as they
are typically
exchanged -- encoded as a sequence of bytes in a
file or during a
network transmission. User agents must also know the
specific character
encoding that was used to transform the document
character stream into a
byte stream.


On Tuesday, October 29, 2002, at 02:20 PM, a.h.s.
boy wrote:


I'm working on a PHP-based CMS that allows users

to post lengthy

article texts by submitting through a form. The

short version of my

quandary is this: How can I create a conversion

routine that reliably

substitutes HTML-acceptable output for high-ASCII

characters pasted

into the form (from a variety of operating

systems)?



--
Brent Baisley
Systems Architect
Landover Associates, Inc.
Search  Advisory Services for Advanced Technology
Environments
p: 212.759.6400/800.759.0577


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




__
Do you Yahoo!?
HotJobs - Search new jobs daily now
http://hotjobs.yahoo.com/

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php






--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



---
a.h.s. boy
spud(at)nothingness.orgas yes is to if,love is to yes
http://www.nothingness.org/
---


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php