On 18.12.2010 08:12, Mike Cowlishaw wrote:
>>>> I have a Rexx program that merges several small files onto 
>>>>         
>> one large 
>>     
>>>> one. As it turned out a few of the small files were 
>>>>         
>> prefixed with a 
>>     
>>>> UTF8 BOM, |0xEFBBBF|. Should the BOM have been recognized and 
>>>> discarded?
>>>>         
>>> How could Rexx (or any other processor) decide that some particular 
>>> prefix/content/suffix of a file is worthless and should be 
>>>       
>> discarded?
>>     
>>> ("darn it, this file ends in 'ILY'; delete that!").
>>>       
>> It would handle it as any other text processor. Open the 
>> file, read the first three or four bytes. If no BOM is 
>> present reposition to the beginning, else position to the 
>> first char after the BOM.
>>
>> I realize that Rexx can not handle wide characters and use of 
>> the UTF8 BOM is discouraged, and at least on *ix systems can 
>> lead to problems with some apps.
>> But the use of UTF8 is not forbidden. So when processing text 
>> files, it seems to me that a BOM should be checked for, even 
>> if it is ignored. Or a error issued for an unsupported 
>> encoding. For UTF8 I would ignore it and process the file as ASCII.
>>     
> I wasn't clear -- sorry.  I meant: what if the program *wants* to be able to
> read the BOM and then process the file appropriately?  Such a program would be
> broken if the standard file read discarded rge BOM.
>
> Perhaps what you need is a wrapper function/class that will do exactly that
> ('readUTF8' ... etc.).

This kind of situation - processing non-ASCII-text files becomes more
and more common on every operating systems. The upcoming "HTML5"
standard even suggests to use UTF-8 encodings, which probably will lead
to the number of BOM'med text files to explode on any platform.

It is interesting to note that "linein" was used, which means that the
author expected and indicated processing a text file. Now, a LF or CR-LF
sequence on "linein" (and "lineout" for that matter) are never part of
the read data (and appended automatically with lineout/say when writing
textual data). So text files have been processed in Rexx always in a
different way than binary files (for which charin/charout would be
available), making it easy for the programmer to process them. I can see
that Rexx programmers might therefore expect automatic BOM-handling on
non-ASCII text files, if the Rexx programmer uses "linein" or
"lineout/say".

---rony


------------------------------------------------------------------------------
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
_______________________________________________
Oorexx-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oorexx-devel

Reply via email to