At 1:12 am +0200 26/10/03, Marco Baroni wrote:

I am new to (explicit) unicode handling, and right now I am facing this problem.

I have some data (lots of data) that in theory should be in ascii (with entity references in place of non-ascii characters). I have no easy way to get to know exactly how these data were generated.

Presumably you have some idea what OS the files were created on. If they are MacRoman files, us-ascii or not, then you might try something like this. The first part of the script simply creates a sample file for testing purposes:



#!/usr/bin/perl -w # write some MacRoman to file some.txt my $text = "/tmp/some.txt" ; open TEXT, ">$text" ; print TEXT 'œ∑鮆¥üîøπ' ; ##### `open -a 'SimpleText' $text` ; # if you like close TEXT; # # use encoding "MacRoman", STDOUT => "utf8";

my $top = q(<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title>Some chars</title>);

my $html = "/tmp/some.html";
open HTML, ">:encoding(utf8)", "$html";
print HTML $top; # write contents of some.txt to html file as utf-8
open TEXT, "<:encoding(MacRoman)", "$text" ;
for (<TEXT>) {
        s~∑~S~g ;
        print HTML;
}
`open -a 'Safari' $html` ;



Reply via email to