Re: confusing bullets

2004-01-11 Thread John Delacour
At 9:26 pm -0500 10/1/04, Vic Norton wrote:

I'm sorry, John. I was talking figuratively. I didn't mean real bullets.
FIguratively or no, you were right on target with 
your choice.  The bullet is a character in the 
'macintosh' character set (referred to wrongly by 
the Perl people MacRoman) which does not exist 
in the widely used (or at least declared) charset 
Latin-1 and has the same 8-bit codepoint as the i 
with diaeresisin the Windows-1252 charset. 
It is to rebuild this Tower of Babel that Unicode 
was conceived and, far too slowly, brought into 
the computer world first in Windows NT and 
finally in Mac OS X.  Unicode is a 'good thing' 
but it requires to be learned about and you'll 
come unstuck pretty often if you don't put aside 
a bit of time to do so.

http://www.unicode.org/standard/WhatIsUnicode.html

%   perldoc -X Encode | more

SEE ALSO
Encode::Encoding, Encode::Supported, Encode::PerlIO, encoding,
perlebcdic, open in perlfunc, perlunicode, utf8, the Perl Unicode
Mailing List [EMAIL PROTECTED]

How come Perl sees C2 A0 whenever HexEdit sees 
CA and visa versa? I don't care what kind of 
characters we are talking here. To paraphrase 
Gertrude Stein, a byte is a byte is a byte. At 
least that's what I thought until now.
Gertrude Stein was a character.  Some characters 
are a byte. Some are not, and you have to care.

use strict ;
my $file = /tmp/file;
#
open FILE, :utf8, $file ;
print FILE \xF0 ;
close FILE ;
#
open FILE, $file ;
print UTF-8:\t .  FILE . $/ ;
close FILE;
#
open FILE, $file ;
print FILE \xF0 ;
close FILE ;
#
open FILE, $file ;
print MacRoman:\t .  FILE . $/ ;
close FILE;
exit ;
JD




Re: confusing bullets

2004-01-11 Thread Doug McNutt
At 13:52 -0500 1/11/04, Vic Norton wrote:
Now I seem to have resolved the problem--sort of. I believe it's a bug in BBEdit.

I suspect BareBones will call it a feature.

The unwillingness of BBEdit to work on a file without doing things like changing all 
of the line ends has been a pain. It is impossible to edit and save a file that has 
mixed line ends. When looking at a file, even with show invisibles enabled, it's 
impossible to see what kind of line end is present. It's also possible that BBEdit is 
preprocessing the UTF* stuff before you get to see it. I usually run to MPW on my 8500 
to handle such things.

BBEdit's HEX dump Tool will allow you to look at  the raw file but, in ordinary 
Editing mode, there is no way to see, for instance, the byte order mark. HEX Dump will 
not allow any editing of the file.

-- 
--  There are 10 kinds of people:  those who understand binary, and those who don't 
--


Re: confusing bullets

2004-01-11 Thread John Delacour
At 1:52 pm -0500 11/1/04, Vic Norton wrote:

 # file0.pl - The data in file0.pl is a real bullet,
 #namely A5. But the script file0.pl can't
 #find it when run from BBEdit.
Vic, before I spend time testing this, what do 
you mean by real bullet namely A5.  Do you mean 
that we are to replace \x5 in the scripts you 
posted with Macintosh character  and save the 
scripts as Unix without the Encode as Unicode 
checked ?  There are a lot of options and your 
instructions are far from unambiguous.

I get the feeling you have not yet visited the Unicode page I recommended.

As to BBEdit, it's important to realise that it 
does not speak Unicode.  It uses an interpreter. 
What you see is not what you get.  If the 
interpreter can't convert to mac it prints the 
raw UTF-8.

If you run this script from BBEdit you will see how real your bullet is.

#!/usr/bin/perl
no warnings ;
$f = /tmp/vicsbullet.html ;  open  F, $f ;
print \x{1F00};  print F html\x{1F00};
print chr 10;  print F chr 13;
print \xA5;  print F \xA5;
close F;
`open -a safari $f` ;


JD



Re: confusing bullets

2004-01-11 Thread Vic Norton
Hi John,

What I meant and sent was what Sherm Pedley called a bullet, namely 
what is produced on a Mac when you type option-8. The unicode 
character is \x{2022}. On web pages it is #8226;.

When I put pbullet #8226;/p on a web page, open the page in 
Safari, copy the bullet from the page, and paste it into to a Mac 
Roman BBEdit text file, I see a bullet that is encoded in the file as 
A5. I can get the same bullet by typing option-8 in the text file.

I now realize that I was only seeing the CA bullet before because I 
was showing invisibles in BBEdit. Then the CA character looked like a 
faded bullet. Ordinarily it is invisible.

Regards,

Vic

P.S. I tried your script below. I haven't the slightest idea what the 
output means. As I said, a real web bullet is #8226;.

At 8:37 PM + 1/11/04, John Delacour wrote:
At 1:52 pm -0500 11/1/04, Vic Norton wrote:

 # file0.pl - The data in file0.pl is a real bullet,
 #namely A5. But the script file0.pl can't
 #find it when run from BBEdit.
Vic, before I spend time testing this, what do you mean by real 
bullet namely A5.  Do you mean that we are to replace \x5 in the 
scripts you posted with Macintosh character * and save the scripts 
as Unix without the Encode as Unicode checked ?  There are a lot of 
options and your instructions are far from unambiguous.

I get the feeling you have not yet visited the Unicode page I recommended.

As to BBEdit, it's important to realise that it does not speak 
Unicode.  It uses an interpreter. What you see is not what you get. 
If the interpreter can't convert to mac it prints the raw UTF-8.

If you run this script from BBEdit you will see how real your bullet is.

#!/usr/bin/perl
no warnings ;
$f = /tmp/vicsbullet.html ;  open  F, $f ;
print \x{1F00};  print F html\x{1F00}; 
print chr 10;  print F chr 13;
print \xA5;  print F \xA5;
close F;
`open -a safari $f` ;



JD




Re: confusing bullets

2004-01-11 Thread John Delacour
At 6:21 pm -0500 11/1/04, Vic Norton wrote:

P.S. I tried your script below. I haven't the slightest idea what 
the output means. As I said, a real web bullet is #8226;.
A bullet can be written in valid html code in the real world in half 
a dozen different ways, and that would not be the preferred one 
either, but enough of this.

$f = /tmp/bullet.html;
open F, $f;
print F 'htmlbull;';
`open -a safari $f` ;
JD



confusing bullets

2004-01-10 Thread Vic Norton
Now and then I copy data from the web and paste it into a perl script
after __END__ or __DATA__. I plan to take the data apart with perl.
The file is generally a BBEdit text file with unix line feeds.
Sometimes there are bullets in the data. According to HexEdit these
bullets are \xca characters, but when I try to spot these characters
with m/\xca/ I find none.
So I look at a line that contains just a bullet, nothing more. HexEdit shows
that line as CA 0A, a bullet and a new line. On the other hand perl says that
the line has length 3. Using ord I see that the line consists of C2 A0 0A.
To prove its point perl has no trouble finding all bullets with the pattern
m/\xc2\xa0/.
What is going on here? HexEdit sees one byte for each bullet and perl
sees two. I thought hex stuff was unambiguous, but, as a mathematician,
I am pretty certain that 1 is not equal to 2.
Regards,

Vic

--
*---* mailto:[EMAIL PROTECTED]
| Victor Thane Norton, Jr.
| Mathematician and Motorcyclist
| phone: 419-353-3399
*---* http://vic.norton.name


Re: confusing bullets

2004-01-10 Thread John Delacour
At 11:22 am -0500 10/1/04, Vic Norton wrote:

What is going on here? HexEdit sees one byte for each bullet and perl
sees two. I thought hex stuff was unambiguous, but, as a mathematician,
I am pretty certain that 1 is not equal to 2.
Perl talks UTF-8.  The bullet in utf-8 is chr (8226) \x{2022}

perldoc -X encoding | more

TMTOWTDI but it sounds as though you'd like to 
work as though Unicode didn't exist and something 
like this might be simplest.

binmode(STDOUT=':encoding(MacRoman)') ;
my $display_in_dumb_editor = 1 ;
my $f = '/tmp/bullet.txt' ;
open F, $f;
print F Here's a bullet \r ;
`open -a 'simpletext' $f` if $display_in_dumb_editor;
close F ;
open F, $f ;
for (F) {
  // and print Got one ! or print  :- 
}
PS. for anyone rash enough, like me, to have 
installed 5.8.3 and having problems finding 
CongigLocal.pm, this will solve the problem:

enc2xs -C

JD


Re: confusing bullets

2004-01-10 Thread Vic Norton
I'm sorry, John. I was talking figuratively. I didn't mean real bullets.

How come Perl sees C2 A0 whenever HexEdit sees CA and visa versa? 
I don't care what kind of characters we are talking here. To 
paraphrase Gertrude Stein, a byte is a byte is a byte. At least 
that's what I thought until now.

Regards,

Vic

At 5:33 PM + 1/10/04, John Delacour wrote:
At 11:22 am -0500 10/1/04, Vic Norton wrote:

What is going on here? HexEdit sees one byte for each bullet and perl
sees two. I thought hex stuff was unambiguous, but, as a mathematician,
I am pretty certain that 1 is not equal to 2.
Perl talks UTF-8.  The bullet in utf-8 is chr (8226) \x{2022}

perldoc -X encoding | more

TMTOWTDI but it sounds as though you'd like to work as though 
Unicode didn't exist and something like this might be simplest.

binmode(STDOUT=':encoding(MacRoman)') ;
my $display_in_dumb_editor = 1 ;
my $f = '/tmp/bullet.txt' ;
open F, $f;
print F Here's a bullet *\r ;
`open -a 'simpletext' $f` if $display_in_dumb_editor;
close F ;
open F, $f ;
for (F) {
  /*/ and print Got one ! or print  :- 
}
PS. for anyone rash enough, like me, to have installed 5.8.3 and 
having problems finding CongigLocal.pm, this will solve the problem:

enc2xs -C

JD




Re: confusing bullets

2004-01-10 Thread Sherm Pendley
On Jan 10, 2004, at 9:26 PM, Vic Norton wrote:

How come Perl sees C2 A0 whenever HexEdit sees CA and visa versa? 
I don't care what kind of characters we are talking here. To 
paraphrase Gertrude Stein, a byte is a byte is a byte. At least 
that's what I thought until now.
Like John said - text encoding.

The file you're viewing with HexEdit is most likely encoded using 
MacRoman, or possibly ISO 8859-1. Internally, Perl uses UTF8 encoding.

Try this: Create a new text file in BBEdit, and enter a bullet (opt-8). 
Save it using the default text encoding. HexEdit shows a single byte in 
the file: A5. Now, open the file again, and save a copy of it using 
UTF8 encoding with no byte-order mark. HexEdit now shows *three* bytes: 
E2 80 A2. And, you have to tell BBEdit what encoding the file uses when 
you open it - without the byte-order mark, BBEdit can't tell it's UTF8.

Just for grins, save it again, this time *with* the byte-order mark. 
HexEdit now reports *six* bytes in the file: EF BB BF E2 80 A2.

In other words, yes - a byte is a byte is a byte. But you're not 
working with bytes, you're working with text. A character is not always 
a byte. It can be several bytes, depending on how it's encoded.

sherm--