Re: confusing bullets
At 9:26 pm -0500 10/1/04, Vic Norton wrote: I'm sorry, John. I was talking figuratively. I didn't mean real bullets. FIguratively or no, you were right on target with your choice. The bullet is a character in the 'macintosh' character set (referred to wrongly by the Perl people MacRoman) which does not exist in the widely used (or at least declared) charset Latin-1 and has the same 8-bit codepoint as the i with diaeresisin the Windows-1252 charset. It is to rebuild this Tower of Babel that Unicode was conceived and, far too slowly, brought into the computer world first in Windows NT and finally in Mac OS X. Unicode is a 'good thing' but it requires to be learned about and you'll come unstuck pretty often if you don't put aside a bit of time to do so. http://www.unicode.org/standard/WhatIsUnicode.html % perldoc -X Encode | more SEE ALSO Encode::Encoding, Encode::Supported, Encode::PerlIO, encoding, perlebcdic, open in perlfunc, perlunicode, utf8, the Perl Unicode Mailing List [EMAIL PROTECTED] How come Perl sees C2 A0 whenever HexEdit sees CA and visa versa? I don't care what kind of characters we are talking here. To paraphrase Gertrude Stein, a byte is a byte is a byte. At least that's what I thought until now. Gertrude Stein was a character. Some characters are a byte. Some are not, and you have to care. use strict ; my $file = /tmp/file; # open FILE, :utf8, $file ; print FILE \xF0 ; close FILE ; # open FILE, $file ; print UTF-8:\t . FILE . $/ ; close FILE; # open FILE, $file ; print FILE \xF0 ; close FILE ; # open FILE, $file ; print MacRoman:\t . FILE . $/ ; close FILE; exit ; JD
Re: confusing bullets
At 13:52 -0500 1/11/04, Vic Norton wrote: Now I seem to have resolved the problem--sort of. I believe it's a bug in BBEdit. I suspect BareBones will call it a feature. The unwillingness of BBEdit to work on a file without doing things like changing all of the line ends has been a pain. It is impossible to edit and save a file that has mixed line ends. When looking at a file, even with show invisibles enabled, it's impossible to see what kind of line end is present. It's also possible that BBEdit is preprocessing the UTF* stuff before you get to see it. I usually run to MPW on my 8500 to handle such things. BBEdit's HEX dump Tool will allow you to look at the raw file but, in ordinary Editing mode, there is no way to see, for instance, the byte order mark. HEX Dump will not allow any editing of the file. -- -- There are 10 kinds of people: those who understand binary, and those who don't --
Re: confusing bullets
At 1:52 pm -0500 11/1/04, Vic Norton wrote: # file0.pl - The data in file0.pl is a real bullet, #namely A5. But the script file0.pl can't #find it when run from BBEdit. Vic, before I spend time testing this, what do you mean by real bullet namely A5. Do you mean that we are to replace \x5 in the scripts you posted with Macintosh character and save the scripts as Unix without the Encode as Unicode checked ? There are a lot of options and your instructions are far from unambiguous. I get the feeling you have not yet visited the Unicode page I recommended. As to BBEdit, it's important to realise that it does not speak Unicode. It uses an interpreter. What you see is not what you get. If the interpreter can't convert to mac it prints the raw UTF-8. If you run this script from BBEdit you will see how real your bullet is. #!/usr/bin/perl no warnings ; $f = /tmp/vicsbullet.html ; open F, $f ; print \x{1F00}; print F html\x{1F00}; print chr 10; print F chr 13; print \xA5; print F \xA5; close F; `open -a safari $f` ; JD
Re: confusing bullets
Hi John, What I meant and sent was what Sherm Pedley called a bullet, namely what is produced on a Mac when you type option-8. The unicode character is \x{2022}. On web pages it is #8226;. When I put pbullet #8226;/p on a web page, open the page in Safari, copy the bullet from the page, and paste it into to a Mac Roman BBEdit text file, I see a bullet that is encoded in the file as A5. I can get the same bullet by typing option-8 in the text file. I now realize that I was only seeing the CA bullet before because I was showing invisibles in BBEdit. Then the CA character looked like a faded bullet. Ordinarily it is invisible. Regards, Vic P.S. I tried your script below. I haven't the slightest idea what the output means. As I said, a real web bullet is #8226;. At 8:37 PM + 1/11/04, John Delacour wrote: At 1:52 pm -0500 11/1/04, Vic Norton wrote: # file0.pl - The data in file0.pl is a real bullet, #namely A5. But the script file0.pl can't #find it when run from BBEdit. Vic, before I spend time testing this, what do you mean by real bullet namely A5. Do you mean that we are to replace \x5 in the scripts you posted with Macintosh character * and save the scripts as Unix without the Encode as Unicode checked ? There are a lot of options and your instructions are far from unambiguous. I get the feeling you have not yet visited the Unicode page I recommended. As to BBEdit, it's important to realise that it does not speak Unicode. It uses an interpreter. What you see is not what you get. If the interpreter can't convert to mac it prints the raw UTF-8. If you run this script from BBEdit you will see how real your bullet is. #!/usr/bin/perl no warnings ; $f = /tmp/vicsbullet.html ; open F, $f ; print \x{1F00}; print F html\x{1F00}; print chr 10; print F chr 13; print \xA5; print F \xA5; close F; `open -a safari $f` ; JD
Re: confusing bullets
At 6:21 pm -0500 11/1/04, Vic Norton wrote: P.S. I tried your script below. I haven't the slightest idea what the output means. As I said, a real web bullet is #8226;. A bullet can be written in valid html code in the real world in half a dozen different ways, and that would not be the preferred one either, but enough of this. $f = /tmp/bullet.html; open F, $f; print F 'htmlbull;'; `open -a safari $f` ; JD
confusing bullets
Now and then I copy data from the web and paste it into a perl script after __END__ or __DATA__. I plan to take the data apart with perl. The file is generally a BBEdit text file with unix line feeds. Sometimes there are bullets in the data. According to HexEdit these bullets are \xca characters, but when I try to spot these characters with m/\xca/ I find none. So I look at a line that contains just a bullet, nothing more. HexEdit shows that line as CA 0A, a bullet and a new line. On the other hand perl says that the line has length 3. Using ord I see that the line consists of C2 A0 0A. To prove its point perl has no trouble finding all bullets with the pattern m/\xc2\xa0/. What is going on here? HexEdit sees one byte for each bullet and perl sees two. I thought hex stuff was unambiguous, but, as a mathematician, I am pretty certain that 1 is not equal to 2. Regards, Vic -- *---* mailto:[EMAIL PROTECTED] | Victor Thane Norton, Jr. | Mathematician and Motorcyclist | phone: 419-353-3399 *---* http://vic.norton.name
Re: confusing bullets
At 11:22 am -0500 10/1/04, Vic Norton wrote: What is going on here? HexEdit sees one byte for each bullet and perl sees two. I thought hex stuff was unambiguous, but, as a mathematician, I am pretty certain that 1 is not equal to 2. Perl talks UTF-8. The bullet in utf-8 is chr (8226) \x{2022} perldoc -X encoding | more TMTOWTDI but it sounds as though you'd like to work as though Unicode didn't exist and something like this might be simplest. binmode(STDOUT=':encoding(MacRoman)') ; my $display_in_dumb_editor = 1 ; my $f = '/tmp/bullet.txt' ; open F, $f; print F Here's a bullet \r ; `open -a 'simpletext' $f` if $display_in_dumb_editor; close F ; open F, $f ; for (F) { // and print Got one ! or print :- } PS. for anyone rash enough, like me, to have installed 5.8.3 and having problems finding CongigLocal.pm, this will solve the problem: enc2xs -C JD
Re: confusing bullets
I'm sorry, John. I was talking figuratively. I didn't mean real bullets. How come Perl sees C2 A0 whenever HexEdit sees CA and visa versa? I don't care what kind of characters we are talking here. To paraphrase Gertrude Stein, a byte is a byte is a byte. At least that's what I thought until now. Regards, Vic At 5:33 PM + 1/10/04, John Delacour wrote: At 11:22 am -0500 10/1/04, Vic Norton wrote: What is going on here? HexEdit sees one byte for each bullet and perl sees two. I thought hex stuff was unambiguous, but, as a mathematician, I am pretty certain that 1 is not equal to 2. Perl talks UTF-8. The bullet in utf-8 is chr (8226) \x{2022} perldoc -X encoding | more TMTOWTDI but it sounds as though you'd like to work as though Unicode didn't exist and something like this might be simplest. binmode(STDOUT=':encoding(MacRoman)') ; my $display_in_dumb_editor = 1 ; my $f = '/tmp/bullet.txt' ; open F, $f; print F Here's a bullet *\r ; `open -a 'simpletext' $f` if $display_in_dumb_editor; close F ; open F, $f ; for (F) { /*/ and print Got one ! or print :- } PS. for anyone rash enough, like me, to have installed 5.8.3 and having problems finding CongigLocal.pm, this will solve the problem: enc2xs -C JD
Re: confusing bullets
On Jan 10, 2004, at 9:26 PM, Vic Norton wrote: How come Perl sees C2 A0 whenever HexEdit sees CA and visa versa? I don't care what kind of characters we are talking here. To paraphrase Gertrude Stein, a byte is a byte is a byte. At least that's what I thought until now. Like John said - text encoding. The file you're viewing with HexEdit is most likely encoded using MacRoman, or possibly ISO 8859-1. Internally, Perl uses UTF8 encoding. Try this: Create a new text file in BBEdit, and enter a bullet (opt-8). Save it using the default text encoding. HexEdit shows a single byte in the file: A5. Now, open the file again, and save a copy of it using UTF8 encoding with no byte-order mark. HexEdit now shows *three* bytes: E2 80 A2. And, you have to tell BBEdit what encoding the file uses when you open it - without the byte-order mark, BBEdit can't tell it's UTF8. Just for grins, save it again, this time *with* the byte-order mark. HexEdit now reports *six* bytes in the file: EF BB BF E2 80 A2. In other words, yes - a byte is a byte is a byte. But you're not working with bytes, you're working with text. A character is not always a byte. It can be several bytes, depending on how it's encoded. sherm--