On 9/20/07, P Kishor <[EMAIL PROTECTED]> wrote: > On 9/20/07, Trevor Talbot <[EMAIL PROTECTED]> wrote: > > On 9/20/07, P Kishor <[EMAIL PROTECTED]> wrote:
> > > Lucknow:~/Data/ecoservices punkish$ less foo.csv > > > "the first record" > > > "\351 \347 \361 \356" > > > "more from 3rd row" > > > "row four" > > > "these \223volunteered\224 activities" > > > "<\341 \370 \343 \374 \356 & others>" > > > foo.csv (END) > > > ----------------------------- > > > > Note that this is *not* UTF-8. If you're still using this as test > > data, you need to get rid of it and use UTF-8 encoded data instead. > this is where I lost you... when you say "this" is not UTF8, what is > "this"? The data in the file shown by less, and since sqlite3 exported that data exactly as it was stored, the data in the db as well. > All I want is that I want (1) the user to be able to type ç in > the web form, and (2) I want to be able to save ç in the db. (3) Then > when I look at that data, either on the command line, but definitely > back on the web, I want it to appear as ç. (4) If I export it, I > should still be able to see it as ç and not something else. > > Seems like I was able to do 1, 2, and 3 with my test case, but not 4 > (I got \347 instead ç). > > Also, in my production case, 1,2, and 3 are not very reliable. Are you > saying my data above are not UTF8? If so, I would like to know how you > can tell that, so I can recognize it in the future myself. Also, I > would like to know how I can do what you are suggesting I should do, > that is, how can I ensure that I "use UTF8 encoded data"? Okay, first a quick primer on character sets and encodings. A byte can hold one of 256 different values (0-255), and most processing tends to happen on bytes, so it makes sense that individual characters should be stored as individual bytes. First we have US ASCII, the character encoding standard that defines 128 characters, including the basic english alphabet, numbers, and some punctuation (www.asciitable.com). However, this obviously doesn't cover all the symbols in common use, or characters from other languages, so more definitions are needed. Given that a byte supports twice as many values (ASCII takes up only half), that leaves 128 values for other purposes. Many other character sets keep the bottom half as ASCII, and assign different characters to the top 128 values. The ISO-8859 family of standards works this way. ISO-8859-1 is also known as Latin-1, and is most common for languages that use characters similar to English, Spanish, etc. It adds a few more symbols (copyright, paragraph, etc) and some common characters with diacritical marks (like é ç ñ î). The data you posted above was entered into your database using this encoding (or Windows-1252, which is identical except for adding some characters in places 8859-1 does not use). ISO-8859-2 is also known as Latin-2, and covers another set of European languages (such as Romanian). It contains a different set of symbols and characters with diacritical marks needed for these languages, characters that don't fit in 8859-1. It keeps going, of course (Wikipedia has info: http://en.wikipedia.org/wiki/Category:ISO_8859). There are many other encodings that work this way, and collectively they're known as single-byte encodings: they all represent a character as a single byte, but the actual meaning of that byte depends in the character set in use. This situation is ripe for confusion, since interpreting a sequence of bytes as being in a different encoding than it was stored in will lead to strange results. This is exactly what you saw in your Cocoa editor, since it defaulted to using the classic MacRoman encoding, which uses those same byte values to store uppercase characters instead. It gets worse: there are multi-byte encodings too. You typically see these in the East Asian languages, since they don't use the same alphabetic writing system, and instead have thousands of characters to encode. A byte only supports a mere 256 values, so more than one byte is needed to represent a single character. By now you can see how this can spiral into an unmaintainable mess: you have to worry about this encoding and that encoding and you can store the encoding with the text but what do you do if someone requests data in another encoding and what if they are using a specific encoding but that text only contains ASCII characters and therefore everyone should see it anyway and how do you tell the difference and *brain asplode* Enter Unicode, which has the goal of putting all the world's commonly used language characters and symbols into one single character set. By using Unicode, you don't have to worry about which character set your data is in, and you can move on to other more interesting issues. Of course, it's a very large character set, supporting just over 1 million characters. Obviously these don't all fit in one byte, so there are also several standard encodings. UTF-8 is one of those, and the most common one for Internet use. UTF-8 was designed to be ASCII-friendly: the first 128 byte values (0-127) are identical to ASCII. All Unicode characters beyond those first 128 are represented using 2 or more bytes in a row, with each byte having a value in the range 128-255. The tool you used to look at the CSV file you wrote out, less, is designed for ASCII use. When it encounters a byte it can't print as an ASCII character, it displays the value using an octal escape sequence instead (\351 etc). (Many other unixy tools accept such escape sequences as input, so even if you can't print or type such characters you can still work with them using your keyboard.) With the above two items in mind, that's how I knew it wasn't UTF-8 right away: you showed 4 characters that were not in the ASCII range. In UTF-8, they would require at least 2 bytes each for a total of 8+ bytes, but the output from less showed only 4 bytes. Therefore, it's some single-byte encoding, and not UTF-8. (Familiarity with the issues and some staring at code charts on Wikipedia let me reach the conclusions about ISO-8859-1/Win-1252 and MacRoman, but you don't need to go that far just to check for UTF-8.) Okay, so now you want to make sure you always use UTF-8. I'll defer to others for the web situation, as I simply don't have enough experience there. They're probably right that at this point you need to look at the Perl CGI side of things. As for working with the database directly, probably the easiest thing you can do is change the encoding Terminal.app uses. I'm on Panther and don't have a Tiger machine to test with, but as far as I'm aware it works the same way. Under the Terminal -> Window Settings menu, Display section, you can set the encoding to UTF-8. This will cause it to translate byte sequences on both input and display. With that change made, try starting sqlite3 and inserting those characters above directly on the command line. Do the CSV output again, then you can repeat the viewing with less and your Cocoa editor to see if you get different results. There's another easy check you can do though, with the file utility: file foo.csv It should say something like "foo.csv: UTF-8 Unicode text" if those characters were output as UTF-8, and "foo.csv: ISO-8859 text" for the file you originally made. Once you have some sample data stored in sqlite that way, you'll be able to test the storage and display parts of your web app separately, which should help narrow down problems you encounter there. There are also some third-party GUI database management tools; perhaps someone here can recommend a good one for OS X that supports UTF-8? I've yet to try any myself. ----------------------------------------------------------------------------- To unsubscribe, send email to [EMAIL PROTECTED] -----------------------------------------------------------------------------