Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support
On Tue, Feb 3, 2009 at 9:02 AM, Vincent Snijders vsnijd...@vodafonevast.nl wrote: I am a Lazarus developer, and I don't think I said it like that. I wasn't pointing fingers to you Vincent. :-) I summarized what a few people have said. LoadFromFile in a LCL control, you need to make sure they are valid UTF8 strings. And honestly, it is only you who make sure that it is, because you know the initial encoding. The problem is as follows Even though I am a long time developer, I often have no clue what encoding a file is in when I look at the file using Nautilus file manager. I often open a file in my preferred text editor, look if it displays correctly, then look in the statusbar area for what encoding the editor detected (at least my editor does that nicely). So even though you are using something as simple as the TMemo in LCL, and LCL always wants UTF-8, how do you know what encoding to convert from to UTF-8? If I give you various text files, each using one of the following schemes: UTF-16, UTF-16BE, and UTF-16LE, UTF-32 and whatever else I can find. Loading the file into a TStringList and then doing UTF8Decode on each line will it display correctly in the TMemo? Now what if the memo content is changed and then saved? How does the TMemo know which encoding to use (I would preferably like the same encoding as before, not necessarily UTF-8). So if the file was originally UTF-32, I don't want it to be UTF-8 afterwards. If the TStringList.LoadFromFile(...) took a encoding parameter, it could store that encoding option internally, so if you call .SaveToFile(somefile.txt) later, it could use the same encoding as used in LoadFromFile(), otherwise default to something like utf-8 if no encoding was specified anywhere. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support
Graeme Geldenhuys schreef: On Tue, Feb 3, 2009 at 9:02 AM, Vincent Snijders vsnijd...@vodafonevast.nl wrote: I am a Lazarus developer, and I don't think I said it like that. I wasn't pointing fingers to you Vincent. :-) I summarized what a few people have said. LoadFromFile in a LCL control, you need to make sure they are valid UTF8 strings. And honestly, it is only you who make sure that it is, because you know the initial encoding. The problem is as follows Even though I am a long time developer, I often have no clue what encoding a file is in when I look at the file using Nautilus file manager. I often open a file in my preferred text editor, look if it displays correctly, then look in the statusbar area for what encoding the editor detected (at least my editor does that nicely). The LCL does not have this feature. It can only handle UTF8. period. So even though you are using something as simple as the TMemo in LCL, and LCL always wants UTF-8, how do you know what encoding to convert from to UTF-8? If you don't know, you cannot process it. Simple. If I give you various text files, each using one of the following schemes: UTF-16, UTF-16BE, and UTF-16LE, UTF-32 and whatever else I can find. Loading the file into a TStringList and then doing UTF8Decode on each line will it display correctly in the TMemo? For each of these encodings, you would first have to translate it to UTF8, before you give it to the LCL. Note that is not wise to load UTF16* and UTF32 encoded files into a byte indexed ansistring. Now what if the memo content is changed and then saved? How does the TMemo know which encoding to use (I would preferably like the same encoding as before, not necessarily UTF-8). So if the file was originally UTF-32, I don't want it to be UTF-8 afterwards. If you want it the be the same, then you have to convert it back. You know what it was in the first place, because you translated it to UTF8, before giving it to the LCL. If the TStringList.LoadFromFile(...) took a encoding parameter, it could store that encoding option internally, so if you call .SaveToFile(somefile.txt) later, it could use the same encoding as used in LoadFromFile(), otherwise default to something like utf-8 if no encoding was specified anywhere. Maybe. I leave that suggestion to RTL developers. See also Marco's mail. Vincent ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support
On Tue, 3 Feb 2009, Vincent Snijders wrote: Graeme Geldenhuys schreef: On Tue, Feb 3, 2009 at 9:02 AM, Vincent Snijders vsnijd...@vodafonevast.nl wrote: I am a Lazarus developer, and I don't think I said it like that. I wasn't pointing fingers to you Vincent. :-) I summarized what a few people have said. LoadFromFile in a LCL control, you need to make sure they are valid UTF8 strings. And honestly, it is only you who make sure that it is, because you know the initial encoding. The problem is as follows Even though I am a long time developer, I often have no clue what encoding a file is in when I look at the file using Nautilus file manager. I often open a file in my preferred text editor, look if it displays correctly, then look in the statusbar area for what encoding the editor detected (at least my editor does that nicely). The LCL does not have this feature. It can only handle UTF8. period. So even though you are using something as simple as the TMemo in LCL, and LCL always wants UTF-8, how do you know what encoding to convert from to UTF-8? If you don't know, you cannot process it. Simple. This is why many editors and mail programs have a menu option 'Encoding': because they also don't know, and they cannot know without external means, what the encoding is. Michael. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support
On Tue, 3 Feb 2009 09:39:58 +0100 (CET) Michael Van Canneyt mich...@freepascal.org wrote: On Tue, 3 Feb 2009, Vincent Snijders wrote: Graeme Geldenhuys schreef: On Tue, Feb 3, 2009 at 9:02 AM, Vincent Snijders vsnijd...@vodafonevast.nl wrote: I am a Lazarus developer, and I don't think I said it like that. I wasn't pointing fingers to you Vincent. :-) I summarized what a few people have said. LoadFromFile in a LCL control, you need to make sure they are valid UTF8 strings. And honestly, it is only you who make sure that it is, because you know the initial encoding. The problem is as follows Even though I am a long time developer, I often have no clue what encoding a file is in when I look at the file using Nautilus file manager. I often open a file in my preferred text editor, look if it displays correctly, then look in the statusbar area for what encoding the editor detected (at least my editor does that nicely). The LCL does not have this feature. It can only handle UTF8. period. So even though you are using something as simple as the TMemo in LCL, and LCL always wants UTF-8, how do you know what encoding to convert from to UTF-8? If you don't know, you cannot process it. Simple. This is why many editors and mail programs have a menu option 'Encoding': because they also don't know, and they cannot know without external means, what the encoding is. Let's add that to TMemo. ;) Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support
On Tue, Feb 3, 2009 at 10:44 AM, Mattias Gaertner nc-gaert...@netcologne.de wrote: This is why many editors and mail programs have a menu option 'Encoding': because they also don't know, and they cannot know without external means, what the encoding is. Let's add that to TMemo. ;) The point is that you will probably have a File Open dialog that gives the filename to TMemo.LoadFromFile. The the file dialog could collect the filename and optional encoding to pass on to LoadFromFile. Even if TStringList has the optional encoding parameter, prior source code should still work as-is without the encoding parameter. No code would be broken. I agree with Marco that auto detecting encodings is probably not a good idea in the RTL, but at least enable the option of a encoding parameter in TStringList, which could help things along. As in the case of the bug report. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support
On Tue, Feb 3, 2009 at 1:33 PM, Sergei Gorelkin sergei_gorel...@mail.ru wrote: Strings.LoadFromStream(TDecodingStream.Create(TFileStream.Create('myfile'), 'cp866', 'utf-8')); This approach isn't limited to decoding, you can do decrypting, compressing, etc. That's actually a very clever idea. :) Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support
On Tue, Feb 3, 2009 at 1:00 PM, Michael Schnell mschn...@lumino.de wrote: Would it not be necessary to define as well the encoding of the file as the encoding you want to have for the strings when accessing them ? I guess that would be taken care of if Free Pascal has a fully working UnicodeString type. I don't know what's the status of that in the 2.3.x code. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support
I guess that would be taken care of if Free Pascal has a fully working UnicodeString type. I don't know what's the status of that in the 2.3.x code. Of course (as already discussed several times). Even if the state is not known, do we know the final goal ? Will be _only_one_ UnicodeString type or will there still be things like ANSIString holding UTF8 ? Did the powers agree on a single white paper on this ? -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support
On Tue, 3 Feb 2009, Sergei Gorelkin wrote: Graeme Geldenhuys wrote: The point is that you will probably have a File Open dialog that gives the filename to TMemo.LoadFromFile. The the file dialog could collect the filename and optional encoding to pass on to LoadFromFile. Even if TStringList has the optional encoding parameter, prior source code should still work as-is without the encoding parameter. No code would be broken. I agree with Marco that auto detecting encodings is probably not a good idea in the RTL, but at least enable the option of a encoding parameter in TStringList, which could help things along. As in the case of the bug report. There is no need for TStrings.LoadFromFile method at all. The container is one matter, and its serialization is a separate one. Just introduce a decoding stream and you can write e.g: Strings.LoadFromStream(TDecodingStream.Create(TFileStream.Create('myfile'), 'cp866', 'utf-8')); This approach isn't limited to decoding, you can do decrypting, compressing, etc. In reality, of course, you have to finalize all the stuff - that means typing more than one line. C++ language that automatically finalizes on-stack objects is more friendly in this respect. That's easily done: - make TDecodingStream descendent of TOwnerStream (exists) and TFileStream will be freed. - Add a second parameter to LoadFromStream(AStream : TStream; FreeStream = False) and your call becomes Strings.LoadFromStream(TDecodingStream.Create(TFileStream.Create('myfile'),'cp866', 'utf-8'),True); Michael. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support
Would it not be necessary to define as well the encoding of the file as the encoding you want to have for the strings when accessing them ? -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support
Graeme Geldenhuys wrote: The point is that you will probably have a File Open dialog that gives the filename to TMemo.LoadFromFile. The the file dialog could collect the filename and optional encoding to pass on to LoadFromFile. Even if TStringList has the optional encoding parameter, prior source code should still work as-is without the encoding parameter. No code would be broken. I agree with Marco that auto detecting encodings is probably not a good idea in the RTL, but at least enable the option of a encoding parameter in TStringList, which could help things along. As in the case of the bug report. There is no need for TStrings.LoadFromFile method at all. The container is one matter, and its serialization is a separate one. Just introduce a decoding stream and you can write e.g: Strings.LoadFromStream(TDecodingStream.Create(TFileStream.Create('myfile'), 'cp866', 'utf-8')); This approach isn't limited to decoding, you can do decrypting, compressing, etc. In reality, of course, you have to finalize all the stuff - that means typing more than one line. C++ language that automatically finalizes on-stack objects is more friendly in this respect. Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support
Graeme Geldenhuys schreef: Hi, I just read all the comments about the following bug report in filed under the Lazarus project. http://bugs.freepascal.org/view.php?id=12676 The comments posted doesn't seem sufficient to me. If a user selects a file to be loaded, they have no clue if that file is ANSI, UTF-8, UTF-16 etc encoded. The suggestion by the Lazarus developers is to ALWAYS assume the file is in UTF-8 (just because LCL uses UTF-8 internally) and to do a UTF8Encode on each line of the file. So what happens if you do a .SavetoFile(...)? Must you UTF8Decode each line again?? I am a Lazarus developer, and I don't think I said it like that. What I mean is: If you load a file using LoadFromFile, the lines of the file are loaded in ansistrings. No conversion is done by the RTL, so the encoding remains the same as is in the file. Now, the LCL is very picky about its encoding, it wants always UTF8 encoded strings. It is not a chameleon like that RTL that changes its encoding according to the systems settings. If you want to show strings loaded by LoadFromFile in a LCL control, you need to make sure they are valid UTF8 strings. And honestly, it is only you who make sure that it is, because you know the initial encoding. Vincent ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support
In our previous episode, Graeme Geldenhuys said: This supposed solution fails horribly in practice. What if the file was UTF-16 encoded? Then you can't load it into a ansistring tstringlist, since ansistring is one char per default. I believe Delphi 2009 extended the .LoadFromFile(...) and .SaveToFile(...) methods with an optional encoding parameter. Delphi (and 2.4 in the future) are totally separate things, since they actually have UTF-8 support. Currently 2.2.x only supports UTF-16 (widestring) and ansistring (in the native encoding), the rest is bolted on at best. Could something like this be added to TStringList etc? In time, when D2009 compatibility is added yes. But not for a crur I guess we would also need some auto encoding detection in place. Never for a basic RTL primitive. That is something for editor programs, not library routines, iow this must be handled at a different level as the base RTL. How do other text editors managed to auto detect the file encodings - to a degree of accuracy? Start decoding the various types and count errors would be my first guess. Maybe heuristics are involved. Also, if the .LoadFromFile(...) and .SaveToFile(...) methods were extended, then we (GUI toolkit developers) could extend File Open and File Save dialogs like Qt has done. If the auto encoding detection didn't work, the user can use the combobox in the file open/save dialog to specify a encoding to use. Web Browsers have a similar feature when displaying HTML. Autodetection Is the job of the toolkit developer, not of the base RTL. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel