Even, Thanks for bringing this to my attention. We're winding down a major release here (hence my relative absence from the mailing list except as a lurker with occasional comments), but this issue is one I wanted to revisit this summer. Snips and comments below
On May 25, 2010, at 7:31 PM, Even Rouault wrote: > Alexander, > > I'm cc'ing Gaige Paulsen as he proposed in > http://trac.osgeo.org/gdal/ticket/3403 a patch with a similar approach to > yours, that is to say provide a method at the OGRLayer level to return the > encoding. > > The more I think to this issue the more I recognize that the "UTF-8 > everywhere > internally" is probably not practical in all situations, or at least doesn't > let enough control to the user. The UTF-8 as a pivot is - conceptually - OK > for the read part, but it doesn't help for the write part when a driver > doesn't support UTF-8 (or if for some compatibility reasons with other > software, we must write data in a certain encoding) > > My main remark about your patch is I don't believe that the enum approach to > list the encodings is the best one. I'd be rather in favor of using a string, > and possibly sticking to the ones returned by 'iconv -l' so that we can > easily use the return of GetEncoding() to feed it into the converter through > CPLRecode(). I've experimented with it some time ago and have ready some > changes in cpl_recode_stub.cpp & configure to plug iconv support into it, in > order to extend its scope beyond the current hardcoded support for UTF8 and > ISO-8859-1. I agree with using strings instead of enums for this. > We could imagine a -s_encoding, -t_encoding and -a_encoding switches to > ogr2ogr to let the user define the transcoding or encoding assignment. One of > the difficulty raised by Gaige in #3403 is the meaning of the width attribute > of an OGRFieldDefn object (number of bytes or number characters in a given > encoding), and how/if it will be affected by an encoding change. > > The other issues raised by Gaige in his last comment are still worth > considering. For the read part, what do we want ? : > 1) that the driver returns the data in its "raw" encoding and mentions the > encoding --> matches the approach of your proposal > 2) that we ask it to return the data to UTF-8 when we don't care about the > data in its source encoding > 3) that we can override its encoding when the source encoding is believed to > be incorrect so that 2) can work properly I still think that UTF-8 as a pivot makes sense and works well for most cases (we tend to use it internally as well). And, mostly I prefer the use of #3. I was specifically looking at these different problems: - Situations where a format handles multiple encodings, but the encoding in the file being read is either ambiguous or incorrect. - Situations where there is a need to determine which of multiple encodings to use (sometimes necessary for compatibility reasons) - Situations where storage space is very tight and field width must be intelligently degraded > 1) and 2) approach are clearly following 2 differents tracks. One way to > reconcile both would be to provide some configuration/opening option to > choose which behaviour is prefered. RFC23 currently chooses 2) as it mandates > that "Any driver which knows it's encoding should convert to UTF-8." Well, > probably not a big deal since that any change related to how we deal with > encoding is likely to cause RFC23 to be amended anyway. > > Personnaly, I'm not sure about which one is the best. I'm wondering what the > use cases for 1) are : when do we really want the data to be returned in its > source encoding --> will not be it converted later to UTF-8 at the > application level after the user has potentially selected/overriden the > source encoding ? In which case 3) would solve the problem. Just thinking > loud... I think 3 could well solve the problem here. The only downside that I see is that it might make implementation of auto-detection difficult if someone were to attempt to do that in a manner that not everyone agreed on. However, the cases that I see above would be resolved by #3. > > For the write part, a OGRSFDriver::GetSupportedEncodings() and > OGRLayer::SetEncoding() could make sense (for the later, if it must be > exposed at the datasource or layer level is an open point and a slight > difference between yours and Gaige's approach) Is there a need for a per-layer approach to this? I've yet to see a format that allowed different encodings in different layers. Although, thinking about it, it might be a problem using some of the virtual data sets, since they hide some of this. -Gaige
_______________________________________________ gdal-dev mailing list gdal-dev@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/gdal-dev