On December 28, 2002 at 20:51, Nick Ing-Simmons wrote: > >BTW, in the t/fallbacks.t test case of Encode, 8-bit characters are > >used for the ascii test, and entity references are generated for the > >8-bit characters. > > > >As I stated in my original post, the problem is that t/fallbacks.t > >tests an undocumented (or poorly documented) Encode interface, and > >it does not test the well-documented interface. > > Whether un(der)?documented or not the object style used in t/fallback.t > is the way the internals work.
But t/fallback.t fails to properly test what is clearly documented as the API in the docs. t/fallback.t *should* also test the documented API functions and not just how the internals work. IMO, any experience test engineer will agree with this assessment. > You say "... it is impractical to maintain unique > conversion tables between all types of character encodings." - it is even > more impractical to _test_ them that way. Agreed, but in testing, you can have cases that can be used to represent a class of cases. In this case, a conversion from one set to another where the original set contains octets/bytes/characters that are undefined (e.g. ISO-8859-3). There is no need to test all possible combinations. > >So why doesn't the from_to() usage generate the same results? > > Because the ->decode side has removed the non-representable octets > and replaced them with 4-chars each: \xHH. > So there are no hi-bit chars to cause entity refs. This is the explanation I was looking for. I.e. from_to() is not "atomic", it is really a two step process (which is obvious for the technically inclined when thinking about how the internals may work, but the fallback flags are also impacted by the two step process). This should be documented since the sematics of the fallback flags as documented are not preserved across the from_to() process. If it was "atomic" the ->decode side would _not_ remove the non-representable octets and replaced them with 4-chars, but "passed them through" to the ->encode side so the fallback flags would have the predicted effects. Now I know doing this may complicate the implementation of from_to(). Therefore, the sematics of the fallback flags should be documented for from_to() or not supported at all, maybe by issuing a warning. > You can get that (I believe) by passing appropriate fallback options to > ->decode of ASCII. I personally dislike fallback to '?' as it looses > information in a way that is hard to back-track - which is why default > fallback is \xHH. Reasonable. Note, this behavior, wrt from_to(), highlights the confusion for the user of from_to(). When FB_XMLCREF is specified, and all of sudden \xHH's show up, it implies that FB_PERLQQ was being used. > >Maybe I am misunderatanding Encode's conversion operations, so > >maybe it is a problem with the documentation not being clear about > >this behavior. But IMHO, what I am getting appears to be incorrect. > > And IMHO you are getting what I "designed" it to produce ;-) As I like to say, "works as coded." ;-) > I strongly recommend doing conversions in two steps explcitly - that way > you can get whatever you want. I find the from_to() much more convenient code-wise. I think the limitations of from_to() should be documented, or its used deprecated since it appears to be just a wrapper around ->decode ->encode. Note, some may think that calling from_to() may be slightly more efficient than doing the ->decode ->encode directly (i.e. from_to() could be an XS routine and/or short-cut some steps). If this is not the case, why even bother having from_to()? > I am also willing to concede that documentation could be improved :-) Of course, no one reads the documentation :-) --ewh
