Re: TT and UTF8?
On 31 Jan 2010, at 22:57, Tomas Doran wrote: > > On 29 Jan 2010, at 23:10, Dave Hodgkinson wrote: >> I've turned off caching in TT, inserted "Motörhead múm" into the >> template as static text and a BOM (od -x 000 bbef 3cbf...) as >> the first octets. Works first time, fails second. > > > Are you using Cache::Memcached from Template::Plugin::Cache? No :) -- Dave HodgkinsonMSN: daveh...@hotmail.com Site: http://www.davehodgkinson.com UK: +44 7768 490620 Blog: http://www.davehodgkinson.com/blog Photos: http://www.flickr.com/photos/davehodg
Re: TT and UTF8?
2010/1/31 Tomas Doran : > > Anyway - Cache::Memcached borks everything by not correctly storing the > utf8 flag... I wonder whether this is a legacy of its having been developed at LiveJournal -- which uses UTF-8 for entries etc. but treats it all as bytes in Perl (I've even seen bits of code that explicitly strip the UTF-8 flag, I think). So things such as 'ö' would be not just two bytes but also two (Perl) characters inside LiveJournal's innards. Cheers, Philip -- Philip Newton
Re: TT and UTF8?
On 29 Jan 2010, at 23:10, Dave Hodgkinson wrote: I've turned off caching in TT, inserted "Motörhead múm" into the template as static text and a BOM (od -x 000 bbef 3cbf...) as the first octets. Works first time, fails second. Are you using Cache::Memcached from Template::Plugin::Cache? As I was bitten by this the other day.. :) (I'd consider doing this a code smell - but I have some legacy crud at $ork with too much logic in TT, so it's useful..) Anyway - Cache::Memcached borks everything by not correctly storing the utf8 flag... You can fix this by hacking template::plugin::cache to store a 1 item list (and retireve that item), which will cause Cache::Memcached to serialize everything through storable, and ergo avoid the issue.. Which is gross, but works. I'm also told that if you use CHI, or one of the other clients - they're more likely to get it right, but I haven't tried this myself.. Cheers t0m
Re: TT and UTF8?
On 29/01/10 23:10, Dave Hodgkinson wrote: > > On 29 Jan 2010, at 19:07, Dave Cross wrote: >> >> There's a Perlanet fork that has a hack for dealing correctly with >> Templates that contain UTF-9 whether or not they contain a BOM. >> >> http://github.com/kappa/perlanet/blob/master/lib/Perlanet.pm >> >> It does it by overriding the Template::Provider::_decode_unicode >> subroutine. > > So you're saying this is a problem you've had? No. I'm saying it's a problem that the people running the Russian Planet Perl had. They have a lot of UTF-8. It's their fork. I haven't merged it back into the production version yet. Dave...
Re: TT and UTF8?
On Fri, Jan 29, 2010 at 11:10 PM, Dave Hodgkinson wrote: > As far as I'm concerned it's getting mangled *after* I've unleashed it > to apache. Mark's observation on i18n data exchange: There's always *another* UTF-8 bug lurking somewhere. You just haven't found it yet. I suggest checking that TT is actually sending the right darn thing to apache. May I suggest sticking a use Devel::Peek; At the top of the file and then doing something like $tt->process($template, $things, \$output); Dump $output; ...hand $output to apache... This will print an ASCII representation of $output to STDERR, with enough info to see what the underlying bytes and utf8 flag settings that perl is using to represent your output. Then you can tell what's going on. Mark.
Re: TT and UTF8?
On 29 Jan 2010, at 19:07, Dave Cross wrote: > > There's a Perlanet fork that has a hack for dealing correctly with Templates > that contain UTF-9 whether or not they contain a BOM. > > http://github.com/kappa/perlanet/blob/master/lib/Perlanet.pm > > It does it by overriding the Template::Provider::_decode_unicode subroutine. So you're saying this is a problem you've had? I've turned off caching in TT, inserted "Motörhead múm" into the template as static text and a BOM (od -x 000 bbef 3cbf...) as the first octets. Works first time, fails second. I parse this out of the final string abd is also printed using an octet unpicker: [Fri Jan 29 22:55:48 2010] -e: Motörhead [Fri Jan 29 22:55:48 2010] -e: 4d6f74c3b67268656164 Looks about right. when it fails: [Fri Jan 29 22:55:52 2010] -e: Motörhead [Fri Jan 29 22:55:52 2010] -e: 4d6f74c3b67268656164 As far as I'm concerned it's getting mangled *after* I've unleashed it to apache. It's a low traffic app. I'm sorely tempted to load up a startup.pl and set MaxRequestsPerClient to 1. -- Dave HodgkinsonMSN: daveh...@hotmail.com Site: http://www.davehodgkinson.com UK: +44 7768 490620 Blog: http://www.davehodgkinson.com/blog Photos: http://www.flickr.com/photos/davehodg
Re: TT and UTF8?
On Fri, Jan 29, 2010 at 08:12:15PM +, Peter Edwards wrote: > On 29 January 2010 19:20, Dave Cross wrote: > > > On 01/29/2010 07:07 PM, Dave Cross wrote: > > > >> Templates that contain UTF-9 > >> > > > > These Template are one more encoded! > > > > > /me waits for inevitable UTF-11 gag. Oh! Too late. So when are Spinal Tap writing the RFC? ( compare with http://www.ietf.org/rfc/rfc4042.txt ) Nicholas Clark
Re: TT and UTF8?
On 29 January 2010 19:20, Dave Cross wrote: > On 01/29/2010 07:07 PM, Dave Cross wrote: > >> Templates that contain UTF-9 >> > > These Template are one more encoded! > > /me waits for inevitable UTF-11 gag. Oh! Too late.
Re: TT and UTF8?
On 01/29/2010 07:07 PM, Dave Cross wrote: Templates that contain UTF-9 These Template are one more encoded!
Re: TT and UTF8?
On 01/29/2010 02:30 PM, Dave Hodgkinson wrote: Anyone had issues with TT and UTF8? sheriff and theorbtwo have got me a long way down the line but... I have a string which is_utf8() and contains weird characters. I restart apache and Mötorhead displays fine. Next time through it's Mot�rhead. Printing unpack(H*) shows the right octets in the string. A tcpdump shows two byes being sent on the first hit and only one on the second. Any quick suggestions before I spend tomorrow swearing at this? Am I being misled by unpack? And tools I can use to look to see what TT is doing with my apparently perfectly formed UTF8? There's a Perlanet fork that has a hack for dealing correctly with Templates that contain UTF-9 whether or not they contain a BOM. http://github.com/kappa/perlanet/blob/master/lib/Perlanet.pm It does it by overriding the Template::Provider::_decode_unicode subroutine. Might be useful to you. Dave...
Re: TT and UTF8?
Joel Bernstein wrote: On 29 January 2010 16:59, Matt Lawrence wrote: Joel Bernstein wrote: On 29 January 2010 15:25, Dave Hodgkinson wrote: IIRC, you can say ":set bomb" in vim to do this. Someone set up us the &^&^!^ytNO CARRIER I first encountered BOMs when dealing with XML files that had been saved as unicode from Notepad. It automatically adds a BOM, but the libxml (or was it XML::Parser?) of that time blew up (hur hur) when it encountered it. It took a while to discover why, because U+FEFF is a zero-width non-breaking space, so anything that understands unicode displays absolutely nothing. At least in vim you can set the encoding to something else and see the bytes, in notepad the very presence of the BOM prevents it from being displayed. Matt
Re: TT and UTF8?
On 29 January 2010 16:59, Matt Lawrence wrote: > Joel Bernstein wrote: >> >> On 29 January 2010 15:25, Dave Hodgkinson wrote: >> >>> >>> On 29 Jan 2010, at 14:48, Ash Berlin wrote: >>> >>> 2) stick a BOM in the .tt file >>> >>> BOM? >>> >> >> U+FEFF - unicode codepoint used to indicate endianness in encodings >> where word length is not single octet multiples. >> >> http://en.wikipedia.org/wiki/Byte_order_mark >> >> > > IIRC, you can say ":set bomb" in vim to do this. Someone set up us the &^&^!^ytNO CARRIER
Re: TT and UTF8?
Joel Bernstein wrote: On 29 January 2010 15:25, Dave Hodgkinson wrote: On 29 Jan 2010, at 14:48, Ash Berlin wrote: 2) stick a BOM in the .tt file BOM? U+FEFF - unicode codepoint used to indicate endianness in encodings where word length is not single octet multiples. http://en.wikipedia.org/wiki/Byte_order_mark IIRC, you can say ":set bomb" in vim to do this. Matt
Re: TT and UTF8?
Dave Hodgkinson wrote: On 29 Jan 2010, at 14:48, Ash Berlin wrote: 2) stick a BOM in the .tt file BOM? Byte-Order Mark - http://en.wikipedia.org/wiki/Byte_order_mark
Re: TT and UTF8?
On 29 January 2010 15:25, Dave Hodgkinson wrote: > > On 29 Jan 2010, at 14:48, Ash Berlin wrote: > >> 2) stick a BOM in the .tt file >> > > BOM? U+FEFF - unicode codepoint used to indicate endianness in encodings where word length is not single octet multiples. http://en.wikipedia.org/wiki/Byte_order_mark Even if your template content is UTF-8 (which has no endianness issues) it's still the simplest way to definitively mark your content as Unicode. /joel
Re: TT and UTF8?
On Fri, Jan 29, 2010 at 03:25:14PM +, Dave Hodgkinson wrote: > On 29 Jan 2010, at 14:48, Ash Berlin wrote: > > 2) stick a BOM in the .tt file > BOM? Byte Order Mark http://en.wikipedia.org/wiki/Byte_order_mark -- David Cantrell | Cake Smuggler Extraordinaire Immigration: making Britain great since AD43
Re: TT and UTF8?
On 29 Jan 2010, at 15:25, Dave Hodgkinson wrote: > > On 29 Jan 2010, at 14:48, Ash Berlin wrote: > >> 2) stick a BOM in the .tt file >> > > BOM? Byte Order Mark. It signals the signal the endianness of the data. -- David Dorward http://dorward.me.uk
Re: TT and UTF8?
On 29 Jan 2010, at 14:48, Ash Berlin wrote: > 2) stick a BOM in the .tt file > BOM? -- Dave HodgkinsonMSN: daveh...@hotmail.com Site: http://www.davehodgkinson.com UK: +44 7768 490620 Blog: http://www.davehodgkinson.com/blog Photos: http://www.flickr.com/photos/davehodg
Re: TT and UTF8?
On 29 Jan 2010, at 14:30, Dave Hodgkinson wrote: > > Anyone had issues with TT and UTF8? > > sheriff and theorbtwo have got me a long way down the line but... > > I have a string which is_utf8() and contains weird characters. I > restart apache and Mötorhead displays fine. Next time through it's > Mot�rhead. Printing unpack(H*) shows the right octets in the string. > > A tcpdump shows two byes being sent on the first hit and only one > on the second. > > Any quick suggestions before I spend tomorrow swearing at this? Am > I being misled by unpack? And tools I can use to look to see what > TT is doing with my apparently perfectly formed UTF8? General rules i've used in the past: 1) obv make sure your stash data is utf8, not bytes (which it looks like you have) 2) stick a BOM in the .tt file That seemed to do it for me. There is an ENCODING config var but i never had much luck with it doing anything. As for why it changes from request to request: absolutely no clue on that one. One possible thing to try is to disable any caching that TT is doing and see if its a weird TT/apache clash. -ash
TT and UTF8?
Anyone had issues with TT and UTF8? sheriff and theorbtwo have got me a long way down the line but... I have a string which is_utf8() and contains weird characters. I restart apache and Mötorhead displays fine. Next time through it's Mot�rhead. Printing unpack(H*) shows the right octets in the string. A tcpdump shows two byes being sent on the first hit and only one on the second. Any quick suggestions before I spend tomorrow swearing at this? Am I being misled by unpack? And tools I can use to look to see what TT is doing with my apparently perfectly formed UTF8? -- Dave HodgkinsonMSN: daveh...@hotmail.com Site: http://www.davehodgkinson.com UK: +44 7768 490620 Blog: http://www.davehodgkinson.com/blog Photos: http://www.flickr.com/photos/davehodg