Re: TT and UTF8?

2010-02-01 Thread Dave Hodgkinson

On 31 Jan 2010, at 22:57, Tomas Doran wrote:

> 
> On 29 Jan 2010, at 23:10, Dave Hodgkinson wrote:
>> I've turned off caching in TT, inserted "Motörhead múm" into the
>> template as static text and a BOM (od -x 000 bbef 3cbf...) as
>> the first octets. Works first time, fails second.
> 
> 
> Are you using Cache::Memcached from Template::Plugin::Cache? 

No :)

-- 
Dave HodgkinsonMSN: daveh...@hotmail.com
Site: http://www.davehodgkinson.com  UK: +44 7768 490620
Blog: http://www.davehodgkinson.com/blog
Photos: http://www.flickr.com/photos/davehodg











Re: TT and UTF8?

2010-01-31 Thread Philip Newton
2010/1/31 Tomas Doran :
>
> Anyway - Cache::Memcached borks everything by not correctly storing the 
> utf8 flag...

I wonder whether this is a legacy of its having been developed at
LiveJournal -- which uses UTF-8 for entries etc. but treats it all as
bytes in Perl (I've even seen bits of code that explicitly strip the
UTF-8 flag, I think). So things such as 'ö' would be not just two
bytes but also two (Perl) characters inside LiveJournal's innards.

Cheers,
Philip
-- 
Philip Newton 



Re: TT and UTF8?

2010-01-31 Thread Tomas Doran


On 29 Jan 2010, at 23:10, Dave Hodgkinson wrote:

I've turned off caching in TT, inserted "Motörhead múm" into the
template as static text and a BOM (od -x 000 bbef 3cbf...) as
the first octets. Works first time, fails second.



Are you using Cache::Memcached from Template::Plugin::Cache? As I was  
bitten by this the other day.. :)


(I'd consider doing this a code smell - but I have some legacy crud  
at $ork with too much logic in TT, so it's useful..)


Anyway - Cache::Memcached borks everything by not correctly  
storing the utf8 flag...


You can fix this by hacking template::plugin::cache to store a 1 item  
list (and retireve that item), which will cause Cache::Memcached to  
serialize everything through storable, and ergo avoid the issue..  
Which is gross, but works.


I'm also told that if you use CHI, or one of the other clients -  
they're more likely to get it right, but I haven't tried this myself..


Cheers
t0m




Re: TT and UTF8?

2010-01-30 Thread Dave Cross
On 29/01/10 23:10, Dave Hodgkinson wrote:
> 
> On 29 Jan 2010, at 19:07, Dave Cross wrote:
>>
>> There's a Perlanet fork that has a hack for dealing correctly with 
>> Templates that contain UTF-9 whether or not they contain a BOM.
>>
>> http://github.com/kappa/perlanet/blob/master/lib/Perlanet.pm
>>
>> It does it by overriding the Template::Provider::_decode_unicode 
>> subroutine.
> 
> So you're saying this is a problem you've had?

No. I'm saying it's a problem that the people running the Russian Planet
Perl had. They have a lot of UTF-8.

It's their fork. I haven't merged it back into the production version yet.

Dave...


Re: TT and UTF8?

2010-01-30 Thread Mark Fowler
On Fri, Jan 29, 2010 at 11:10 PM, Dave Hodgkinson  wrote:

> As far as I'm concerned it's getting mangled *after* I've unleashed it
> to apache.

Mark's observation on i18n data exchange: There's always *another*
UTF-8 bug lurking somewhere. You just haven't found it yet.

I suggest checking that TT is actually sending the right darn thing to
apache.  May I suggest sticking a

use Devel::Peek;

At the top of the file and then doing something like

$tt->process($template, $things, \$output);
Dump $output;
...hand $output to apache...

This will print an ASCII representation of $output to STDERR, with
enough info to see what the underlying bytes and utf8 flag settings
that perl is using to represent your output.

Then you can tell what's going on.

Mark.


Re: TT and UTF8?

2010-01-29 Thread Dave Hodgkinson

On 29 Jan 2010, at 19:07, Dave Cross wrote:
> 
> There's a Perlanet fork that has a hack for dealing correctly with Templates 
> that contain UTF-9 whether or not they contain a BOM.
> 
> http://github.com/kappa/perlanet/blob/master/lib/Perlanet.pm
> 
> It does it by overriding the Template::Provider::_decode_unicode subroutine.

So you're saying this is a problem you've had?

I've turned off caching in TT, inserted "Motörhead múm" into the
template as static text and a BOM (od -x 000 bbef 3cbf...) as
the first octets. Works first time, fails second.

I parse this out of the final string abd is also printed using an octet
unpicker:

[Fri Jan 29 22:55:48 2010] -e: Motörhead
[Fri Jan 29 22:55:48 2010] -e: 4d6f74c3b67268656164 

Looks about right. when it fails: 

[Fri Jan 29 22:55:52 2010] -e: Motörhead 
[Fri Jan 29 22:55:52 2010] -e: 4d6f74c3b67268656164

As far as I'm concerned it's getting mangled *after* I've unleashed it
to apache. It's a low traffic app. I'm sorely tempted to load up a
startup.pl and set MaxRequestsPerClient to 1.

-- 
Dave HodgkinsonMSN: daveh...@hotmail.com
Site: http://www.davehodgkinson.com  UK: +44 7768 490620
Blog: http://www.davehodgkinson.com/blog
Photos: http://www.flickr.com/photos/davehodg











Re: TT and UTF8?

2010-01-29 Thread Nicholas Clark
On Fri, Jan 29, 2010 at 08:12:15PM +, Peter Edwards wrote:
> On 29 January 2010 19:20, Dave Cross  wrote:
> 
> > On 01/29/2010 07:07 PM, Dave Cross wrote:
> >
> >> Templates that contain UTF-9
> >>
> >
> > These Template are one more encoded!
> >
> >
> /me waits for inevitable UTF-11 gag. Oh! Too late.

So when are Spinal Tap writing the RFC?
( compare with http://www.ietf.org/rfc/rfc4042.txt )

Nicholas Clark


Re: TT and UTF8?

2010-01-29 Thread Peter Edwards
On 29 January 2010 19:20, Dave Cross  wrote:

> On 01/29/2010 07:07 PM, Dave Cross wrote:
>
>> Templates that contain UTF-9
>>
>
> These Template are one more encoded!
>
>
/me waits for inevitable UTF-11 gag. Oh! Too late.


Re: TT and UTF8?

2010-01-29 Thread Dave Cross

On 01/29/2010 07:07 PM, Dave Cross wrote:

Templates that contain UTF-9


These Template are one more encoded!



Re: TT and UTF8?

2010-01-29 Thread Dave Cross

On 01/29/2010 02:30 PM, Dave Hodgkinson wrote:


Anyone had issues with TT and UTF8?

sheriff and theorbtwo have got me a long way down the line but...

I have a string which is_utf8() and contains weird characters. I
restart apache and Mötorhead displays fine. Next time through it's
Mot�rhead. Printing unpack(H*) shows the right octets in the string.

A tcpdump shows two byes being sent on the first hit and only one
on the second.

Any quick suggestions before I spend tomorrow swearing at this? Am
I being misled by unpack? And tools I can use to look to see what
TT is doing with my apparently perfectly formed UTF8?


There's a Perlanet fork that has a hack for dealing correctly with 
Templates that contain UTF-9 whether or not they contain a BOM.


http://github.com/kappa/perlanet/blob/master/lib/Perlanet.pm

It does it by overriding the Template::Provider::_decode_unicode subroutine.

Might be useful to you.

Dave...


Re: TT and UTF8?

2010-01-29 Thread Matt Lawrence

Joel Bernstein wrote:

On 29 January 2010 16:59, Matt Lawrence  wrote:
  

Joel Bernstein wrote:


On 29 January 2010 15:25, Dave Hodgkinson  wrote:
  


IIRC, you can say ":set bomb" in vim to do this.



Someone set up us the &^&^!^ytNO CARRIER
  
I first encountered BOMs when dealing with XML files that had been saved 
as unicode from Notepad. It automatically adds a BOM, but the libxml (or 
was it XML::Parser?) of that time blew up (hur hur) when it encountered 
it. It took a while to discover why, because U+FEFF is a zero-width 
non-breaking space, so anything that understands unicode displays 
absolutely nothing. At least in vim you can set the encoding to 
something else and see the bytes, in notepad the very presence of the 
BOM prevents it from being displayed.


Matt



Re: TT and UTF8?

2010-01-29 Thread Joel Bernstein
On 29 January 2010 16:59, Matt Lawrence  wrote:
> Joel Bernstein wrote:
>>
>> On 29 January 2010 15:25, Dave Hodgkinson  wrote:
>>
>>>
>>> On 29 Jan 2010, at 14:48, Ash Berlin wrote:
>>>
>>>

 2) stick a BOM in the .tt file


>>>
>>> BOM?
>>>
>>
>> U+FEFF - unicode codepoint used to indicate endianness in encodings
>> where word length is not single octet multiples.
>>
>> http://en.wikipedia.org/wiki/Byte_order_mark
>>
>>
>
> IIRC, you can say ":set bomb" in vim to do this.

Someone set up us the &^&^!^ytNO CARRIER


Re: TT and UTF8?

2010-01-29 Thread Matt Lawrence

Joel Bernstein wrote:

On 29 January 2010 15:25, Dave Hodgkinson  wrote:
  

On 29 Jan 2010, at 14:48, Ash Berlin wrote:



2) stick a BOM in the .tt file

  

BOM?



U+FEFF - unicode codepoint used to indicate endianness in encodings
where word length is not single octet multiples.

http://en.wikipedia.org/wiki/Byte_order_mark

  

IIRC, you can say ":set bomb" in vim to do this.

Matt


Re: TT and UTF8?

2010-01-29 Thread David Precious

Dave Hodgkinson wrote:

On 29 Jan 2010, at 14:48, Ash Berlin wrote:


2) stick a BOM in the .tt file


BOM?


Byte-Order Mark - http://en.wikipedia.org/wiki/Byte_order_mark



Re: TT and UTF8?

2010-01-29 Thread Joel Bernstein
On 29 January 2010 15:25, Dave Hodgkinson  wrote:
>
> On 29 Jan 2010, at 14:48, Ash Berlin wrote:
>
>> 2) stick a BOM in the .tt file
>>
>
> BOM?

U+FEFF - unicode codepoint used to indicate endianness in encodings
where word length is not single octet multiples.

http://en.wikipedia.org/wiki/Byte_order_mark

Even if your template content is UTF-8 (which has no endianness
issues) it's still the simplest way to definitively mark your content
as Unicode.

/joel


Re: TT and UTF8?

2010-01-29 Thread David Cantrell
On Fri, Jan 29, 2010 at 03:25:14PM +, Dave Hodgkinson wrote:
> On 29 Jan 2010, at 14:48, Ash Berlin wrote:
> > 2) stick a BOM in the .tt file
> BOM?

Byte Order Mark

http://en.wikipedia.org/wiki/Byte_order_mark

-- 
David Cantrell | Cake Smuggler Extraordinaire

Immigration: making Britain great since AD43


Re: TT and UTF8?

2010-01-29 Thread David Dorward

On 29 Jan 2010, at 15:25, Dave Hodgkinson wrote:

> 
> On 29 Jan 2010, at 14:48, Ash Berlin wrote:
> 
>> 2) stick a BOM in the .tt file
>> 
> 
> BOM?


Byte Order Mark. It signals the signal the endianness of the data.

-- 
David Dorward
http://dorward.me.uk



Re: TT and UTF8?

2010-01-29 Thread Dave Hodgkinson

On 29 Jan 2010, at 14:48, Ash Berlin wrote:

> 2) stick a BOM in the .tt file
> 

BOM?

-- 
Dave HodgkinsonMSN: daveh...@hotmail.com
Site: http://www.davehodgkinson.com  UK: +44 7768 490620
Blog: http://www.davehodgkinson.com/blog
Photos: http://www.flickr.com/photos/davehodg










Re: TT and UTF8?

2010-01-29 Thread Ash Berlin

On 29 Jan 2010, at 14:30, Dave Hodgkinson wrote:

> 
> Anyone had issues with TT and UTF8?
> 
> sheriff and theorbtwo have got me a long way down the line but...
> 
> I have a string which is_utf8() and contains weird characters. I 
> restart apache and Mötorhead displays fine. Next time through it's
> Mot�rhead. Printing unpack(H*) shows the right octets in the string.
> 
> A tcpdump shows two byes being sent on the first hit and only one
> on the second.
> 
> Any quick suggestions before I spend tomorrow swearing at this? Am
> I being misled by unpack? And tools I can use to look to see what 
> TT is doing with my apparently perfectly formed UTF8?

General rules i've used in the past:

1) obv make sure your stash data is utf8, not bytes (which it looks like you 
have)
2) stick a BOM in the .tt file

That seemed to do it for me. There is an ENCODING config var but i never had 
much luck with it doing anything. As for why it changes from request to 
request: absolutely no clue on that one. One possible thing to try is to 
disable any caching that TT is doing and see if its a weird TT/apache clash.

-ash


TT and UTF8?

2010-01-29 Thread Dave Hodgkinson

Anyone had issues with TT and UTF8?

sheriff and theorbtwo have got me a long way down the line but...

I have a string which is_utf8() and contains weird characters. I 
restart apache and Mötorhead displays fine. Next time through it's
Mot�rhead. Printing unpack(H*) shows the right octets in the string.

A tcpdump shows two byes being sent on the first hit and only one
on the second.

Any quick suggestions before I spend tomorrow swearing at this? Am
I being misled by unpack? And tools I can use to look to see what 
TT is doing with my apparently perfectly formed UTF8?


-- 
Dave HodgkinsonMSN: daveh...@hotmail.com
Site: http://www.davehodgkinson.com  UK: +44 7768 490620
Blog: http://www.davehodgkinson.com/blog
Photos: http://www.flickr.com/photos/davehodg