Nick Kew wrote:
On Wed, 08 Nov 2006 00:48:39 -0500
mickg <[EMAIL PROTECTED]> wrote:
Just to put my money where my mouth is, I have implemented a (stupid)
prototype that does: If no known charset is native to libxml2
detected , a recompiled version of mod_proxy_html now uses iconv
(eventually via the xmlFindCharEncodingHandler function) to convert
from the source encoding to UTF-8.
Interesting. You've gone one up on my aliasing proposal, for
what looks like rather less work than I thought that would take.
I might snarf the basic idea for Version 3.
Do you want the full working code once I clean up the memory problem?
It is, after all, GPL, so it would be in good spirit for me to release
the modified source. :)
Although, to be truly honest, what the thing is doing IS somewhat backwards.
The dataflow would be such (And I am more familiar with Python code, as the
next snippet will show).
data comes in
if ctxt.encoder==None:
obtain charset
if need iconv to convert charset:
ctxt.encoder=charset
return enc=UTF-8
else:
return enc
proir to processing buf,
if ctxt.encoder!=None:
convert(buf)
convert if encoder is set (non-null).
This guarantees that either the data is in known enc to libxml, or was utf8 to
begin with, or was converted to utf8, or conversion failed miserably (the
miserable failure was logged.)
If no encoding info is specified, it assumes windows-1251 (yes,
stupid, but still).
But not stupid if we make it a configurable default!
Yeah, preferably via a directive such as HTMLSourceDefaultEnc windows-1251
or some such.
It does work on my _own_ website, where it quite happily converts
win-1251 to utf-8. Once I fix the memory leak (any help appreciated),
I'll be happy.
See http://www.apachetutor.org/dev/pools for an easy way to
deal with the memory.
And a great many thanks to Nick Kew for getting me off my lazy ... to
start coding (which, honestly, I am better at than administering
systems).
:-)
BTW, I still have no clue why I cannot do this with mod_charset_lite.
Neither am I. But a closer look at mod_charset_lite has been on
my TODO list for so long it's probably on a permanent back-burner.
Did you also look at the full mod_charset? AIUI it was written by
Russian developers, so cyrillic was presumably important to them.
The thing about mod_charset, is that they assume no iconv, and do all
internal translation. With translation settings and weird maps, where
needed. This seems a bit insane to me, unless needed.
I believe the reason was that we had:
win1251 read as koi8, transcoded into LATIN1
Now, we need to make sense of *that*.
Also, they do not cleanly support utf8 translation (they do not support
translation back from utf8). iconv does.
Honestly, remaking mod_proxy_html into mod_proxy_charset_convert would
be trivial now, IMO.
And maybe that's the better idea. Although that does duplicate
mod_charset_light, at least I know it'll work.
And , it would use libxml2 where possible, not iconv.
mickg
---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: [EMAIL PROTECTED]
" from the digest: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]