Just to put my money where my mouth is, I have implemented a (stupid) prototype
that does: If no known charset is native to libxml2 detected , a recompiled 
version
of mod_proxy_html now uses iconv (eventually via the xmlFindCharEncodingHandler
function) to convert from the source encoding to UTF-8.

If no encoding info is specified, it assumes windows-1251 (yes, stupid, but 
still).

The main work is done by adding a
const char * enc_from  to ctxt
        this specifies, in iconv compatible terms, the source encoding.

sniff_encoding is modified to return 0 when it encounters a non-native coding,
and to set ctxt->enc_from (ctxt is added as a parameter to it)

The function:
size_t ConvertCtxtBuffer(const char * buf, char ** newbuf, size_t bytes, 
saxctxt *ctxt, ap_filter_t *f) {
        size_t len=0;
        if (ctxt->enc_from) {
            if (!xmlFindCharEncodingHandler(ctxt->enc_from)) {
                ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0, f->r,"ConvertInput: no 
encoding handler found for '%s'", ctxt->enc_from);
                *newbuf=buf;
                return bytes;
            } else {
                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: bytes: 
%d, ", bytes);
                len=ConvertInput(buf,newbuf,bytes,f->r,ctxt->enc_from);
                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: len: 
%d, ", len);
                if (len<0) {
                        ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0, f->r,"ConvertInput: 
conversion failed from '%s'", ctxt->enc_from);
                        *newbuf=buf;
                        return bytes;
                }
                buf=*newbuf;
                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: 
encoding handler found for '%s'", buf);
                return len;
            }
        } else {
                *newbuf=buf;
                return bytes;
        }
}

calls the actual conversion.

The function
size_t
ConvertInput(const char *in, char ** newbuf, int size, void * r, const char 
*encoding)
{
  xmlChar *out;
  xmlChar *oldout;
  int ret;
  int out_size;
  int temp;
  size_t len=0;
  xmlCharEncodingHandlerPtr handler;

  if (in == 0)
    return 0;
        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z1") ;

  handler = xmlFindCharEncodingHandler(encoding);

        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2 %d %d %d",handler->input, 
handler->output, handler->iconv_in) ;
  if (!handler) {
        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2a") ;
    printf("ConvertInput: no encoding handler found for '%s'\n",
           encoding ? encoding : "");
    return 0;
  }
        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z3") ;

  out_size = (size+1) * 2 - 1;
  out = (unsigned char *) xmlMalloc((size_t) out_size);
  oldout=out;
        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z4 %d %d %s %s 
%d",size,out_size,encoding,in,handler->output) ;
        if (out != 0) {
                temp = size ;
                if (handler->input) {
                        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z5") ;
                        ret = handler->input(out, &out_size, in, &temp);
                }
                else {
                        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z5a") ;
                        ret = iconv(handler->iconv_in,&in,&temp,&out,&out_size);
                }
                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z6 %d %d 
%d",ret,temp,out_size) ;
                if ((ret < 0)) {
                        if (ret < 0) {
                                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, 
r,"ConvertInput: conversion wasn't succesful") ;
                        } else {
                                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, 
r,"ConvertInput: conversion wasn't succesful. Converter %i octets.",temp) ;
                        }
                        xmlFree(oldout);
                        out = 0;
                        out_size=-1;
                } else {
                        out_size=( (size+1) * 2 - 1) - out_size;
                        out = (unsigned char *) xmlRealloc(oldout, out_size+1 );
                        out[out_size] = 0;  /*null terminating out */
                        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"out %d, oldout 
%d",out,oldout) ;

                        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"len(OUT): 
%d",strlen(out)) ;
                }
        } else {
                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"No memory!") ;
        }
  *newbuf=out;
  return out_size;
}

does the actual conversion. It currently output a bit too much log info, and I
suspect a memory leak from xmlMalloc. I honestly do not know enough about Apache
to figure out when to free it (especially at 1AM).

Oh, also, the proxy_html_filter function is modified at 4 points, so that
bytes=ConvertCtxtBuffer(buf,&buf,bytes,ctxt,f);
is called, so that the conversion actually takes place, and so that when
sniff_... returns 0, the return value is converted to XML_CHAR_ENCODING_UTF8.



******************************************************************************
*              !!!THIS CODE IS *NOT* PRODUCTION QUALITY!!!                   *
*IT HAS AT LEAST ONE MEMORY LEAK, AND LOGS WAY TOO MUCH TO THE ERROR LOG.    *
*Also, I am not sure of the security implications of passing the decoding off*
*to iconv (Are there any buffer overflows in it? Could it be exploited by a  *
*specially crafted file in a particular encoding?)                           *
******************************************************************************

Also, I am not sure what this code will do to get&put method data.

It does work on my _own_ website, where it quite happily converts win-1251 to
utf-8. Once I fix the memory leak (any help appreciated), I'll be happy.


And a great many thanks to Nick Kew for getting me off my lazy ... to start
coding  (which, honestly, I am better at than administering systems).

Hopefully this helps someone.


BTW, I still have no clue why I cannot do this with mod_charset_lite.



mickg wrote:
Nick Kew wrote:
On Tue, 07 Nov 2006 17:49:25 -0500
mickg <[EMAIL PROTECTED]> wrote:


2 questions:
I think I'd have to play with that hands-on to figure it out
with your attempted configuration.
Was that an offer :) If yes, please say so, and shell account will be
provided. (As the system is a VM, I will just clone it, and give
access to that, so, if you mess it up, no problem).

Well it could be, if you have the budget for my time.
That's your most expensive option.

Understood :)
It might be worth trying
mod_line_edit instead of mod_proxy_html.  You sacrifice the
markup support, but in your case the markup isn't properly
supported anyway, and you probably benefit from the fact that
it is also unaware of charsets.

Hmm. Did not know about that module. Any idea where I can get
the .so ?

Same place you get the mod_proxy_html.so.  Except I guess you
got that from a third-party package.  I supply binaries and
basic support to registered users.

Or an ubuntu package?

Or how to compile the source, given a development environment?

Read the apache docs on apxs.  You'll probably need an apache-dev
package on ubuntu.  It's simpler than mod_proxy_html, because it
doesn't rely on additional libraries.

Understood, will do. Thank you!
I should add that today's correspondence has prompted me to blog
about mod_proxy_html 3.0, which will enable you to fix that
charset problem by aliasing an unsupported charset to a similar
supported one (windows cyrillic is probably similar enough to
ISO cyrillic - aka ISO-8859-5 - for that to work).  I'm inviting
blog comments from anyone with great ideas for the next major
release of mod_proxy_html.

Actually, I think the characters are different in the upper register.

What about letting mod_proxy do it's own transcoding, via iconv or
some such?
Maybe even a filter-architecture of it's own?
As in, given a match, apply this filter to it?
Although, that may be overkill for a simple matcher.



mickg


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: [EMAIL PROTECTED]
  "   from the digest: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 (Solved!)


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: [EMAIL PROTECTED]
  "   from the digest: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to