[HTTPClient 3.0.1] Bug: Multipart posts with files named using UTF-8 characters

2006-10-19 Thread Tumidajewicz, Przemyslaw

Hello everyone,

First post here, hope I'm doing it right ;)

I've been having problems with sending multipart posts containing files 
named using UTF-8 characters - all non-ASCII characters are turned into 
question marks. I've tried to specify the charset when creating the 
FilePart like this


FilePart fp = new FilePart(name, file, null, UTF-8);

as well as setting the charset later on like this

fp.setCharSet(UTF-8);

with no result. So I took a deeper look at the HttpClient code (thank 
god for open source!) and found that the loss of special characters 
happens in the FilePart.sendDispositionHeader method, at line


out.write(EncodingUtil.getAsciiBytes(filename));

where the filename is forced to fit into the US-ASCII charset.

My workaround for this problem is to substitute the above line with a 
charset-aware version:


out.write(EncodingUtil.getBytes(filename, getCharSet()));

but I'm not sure if it's the correct way to do it.

What I'm quite sure of at this point is that it works for UTF-8 and 
results are consistent with what I get out of IE6 when posting the same 
file using a form like this:


form action=http://localhost:1235; method=POST 
enctype=multipart/form-data accept-charset=UTF-8

input type=file name=file/input
input type=submit/input
/form

It's also parsed correctly by FileUpload 1.1.

I've had a look at the HTTPClient 3.1-alpha1 source and the problematic 
line in FilePart looks the same - which means that either my fix is a 
heresy and/or there is a better way of doing this - or that this bug has 
not been reported before (I failed to find anything on this in the archive).


Please let me know if this is the right way of fixing this problem and 
if so, will this fix make it into HTTPClient 3.1


Thanks and best regards!
--Przemek

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [HTTPClient 3.0.1] Bug: Multipart posts with files named using UTF-8 characters

2006-10-19 Thread Ortwin Glück

Guys,

Look at RFC 2047 which updates RFC 1521. This method is quite popular in 
E-Mail traffic. Maybe real-world HTTP servers and clients support it?


Odi


Oleg Kalnichevski wrote:

On Thu, 2006-10-19 at 14:29 +0200, Tumidajewicz, Przemyslaw wrote:

Hello everyone,

First post here, hope I'm doing it right ;)

I've been having problems with sending multipart posts containing files 
named using UTF-8 characters - all non-ASCII characters are turned into 
question marks. I've tried to specify the charset when creating the 
FilePart like this


FilePart fp = new FilePart(name, file, null, UTF-8);

as well as setting the charset later on like this

fp.setCharSet(UTF-8);

with no result. So I took a deeper look at the HttpClient code (thank 
god for open source!) and found that the loss of special characters 
happens in the FilePart.sendDispositionHeader method, at line


out.write(EncodingUtil.getAsciiBytes(filename));

where the filename is forced to fit into the US-ASCII charset.



Przemyslaw,

This behavior is in line with the requirements of the MIME specification
as outlined in RFC 1521 and RFC 1522. The use of non-ASCII characters in
MIME headers is not permitted. One is supposed to escape non-ASCII
characters using BASE64 or Quoted-Printable encoding. 


See this feature request for details

https://issues.apache.org/jira/browse/HTTPCLIENT-293  


Oleg


My workaround for this problem is to substitute the above line with a 
charset-aware version:


out.write(EncodingUtil.getBytes(filename, getCharSet()));

but I'm not sure if it's the correct way to do it.

What I'm quite sure of at this point is that it works for UTF-8 and 
results are consistent with what I get out of IE6 when posting the same 
file using a form like this:


form action=http://localhost:1235; method=POST 
enctype=multipart/form-data accept-charset=UTF-8

input type=file name=file/input
input type=submit/input
/form

It's also parsed correctly by FileUpload 1.1.

I've had a look at the HTTPClient 3.1-alpha1 source and the problematic 
line in FilePart looks the same - which means that either my fix is a 
heresy and/or there is a better way of doing this - or that this bug has 
not been reported before (I failed to find anything on this in the archive).


Please let me know if this is the right way of fixing this problem and 
if so, will this fix make it into HTTPClient 3.1


Thanks and best regards!
--Przemek

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
[web]  http://www.odi.ch/
[blog] http://www.odi.ch/weblog/
[pgp]  key 0x81CF3416
   finger print F2B1 B21F F056 D53E 5D79 A5AF 02BE 70F5 81CF 3416

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [HTTPClient 3.0.1] Bug: Multipart posts with files named using UTF-8 characters

2006-10-19 Thread Roland Weber
Hi Odi,

 Look at RFC 2047 which updates RFC 1521. This method is quite popular in
 E-Mail traffic. Maybe real-world HTTP servers and clients support it?

Maybe, but MIME encoding is not really our line of work. If somebody
is willing to come up with a patch, I would suggest to implement
something similar to the non-ASCII HTTP headers we already have,
to be used at the application developer's risk.

http://jakarta.apache.org/commons/httpclient/apidocs/org/apache/commons/httpclient/params/HttpMethodParams.html#HTTP_ELEMENT_CHARSET

cheers,
  Roland

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [HTTPClient 3.0.1] Bug: Multipart posts with files named using UTF-8 characters

2006-10-19 Thread Michael Becke

I agree this is the way to go.  We can add a mechanism to change the
default encoding, but leave things as they are by default.

Mike

On 10/19/06, Roland Weber [EMAIL PROTECTED] wrote:

Hi Odi,

 Look at RFC 2047 which updates RFC 1521. This method is quite popular in
 E-Mail traffic. Maybe real-world HTTP servers and clients support it?

Maybe, but MIME encoding is not really our line of work. If somebody
is willing to come up with a patch, I would suggest to implement
something similar to the non-ASCII HTTP headers we already have,
to be used at the application developer's risk.

http://jakarta.apache.org/commons/httpclient/apidocs/org/apache/commons/httpclient/params/HttpMethodParams.html#HTTP_ELEMENT_CHARSET

cheers,
  Roland

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]