On 24/05/2021 10:58, Scott,Tim wrote:
Hi experts,
First time poster, here, so I know I’m risking not providing nearly
enough of the right information. Please let me know what I can send to
help you help me further through this.
How are you reading the uploaded file? Please provide the code that does
this.
The only way the default encoding should impact things is if the file
bytes are being converted to String at some point. That shouldn't
normally happen for an uploaded file.
Mark
I’m using separate deployments of Tomcat 9 on Linux (RedHat 7) and
Windows for the same mature .war application.
Around Jan 2020 I found that uploads of ZIP files to the Linux Tomcat
were getting corrupted. The Windows upload worked fine. After much
digging I found this appears to relate to the file.encoding property.
Launching the Tomcat 9 service on Windows with “-Dfile.encoding=UTF-8”
(overriding the default of Cp1252) causes the Windows upload to corrupt
the data.
It would appear, therefore, that file.encoding is affecting binary file
uploads and I do not think it should. With this set to utf-8, I am
observing that invalid utf-8 characters are been replaced with “ef bf
bd” (the BOM/”unknown character” for UTF-8).
Is there a way to address this?
I believe source .jsp files are utf-8 encoded and I deal with utf-8 in
many parts of the application. I would rather add this encoding to the
Windows deployments than use, e.g., -Dfile.encoding=ISO-8859-1 on Linux.
Note also “If the draft JEP discussed in this post is implemented, the
default charset for file contents will be changed to UTF-8 even for
Windows.”
Ref:
https://dzone.com/articles/java-may-use-utf-8-as-its-default-charset
<https://dzone.com/articles/java-may-use-utf-8-as-its-default-charset>
(March 1st, 2018)
I’ve put some details / “evidence” below should you wish to read further.
Thank you,
Tim
This morning, with Tomcat 9.0.45, I again captured a tcpdump to show
that the browser is sending the correct data. The temp file which Tomcat
created prior to passing the stream to my application is corrupted.
Part of the tcpdump submission is:
------WebKitFormBoundary37kBaouQxD4aoug5
Content-Disposition: form-data; name="file.ob_filename"; filename="MEP.zip"
Content-Type: application/x-zip-compressed
PK.........`.R................tbl_Evidence.csv.Zks.H..........[.=y.Do/..a.`......
.T......i..{..$c......3X.Q..<y.d..&.|:.....&|..Q"....y(r...( ....O....G....
;..Q,.q..e.&......P$.X..0*.3<T.K....O.........m<..8..b....|%.E...2...e^.......H}.F.|;.W+.....(
Captured with -X, this reads:
0x0230: 6e61 6d65 3d22 4d45 502e 7a69 7022 0d0a name="MEP.zip"..
0x0240: 436f 6e74 656e 742d 5479 7065 3a20 6170 Content-Type:.ap
0x0250: 706c 6963 6174 696f 6e2f 782d 7a69 702d plication/x-zip-
0x0260: 636f 6d70 7265 7373 6564 0d0a 0d0a 504b compressed....PK
0x0270: 0304 1400 0808 0800 8960 b352 0000 0000 .........`.R....
0x0280: 0000 0000 0000 0000 1000 0000 7462 6c5f ............tbl_
0x0290: 4576 6964 656e 6365 2e63 7376 bd5a 6b73 Evidence.csv.Zks
0x02a0: e248 b2fd bebf a2c2 11b7 db8e 5b06 3d79 .H..........[.=y
0x02b0: f444 6f2f c6b8 61c6 6016 b9c7 b113 8e20 .Do/..a.`.......
The temp file shows:
$ od -t x1 upload_5e216399_71ab_4273_b38b_0410583a4edb_00000024.tmp | head
0000000 50 4b 03 04 14 00 08 08 08 00 ef bf bd 60 ef bf
0000020 bd 52 00 00 00 00 00 00 00 00 00 00 00 00 10 00
0000040 00 00 74 62 6c 5f 45 76 69 64 65 6e 63 65 2e 63
0000060 73 76 ef bf bd 5a 6b 73 ef bf bd 48 ef bf bd ef
0000100 bf bd ef bf bd ef bf bd ef bf bd ef bf bd 11 ef
As you may notice comparing this line with the first line of the od output:
0x0270: 0304 1400 0808 0800 8960 b352 0000 0000 .........`.R....
The “89” and “b3” (no doubt an invalid utf-8 characters) have been
replaced with “ef bf bd”. This is repeated later for each subsequent
invalid utf-8 character.
In case this is relevant, I’m using Amazon’s Corretto JDK 11.0.4
(64-bit) on Linux (11.0.7 now on Windows) but I’ve observed this problem
with JDK8 and I can’t say when it started. I know it worked a few years
ago on Linux and Windows, but can’t dig out the version information for
then.
NB: Just updated to JDK 11.0.11 and it made no difference.
My extensive, repeated and varied searches merely confirm that my HTML
is OK, the form submission is as intended. Maybe the process for reading
the data is out of date but it works fine on Windows (Java is meant to
be a WORM language) and all the debugging I do shows that the data is
corrupt before my application sees it.
My JVM property file.encoding = UTF-8 on Linux and was Cp1252 on Windows.
--
Tim Scott
*OCLC* · Senior Software Engineer / Technical Product Manager
CityGate, 8 St. Mary’s Gate, Sheffield S1 4LW, UK
cc: IT file
OCLC COVID-19 resources: oc.lc/covid19-service-info
<https://oc.lc/covid19-service-info>
COVID-19: We’re in this together
<https://www.oclc.org/en/covid-19.html?utm_campaign=covid-19-support&utm_medium=email&utm_source=libraryservices&utm_content=signature-banner-covid-19-information-resources>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org