On 24/05/2021 10:58, Scott,Tim wrote:
Hi experts,

First time poster, here, so I know I’m risking not providing nearly enough of the right information. Please let me know what I can send to help you help me further through this.

How are you reading the uploaded file? Please provide the code that does this.

The only way the default encoding should impact things is if the file bytes are being converted to String at some point. That shouldn't normally happen for an uploaded file.

Mark


I’m using separate deployments of Tomcat 9 on Linux (RedHat 7) and Windows for the same mature .war application.

Around Jan 2020 I found that uploads of ZIP files to the Linux Tomcat were getting corrupted. The Windows upload worked fine. After much digging I found this appears to relate to the file.encoding property.

Launching the Tomcat 9 service on Windows with “-Dfile.encoding=UTF-8” (overriding the default of Cp1252) causes the Windows upload to corrupt the data.

It would appear, therefore, that file.encoding is affecting binary file uploads and I do not think it should. With this set to utf-8, I am observing that invalid utf-8 characters are been replaced with “ef bf bd” (the BOM/”unknown character” for UTF-8).

Is there a way to address this?

I believe source .jsp files are utf-8 encoded and I deal with utf-8 in many parts of the application. I would rather add this encoding to the Windows deployments than use, e.g., -Dfile.encoding=ISO-8859-1 on Linux.

Note also “If the draft JEP discussed in this post is implemented, the default charset for file contents will be changed to UTF-8 even for Windows.”

               Ref: https://dzone.com/articles/java-may-use-utf-8-as-its-default-charset <https://dzone.com/articles/java-may-use-utf-8-as-its-default-charset> (March 1st, 2018)

I’ve put some details / “evidence” below should you wish to read further.

Thank you,

Tim

This morning, with Tomcat 9.0.45, I again captured a tcpdump to show that the browser is sending the correct data. The temp file which Tomcat created prior to passing the stream to my application is corrupted.

Part of the tcpdump submission is:

------WebKitFormBoundary37kBaouQxD4aoug5

Content-Disposition: form-data; name="file.ob_filename"; filename="MEP.zip"

Content-Type: application/x-zip-compressed

PK.........`.R................tbl_Evidence.csv.Zks.H..........[.=y.Do/..a.`...... .T......i..{..$c......3X.Q..<y.d..&.|:.....&|..Q"....y(r...(  ....O....G....

;..Q,.q..e.&......P$.X..0*.3<T.K....O.........m<..8..b....|%.E...2...e^.......H}.F.|;.W+.....(

Captured with -X, this reads:

         0x0230:  6e61 6d65 3d22 4d45 502e 7a69 7022 0d0a  name="MEP.zip"..

         0x0240:  436f 6e74 656e 742d 5479 7065 3a20 6170  Content-Type:.ap

         0x0250:  706c 6963 6174 696f 6e2f 782d 7a69 702d  plication/x-zip-

         0x0260:  636f 6d70 7265 7373 6564 0d0a 0d0a 504b  compressed....PK

         0x0270:  0304 1400 0808 0800 8960 b352 0000 0000  .........`.R....

         0x0280:  0000 0000 0000 0000 1000 0000 7462 6c5f  ............tbl_

         0x0290:  4576 6964 656e 6365 2e63 7376 bd5a 6b73  Evidence.csv.Zks

         0x02a0:  e248 b2fd bebf a2c2 11b7 db8e 5b06 3d79  .H..........[.=y

         0x02b0:  f444 6f2f c6b8 61c6 6016 b9c7 b113 8e20  .Do/..a.`.......

The temp file shows:

$ od -t x1 upload_5e216399_71ab_4273_b38b_0410583a4edb_00000024.tmp | head

0000000 50 4b 03 04 14 00 08 08 08 00 ef bf bd 60 ef bf

0000020 bd 52 00 00 00 00 00 00 00 00 00 00 00 00 10 00

0000040 00 00 74 62 6c 5f 45 76 69 64 65 6e 63 65 2e 63

0000060 73 76 ef bf bd 5a 6b 73 ef bf bd 48 ef bf bd ef

0000100 bf bd ef bf bd ef bf bd ef bf bd ef bf bd 11 ef

As you may notice comparing this line with the first line of the od output:

         0x0270:  0304 1400 0808 0800 8960 b352 0000 0000  .........`.R....

The “89” and “b3” (no doubt an invalid utf-8 characters) have been replaced with “ef bf bd”. This is repeated later for each subsequent invalid utf-8 character.

In case this is relevant, I’m using Amazon’s Corretto JDK 11.0.4 (64-bit) on Linux (11.0.7 now on Windows) but I’ve observed this problem with JDK8 and I can’t say when it started. I know it worked a few years ago on Linux and Windows, but can’t dig out the version information for then.

                NB: Just updated to JDK 11.0.11 and it made no difference.

My extensive, repeated and varied searches merely confirm that my HTML is OK, the form submission is as intended. Maybe the process for reading the data is out of date but it works fine on Windows (Java is meant to be a WORM language) and all the debugging I do shows that the data is corrupt before my application sees it.

My JVM property file.encoding = UTF-8 on Linux and was Cp1252 on Windows.

--

Tim Scott

*OCLC* · Senior Software Engineer / Technical Product Manager

CityGate, 8 St. Mary’s Gate, Sheffield S1 4LW, UK

cc: IT file

OCLC COVID-19 resources: oc.lc/covid19-service-info <https://oc.lc/covid19-service-info>

COVID-19: We’re in this together <https://www.oclc.org/en/covid-19.html?utm_campaign=covid-19-support&utm_medium=email&utm_source=libraryservices&utm_content=signature-banner-covid-19-information-resources>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to