[ https://issues.apache.org/jira/browse/HDDS-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838478#comment-17838478 ]
Tanvi Penumudy edited comment on HDDS-10688 at 4/18/24 6:19 AM: ---------------------------------------------------------------- Sharing the details to the discussion: * When we attempt copying a file > 5MB, goofys performs the operation via multipart upload - this is where the issue as stated occurs due to which we were seeing the INVALID_PART issue for larger files (> 5MB) with Chinese-character filenames. * When we perform multipart upload via aws s3api, during the complete-multipart-upload step, the user is supposed to provide a compiled file with the part number and ETag information (so this works fine), but in case of the goofys copy operation (for files > 5 MB), this step is handled by the underlying implementation. * The output of each individual upload-part step is used for the compilation of this part information (for the complete-multipart-upload step) internally by goofys. * Although, functionally the upload-part operation works as expected (even with Chinese-character filenames), the output in the response of upload-part does not contain the Chinese characters as described in Eg. 2 below. Examples of ETag information prompted during the upload-part step: Eg 1: [Older Ozone version] Filenames with pure English characters: {code:java} aws s3api create-multipart-upload --bucket multipart --key sample_en --endpoint <endpoint-URL>:9879 --no-verify-ssl { "Bucket": "multipart", "Key": "sample_en", "UploadId": "41d1a3ab-291a-4c2c-a9a0-d43e8784d554-112284974522564610" } {code} {code:java} aws s3api upload-part --bucket multipart <endpoint-URL>:9879 --key sample_en --part-number 1 --body xaa --upload-id 41d1a3ab-291a-4c2c-a9a0-d43e8784d554-112284974522564610 --no-verify-ssl { "ETag": "/s3v/multipart/sample_en-41d1a3ab-291a-4c2c-a9a0-d43e8784d554-112284974522564610-1" } {code} The ETag name is typically compiled as /<vol>/<buck>/<multipart-key><upload-id><part-num> for the upload-part operation. Eg 2: [Older Ozone version] Filenames with one or more Chinese characters: {code:java} aws s3api create-multipart-upload --bucket multipart --key 测试三 --endpoint <endpoint-URL>:9879 --no-verify-ssl { "Bucket": "multipart", "Key": "测试三", "UploadId": "f5525b81-ccb4-453a-aeba-6fc7ab93348e-112284983095001094" } {code} {code:java} aws s3api upload-part --bucket multipart --endpoint <endpoint-URL>:9879 --key 测试三 --part-number 1 --body xaa --upload-id f5525b81-ccb4-453a-aeba-6fc7ab93348e-112284983095001094 --no-verify-ssl { "ETag": "/s3v/multipart/ -f5525b81-ccb4-453a-aeba-6fc7ab93348e-112284983095001094-1" } {code} The ETag name is compiled the same way as Eg 1, except that the Chinese characters present in the filename are missing. Eg 3. [Latest Ozone version] Filenames with pure English characters: {code:java} aws s3api create-multipart-upload --bucket multipart --key sample_eng --endpoint <endpoint-URL>:9879 --no-verify-ssl { "Bucket": "multipart", "Key": "sample_eng", "UploadId": "15788792-d3cc-40cc-ae2b-ecb11df75d31-112284850768117806" } {code} {code:java} aws s3api upload-part --bucket multipart --endpoint <endpoint-URL>:9879 --key sample_eng --part-number 1 --body xaa --upload-id 15788792-d3cc-40cc-ae2b-ecb11df75d31-112284850768117806 --no-verify-ssl { "ETag": "4034379ecc54213fc9a51785a9d0e8e2" } {code} After the changes: HDDS-9115 (Ticket: HDDS-9114 and PR: [https://github.com/apache/ozone/pull/5162]) have been checked in, the ETag calculation for individual upload-part is now the MD5 hash of the specified part's body. Eg 4. [Latest Ozone version] Filenames with one or more Chinese characters: {code:java} aws s3api create-multipart-upload --bucket multipart --key 客客 --endpoint <endpoint-URL>:9879 --no-verify-ssl { "Bucket": "multipart", "Key": "客客", "UploadId": "dc94b109-cd48-435f-9a16-5ade0edbfb65-112284958083645514" } {code} {code:java} aws s3api upload-part --bucket multipart --endpoint <endpoint-URL>:9879 --key 客客 --part-number 1 --body xaa --upload-id dc94b109-cd48-435f-9a16-5ade0edbfb65-112284958083645514 --no-verify-ssl { "ETag": "4034379ecc54213fc9a51785a9d0e8e2" } {code} The ETag value for even Chinese-character filenames is calculated the same way as that of Eg. 3. Due to this change in the latest version of Ozone, the underlying implementation of goofys picks up the MD5 hash of the specified part's body for every upload-part ETag while compiling the information for the complete-multipart-upload step (this has no scope of missing Chinese characters, or any non-encoded characters) due to which this issue is no longer seen in the latest Ozone code. was (Author: JIRAUSER285056): Sharing the details to the discussion: * When we attempt copying a file > 5MB, goofys performs the operation via multipart upload - this is where the issue as stated occurs due to which we were seeing the INVALID_PART issue for larger files (> 5MB) with Chinese-character filenames. * When we perform multipart upload via aws s3api, during the complete-multipart-upload step, the user is supposed to provide a compiled file with the part number and ETag information (so this works fine), but in case of the goofys copy operation (for files > 5 MB), this step is handled by the underlying implementation. * The output of each individual upload-part step is used for the compilation of this part information (for the complete-multipart-upload step) internally by goofys. * Although, functionally the upload-part operation works as expected (even with Chinese-character filenames), the output in the response of upload-part does not contain the Chinese characters as described in Eg. 2 below. Examples of ETag information prompted during the upload-part step: Eg 1: [Older Ozone version] Filenames with pure English characters: {code:java} aws s3api create-multipart-upload --bucket multipart --key sample_en --endpoint <endpoint-URL>:9879 --no-verify-ssl { "Bucket": "multipart", "Key": "sample_en", "UploadId": "41d1a3ab-291a-4c2c-a9a0-d43e8784d554-112284974522564610" } {code} {code:java} aws s3api upload-part --bucket multipart <endpoint-URL>:9879 --key sample_en --part-number 1 --body xaa --upload-id 41d1a3ab-291a-4c2c-a9a0-d43e8784d554-112284974522564610 --no-verify-ssl { "ETag": "/s3v/multipart/sample_en-41d1a3ab-291a-4c2c-a9a0-d43e8784d554-112284974522564610-1" } {code} The ETag name is typically compiled as /<vol>/<buck>/<multipart-key><upload-id><part-num> for the upload-part operation. Eg 2: [Older Ozone version] Filenames with one or more Chinese characters: {code:java} aws s3api create-multipart-upload --bucket multipart --key 测试三 --endpoint <endpoint-URL>:9879 --no-verify-ssl { "Bucket": "multipart", "Key": "测试三", "UploadId": "f5525b81-ccb4-453a-aeba-6fc7ab93348e-112284983095001094" } {code} {code:java} aws s3api upload-part --bucket multipart --endpoint <endpoint-URL>:9879 --key 测试三 --part-number 1 --body xaa --upload-id f5525b81-ccb4-453a-aeba-6fc7ab93348e-112284983095001094 --no-verify-ssl { "ETag": "/s3v/multipart/ -f5525b81-ccb4-453a-aeba-6fc7ab93348e-112284983095001094-1" } {code} The ETag name is compiled the same way as Eg 1, except that the Chinese characters present in the filename are missing. Eg 3. [Latest Ozone version] Filenames with pure English characters: {code:java} aws s3api create-multipart-upload --bucket multipart --key sample_eng --endpoint <endpoint-URL>:9879 --no-verify-ssl { "Bucket": "multipart", "Key": "sample_eng", "UploadId": "15788792-d3cc-40cc-ae2b-ecb11df75d31-112284850768117806" } {code} {code:java} aws s3api upload-part --bucket multipart --endpoint <endpoint-URL>:9879 --key sample_eng --part-number 1 --body xaa --upload-id 15788792-d3cc-40cc-ae2b-ecb11df75d31-112284850768117806 --no-verify-ssl { "ETag": "4034379ecc54213fc9a51785a9d0e8e2" } {code} After the changes: HDDS-9115 (Ticket: HDDS-9114 and PR: [https://github.com/apache/ozone/pull/5162]) have been checked in, the ETag calculation for individual upload-part is now the MD5 hash of the specified part's body. Eg 4. [Latest Ozone version] Filenames with one or more Chinese characters: {code:java} aws s3api create-multipart-upload --bucket multipart --key 客客 --endpoint <endpoint-URL>:9879 --no-verify-ssl { "Bucket": "multipart", "Key": "客客", "UploadId": "dc94b109-cd48-435f-9a16-5ade0edbfb65-112284958083645514" } {code} {code:java} aws s3api upload-part --bucket multipart --endpoint <endpoint-URL>:9879 --key 客客 --part-number 1 --body xaa --upload-id dc94b109-cd48-435f-9a16-5ade0edbfb65-112284958083645514 --no-verify-ssl { "ETag": "4034379ecc54213fc9a51785a9d0e8e2" } {code} The ETag value for even Chinese-character filenames is calculated the same way as that of Eg. 3. Due to this change in upstream, the underlying implementation of goofys picks up the MD5 hash of the specified part's body for every upload-part ETag while compiling the information for the complete-multipart-upload step (this has no scope of missing Chinese characters, or any non-encoded characters) due to which this issue is no longer seen in the latest Ozone code. > S3 multipart upload failed for Chinese filename with s3 fuse clients > -------------------------------------------------------------------- > > Key: HDDS-10688 > URL: https://issues.apache.org/jira/browse/HDDS-10688 > Project: Apache Ozone > Issue Type: Bug > Components: S3, s3gateway > Affects Versions: 1.4.0, 1.5.0 > Reporter: Soumitra Sulav > Assignee: Tanvi Penumudy > Priority: Critical > > * Issue is seen only with s3 fuse clients which internally do the copy via > multipart upload mechanism. Basically, the client initiates an MPU, creates > individual parts uploads, and finally runs the COMPLETE_MULTIPART_UPLOAD > using individual parts. The issue is observed at the last layer where it is > trying to merge the file and failing to do so. > * This issue is only seen with Chinese (Non-English charset) characters. > * Upload-part API misses the non-English character in the response even > after setting proper encoding and locale variables. > * List-Part API response is proper. > Below are the repro steps : > # Install goofys fuse client > [https://github.com/kahing/goofys/releases/download/v0.24.0/goofys] > # Mount the ozone s3 endpoint via goofys > {code:java} > goofys --debug_fuse --debug_s3 --endpoint > http://<OzoneS3GHost>:<OzoneS3GPort> <BucketName> <LocalPath> > {code} > # Create a file of size > 5MB and a name containing Chinese characters. > # Copy the file from the local filesystem to the mounted path. > {code:java} > cp 测试三.txt /mnt/test-goofys/ > cp: failed to close '/mnt/test-goofys/测试三.txt': Invalid argument > {code} > Error stacktrace > {code:java} > 2024-04-08 09:29:36,145 | INFO | S3GAudit | user=o...@root.comops.site | > ip=10.129.77.95 | op=INIT_MULTIPART_UPLOAD {bucket=[buckettest], > path=[测试三.txt], uploads=[]} | ret=SUCCESS | > 2024-04-08 09:29:36,288 | INFO | S3GAudit | user=o...@root.comops.site | > ip=10.129.77.95 | op=CREATE_MULTIPART_KEY {bucket=[buckettest], > path=[测试三.txt], > uploadId=[669be17b-6c05-4066-9398-13a3586c65b1-112234894206109441], > partNumber=[1]} | ret=SUCCESS | > 2024-04-08 09:29:36,432 | INFO | S3GAudit | user=o...@root.comops.site | > ip=10.129.77.95 | op=CREATE_MULTIPART_KEY {bucket=[buckettest], > path=[测试三.txt], > uploadId=[669be17b-6c05-4066-9398-13a3586c65b1-112234894206109441], > partNumber=[2]} | ret=SUCCESS | > 2024-04-08 09:29:36,455 | ERROR | S3GAudit | user=o...@root.comops.site | > ip=10.129.77.95 | op=COMPLETE_MULTIPART_UPLOAD {bucket=[buckettest], > path=[测试三.txt], > uploadId=[669be17b-6c05-4066-9398-13a3586c65b1-112234894206109441]} | > ret=FAILURE | INVALID_PART org.apache.hadoop.ozone.om.exceptions.OMException: > Complete Multipart Upload Failed: volume: s3v bucket: buckettest key: > 测试三.txt. Provided Part info is { /s3v/buckettest/ > .txt-669be17b-6c05-4066-9398-13a3586c65b1-112234894206109441-1, 1}, whereas > OM has partName > /s3v/buckettest/测试三.txt-669be17b-6c05-4066-9398-13a3586c65b1-112234894206109441-1 > at > org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:728) > at > org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB.java:1587) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org