[ 
https://issues.apache.org/jira/browse/HDDS-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838478#comment-17838478
 ] 

Tanvi Penumudy edited comment on HDDS-10688 at 4/18/24 6:19 AM:
----------------------------------------------------------------

Sharing the details to the discussion:
 * When we attempt copying a file > 5MB, goofys performs the operation via 
multipart upload - this is where the issue as stated occurs due to which we 
were seeing the INVALID_PART issue for larger files (> 5MB) with 
Chinese-character filenames.
 * When we perform multipart upload via aws s3api, during the 
complete-multipart-upload step, the user is supposed to provide a compiled file 
with the part number and ETag information (so this works fine), but in case of 
the goofys copy operation (for files > 5 MB), this step is handled by the 
underlying implementation.
 * The output of each individual upload-part step is used for the compilation 
of this part information (for the complete-multipart-upload step) internally by 
goofys.
 * Although, functionally the upload-part operation works as expected (even 
with Chinese-character filenames), the output in the response of upload-part 
does not contain the Chinese characters as described in Eg. 2 below.

 

Examples of ETag information prompted during the upload-part step:

Eg 1: [Older Ozone version] Filenames with pure English characters:
{code:java}
aws s3api create-multipart-upload --bucket multipart --key sample_en --endpoint 
<endpoint-URL>:9879 --no-verify-ssl

{
    "Bucket": "multipart",
    "Key": "sample_en",
    "UploadId": "41d1a3ab-291a-4c2c-a9a0-d43e8784d554-112284974522564610"
}
{code}
{code:java}
aws s3api upload-part --bucket multipart <endpoint-URL>:9879 --key sample_en 
--part-number 1 --body xaa --upload-id 
41d1a3ab-291a-4c2c-a9a0-d43e8784d554-112284974522564610 --no-verify-ssl

{
    "ETag": 
"/s3v/multipart/sample_en-41d1a3ab-291a-4c2c-a9a0-d43e8784d554-112284974522564610-1"
}
{code}
The ETag name is typically compiled as 
/<vol>/<buck>/<multipart-key><upload-id><part-num> for the upload-part 
operation.

 

Eg 2: [Older Ozone version] Filenames with one or more Chinese characters:
{code:java}
aws s3api create-multipart-upload --bucket multipart --key 测试三 --endpoint 
<endpoint-URL>:9879 --no-verify-ssl

{
    "Bucket": "multipart",
    "Key": "测试三",
    "UploadId": "f5525b81-ccb4-453a-aeba-6fc7ab93348e-112284983095001094"
}
{code}
{code:java}
aws s3api upload-part --bucket multipart --endpoint <endpoint-URL>:9879 --key 
测试三 --part-number 1 --body xaa --upload-id 
f5525b81-ccb4-453a-aeba-6fc7ab93348e-112284983095001094 --no-verify-ssl

{
    "ETag": "/s3v/multipart/   
-f5525b81-ccb4-453a-aeba-6fc7ab93348e-112284983095001094-1"
}
{code}
The ETag name is compiled the same way as Eg 1, except that the Chinese 
characters present in the filename are missing.

 
Eg 3. [Latest Ozone version] Filenames with pure English characters:
{code:java}
aws s3api create-multipart-upload --bucket multipart --key sample_eng 
--endpoint <endpoint-URL>:9879 --no-verify-ssl

{
    "Bucket": "multipart",
    "Key": "sample_eng",
    "UploadId": "15788792-d3cc-40cc-ae2b-ecb11df75d31-112284850768117806"
}
{code}
{code:java}
aws s3api upload-part --bucket multipart --endpoint <endpoint-URL>:9879 --key 
sample_eng --part-number 1 --body xaa --upload-id 
15788792-d3cc-40cc-ae2b-ecb11df75d31-112284850768117806 --no-verify-ssl

{
    "ETag": "4034379ecc54213fc9a51785a9d0e8e2"
}
{code}
After the changes: HDDS-9115 (Ticket: HDDS-9114 and PR: 
[https://github.com/apache/ozone/pull/5162]) have been checked in, the ETag 
calculation for individual upload-part is now the MD5 hash of the specified 
part's body.

 

Eg 4. [Latest Ozone version] Filenames with one or more Chinese characters:
{code:java}
aws s3api create-multipart-upload --bucket multipart --key 客客 --endpoint 
<endpoint-URL>:9879 --no-verify-ssl

{
    "Bucket": "multipart",
    "Key": "客客",
    "UploadId": "dc94b109-cd48-435f-9a16-5ade0edbfb65-112284958083645514"
}
{code}
{code:java}
aws s3api upload-part --bucket multipart --endpoint <endpoint-URL>:9879 --key 
客客 --part-number 1 --body xaa --upload-id 
dc94b109-cd48-435f-9a16-5ade0edbfb65-112284958083645514 --no-verify-ssl

{
    "ETag": "4034379ecc54213fc9a51785a9d0e8e2"
}
{code}
The ETag value for even Chinese-character filenames is calculated the same way 
as that of Eg. 3.

 

Due to this change in the latest version of Ozone, the underlying 
implementation of goofys picks up the MD5 hash of the specified part's body for 
every upload-part ETag while compiling the information for the 
complete-multipart-upload step (this has no scope of missing Chinese 
characters, or any non-encoded characters) due to which this issue is no longer 
seen in the latest Ozone code.


was (Author: JIRAUSER285056):
Sharing the details to the discussion:
 * When we attempt copying a file > 5MB, goofys performs the operation via 
multipart upload - this is where the issue as stated occurs due to which we 
were seeing the INVALID_PART issue for larger files (> 5MB) with 
Chinese-character filenames.
 * When we perform multipart upload via aws s3api, during the 
complete-multipart-upload step, the user is supposed to provide a compiled file 
with the part number and ETag information (so this works fine), but in case of 
the goofys copy operation (for files > 5 MB), this step is handled by the 
underlying implementation.
 * The output of each individual upload-part step is used for the compilation 
of this part information (for the complete-multipart-upload step) internally by 
goofys.
 * Although, functionally the upload-part operation works as expected (even 
with Chinese-character filenames), the output in the response of upload-part 
does not contain the Chinese characters as described in Eg. 2 below.

 

Examples of ETag information prompted during the upload-part step:

Eg 1: [Older Ozone version] Filenames with pure English characters:
{code:java}
aws s3api create-multipart-upload --bucket multipart --key sample_en --endpoint 
<endpoint-URL>:9879 --no-verify-ssl

{
    "Bucket": "multipart",
    "Key": "sample_en",
    "UploadId": "41d1a3ab-291a-4c2c-a9a0-d43e8784d554-112284974522564610"
}
{code}
{code:java}
aws s3api upload-part --bucket multipart <endpoint-URL>:9879 --key sample_en 
--part-number 1 --body xaa --upload-id 
41d1a3ab-291a-4c2c-a9a0-d43e8784d554-112284974522564610 --no-verify-ssl

{
    "ETag": 
"/s3v/multipart/sample_en-41d1a3ab-291a-4c2c-a9a0-d43e8784d554-112284974522564610-1"
}
{code}
The ETag name is typically compiled as 
/<vol>/<buck>/<multipart-key><upload-id><part-num> for the upload-part 
operation.

 

Eg 2: [Older Ozone version] Filenames with one or more Chinese characters:
{code:java}
aws s3api create-multipart-upload --bucket multipart --key 测试三 --endpoint 
<endpoint-URL>:9879 --no-verify-ssl

{
    "Bucket": "multipart",
    "Key": "测试三",
    "UploadId": "f5525b81-ccb4-453a-aeba-6fc7ab93348e-112284983095001094"
}
{code}
{code:java}
aws s3api upload-part --bucket multipart --endpoint <endpoint-URL>:9879 --key 
测试三 --part-number 1 --body xaa --upload-id 
f5525b81-ccb4-453a-aeba-6fc7ab93348e-112284983095001094 --no-verify-ssl

{
    "ETag": "/s3v/multipart/   
-f5525b81-ccb4-453a-aeba-6fc7ab93348e-112284983095001094-1"
}
{code}
The ETag name is compiled the same way as Eg 1, except that the Chinese 
characters present in the filename are missing.

 
Eg 3. [Latest Ozone version] Filenames with pure English characters:
{code:java}
aws s3api create-multipart-upload --bucket multipart --key sample_eng 
--endpoint <endpoint-URL>:9879 --no-verify-ssl

{
    "Bucket": "multipart",
    "Key": "sample_eng",
    "UploadId": "15788792-d3cc-40cc-ae2b-ecb11df75d31-112284850768117806"
}
{code}
{code:java}
aws s3api upload-part --bucket multipart --endpoint <endpoint-URL>:9879 --key 
sample_eng --part-number 1 --body xaa --upload-id 
15788792-d3cc-40cc-ae2b-ecb11df75d31-112284850768117806 --no-verify-ssl

{
    "ETag": "4034379ecc54213fc9a51785a9d0e8e2"
}
{code}
After the changes: HDDS-9115 (Ticket: HDDS-9114 and PR: 
[https://github.com/apache/ozone/pull/5162]) have been checked in, the ETag 
calculation for individual upload-part is now the MD5 hash of the specified 
part's body.

 

Eg 4. [Latest Ozone version] Filenames with one or more Chinese characters:
{code:java}
aws s3api create-multipart-upload --bucket multipart --key 客客 --endpoint 
<endpoint-URL>:9879 --no-verify-ssl

{
    "Bucket": "multipart",
    "Key": "客客",
    "UploadId": "dc94b109-cd48-435f-9a16-5ade0edbfb65-112284958083645514"
}
{code}
{code:java}
aws s3api upload-part --bucket multipart --endpoint <endpoint-URL>:9879 --key 
客客 --part-number 1 --body xaa --upload-id 
dc94b109-cd48-435f-9a16-5ade0edbfb65-112284958083645514 --no-verify-ssl

{
    "ETag": "4034379ecc54213fc9a51785a9d0e8e2"
}
{code}
The ETag value for even Chinese-character filenames is calculated the same way 
as that of Eg. 3.

 

Due to this change in upstream, the underlying implementation of goofys picks 
up the MD5 hash of the specified part's body for every upload-part ETag while 
compiling the information for the complete-multipart-upload step (this has no 
scope of missing Chinese characters, or any non-encoded characters) due to 
which this issue is no longer seen in the latest Ozone code.

> S3 multipart upload failed for Chinese filename with s3 fuse clients
> --------------------------------------------------------------------
>
>                 Key: HDDS-10688
>                 URL: https://issues.apache.org/jira/browse/HDDS-10688
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: S3, s3gateway
>    Affects Versions: 1.4.0, 1.5.0
>            Reporter: Soumitra Sulav
>            Assignee: Tanvi Penumudy
>            Priority: Critical
>
> * Issue is seen only with s3 fuse clients which internally do the copy via 
> multipart upload mechanism. Basically, the client initiates an MPU, creates 
> individual parts uploads, and finally runs the COMPLETE_MULTIPART_UPLOAD 
> using individual parts. The issue is observed at the last layer where it is 
> trying to merge the file and failing to do so.
>  * This issue is only seen with Chinese (Non-English charset) characters.
>  * Upload-part API misses the non-English character in the response even 
> after setting proper encoding and locale variables.
>  * List-Part API response is proper.
> Below are the repro steps :
>  # Install goofys fuse client
> [https://github.com/kahing/goofys/releases/download/v0.24.0/goofys]
>  # Mount the ozone s3 endpoint via goofys
> {code:java}
> goofys --debug_fuse --debug_s3 --endpoint 
> http://<OzoneS3GHost>:<OzoneS3GPort> <BucketName> <LocalPath>
> {code}
>  # Create a file of size > 5MB and a name containing Chinese characters.
>  # Copy the file from the local filesystem to the mounted path.
> {code:java}
> cp 测试三.txt /mnt/test-goofys/
> cp: failed to close '/mnt/test-goofys/测试三.txt': Invalid argument
> {code}
> Error stacktrace
> {code:java}
> 2024-04-08 09:29:36,145 | INFO  | S3GAudit | user=o...@root.comops.site | 
> ip=10.129.77.95 | op=INIT_MULTIPART_UPLOAD {bucket=[buckettest], 
> path=[测试三.txt], uploads=[]} | ret=SUCCESS |
> 2024-04-08 09:29:36,288 | INFO  | S3GAudit | user=o...@root.comops.site | 
> ip=10.129.77.95 | op=CREATE_MULTIPART_KEY {bucket=[buckettest], 
> path=[测试三.txt], 
> uploadId=[669be17b-6c05-4066-9398-13a3586c65b1-112234894206109441], 
> partNumber=[1]} | ret=SUCCESS |
> 2024-04-08 09:29:36,432 | INFO  | S3GAudit | user=o...@root.comops.site | 
> ip=10.129.77.95 | op=CREATE_MULTIPART_KEY {bucket=[buckettest], 
> path=[测试三.txt], 
> uploadId=[669be17b-6c05-4066-9398-13a3586c65b1-112234894206109441], 
> partNumber=[2]} | ret=SUCCESS |
> 2024-04-08 09:29:36,455 | ERROR | S3GAudit | user=o...@root.comops.site | 
> ip=10.129.77.95 | op=COMPLETE_MULTIPART_UPLOAD {bucket=[buckettest], 
> path=[测试三.txt], 
> uploadId=[669be17b-6c05-4066-9398-13a3586c65b1-112234894206109441]} | 
> ret=FAILURE | INVALID_PART org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: s3v bucket: buckettest key: 
> 测试三.txt. Provided Part info is { /s3v/buckettest/   
> .txt-669be17b-6c05-4066-9398-13a3586c65b1-112234894206109441-1, 1}, whereas 
> OM has partName 
> /s3v/buckettest/测试三.txt-669be17b-6c05-4066-9398-13a3586c65b1-112234894206109441-1
>         at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:728)
>         at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB.java:1587)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

Reply via email to