[ https://issues.apache.org/jira/browse/HADOOP-16260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820447#comment-16820447 ]
Gabor Bota edited comment on HADOOP-16260 at 4/17/19 7:52 PM: -------------------------------------------------------------- Have you checked this: http://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html#DistCp_and_Object_Stores This is about using {{-direct}} with distcp, so it will skip temporary file rename operations when the destination is an object store. Maybe it will be useful for your case. was (Author: gabor.bota): Have you checked this: https://aajisaka.github.io/hadoop-document/hadoop-project/hadoop-distcp/DistCp.html#DistCp_and_Object_Stores This is about using {{-direct}} with distcp, so it will skip temporary file rename operations when the destination is an object store. Maybe it will be useful for your case. > Allow Distcp to create a new tempTarget file per File > ----------------------------------------------------- > > Key: HADOOP-16260 > URL: https://issues.apache.org/jira/browse/HADOOP-16260 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.9.2 > Reporter: Arun Suresh > Priority: Major > > We use distcp to copy entire HDFS clusters to GCS. > In the process, we hit the following error: > {noformat} > INFO: Encountered status code 410 when accessing URL > https://www.googleapis.com/upload/storage/v1/b/app/o?ifGenerationMatch=0&name=analytics/.distcp.tmp.attempt_local1083459072_0001_m_000000_0&uploadType=resumable&upload_id=AEnB2Uq4mZeZxXgs2Mhx0uskNpZ4Cka8pT4aCcd7v6UC4TDQx-h0uEFWoPpdOO4pWEdmaKnhTjxVva5Ow4vXbTe6_JScIU5fsQSaIwNkF3D84DHjtuhKSCU. > Delegating to response handler for possible retry. > Apr 14, 2019 5:53:17 AM > com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation > call > SEVERE: Exception not convertible into handled response > com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException: > 410 Gone > { > "code" : 429, > "errors" : [ { > "domain" : "usageLimits", > "message" : "The total number of changes to the object > app/folder/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds the > rate limit. Please reduce the rate of create, update, and delete requests.", > "reason" : "rateLimitExceeded" > } ], > "message" : "The total number of changes to the object > app/folder/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds the > rate limit. Please reduce the rate of create, update, and delete requests." > } > at > com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150) > at > com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113) > at > com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40) > at > com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432) > at > com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) > at > com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469) > at > com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:301) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > {noformat} > Looking at the code, it looks like a distCp mapper gets a list of files to > copy from src to target filesystem. The mapper handles each file in its list > sequentially: It first creates/overwrites a temp file > (*.distcp.tmp.attempt_local1083459072_0001_m_000000_0*), then it copies the > src file to the temp file, and finally renames the temp file to the actual > target file. > The temp file name (which contains the task ID) is reused for all the files > in the mapper's batch. It looks like GCP enforces a rate-limit on the number > of operations per sec on any object (even though we are actually creating a > new file and renaming it to the final target, gcp assumes we are making > changes to the same object) > Even though it is possible to play around with the number of Maps / split > size etc. It is hard to arrive at one of those values based on any rate-limit. > Thus, we propose we add a flag to allow the DistCp mapper to use a different > temp file PER file. > Thoughts ? (cc/[~steve_l], [~benoyantony]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org