[ 
https://issues.apache.org/jira/browse/HADOOP-16260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820447#comment-16820447
 ] 

Gabor Bota commented on HADOOP-16260:
-------------------------------------

Have you checked this: 
https://aajisaka.github.io/hadoop-document/hadoop-project/hadoop-distcp/DistCp.html#DistCp_and_Object_Stores
This is about using {{-direct}} with distcp, so it will skip temporary file 
rename operations when the destination is an object store.
Maybe it will be useful for your case.

> Allow Distcp to create a new tempTarget file per File
> -----------------------------------------------------
>
>                 Key: HADOOP-16260
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16260
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.9.2
>            Reporter: Arun Suresh
>            Priority: Major
>
> We use distcp to copy entire HDFS clusters to GCS.
>  In the process, we hit the following error:
> {noformat}
> INFO: Encountered status code 410 when accessing URL 
> https://www.googleapis.com/upload/storage/v1/b/app/o?ifGenerationMatch=0&name=analytics/.distcp.tmp.attempt_local1083459072_0001_m_000000_0&uploadType=resumable&upload_id=AEnB2Uq4mZeZxXgs2Mhx0uskNpZ4Cka8pT4aCcd7v6UC4TDQx-h0uEFWoPpdOO4pWEdmaKnhTjxVva5Ow4vXbTe6_JScIU5fsQSaIwNkF3D84DHjtuhKSCU.
>  Delegating to response handler for possible retry.
> Apr 14, 2019 5:53:17 AM 
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation
>  call
> SEVERE: Exception not convertible into handled response
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException:
>  410 Gone
> {
>   "code" : 429,
>   "errors" : [ {
>     "domain" : "usageLimits",
>     "message" : "The total number of changes to the object 
> app/folder/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds the 
> rate limit. Please reduce the rate of create, update, and delete requests.",
>     "reason" : "rateLimitExceeded"
>   } ],
>   "message" : "The total number of changes to the object 
> app/folder/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds the 
> rate limit. Please reduce the rate of create, update, and delete requests."
> }
>        at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
>         at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
>         at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
>         at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
>         at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
>         at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
>         at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:301)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>  
> {noformat}
> Looking at the code, it looks like a distCp mapper gets a list of files to 
> copy from src to target filesystem. The mapper handles each file in its list 
> sequentially: It first creates/overwrites a temp file 
> (*.distcp.tmp.attempt_local1083459072_0001_m_000000_0*), then it copies the 
> src file to the temp file, and finally renames the temp file to the actual 
> target file.
>  The temp file name (which contains the task ID) is reused for all the files 
> in the mapper's batch. It looks like GCP enforces a rate-limit on the number 
> of operations per sec on any object (even though we are actually creating a 
> new file and renaming it to the final target, gcp assumes we are making 
> changes to the same object)
> Even though it is possible to play around with the number of Maps / split 
> size etc. It is hard to arrive at one of those values based on any rate-limit.
> Thus, we propose we add a flag to allow the DistCp mapper to use a different 
> temp file PER file.
> Thoughts ? (cc/[~steve_l], [~benoyantony])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to