Hello Sebastian,

This is an interesting finding.  Thank you for reporting it.

Are you able to share a bit more about your deployment architecture?  Are these 
EC2 VMs?  If so, are they co-located in the same AWS region as the S3 bucket?  
If the cluster is not running in EC2 (e.g. on-premises physical hardware), then 
are there any notable differences on nodes that experienced this problem (e.g. 
smaller capacity on the outbound NIC)?

This is just a theory, but If your bandwidth to the S3 service is 
intermittently saturated or throttled or somehow compromised, then I could see 
how longer timeouts and more retries might increase overall job time.  With the 
shorter settings, it might cause individual task attempts to fail sooner.  
Then, if the next attempt gets scheduled to a different node with better 
bandwidth to S3, it would start making progress faster in the second attempt.  
Then, the effect on overall job execution might be faster.

--Chris Nauroth

On 8/7/16, 12:12 PM, "Sebastian Nagel" <wastl.na...@googlemail.com> wrote:

    Hi,
    
    recently, after upgrading to CDH 5.8.0, I've run into a performance
    issue when reading data from AWS S3 (via s3a).
    
    A job [1] reads 10,000s files ("objects") from S3 and writes extracted
    data back to S3. Every file/object is about 1 GB in size, processing
    is CPU-intensive and takes a couple of minutes per file/object. Each
    file/object is processed by one task using FilenameInputFormat.
    
    After the upgrade to CDH 5.8.0, the job showed slow progress, 5-6
    times slower in overall than in previous runs. A significant number
    of tasks hung up without progress for up to one hour. These tasks were
    dominating and most nodes in the cluster showed little or no CPU
    utilization. Tasks are not killed/restarted because the task timeout
    is set to a very large value (because S3 is known to be slow
    sometimes). Attaching to a couple of the hung tasks with jstack
    showed that these tasks hang when reading from S3 [3].
    
    The problem was finally fixed by setting
      fs.s3a.connection.timeout = 30000  (default: 200000 ms)
      fs.s3a.attempts.maximum = 5        (default 20)
    Tasks now take 20min. in the worst case, the majority finishes within 
minutes.
    
    Is this the correct way to fix the problem?
    These settings have been increased recently in HADOOP-12346 [2].
    What could be the draw-backs with a lower timeout?
    
    Thanks,
    Sebastian
    
    [1]
    
https://github.com/commoncrawl/ia-hadoop-tools/blob/master/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java
    
    [2] https://issues.apache.org/jira/browse/HADOOP-12346
    
    [3] "main" prio=10 tid=0x00007fad64013000 nid=0x4ab5 runnable 
[0x00007fad6b274000]
       java.lang.Thread.State: RUNNABLE
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at
    
com.cloudera.org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:204)
            at
    
com.cloudera.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:182)
            at 
com.cloudera.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
            at 
com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at 
com.cloudera.com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
            at 
com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at 
com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at 
com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at 
com.cloudera.com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
            at 
com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at 
com.cloudera.com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:108)
            at 
com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
            at 
org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:160)
            - locked <0x00000007765604f8> (a 
org.apache.hadoop.fs.s3a.S3AInputStream)
            at java.io.DataInputStream.read(DataInputStream.java:149)
            ...
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
    For additional commands, e-mail: user-h...@hadoop.apache.org
    
    
    

Reply via email to