Re: Slow read from S3 on CDH 5.8.0 (includes HADOOP-12346)

2016-08-20 Thread max scalf
Just out of curiosity, have you enabled S3 endpoint for this ?  Hopefully u
are running this cluster inside a VPC, if so an endpoint would help as the
S3 traffic will not go out to the Internet...

Any new policies put in place for your S3 bucket as others have mentioned
something about throttling ?

On Wed, Aug 17, 2016, 3:22 PM Sebastian Nagel 
wrote:

> Hi Dheeren, hi Chris,
>
>
> >> Are you able to share a bit more about your deployment architecture?
> Are these EC2 VMs?  If so,
> are they co-located in the same AWS region as the S3 bucket?
>
> Running a cluster of 100 m1.xlarge EC2 instances with Ubuntu 14.04
> (ami-41a20f2a).
> The cluster is running in a single availability zone (us-east-1d), the S3
> bucket
> is in the same region (us-east-1).
>
> % lsb_release -d
> Description:Ubuntu 14.04.3 LTS
>
> % uname -a
> Linux ip-10-91-235-121 3.13.0-61-generic #100-Ubuntu SMP Wed Jul 29
> 11:21:34 UTC 2015 x86_64 x86_64
> x86_64 GNU/Linux
>
> > Did you change java idk version as well,  as part of the upgrade?
>
> Java is taken as provided by Ubuntu:
>
> % java -version
> java version "1.7.0_111"
> OpenJDK Runtime Environment (IcedTea 2.6.7) (7u111-2.6.7-0ubuntu0.14.04.3)
> OpenJDK 64-Bit Server VM (build 24.111-b01, mixed mode)
>
> Cloudera CDH is installed from
>
> http://archive.cloudera.com/cdh5/one-click-install/trusty/amd64/cdh5-repository_1.0_all.deb
>
> After the jobs are done the cluster is shut down and bootstrapped (bash +
> cloudinit) anew on demand.
> A new launch of the cluster may, of course, include updates of
>  - the underlying Amazon machine image
>  - Ubuntu packages
>  - Cloudera packages
>
> And the real reason for the problem may come from any of these changes.
> The update to Cloudera CDH 5.8.0 was the most obvious since the problems
> appeared
> (seen first 2016-08-01).
>
> >> If the cluster is not running in EC2 (e.g. on-premises physical
> hardware), then are there any
> notable differences on nodes that experienced this problem (e.g. smaller
> capacity on the outbound NIC)?
>
> Probably not, although I cannot exclude this. I've the last days run into
> problems which could be
> related: few tasks are slow, even seem to hang, e.g., reducers during
> copy. But that's also looks
> more like a Hadoop (configuration) problem. Network throughput between
> nodes measured with iperf is
> not super-performant but generally ok (5-20 MBit/s).
>
>  >> This is just a theory, but If your bandwidth to the S3 service is
> intermittently saturated or
> throttled or somehow compromised, then I could see how longer timeouts and
> more retries might
> increase overall job time.  With the shorter settings, it might cause
> individual task attempts to
> fail sooner.  Then, if the next attempt gets scheduled to a different node
> with better bandwidth to
> S3, it would start making progress faster in the second attempt.  Then,
> the effect on overall job
> execution might be faster.
>
> That's also my assumption. While connecting to S3 a server is selected
> which is fast now.
> While copying 1 GB which takes a couple of minutes just because of general
> network throughput,
> the server may become more loaded. When reconnecting a better server is
> chosen.
>
> Btw., tasks are not failing when choosing a moderate timeout - 30 sec. is
> ok, with lower
> values (a few seconds) the file uploads frequently fail.
>
> I've seen this behavior with a simple distcp from S3: with the default
> values, it took 1 day to copy
> 300 GB from S3 to HDFS. After choosing a shorter timeout the job finished
> within 5 hours.
>
> Thanks,
> Sebastian
>
> On 08/16/2016 09:11 PM, Dheeren Bebortha wrote:
> > Did you change java idk version as well,  as part of the upgrade?
> > Dheeren
> >
> >> On Aug 16, 2016, at 11:59 AM, Chris Nauroth 
> wrote:
> >>
> >> Hello Sebastian,
> >>
> >> This is an interesting finding.  Thank you for reporting it.
> >>
> >> Are you able to share a bit more about your deployment architecture?
> Are these EC2 VMs?  If so, are they co-located in the same AWS region as
> the S3 bucket?  If the cluster is not running in EC2 (e.g. on-premises
> physical hardware), then are there any notable differences on nodes that
> experienced this problem (e.g. smaller capacity on the outbound NIC)?
> >>
> >> This is just a theory, but If your bandwidth to the S3 service is
> intermittently saturated or throttled or somehow compromised, then I could
> see how longer timeouts and more retries might increase overall job time.
> With the shorter settings, it might cause individual task attempts to fail
> sooner.  Then, if the next attempt gets scheduled to a different node with
> better bandwidth to S3, it would start making progress faster in the second
> attempt.  Then, the effect on overall job execution might be faster.
> >>
> >> --Chris Nauroth
> >>
> >> On 8/7/16, 12:12 PM, "Sebastian Nagel" 
> wrote:
> >>
> >>Hi,
> >>
> >>  

Re: ACCEPTED: waiting for AM container to be allocated, launched and register with RM

2016-08-20 Thread Sunil Govind
Hi Ram

>From the console log, as Rohith said, AM is looking for AM at 8030. So pls
confirm the RM port once.
Could you please share AM and RM logs.

Thanks
Sunil

On Sat, Aug 20, 2016 at 10:36 AM rammohan ganapavarapu <
rammohanga...@gmail.com> wrote:

> yes, I did configured.
>
> On Aug 19, 2016 7:22 PM, "Rohith Sharma K S" 
> wrote:
>
>> Hi
>>
>> From below discussion and AM logs, I see that AM container has launched
>> but not able to connect to RM.
>>
>> This looks like your configuration issue. Would you check your job.xml
>> jar that does *yarn.resourcemanager.scheduler.address *has been
>> configured?
>>
>> Essentially, this address required by MRAppMaster for connecting to RM
>> for heartbeats. If you don’t not configure, default value will be taken i.e
>> 8030.
>>
>>
>> Thanks & Regards
>> Rohith Sharma K S
>>
>> On Aug 20, 2016, at 7:02 AM, rammohan ganapavarapu <
>> rammohanga...@gmail.com> wrote:
>>
>> Even if  the cluster dont have enough resources it should connect to "
>>
>> /0.0.0.0:8030" right? it should connect to my , not sure why 
>> its trying to connect to 0.0.0.0:8030.
>>
>> I have verified the config and i removed traces of 0.0.0.0 still no luck.
>>
>> org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at 
>> /0.0.0.0:8030
>>
>> If an one has any clue please share.
>>
>> Thanks,
>>
>> Ram
>>
>>
>>
>> On Fri, Aug 19, 2016 at 2:32 PM, rammohan ganapavarapu <
>> rammohanga...@gmail.com> wrote:
>>
>>> When i submit a job using yarn its seems working only with oozie its
>>> failing i guess, not sure what is missing.
>>>
>>> yarn jar
>>> /uap/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi
>>> 20 1000
>>> Number of Maps  = 20
>>> Samples per Map = 1000
>>> .
>>> .
>>> .
>>> Job Finished in 19.622 seconds
>>> Estimated value of Pi is 3.1428
>>>
>>> Ram
>>>
>>> On Fri, Aug 19, 2016 at 11:46 AM, rammohan ganapavarapu <
>>> rammohanga...@gmail.com> wrote:
>>>
 Ok, i have used yarn-utils.py to get the correct values for my cluster
 and update those properties and restarted RM and NM but still no luck not
 sure what i am missing, any other insights will help me.

 Below are my properties from yarn-site.xml and map-site.xml.

 python yarn-utils.py -c 24 -m 63 -d 3 -k False
  Using cores=24 memory=63GB disks=3 hbase=False
  Profile: cores=24 memory=63488MB reserved=1GB usableMem=62GB disks=3
  Num Container=6
  Container Ram=10240MB
  Used Ram=60GB
  Unused Ram=1GB
  yarn.scheduler.minimum-allocation-mb=10240
  yarn.scheduler.maximum-allocation-mb=61440
  yarn.nodemanager.resource.memory-mb=61440
  mapreduce.map.memory.mb=5120
  mapreduce.map.java.opts=-Xmx4096m
  mapreduce.reduce.memory.mb=10240
  mapreduce.reduce.java.opts=-Xmx8192m
  yarn.app.mapreduce.am.resource.mb=5120
  yarn.app.mapreduce.am.command-opts=-Xmx4096m
  mapreduce.task.io.sort.mb=1024


 
   mapreduce.map.memory.mb
   5120
 
 
   mapreduce.map.java.opts
   -Xmx4096m
 
 
   mapreduce.reduce.memory.mb
   10240
 
 
   mapreduce.reduce.java.opts
   -Xmx8192m
 
 
   yarn.app.mapreduce.am.resource.mb
   5120
 
 
   yarn.app.mapreduce.am.command-opts
   -Xmx4096m
 
 
   mapreduce.task.io.sort.mb
   1024
 



  
   yarn.scheduler.minimum-allocation-mb
   10240
 

  
   yarn.scheduler.maximum-allocation-mb
   61440
 

  
   yarn.nodemanager.resource.memory-mb
   61440
 


 Ram

 On Thu, Aug 18, 2016 at 11:14 PM, tkg_cangkul 
 wrote:

> maybe this link can be some reference to tune up the cluster:
>
>
> http://jason4zhu.blogspot.co.id/2014/10/memory-configuration-in-hadoop.html
>
>
> On 19/08/16 11:13, rammohan ganapavarapu wrote:
>
> Do you know what properties to tune?
>
> Thanks,
> Ram
>
> On Thu, Aug 18, 2016 at 9:11 PM, tkg_cangkul 
> wrote:
>
>> i think that's because you don't have enough resource.  u can tune
>> your cluster config to maximize your resource.
>>
>>
>> On 19/08/16 11:03, rammohan ganapavarapu wrote:
>>
>> I dont see any thing odd except this not sure if i have to worry
>> about it or not.
>>
>> 2016-08-19 03:29:26,621 INFO [main]
>> org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at /
>> 0.0.0.0:8030
>> 2016-08-19 03:29:27,646 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 0
>> time(s); retry policy is RetryUpToMaximumCo