Implementing security in hadoop

2014-08-12 Thread Chhaya Vishwakarma
Hi,

I'm trying to implement security on my hadoop data. I'm using Cloudera hadoop

Below are the two specific things I'm looking for

1. Role based authorization and authentication

2. Encryption on data residing in HDFS

I have looked into Kerboroes but it doesn't provide encryption for data already 
residing in HDFS.

Are there any other security tools i can go for? has anyone done above two 
security features in cloudera hadoop.

Please suggest


Regards,
Chhaya Vishwakarma



The contents of this e-mail and any attachment(s) may contain confidential or 
privileged information for the intended recipient(s). Unintended recipients are 
prohibited from taking action on the basis of information in this e-mail and 
using or disseminating the information, and must notify the sender and delete 
it from their system. L&T Infotech will not accept responsibility or liability 
for the accuracy or completeness of, or the presence of any virus or disabling 
code in this e-mail"


RE: fair scheduler not working as intended

2014-08-12 Thread Henry Hung
Hi Yehia,

Oh? I thought that by using maxResources = 15360 mb (3072 mb * 5), vcores = 5, 
and maxMaps = 5, I already restricting the job to only use 5 maps at max.

The reason is my long run job have 841 maps, and each map will process data for 
almost 2 hours.
In the meantime there will be some short jobs that only need couple of minutes 
to complete.
Hence why I use fair scheduler to split resources into 2 groups, one default 
and other one longrun.
I want to make sure there always an available resources ready to be used by 
short jobs.

If your explanation is true, then current fair scheduler behavior is not what I 
wanted.
So is there any other ways to setup YARN resources to accommodate the short / 
long run jobs?
Or do I need to create 2 separate YARN cluster? (I have been thinking about 
this approach)

Best regards,
Henry

From: Yehia Elshater [mailto:y.z.elsha...@gmail.com]
Sent: Wednesday, August 13, 2014 11:27 AM
To: user@hadoop.apache.org
Subject: Re: fair scheduler not working as intended

Hi Henry,

Are there any applications (on different queues rather than longrun queue) are 
running in the same time ? I think FairScheduler is going to assign more 
resources to your "longrun" as long as there no other applications are running 
in the other queues.

Thanks
Yehia

On 12 August 2014 20:30, Henry Hung 
mailto:ythu...@winbond.com>> wrote:
Hi Everyone,

I’m using Hadoop-2.2.0 with fair scheduler in my YARN cluster, but something is 
wrong with the fair scheduler.

Here is my fair-scheduler.xml looks like:


  
15360 mb, 5 vcores
0.5
2
5
1
  


I create a “longrun” queue to ensure that huge MR application can only use 5 
resources. My YARN setup for each resource memory is 3072 MB:

  
mapreduce.map.memory.mb
3072
  
  
mapreduce.reduce.memory.mb
3072
  

When the huge application started, it works just fine and scheduler restrict it 
to only run 5 maps in parallel.
But after running for some time, the application run 10 maps in parallel.
The scheduler page show that the “longrun” queue used 66%, exceed the fair 
share 30%.

Can anyone tell me why the application can get more than it deserved?
Is the problem with my configuration? Or there is a bug?

Best regards,
Henry Hung


The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.



The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.


Re: MR AppMaster unable to load native libs

2014-08-12 Thread Susheel Kumar Gadalay
This message I have also got when running in 2.4.1

I have found the native libraries in $HADOOP_HOME/lib/native are 32
bit not 64 bit.

Recompile once again and build 64 bit shared objects, but it is a
lengthy exercise.

On 8/13/14, Subroto Sanyal  wrote:
> Hi,
>
> I am running a single node hadoop cluster 2.4.1.
> When I submit a MR job it logs a warning:
> 2014-08-12 21:38:22,173 WARN [main] org.apache.hadoop.util.NativeCodeLoader:
> Unable to load native-hadoop library for your platform... using builtin-java
> classes where applicable
>
> The problem doesn’t comes up when starting any hadoop daemons.
> Do I need to pass any specific configuration so that the child jvm is able
> to pick up the native lib folder?
>
> Cheers,
> Subroto Sanyal
>


Re: fair scheduler not working as intended

2014-08-12 Thread Yehia Elshater
Hi Henry,

Are there any applications (on different queues rather than longrun queue)
are running in the same time ? I think FairScheduler is going to assign
more resources to your "longrun" as long as there no other applications are
running in the other queues.

Thanks
Yehia


On 12 August 2014 20:30, Henry Hung  wrote:

>  Hi Everyone,
>
>
>
> I’m using Hadoop-2.2.0 with fair scheduler in my YARN cluster, but
> something is wrong with the fair scheduler.
>
>
>
> Here is my fair-scheduler.xml looks like:
>
>
>
> 
>
>   
>
> 15360 mb, 5 vcores
>
> 0.5
>
> 2
>
> 5
>
> 1
>
>   
>
> 
>
>
>
> I create a “longrun” queue to ensure that huge MR application can only use
> 5 resources. My YARN setup for each resource memory is 3072 MB:
>
>
>
>   
>
> mapreduce.map.memory.mb
>
> 3072
>
>   
>
>   
>
> mapreduce.reduce.memory.mb
>
> 3072
>
>   
>
>
>
> When the huge application started, it works just fine and scheduler
> restrict it to only run 5 maps in parallel.
>
> But after running for some time, the application run 10 maps in parallel.
>
> The scheduler page show that the “longrun” queue used 66%, exceed the fair
> share 30%.
>
>
>
> Can anyone tell me why the application can get more than it deserved?
>
> Is the problem with my configuration? Or there is a bug?
>
>
>
> Best regards,
>
> Henry Hung
>
> --
> The privileged confidential information contained in this email is
> intended for use only by the addressees as indicated by the original sender
> of this email. If you are not the addressee indicated in this email or are
> not responsible for delivery of the email to such a person, please kindly
> reply to the sender indicating this fact and delete all copies of it from
> your computer and network server immediately. Your cooperation is highly
> appreciated. It is advised that any unauthorized use of confidential
> information of Winbond is strictly prohibited; and any information in this
> email irrelevant to the official business of Winbond shall be deemed as
> neither given nor endorsed by Winbond.
>


Re: How to use docker in Hadoop, with patch of YARN-1964?

2014-08-12 Thread sam liu
After applying this patch, I added following config in yarn-site.xml:
  
yarn.nodemanager.container-executor.class

org.apache.hadoop.yarn.server.nodemanager.DockerContainerExecutor
  

Then I can start NodeManager with enabling DockerContainerExecutor. But
failed to execute a simple mr job, and the exception is* 'Cannot connect to
the Docker daemon. Is 'docker -d' running on this host?'*. Below are the
full exception info and any body could give me some hints?

14/08/12 20:06:14 INFO mapreduce.Job: Job job_1407899030909_0002 running in
uber mode : false
14/08/12 20:06:14 INFO mapreduce.Job:  map 0% reduce 0%
14/08/12 20:06:14 INFO mapreduce.Job: Job job_1407899030909_0002 failed
with state FAILED due to: Application application_1407899030909_0002
failed 2 times due to AM Container for appattempt_1407899030909_0002_02
exited with  exitCode: 1 due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException: Warning: '-rm' is
deprecated, it will be replaced by '--rm' soon. See usage.
Warning: '-name' is deprecated, it will be replaced by '--name' soon. See
usage.
2014/08/12 20:06:13 Cannot connect to the Docker daemon. Is 'docker -d'
running on this host?

at org.apache.hadoop.util.Shell.runCommand(Shell.java:444)
at org.apache.hadoop.util.Shell.run(Shell.java:359)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:569)
at
org.apache.hadoop.yarn.server.nodemanager.DockerContainerExecutor.launchContainer(DockerContainerExecutor.java:174)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:319)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:314)
at java.util.concurrent.FutureTask.run(FutureTask.java:149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:897)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919)
at java.lang.Thread.run(Thread.java:738)

Thanks!


2014-08-12 2:18 GMT-07:00 sam liu :

> Hi Experts,
>
> I am very interesting that Hadoop could work with Docker and doing some
> trial on patch of YARN-1964.
>
> I applied patch yarn-1964-branch-2.2.0-docker.patch of jira YARN-1964 on
> branch 2.2 and am going to install a Hadoop cluster using the new generated
> tarball including the patch.
>
> Then, I think I can use DockerContainerExecutor, but I do not know much
> details on the usage and have following questions:
>
>
>
> *1. After installation, What's the detailed config steps to adopt
> DockerContainerExecutor? *
>
>
> *2. How to verify whether a MR task is really launched in Docker container
> not Yarn container?*
> *3. Which hadoop branch will officially include Docker support?*
>
> Thanks a lot!
>


Re: Synchronization among Mappers in map-reduce task

2014-08-12 Thread Wangda Tan
Hi Saurabh,

>> am not sure making overwrite=false , will solve the problem. As per java
doc by making overwrite=false , it will throw an exception if the file
already exists. So, for all the remaining mappers it will throw an
exception.
You can catch the exception and wait.

>> Can you please refer to me some source or link on ZK , that can help me
in solving the problem.
You can check this: http://zookeeper.apache.org/doc/r3.4.6/recipes.html

Thanks,
Wangda



On Wed, Aug 13, 2014 at 9:34 AM, saurabh jain  wrote:

> Hi Wangda ,
>
> I am not sure making overwrite=false , will solve the problem. As per java
> doc by making overwrite=false , it will throw an exception if the file
> already exists. So, for all the remaining mappers it will throw an
> exception.
>
> Also I am very new to ZK and have very basic knowledge of it , I am not
> sure if it can solve the problem and if yes how. I am still going through
> available sources on the ZK.
>
> Can you please refer to me some source or link on ZK , that can help me in
> solving the problem.
>
> Best
> Saurabh
>
> On Tue, Aug 12, 2014 at 3:08 AM, Wangda Tan  wrote:
>
>> Hi Saurabh,
>> It's an interesting topic,
>>
>> >> So , here is the question , is it possible to make sure that when one
>> of
>> the mapper tasks is writing to a file , other should wait until the first
>> one is finished. ? I read that all the mappers task don't interact with
>> each other
>>
>> A simple way to do this is using HDFS namespace:
>> Create file using "public FSDataOutputStream create(Path f, boolean
>> overwrite)", overwrite=false. Only one mapper can successfully create
>> file.
>>
>> After write completed, the mapper will create a flag file like "completed"
>> in the same folder. Other mappers can wait for the "completed" file
>> created.
>>
>> >> Is there any way to have synchronization between two independent map
>> reduce jobs?
>> I think ZK can do some complex synchronization here. Like mutex, master
>> election, etc.
>>
>> Hope this helps,
>>
>> Wangda Tan
>>
>>
>>
>>
>> On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain 
>> wrote:
>>
>> > Hi Folks ,
>> >
>> > I have been writing a map-reduce application where I am having an input
>> > file containing records and every field in the record is separated by
>> some
>> > delimiter.
>> >
>> > In addition to this user will also provide a list of columns that he
>> wants
>> > to lookup in a master properties file (stored in HDFS). If this columns
>> > (lets say it a key) is present in master properties file then get the
>> > corresponding value and update the key with this value and if the key is
>> > not present it in the master properties file then it will create a new
>> > value for this key and will write to this property file and will also
>> > update in the record.
>> >
>> > I have written this application , tested it and everything worked fine
>> > till now.
>> >
>> > *e.g :* *I/P Record :* This | is | the | test | record
>> >
>> > *Columns :* 2,4 (that means code will look up only field *"is" and
>> "test"* in
>> > the master properties file.)
>> >
>> > Here , I have a question.
>> >
>> > *Q 1:* In the case when my input file is huge and it is splitted across
>> > the multiple mappers , I was getting the below mentioned exception where
>> > all the other mappers tasks were failing. *Also initially when I started
>> > the job my master properties file is empty.* In my code I have a check
>> if
>> > this file (master properties) doesn't exist create a new empty file
>> before
>> > submitting the job itself.
>> >
>> > e.g : If i have 4 splits of data , then 3 map tasks are failing. But
>> after
>> > this all the failed map tasks restarts and finally the job become
>> > successful.
>> >
>> > So , *here is the question , is it possible to make sure that when one
>> of
>> > the mapper tasks is writing to a file , other should wait until the
>> first
>> > one is finished. ?* I read that all the mappers task don't interact with
>> > each other.
>> >
>> > Also what will happen in the scenario when I start multiple parallel
>> > map-reduce jobs and all of them working on the same properties files.
>> *Is
>> > there any way to have synchronization between two independent map reduce
>> > jobs*?
>> >
>> > I have also read that ZooKeeper can be used in such scenarios , Is that
>> > correct ?
>> >
>> >
>> > Error:
>> com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException:
>> IOException - failed while appending data to the file ->Failed to create
>> file [/user/cloudera/lob/master/bank.properties] for
>> [DFSClient_attempt_1407778869492_0032_m_02_0_1618418105_1] on client
>> [10.X.X.17], because this file is already being created by
>> > [DFSClient_attempt_1407778869492_0032_m_05_0_-949968337_1] on
>> [10.X.X.17]
>> > at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>> > at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.a

Re: Making All datanode down

2014-08-12 Thread Gordon Wang
Did you try to close the file and reopen it for writing after datanodes
restart ?

I think if you close the file and reopen it. The exception might disappear.


On Wed, Aug 13, 2014 at 2:21 AM, Satyam Singh 
wrote:

> Hi Users,
>
>
>
> In my cluster setup i am doing one test case of making only all datanodes
> down and keep namenode running.
>
> In this case my application gets error with remoteException:
>  could only be replicated to 0 nodes instead of minReplication (=1).
>  There are 0 datanode(s) running and no node(s) are excluded in this
> operation
>
> Then i make all datanodes up, but i still faces above exception looks like
> datanode came up but not sync with namenode.
>
> I have to then restart my application for writing files to hdfs. After
> restart it starts processing fine but i don't want to kill my application
> in this case.
>
>
>
> Prompt help is highly appreciated!!
>
> Regards,
> Satyam
>



-- 
Regards
Gordon Wang


Re: Synchronization among Mappers in map-reduce task

2014-08-12 Thread saurabh jain
Hi Wangda ,

I am not sure making overwrite=false , will solve the problem. As per java
doc by making overwrite=false , it will throw an exception if the file
already exists. So, for all the remaining mappers it will throw an
exception.

Also I am very new to ZK and have very basic knowledge of it , I am not
sure if it can solve the problem and if yes how. I am still going through
available sources on the ZK.

Can you please refer to me some source or link on ZK , that can help me in
solving the problem.

Best
Saurabh

On Tue, Aug 12, 2014 at 3:08 AM, Wangda Tan  wrote:

> Hi Saurabh,
> It's an interesting topic,
>
> >> So , here is the question , is it possible to make sure that when one of
> the mapper tasks is writing to a file , other should wait until the first
> one is finished. ? I read that all the mappers task don't interact with
> each other
>
> A simple way to do this is using HDFS namespace:
> Create file using "public FSDataOutputStream create(Path f, boolean
> overwrite)", overwrite=false. Only one mapper can successfully create file.
>
> After write completed, the mapper will create a flag file like "completed"
> in the same folder. Other mappers can wait for the "completed" file
> created.
>
> >> Is there any way to have synchronization between two independent map
> reduce jobs?
> I think ZK can do some complex synchronization here. Like mutex, master
> election, etc.
>
> Hope this helps,
>
> Wangda Tan
>
>
>
>
> On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain 
> wrote:
>
> > Hi Folks ,
> >
> > I have been writing a map-reduce application where I am having an input
> > file containing records and every field in the record is separated by
> some
> > delimiter.
> >
> > In addition to this user will also provide a list of columns that he
> wants
> > to lookup in a master properties file (stored in HDFS). If this columns
> > (lets say it a key) is present in master properties file then get the
> > corresponding value and update the key with this value and if the key is
> > not present it in the master properties file then it will create a new
> > value for this key and will write to this property file and will also
> > update in the record.
> >
> > I have written this application , tested it and everything worked fine
> > till now.
> >
> > *e.g :* *I/P Record :* This | is | the | test | record
> >
> > *Columns :* 2,4 (that means code will look up only field *"is" and
> "test"* in
> > the master properties file.)
> >
> > Here , I have a question.
> >
> > *Q 1:* In the case when my input file is huge and it is splitted across
> > the multiple mappers , I was getting the below mentioned exception where
> > all the other mappers tasks were failing. *Also initially when I started
> > the job my master properties file is empty.* In my code I have a check if
> > this file (master properties) doesn't exist create a new empty file
> before
> > submitting the job itself.
> >
> > e.g : If i have 4 splits of data , then 3 map tasks are failing. But
> after
> > this all the failed map tasks restarts and finally the job become
> > successful.
> >
> > So , *here is the question , is it possible to make sure that when one of
> > the mapper tasks is writing to a file , other should wait until the first
> > one is finished. ?* I read that all the mappers task don't interact with
> > each other.
> >
> > Also what will happen in the scenario when I start multiple parallel
> > map-reduce jobs and all of them working on the same properties files. *Is
> > there any way to have synchronization between two independent map reduce
> > jobs*?
> >
> > I have also read that ZooKeeper can be used in such scenarios , Is that
> > correct ?
> >
> >
> > Error:
> com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException:
> IOException - failed while appending data to the file ->Failed to create
> file [/user/cloudera/lob/master/bank.properties] for
> [DFSClient_attempt_1407778869492_0032_m_02_0_1618418105_1] on client
> [10.X.X.17], because this file is already being created by
> > [DFSClient_attempt_1407778869492_0032_m_05_0_-949968337_1] on
> [10.X.X.17]
> > at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
> > at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
> > at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
> > at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
> > at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
> > at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
> > at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.

fair scheduler not working as intended

2014-08-12 Thread Henry Hung
Hi Everyone,

I'm using Hadoop-2.2.0 with fair scheduler in my YARN cluster, but something is 
wrong with the fair scheduler.

Here is my fair-scheduler.xml looks like:


  
15360 mb, 5 vcores
0.5
2
5
1
  


I create a "longrun" queue to ensure that huge MR application can only use 5 
resources. My YARN setup for each resource memory is 3072 MB:

  
mapreduce.map.memory.mb
3072
  
  
mapreduce.reduce.memory.mb
3072
  

When the huge application started, it works just fine and scheduler restrict it 
to only run 5 maps in parallel.
But after running for some time, the application run 10 maps in parallel.
The scheduler page show that the "longrun" queue used 66%, exceed the fair 
share 30%.

Can anyone tell me why the application can get more than it deserved?
Is the problem with my configuration? Or there is a bug?

Best regards,
Henry Hung


The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.


Hadoop 2.4 failed to launch job on aws s3n

2014-08-12 Thread Yue Cheng
Hi,

I deployed Hadoop 2.4 on AWS EC2 using S3 native file system as a
replacement of HDFS. I tried several example apps, all gave me the
following stack tracing msgs (an older thread on Jul 24 hang there w/o
being resolved... So I attach the DEBUG info here...):

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.0.jar
wordcount s3n://mybkt/wc/ s3n://mybkt/out

14/08/12 21:57:35 DEBUG util.Shell: setsid exited with exit code 0
14/08/12 21:57:36 DEBUG lib.MutableMetricsFactory: field
org.apache.hadoop.metrics2.lib.MutableRate
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess
with annotation
@org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, value=[Rate
of successful kerberos logins and latency (milliseconds)], about=,
type=DEFAULT, always=false, sampleName=Ops)
14/08/12 21:57:36 DEBUG lib.MutableMetricsFactory: field
org.apache.hadoop.metrics2.lib.MutableRate
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure
with annotation
@org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, value=[Rate
of failed kerberos logins and latency (milliseconds)], about=,
type=DEFAULT, always=false, sampleName=Ops)
14/08/12 21:57:36 DEBUG lib.MutableMetricsFactory: field
org.apache.hadoop.metrics2.lib.MutableRate
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with
annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time,
value=[GetGroups], about=, type=DEFAULT, always=false, sampleName=Ops)
14/08/12 21:57:36 DEBUG impl.MetricsSystemImpl: UgiMetrics, User and group
related metrics
14/08/12 21:57:36 DEBUG util.KerberosName: Kerberos krb5 configuration not
found, setting default realm to empty
14/08/12 21:57:36 DEBUG security.Groups:  Creating new Groups object
14/08/12 21:57:36 DEBUG util.NativeCodeLoader: Trying to load the
custom-built native-hadoop library...
14/08/12 21:57:36 DEBUG util.NativeCodeLoader: Failed to load native-hadoop
with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
14/08/12 21:57:36 DEBUG util.NativeCodeLoader:
java.library.path=/home/ubuntu/hadoop-2.4.0/lib
14/08/12 21:57:36 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/08/12 21:57:36 DEBUG security.JniBasedUnixGroupsMappingWithFallback:
Falling back to shell based
14/08/12 21:57:36 DEBUG security.JniBasedUnixGroupsMappingWithFallback:
Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
14/08/12 21:57:36 DEBUG security.Groups: Group mapping
impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
cacheTimeout=30; warningDeltaMs=5000
14/08/12 21:57:36 DEBUG security.UserGroupInformation: hadoop login
14/08/12 21:57:36 DEBUG security.UserGroupInformation: hadoop login commit
14/08/12 21:57:36 DEBUG security.UserGroupInformation: using local
user:UnixPrincipal: ubuntu
14/08/12 21:57:36 DEBUG security.UserGroupInformation: UGI loginUser:ubuntu
(auth:SIMPLE)
14/08/12 21:57:36 DEBUG service.Jets3tProperties: s3service.https-only=true
14/08/12 21:57:36 DEBUG service.Jets3tProperties:
storage-service.internal-error-retry-max=5
14/08/12 21:57:36 DEBUG service.Jets3tProperties:
http.connection-manager.factory-class-name=org.jets3t.service.utils.RestUtils$ConnManagerFactory
14/08/12 21:57:36 DEBUG service.Jets3tProperties:
httpclient.connection-timeout-ms=6
14/08/12 21:57:36 DEBUG service.Jets3tProperties:
httpclient.socket-timeout-ms=6
14/08/12 21:57:36 DEBUG service.Jets3tProperties:
httpclient.stale-checking-enabled=true
14/08/12 21:57:36 DEBUG service.Jets3tProperties: httpclient.useragent=null
14/08/12 21:57:36 DEBUG utils.RestUtils: Setting user agent string:
JetS3t/0.9.0 (Linux/3.13.0-29-generic; amd64; en; JVM 1.7.0_55)
14/08/12 21:57:36 DEBUG service.Jets3tProperties:
http.protocol.expect-continue=true
14/08/12 21:57:36 DEBUG service.Jets3tProperties:
httpclient.connection-manager-timeout=0
14/08/12 21:57:36 DEBUG service.Jets3tProperties: httpclient.retry-max=5
14/08/12 21:57:36 DEBUG service.Jets3tProperties:
httpclient.proxy-autodetect=true
14/08/12 21:57:36 DEBUG service.Jets3tProperties: s3service.s3-endpoint=
s3.amazonaws.com
14/08/12 21:57:36 DEBUG proxy.PluginProxyUtil: About to attempt auto proxy
detection under Java version:1.7.0_55-b14
14/08/12 21:57:36 DEBUG proxy.PluginProxyUtil: Sun Plugin reported java
version not 1.3.X, 1.4.X, 1.5.X or 1.6.X - trying failover detection...
14/08/12 21:57:36 DEBUG proxy.PluginProxyUtil: Using failover proxy
detection...
14/08/12 21:57:36 DEBUG proxy.PluginProxyUtil: Plugin Proxy Config List
Property:null
14/08/12 21:57:36 DEBUG proxy.PluginProxyUtil: No configured plugin proxy
list
14/08/12 21:57:36 DEBUG service.Jets3tProperties:
s3service.default-storage-class=null
14/08/12 21:57:36 DEBUG service.Jets3tProperties:
s3service.server-side-encryption=null
14/08/12 21:57:36 DEBUG service.Jets3tProperties:
http.connection-manager.factory-class-name=

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

2014-08-12 Thread Xuan Gong
Hey, Arthur:

   Could you show me the error message for rm2. please ?


Thanks

Xuan Gong


On Mon, Aug 11, 2014 at 10:17 PM, arthur.hk.c...@gmail.com <
arthur.hk.c...@gmail.com> wrote:

> Hi,
>
> Thank y very much!
>
> At the moment if I run ./sbin/start-yarn.sh in rm1, the standby STANDBY 
> ResourceManager
> in rm2 is not started accordingly.  Please advise what would be wrong?
> Thanks
>
> Regards
> Arthur
>
>
>
>
> On 12 Aug, 2014, at 1:13 pm, Xuan Gong  wrote:
>
> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there
> a way that I just run ./sbin/start-yarn.sh in rm1 and get the
> STANDBY ResourceManager in rm2 started as well?
>
> No, need to start multiple RMs separately.
>
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is
> down in an auto-failover env? or how do you monitor the status of
> ACTIVE/STANDBY ResourceManager?
>
> Interesting question. But one of the design for auto-failover is that the
> down-time of RM is invisible to end users. The end users can submit
> applications normally even if the failover happens.
>
> We can monitor the status of RMs by using the command-line (you did
> previously) or from webUI/webService
> (rm_address:portnumber/cluster/cluster). We can get the current status from
> there.
>
> Thanks
>
> Xuan Gong
>
>
> On Mon, Aug 11, 2014 at 5:12 PM, arthur.hk.c...@gmail.com <
> arthur.hk.c...@gmail.com> wrote:
>
>> Hi,
>>
>> it is a multiple-node cluster, two master nodes (rm1 and rm2), below is
>> my yarn-site.xml.
>>
>> At the moment, the ResourceManager HA works if:
>>
>> 1) at rm1, run ./sbin/start-yarn.sh
>>
>> yarn rmadmin -getServiceState rm1
>> active
>>
>> yarn rmadmin -getServiceState rm2
>> 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/
>> 192.168.1.1:23142. Already tried 0 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
>> MILLISECONDS)
>> Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on
>> connection exception: java.net.ConnectException: Connection refused; For
>> more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
>>
>>
>> 2) at rm2, run ./sbin/start-yarn.sh
>>
>> yarn rmadmin -getServiceState rm1
>> standby
>>
>>
>> Some questions:
>> Q1)  I need start yarn in EACH master separately, is this normal? Is
>> there a way that I just run ./sbin/start-yarn.sh in rm1 and get the
>> STANDBY ResourceManager in rm2 started as well?
>>
>> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is
>> down in an auto-failover env? or how do you monitor the status of
>> ACTIVE/STANDBY ResourceManager?
>>
>>
>> Regards
>> Arthur
>>
>>
>> 
>> 
>>
>> 
>>
>>
>>   yarn.nodemanager.aux-services
>>   mapreduce_shuffle
>>
>>
>>
>>   yarn.resourcemanager.address
>>   192.168.1.1:8032
>>
>>
>>
>>yarn.resourcemanager.resource-tracker.address
>>192.168.1.1:8031
>>
>>
>>
>>yarn.resourcemanager.admin.address
>>192.168.1.1:8033
>>
>>
>>
>>yarn.resourcemanager.scheduler.address
>>192.168.1.1:8030
>>
>>
>>
>>   yarn.nodemanager.loacl-dirs
>>/edh/hadoop_data/mapred/nodemanager
>>true
>>
>>
>>
>>yarn.web-proxy.address
>>192.168.1.1:
>>
>>
>>
>>   yarn.nodemanager.aux-services.mapreduce.shuffle.class
>>   org.apache.hadoop.mapred.ShuffleHandler
>>
>>
>>
>>
>>
>>
>>   yarn.nodemanager.resource.memory-mb
>>   18432
>>
>>
>>
>>   yarn.scheduler.minimum-allocation-mb
>>   9216
>>
>>
>>
>>   yarn.scheduler.maximum-allocation-mb
>>   18432
>>
>>
>>
>>
>>   
>> yarn.resourcemanager.connect.retry-interval.ms
>> 2000
>>   
>>   
>> yarn.resourcemanager.ha.enabled
>> true
>>   
>>   
>> yarn.resourcemanager.ha.automatic-failover.enabled
>> true
>>   
>>   
>> yarn.resourcemanager.ha.automatic-failover.embedded
>> true
>>   
>>   
>> yarn.resourcemanager.cluster-id
>> cluster_rm
>>   
>>   
>> yarn.resourcemanager.ha.rm-ids
>> rm1,rm2
>>   
>>   
>> yarn.resourcemanager.hostname.rm1
>> 192.168.1.1
>>   
>>   
>> yarn.resourcemanager.hostname.rm2
>> 192.168.1.2
>>   
>>   
>> yarn.resourcemanager.scheduler.class
>>
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
>>   
>>   
>> yarn.resourcemanager.recovery.enabled
>> true
>>   
>>   
>> yarn.resourcemanager.store.class
>>
>> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
>>   
>>   
>>   yarn.resourcemanager.zk-address
>>   rm1:2181,m135:2181,m137:2181
>>   
>>   
>>
>> yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms
>> 5000
>>   
>>
>>   
>>   
>> yarn.resourcemanager.address.rm1
>> 192.168.1.1:23140
>>   
>>   
>> yarn.resourcemanager.scheduler.address.rm1
>> 192.168.1.1:23130
>>   
>>  

Re: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode?

2014-08-12 Thread Adaryl "Bob" Wakefield, MBA
Is this up to date?

http://www.mapr.com/products/product-overview/overview


Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Aaron Eng 
Sent: Tuesday, August 12, 2014 4:31 PM
To: user@hadoop.apache.org 
Subject: Re: Started learning Hadoop. Which distribution is best for native 
install in pseudo distributed mode?

On that note, 2 is also misleading/incomplete.  You might want to explain which 
specific features you are referencing so the original poster can figure out if 
those features are relevant.  The inverse of 2 is also true, things like 
consistent snapshots and full random read/write over NFS are in MapR and not in 
HDFS.




On Tue, Aug 12, 2014 at 2:10 PM, Kai Voigt  wrote:

  3. seems a biased and incomplete statement. 

  Cloudera’s distribution CDH is fully open source. The proprietary „stuff" you 
refer to is most likely Cloudera Manager, an additional tool to make 
deployment, configuration and monitoring easy.

  Nobody is required to use it to run a Hadoop cluster.

  Kai (a Cloudera Employee)

  Am 12.08.2014 um 21:56 schrieb Adaryl Bob Wakefield, MBA 
:


Hortonworks. Here is my reasoning:
1. Hortonwork is 100% open source.
2. MapR has stuff on their roadmap that Hortonworks has already 
accomplished and has moved on to other things.
3. Cloudera has proprietary stuff in their stack. No.
4. Hortonworks makes training super accessible and there is a community 
around it.
5. Who the heck is BigInsights? (Which should tell you something.)

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: mani kandan 
Sent: Tuesday, August 12, 2014 3:12 PM
To: user@hadoop.apache.org 
Subject: Started learning Hadoop. Which distribution is best for native 
install in pseudo distributed mode?

Which distribution are you people using? Cloudera vs Hortonworks vs 
Biginsights? 



--
  Kai Voigt Am Germaniahafen 1 k...@123.org
  24143 Kiel +49 160 96683050
  Germany @KaiVoigt



Re: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode?

2014-08-12 Thread Jay Vyas
also, consider apache bigtop. That is the apache upstream Hadoop initiative, 
and it comes with smoke tests+ Puppet recipes for setting up your own Hadoop 
distro from scratch.

IMHO ... If learning or building your own tooling around Hadoop , bigtop is 
ideal.  If interested in purchasing support , than the vendor distros are a 
good gateway.

> On Aug 12, 2014, at 5:31 PM, Aaron Eng  wrote:
> 
> On that note, 2 is also misleading/incomplete.  You might want to explain 
> which specific features you are referencing so the original poster can figure 
> out if those features are relevant.  The inverse of 2 is also true, things 
> like consistent snapshots and full random read/write over NFS are in MapR and 
> not in HDFS.
> 
> 
>> On Tue, Aug 12, 2014 at 2:10 PM, Kai Voigt  wrote:
>> 3. seems a biased and incomplete statement.
>> 
>> Cloudera’s distribution CDH is fully open source. The proprietary „stuff" 
>> you refer to is most likely Cloudera Manager, an additional tool to make 
>> deployment, configuration and monitoring easy.
>> 
>> Nobody is required to use it to run a Hadoop cluster.
>> 
>> Kai (a Cloudera Employee)
>> 
>>> Am 12.08.2014 um 21:56 schrieb Adaryl Bob Wakefield, MBA 
>>> :
>>> 
>>> Hortonworks. Here is my reasoning:
>>> 1. Hortonwork is 100% open source.
>>> 2. MapR has stuff on their roadmap that Hortonworks has already 
>>> accomplished and has moved on to other things.
>>> 3. Cloudera has proprietary stuff in their stack. No.
>>> 4. Hortonworks makes training super accessible and there is a community 
>>> around it.
>>> 5. Who the heck is BigInsights? (Which should tell you something.)
>>>  
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
>>>  
>>> From: mani kandan
>>> Sent: Tuesday, August 12, 2014 3:12 PM
>>> To: user@hadoop.apache.org
>>> Subject: Started learning Hadoop. Which distribution is best for native 
>>> install in pseudo distributed mode?
>>>  
>>> Which distribution are you people using? Cloudera vs Hortonworks vs 
>>> Biginsights?
>>> 
>> 
>> Kai VoigtAm Germaniahafen 1  
>> k...@123.org
>>  24143 Kiel  
>> +49 160 96683050
>>  Germany 
>> @KaiVoigt
> 


Re: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode?

2014-08-12 Thread Aaron Eng
On that note, 2 is also misleading/incomplete.  You might want to explain
which specific features you are referencing so the original poster can
figure out if those features are relevant.  The inverse of 2 is also true,
things like consistent snapshots and full random read/write over NFS are in
MapR and not in HDFS.


On Tue, Aug 12, 2014 at 2:10 PM, Kai Voigt  wrote:

> 3. seems a biased and incomplete statement.
>
> Cloudera’s distribution CDH is fully open source. The proprietary „stuff"
> you refer to is most likely Cloudera Manager, an additional tool to make
> deployment, configuration and monitoring easy.
>
> Nobody is required to use it to run a Hadoop cluster.
>
> Kai (a Cloudera Employee)
>
> Am 12.08.2014 um 21:56 schrieb Adaryl Bob Wakefield, MBA <
> adaryl.wakefi...@hotmail.com>:
>
>   Hortonworks. Here is my reasoning:
> 1. Hortonwork is 100% open source.
> 2. MapR has stuff on their roadmap that Hortonworks has already
> accomplished and has moved on to other things.
> 3. Cloudera has proprietary stuff in their stack. No.
> 4. Hortonworks makes training super accessible and there is a community
> around it.
> 5. Who the heck is BigInsights? (Which should tell you something.)
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* mani kandan 
> *Sent:* Tuesday, August 12, 2014 3:12 PM
> *To:* user@hadoop.apache.org
> *Subject:* Started learning Hadoop. Which distribution is best for native
> install in pseudo distributed mode?
>
>
> Which distribution are you people using? Cloudera vs Hortonworks vs
> Biginsights?
>
>
> --
> *Kai Voigt* Am Germaniahafen 1 k...@123.org
> 24143 Kiel +49 160 96683050
> Germany @KaiVoigt
>
>


Re: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode?

2014-08-12 Thread Adaryl "Bob" Wakefield, MBA
You fell into my trap sir. I was hoping someone would clear that up. :)

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Kai Voigt 
Sent: Tuesday, August 12, 2014 4:10 PM
To: user@hadoop.apache.org 
Subject: Re: Started learning Hadoop. Which distribution is best for native 
install in pseudo distributed mode?

3. seems a biased and incomplete statement. 

Cloudera’s distribution CDH is fully open source. The proprietary „stuff" you 
refer to is most likely Cloudera Manager, an additional tool to make 
deployment, configuration and monitoring easy.

Nobody is required to use it to run a Hadoop cluster.

Kai (a Cloudera Employee)

Am 12.08.2014 um 21:56 schrieb Adaryl Bob Wakefield, MBA 
:


  Hortonworks. Here is my reasoning:
  1. Hortonwork is 100% open source.
  2. MapR has stuff on their roadmap that Hortonworks has already accomplished 
and has moved on to other things.
  3. Cloudera has proprietary stuff in their stack. No.
  4. Hortonworks makes training super accessible and there is a community 
around it.
  5. Who the heck is BigInsights? (Which should tell you something.)

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba
  Twitter: @BobLovesData

  From: mani kandan 
  Sent: Tuesday, August 12, 2014 3:12 PM
  To: user@hadoop.apache.org 
  Subject: Started learning Hadoop. Which distribution is best for native 
install in pseudo distributed mode?

  Which distribution are you people using? Cloudera vs Hortonworks vs 
Biginsights? 




Kai Voigt Am Germaniahafen 1 k...@123.org
24143 Kiel +49 160 96683050
Germany @KaiVoigt


Re: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode?

2014-08-12 Thread Kai Voigt
3. seems a biased and incomplete statement.

Cloudera’s distribution CDH is fully open source. The proprietary „stuff" you 
refer to is most likely Cloudera Manager, an additional tool to make 
deployment, configuration and monitoring easy.

Nobody is required to use it to run a Hadoop cluster.

Kai (a Cloudera Employee)

Am 12.08.2014 um 21:56 schrieb Adaryl Bob Wakefield, MBA 
:

> Hortonworks. Here is my reasoning:
> 1. Hortonwork is 100% open source.
> 2. MapR has stuff on their roadmap that Hortonworks has already accomplished 
> and has moved on to other things.
> 3. Cloudera has proprietary stuff in their stack. No.
> 4. Hortonworks makes training super accessible and there is a community 
> around it.
> 5. Who the heck is BigInsights? (Which should tell you something.)
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
> From: mani kandan
> Sent: Tuesday, August 12, 2014 3:12 PM
> To: user@hadoop.apache.org
> Subject: Started learning Hadoop. Which distribution is best for native 
> install in pseudo distributed mode?
>  
> Which distribution are you people using? Cloudera vs Hortonworks vs 
> Biginsights?
> 

Kai Voigt   Am Germaniahafen 1  
k...@123.org
24143 Kiel  
+49 160 96683050
Germany 
@KaiVoigt



Re: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode?

2014-08-12 Thread Adaryl "Bob" Wakefield, MBA
Hortonworks. Here is my reasoning:
1. Hortonwork is 100% open source.
2. MapR has stuff on their roadmap that Hortonworks has already accomplished 
and has moved on to other things.
3. Cloudera has proprietary stuff in their stack. No.
4. Hortonworks makes training super accessible and there is a community around 
it.
5. Who the heck is BigInsights? (Which should tell you something.)

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: mani kandan 
Sent: Tuesday, August 12, 2014 3:12 PM
To: user@hadoop.apache.org 
Subject: Started learning Hadoop. Which distribution is best for native install 
in pseudo distributed mode?

Which distribution are you people using? Cloudera vs Hortonworks vs 
Biginsights? 


I was wondering what could make these 2 variables different: HADOOP_CONF_DIR vs YARN_CONF_DIR

2014-08-12 Thread REYANE OUKPEDJO
Can someone explain what makes the above variable different  ? Most of the time 
they are set pointing to
the same directory.

Thanks 


Reyane OUKPEDJO


Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode?

2014-08-12 Thread mani kandan
Which distribution are you people using? Cloudera vs Hortonworks vs
Biginsights?


MR AppMaster unable to load native libs

2014-08-12 Thread Subroto Sanyal
Hi,

I am running a single node hadoop cluster 2.4.1.
When I submit a MR job it logs a warning:
2014-08-12 21:38:22,173 WARN [main] org.apache.hadoop.util.NativeCodeLoader: 
Unable to load native-hadoop library for your platform... using builtin-java 
classes where applicable

The problem doesn’t comes up when starting any hadoop daemons. 
Do I need to pass any specific configuration so that the child jvm is able to 
pick up the native lib folder?

Cheers,
Subroto Sanyal


signature.asc
Description: Message signed with OpenPGP using GPGMail


hadoop/yarn and task parallelization on non-hdfs filesystems

2014-08-12 Thread Calvin
Hi all,

I've instantiated a Hadoop 2.4.1 cluster and I've found that running
MapReduce applications will parallelize differently depending on what
kind of filesystem the input data is on.

Using HDFS, a MapReduce job will spawn enough containers to maximize
use of all available memory. For example, a 3-node cluster with 172GB
of memory with each map task allocating 2GB, about 86 application
containers will be created.

On a filesystem that isn't HDFS (like NFS or in my use case, a
parallel filesystem), a MapReduce job will only allocate a subset of
available tasks (e.g., with the same 3-node cluster, about 25-40
containers are created). Since I'm using a parallel filesystem, I'm
not as concerned with the bottlenecks one would find if one were to
use NFS.

Is there a YARN (yarn-site.xml) or MapReduce (mapred-site.xml)
configuration that will allow me to effectively maximize resource
utilization?

Thanks,
Calvin


Re: ulimit for Hive

2014-08-12 Thread Chris MacKenzie
Hi Zhijie,

ulimit is common between hard and soft ulimit

The hard limit can only be set by a sys admin. It can be used for a fork
bomb dos attack.
The sys admin hard ulimit can be set per user i.e hadoop_user

A user can add a line to their .profile file setting a soft -ulimit up to
the hard limit. You can google how to do that

You can check the ulimits like so:

ulimit -H -a // hard limit
ulimit -S -a // soft limit

The max value for the hard limit is -unlimited. I currently have mine set
to this as I was running out of processes (nproc)

I don¹t know about restarting, I think so.
I don¹t know about hive.



Warm regards.

Chris

telephone: 0131 332 6967
email: stu...@chrismackenziephotography.co.uk
corporate: www.chrismackenziephotography.co.uk






From:  Zhijie Shen 
Reply-To:  
Date:  Tuesday, 12 August 2014 18:33
To:  , 
Subject:  Re: ulimit for Hive


+ Hive user mailing list
It should be a better place for your questions.



On Mon, Aug 11, 2014 at 3:17 PM, Ana Gillan  wrote:

Hi,

I¹ve been reading a lot of posts about needing to set a high ulimit for
file descriptors in Hadoop and I think it¹s probably the cause of a lot of
the errors I¹ve been having when trying to run queries on larger data sets
in Hive. However, I¹m really confused about how and where to set the
limit, so I have a number of questions:

1. How high is it recommended to set the ulimit?
2. What is the difference between soft and hard limits? Which one needs to
be set to the value from question 1?
3. For which user(s) do I set the ulimit? If I am running the Hive query
with my login, do I set my own ulimit to the high value?
4. Do I need to set this limit for these users on all the machines in the
cluster? (we have one master node and 6 slave nodes)
5. Do I need to restart anything after configuring the ulimit?

Thanks in advance,
Ana







-- 
Zhijie ShenHortonworks Inc.
http://hortonworks.com/



CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity
to which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified
that any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender
immediately and delete it from your system. Thank You.




Making All datanode down

2014-08-12 Thread Satyam Singh

Hi Users,



In my cluster setup i am doing one test case of making only all 
datanodes down and keep namenode running.


In this case my application gets error with remoteException:
 could only be replicated to 0 nodes instead of minReplication (=1).  
There are 0 datanode(s) running and no node(s) are excluded in this 
operation


Then i make all datanodes up, but i still faces above exception looks 
like datanode came up but not sync with namenode.


I have to then restart my application for writing files to hdfs. After 
restart it starts processing fine but i don't want to kill my 
application in this case.




Prompt help is highly appreciated!!

Regards,
Satyam


Re: ulimit for Hive

2014-08-12 Thread Zhijie Shen
+ Hive user mailing list

It should be a better place for your questions.


On Mon, Aug 11, 2014 at 3:17 PM, Ana Gillan  wrote:

> Hi,
>
> I’ve been reading a lot of posts about needing to set a high ulimit for
> file descriptors in Hadoop and I think it’s probably the cause of a lot of
> the errors I’ve been having when trying to run queries on larger data sets
> in Hive. However, I’m really confused about how and where to set the limit,
> so I have a number of questions:
>
>1. How high is it recommended to set the ulimit?
>2. What is the difference between soft and hard limits? Which one
>needs to be set to the value from question 1?
>3. For which user(s) do I set the ulimit? If I am running the Hive
>query with my login, do I set my own ulimit to the high value?
>4. Do I need to set this limit for these users on all the machines in
>the cluster? (we have one master node and 6 slave nodes)
>5. Do I need to restart anything after configuring the ulimit?
>
> Thanks in advance,
> Ana
>



-- 
Zhijie Shen
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Pseudo -distributed mode

2014-08-12 Thread sindhu hosamane
I have read "By default, Hadoop is configured to run in a non-distributed
mode, as a single Java process" .

But if my hadoop is pseudo distributed mode , why does it still runs as a
single Java process and utilizes only 1 cpu core even if there are many
more ?


On Tue, Aug 12, 2014 at 4:32 PM, Sergey Murylev 
wrote:

> Yes :)
>
> Pseudo-distributed mode is such configuration when we have some Hadoop
> environment on single computer.
>
> On 12/08/14 18:25, sindhu hosamane wrote:
> > Can Setting up 2 datanodes on same machine  be considered as
> > pseudo-distributed mode hadoop  ?
> >
> > Thanks,
> > Sindhu
>
>
>


Re: Pseudo -distributed mode

2014-08-12 Thread Sergey Murylev
Yes :)

Pseudo-distributed mode is such configuration when we have some Hadoop
environment on single computer.

On 12/08/14 18:25, sindhu hosamane wrote:
> Can Setting up 2 datanodes on same machine  be considered as
> pseudo-distributed mode hadoop  ?
>
> Thanks,
> Sindhu




signature.asc
Description: OpenPGP digital signature


Pseudo -distributed mode

2014-08-12 Thread sindhu hosamane
Can Setting up 2 datanodes on same machine  be considered as
pseudo-distributed mode hadoop  ?

Thanks,
Sindhu


Why 2 different approach for deleting localized resources and aggregated logs?

2014-08-12 Thread Rohith Sharma K S
Hi

I see two different approach for deleting localized resources and 
aggregated logs.

1.   Localized resources are deleted based on the size of localizer cache, 
per local directory.

2.   Aggregated logs are deleted based on the time(if enabled).

   Is there any specific thoughts for 2 different implementations why it is?

   Can aggregated logs also can be deleted based on size?

Thanks & Regards
Rohith Sharma K S




org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=EXECUTE

2014-08-12 Thread Ana Gillan
Hi,

I ran a job in Hive and it got to this stage: Stage-1 map = 100%,  reduce =
29%, seemed to start cleaning up the containers and stuff successfully, and
then I got this series of errors:

2014-08-12 03:58:55,718 ERROR
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:yarn (auth:SIMPLE)
cause:org.apache.hadoop.security.AccessControlException: Permission denied:
user=yarn, access=EXECUTE,
inode="/tmp/hadoop-yarn/staging/zslf023":zslf023:mapred:drwx--
2014-08-12 03:58:55,718 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.delete from 10.0.0.42:58610:
error: org.apache.hadoop.security.AccessControlException: Permission denied:
user=yarn, access=EXECUTE,
inode="/tmp/hadoop-yarn/staging/zslf023":zslf023:mapred:drwx--
2014-08-12 03:58:55,726 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySch
eduler: Null container completed...
2014-08-12 03:58:55,732 ERROR
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:zslf023 (auth:SIMPLE)
cause:org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl:
Application doesn't exist in cache appattempt_1403771939632_0456_01
2014-08-12 03:58:55,732 ERROR
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Application doesn't exist in cache appattempt_1403771939632_0456_01
2014-08-12 03:58:55,925 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerI
mpl: Event EventType: KILL_CONTAINER sent to absent container
container_1403771939632_0456_01_20

Does the yarn user need to have some sort of permissions set? Does my user
need some special permissions?

Thanks,
Ana







How to use docker in Hadoop, with patch of YARN-1964?

2014-08-12 Thread sam liu
Hi Experts,

I am very interesting that Hadoop could work with Docker and doing some
trial on patch of YARN-1964.

I applied patch yarn-1964-branch-2.2.0-docker.patch of jira YARN-1964 on
branch 2.2 and am going to install a Hadoop cluster using the new generated
tarball including the patch.

Then, I think I can use DockerContainerExecutor, but I do not know much
details on the usage and have following questions:



*1. After installation, What's the detailed config steps to adopt
DockerContainerExecutor?*


*2. How to verify whether a MR task is really launched in Docker container
not Yarn container?*
*3. Which hadoop branch will officially include Docker support?*

Thanks a lot!


Re: Negative value given by getVirtualCores() or getAvailableResources()

2014-08-12 Thread Wangda Tan
By default, vcore = 1 for each resource request. If you don't like this
behavior, you can set yarn.scheduler.minimum-allocation-vcores=0

Hope this helps,
Wangda Tan



On Thu, Aug 7, 2014 at 7:13 PM, Krishna Kishore Bonagiri <
write2kish...@gmail.com> wrote:

> Hi,
>   I am calling getAvailableResources() on AMRMClientAsync and getting -ve
> value for the number virtual cores as below. Is there something wrong?
>
> .
>
> I have set the vcores in my yarn-site.xml like this, and just ran an
> application that requires two containers other than the Application
> Master's container. In the ContainerRequest setup from my
> ApplicationMaster, I haven't set anything for virtual cores, means I didn't
> call setVirtualCores() at all.
>
> So, I think it shouldn't be showing a -ve value for the vcores, when I
> call getAvailableResources(), am I wrong?
>
>
>  Number of CPU cores that can be allocated for containers.
> 
>  yarn.nodemanager.resource.cpu-vcores 
>  4 
> 
>
> Thanks,
> Kishore
>


Re: 100% CPU consumption by Resource Manager process

2014-08-12 Thread Wangda Tan
Hi Krishna,
To get more understanding about the problem, could you please share
following information:
1) Number of nodes and running app in the cluster
2) What's the version of your Hadoop?
3) Have you set
"yarn.scheduler.capacity.schedule-asynchronously.enable"=true?
4) What's the "yarn.resourcemanager.nodemanagers.heartbeat-interval-ms" in
your configuration?

Thanks,
Wangda Tan



On Sun, Aug 10, 2014 at 11:29 PM, Krishna Kishore Bonagiri <
write2kish...@gmail.com> wrote:

> Hi,
>   My YARN resource manager is consuming 100% CPU when I am running an
> application that is running for about 10 hours, requesting as many as 27000
> containers. The CPU consumption was very low at the starting of my
> application, and it gradually went high to over 100%. Is this a known issue
> or are we doing something wrong?
>
> Every dump of the EVent Processor thread is running
> LeafQueue::assignContainers() specifically the for loop below from
> LeafQueue.java and seems to be looping through some priority list.
>
> // Try to assign containers to applications in order
> for (FiCaSchedulerApp application : activeApplications) {
> ...
> // Schedule in priority order
> for (Priority priority : application.getPriorities()) {
>
> 3XMTHREADINFO  "ResourceManager Event Processor"
> J9VMThread:0x01D08600, j9thread_t:0x7F032D2FAA00,
> java/lang/Thread:0x8341D9A0, state:CW, prio=5
> 3XMJAVALTHREAD(java/lang/Thread getId:0x1E, isDaemon:false)
> 3XMTHREADINFO1(native thread ID:0x4B64, native priority:0x5,
> native policy:UNKNOWN)
> 3XMTHREADINFO2(native stack address range
> from:0x7F0313DF8000, to:0x7F0313E39000, size:0x41000)
> 3XMCPUTIME   *CPU usage total: 42334.614623696 secs*
> 3XMHEAPALLOC Heap bytes allocated since last GC cycle=20456
> (0x4FE8)
> 3XMTHREADINFO3   Java callstack:
> 4XESTACKTRACEat
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.assignContainers(LeafQueue.java:850(Compiled
> Code))
> 5XESTACKTRACE   (entered lock:
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp@0x8360DFE0,
> entry count: 1)
> 5XESTACKTRACE   (entered lock:
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue@0x833B9280,
> entry count: 1)
> 4XESTACKTRACEat
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.assignContainersToChildQueues(ParentQueue.java:655(Compiled
> Code))
> 5XESTACKTRACE   (entered lock:
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue@0x83360A80,
> entry count: 2)
> 4XESTACKTRACEat
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.assignContainers(ParentQueue.java:569(Compiled
> Code))
> 5XESTACKTRACE   (entered lock:
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue@0x83360A80,
> entry count: 1)
> 4XESTACKTRACEat
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:831(Compiled
> Code))
> 5XESTACKTRACE   (entered lock:
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler@0x834037C8,
> entry count: 1)
> 4XESTACKTRACEat
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.handle(CapacityScheduler.java:878(Compiled
> Code))
> 4XESTACKTRACEat
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.handle(CapacityScheduler.java:100(Compiled
> Code))
> 4XESTACKTRACEat
> org/apache/hadoop/yarn/server/resourcemanager/ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591)
> 4XESTACKTRACEat java/lang/Thread.run(Thread.java:853)
>
> 3XMTHREADINFO  "ResourceManager Event Processor"
> J9VMThread:0x01D08600, j9thread_t:0x7F032D2FAA00,
> java/lang/Thread:0x8341D9A0, state:CW, prio=5
> 3XMJAVALTHREAD(java/lang/Thread getId:0x1E, isDaemon:false)
> 3XMTHREADINFO1(native thread ID:0x4B64, native priority:0x5,
> native policy:UNKNOWN)
> 3XMTHREADINFO2(native stack address range
> from:0x7F0313DF8000, to:0x7F0313E39000, size:0x41000)
> 3XMCPUTIME   CPU usage total: 42379.604203548 secs
> 3XMHEAPALLOC Heap bytes allocated since last GC cycle=57280
> (0xDFC0)
> 3XMTHREADINFO3   Java callstack:
> 4XESTACKTRACEat
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.assignContainers(LeafQueue.java:841(Compiled
> Code))
> 5XESTACKTRACE   (entered lock:
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaS

Re: Synchronization among Mappers in map-reduce task

2014-08-12 Thread Wangda Tan
Hi Saurabh,
It's an interesting topic,

>> So , here is the question , is it possible to make sure that when one of
the mapper tasks is writing to a file , other should wait until the first
one is finished. ? I read that all the mappers task don't interact with
each other

A simple way to do this is using HDFS namespace:
Create file using "public FSDataOutputStream create(Path f, boolean
overwrite)", overwrite=false. Only one mapper can successfully create file.

After write completed, the mapper will create a flag file like "completed"
in the same folder. Other mappers can wait for the "completed" file created.

>> Is there any way to have synchronization between two independent map
reduce jobs?
I think ZK can do some complex synchronization here. Like mutex, master
election, etc.

Hope this helps,

Wangda Tan




On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain 
wrote:

> Hi Folks ,
>
> I have been writing a map-reduce application where I am having an input
> file containing records and every field in the record is separated by some
> delimiter.
>
> In addition to this user will also provide a list of columns that he wants
> to lookup in a master properties file (stored in HDFS). If this columns
> (lets say it a key) is present in master properties file then get the
> corresponding value and update the key with this value and if the key is
> not present it in the master properties file then it will create a new
> value for this key and will write to this property file and will also
> update in the record.
>
> I have written this application , tested it and everything worked fine
> till now.
>
> *e.g :* *I/P Record :* This | is | the | test | record
>
> *Columns :* 2,4 (that means code will look up only field *"is" and "test"* in
> the master properties file.)
>
> Here , I have a question.
>
> *Q 1:* In the case when my input file is huge and it is splitted across
> the multiple mappers , I was getting the below mentioned exception where
> all the other mappers tasks were failing. *Also initially when I started
> the job my master properties file is empty.* In my code I have a check if
> this file (master properties) doesn't exist create a new empty file before
> submitting the job itself.
>
> e.g : If i have 4 splits of data , then 3 map tasks are failing. But after
> this all the failed map tasks restarts and finally the job become
> successful.
>
> So , *here is the question , is it possible to make sure that when one of
> the mapper tasks is writing to a file , other should wait until the first
> one is finished. ?* I read that all the mappers task don't interact with
> each other.
>
> Also what will happen in the scenario when I start multiple parallel
> map-reduce jobs and all of them working on the same properties files. *Is
> there any way to have synchronization between two independent map reduce
> jobs*?
>
> I have also read that ZooKeeper can be used in such scenarios , Is that
> correct ?
>
>
> Error: 
> com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException: 
> IOException - failed while appending data to the file ->Failed to create file 
> [/user/cloudera/lob/master/bank.properties] for 
> [DFSClient_attempt_1407778869492_0032_m_02_0_1618418105_1] on client 
> [10.X.X.17], because this file is already being created by
> [DFSClient_attempt_1407778869492_0032_m_05_0_-949968337_1] on [10.X.X.17]
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at 
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
> at 
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
> at org.apache.hadoop.ipc.Server$Handler.run(S