Re: Failed to start namenode.

2016-06-09 Thread Rakesh Radhakrishnan
Hi,

Could you please check kerberos principal name is specified correctly in
"hdfs-site.xml", which is used to authenticate against Kerberos. If using
_HOST variable in hdfs-site.xml, ensure that hostname is getting resolved
and it matches with the principal name.

If keytab file defined in "hdfs-site.xml" is not present you will see this
error. So, please verify the path and the keytab filename correctly
configured.

Also, did you verify manual kinit using the principal name and keytab. Is
that working for you?

Please share "hdfs-site.xml" config file to know more about your
configurations.

Regards,
Rakesh

On Thu, Jun 9, 2016 at 6:21 PM, Hafiz Mujadid 
wrote:

> Hi,
>
> I have setup kerbores with hadoop and I am facing following exception when
> i start hadoop.
>
> ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start
> namenode.
> java.io.IOException: Login failure for admin/admin@queryiorealm from
> keytab /usr/local/var/krb5kdc/kadm5.keytab:
> javax.security.auth.login.LoginException: Unable to obtain password from
> user
>
> at
> org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:962)
> at
> org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:246)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.loginAsNameNodeUser(NameNode.java:613)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:632)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:811)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:795)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
> Caused by: javax.security.auth.login.LoginException: Unable to obtain
> password from user
>
>
> Can anybody please help me how to get rid of this issue?
>
> Thanks
>


Re: Failed to start namenode.

2016-06-09 Thread Rakesh Radhakrishnan
Good to hear the issue is resolved and able to continue with your setup!

Best Regards,
Rakesh

On Fri, Jun 10, 2016 at 9:34 AM, Hafiz Mujadid 
wrote:

> Thanks Anu and Rakesh for your response. The problem was that principal
> name was not added to database, I was trying to connect to hadoop through.
> The keytab file permissions were not enough.So by changing the permission
> and adding relative principal name solved the issue.
>
> Thanks
>
> On Thu, Jun 9, 2016 at 10:19 PM, Anu Engineer 
> wrote:
>
>> Hi Hafiz,
>>
>>
>>
>> All suggestions from Rakesh are great ways to debug your current
>> situation. However, it is hard to answer this question without specifics of
>> the distro. The issue is that different vendors and apache seems to have
>> slightly different recommendation.
>>
>>
>>
>> If you are working with apache – here are the instructions on how to
>> setup the Kerberos
>> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SecureMode.html
>>
>> If you are working with Hortonworks -
>> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/ch_security_for_manual_installs_chapter.html
>>
>> If you are working with Cloudera -
>> http://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_sg_cdh5_hadoop_security.html#topic_3
>>
>> If you are working with Ambari -
>> https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_Ambari_Security_Guide/content/ch_amb_sec_guide.html
>>
>>
>>
>> It looks like you are running your Namenode without having correct
>> Kerberos configuration – it could be an issue with keytab or Kerberos
>> principal that is configured.
>>
>> I would start by running the klist command in the apache page, confirm
>> that you have right tickets, and then verify that Namenode is configured to
>> use the correct Kerberos principal.
>>
>>
>>
>> Not to scare you, but if you are completely new Kerberos and Hadoop – You
>> can read this  --
>> https://www.gitbook.com/book/steveloughran/kerberos_and_hadoop/details
>> -- Even though Steve makes is sound painful and scary -- once you set it
>> up,  you will feel that it was not that hard.
>>
>>
>>
>> If you have some background in Kerberos –understanding that Kerberos is
>> used slightly differently in Hadoop is also useful – especially if you have
>> to debug your cluster – The PDF attached to this JIRA gives you some
>> background. https://issues.apache.org/jira/browse/HADOOP-4487
>>
>>
>>
>> After the setup of your cluster, if you are still having issues with
>> other services or HDFS – This is a diagnostic tool that can help you.
>> https://github.com/steveloughran/kdiag
>>
>>
>>
>> Thanks
>>
>> Anu
>>
>>
>>
>>
>>
>> *From: *Rakesh Radhakrishnan 
>> *Date: *Thursday, June 9, 2016 at 10:11 AM
>> *To: *Hafiz Mujadid 
>> *Cc: *"user@hadoop.apache.org" 
>> *Subject: *Re: Failed to start namenode.
>>
>>
>>
>> Hi,
>>
>>
>>
>> Could you please check kerberos principal name is specified correctly in
>> "hdfs-site.xml", which is used to authenticate against Kerberos. If using
>> _HOST variable in hdfs-site.xml, ensure that hostname is getting resolved
>> and it matches with the principal name.
>>
>>
>>
>> If keytab file defined in "hdfs-site.xml" is not present you will see
>> this error. So, please verify the path and the keytab filename correctly
>> configured.
>>
>>
>>
>> Also, did you verify manual kinit using the principal name and keytab. Is
>> that working for you?
>>
>>
>>
>> Please share "hdfs-site.xml" config file to know more about your
>> configurations.
>>
>>
>>
>> Regards,
>>
>> Rakesh
>>
>>
>>
>> On Thu, Jun 9, 2016 at 6:21 PM, Hafiz Mujadid 
>> wrote:
>>
>> Hi,
>>
>> I have setup kerbores with hadoop and I am facing following exception
>> when i start hadoop.
>>
>> ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start
>> namenode.
>>
>> java.io.IOException: Login failure for admin/admin@queryiorealm from
>> keytab /usr/local/var/krb5kdc/kadm5.keytab:
>> javax.security.auth.login.LoginException: Unable to obtain password from
>> user
>>
>>
>>
>> at
>> org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupIn

Re: Setting up secure Multi-Node cluster

2016-06-27 Thread Rakesh Radhakrishnan
Hi Aneela,

IIUC, Namenode, Datanode is using _HOST pattern in their principal and
needs to create separate principal for NN and DN if running in different
machines. I hope the below explanation will help you.

"dfs.namenode.kerberos.principal" is typically set to nn/_HOST@REALM. Each
Namenode will substitute the _HOST with its own fully qualified hostname at
startup.The _HOST placeholder allows using the same configuration setting
on both Active and Standby NameNodes in an HA setup

Similarly "dfs.datanode.kerberos.principal" will set to dn/_HOST@REALM.
DataNode will substitute _HOST with its own fully qualified hostname at
startup. The _HOST placeholder allows using the same configuration setting
on all DataNodes.

Again, if you are using HA setup with QJM,
"dfs.journalnode.kerberos.principal" will set to jn/_HOST@REALM

>Do i need to copy all the kerberos configuration files like kdc.conf
and krb5.conf etc on every node in default locations?
Yes, you need to place these in appropriate paths in all the machines.

Regards,
Rakesh

On Tue, Jun 28, 2016 at 3:15 AM, Aneela Saleem 
wrote:

> Hi all,
>
> I have configured Kerberos for single node cluster successfully. I used
> this
> 
>  documentation
> for configurations. Now i'm enabling security for multi node cluster and i
> have some confusions about that. Like
>
> How principals would be managed for namenode and data node? because till
> now i had only one principal *hdfs/_HOST@platalyticsrealm *used for both
> namenode as well as for datanode? Do i need to add separate principals for
> both namenode and datanode having different hostname? for example:
> if my namenode hostname is *hadoop-master* then there should be principal
> added *nn/hadoop-master@platalyticsrealm *(with appropriate keytab file)
> if my datanode hostname is *hadoop-slave *then there should be principal
> added *dn/hadoop-slave@platalyticsrealm* (with appropriate keytab file)
>
> Do i need to copy all the kerberos configuration files like kdc.conf and
> krb5.conf etc on every node in default locations?
>
> A little guidance would be highly appreciated. Thanks
>


Re: Error in Hbase backup from secure to normal cluster.

2016-07-11 Thread Rakesh Radhakrishnan
Hi,

Hope you are executing 'distcp' command from the secured cluster. Are you
executing the command from a non-super user? Please explain me the
command/way you are executing to understand, how you are entering "entered
super user credentials" and -D command line args.

Also, please share your hdfs-site.xml, core-site.xml configurations.

Have you modified any of these configurations or passed as -D command line
args.

In core-site.xml on the secure cluster side:

  ipc.client.fallback-to-simple-auth-allowed


In hdfs-site.xml on the secure cluster side:

  dfs.permissions.superusergroup
  This is by default set to an arbitrary string "superuser",
which is
mostly a non existing group on most (all?) environments. Changing this
and/or creating such a group name on the NN machine will let you
permit more users to act as superusers, if needed.
 


Again, I'd suggest you to post this problem to the hbase user mailing list
to know any specific configurations from HBase side.

Regards,
Rakesh

On Mon, Jul 11, 2016 at 3:45 PM, mathes waran 
wrote:

> Hi,
>
> I tried distcp method in hbase backup from secure to normal cluster,In my
> backup job the data is transferred successfully from secure to normal
> cluster,but in resource manager job id is shown as failed due to error of
> Non super user cannot change owner.
>
> However, table and snapshot is created successfully and restoring table
> from normal to secure is executed successfully.
>
> But in HBase backup due to this error ,application Id is shown as failed.
>
> Please find the error details as below:
>
> *Caused by:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
> Non-super user cannot change owner*
>
> Even I entered super user credentials while executing distcp job from
> secure to normal cluster,this exception is occurred.
>
> please note that,this error is occurred only on backup from secure to
> normal,and its not occur on restoring table.
>
> Can u pls tell me is there any changes in configuration file or how could
> i solve this..?
>
> Looking back.!
>
> Thanks,
>
> Matheskrishna
>


Re: Error in Hbase backup from secure to normal cluster.

2016-07-11 Thread Rakesh Radhakrishnan
Hi Matheskrishna,

Adding one more thought to my above comments.

Since you are telling about the job execution failure on distcp, I think it
would be good to analyse the failure logs of the job to see any problem is
that inside the staging directory in hdfs or we may get some hint about the
cause.

Regards,
Rakesh

On Mon, Jul 11, 2016 at 6:13 PM, Rakesh Radhakrishnan 
wrote:

> Hi,
>
> Hope you are executing 'distcp' command from the secured cluster. Are you
> executing the command from a non-super user? Please explain me the
> command/way you are executing to understand, how you are entering "entered
> super user credentials" and -D command line args.
>
> Also, please share your hdfs-site.xml, core-site.xml configurations.
>
> Have you modified any of these configurations or passed as -D command line
> args.
>
> In core-site.xml on the secure cluster side:
> 
>   ipc.client.fallback-to-simple-auth-allowed
> 
>
> In hdfs-site.xml on the secure cluster side:
> 
>   dfs.permissions.superusergroup
>   This is by default set to an arbitrary string "superuser",
> which is
> mostly a non existing group on most (all?) environments. Changing this
> and/or creating such a group name on the NN machine will let you
> permit more users to act as superusers, if needed.
>  
> 
>
> Again, I'd suggest you to post this problem to the hbase user mailing list
> to know any specific configurations from HBase side.
>
> Regards,
> Rakesh
>
> On Mon, Jul 11, 2016 at 3:45 PM, mathes waran 
> wrote:
>
>> Hi,
>>
>> I tried distcp method in hbase backup from secure to normal cluster,In my
>> backup job the data is transferred successfully from secure to normal
>> cluster,but in resource manager job id is shown as failed due to error of
>> Non super user cannot change owner.
>>
>> However, table and snapshot is created successfully and restoring table
>> from normal to secure is executed successfully.
>>
>> But in HBase backup due to this error ,application Id is shown as failed.
>>
>> Please find the error details as below:
>>
>> *Caused by:
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>> Non-super user cannot change owner*
>>
>> Even I entered super user credentials while executing distcp job from
>> secure to normal cluster,this exception is occurred.
>>
>> please note that,this error is occurred only on backup from secure to
>> normal,and its not occur on restoring table.
>>
>> Can u pls tell me is there any changes in configuration file or how could
>> i solve this..?
>>
>> Looking back.!
>>
>> Thanks,
>>
>> Matheskrishna
>>
>
>


Re: Error in Hbase backup from secure to normal cluster.

2016-07-12 Thread Rakesh Radhakrishnan
>>>->Now I executed the hbase backup command with super user credentials
,job are submitted in resource manager,while executing it throws an error
of "Non super user cannot change owner" ,at same time backup is
completed,table is stored in destination cluster along with error
status.(Job Id is also faied in resource manager)

Could you please share the job execution logs, probably will get some hints
about the failure. Also, please share the complete command structure you
are executing for hbase backup.

Thanks,
Rakesh

On Tue, Jul 12, 2016 at 11:42 AM, mathes waran 
wrote:

> Hi Rakesh Radhakrishnan,
>
>yes,I am executed this command from only secure cluster ,and I
> executed job with super user credentials.
> please find the details below which I followed:
> ->set ipc.client.fallback-to-simple-auth-allowed as true in HBase-site.xml
> for the purpose of allowing of data from secure to normal cluster.
> ->set property hdfs-site.xml on the secure cluster side:
>
>  
>  dfs.permissions.superusergroup
>  supergroupname
>  
>
> ->Now I executed the hbase backup command with super user credentials ,job
> are submitted in resource manager,while executing it throws an error of
> "Non super user cannot change owner" ,at same time backup is
> completed,table is stored in destination cluster along with error
> status.(Job Id is also faied in resource manager)
>
> ->For your information,while executing restore command the table is
> restored successfully without this kind of issues.
>
> So please could you explain how to solve this exception.
>
> Looking back for your response,
>
> Thanks,
> Matheskrishna
>
>
> On Mon, Jul 11, 2016 at 6:32 PM, Rakesh Radhakrishnan 
> wrote:
>
>> Hi Matheskrishna,
>>
>> Adding one more thought to my above comments.
>>
>> Since you are telling about the job execution failure on distcp, I think
>> it would be good to analyse the failure logs of the job to see any problem
>> is that inside the staging directory in hdfs or we may get some hint about
>> the cause.
>>
>> Regards,
>> Rakesh
>>
>> On Mon, Jul 11, 2016 at 6:13 PM, Rakesh Radhakrishnan > > wrote:
>>
>>> Hi,
>>>
>>> Hope you are executing 'distcp' command from the secured cluster. Are
>>> you executing the command from a non-super user? Please explain me the
>>> command/way you are executing to understand, how you are entering "entered
>>> super user credentials" and -D command line args.
>>>
>>> Also, please share your hdfs-site.xml, core-site.xml configurations.
>>>
>>> Have you modified any of these configurations or passed as -D command
>>> line args.
>>>
>>> In core-site.xml on the secure cluster side:
>>> 
>>>   ipc.client.fallback-to-simple-auth-allowed
>>> 
>>>
>>> In hdfs-site.xml on the secure cluster side:
>>> 
>>>   dfs.permissions.superusergroup
>>>   This is by default set to an arbitrary string
>>> "superuser", which is
>>> mostly a non existing group on most (all?) environments. Changing
>>> this
>>> and/or creating such a group name on the NN machine will let you
>>> permit more users to act as superusers, if needed.
>>>  
>>> 
>>>
>>> Again, I'd suggest you to post this problem to the hbase user mailing
>>> list to know any specific configurations from HBase side.
>>>
>>> Regards,
>>> Rakesh
>>>
>>> On Mon, Jul 11, 2016 at 3:45 PM, mathes waran 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I tried distcp method in hbase backup from secure to normal cluster,In
>>>> my backup job the data is transferred successfully from secure to normal
>>>> cluster,but in resource manager job id is shown as failed due to error of
>>>> Non super user cannot change owner.
>>>>
>>>> However, table and snapshot is created successfully and restoring table
>>>> from normal to secure is executed successfully.
>>>>
>>>> But in HBase backup due to this error ,application Id is shown as
>>>> failed.
>>>>
>>>> Please find the error details as below:
>>>>
>>>> *Caused by:
>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>>>> Non-super user cannot change owner*
>>>>
>>>> Even I entered super user credentials while executing distcp job from
>>>> secure to normal cluster,this exception is occurred.
>>>>
>>>> please note that,this error is occurred only on backup from secure to
>>>> normal,and its not occur on restoring table.
>>>>
>>>> Can u pls tell me is there any changes in configuration file or how
>>>> could i solve this..?
>>>>
>>>> Looking back.!
>>>>
>>>> Thanks,
>>>>
>>>> Matheskrishna
>>>>
>>>
>>>
>>
>


Re: Subcribe

2016-07-17 Thread Rakesh Radhakrishnan
Hi Sandeep,

Please go through the web page: "
https://hadoop.apache.org/mailing_lists.html"; and can subscribe by
following these steps.

Regards,
Rakesh

On Mon, Jul 18, 2016 at 8:32 AM, sandeep vura  wrote:

> Hi Team,
>
> please add my email id in subscribe list.
>
> Regards,
> Sandeep.v
>


Re: Standby Namenode getting RPC latency alerts

2016-07-17 Thread Rakesh Radhakrishnan
Hi Sandeep,

This alert could be triggered if the NN operations exceeds certain
threshold value. Sometimes an increase in the RPC processing time increases
the length of call queue and results in this situation. Could you please
provide more details about the client operations you are performing and
causing it to perform too many NameNode operations. Perhaps, you can check
your client applications and their logs to get any info/hint. Also, do you
see any heavy utilization of CPU? Could you please share both Namenodes,
client logs etc.

Regards,
Rakesh

On Mon, Jul 18, 2016 at 8:35 AM, sandeep vura  wrote:

> Hi Team,
>
> We are getting rpc latency alerts from the standby namenode. What does it
> means? Where to check the logs for the root cause?
>
>
> I have already checked standby namenode logs but didn't find any specific
> error.
>
>
> Regards,
> Sandeep.v
>
>


Re: Hadoop Installation on Windows 7 in 64 bit

2016-07-17 Thread Rakesh Radhakrishnan
>>>I couldn't find folder* conf in *hadoop home.

Could you check %HADOOP_HOME%/etc/hadoop/hadoop-env.cmd path. May be,
U:/Desktop/hadoop-2.7.2/etc/hadoop/hadoop-env.cmd location.

Typically HADOOP_CONF_DIR will be set to %HADOOP_HOME%/etc/hadoop. Could
you check "HADOOP_CONF_DIR" env variable value, location of the hadoop
cluster configuration.

Regards,
Rakesh


On Mon, Jul 18, 2016 at 10:47 AM, Vinodh Nagaraj 
wrote:

> Hi All,
>
> I tried to install Hadoop hadoop-2.7.2 on Windows 7 in 64 bit machine.
> Java version 1.8.set path variables. It works fine.
>
> Trying to execute* start-all.cmd.got. *Got the below error*.* I couldn't
> find folder* conf in *hadoop home.
>
> U:\Desktop\hadoop-2.7.2\sbin>start-all.cmd
> This script is Deprecated. Instead use start-dfs.cmd and start-yarn.cmd
> Error: JAVA_HOME is incorrectly set.
>Please update U:\Desktop\hadoop-2.7.2\*conf\*hadoop-env.cmd
>
>
> Any Suggestions.
>
>
> Thanks,
> Vinodh.N
>


Re: About Archival Storage

2016-07-19 Thread Rakesh Radhakrishnan
Is that mean I should config dfs.replication with 1 ?  if more than one
I should not use *Lazy_Persist*  policies ?

The idea of Lazy_Persist policy is, while writing blocks, one replica will
be placed in memory first and then it is lazily persisted into DISK. It
doesn't means that, you are not allowed to configure dfs.replication > 1.
If 'dfs.replication' is configured > 1 then the first replica will be
placed in RAM_DISK and all the other replicas (n-1) will be written to the
DISK. Here the (n-1) replicas will have the overhead of pipeline
replication over the network and the DISK write latency on the write hot
path. So you will not get better performance results.

IIUC, for getting memory latency benefits, it is recommended to use
replication=1. In this way, applications should be able to perform single
replica writes to a local DN with low latency. HDFS will store block data
in memory and lazily save it to disk avoiding incurring disk write latency
on the hot path. By writing to local memory we can also avoid checksum
computation on the hot path.

Regards,
Rakesh

On Tue, Jul 19, 2016 at 3:25 PM, kevin  wrote:

> I don't quite understand :"Note that the Lazy_Persist policy is useful
> only for single replica blocks. For blocks with more than one replicas, all
> the replicas will be written to DISK since writing only one of the replicas
> to RAM_DISK does not improve the overall performance."
>
> Is that mean I should config dfs.replication with 1 ?  if more than one I
> should not use *Lazy_Persist*  policies ?
>


Re: ZKFC do not work in Hadoop HA

2016-07-19 Thread Rakesh Radhakrishnan
Hi Alexandr,

I could see the following warning message in your logs and is the reason
for unsuccessful fencing. Could you please check 'fuser' command execution
in your system.

2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
command not found
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc: 127
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Disconnecting from hadoopActiveMaster port 22

Also, I'd suggest to visit
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
page to understand more about the fencing logic. In this page you can
search for "*dfs.ha.fencing.methods*" configuration.

Regards,
Rakesh

On Tue, Jul 19, 2016 at 7:22 PM, Alexandr Porunov <
alexandr.poru...@gmail.com> wrote:

> Hello,
>
> I have a problem with ZKFC.
> I have configured High Availability for Hadoop with QJM.
> The problem is that when I turn off the active master node (or kill the
> namenode process) standby node does not want to change its status from
> standby to active. So it continues to be the standby node.
>
> I was watching the log file of ZKFC when I turned off the active node. It
> started trying to connect to the active node (which already died) to change
> its status from active to standby.
> But the active node already died, so it is impossible to connect to the
> dead active master node.
> Then I turned on the active master node. After that my standby node
> connected to the old active master node and changed the status of the
> active node from active to standby and the status of standby node from
> standby to active.
>
> It is really strange. After the crash of the active node the ZKFC wants to
> connect to the dead node. Before connection is established ZKFC doesn't
> want to change the status of standby node to active.
>
> Why is it happens?
>
> Here my log from zkfc (I cut it because it repeats all the time. After
> this part of logs it logger writes the same thing):
>
> 2016-07-19 14:43:21,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Checking for any old active which needs to be fenced...
> 2016-07-19 14:43:21,957 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Old node exists: 0a0a68612d636c757374657212036e6e311a12686164
> 6f6f704163746976654d617374657220d43e28d33e
> 2016-07-19 14:43:21,978 INFO org.apache.hadoop.ha.ZKFailoverController:
> Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020
> 2016-07-19 14:43:22,995 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: hadoopActiveMaster/192.168.0.80:8020. Already tried 0
> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
> sleepTime=1000 MILLISECONDS)
> 2016-07-19 14:43:23,001 WARN org.apache.hadoop.ha.FailoverController:
> Unable to gracefully make NameNode at hadoopActiveMaster/192.168.0.80:8020
> standby (unable to connect)
> java.net.ConnectException: Call From hadoopStandby/192.168.0.81 to
> hadoopActiveMaster:8020 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
> at org.apache.hadoop.ipc.Client.call(Client.java:1479)
> at org.apache.hadoop.ipc.Client.call(Client.java:1412)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
> at
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
> at
> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
> at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
> at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
> at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
> at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
> at
> org.apache.hadoop.ha.ActiveStan

Re: ZKFC do not work in Hadoop HA

2016-07-19 Thread Rakesh Radhakrishnan
Good to hear the problem is resolved and able to continue.

Regards,
Rakesh

On Tue, Jul 19, 2016 at 10:31 PM, Alexandr Porunov <
alexandr.poru...@gmail.com> wrote:

> Rakesh,
>
> Thank you very much. I missed it. I hadn't "fuser" command on my nodes.
> I've just installed it. ZKFC became work properly!
>
> Best regards,
> Alexandr
>
> On Tue, Jul 19, 2016 at 5:29 PM, Rakesh Radhakrishnan 
> wrote:
>
>> Hi Alexandr,
>>
>> I could see the following warning message in your logs and is the reason
>> for unsuccessful fencing. Could you please check 'fuser' command
>> execution in your system.
>>
>> 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>> PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
>> command not found
>> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc:
>> 127
>> 2016-07-19 14:43:23,706 INFO
>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Disconnecting from
>> hadoopActiveMaster port 22
>>
>> Also, I'd suggest to visit
>> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
>> page to understand more about the fencing logic. In this page you can
>> search for "*dfs.ha.fencing.methods*" configuration.
>>
>> Regards,
>> Rakesh
>>
>> On Tue, Jul 19, 2016 at 7:22 PM, Alexandr Porunov <
>> alexandr.poru...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a problem with ZKFC.
>>> I have configured High Availability for Hadoop with QJM.
>>> The problem is that when I turn off the active master node (or kill the
>>> namenode process) standby node does not want to change its status from
>>> standby to active. So it continues to be the standby node.
>>>
>>> I was watching the log file of ZKFC when I turned off the active node.
>>> It started trying to connect to the active node (which already died) to
>>> change its status from active to standby.
>>> But the active node already died, so it is impossible to connect to the
>>> dead active master node.
>>> Then I turned on the active master node. After that my standby node
>>> connected to the old active master node and changed the status of the
>>> active node from active to standby and the status of standby node from
>>> standby to active.
>>>
>>> It is really strange. After the crash of the active node the ZKFC wants
>>> to connect to the dead node. Before connection is established ZKFC doesn't
>>> want to change the status of standby node to active.
>>>
>>> Why is it happens?
>>>
>>> Here my log from zkfc (I cut it because it repeats all the time. After
>>> this part of logs it logger writes the same thing):
>>>
>>> 2016-07-19 14:43:21,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>>> Checking for any old active which needs to be fenced...
>>> 2016-07-19 14:43:21,957 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>>> Old node exists: 0a0a68612d636c757374657212036e6e311a12686164
>>> 6f6f704163746976654d617374657220d43e28d33e
>>> 2016-07-19 14:43:21,978 INFO org.apache.hadoop.ha.ZKFailoverController:
>>> Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020
>>> 2016-07-19 14:43:22,995 INFO org.apache.hadoop.ipc.Client: Retrying
>>> connect to server: hadoopActiveMaster/192.168.0.80:8020. Already tried
>>> 0 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
>>> MILLISECONDS)
>>> 2016-07-19 14:43:23,001 WARN org.apache.hadoop.ha.FailoverController:
>>> Unable to gracefully make NameNode at hadoopActiveMaster/
>>> 192.168.0.80:8020 standby (unable to connect)
>>> java.net.ConnectException: Call From hadoopStandby/192.168.0.81 to
>>> hadoopActiveMaster:8020 failed on connection exception:
>>> java.net.ConnectException: Connection refused; For more details see:
>>> http://wiki.apache.org/hadoop/ConnectionRefused
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)
>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>

Re: ZKFC fencing problem after the active node crash

2016-07-19 Thread Rakesh Radhakrishnan
Hi Alexandr,

Since you powered off the Active NN machine, during fail-over SNN timed out
to connect to this machine and fencing is failed. Typically fencing methods
should be configured to not to allow multiple writers to same shared
storage. It looks like you are using 'QJM' and it supports the fencing
feature on its own. i.e. it wont allow multiple writers at a time. So I
think external fencing methods can be skipped. AFAIK, to improve the
availability of the system in the event the fencing mechanisms fail, it is
advisable to configure a fencing method which is guaranteed to return
success. You can remove the SSH fencing method from both machines
configuration. Please try the below shell based fence method just to skip
SSH fence and restart the cluster. Then fail over will happen successfully.


  dfs.ha.fencing.methods
  shell(/bin/true)


*Reference:-*
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
"*JournalNodes will only ever allow a single NameNode to be a writer at a
time. During a failover, the NameNode which is to become active will simply
take over the role of writing to the JournalNodes, which will effectively
prevent the other NameNode from continuing in the Active state, allowing
the new Active to safely proceed with failover*."

Regards,
Rakesh

On Wed, Jul 20, 2016 at 12:52 AM, Alexandr Porunov <
alexandr.poru...@gmail.com> wrote:

> Hello,
>
> I have configured Hadoop HA cluster. It works like in tutorials. If I kill
> Namenode process with command "kill -9 NameNodeProcessId" my standby node
> changes its state to active. But if I power off active node then standby
> node can't change its state to active because it trys to connect to the
> crashed node by using SSH.
>
> This parameter doesn't work:
> 
> dfs.ha.fencing.ssh.connect-timeout
> 3000
> 
>
> I read from the documentation that it is 5 second by default. But even
> after 5 minutes standby node continue try to connect to crashed node. I set
> it manually for 3 second but it still doesn't work. So, if we just kill
> namenode process our cluster works but if we crash active node our cluster
> become unavailable.
>
> *Here is the part of ZKFC logs (After the crash the logger writes the same
> information infinitely)*:
> 2016-07-19 20:56:24,139 INFO org.apache.hadoop.ha.NodeFencer: ==
> Beginning Service Fencing Process... ==
> 2016-07-19 20:56:24,139 INFO org.apache.hadoop.ha.NodeFencer: Trying
> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
> 2016-07-19 20:56:24,141 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
> Connecting to hadoopActiveMaster...
> 2016-07-19 20:56:24,141 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Connecting to hadoopActiveMaster port 22
> 2016-07-19 20:56:27,148 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
> Unable to connect to hadoopActiveMaster as user hadoop
> com.jcraft.jsch.JSchException: timeout: socket is not established
> at com.jcraft.jsch.Util.createSocket(Util.java:386)
> at com.jcraft.jsch.Session.connect(Session.java:182)
> at
> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
> at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
> at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:532)
> at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
> at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
> at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2016-07-19 20:56:27,149 WARN org.apache.hadoop.ha.NodeFencer: Fencing
> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsucce
> ssful.
> 2016-07-19 20:56:27,149 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
> fence service by any configured method.
> 2016-07-19 20:56:27,150 WARN org.apache.hadoop.ha.ActiveStandbyElector:
> Exception handling the winning of election
> java.lang.RuntimeException: Unable to fence NameNode at hadoopActiveMaster/
> 192.168.0.80:8020
> at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
> at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
> at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
> at
> org.apache.hadoop.ha.ZKFailoverController$Ele

Re: About Archival Storage

2016-07-19 Thread Rakesh Radhakrishnan
>>>I have another question is , hdfs mover (A New Data Migration Tool )
know when to move data from hot to cold  automatically ?
While running the tool, it reads the argument and get the separated list of
hdfs files/dirs to migrate. Then it periodically scans these files in HDFS
to check if the block placement satisfies the storage policy, if not
satisfied it moves the replicas to a different storage type in order to
fulfill the storage policy requirement. This cycle continues until it hits
an error or no blocks to move etc. Could you please tell me, what do you
meant by "automatically" ? FYI, HDFS-10285 is proposing an idea to
introduce a daemon thread in Namenode to track the storage movements set by
APIs from clients. This Daemon thread named as StoragePolicySatisfier(SPS)
serves something similar to ReplicationMonitor. If interested you can read
the https://goo.gl/NA5EY0 proposal/idea and welcome feedback.

Sleep time between each cycle is, ('dfs.heartbeat.interval' * 2000) +
('dfs.namenode.replication.interval' * 1000) milliseconds;

>>>It use algorithm like LRU、LFU ?
It will simply iterating over the lists in the order of files/dirs given to
this tool as an argument. afaik, its just maintains the order mentioned by
the user.

Regards,
Rakesh


On Wed, Jul 20, 2016 at 7:05 AM, kevin  wrote:

> Thanks a lot Rakesh.
>
> I have another question is , hdfs mover (A New Data Migration Tool ) know
> when to move data from hot to cold  automatically ? It use algorithm
> like LRU、LFU ?
>
> 2016-07-19 19:55 GMT+08:00 Rakesh Radhakrishnan :
>
>> >>>>Is that mean I should config dfs.replication with 1 ?  if more than
>> one I should not use *Lazy_Persist*  policies ?
>>
>> The idea of Lazy_Persist policy is, while writing blocks, one replica
>> will be placed in memory first and then it is lazily persisted into DISK.
>> It doesn't means that, you are not allowed to configure dfs.replication >
>> 1. If 'dfs.replication' is configured > 1 then the first replica will be
>> placed in RAM_DISK and all the other replicas (n-1) will be written to the
>> DISK. Here the (n-1) replicas will have the overhead of pipeline
>> replication over the network and the DISK write latency on the write hot
>> path. So you will not get better performance results.
>>
>> IIUC, for getting memory latency benefits, it is recommended to use
>> replication=1. In this way, applications should be able to perform single
>> replica writes to a local DN with low latency. HDFS will store block data
>> in memory and lazily save it to disk avoiding incurring disk write latency
>> on the hot path. By writing to local memory we can also avoid checksum
>> computation on the hot path.
>>
>> Regards,
>> Rakesh
>>
>> On Tue, Jul 19, 2016 at 3:25 PM, kevin  wrote:
>>
>>> I don't quite understand :"Note that the Lazy_Persist policy is useful
>>> only for single replica blocks. For blocks with more than one replicas, all
>>> the replicas will be written to DISK since writing only one of the replicas
>>> to RAM_DISK does not improve the overall performance."
>>>
>>> Is that mean I should config dfs.replication with 1 ?  if more than one
>>> I should not use *Lazy_Persist*  policies ?
>>>
>>
>>
>


Re: About Archival Storage

2016-07-19 Thread Rakesh Radhakrishnan
Based on storage policy the data from hot storage will be moved to cold
storage. The storage policy defines the number of replicas to be located on
each storage type. It is possible to change the storage policy on a
directory(for example: HOT to COLD) and then invoke 'Mover tool' on that
directory to make the policy effective. One can set/change the storage
policy via HDFSCommand, "hdfs storagepolicies -setStoragePolicy -path
 -policy ". After setting the new policy, you need to run the
tool, then it identifies the replicas to be moved based on the storage
policy information, and schedules the movement between source and
destination data nodes to satisfy the policy. Internally, the tool is
comparing the 'storage type' of a block in order to fulfill the 'storage
policy' requirement.

Probably you can refer
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html
to know more about storage types, storage policies and hdfs commands. Hope
this helps.

Rakesh

On Wed, Jul 20, 2016 at 10:30 AM, kevin  wrote:

> Thanks again. "automatically" what I mean is the hdfs mover knows the hot
> data have come to cold , I don't need to tell it what exactly files/dirs
> need to be move now ?
> Of course I should tell it what files/dirs need to monitoring.
>
> 2016-07-20 12:35 GMT+08:00 Rakesh Radhakrishnan :
>
>> >>>I have another question is , hdfs mover (A New Data Migration Tool )
>> know when to move data from hot to cold  automatically ?
>> While running the tool, it reads the argument and get the separated list
>> of hdfs files/dirs to migrate. Then it periodically scans these files in
>> HDFS to check if the block placement satisfies the storage policy, if not
>> satisfied it moves the replicas to a different storage type in order to
>> fulfill the storage policy requirement. This cycle continues until it hits
>> an error or no blocks to move etc. Could you please tell me, what do you
>> meant by "automatically" ? FYI, HDFS-10285 is proposing an idea to
>> introduce a daemon thread in Namenode to track the storage movements set by
>> APIs from clients. This Daemon thread named as StoragePolicySatisfier(SPS)
>> serves something similar to ReplicationMonitor. If interested you can read
>> the https://goo.gl/NA5EY0 proposal/idea and welcome feedback.
>>
>> Sleep time between each cycle is, ('dfs.heartbeat.interval' * 2000) +
>> ('dfs.namenode.replication.interval' * 1000) milliseconds;
>>
>> >>>It use algorithm like LRU、LFU ?
>> It will simply iterating over the lists in the order of files/dirs given
>> to this tool as an argument. afaik, its just maintains the order mentioned
>> by the user.
>>
>> Regards,
>> Rakesh
>>
>>
>> On Wed, Jul 20, 2016 at 7:05 AM, kevin  wrote:
>>
>>> Thanks a lot Rakesh.
>>>
>>> I have another question is , hdfs mover (A New Data Migration Tool )
>>> know when to move data from hot to cold  automatically ? It
>>> use algorithm like LRU、LFU ?
>>>
>>> 2016-07-19 19:55 GMT+08:00 Rakesh Radhakrishnan :
>>>
>>>> >>>>Is that mean I should config dfs.replication with 1 ?  if more
>>>> than one I should not use *Lazy_Persist*  policies ?
>>>>
>>>> The idea of Lazy_Persist policy is, while writing blocks, one replica
>>>> will be placed in memory first and then it is lazily persisted into DISK.
>>>> It doesn't means that, you are not allowed to configure dfs.replication >
>>>> 1. If 'dfs.replication' is configured > 1 then the first replica will be
>>>> placed in RAM_DISK and all the other replicas (n-1) will be written to the
>>>> DISK. Here the (n-1) replicas will have the overhead of pipeline
>>>> replication over the network and the DISK write latency on the write hot
>>>> path. So you will not get better performance results.
>>>>
>>>> IIUC, for getting memory latency benefits, it is recommended to use
>>>> replication=1. In this way, applications should be able to perform single
>>>> replica writes to a local DN with low latency. HDFS will store block data
>>>> in memory and lazily save it to disk avoiding incurring disk write latency
>>>> on the hot path. By writing to local memory we can also avoid checksum
>>>> computation on the hot path.
>>>>
>>>> Regards,
>>>> Rakesh
>>>>
>>>> On Tue, Jul 19, 2016 at 3:25 PM, kevin  wrote:
>>>>
>>>>> I don't quite understand :"Note that the Lazy_Persist policy is useful
>>>>> only for single replica blocks. For blocks with more than one replicas, 
>>>>> all
>>>>> the replicas will be written to DISK since writing only one of the 
>>>>> replicas
>>>>> to RAM_DISK does not improve the overall performance."
>>>>>
>>>>> Is that mean I should config dfs.replication with 1 ?  if more than
>>>>> one I should not use *Lazy_Persist*  policies ?
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Start client side daemon

2016-07-22 Thread Rakesh Radhakrishnan
Hi Kun,

HDFS won't start any client side object(for example,
DistributedFileSystem). I can say, HDFS Client -> user applications access
the file system using the HDFS client, a library that exports the HDFS file
system interface. Perhaps, you can visit api docs,
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileSystem.html
.

Namenode has RPC server that listens to requests from data nodes, clients.
Datanode has RPC Server which will be used for inter data node
communications. I think, its worth reading the following to get more
information.
https://en.wikipedia.org/wiki/Apache_Hadoop
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#The_Communication_Protocols

Regards,
Rakesh
Intel

On Fri, Jul 22, 2016 at 7:21 PM, Kun Ren  wrote:

> Hi Genius,
>
> I understand that we use the command to start namenode and datanode. But I
> don't know how HDFS starts client side and creates the Client side
> object(Like DistributedFileSystem), and client side RPC server? Could you
> please point it out how HDFS start the client side dameon?
> If the client side uses the same RPC server with server side, Can I
> understand that the client side has to be located at either Namenode or
> Datanode?
>
> Thanks so much.
> Kun
>


Re: Start client side daemon

2016-07-22 Thread Rakesh Radhakrishnan
Sorry, could you please tell me, what do you meant by "located" ?

Like I mentioned earlier, user applications will talk to the HDFS via
clients. For example, operations like create/open/write/read/delete files
etc via client.

Thanks,
Rakesh


On Fri, Jul 22, 2016 at 8:38 PM, Kun Ren  wrote:

> Thanks for your reply. So The clients can be located at any machine that
> has the HDFS client library, correct?
>
> On Fri, Jul 22, 2016 at 10:50 AM, Rakesh Radhakrishnan  > wrote:
>
>> Hi Kun,
>>
>> HDFS won't start any client side object(for example,
>> DistributedFileSystem). I can say, HDFS Client -> user applications access
>> the file system using the HDFS client, a library that exports the HDFS file
>> system interface. Perhaps, you can visit api docs,
>> https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileSystem.html
>> .
>>
>> Namenode has RPC server that listens to requests from data nodes,
>> clients. Datanode has RPC Server which will be used for inter data node
>> communications. I think, its worth reading the following to get more
>> information.
>> https://en.wikipedia.org/wiki/Apache_Hadoop
>>
>> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#The_Communication_Protocols
>>
>> Regards,
>> Rakesh
>> Intel
>>
>> On Fri, Jul 22, 2016 at 7:21 PM, Kun Ren  wrote:
>>
>>> Hi Genius,
>>>
>>> I understand that we use the command to start namenode and datanode. But
>>> I don't know how HDFS starts client side and creates the Client side
>>> object(Like DistributedFileSystem), and client side RPC server? Could you
>>> please point it out how HDFS start the client side dameon?
>>> If the client side uses the same RPC server with server side, Can I
>>> understand that the client side has to be located at either Namenode or
>>> Datanode?
>>>
>>> Thanks so much.
>>> Kun
>>>
>>
>>
>


Re: Improving recovery performance for degraded reads

2016-07-27 Thread Rakesh Radhakrishnan
Hi Roy,

>>>> (a) In your last email, I am sure you meant => "... submitting read
requests to fetch "any" (instead of all) the 'k' chunk (out of k+m-x
surviving chunks)  ?
>>>> Do you have any optimization in place to decide which data-nodes will
be part of those "k" ?

Answer:-
I hope you know the write path, just adding few details here to support the
read explanation part. While writing to an EC file, dfs client writes data
stripe(e.g. 64KB cellsize) to multiple datanodes. For (k, m) schema, the
client writes data block to the first k datanodes and parity block to the
remaining m datanodes. Say, one stripe is (k * cellSize + m * cellSize)
data. While reading, client will fetch in the same order, read stripe by
stripe. The datanodes with data blocks are first fetched than the datanodes
with parity blocks because less EC block reconstruction work is needed.
Internally, dfs client reads the whole stripe one by one and contacts k
datanodes parallelly for each stripe. If there is any failures then will
contact parity datanodes and do reconstruction on the fly.
'DFSStripedInputStream' supported both positional read and read entire
buffer(e.g filesize buffer).

>>>> (b) Are there any caching being done (as proposed for QFS in the
previously attached "PPR" paper) ?
Answer:-
HDFS-9879, there is an open jira to discuss the caching of striped blocks
at the datanode. Perhaps, caching logic could be utilized similar to the
QFS and while reconstruction choose those datanodes that have already
cached the data in memory. This is an open improvement task as of now.

>>>> (c) When you mentioned stripping is being done, I assume it is
probably to reduce the chunk sizes and hence k*c ?
Answer:-
Yes, striping is done by dividing the block into several chunks, we call it
as cellSize (e.g. 64KB). (k * c + m * c) is one stripe. A block group
comprises of several stripes. I'd  suggest you to read the blog -
http://blog.cloudera.com/blog/2015/09/introduction-to-hdfs-erasure-coding-in-apache-hadoop/
to understand more about the stripe, cells and block group terminolgy etc
before reeading the below answer.
   blk_0  blk_1   blk_2
 | ||
 vv   v
  +--+   +--+   +--+
  |cell_0|   |cell_1|   |cell_2|
  +--+   +--+   +--+

>>>> Now, if my object sizes are large (e.g. super HD images) where I would
have to get data from multiple stripes to rebuild the images before I can
display to the
>>>> client, do you think stripping would still help ?
>>>> Is there a possibility that since I know that all the segments of the
HD image would always be read together, by stripping and distributing it on
different nodes, I am ignoring
>>>> its special/temporal locality and further increase any associated
delays ?

Answer:-
Since for each stripe it contacts all the k datanodes, assume if there are
slow datanodes or some dead datanodes in each data block stripe then it
will affect the read performance. AFAIK, for a large file contiguous layout
is suitable, this will be supported in phase-2 and design discussions are
still going on, please see HDFS-8030 jira. On the otherside, in theory I
can say there is a benefit of striping layout, which enables the client to
work with multiple data nodes in parallel, greatly enhancing the aggregate
throughput(assuming that all datanodes are good servers). But this needs to
be tested in your cluster to understand the impact.


Thanks,
Rakesh
Intel

On Sun, Jul 24, 2016 at 12:00 PM, Roy Leonard 
wrote:

> Hi Rakesh,
>
> Thanks for sharing your thoughts and updates.
>
> (a) In your last email, I am sure you meant => "... submitting read
> requests to fetch "any" (instead of all) the 'k' chunk (out of k+m-x
> surviving chunks)  ?
> Do you have any optimization in place to decide which data-nodes will be
> part of those "k" ?
>
> (b) Are there any caching being done (as proposed for QFS in the previously
> attached "PPR" paper) ?
>
> (c) When you mentioned stripping is being done, I assume it is probably to
> reduce the chunk sizes and hence k*c ?
> Now, if my object sizes are large (e.g. super HD images) where I would have
> to get data from multiple stripes to rebuild the images before I can
> display to the client, do you think stripping would still help ?
> Is there a possibility that since I know that all the segments of the HD
> image would always be read together, by stripping and distributing it on
> different nodes, I am ignoring its special/temporal locality and further
> increase any associated delays ?
>
> Just wanted to know your thoughts.
> I am looking forward to the future performance improvements in HDFS.
>
> Regards,
> R.
>
&g

Re: [DISCUSS] Retire BKJM from trunk?

2016-07-27 Thread Rakesh Radhakrishnan
If I remember correctly, Huawei also adopted QJM component. I hope @Vinay
might have discussed internally in Huawei before starting this e-mail
discussion thread. I'm +1, for removing the bkjm contrib from the trunk
code.

Also, there are quite few open sub-tasks under HDFS-3399 umbrella jira,
which was used for the BKJM implementation time. How about closing these
jira by marking as "Won't Fix"?

Thanks,
Rakesh
Intel

On Thu, Jul 28, 2016 at 1:53 AM, Sijie Guo  wrote:

> + Rakesh and Uma
>
> Rakesh and Uma might have a better idea on this. I think Huawei was using
> it when Rakesh and Uma worked there.
>
> - Sijie
>
> On Wed, Jul 27, 2016 at 12:06 PM, Chris Nauroth 
> wrote:
>
> > I recommend including the BookKeeper community in this discussion.  I’ve
> > added their user@ and dev@ lists to this thread.
> >
> > I do not see BKJM being used in practice.  Removing it from trunk would
> be
> > attractive in terms of less code for Hadoop to maintain and build, but if
> > we find existing users that want to keep it, I wouldn’t object.
> >
> > --Chris Nauroth
> >
> > On 7/26/16, 11:14 PM, "Vinayakumar B"  wrote:
> >
> > Hi All,
> >
> >BKJM was Active and made much stable when the NameNode HA was
> > implemented and there was no QJM implemented.
> >Now QJM is present and is much stable which is adopted by many
> > production environment.
> >I wonder whether it would be a good time to retire BKJM from
> trunk?
> >
> >Are there any users of BKJM exists?
> >
> > -Vinay
> >
> >
> >
>


Re: Teradata into hadoop Migration

2016-08-01 Thread Rakesh Radhakrishnan
Hi Bhagaban,

Perhaps, you can try "Apache Sqoop" to transfer data to Hadoop from
Teradata. Apache Sqoop provides an efficient approach for transferring
large data between Hadoop related systems and structured data stores. It
allows support for a data store to be added as a so-called connector and
can connect to various databases including Oracle etc.

I hope the below links will be helpful to you,
http://sqoop.apache.org/
http://blog.cloudera.com/blog/2012/01/cloudera-connector-for-teradata-1-0-0/
http://hortonworks.com/blog/round-trip-data-enrichment-teradata-hadoop/
http://dataconomy.com/wp-content/uploads/2014/06/Syncsort-A-123ApproachtoTeradataOffloadwithHadoop.pdf

Below are few data ingestion tools, probably you can dig more into it,
https://www.datatorrent.com/product/datatorrent-ingestion/
https://www.datatorrent.com/dtingest-unified-streaming-batch-data-ingestion-hadoop/

Thanks,
Rakesh

On Mon, Aug 1, 2016 at 4:54 PM, Bhagaban Khatai 
wrote:

> Hi Guys-
>
> I need a quick help if anybody done any migration project in TD into
> hadoop.
> We have very tight deadline and I am trying to find any tool (online or
> paid) for quick development.
>
> Please help us here and guide me if any other way is available to do the
> development fast.
>
> Bhagaban
>


Re: Teradata into hadoop Migration

2016-08-04 Thread Rakesh Radhakrishnan
Sorry, I don't have much insight about this apart from basic Sqoop. AFAIK,
it is more of vendor specific, you may need to dig more into that line.

Thanks,
Rakesh

On Mon, Aug 1, 2016 at 11:38 PM, Bhagaban Khatai 
wrote:

> Thanks Rakesh for the useful information. But we are using sqoop for data
> transfer but all TD logic we are implementing thru Hive.
> But it's taking time by using mapping provided by TD team and the same
> logic we are implementing.
>
> What I want some tool or ready-made framework so that development effort
> would be less.
>
> Thanks in advance for your help.
>
> Bhagaban
>
> On Mon, Aug 1, 2016 at 6:07 PM, Rakesh Radhakrishnan 
> wrote:
>
>> Hi Bhagaban,
>>
>> Perhaps, you can try "Apache Sqoop" to transfer data to Hadoop from
>> Teradata. Apache Sqoop provides an efficient approach for transferring
>> large data between Hadoop related systems and structured data stores. It
>> allows support for a data store to be added as a so-called connector and
>> can connect to various databases including Oracle etc.
>>
>> I hope the below links will be helpful to you,
>> http://sqoop.apache.org/
>> http://blog.cloudera.com/blog/2012/01/cloudera-connector-
>> for-teradata-1-0-0/
>> http://hortonworks.com/blog/round-trip-data-enrichment-teradata-hadoop/
>> http://dataconomy.com/wp-content/uploads/2014/06/Syncsort-A-
>> 123ApproachtoTeradataOffloadwithHadoop.pdf
>>
>> Below are few data ingestion tools, probably you can dig more into it,
>> https://www.datatorrent.com/product/datatorrent-ingestion/
>> https://www.datatorrent.com/dtingest-unified-streaming-
>> batch-data-ingestion-hadoop/
>>
>> Thanks,
>> Rakesh
>>
>> On Mon, Aug 1, 2016 at 4:54 PM, Bhagaban Khatai > > wrote:
>>
>>> Hi Guys-
>>>
>>> I need a quick help if anybody done any migration project in TD into
>>> hadoop.
>>> We have very tight deadline and I am trying to find any tool (online or
>>> paid) for quick development.
>>>
>>> Please help us here and guide me if any other way is available to do the
>>> development fast.
>>>
>>> Bhagaban
>>>
>>
>>
>


Re: issue starting regionserver with SASL authentication failed

2016-08-06 Thread Rakesh Radhakrishnan
Hey Aneela,

I've filtered the below output from your log messages. It looks like you
have "/ranger" directory under the root directory and directory listing is
working fine.

*Found 1 items*
*drwxr-xr-x   - hdfs supergroup  0 2016-08-02 14:44 /ranger*

I think its putting all the log messages to the console because it may
missing log configurations and you may need to check your log configuration
in both Kerberos and Hadoop console client. Perhaps, you can refer
HADOOP_LOG_DIR section in https://wiki.apache.org/hadoop/HowToConfigure,
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/ClusterSetup.html#Logging.
Also, for Kerberos can try passing "-Dsun.security.krb5.debug=false"
when starting
the jvm.

Thanks,
Rakesh
Intel

On Tue, Aug 2, 2016 at 10:35 PM, Aneela Saleem 
wrote:

> Hi all,
>
> I'm facing issue starting region server in HBase. I have enabled Kerberos
> debugging in Hadoop command line, so when i run the "hadoop fs -ls /"
> command, i get following output, I can't interpret this. Can anyone please
> tell me is something wrong with Kerberos configuration or everything is
> fine ?
>
>
> 16/08/02 18:34:10 DEBUG util.Shell: setsid exited with exit code 0
> 16/08/02 18:34:10 DEBUG conf.Configuration: parsing URL
> jar:file:/usr/local/hadoop/share/hadoop/common/hadoop-
> common-2.7.2.jar!/core-default.xml
> 16/08/02 18:34:10 DEBUG conf.Configuration: parsing input stream
> sun.net.www.protocol.jar.JarURLConnection$JarURLInputStream@4fbc7b65
> 16/08/02 18:34:10 DEBUG conf.Configuration: parsing URL
> file:/usr/local/hadoop/etc/hadoop/core-site.xml
> 16/08/02 18:34:10 DEBUG conf.Configuration: parsing input stream
> java.io.BufferedInputStream@69c1adfa
> 16/08/02 18:34:11 DEBUG lib.MutableMetricsFactory: field
> org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.
> UserGroupInformation$UgiMetrics.loginSuccess with annotation
> @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, value=[Rate
> of successful kerberos logins and latency (milliseconds)], about=,
> always=false, type=DEFAULT, sampleName=Ops)
> 16/08/02 18:34:11 DEBUG lib.MutableMetricsFactory: field
> org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.
> UserGroupInformation$UgiMetrics.loginFailure with annotation
> @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, value=[Rate
> of failed kerberos logins and latency (milliseconds)], about=,
> always=false, type=DEFAULT, sampleName=Ops)
> 16/08/02 18:34:11 DEBUG lib.MutableMetricsFactory: field
> org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.
> UserGroupInformation$UgiMetrics.getGroups with annotation
> @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time,
> value=[GetGroups], about=, always=false, type=DEFAULT, sampleName=Ops)
> 16/08/02 18:34:11 DEBUG impl.MetricsSystemImpl: UgiMetrics, User and group
> related metrics
> Java config name: null
> Native config name: /etc/krb5.conf
> Loaded from native config
> 16/08/02 18:34:11 DEBUG security.Groups:  Creating new Groups object
> 16/08/02 18:34:11 DEBUG security.Groups: Group mapping
> impl=org.apache.hadoop.security.LdapGroupsMapping; cacheTimeout=30;
> warningDeltaMs=5000
> >>>KinitOptions cache name is /tmp/krb5cc_0
> >>>DEBUG   client principal is nn/hadoop-master@
> platalyticsrealm
> >>>DEBUG  server principal is krbtgt/platalyticsrealm@
> platalyticsrealm
> >>>DEBUG  key type: 16
> >>>DEBUG  auth time: Tue Aug 02 18:23:59 PKT 2016
> >>>DEBUG  start time: Tue Aug 02 18:23:59 PKT 2016
> >>>DEBUG  end time: Wed Aug 03 06:23:59 PKT 2016
> >>>DEBUG  renew_till time: Tue Aug 09 18:23:59 PKT 2016
> >>> CCacheInputStream: readFlags()  FORWARDABLE; RENEWABLE; INITIAL;
> >>>DEBUG   client principal is nn/hadoop-master@
> platalyticsrealm
> >>>DEBUG  server principal is
> X-CACHECONF:/krb5_ccache_conf_data/fast_avail/krbtgt/platalyticsrealm@
> platalyticsrealm
> >>>DEBUG  key type: 0
> >>>DEBUG  auth time: Thu Jan 01 05:00:00 PKT 1970
> >>>DEBUG  start time: null
> >>>DEBUG  end time: Thu Jan 01 05:00:00 PKT 1970
> >>>DEBUG  renew_till time: null
> >>> CCacheInputStream: readFlags()
> 16/08/02 18:34:11 DEBUG security.UserGroupInformation: hadoop login
> 16/08/02 18:34:11 DEBUG security.UserGroupInformation: hadoop login commit
> 16/08/02 18:34:11 DEBUG security.UserGroupInformation: using kerberos
> user:nn/hadoop-master@platalyticsrealm
> 16/08/02 18:34:11 DEBUG security.UserGroupInformation: Using user:
> "nn/hadoop-master@platalyticsrealm" with name nn/hadoop-master@
> platalyticsrealm
> 16/08/02 18:34:11 DEBUG security.UserGroupInformation: User entry:
> "nn/hadoop-master@platalyticsrealm"
> 16/08/02 18:34:11 DEBUG security.UserGroupInformation: UGI
> loginUser:nn/hadoop-master@platalyticsrealm (auth:KERBEROS)
> 16/08/02 18:34:12 DEBUG security.UserGroupInformation: Found tgt Ticket
> (hex) =
> : 61 82 01 72 30 82 01 6E   A0 03 02 01 05 A1 12 1B  a..r0..n
> 0010: 10 70 6C 61 74 61 6C 79   74 69 63 73 72 6

Re: Cannot run Hadoop on Windows

2016-08-08 Thread Rakesh Radhakrishnan
Hi Atri,

I doubt the problem is due to space in the path -> "Program Files".
Instead of C:\Program Files\Java\jdk1.8.0_101, please copy JDK dir to
C:\java\jdk1.8.0_101 and try once.

Rakesh
Intel

On Mon, Aug 8, 2016 at 4:34 PM, Atri Sharma  wrote:

> Hi All,
>
> I am trying to run a compiled Hadoop jar on Windows but ran into the
> following error when running hdfs-format:
>
> JAVA_HOME is incorrectly set.
>
> I echoed the path being set in etc/Hadoop-env.cmd and it echoes the
> correct path:
>
> C:\Program Files\Java\jdk1.8.0_101
>
> Please advise.
>
> Regards,
>
> Atri
>
>


Re: Connecting JConsole to ResourceManager

2016-08-09 Thread Rakesh Radhakrishnan
Hi Atri,

Do you meant, something like, jconsole [processID].

afaik, the local jmx uses the local filesystem. I hope your processes are
running under same user to ensure there is no permission issues. Also,
could you please check %TEMP% and %TMP% environment variables and make sure
YOUR_USER_NAME in this is exactly same(I meant, case sensitive). I had
observed some problematic cases like directory created with lowercase
letters and username with uppercase letters.

Thanks,
Rakesh

On Tue, Aug 9, 2016 at 1:06 PM, Atri Sharma  wrote:

> Thanks Rohith.
>
> I am running JConsole locally on the same machine as I have my single node
> cluster.Do I still need to enable remote access?
>
> On 9 Aug 2016 12:44 p.m., "Rohith Sharma K S" 
> wrote:
>
>> Hi
>>
>> Have you enabled JMX remote connections parameters for RM start up? If
>> you are trying to remote connection, these parameter supposed to passed in
>> hadoop opts
>>
>> You need to enable remote by configuring these parameter in RM jam start up.
>>
>> -Dcom.sun.management.jmxremote.port= \
>>  -Dcom.sun.management.jmxremote.authenticate=false \
>>  -Dcom.sun.management.jmxremote.ssl=false
>>
>>
>>
>>
>>
>> -Regards
>> Rohith Sharma K S
>>
>> On Aug 9, 2016, at 12:32 PM, Atri Sharma  wrote:
>>
>> Hi All,
>>
>> I am trying to connect to a running ResourceManager process on Windows. I
>> ran jconsole and it shows the ResourceManager process. When I try
>> connecting, it immediately fails saying that it cannot connect.
>>
>> I verified that the cluster is running fine by running the wordcount
>> example.
>>
>> Please advise.
>>
>> Regards,
>>
>> Atri
>>
>>
>>


Re: Journal nodes in HA

2016-08-12 Thread Rakesh Radhakrishnan
Hi Konstantinos,

The typical deployment is, three Journal Nodes(JNs) and can collocate two
of the three JNs on the same machine where Namenodes(2 NNs) are running.
The third one can be deployed to the machine where ZK server is
running(assume ZK cluster has 3 nodes). I'd recommend to have a dedicated
disk for each JN server to use for edit log path as edit logs will be
writing continuously.

It would be helpful if you could give more details of your Hadoop cluster
size and components including ZK service etc.

Thanks,
Rakesh

On Fri, Aug 12, 2016 at 3:12 PM, Konstantinos Tsakalozos <
kos.tsakalo...@canonical.com> wrote:

> Hi everyone,
>
> In an HA setup do you tend to co-host the journal service with other
> services instead of having them on separate dedicated machines? If so, what
> services do you pack together?
>
> Thank you,
> Konstantinos
>


Re: Journal nodes in HA

2016-08-12 Thread Rakesh Radhakrishnan
Hi Konstantinos,

Nice documentation! Wish you all the success for expanding to Hadoop-HA
mode.

I'd say, the JournalNode should be co-located on machines with other Hadoop
master daemons; for example Namenodes, YARN ResourceManager etc. These
daemons are attractive because they are already well-provisioned machines
with little unpredictable user activity, and those daemons are generally
light on disk usage, compares to worker nodes(Datanode, Nodemanager etc.).
In general, dedicating a disk drive on each of the machines for use by the
JournalNode helps avoid disk spindle competition between others. Sorry, I
don't have any reports with me now. Perhaps other folks can pitch in and
add more about any performance benchmarks results, if any. For ZooKeeper
server, can refer http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html,
https://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview pages.

Thanks,
Rakesh

On Fri, Aug 12, 2016 at 5:56 PM, Konstantinos Tsakalozos <
kos.tsakalo...@canonical.com> wrote:

> + the hadoop list
>
> On Fri, Aug 12, 2016 at 3:25 PM, Konstantinos Tsakalozos <
> kos.tsakalo...@canonical.com> wrote:
>
>> Hi Rakesh,
>>
>> Thank you for your prompt reply.
>>
>> In the Juju big data team we bundle Hadoop and a set of "peripheral"
>> helper services so that any interested user can easily deploy the full
>> environment in an automated way.
>> The deployment bundle looks like this: 
>> https://jujucharms.com/hadoop-processing/
>> . On the right side of the bundle you see a client service that can be
>> replaced with any other service the user wishes (eg Hive, Pig etc). We
>> also decided to go with ganglia and rsyslog for monitoring. Would you
>> prefer to see anything more there? In the next release we will be adding
>> Apache Zookeeper that will give us HA and this is why I am asking where
>> would it be best to place the journal nodes.
>>
>> In our case it would be preferable to "waste" one more "namenode"
>> machine (machine=unit in juju terminology) to place the third journal
>> service by itself. The deployment would be cleaner and easier to reach.
>> Also, appreciate very much your advice on dedicated storage. Are there any
>> performance benchmarks showing what bandwidth we can sustain with shared vs
>> dedicated storage for the journal nodes?
>>
>> Thank you,
>> Konstantinos
>>
>>
>>
>>
>> On Fri, Aug 12, 2016 at 2:26 PM, Rakesh Radhakrishnan > > wrote:
>>
>>> Hi Konstantinos,
>>>
>>> The typical deployment is, three Journal Nodes(JNs) and can collocate
>>> two of the three JNs on the same machine where Namenodes(2 NNs) are
>>> running. The third one can be deployed to the machine where ZK server is
>>> running(assume ZK cluster has 3 nodes). I'd recommend to have a dedicated
>>> disk for each JN server to use for edit log path as edit logs will be
>>> writing continuously.
>>>
>>> It would be helpful if you could give more details of your Hadoop
>>> cluster size and components including ZK service etc.
>>>
>>> Thanks,
>>> Rakesh
>>>
>>> On Fri, Aug 12, 2016 at 3:12 PM, Konstantinos Tsakalozos <
>>> kos.tsakalo...@canonical.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> In an HA setup do you tend to co-host the journal service with other
>>>> services instead of having them on separate dedicated machines? If so, what
>>>> services do you pack together?
>>>>
>>>> Thank you,
>>>> Konstantinos
>>>>
>>>
>>>
>>
>


Re: Is it possible to configure hdfs in a federation mode and in an HA mode in the same time?

2016-08-18 Thread Rakesh Radhakrishnan
Yes, it is possible is to enable HA mode and Automatic Failover in
a federated namespace. Following are some of the quick references, I feel
its worth reading these blogs to get more insight into this. I think, you
can start prototyping a test cluster with this and post your queries to
this forum if you face any issues while setting up, some of us will help
you.

http://pe-kay.blogspot.in/2016/02/configuring-federated-hdfs-cluster-with.html
http://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_hag_hdfs_ha_enabling.html
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_hadoop-ha/content/ha-nn-config-cluster.html

Thanks,
Rakesh
Intel

On Tue, Aug 16, 2016 at 11:19 AM, Alexandr Porunov <
alexandr.poru...@gmail.com> wrote:

> Hello all,
>
> I don't understand if it possible to configure HDFS in both modes in the
> same time. Does it make sense? Can somebody show a simple configuration of
> HDFS in both modes? (nameNode1, nameNode2, nameNodeStandby1,
> nameNodeStandby2)
>
> Sincerely,
> Alexandr
>


Re: Namenode Unable to Authenticate to QJM in Secure mode.

2016-08-19 Thread Rakesh Radhakrishnan
Hi Akash,

In general "GSSException: No valid credentials provided" means you don’t
have valid Kerberos credentials. I'm suspecting some issues related to
spnego, could you please revisit all of your kerb related configurations,
probably you can start from the below configuration. Please share
*-site.xml configurations of JN and NNs. Also, please check any unexpected
exceptions in KDC server logs.

I've filtered out "REQUEST /getJournal on org.mortbay.jetty.HttpConnection"
 in your "qjm.log" log file and I could see this has came immediately after
your restart, few has succeeded and few others failed with this exception.

2016-08-19 10:34:14,345 DEBUG org.mortbay.log: RESPONSE /getJournal  401
2016-08-19 10:34:14,374 DEBUG org.mortbay.log: RESPONSE /getJournal  403
2016-08-19 10:34:14,382 DEBUG org.mortbay.log: RESPONSE /getJournal  401
2016-08-19 10:34:14,398 DEBUG org.mortbay.log: RESPONSE /getJournal  403
2016-08-19 10:34:49,679 DEBUG org.mortbay.log: RESPONSE /getJournal  401


  dfs.journalnode.kerberos.internal.spnego.principal
  
  
The server principal used by the JournalNode HTTP Server for
SPNEGO authentication when Kerberos security is enabled. This is
typically set to HTTP/_h...@realm.tld. The SPNEGO server principal
begins with the prefix HTTP/ by convention.

If the value is '*', the web server will attempt to login with
every principal specified in the keytab file
dfs.web.authentication.kerberos.keytab.

For most deployments this can be set to
${dfs.web.authentication.kerberos.principal}
i.e use the value of dfs.web.authentication.kerberos.principal.
  



Rakesh,
Intel

On Fri, Aug 19, 2016 at 4:15 PM, Akash Mishra 
wrote:

> Hi *,
>
> I am trying to run Hadoop cluster [ 2.7.1] in Secure mode. In my cluster
> Namenode is failing while restart with
>
> 2016-08-19 10:34:49,754 DEBUG org.apache.hadoop.security.
> authentication.client.KerberosAuthenticator: Using fallback authenticator
> sequence.
> 2016-08-19 10:34:49,774 DEBUG org.apache.hadoop.security.UserGroupInformation:
> PrivilegedActionException as:hdfs/hadoopdev1.m...@hadoopdev.mlan
> (auth:KERBEROS) cause:java.io.IOException: org.apache.hadoop.security.
> authentication.client.AuthenticationException: Authentication failed,
> status: 403, message: GSSException: No valid credentials provided
> (Mechanism level: Failed to find any Kerberos credentails)
> 2016-08-19 10:34:49,775 ERROR 
> org.apache.hadoop.hdfs.server.namenode.EditLogInputStream:
> caught exception initializing http://hadoopdev1:8480/
> getJournal?jid=hadoopdev&segmentTxId=2275460&storageInfo=-63%3A1455401088%
> 3A1444912570574%3ACID-f748dfef-c174-4d19-8d18-43b74552c8e6
> java.io.IOException: 
> org.apache.hadoop.security.authentication.client.AuthenticationException:
> Authentication failed, status: 403, message: GSSException: No valid
> credentials provided (Mechanism level: Failed to find any Kerberos
> credentails)
> at org.apache.hadoop.hdfs.server.namenode.
> EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:464)
> at org.apache.hadoop.hdfs.server.namenode.
> EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:456)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1657)
> at org.apache.hadoop.security.SecurityUtil.doAsUser(
> SecurityUtil.java:448)
> at org.apache.hadoop.security.SecurityUtil.doAsCurrentUser(
> SecurityUtil.java:442)
> at org.apache.hadoop.hdfs.server.namenode.
> EditLogFileInputStream$URLLog.getInputStream(EditLogFileInputStream.java:
> 455)
> at org.apache.hadoop.hdfs.server.namenode.
> EditLogFileInputStream.init(EditLogFileInputStream.java:141)
> at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.
> nextOpImpl(EditLogFileInputStream.java:192)
> at org.apache.hadoop.hdfs.server.namenode.
> EditLogFileInputStream.nextOp(EditLogFileInputStream.java:250)
> at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.
> readOp(EditLogInputStream.java:85)
> at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.
> skipUntil(EditLogInputStream.java:151)
> at org.apache.hadoop.hdfs.server.namenode.
> RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:178)
> at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.
> readOp(EditLogInputStream.java:85)
> at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.
> skipUntil(EditLogInputStream.java:151)
> at org.apache.hadoop.hdfs.server.namenode.
> RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:178)
> at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.
> readOp(EditLogInputStream.java:85)
>
>
> I am using MIT 5 Kerberos. I am able to successfully kinit using keytab
> file. 

Re: java.lang.NoSuchFieldError: HADOOP_CLASSPATH

2016-08-26 Thread Rakesh Radhakrishnan
Hi Senthil,

There might be case of including the wrong version of a jar file, could you
please check "Environment.HADOOP_CLASSPATH" enum variable in
"org.apache.hadoop.yarn.api.ApplicationConstants.java" class in your hadoop
jar file?. I think it is throwing "NoSuchFieldError" as its not seeing the
"HADOOP_CLASSPATH" enum variable. Also, please ensure that the hadoop jars
are properly available in the classpath while running the job.

Thanks,
Rakesh

On Fri, Aug 26, 2016 at 4:53 PM, kumar, Senthil(AWF) 
wrote:

> Dear All ,   Facing  No Such Field Error when I run Map Reduce Job.. I
> have correct HADOOP_CLASSPATH in cluster .. Not sure what causing issue
> here ..
>
>
>
> java.lang.NoSuchFieldError: HADOOP_CLASSPATH
>
> at org.apache.hadoop.mapreduce.v2.util.MRApps.setClasspath(
> MRApps.java:248)
>
> at org.apache.hadoop.mapred.YARNRunner.
> createApplicationSubmissionContext(YARNRunner.java:458)
>
> at org.apache.hadoop.mapred.YARNRunner.submitJob(
> YARNRunner.java:285)
>
> at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(
> JobSubmitter.java:432)
>
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
>
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:422)
>
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1709)
>
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
>
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.
> java:1303)
>
>
>
>
>
> Version Info:
>
>
>
> *Hadoop 2.7.1.2.4.2.0-258*
>
> Subversion g...@github.com:hortonworks/hadoop.git -r
> 13debf893a605e8a88df18a7d8d214f571e05289
>
> Compiled by jenkins on 2016-04-25T05:46Z
>
> Compiled with protoc 2.5.0
>
> From source with checksum 2a2d95f05ec6c3ac547ed58cab713ac
>
>
>
> Did anyone face this issue ??
>
>
>
> --Senthil
>


Re: Running a HA HDFS cluster on alpine linux

2016-08-28 Thread Rakesh Radhakrishnan
Hi Francis,

There could be cases of connection fluctuations between ZKFC and ZK server,
I've observed the following message from your logs. I'd suggest you to
start analyzing all your ZooKeeper servers log messages and see ZooKeeper
cluster status during this period. BTW, could you tell me the ZK cluster
size.

16/08/25 22:41:40 INFO zookeeper.ClientCnxn: Unable to read additional data
from server sessionid 0x156c3dc679a0003, likely server has closed socket,
closing socket connection and attempting reconnect
16/08/25 22:41:40 INFO ha.ActiveStandbyElector: Session disconnected.
Entering neutral mode...

Regards,
Rakesh

On Sat, Aug 27, 2016 at 5:16 PM, F21  wrote:

> Hi all,
>
> I am currently experimenting with running a HA HDFS cluster in docker
> containers. I have successfully created an HA cluster using Ubuntu as my
> base image for running the namenode, datanode and journalnodes. The
> zookeeper instance runs on an image built using Alpine linux as the base
> and works pretty well.
>
> I attempted to get the namenode, datanode and journalnodes running using
> Alpine linux as the base image. The datanode and journalnodes seem to work
> fine. However, while the namenodes start correctly, they seem to disconnect
> from Zookeeper quite often and will transition into neutral mode. This
> results in the namenodes being in "startup mode" forever.
>
> These are the logs from the active namenode:
> 16/08/25 22:40:28 INFO blockmanagement.CacheReplicationMonitor: Starting
> CacheReplicationMonitor with interval 3 milliseconds
> 16/08/25 22:40:28 INFO fs.TrashPolicyDefault: Namenode trash
> configuration: Deletion interval = 1440 minutes, Emptier interval = 0
> minutes.
> 16/08/25 22:40:28 INFO fs.TrashPolicyDefault: The configured checkpoint
> interval is 0 minutes. Using an interval of 1440 minutes that is used for
> deletion instead
> 16/08/25 22:40:28 INFO blockmanagement.BlockManager: Total number of
> blocks= 0
> 16/08/25 22:40:28 INFO blockmanagement.BlockManager: Number of invalid
> blocks  = 0
> 16/08/25 22:40:28 INFO blockmanagement.BlockManager: Number of
> under-replicated blocks = 0
> 16/08/25 22:40:28 INFO blockmanagement.BlockManager: Number of
> over-replicated blocks = 0
> 16/08/25 22:40:28 INFO blockmanagement.BlockManager: Number of blocks
> being written= 0
> 16/08/25 22:40:28 INFO hdfs.StateChange: STATE* Replication Queue
> initialization scan for invalid, over- and under-replicated blocks
> completed in 687 msec
> 16/08/25 22:40:28 INFO ha.ZKFailoverController: Successfully transitioned
> NameNode at m9edd51-nn1.m9edd51/172.18.0.7:8020 to active state
> 16/08/25 22:40:30 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> mkdir: `/tmp': File exists
> 16/08/25 22:40:32 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> getfileinfo is: 0
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> mkdirs is: 0
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> listStatus is: 0
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> * is: 0
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> setPermission is: 0
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> getfileinfo is: 1
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> mkdirs is: 1
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> listStatus is: 1
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> * is: 1
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> setPermission is: 1
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> getfileinfo is: 1
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> mkdirs is: 1
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> listStatus is: 1
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> * is: 1
> 16/08/25 22:41:36 INFO window.RollingWindowManager: topN size for command
> setPermission is: 1
> 16/08/25 22:41:40 INFO zookeeper.ClientCnxn: Unable to read additional
> data from server sessionid 0x156c3dc679a0003, likely server has closed
> socket, closing socket connection and attempting reconnect
> 16/08/25 22:41:40 INFO ha.ActiveStandbyElector: Session disconnected.
> Entering neutral mode...
> 16/08/25 22:41:41 INFO zookeeper.ClientCnxn: Opening socket connection to
> server m9edd51-zookeeper.m9edd51/172.18.0.2:2181. Will not attempt to
> authenticate using SASL (unknown error)
> 16/08/25 22:41:41 INFO zookeeper.ClientCnxn: Socket connection established
> to m9edd51-zookeeper.m9edd51/172.18.0.2:2181, init

Re: java.lang.NoSuchFieldError: HADOOP_CLASSPATH

2016-08-29 Thread Rakesh Radhakrishnan
Hi Senthil,

IIUC, the root cause is, while executing the following statement its not
finding the "Environment.HADOOP_CLASSPATH" enum variable in
ApplicationConstants$Environment.class file and throwing NoSuchFieldError.
Could you please cross check your jars in that line. Also, I'd suggest you
to refer MAPREDUCE-6454 jira, which has done few changes in these classes.
I hope this will helps in better debugging your env and solve it soon.

https://github.com/apache/hadoop/blob/branch-2.7/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapreduce/v2/util/MRApps.java#L248

Regards,
Rakesh
Intel

On Mon, Aug 29, 2016 at 11:46 AM, kumar, Senthil(AWF) 
wrote:

> Thanks Rakesh..
>
>
>
> Hadoop 2.7.1.2.4.2.0-258
>
> Subversion g...@github.com:hortonworks/hadoop.git -r
> 13debf893a605e8a88df18a7d8d214f571e05289
>
> Compiled by jenkins on 2016-04-25T05:46Z
>
>
>
> https://github.com/apache/hadoop/blob/release-2.7.1/
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/
> java/org/apache/hadoop/yarn/api/ApplicationConstants.java
>
>
>
> I don’t see the ENUM HADOOP_CLASSPATH in Yarn API ..
>
>
>
> --Senthil
>
> *From:* Rakesh Radhakrishnan [mailto:rake...@apache.org]
> *Sent:* Friday, August 26, 2016 8:26 PM
> *To:* kumar, Senthil(AWF) 
> *Cc:* user.hadoop 
> *Subject:* Re: java.lang.NoSuchFieldError: HADOOP_CLASSPATH
>
>
>
> Hi Senthil,
>
>
>
> There might be case of including the wrong version of a jar file, could
> you please check "Environment.HADOOP_CLASSPATH" enum variable in
> "org.apache.hadoop.yarn.api.ApplicationConstants.java" class in your
> hadoop jar file?. I think it is throwing "NoSuchFieldError" as its not
> seeing the "HADOOP_CLASSPATH" enum variable. Also, please ensure that the
> hadoop jars are properly available in the classpath while running the job.
>
>
>
> Thanks,
>
> Rakesh
>
>
>
> On Fri, Aug 26, 2016 at 4:53 PM, kumar, Senthil(AWF) 
> wrote:
>
> Dear All ,   Facing  No Such Field Error when I run Map Reduce Job.. I
> have correct HADOOP_CLASSPATH in cluster .. Not sure what causing issue
> here ..
>
>
>
> java.lang.NoSuchFieldError: HADOOP_CLASSPATH
>
> at org.apache.hadoop.mapreduce.v2.util.MRApps.setClasspath(
> MRApps.java:248)
>
> at org.apache.hadoop.mapred.YARNRunner.
> createApplicationSubmissionContext(YARNRunner.java:458)
>
> at org.apache.hadoop.mapred.YARNRunner.submitJob(
> YARNRunner.java:285)
>
> at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(
> JobSubmitter.java:432)
>
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
>
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:422)
>
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1709)
>
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
>
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.
> java:1303)
>
>
>
>
>
> Version Info:
>
>
>
> *Hadoop 2.7.1.2.4.2.0-258*
>
> Subversion g...@github.com:hortonworks/hadoop.git -r
> 13debf893a605e8a88df18a7d8d214f571e05289
>
> Compiled by jenkins on 2016-04-25T05:46Z
>
> Compiled with protoc 2.5.0
>
> From source with checksum 2a2d95f05ec6c3ac547ed58cab713ac
>
>
>
> Did anyone face this issue ??
>
>
>
> --Senthil
>
>
>


Re: HDFS Balancer Stuck after 10 Minz

2016-09-08 Thread Rakesh Radhakrishnan
Have you taken multiple thread dumps (jstack) and observed the operations
which are performing during this period of time. Perhaps there could be
high chance of searching for data blocks which it can move around to
balance the cluster.

Could you tell me the used space and available space values. Have you tried
changing the threshold to a lower value, may be 10 or 5 and what happens
with this value. Also, I think there is no log messages during 15 mins time
period, any possibility of enabling debug log priority and try to dig more
about the problem.


Rakesh

On Thu, Sep 8, 2016 at 7:44 PM, Rakesh Radhakrishnan 
wrote:

> Have you taken multiple thread dumps (jstack) and observed the operations
> which are performing during this period of time. Perhaps there could be
> high chance of searching for data blocks which it can move around to
> balance the cluster.
>
> Could you tell me the used space and available space values. Have you
> tried changing the threshold to a lower value, may be 10 or 5 and what
> happens with this value. Also, I think there is no log messages during 15
> mins time period, any possibility of enabling debug log priority and try to
> dig more about the problem.
>
> Rakesh
>
> On Thu, Sep 8, 2016 at 6:15 PM, Senthil Kumar 
> wrote:
>
>> Hi All ,  We are in the situation to balance the cluster data since median
>> reached 98% .. I started balancer as below
>>
>> Hadoop Version: Hadoop 2.4.1
>>
>>
>> /apache/hadoop/sbin/start-balancer.sh   -threshold  30
>>
>>
>> Once i start balancer it goes will for first 8-10 minutes of time..
>> Balancer was moving so quickly first 10 minutes.. Not sure whats happening
>> in the cluster after sometime ( say 10 minz ) , balancer is almost stuck .
>>
>> Log excerpts :
>>
>> 2016-09-08 04:58:15,653 INFO
>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>> blk_-5830766563502877304_1279767737 with size=134217728 from
>> 10.103.21.27:1004 to 10.142.21.56:1004 through 10.103.21.27:1004
>>
>> 2016-09-08 04:59:14,426 INFO
>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>> blk_2601479900_1104500421142 with size=268435456 from 10.103.84.51:1004
>> to
>> 10.142.18.27:1004 through 10.103.84.16:1004
>>
>> 2016-09-08 05:01:15,037 INFO
>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>> blk_3073791211_1104972921837 with size=268435456 from 10.103.21.27:1004
>> to
>> 10.142.21.56:1004 through 10.103.21.42:1004
>>
>>
>>
>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ date
>>
>> Thu Sep  8 05:16:53 GMT+7 2016
>>
>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ jps
>>
>> 1003 Balancer
>>
>> 20388 Jps
>>
>>
>>
>> Last Block Mover Timestamp : 05:01
>>
>> Current Timestamp: 05:16
>>
>>
>> Almost 15 minz no blocks moved by balancer ..  What could be the issue
>> here
>> ??  Restart would help us start moving again..
>>
>>
>>
>> It’s not event passing iteration 1 ..
>>
>>
>> I found one thread discussing about the same issue:
>>
>> http://lucene.472066.n3.nabble.com/A-question-about-Balancer
>> -in-HDFS-td4118505.html
>>
>>
>> Pls suggest here to balance cluster ..
>>
>>
>> --Senthil
>>
>
>


Re: HDFS ACL | Unable to define ACL automatically for child folders

2016-09-18 Thread Rakesh Radhakrishnan
It looks like '/user/test3' has owner '"hdfs" and denying the access while
performing operations via "shashi" user. One idea is to recursively set ACL
to sub-directories and files as follows:

 hdfs dfs -setfacl -R -m default:user:shashi:rwx /user

-R, option can be used to apply operations to all files and
directories recursively.

Regards,
Rakesh

On Sun, Sep 18, 2016 at 8:53 PM, Shashi Vishwakarma <
shashi.vish...@gmail.com> wrote:

> I have following scenario. There is parent folder /user with five child
> folder as test1 , test2, test3 etc in HDFS.
>
> /user/test1
> /user/test2
> /user/test3
>
> I applied acl on parent folder to make sure user has automatically access
> to child folder.
>
>  hdfs dfs -setfacl -m default:user:shashi:rwx /user
>
>
> but when i try to put some file , it is giving permission denied exception
>
> hadoop fs -put test.txt  /user/test3
> put: Permission denied: user=shashi, access=WRITE,
> inode="/user/test3":hdfs:supergroup:drwxr-xr-x
>
> **getfacl output**
>
> hadoop fs -getfacl /user/test3
> # file: /user/test3
> # owner: hdfs
> # group: supergroup
> user::rwx
> group::r-x
> other::r-x
>
> Any pointers on this?
>
> Thanks
> Shashi
>


Re: HDFS ACL | Unable to define ACL automatically for child folders

2016-09-19 Thread Rakesh Radhakrishnan
AFAIK, there is no java API available. Perhaps you could do recursive
directory listing for a path and invokes #setAcl java api for each.
https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#setAcl(org.apache.hadoop.fs.Path,
java.util.List)

Rakesh

On Mon, Sep 19, 2016 at 11:22 AM, Shashi Vishwakarma <
shashi.vish...@gmail.com> wrote:

> Thanks Rakesh.
>
> Just last question, is there any Java API available for recursively
> applying ACL or I need to iterate on all folders of dir and apply acl for
> each?
>
> Thanks
> Shashi
>
> On 19 Sep 2016 9:56 am, "Rakesh Radhakrishnan"  wrote:
>
>> It looks like '/user/test3' has owner '"hdfs" and denying the
>> access while performing operations via "shashi" user. One idea is to
>> recursively set ACL to sub-directories and files as follows:
>>
>>  hdfs dfs -setfacl -R -m default:user:shashi:rwx /user
>>
>> -R, option can be used to apply operations to all files and
>> directories recursively.
>>
>> Regards,
>> Rakesh
>>
>> On Sun, Sep 18, 2016 at 8:53 PM, Shashi Vishwakarma <
>> shashi.vish...@gmail.com> wrote:
>>
>>> I have following scenario. There is parent folder /user with five child
>>> folder as test1 , test2, test3 etc in HDFS.
>>>
>>> /user/test1
>>> /user/test2
>>> /user/test3
>>>
>>> I applied acl on parent folder to make sure user has automatically
>>> access to child folder.
>>>
>>>  hdfs dfs -setfacl -m default:user:shashi:rwx /user
>>>
>>>
>>> but when i try to put some file , it is giving permission denied
>>> exception
>>>
>>> hadoop fs -put test.txt  /user/test3
>>> put: Permission denied: user=shashi, access=WRITE,
>>> inode="/user/test3":hdfs:supergroup:drwxr-xr-x
>>>
>>> **getfacl output**
>>>
>>> hadoop fs -getfacl /user/test3
>>> # file: /user/test3
>>> # owner: hdfs
>>> # group: supergroup
>>> user::rwx
>>> group::r-x
>>> other::r-x
>>>
>>> Any pointers on this?
>>>
>>> Thanks
>>> Shashi
>>>
>>
>>


Re: hdfs2.7.3 kerberos can not startup

2016-09-20 Thread Rakesh Radhakrishnan
>>Caused by: javax.security.auth.login.LoginException: Unable to obtain
password from user

Could you please check kerberos principal name is specified correctly in
"hdfs-site.xml", which is used to authenticate against Kerberos.

If keytab file defined in "hdfs-site.xml" and doesn't exists or wrong path,
you will see
this error. So, please verify the path and the keytab filename correctly
configured.

I hope hadoop discussion thread, https://goo.gl/M6l3vv may help you.


>>>2016-09-20 00:54:06,665 INFO org.apache.hadoop.http.HttpServer2:
HttpServer.start() threw a non Bind IOException
java.io.IOException: !JsseListener: java.lang.NullPointerException

This is probably due to some missing configuration.
Could you please re-check the ssl-server.xml, keystore and truststore
properties:

ssl.server.keystore.location
ssl.server.keystore.keypassword
ssl.client.truststore.location
ssl.client.truststore.password

Rakesh

On Tue, Sep 20, 2016 at 10:53 AM, kevin  wrote:

> *hi,all:*
> *My environment : Centos7.2 hadoop2.7.3 jdk1.8*
> *after I config hdfs with kerberos ,I can't start up with
> sbin/start-dfs.sh*
>
> *::namenode log as below  *
>
> *STARTUP_MSG:   build = Unknown -r Unknown; compiled by 'root' on
> 2016-09-18T09:05Z*
> *STARTUP_MSG:   java = 1.8.0_102*
> */*
> *2016-09-20 00:54:05,822 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: registered UNIX signal
> handlers for [TERM, HUP, INT]*
> *2016-09-20 00:54:05,825 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: createNameNode []*
> *2016-09-20 00:54:06,078 INFO
> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
> hadoop-metrics2.properties*
> *2016-09-20 00:54:06,149 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
> period at 10 second(s).*
> *2016-09-20 00:54:06,149 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system
> started*
> *2016-09-20 00:54:06,151 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: fs.defaultFS is
> hdfs://dmp1.example.com:9000 *
> *2016-09-20 00:54:06,152 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: Clients are to use
> dmp1.example.com:9000  to access this
> namenode/service.*
> *2016-09-20 00:54:06,446 INFO
> org.apache.hadoop.security.UserGroupInformation: Login successful for user
> hadoop/dmp1.example@example.com  using
> keytab file /etc/hadoop/conf/hdfs.keytab*
> *2016-09-20 00:54:06,472 INFO org.apache.hadoop.hdfs.DFSUtil: Starting web
> server as: HTTP/dmp1.example@example.com *
> *2016-09-20 00:54:06,475 INFO org.apache.hadoop.hdfs.DFSUtil: Starting
> Web-server for hdfs at: https://dmp1.example.com:50470
> *
> *2016-09-20 00:54:06,517 INFO org.mortbay.log: Logging to
> org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
> org.mortbay.log.Slf4jLog*
> *2016-09-20 00:54:06,533 INFO
> org.apache.hadoop.security.authentication.server.AuthenticationFilter:
> Unable to initialize FileSignerSecretProvider, falling back to use random
> secrets.*
> *2016-09-20 00:54:06,542 INFO org.apache.hadoop.http.HttpRequestLog: Http
> request log for http.requests.namenode is not defined*
> *2016-09-20 00:54:06,546 INFO org.apache.hadoop.http.HttpServer2: Added
> global filter 'safety'
> (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)*
> *2016-09-20 00:54:06,548 INFO org.apache.hadoop.http.HttpServer2: Added
> filter static_user_filter
> (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
> context hdfs*
> *2016-09-20 00:54:06,548 INFO org.apache.hadoop.http.HttpServer2: Added
> filter static_user_filter
> (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
> context static*
> *2016-09-20 00:54:06,548 INFO org.apache.hadoop.http.HttpServer2: Added
> filter static_user_filter
> (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to
> context logs*
> *2016-09-20 00:54:06,653 INFO org.apache.hadoop.http.HttpServer2: Added
> filter 'org.apache.hadoop.hdfs.web.Au
> thFilter'
> (class=org.apache.hadoop.hdfs.web.AuthFilter)*
> *2016-09-20 00:54:06,654 INFO org.apache.hadoop.http.HttpServer2:
> addJerseyResourcePackage:
> packageName=org.apache.hadoop.hdfs.server.namenode.web.resources;org.apache.hadoop.hdfs.web.resources,
> pathSpec=/webhdfs/v1/**
> *2016-09-20 00:54:06,657 INFO org.apache.hadoop.http.HttpServer2: Adding
> Kerberos (SPNEGO) filter to getDelegationToken*
> *2016-09-20 00:54:06,658 INFO org.apache.hadoop.http.HttpServer2: Adding
> Kerberos (SPNEGO) filter to renewDelegationToken*
> *2016-09-20 00:54:06,658 INFO org.apache.hadoop.http.HttpServer2: Adding
> Kerberos (SPNEGO) filter to cancelDelegationToken*
> *2016-09-20 00:54:06,659 INFO org.apache.hadoop.http.HttpServer2: Adding
> Kerberos (SPNEGO) filter to fsck*
> *2016-09-20 00:54:06

Re: hdfs2.7.3 kerberos can not startup

2016-09-21 Thread Rakesh Radhakrishnan
I could see "Ticket cache: KEYRING:persistent:1004:1004" in your env.

May be KEYRING persistent cache setting is causing trouble, Kerberos
libraries to store the krb cache in a location and the Hadoop libraries
can't seem to access it.

Please refer these links,
https://community.hortonworks.com/questions/818/ipa-kerberos-not-liking-my-kinit-ticket.html
https://community.hortonworks.com/articles/11291/kerberos-cache-in-ipa-redhat-idm-keyring-solved-1.html


Rakesh
Intel

On Wed, Sep 21, 2016 at 2:21 PM, kevin  wrote:

> [hadoop@dmp1 ~]$ hdfs dfs -ls /
> 16/09/20 15:00:44 WARN ipc.Client: Exception encountered while connecting
> to the server : javax.security.sasl.SaslException: GSS initiate failed
> [Caused by GSSException: No valid credentials provided (Mechanism level:
> Failed to find any Kerberos tgt)]
> ls: Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException:
> GSS initiate failed [Caused by GSSException: No valid credentials provided
> (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local
> host is: "dmp1.youedata.com/192.168.249.129"; destination host is: "
> dmp1.youedata.com":9000;
> [hadoop@dmp1 ~]$ klist
> Ticket cache: KEYRING:persistent:1004:1004
> Default principal: had...@example.com
>
> Valid starting   Expires  Service principal
> 09/20/2016 14:57:34  09/21/2016 14:57:31  krbtgt/example@example.com
> renew until 09/27/2016 14:57:31
> [hadoop@dmp1 ~]$
>
> I have run kinit had...@example.com before .
>
> 2016-09-21 10:14 GMT+08:00 Wei-Chiu Chuang :
>
>> You need to run kinit command to authenticate before running hdfs dfs -ls
>> command.
>>
>> Wei-Chiu Chuang
>>
>> On Sep 20, 2016, at 6:59 PM, kevin  wrote:
>>
>> Thank you Brahma Reddy Battula.
>> It's because of my problerm of the hdfs-site config file and https
>> ca configuration.
>> now I can startup namenode and I can see the datanodes from the web.
>> but When I try hdfs dfs -ls /:
>>
>> *[hadoop@dmp1 hadoop-2.7.3]$ hdfs dfs -ls /*
>> *16/09/20 07:56:48 WARN ipc.Client: Exception encountered while
>> connecting to the server : javax.security.sasl.SaslException: GSS initiate
>> failed [Caused by GSSException: No valid credentials provided (Mechanism
>> level: Failed to find any Kerberos tgt)]*
>> *ls: Failed on local exception: java.io.IOException:
>> javax.security.sasl.SaslException: GSS initiate failed [Caused by
>> GSSException: No valid credentials provided (Mechanism level: Failed to
>> find any Kerberos tgt)]; Host Details : local host is:
>> "dmp1.example.com/192.168.249.129
>> <http://dmp1.example.com/192.168.249.129>"; destination host is: "dmp1.*
>> *example**.com":9000; *
>>
>> current user is hadoop which startup hdfs , and I have add addprinc
>> hadoop with commond :
>> kadmin.local -q "addprinc hadoop"
>>
>>
>> 2016-09-20 17:33 GMT+08:00 Brahma Reddy Battula <
>> brahmareddy.batt...@huawei.com>:
>>
>>> Seems to be property problem.. it should be *principal* ( “l” is
>>> missed).
>>>
>>>
>>>
>>> **
>>>
>>> *  dfs.secondary.namenode.kerberos.principa*
>>>
>>> *  hadoop/_h...@example.com *
>>>
>>> **
>>>
>>>
>>>
>>>
>>>
>>> For namenode httpserver start fail, please check rakesh comments..
>>>
>>>
>>>
>>> This is probably due to some missing configuration.
>>>
>>> Could you please re-check the ssl-server.xml, keystore and truststore
>>> properties:
>>>
>>>
>>>
>>> ssl.server.keystore.location
>>>
>>> ssl.server.keystore.keypassword
>>>
>>> ssl.client.truststore.location
>>>
>>> ssl.client.truststore.password
>>>
>>>
>>>
>>>
>>>
>>> --Brahma Reddy Battula
>>>
>>>
>>>
>>> *From:* kevin [mailto:kiss.kevin...@gmail.com]
>>> *Sent:* 20 September 2016 16:53
>>> *To:* Rakesh Radhakrishnan
>>> *Cc:* user.hadoop
>>> *Subject:* Re: hdfs2.7.3 kerberos can not startup
>>>
>>>
>>>
>>> thanks, but my issue is name node could  *Login successful,but second
>>> namenode couldn't. and name node got a HttpServer.start() threw a non Bind
>>> IOException:*
>>>
>>>
>>>
>>> hdfs-site.xml:
>>>
>>>
>>>
>>> **
>>>

Re: hdfs2.7.3 not work with kerberos

2016-09-21 Thread Rakesh Radhakrishnan
I could see "Ticket cache: KEYRING:persistent:1004:1004" in your env.

May be KEYRING persistent cache setting is causing trouble, Kerberos
libraries to store the krb cache in a location and the Hadoop libraries
can't seem to access it.

Please refer these links,
https://community.hortonworks.com/questions/818/ipa-kerberos-not-liking-my-kinit-ticket.html
https://community.hortonworks.com/articles/11291/kerberos-cache-in-ipa-redhat-idm-keyring-solved-1.html

Rakesh
Intel

On Wed, Sep 21, 2016 at 3:01 PM, lk_hadoop  wrote:

> hi,all:
> *My environment : Centos7.2 hadoop2.7.3 jdk1.8 kerberos 1.13.2-12.el7_2*
> *my hadoop user is hadoop,now I can startup hdfs with sbin/start-dfs.sh
> and I can see data node from the web.*
> *but when I try to hdfs dfs -ls / ,I got error:*
>
> [hadoop@dmp1 ~]$ hdfs dfs -ls /
> 16/09/20 15:00:44 WARN ipc.Client: Exception encountered while connecting
> to the server : javax.security.sasl.SaslException: GSS initiate failed
> [Caused by GSSException: No valid credentials provided (Mechanism level:
> Failed to find any Kerberos tgt)]
> ls: Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException:
> GSS initiate failed [Caused by GSSException: No valid credentials provided
> (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local
> host is: "dmp1.example.com/192.168.249.129"; destination host is: "
> dmp1.example.com":9000;
> [hadoop@dmp1 ~]$ klist
> Ticket cache: KEYRING:persistent:1004:1004
> Default principal: had...@example.com
>
> Valid starting   Expires  Service principal
> 09/20/2016 14:57:34  09/21/2016 14:57:31  krbtgt/example@example.com
>  renew until 09/27/2016 14:57:31
> [hadoop@dmp1 ~]$
>
> It's because of my jdk is *1.8 ?*
>
> 2016-09-21
> --
> lk_hadoop
>


Re: LeaseExpiredException: No lease on /user/biadmin/analytic‐root/SX5XPWPPDPQH/.

2016-10-16 Thread Rakesh Radhakrishnan
Hi Jian Feng,

Could you please check your code and see any possibilities of simultaneous
access to the same file. Mostly this situation happens when multiple
clients tries to access the same file.

Code Reference:- https://github.com/apache/hadoop/blob/branch-2.2/hadoop-
hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/
hdfs/server/namenode/FSNamesystem.java#L2737

Best Regards,
Rakesh
Intel

On Mon, Oct 17, 2016 at 7:16 AM, Zhang Jianfeng  wrote:

> Hi ,
>
> I hit an wired error. On our hadoop cluster (2.2.0), occasionally a
> LeaseExpiredException is thrown.
>
> The stacktrace is as below:
>
>
> *org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on /user/biadmin/analytic‐root/SX5XPWPPDPQH/.executions/.at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)*
>
> *at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:2801)*
>
> *at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2783)*
>
> *at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:611)*
>
> *at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:428)*
>
> *at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59586)*
>
> *at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)*
>
> *at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)*
>
> *at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)*
>
> *at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)*
>
> *at java.security.AccessController.doPrivileged(AccessController.java:310)*
>
> *at javax.security.auth.Subject.doAs(Subject.java:573)*
>
> *at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502)*
>
> *at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)*
>
> *at org.apache.hadoop.ipc.Client.call(Client.java:1347)*
>
> *at org.apache.hadoop.ipc.Client.call(Client.java:1300)*
>
> *at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)*
>
> *at $Proxy7.complete(Unknown Source)*
>
> *at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)*
>
> *at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)*
>
> *at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)*
>
> at java.lang.reflect.Method.invoke(Method.java:611)
>
> at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(
> RetryInvocationHandler.java:186)
>
> at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(
> RetryInvocationHandler.java:102)
>
> at $Proxy7.complete(Unknown Source)
>
> at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslat
> orPB.complete(ClientNamenodeProtocolTranslatorPB.java:371)
>
> at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(
> DFSOutputStream.java:1894)
>
> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:1881)
>
> at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(
> FSDataOutputStream.java:71)
>
> at org.apache.hadoop.fs.FSDataOutputStream.close(
> FSDataOutputStream.java:104)
>
> at java.io.FilterOutputStream.close(FilterOutputStream.java:154)
>
> Any help will be appreciated!
>
> --
> Best Regards,
> Jian Feng
>


Re: Connecting Hadoop HA cluster via java client

2016-10-18 Thread Rakesh Radhakrishnan
Hi,

dfs.namenode.http-address, this is the fully-qualified HTTP address for
each NameNode to listen on. Similarly to rpc-address configuration, set the
addresses for both NameNodes HTTP servers(Web UI) to listen on and can
browse the status of Active/Standby NN in Web browser. Also, hdfs supports
secure http server address and port, can use "dfs.namenode.https-address"
for this.

For example:-
I assume dfs.nameservices(the logical name for your nameservice) config
item is configured as "mycluster"


  dfs.namenode.http-address.mycluster.nn1
  machine1.example.com:50070


  dfs.namenode.http-address.mycluster.nn2
  machine2.example.com:50070


Regards,
Rakesh

On Tue, Oct 18, 2016 at 7:32 PM, Pushparaj Motamari 
wrote:

> Hi,
>
> Following are not required I guess. I am able to connect to cluster
> without these. Is there any reason to include them?
>
> dfs.namenode.http-address.${dfs.nameservices}.nn1
>
> dfs.namenode.http-address.${dfs.nameservices}.nn2
>
> Regards
>
> Pushparaj
>
>
>
> On Wed, Oct 12, 2016 at 6:39 AM, 권병창  wrote:
>
>> Hi.
>>
>>
>>
>> 1. minimal configuration to connect HA namenode is below properties.
>>
>> zookeeper information does not necessary.
>>
>>
>>
>> dfs.nameservices
>>
>> dfs.ha.namenodes.${dfs.nameservices}
>>
>> dfs.namenode.rpc-address.${dfs.nameservices}.nn1
>>
>> dfs.namenode.rpc-address.${dfs.nameservices}.nn2
>>
>> dfs.namenode.http-address.${dfs.nameservices}.nn1
>>
>> dfs.namenode.http-address.${dfs.nameservices}.nn2
>> dfs.client.failover.proxy.provider.c3=org.apache.hadoop.hdfs
>> .server.namenode.ha.ConfiguredFailoverProxyProvider
>>
>>
>>
>>
>>
>> 2. client use round robin manner for selecting active namenode.
>>
>>
>>
>>
>>
>> -Original Message-
>> *From:* "Pushparaj Motamari"
>> *To:* ;
>> *Cc:*
>> *Sent:* 2016-10-12 (수) 03:20:53
>> *Subject:* Connecting Hadoop HA cluster via java client
>>
>> Hi,
>>
>> I have two questions pertaining to accessing the hadoop ha cluster from
>> java client.
>>
>> 1. Is  it necessary to supply
>>
>> conf.set("dfs.ha.automatic-failover.enabled",true);
>>
>> and
>>
>> conf.set("ha.zookeeper.quorum","zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181");
>>
>> in addition to the other properties set in the code below?
>>
>> private Configuration initHAConf(URI journalURI, Configuration conf) {
>>   conf.set(DFSConfigKeys.DFS_NAMENODE_SHARED_EDITS_DIR_KEY,
>>   journalURI.toString());
>>
>>   String address1 = "127.0.0.1:" + NN1_IPC_PORT;
>>   String address2 = "127.0.0.1:" + NN2_IPC_PORT;
>>   conf.set(DFSUtil.addKeySuffixes(DFS_NAMENODE_RPC_ADDRESS_KEY,
>>   NAMESERVICE, NN1), address1);
>>   conf.set(DFSUtil.addKeySuffixes(DFS_NAMENODE_RPC_ADDRESS_KEY,
>>   NAMESERVICE, NN2), address2);
>>   conf.set(DFSConfigKeys.DFS_NAMESERVICES, NAMESERVICE);
>>   conf.set(DFSUtil.addKeySuffixes(DFS_HA_NAMENODES_KEY_PREFIX, NAMESERVICE),
>>   NN1 + "," + NN2);
>>   conf.set(DFS_CLIENT_FAILOVER_PROXY_PROVIDER_KEY_PREFIX + "." + NAMESERVICE,
>>   ConfiguredFailoverProxyProvider.class.getName());
>>   conf.set("fs.defaultFS", "hdfs://" + NAMESERVICE);
>>
>>   return conf;}
>>
>> 2. If we supply zookeeper configuration details as mentioned in the question 
>> 1 is it necessary to set the primary and secondary namenode addresses as 
>> mentioned in the code above? Since we have
>> given zookeeper connection details the client should be able to figure out 
>> the active namenode connection details.
>>
>>
>> Regards
>>
>> Pushparaj
>>
>>
>


Re: Erasure Coding Policies

2019-07-01 Thread Rakesh Radhakrishnan
Hi,

RS Legacy is pure Java based implementation. Probably you can look at the
encoding/decoding logic at github repo
https://github.com/Jerry-Xin/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/erasurecode/rawcoder/RSRawEncoderLegacy.java

https://github.com/Jerry-Xin/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/erasurecode/rawcoder/RSRawDecoderLegacy.java

RS codec, uses a native implementation which leverages Intel ISA-L library
to improve the performance of codec. Please go through below links to get
more details on ISA-L part.
https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Enable_Intel_ISA-L

https://blog.cloudera.com/blog/2019/06/hdfs-erasure-coding-in-production/

Rakesh

On Mon, Jul 1, 2019 at 5:56 PM Nazerke S  wrote:

> Hi,
>
>
> What is the difference between  LEGACY-RS-6-3-1024k and RS-6-3-1024k EC
> policies?
>
>
>