hadoop missing file?

2013-07-29 Thread ch huang
one of my workmate told me some of his file missing ,i use fs check find
following info , how can i prevent  them from missing? anyone can help me?

Status: HEALTHY
 Total size:272020850157 B (Total open files size: 652056 B)
 Total dirs:1143
 Total files:   1886 (Files currently being written: 2)
 Total blocks (validated):  5651 (avg. block size 48136763 B) (Total
open file blocks (not validated): 1)
 Minimally replicated blocks:   5651 (100.0 %)
 Over-replicated blocks:0 (0.0 %)
 Under-replicated blocks:   129 (2.2827818 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor:3
 Average block replication: 3.0
 Corrupt blocks:0
 Missing replicas:  903 (5.0571237 %)
 Number of data-nodes:  3
 Number of racks:   1
FSCK ended at Tue Jul 30 14:38:01 CST 2013 in 462 milliseconds


Re: Cannot write the output of the reducer to a sequence file

2013-07-29 Thread Pavan Sudheendra
Hi,
This is the output message whjich i got when it failed:

WARN hdfs.DFSClient: DataStreamer Exception:
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
on /sequenceOutput/_temporary/_attempt_local_0001_r_00_0/part-r-0
File does not exist. Holder DFSClient_NONMAPREDUCE_-7901_1 does
not have any open files.
13/07/29 17:04:20 WARN hdfs.DFSClient: Error Recovery for block null
bad datanode[0] nodes == null
13/07/29 17:04:20 WARN hdfs.DFSClient: Could not get block locations.
Source file 
"/sequenceOutput/_temporary/_attempt_local_0001_r_00_0/part-r-0"
- Aborting...
13/07/29 17:04:20 ERROR hdfs.DFSClient: Failed to close file
/sequenceOutput/_temporary/_attempt_local_0001_r_00_0/part-r-0


On Mon, Jul 29, 2013 at 9:34 PM, Harsh J  wrote:
> Hi,
>
> Can you explain the problem you actually face in trying to run the
> above setup? Do you also set your reducer output types?
>
> On Mon, Jul 29, 2013 at 4:48 PM, Pavan Sudheendra  wrote:
>> I have a Map function and a Reduce funtion outputting kep-value pairs
>> of class Text and IntWritable.. This is just the gist of the Map part
>> in the Main function :
>>
>> TableMapReduceUtil.initTableMapperJob(
>>   tablename,// input HBase table name
>>   scan, // Scan instance to control CF and attribute selection
>>   AnalyzeMapper.class,   // mapper
>>   Text.class, // mapper output key
>>   IntWritable.class, // mapper output value
>>   job);
>>
>> And here's my Reducer part in the Main function which writes the output to 
>> HDFS
>>
>> job.setReducerClass(AnalyzeReducerFile.class);
>> job.setNumReduceTasks(1);
>> FileOutputFormat.setOutputPath(job, new
>> Path("hdfs://localhost:54310/output_file"));
>>
>> How do i make the reducer write to a Sequence File instead?
>>
>> I've tried the following code but doesn't work
>>
>> job.setReducerClass(AnalyzeReducerFile.class);
>> job.setNumReduceTasks(1);
>> job.setOutputFormatClass(SequenceFileOutputFormat.class);
>> SequenceFileOutputFormat.setOutputPath(job, new
>> Path("hdfs://localhost:54310/sequenceOutput"));
>>
>> Any help appreciated!
>>
>>
>>
>>
>> --
>> Regards-
>> Pavan
>
>
>
> --
> Harsh J



-- 
Regards-
Pavan


Re: dncp_block_verification

2013-07-29 Thread 闫昆
Viji R  thank you very much


2013/7/26 Viji R 

> Hi,
>
> These are used to keep block IDs that are being verified, i.e., DN
> periodically matches blocks with stored checksums to root out
> corrupted or rotted data. They are removed once verification
> completes.
>
> Regards,
> Viji
>
> On Fri, Jul 26, 2013 at 11:39 AM, 闫昆  wrote:
> > Hi All
> > the datanode directory in crrunt like follow
> > dncp_block_verification.log.prev
> > dncp_block_verification.log.curr
> > what this log function?
>


Re: Suspecting Namenode Filesystem Corrupt!!

2013-07-29 Thread Chris Embree
My foundation is more Linux than Hadoop, so I'll support Harsh (like he
needs it) in asking, "What's the problem?"  If you can't df -h this is
probably a "lower than Hadoop" issue, and while most Hadoop folks are
willing to help (see the fact that Harsh responded) this is 99.9% likely to
be an EXT4, XFS or other FS issue.

In summary, what are you seeing, specifically?


On Mon, Jul 29, 2013 at 10:42 PM, Harsh J  wrote:

> What do you mean by you "can't get back a result"? The command hangs,
> errors out, gives an incorrect result, what does it do? Can you post your
> error please?
> On Jul 30, 2013 1:31 AM, "Sathish Kumar"  wrote:
>
>> Hi All,
>>
>> When i issue "df -h" command in namenode, i am not able to get back the
>> result. Is it a issue with the filesytems??
>>
>> Regards
>> Sathish
>>
>


can kerberos cause performance impack on hadoop cluster?

2013-07-29 Thread ch huang
i have a product env hadoop cluster , i want to know if the kerberos  will
cause big performance impack to my cluster?thanks all


RE: objects as key/values

2013-07-29 Thread Devaraj k
You can write custom key/value classes by implementing 
org.apache.hadoop.io.Writable interface for your Job.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Writable.html

Thanks
Devaraj k

From: jamal sasha [mailto:jamalsha...@gmail.com]
Sent: 30 July 2013 10:27
To: user@hadoop.apache.org
Subject: objects as key/values

Ok.
  A very basic (stupid) question.
I am trying to compute mean using hadoop.

So my implementation is like this:

public class Mean
 public static class Pair{
  //simple class to create object
}
 public class MeanMapper
   emit(text,pair) //where pair is (local sum, count)

 public class MeanReducer
emit (text, mean)

Unfortunately such approach of creating custom class types are not working
since in job I have to set the output type for mapper/reducer...
How are custom key values pair implemented in hadoop?




Re: Suspecting Namenode Filesystem Corrupt!!

2013-07-29 Thread Harsh J
What do you mean by you "can't get back a result"? The command hangs,
errors out, gives an incorrect result, what does it do? Can you post your
error please?
On Jul 30, 2013 1:31 AM, "Sathish Kumar"  wrote:

> Hi All,
>
> When i issue "df -h" command in namenode, i am not able to get back the
> result. Is it a issue with the filesytems??
>
> Regards
> Sathish
>


objects as key/values

2013-07-29 Thread jamal sasha
Ok.
  A very basic (stupid) question.
I am trying to compute mean using hadoop.

So my implementation is like this:

public class Mean
 public static class Pair{
  //simple class to create object
}
 public class MeanMapper
   emit(text,pair) //where pair is (local sum, count)

 public class MeanReducer
emit (text, mean)

Unfortunately such approach of creating custom class types are not working
since in job I have to set the output type for mapper/reducer...
How are custom key values pair implemented in hadoop?


Suspecting Namenode Filesystem Corrupt!!

2013-07-29 Thread Sathish Kumar
Hi All,

When i issue "df -h" command in namenode, i am not able to get back the
result. Is it a issue with the filesytems??

Regards
Sathish


Re: Error on running a hadoop job

2013-07-29 Thread Harsh J
You could see this if you had removed away /wordcount_raw/ when the
reducer was running perhaps. Otherwise the path had an attempt ID so
am not seeing how it could have had two writers conflicting to cause
this. What version, if you didn't remove /wordcount_raw?

On Mon, Jul 29, 2013 at 11:32 PM, Pavan Sudheendra  wrote:
> I'm getting the exact same error. Only thing is I'm trying to write to a
> sequence file.
>
> Regards,
> Pavan
>
> On Jul 29, 2013 11:29 PM, "jamal sasha"  wrote:
>>
>> Hi,
>>  I am getting a weird error?
>>
>> 13/07/29 10:50:58 INFO mapred.JobClient: Task Id :
>> attempt_201307102216_0145_r_16_0, Status : FAILED
>> org.apache.hadoop.ipc.RemoteException:
>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
>> /wordcount_raw/_temporary/_attempt_201307102216_0145_r_16_0/part-r-00016
>> File does not exist. Holder DFSClient_attempt_201307102216_0145_r_16_0
>> does not have any open files.
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1629)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1620)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1536)
>> at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
>> at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:396)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
>>
>> at org.apache.hadoop.ipc.Client.call(Client.java:1066)
>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>> at $Proxy2.addBlock(Unknown Source)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>> at $Proxy2.addBlock(Unknown Source)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3507)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3370)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2586)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2826)
>>
>> I am guessing that it is because in the intermediate step (after map
>> phase) , there were some temp files which might have got deleted while
>> reducer is trying to read it.?
>> How do i resolve this.
>> I am trying to run a simple word count example but on complete wiki dump?
>> Thanks
>>
>>
>



-- 
Harsh J


Re: Error on running a hadoop job

2013-07-29 Thread Pavan Sudheendra
I'm getting the exact same error. Only thing is I'm trying to write to a
sequence file.

Regards,
Pavan
On Jul 29, 2013 11:29 PM, "jamal sasha"  wrote:

> Hi,
>  I am getting a weird error?
>
> 13/07/29 10:50:58 INFO mapred.JobClient: Task Id :
> attempt_201307102216_0145_r_16_0, Status : FAILED
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
> /wordcount_raw/_temporary/_attempt_201307102216_0145_r_16_0/part-r-00016
> File does not exist. Holder DFSClient_attempt_201307102216_0145_r_16_0
> does not have any open files.
>  at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1629)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1620)
>  at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1536)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
>  at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
>  at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
>  at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1066)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>  at $Proxy2.addBlock(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
>  at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>  at $Proxy2.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3507)
>  at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3370)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2586)
>  at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2826)
>
> I am guessing that it is because in the intermediate step (after map
> phase) , there were some temp files which might have got deleted while
> reducer is trying to read it.?
> How do i resolve this.
> I am trying to run a simple word count example but on complete wiki dump?
> Thanks
>
>
>


Error on running a hadoop job

2013-07-29 Thread jamal sasha
Hi,
 I am getting a weird error?

13/07/29 10:50:58 INFO mapred.JobClient: Task Id :
attempt_201307102216_0145_r_16_0, Status : FAILED
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/wordcount_raw/_temporary/_attempt_201307102216_0145_r_16_0/part-r-00016
File does not exist. Holder DFSClient_attempt_201307102216_0145_r_16_0
does not have any open files.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1629)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1620)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1536)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

at org.apache.hadoop.ipc.Client.call(Client.java:1066)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at $Proxy2.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy2.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3507)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3370)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2586)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2826)

I am guessing that it is because in the intermediate step (after map phase)
, there were some temp files which might have got deleted while reducer is
trying to read it.?
How do i resolve this.
I am trying to run a simple word count example but on complete wiki dump?
Thanks


Re: Cannot write the output of the reducer to a sequence file

2013-07-29 Thread Harsh J
Hi,

Can you explain the problem you actually face in trying to run the
above setup? Do you also set your reducer output types?

On Mon, Jul 29, 2013 at 4:48 PM, Pavan Sudheendra  wrote:
> I have a Map function and a Reduce funtion outputting kep-value pairs
> of class Text and IntWritable.. This is just the gist of the Map part
> in the Main function :
>
> TableMapReduceUtil.initTableMapperJob(
>   tablename,// input HBase table name
>   scan, // Scan instance to control CF and attribute selection
>   AnalyzeMapper.class,   // mapper
>   Text.class, // mapper output key
>   IntWritable.class, // mapper output value
>   job);
>
> And here's my Reducer part in the Main function which writes the output to 
> HDFS
>
> job.setReducerClass(AnalyzeReducerFile.class);
> job.setNumReduceTasks(1);
> FileOutputFormat.setOutputPath(job, new
> Path("hdfs://localhost:54310/output_file"));
>
> How do i make the reducer write to a Sequence File instead?
>
> I've tried the following code but doesn't work
>
> job.setReducerClass(AnalyzeReducerFile.class);
> job.setNumReduceTasks(1);
> job.setOutputFormatClass(SequenceFileOutputFormat.class);
> SequenceFileOutputFormat.setOutputPath(job, new
> Path("hdfs://localhost:54310/sequenceOutput"));
>
> Any help appreciated!
>
>
>
>
> --
> Regards-
> Pavan



-- 
Harsh J


Re: Abstraction layer to support both YARN and Mesos

2013-07-29 Thread Tsuyoshi OZAWA
Harsh, yes, I know what you mean :-) Never mind. We should discuss
this topic with MR users.

On Tue, Jul 30, 2013 at 12:08 AM, Michael Segel
 wrote:
> Actually,
> I am interested.
>
> Lots of different Apache top level projects seem to overlap and it can be 
> confusing.
> Its very easy for a good technology to get starved because no one asks how to 
> combine these features in to the framework.
>
> On Jul 29, 2013, at 10:06 AM, Michael Segel  wrote:
>
>> Actually,
>> I am interested.
>>
>> Lots of different Apache top level projects seem to overlap and it can be 
>> confusing.
>> Its very easy for a good technology to get starved because no one asks how 
>> to combine these features in to the framework.
>>
>>
>> On Jul 29, 2013, at 9:58 AM, Tsuyoshi OZAWA  wrote:
>>
>>> I thought some high availability and resource isolation features in
>>> Mesos are more matured. If no one is interested in this topic, MR
>>> should go with YARN.
>>>
>>> On Fri, Jul 26, 2013 at 7:14 PM, Harsh J  wrote:
 Do we have a good reason to prefer Mesos over YARN for scheduling MR
 specifically? At what times would one prefer the other?

 On Fri, Jul 26, 2013 at 11:43 AM, Tsuyoshi OZAWA
  wrote:
> Hi,
>
> Now, Apache Mesos, an distributed resource manager, is top-level
> apache project. Meanwhile, As you know, Hadoop has own resource
> manager - YARN. IMHO, we should make resource manager pluggable in
> MRv2, because there are their own field users of MapReduce would like
> to use. I think this work is useful for MapReduce users. On the other
> hand, this work can also be large, because MRv2's code base is tightly
> coupled with YARN currently. Thoughts?
>
> - Tsuyoshi



 --
 Harsh J
>>>
>>>
>>>
>>> --
>>> - Tsuyoshi
>>>
>>
>



-- 
- Tsuyoshi


Re: Why Hadoop force using DNS?

2013-07-29 Thread Greg Bledsoe
But even if you have permission to change /etc/hosts, /etc/hosts resolution 
seems to introduce instability for the reverse lookup leading to unpredictable 
results.  Dns gets used and if this doesn't match your /etc/hosts file, you 
have problems.  Or am I missing something?

Greg

From: Chris Embree mailto:cemb...@gmail.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>, 
"ch...@embree.us" 
mailto:ch...@embree.us>>
Date: Mon, 29 Jul 2013 09:45:22 -0500
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: Re: Why Hadoop force using DNS?

Just for clarity,  DNS as a service is NOT Required.  Name resolution is.  I 
use /etc/hosts files to identify all nodes in my clusters.

One of the reasons for using Names over IP's is ease of use.  I would much 
rather use a hostname in my XML to identify NN, JT, etc. vs. some random string 
of numbers.




On Mon, Jul 29, 2013 at 10:40 AM, Greg Bledsoe 
mailto:g...@personal.com>> wrote:
I can third this concern.  What purpose does this complexity increasing 
requirement serve?  Why not remove it?

Greg Bledsoe

From: 武泽胜 mailto:wuzesh...@xiaomi.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Mon, 29 Jul 2013 08:21:51 -0500
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: Re: Why Hadoop force using DNS?

I have the same confusion, anyone who can reply to this will be very 
appreciated.

From: Elazar Leibovich mailto:elaz...@gmail.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Thursday, July 25, 2013 3:51 AM
To: user mailto:user@hadoop.apache.org>>
Subject: Why Hadoop force using DNS?

Looking at Hadoop source you can see that Hadoop relies on the fact each node 
has resolvable name.

For example, Hadoop 2 namenode reverse look the up of each node that connects 
to it. Also, there's no way way to tell a database to advertise an UP as it's 
address. Setting datanode.network.interface to, say, eth1, would cause Hadoop 
to reverse lookup UPs on eth1 and advertise the result.

Why is that? Using plain IPs is simple to set up, and I can't see a reason not 
to support them?



Re: Abstraction layer to support both YARN and Mesos

2013-07-29 Thread Michael Segel
Actually, 
I am interested. 

Lots of different Apache top level projects seem to overlap and it can be 
confusing. 
Its very easy for a good technology to get starved because no one asks how to 
combine these features in to the framework.


On Jul 29, 2013, at 9:58 AM, Tsuyoshi OZAWA  wrote:

> I thought some high availability and resource isolation features in
> Mesos are more matured. If no one is interested in this topic, MR
> should go with YARN.
> 
> On Fri, Jul 26, 2013 at 7:14 PM, Harsh J  wrote:
>> Do we have a good reason to prefer Mesos over YARN for scheduling MR
>> specifically? At what times would one prefer the other?
>> 
>> On Fri, Jul 26, 2013 at 11:43 AM, Tsuyoshi OZAWA
>>  wrote:
>>> Hi,
>>> 
>>> Now, Apache Mesos, an distributed resource manager, is top-level
>>> apache project. Meanwhile, As you know, Hadoop has own resource
>>> manager - YARN. IMHO, we should make resource manager pluggable in
>>> MRv2, because there are their own field users of MapReduce would like
>>> to use. I think this work is useful for MapReduce users. On the other
>>> hand, this work can also be large, because MRv2's code base is tightly
>>> coupled with YARN currently. Thoughts?
>>> 
>>> - Tsuyoshi
>> 
>> 
>> 
>> --
>> Harsh J
> 
> 
> 
> -- 
> - Tsuyoshi
> 



Re: Why Hadoop force using DNS?

2013-07-29 Thread Elazar Leibovich
This is a reason to force reverse resolution of IPs if they do not appear
in the dfs.allow. If the IP appear in dfs.allow, there's no reason to
reverse-resolve it.


On Mon, Jul 29, 2013 at 4:48 PM, Daryn Sharp  wrote:

>  One reason is the lists to accept or reject DN accepts hostnames.  If dns
> temporarily can't resolve an IP then an unauthorized DN might slip back
> into the cluster, or a decommissioning node might go back into service.
>
>  Daryn
>
>
>  On Jul 29, 2013, at 8:21 AM, 武泽胜 wrote:
>
>  I have the same confusion, anyone who can reply to this will be very
> appreciated.
>
>   From: Elazar Leibovich 
> Reply-To: "user@hadoop.apache.org" 
> Date: Thursday, July 25, 2013 3:51 AM
> To: user 
> Subject: Why Hadoop force using DNS?
>
>   Looking at Hadoop source you can see that Hadoop relies on the fact
> each node has resolvable name.
>
>  For example, Hadoop 2 namenode reverse look the up of each node that
> connects to it. Also, there's no way way to tell a database to advertise an
> UP as it's address. Setting datanode.network.interface to, say, eth1, would
> cause Hadoop to reverse lookup UPs on eth1 and advertise the result.
>
>  Why is that? Using plain IPs is simple to set up, and I can't see a
> reason not to support them?
>
>
>


Re: Abstraction layer to support both YARN and Mesos

2013-07-29 Thread Michael Segel
Actually,
I am interested.

Lots of different Apache top level projects seem to overlap and it can be 
confusing.
Its very easy for a good technology to get starved because no one asks how to 
combine these features in to the framework.

On Jul 29, 2013, at 10:06 AM, Michael Segel  wrote:

> Actually, 
> I am interested. 
> 
> Lots of different Apache top level projects seem to overlap and it can be 
> confusing. 
> Its very easy for a good technology to get starved because no one asks how to 
> combine these features in to the framework.
> 
> 
> On Jul 29, 2013, at 9:58 AM, Tsuyoshi OZAWA  wrote:
> 
>> I thought some high availability and resource isolation features in
>> Mesos are more matured. If no one is interested in this topic, MR
>> should go with YARN.
>> 
>> On Fri, Jul 26, 2013 at 7:14 PM, Harsh J  wrote:
>>> Do we have a good reason to prefer Mesos over YARN for scheduling MR
>>> specifically? At what times would one prefer the other?
>>> 
>>> On Fri, Jul 26, 2013 at 11:43 AM, Tsuyoshi OZAWA
>>>  wrote:
 Hi,
 
 Now, Apache Mesos, an distributed resource manager, is top-level
 apache project. Meanwhile, As you know, Hadoop has own resource
 manager - YARN. IMHO, we should make resource manager pluggable in
 MRv2, because there are their own field users of MapReduce would like
 to use. I think this work is useful for MapReduce users. On the other
 hand, this work can also be large, because MRv2's code base is tightly
 coupled with YARN currently. Thoughts?
 
 - Tsuyoshi
>>> 
>>> 
>>> 
>>> --
>>> Harsh J
>> 
>> 
>> 
>> -- 
>> - Tsuyoshi
>> 
> 



Re: Abstraction layer to support both YARN and Mesos

2013-07-29 Thread Tsuyoshi OZAWA
I thought some high availability and resource isolation features in
Mesos are more matured. If no one is interested in this topic, MR
should go with YARN.

On Fri, Jul 26, 2013 at 7:14 PM, Harsh J  wrote:
> Do we have a good reason to prefer Mesos over YARN for scheduling MR
> specifically? At what times would one prefer the other?
>
> On Fri, Jul 26, 2013 at 11:43 AM, Tsuyoshi OZAWA
>  wrote:
>> Hi,
>>
>> Now, Apache Mesos, an distributed resource manager, is top-level
>> apache project. Meanwhile, As you know, Hadoop has own resource
>> manager - YARN. IMHO, we should make resource manager pluggable in
>> MRv2, because there are their own field users of MapReduce would like
>> to use. I think this work is useful for MapReduce users. On the other
>> hand, this work can also be large, because MRv2's code base is tightly
>> coupled with YARN currently. Thoughts?
>>
>> - Tsuyoshi
>
>
>
> --
> Harsh J



-- 
- Tsuyoshi


Re: Why Hadoop force using DNS?

2013-07-29 Thread Elazar Leibovich
Ease of use is a reason to support names, not to intentionally disallow raw
IPs. Not using names is convenient if you want to erect a temporary cluster
on a group of machines you don't own.

You have a user access, but name resolution is not always defined. As a
user you cannot change /etc/hosts.
On Jul 29, 2013 5:46 PM, "Chris Embree"  wrote:

> Just for clarity,  DNS as a service is NOT Required.  Name resolution is.
>  I use /etc/hosts files to identify all nodes in my clusters.
>
> One of the reasons for using Names over IP's is ease of use.  I would much
> rather use a hostname in my XML to identify NN, JT, etc. vs. some random
> string of numbers.
>
>
>
>
> On Mon, Jul 29, 2013 at 10:40 AM, Greg Bledsoe  wrote:
>
>> I can third this concern.  What purpose does this complexity increasing
>> requirement serve?  Why not remove it?
>>
>> Greg Bledsoe
>>
>> From: 武泽胜 
>> Reply-To: "user@hadoop.apache.org" 
>> Date: Mon, 29 Jul 2013 08:21:51 -0500
>> To: "user@hadoop.apache.org" 
>> Subject: Re: Why Hadoop force using DNS?
>>
>> I have the same confusion, anyone who can reply to this will be very
>> appreciated.
>>
>> From: Elazar Leibovich 
>> Reply-To: "user@hadoop.apache.org" 
>> Date: Thursday, July 25, 2013 3:51 AM
>> To: user 
>> Subject: Why Hadoop force using DNS?
>>
>> Looking at Hadoop source you can see that Hadoop relies on the fact each
>> node has resolvable name.
>>
>> For example, Hadoop 2 namenode reverse look the up of each node that
>> connects to it. Also, there's no way way to tell a database to advertise an
>> UP as it's address. Setting datanode.network.interface to, say, eth1, would
>> cause Hadoop to reverse lookup UPs on eth1 and advertise the result.
>>
>> Why is that? Using plain IPs is simple to set up, and I can't see a
>> reason not to support them?
>>
>
>


Re: Why Hadoop force using DNS?

2013-07-29 Thread Chris Embree
Just for clarity,  DNS as a service is NOT Required.  Name resolution is.
 I use /etc/hosts files to identify all nodes in my clusters.

One of the reasons for using Names over IP's is ease of use.  I would much
rather use a hostname in my XML to identify NN, JT, etc. vs. some random
string of numbers.




On Mon, Jul 29, 2013 at 10:40 AM, Greg Bledsoe  wrote:

> I can third this concern.  What purpose does this complexity increasing
> requirement serve?  Why not remove it?
>
> Greg Bledsoe
>
> From: 武泽胜 
> Reply-To: "user@hadoop.apache.org" 
> Date: Mon, 29 Jul 2013 08:21:51 -0500
> To: "user@hadoop.apache.org" 
> Subject: Re: Why Hadoop force using DNS?
>
> I have the same confusion, anyone who can reply to this will be very
> appreciated.
>
> From: Elazar Leibovich 
> Reply-To: "user@hadoop.apache.org" 
> Date: Thursday, July 25, 2013 3:51 AM
> To: user 
> Subject: Why Hadoop force using DNS?
>
> Looking at Hadoop source you can see that Hadoop relies on the fact each
> node has resolvable name.
>
> For example, Hadoop 2 namenode reverse look the up of each node that
> connects to it. Also, there's no way way to tell a database to advertise an
> UP as it's address. Setting datanode.network.interface to, say, eth1, would
> cause Hadoop to reverse lookup UPs on eth1 and advertise the result.
>
> Why is that? Using plain IPs is simple to set up, and I can't see a reason
> not to support them?
>


Re: Why Hadoop force using DNS?

2013-07-29 Thread Greg Bledsoe
I can third this concern.  What purpose does this complexity increasing 
requirement serve?  Why not remove it?

Greg Bledsoe

From: 武泽胜 mailto:wuzesh...@xiaomi.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Mon, 29 Jul 2013 08:21:51 -0500
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: Re: Why Hadoop force using DNS?

I have the same confusion, anyone who can reply to this will be very 
appreciated.

From: Elazar Leibovich mailto:elaz...@gmail.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Thursday, July 25, 2013 3:51 AM
To: user mailto:user@hadoop.apache.org>>
Subject: Why Hadoop force using DNS?

Looking at Hadoop source you can see that Hadoop relies on the fact each node 
has resolvable name.

For example, Hadoop 2 namenode reverse look the up of each node that connects 
to it. Also, there's no way way to tell a database to advertise an UP as it's 
address. Setting datanode.network.interface to, say, eth1, would cause Hadoop 
to reverse lookup UPs on eth1 and advertise the result.

Why is that? Using plain IPs is simple to set up, and I can't see a reason not 
to support them?


Re: Why Hadoop force using DNS?

2013-07-29 Thread Daryn Sharp
One reason is the lists to accept or reject DN accepts hostnames.  If dns 
temporarily can't resolve an IP then an unauthorized DN might slip back into 
the cluster, or a decommissioning node might go back into service.

Daryn

On Jul 29, 2013, at 8:21 AM, 武泽胜 wrote:

I have the same confusion, anyone who can reply to this will be very 
appreciated.

From: Elazar Leibovich mailto:elaz...@gmail.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Thursday, July 25, 2013 3:51 AM
To: user mailto:user@hadoop.apache.org>>
Subject: Why Hadoop force using DNS?

Looking at Hadoop source you can see that Hadoop relies on the fact each node 
has resolvable name.

For example, Hadoop 2 namenode reverse look the up of each node that connects 
to it. Also, there's no way way to tell a database to advertise an UP as it's 
address. Setting datanode.network.interface to, say, eth1, would cause Hadoop 
to reverse lookup UPs on eth1 and advertise the result.

Why is that? Using plain IPs is simple to set up, and I can't see a reason not 
to support them?



Re: Why Hadoop force using DNS?

2013-07-29 Thread 武泽胜
I have the same confusion, anyone who can reply to this will be very 
appreciated.

From: Elazar Leibovich mailto:elaz...@gmail.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Thursday, July 25, 2013 3:51 AM
To: user mailto:user@hadoop.apache.org>>
Subject: Why Hadoop force using DNS?

Looking at Hadoop source you can see that Hadoop relies on the fact each node 
has resolvable name.

For example, Hadoop 2 namenode reverse look the up of each node that connects 
to it. Also, there's no way way to tell a database to advertise an UP as it's 
address. Setting datanode.network.interface to, say, eth1, would cause Hadoop 
to reverse lookup UPs on eth1 and advertise the result.

Why is that? Using plain IPs is simple to set up, and I can't see a reason not 
to support them?


Unknown method getClusterMetrics [...] on ClientRMProtocol$ClientRMProtocolService$BlockingInterface protocol.

2013-07-29 Thread Johannes Schaback
Dear all,

I am trying to run the submit the distributed shell client on a Windows
machine to submit it to a yarn resource manager on a Linux box. I am stuck
with the following client error message ...

Unknown method getClusterMetrics called on interface
org.apache.hadoop.yarn.proto.ClientRMProtocol$ClientRMProtocolService$BlockingInterface
protocol.

... which I get right after the

13/07/29 14:22:55 DEBUG ipc.RPC: Call: getClusterMetrics 115

... call. The distributed shell example works as expected when either
executed on the Linux box or on the Windows box in single mode setup.

A bit more context:
- hadoop version 0.23.9
- both machines run in the same network
- both machines use the same JDK 6
- user names differ between Windows client and Linux resource manager

The client log reads:

13/07/29 14:22:54 INFO service.AbstractService:
Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
13/07/29 14:22:54 DEBUG ipc.Client: The ping interval is 6 ms.
13/07/29 14:22:54 DEBUG ipc.Client: Use SIMPLE authentication for protocol
BlockingInterface
13/07/29 14:22:54 DEBUG ipc.Client: Connecting to /192.168.3.152:8033
13/07/29 14:22:55 DEBUG ipc.Client: IPC Client (766488133) connection to /
192.168.3.152:8033 from jSchaback sending #0
13/07/29 14:22:55 DEBUG ipc.Client: IPC Client (766488133) connection to /
192.168.3.152:8033 from jSchaback: starting, having connections 1
13/07/29 14:22:55 DEBUG ipc.Client: IPC Client (766488133) connection to /
192.168.3.152:8033 from jSchaback got value #0
13/07/29 14:22:55 DEBUG ipc.RPC: Call: getClusterMetrics 115
13/07/29 14:22:55 FATAL visualmeta.MyClient: Error running CLient
RemoteTrace:
java.io.IOException: Unknown method getClusterMetrics called on interface
org.apache.hadoop.yarn.proto.ClientRMProtocol$ClientRMProtocolService$BlockingInterface
protocol.
at
org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:345)

The remote resource manager log reads:

2013-07-29 14:40:45,204 WARN org.apache.hadoop.ipc.Server: Unknown method
getClusterMetrics called on interface
org.apache.hadoop.yarn.proto.ClientRMProtocol$ClientRMProtocolService$BlockingInterface
protocol.
2013-07-29 14:40:45,273 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 8033: readAndProcess threw exception java.io.IOException:
Connection reset by peer from client 192.168.3.246. Count of bytes read: 0
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
at sun.nio.ch.IOUtil.read(IOUtil.java:171)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
at org.apache.hadoop.ipc.Server.channelRead(Server.java:1981)
at org.apache.hadoop.ipc.Server.access$2600(Server.java:100)
at
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1227)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:651)
at
org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:450)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:425)


What exatly leads to this problem? Is it a known issue that RPC interfaces
are inconsistent over different OS? After all its all Java, so I dont
really expect that to be an issue, but I cant think of any other reason.

Any help is very much appreciated.

Thanks and regards,

Johannes


Cannot write the output of the reducer to a sequence file

2013-07-29 Thread Pavan Sudheendra
I have a Map function and a Reduce funtion outputting kep-value pairs
of class Text and IntWritable.. This is just the gist of the Map part
in the Main function :

TableMapReduceUtil.initTableMapperJob(
  tablename,// input HBase table name
  scan, // Scan instance to control CF and attribute selection
  AnalyzeMapper.class,   // mapper
  Text.class, // mapper output key
  IntWritable.class, // mapper output value
  job);

And here's my Reducer part in the Main function which writes the output to HDFS

job.setReducerClass(AnalyzeReducerFile.class);
job.setNumReduceTasks(1);
FileOutputFormat.setOutputPath(job, new
Path("hdfs://localhost:54310/output_file"));

How do i make the reducer write to a Sequence File instead?

I've tried the following code but doesn't work

job.setReducerClass(AnalyzeReducerFile.class);
job.setNumReduceTasks(1);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setOutputPath(job, new
Path("hdfs://localhost:54310/sequenceOutput"));

Any help appreciated!




-- 
Regards-
Pavan