Re: Best Linux Operating system used for Hadoop

2012-01-27 Thread alo alt
I suggest CentOS 5.7 / RHEL 5.7

CentOS 6.2 runs also stable

- Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 27, 2012, at 10:15 AM, Sujit Dhamale wrote:

 Hi All,
 I am new to Hadoop,
 Can any one tell me which is the best Linux Operating system used for
 installing  running Hadoop. ??
 now a day i am using Ubuntu 11.4 and install Hadoop on it but it
 crashes number of times .
 
 can some please help me out ???
 
 
 Kind regards
 Sujit Dhamale



Re: Best Linux Operating system used for Hadoop

2012-01-27 Thread Sujit Dhamale
Thanks a lot  Alex,
i will install Linux  RHEL today Only .

--Sujit Dhamale

On Fri, Jan 27, 2012 at 2:49 PM, alo alt wget.n...@googlemail.com wrote:

 I suggest CentOS 5.7 / RHEL 5.7

 CentOS 6.2 runs also stable

 - Alex

 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 On Jan 27, 2012, at 10:15 AM, Sujit Dhamale wrote:

  Hi All,
  I am new to Hadoop,
  Can any one tell me which is the best Linux Operating system used for
  installing  running Hadoop. ??
  now a day i am using Ubuntu 11.4 and install Hadoop on it but it
  crashes number of times .
 
  can some please help me out ???
 
 
  Kind regards
  Sujit Dhamale




Re: NoSuchElementException while Reduce step

2012-01-27 Thread hadoop hive
hey there must be sum problem with the key or value, reducer didnt find the
expected value.

On Fri, Jan 27, 2012 at 1:23 AM, Rajesh Sai T tsairaj...@gmail.com wrote:

 Hi,

 I'm new to Hadoop. I'm trying to write my custom data types for Writable
 types. So, that Map class will produce my structure as value of key. And
 Reduce class shall work on these list of my structure values. Below is my
 program, please guide me what needs to be done to overcome this. I find it
 does pass Map phase but while Reducing on last iteration is pops an
 exception and job terminates.

 import java.io.IOException;
 import java.util.*;
 import java.io.DataOutput;
 import java.io.DataInput;

 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.*;
 import org.apache.hadoop.mapred.*;

 public class InvertedGroupIndex {

 public static class InvertedStruct implements Writable {
 public String location;
 public int count;
 public InvertedStruct (int _count, String _str2) {
 this.count = _count;
 this.location = _str2;
 }
 public InvertedStruct () {
 this(0, null);
 }
 /*public void set (String _str1, String _str2) {
 this.word = _str1;
 this.location = _str2;
 }*/
 public void write(DataOutput out) throws IOException {
 out.writeInt(this.count);
 out.writeChars(this.location);
 }
 public void readFields(DataInput in) throws IOException {
 count = in.readInt();
 location = in.readLine();
 }
 public String toString() {
 return count + ; + location;
 }
 public int getCount() {
 return count;
 }
 public String getString() {
 return location;
 }
 }
 public static class InvertedMap extends MapReduceBase implements
 MapperLongWritable, Text, Text, InvertedStruct {
  private final static IntWritable count = new IntWritable(1);
 private final static Text word = new Text();
  public void map (LongWritable key, Text val, OutputCollectorText,
 InvertedStructoutput, Reporter report) throws IOException {
 FileSplit filesplit = (FileSplit) report.getInputSplit();
 String fileName = filesplit.getPath().getName();
 //location.set(fileName);
 //InvertedStruct result = new InvertedStruct(1, fileName);
 String line = val.toString();
 StringTokenizer token = new StringTokenizer(line.toLowerCase());
 while (token.hasMoreTokens()) {
 word.set(token.nextToken());
 output.collect(word, new InvertedStruct(1, fileName));
 }
 }
 }

 public static class InvertedReducer extends MapReduceBase implements
 ReducerText, InvertedStruct, Text, Text {
 public void reduce(Text key, IteratorInvertedStruct values,
 OutputCollectorText, Text output, Reporter reporter) throws IOException,
 NoSuchElementException {
 int sum=0;
 StringBuilder toReturn = new StringBuilder();
 while (values.hasNext()) {
 sum += values.next().getCount();
 toReturn.append(values.next().getString());
 }
 String s = String.valueOf(sum) + toReturn.toString();
 output.collect(key, new Text(s));
 }
 }
 public static void main (String[] args) throws IOException {
 //JobClient client = new JobClient();
 JobConf conf = new JobConf(InvertedGroupIndex.class);
 conf.setJobName(InvertedGroupIndex);
 conf.setMapperClass(InvertedMap.class);
 //conf.setCombinerClass(InvertedReducer.class);
 conf.setReducerClass(InvertedReducer.class);
 conf.setMapOutputValueClass(InvertedStruct.class);
 conf.setOutputKeyClass(Text.class);
 conf.setOutputValueClass(Text.class);
 FileInputFormat.setInputPaths(conf, new Path(args[0]));
 FileOutputFormat.setOutputPath(conf, new Path(args[1]));
  JobClient.runJob(conf);
 }
 }

 Thanks,
 Sai



jobtracker url(Critical)

2012-01-27 Thread hadoop hive
Hey folks,

i m facing a problem, with job Tracker URL, actually  i added a node to the
cluster and after sometime i restart the cluster,  then i found that my job
tracker is showing  recent added node in *nodes * but rest of nodes are not
available not even in *blacklist. *
*
*
can any1 have any idea why its happening.


Thanks and regards
Vikas Srivastava


Re: jobtracker url(Critical)

2012-01-27 Thread Harsh J
Vikas,

Have you ensured your non-appearing tasktracker services are
started/alive and carry no communication errors in their logs? Did you
perhaps bring up a firewall accidentally, that was not present before?

On Fri, Jan 27, 2012 at 4:47 PM, hadoop hive hadooph...@gmail.com wrote:
 Hey folks,

 i m facing a problem, with job Tracker URL, actually  i added a node to the
 cluster and after sometime i restart the cluster,  then i found that my job
 tracker is showing  recent added node in *nodes * but rest of nodes are not
 available not even in *blacklist. *
 *
 *
 can any1 have any idea why its happening.


 Thanks and regards
 Vikas Srivastava



-- 
Harsh J
Customer Ops. Engineer, Cloudera


Re: jobtracker url(Critical)

2012-01-27 Thread hadoop hive
Hey Harsh,

but after sumtym they are available 1 by 1 in jobtracker URL.

any idea how they add up slowly slowly.

regards
Vikas

On Fri, Jan 27, 2012 at 5:05 PM, Harsh J ha...@cloudera.com wrote:

 Vikas,

 Have you ensured your non-appearing tasktracker services are
 started/alive and carry no communication errors in their logs? Did you
 perhaps bring up a firewall accidentally, that was not present before?

 On Fri, Jan 27, 2012 at 4:47 PM, hadoop hive hadooph...@gmail.com wrote:
  Hey folks,
 
  i m facing a problem, with job Tracker URL, actually  i added a node to
 the
  cluster and after sometime i restart the cluster,  then i found that my
 job
  tracker is showing  recent added node in *nodes * but rest of nodes are
 not
  available not even in *blacklist. *
  *
  *
  can any1 have any idea why its happening.
 
 
  Thanks and regards
  Vikas Srivastava



 --
 Harsh J
 Customer Ops. Engineer, Cloudera



Re: jobtracker url(Critical)

2012-01-27 Thread Edward Capriolo
Task tracker sometimes so not clean up their mapred temp directories well
if that is the case the tt on startup can spent many minutes deleting
files. I use find to delete files older then a couple of days.

On Friday, January 27, 2012, hadoop hive hadooph...@gmail.com wrote:
 Hey Harsh,

 but after sumtym they are available 1 by 1 in jobtracker URL.

 any idea how they add up slowly slowly.

 regards
 Vikas

 On Fri, Jan 27, 2012 at 5:05 PM, Harsh J ha...@cloudera.com wrote:

 Vikas,

 Have you ensured your non-appearing tasktracker services are
 started/alive and carry no communication errors in their logs? Did you
 perhaps bring up a firewall accidentally, that was not present before?

 On Fri, Jan 27, 2012 at 4:47 PM, hadoop hive hadooph...@gmail.com
wrote:
  Hey folks,
 
  i m facing a problem, with job Tracker URL, actually  i added a node to
 the
  cluster and after sometime i restart the cluster,  then i found that my
 job
  tracker is showing  recent added node in *nodes * but rest of nodes are
 not
  available not even in *blacklist. *
  *
  *
  can any1 have any idea why its happening.
 
 
  Thanks and regards
  Vikas Srivastava



 --
 Harsh J
 Customer Ops. Engineer, Cloudera




Re: Too many open files Error

2012-01-27 Thread Mark question
Hi Harsh and Idris ... so the only drawback for increasing the value of
xcievers is memory issue? In that case then I'll set it to a value that
doesn't fill the memory ...
Thanks,
Mark

On Thu, Jan 26, 2012 at 10:37 PM, Idris Ali psychid...@gmail.com wrote:

 Hi Mark,

 As Harsh pointed out it is not good idea to increase the Xceiver count to
 arbitrarily higher value, I suggested to increase the xceiver count just to
 unblock execution of your program temporarily.

 Thanks,
 -Idris

 On Fri, Jan 27, 2012 at 10:39 AM, Harsh J ha...@cloudera.com wrote:

  You are technically allowing DN to run 1 million block transfer
  (in/out) threads by doing that. It does not take up resources by
  default sure, but now it can be abused with requests to make your DN
  run out of memory and crash cause its not bound to proper limits now.
 
  On Fri, Jan 27, 2012 at 5:49 AM, Mark question markq2...@gmail.com
  wrote:
   Harsh, could you explain briefly why is 1M setting for xceiver is bad?
  the
   job is working now ...
   about the ulimit -u it shows  200703, so is that why connection is
 reset
  by
   peer? How come it's working with the xceiver modification?
  
   Thanks,
   Mark
  
  
   On Thu, Jan 26, 2012 at 12:21 PM, Harsh J ha...@cloudera.com wrote:
  
   Agree with Raj V here - Your problem should not be the # of transfer
   threads nor the number of open files given that stacktrace.
  
   And the values you've set for the transfer threads are far beyond
   recommendations of 4k/8k - I would not recommend doing that. Default
   in 1.0.0 is 256 but set it to 2048/4096, which are good value to have
   when noticing increased HDFS load, or when running services like
   HBase.
  
   You should instead focus on why its this particular job (or even
   particular task, which is important to notice) that fails, and not
   other jobs (or other task attempts).
  
   On Fri, Jan 27, 2012 at 1:10 AM, Raj V rajv...@yahoo.com wrote:
Mark
   
You have this Connection reset by peer. Why do you think this
  problem
   is related to too many open files?
   
Raj
   
   
   
   
From: Mark question markq2...@gmail.com
   To: common-user@hadoop.apache.org
   Sent: Thursday, January 26, 2012 11:10 AM
   Subject: Re: Too many open files Error
   
   Hi again,
   I've tried :
property
   namedfs.datanode.max.xcievers/name
   value1048576/value
 /property
   but I'm still getting the same error ... how high can I go??
   
   Thanks,
   Mark
   
   
   
   On Thu, Jan 26, 2012 at 9:29 AM, Mark question markq2...@gmail.com
 
   wrote:
   
Thanks for the reply I have nothing about
   dfs.datanode.max.xceivers on
my hdfs-site.xml so hopefully this would solve the problem and
 about
   the
ulimit -n , I'm running on an NFS cluster, so usually I just start
   Hadoop
with a single bin/start-all.sh ... Do you think I can add it by
bin/Datanode -ulimit n ?
   
Mark
   
   
On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn 
  mapred.le...@gmail.com
   wrote:
   
U need to set ulimit -n bigger value on datanode and restart
   datanodes.
   
Sent from my iPhone
   
On Jan 26, 2012, at 6:06 AM, Idris Ali psychid...@gmail.com
  wrote:
   
 Hi Mark,

 On a lighter note what is the count of xceivers?
dfs.datanode.max.xceivers
 property in hdfs-site.xml?

 Thanks,
 -idris

 On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel 
michael_se...@hotmail.comwrote:

 Sorry going from memory...
 As user Hadoop or mapred or hdfs what do you see when you do a
   ulimit
-a?
 That should give you the number of open files allowed by a
  single
user...


 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On Jan 26, 2012, at 5:13 AM, Mark question 
 markq2...@gmail.com
  
wrote:

 Hi guys,

  I get this error from a job trying to process 3Million
  records.

 java.io.IOException: Bad connect ack with firstBadLink
 192.168.1.20:50010
   at


   
  
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903)
   at


   
  
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
   at


   
  
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
   at


   
  
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

 When I checked the logfile of the datanode-20, I see :

 2012-01-26 03:00:11,827 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode:
   DatanodeRegistration(
 192.168.1.20:50010,
 storageID=DS-97608578-192.168.1.20-50010-1327575205369,
 infoPort=50075, ipcPort=50020):DataXceiver
 java.io.IOException: Connection reset by peer
   at sun.nio.ch.FileDispatcher.read0(Native 

Re: Best Linux Operating system used for Hadoop

2012-01-27 Thread Masoud

Hi,

I suggest you Fedora, in my opinion its more powerful than other 
distribution.

i have run hadoop on it without any problem,

good luck

On 01/27/2012 06:15 PM, Sujit Dhamale wrote:

Hi All,
I am new to Hadoop,
Can any one tell me which is the best Linux Operating system used for
installing  running Hadoop. ??
now a day i am using Ubuntu 11.4 and install Hadoop on it but it
crashes number of times .

can some please help me out ???


Kind regards
Sujit Dhamale