Re: hadoop knowledge gaining

2011-10-10 Thread Steve Loughran

On 07/10/11 15:25, Jignesh Patel wrote:

Guys,
I am able to deploy the first program word count using hadoop. I am interesting 
exploring more about hadoop and Hbase and don't know which is the best way to 
grasp both of them.

I have hadoop in action but it has older api.


Actually the API covered in the 2nd edition is pretty much the one in 
widest use. The newer API is better, but is only as complete in hadoop 
0.21 and later, which aren't yet in wide use



I do also have Hbase definitive guide which I have not started exploring.


Think of a problem, get some data, go through the books. Learning more 
about statistics and datamining is what you really need to learn, more 
than just the hadoop APIs


-steve




Re: ways to expand hadoop.tmp.dir capacity?

2011-10-10 Thread Marcos Luis Ortiz Valmaseda
2011/10/9 Harsh J ha...@cloudera.com

 Hello Meng,

 On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao meng...@gmail.com wrote:
  Currently, we've got defined:
   property
  namehadoop.tmp.dir/name
  value/hadoop/hadoop-metadata/cache//value
   /property
 
  In our experiments with SOLR, the intermediate files are so large that
 they
  tend to blow out disk space and fail (and annoyingly leave behind their
 huge
  failed attempts). We've had issues with it in the past, but we're having
  real problems with SOLR if we can't comfortably get more space out of
  hadoop.tmp.dir somehow.
 
  1) It seems we never set *mapred.system.dir* to anything special, so it's
  defaulting to ${hadoop.tmp.dir}/mapred/system.
  Is this a problem? The docs seem to recommend against it when
 hadoop.tmp.dir
  had ${user.name} in it, which ours doesn't.

 The {mapred.system.dir} is a HDFS location, and you shouldn't really
 be worried about it as much.

  1b) The doc says mapred.system.dir is the in-HDFS path to shared
 MapReduce
  system files. To me, that means there's must be 1 single path for
  mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
  Otherwise, one might imagine that you could specify multiple paths to
 store
  hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
  interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if
  there were more mapping/lookup between mapred.system.dir and
 hadoop.tmp.dir?

 {hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although it
 is on HDFS, and hence is confusing, but there should just be one
 mapred.system.dir, yes.

 Also, the config {hadoop.tmp.dir} doesn't support  1 path. What you
 need here is a proper {mapred.local.dir} configuration.

  2) IIRC, there's a -D switch for supplying config name/value pairs into
  indivdiual jobs. Does such a switch exist? Googling for single letters is
  fruitless. If we had a path on our workers with more space (in our case,
  another hard disk), could we simply pass that path in as hadoop.tmp.dir
 for
  our SOLR jobs? Without incurring any consistency issues on future jobs
 that
  might use the SOLR output on HDFS?

 Only a few parameters of a job are user-configurable. Stuff like
 hadoop.tmp.dir and mapred.local.dir are not override-able by user set
 parameters as they are server side configurations (static).

  Given that the default value is ${hadoop.tmp.dir}/mapred/local, would the
  expanded capacity we're looking for be as easily accomplished as by
 defining
  mapred.local.dir to span multiple disks? Setting aside the issue of temp
  files so big that they could still fill a whole disk.

 1. You can set mapred.local.dir independent of hadoop.tmp.dir
 2. mapred.local.dir can have comma separated values in it, spanning
 multiple disks
 3. Intermediate outputs may spread across these disks but shall not
 consume  1 disk at a time. So if your largest configured disk is 500
 GB while the total set of them may be 2 TB, then your intermediate
 output size can't really exceed 500 GB, cause only one disk is
 consumed by one task -- the multiple disks are for better I/O
 parallelism between tasks.

 Know that hadoop.tmp.dir is a convenience property, for quickly
 starting up dev clusters and such. For a proper configuration, you
 need to remove dependency on it (almost nothing uses hadoop.tmp.dir on
 the server side, once the right properties are configured - ex:
 dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc.)

 --
 Harsh J


Here it's a excellent explanation how to install Apache Hadoop manually, and
Lars explains this very good.

http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/

Regards

-- 
Marcos Luis Ortíz Valmaseda
 Linux Infrastructure Engineer
 Linux User # 418229
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186


Developing MapReduce

2011-10-10 Thread Mohit Anchlia
I use eclipse. Is this http://wiki.apache.org/hadoop/EclipsePlugIn
still the best way to develop mapreduce programs in hadoop? Just want
to make sure before I go down this path.

Or should I just add hadoop jars in my classpath of eclipse and create
my own MapReduce programs.

Thanks


Re: Developing MapReduce

2011-10-10 Thread Jignesh Patel
When you download the hadoop in its dist(i don't remember  exact name) there is 
a related plugin. Go and get it from there. 
On Oct 10, 2011, at 10:34 AM, Mohit Anchlia wrote:

 I use eclipse. Is this http://wiki.apache.org/hadoop/EclipsePlugIn
 still the best way to develop mapreduce programs in hadoop? Just want
 to make sure before I go down this path.
 
 Or should I just add hadoop jars in my classpath of eclipse and create
 my own MapReduce programs.
 
 Thanks



How to iterate over a hdfs folder with hadoop

2011-10-10 Thread Raimon Bosch
Hi,

I'm wondering how can I browse an hdfs folder using the classes
in org.apache.hadoop.fs package. The operation that I'm looking for is
'hadoop dfs -ls'

The standard file system equivalent would be:

File f = new File(outputPath);
if(f.isDirectory()){
  String files[] = f.list();
  for(String file : files){
//Do your logic
  }
}

Thanks in advance,
Raimon Bosch.


Re: How to iterate over a hdfs folder with hadoop

2011-10-10 Thread John Conwell
FileStatus[] files = fs.listStatus(new Path(path));

for (FileStatus fileStatus : files)

{

//...do stuff ehre

}

On Mon, Oct 10, 2011 at 8:03 AM, Raimon Bosch raimon.bo...@gmail.comwrote:

 Hi,

 I'm wondering how can I browse an hdfs folder using the classes
 in org.apache.hadoop.fs package. The operation that I'm looking for is
 'hadoop dfs -ls'

 The standard file system equivalent would be:

 File f = new File(outputPath);
 if(f.isDirectory()){
  String files[] = f.list();
  for(String file : files){
//Do your logic
  }
 }

 Thanks in advance,
 Raimon Bosch.




-- 

Thanks,
John C


Re: hadoop input buffer size

2011-10-10 Thread Uma Maheswara Rao G 72686
I think below can give you more info about it.
http://developer.yahoo.com/blogs/hadoop/posts/2009/08/the_anatomy_of_hadoop_io_pipel/
Nice explanation by Owen here.

Regards,
Uma

- Original Message -
From: Yang Xiaoliang yangxiaoliang2...@gmail.com
Date: Wednesday, October 5, 2011 4:27 pm
Subject: Re: hadoop input buffer size
To: common-user@hadoop.apache.org

 Hi,
 
 Hadoop neither read one line each time, nor fetching 
 dfs.block.size of lines
 into a buffer,
 Actually, for the TextInputFormat, it read io.file.buffer.size 
 bytes of text
 into a buffer each time,
 this can be seen from the hadoop source file LineReader.java
 
 
 
 2011/10/5 Mark question markq2...@gmail.com
 
  Hello,
 
   Correct me if I'm wrong, but when a program opens n-files at 
 the same time
  to read from, and start reading from each file at a time 1 line 
 at a time.
  Isn't hadoop actually fetching dfs.block.size of lines into a 
 buffer? and
  not actually one line.
 
   If this is correct, I set up my dfs.block.size = 3MB and each 
 line takes
  about 650 bytes only, then I would assume the performance for 
 reading 1-4000
  lines would be the same, but it isn't !  Do you know a way to 
 find #n of
  lines to be read at once?
 
  Thank you,
  Mark
 
 


Custom InputFormat for Multiline Input File Hive/Hadoop

2011-10-10 Thread Mike Sukmanowsky
Hi all,

Sending this to core-u...@hadoop.apache.org and d...@hive.apache.org.

Trying to process Omniture's data log files with Hadoop/Hive. The file
format is tab delimited and while being pretty simple for the most part,
they do allow you to have multiple new lines and tabs within a field that
are escaped by a backslash (\\n and \\t). As a result I've opted to create
my own InputFormat to handle the multiple newlines and convert those tabs to
spaces when Hive is going to try to do a split on the tabs.

I've found a fairly good reference for doing this using the newer
InputFormat API at http://blog.rguha.net/?p=293 but unfortunately my version
of Hive (0.7.0) still uses the old InputFormat API.

I haven't been able to find many tutorials on writing a custom InputFile
using the older API so I'm looking to see if I can get some guidance as to
what may be wrong with the following two classes:

https://gist.github.com/3141e9d27d4e07f5f9ed
https://gist.github.com/79fdab227950a0776616

The SELECT statements within hive currently return nothing and my other
variations returned nothing but NULL values.

This issue is also available on StackOverflow at
http://stackoverflow.com/questions/7692994/custom-inputformat-with-hive.

If there's a resource someone can point me to that'd also be great.

Many thanks in advance,
Mike


Re: How to iterate over a hdfs folder with hadoop

2011-10-10 Thread Uma Maheswara Rao G 72686

Yes, FileStatus class would be trhe equavalent for list.
 FileStstus has the API's isDir and getPath. This both api's can satify for 
your futher usage.:-)

I think small difference would be, FileStatus will ensure the sorted order.

Regards,
Uma
- Original Message -
From: John Conwell j...@iamjohn.me
Date: Monday, October 10, 2011 8:40 pm
Subject: Re: How to iterate over a hdfs folder with hadoop
To: common-user@hadoop.apache.org

 FileStatus[] files = fs.listStatus(new Path(path));
 
 for (FileStatus fileStatus : files)
 
 {
 
 //...do stuff ehre
 
 }
 
 On Mon, Oct 10, 2011 at 8:03 AM, Raimon Bosch 
 raimon.bo...@gmail.comwrote:
  Hi,
 
  I'm wondering how can I browse an hdfs folder using the classes
  in org.apache.hadoop.fs package. The operation that I'm looking 
 for is
  'hadoop dfs -ls'
 
  The standard file system equivalent would be:
 
  File f = new File(outputPath);
  if(f.isDirectory()){
   String files[] = f.list();
   for(String file : files){
 //Do your logic
   }
  }
 
  Thanks in advance,
  Raimon Bosch.
 
 
 
 
 -- 
 
 Thanks,
 John C
 


Re: How to iterate over a hdfs folder with hadoop

2011-10-10 Thread Raimon Bosch
Thanks John!

There is the complete solution:


Configuration jc = new Configuration();
Object files[] = null;
List files_in_hdfs = new ArrayList();

FileSystem fs = FileSystem.get(jc);
FileStatus[] file_status = fs.listStatus(new Path(outputPath));
for (FileStatus fileStatus : file_status) {
  files_in_hdfs.add(fileStatus.getPath().getName());
}

files = files_in_hdfs.toArray();

2011/10/10 John Conwell j...@iamjohn.me

 FileStatus[] files = fs.listStatus(new Path(path));

 for (FileStatus fileStatus : files)

 {

 //...do stuff ehre

 }

 On Mon, Oct 10, 2011 at 8:03 AM, Raimon Bosch raimon.bo...@gmail.com
 wrote:

  Hi,
 
  I'm wondering how can I browse an hdfs folder using the classes
  in org.apache.hadoop.fs package. The operation that I'm looking for is
  'hadoop dfs -ls'
 
  The standard file system equivalent would be:
 
  File f = new File(outputPath);
  if(f.isDirectory()){
   String files[] = f.list();
   for(String file : files){
 //Do your logic
   }
  }
 
  Thanks in advance,
  Raimon Bosch.
 



 --

 Thanks,
 John C



Re: hdfs directory location

2011-10-10 Thread bejoy . hadoop
Jignesh
   You are creating a dir in hdfs by that command. The dir won't be in your 
local file system but it hdfs. Issue a command like
hadoop fs -ls /user/hadoop-user/citation/
You can see the dir you created in hdfs

If you want to create a die on local unix use a simple linux command
mkdir /user/hadoop-user/citation/input


--Original Message--
From: Jignesh Patel
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: hdfs directory location
Sent: Oct 10, 2011 23:45

I am using following command to create a file in Unix(i.e. mac) system. 

bin/hadoop fs -mkdir /user/hadoop-user/citation/input

While it creates the directory I need, I am struggling to figure out exact 
location of the folder in my local box.





Regards
Bejoy K S


Re: hdfs directory location

2011-10-10 Thread Jignesh Patel
Bejoy,

If I create a directory in unix box then how I can link it with HDFS directory 
structure?

-Jignesh
On Oct 10, 2011, at 2:59 PM, bejoy.had...@gmail.com wrote:

 Jignesh
   You are creating a dir in hdfs by that command. The dir won't be in 
 your local file system but it hdfs. Issue a command like
 hadoop fs -ls /user/hadoop-user/citation/
 You can see the dir you created in hdfs
 
 If you want to create a die on local unix use a simple linux command
 mkdir /user/hadoop-user/citation/input
 
 
 --Original Message--
 From: Jignesh Patel
 To: common-user@hadoop.apache.org
 ReplyTo: common-user@hadoop.apache.org
 Subject: hdfs directory location
 Sent: Oct 10, 2011 23:45
 
 I am using following command to create a file in Unix(i.e. mac) system. 
 
 bin/hadoop fs -mkdir /user/hadoop-user/citation/input
 
 While it creates the directory I need, I am struggling to figure out exact 
 location of the folder in my local box.
 
 
 
 
 
 Regards
 Bejoy K S



Re: Developing MapReduce

2011-10-10 Thread bejoy . hadoop
Hi Mohit
I'm really not sure how many of the map reduce developers use the map 
reduce eclipse plugin. AFAIK majority don't. As Jignesh mentioned you can get 
it from the hadoop distribution folder as soon as you unzip the same. 
My suggested approach would be,If you are on Windows OS, you can test run your 
map reduce code in two ways.
-set up cygwin in Windows, atop you can set up hadoop and related tools. It is 
a little messy.
-Use a linux VM image. I'd recommend  Cloudera test VM,as it comes pre 
configured with the whole hadoop technology stack. It really segregates the 
developer from the hassles of installing the hadoop tools and making them up 
and running.

In Linux or Mac you can just add the hadoop jars to your class path and run the 
driver class as just how you run a java class within eclipse.(Here hadoop would 
be on standalone mode).

Hope it helps!...


--Original Message--
From: Jignesh Patel
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: Re: Developing MapReduce
Sent: Oct 10, 2011 20:31

When you download the hadoop in its dist(i don't remember  exact name) there is 
a related plugin. Go and get it from there. 
On Oct 10, 2011, at 10:34 AM, Mohit Anchlia wrote:

 I use eclipse. Is this http://wiki.apache.org/hadoop/EclipsePlugIn
 still the best way to develop mapreduce programs in hadoop? Just want
 to make sure before I go down this path.
 
 Or should I just add hadoop jars in my classpath of eclipse and create
 my own MapReduce programs.
 
 Thanks



Regards
Bejoy K S

Re: hdfs directory location

2011-10-10 Thread bejoy . hadoop
Jignesh
Sorry I didn't get your query, 'how I can link it with HDFS 
directory structure?
' 

You mean putting your unix dir contents into hdfs? If so use hadoop fs 
-copyFromLocal src destn 
--Original Message--
From: Jignesh Patel
To: common-user@hadoop.apache.org
To: bejoy.had...@gmail.com
Subject: Re: hdfs directory location
Sent: Oct 11, 2011 01:18

Bejoy,

If I create a directory in unix box then how I can link it with HDFS directory 
structure?

-Jignesh
On Oct 10, 2011, at 2:59 PM, bejoy.had...@gmail.com wrote:

 Jignesh
   You are creating a dir in hdfs by that command. The dir won't be in 
 your local file system but it hdfs. Issue a command like
 hadoop fs -ls /user/hadoop-user/citation/
 You can see the dir you created in hdfs
 
 If you want to create a die on local unix use a simple linux command
 mkdir /user/hadoop-user/citation/input
 
 
 --Original Message--
 From: Jignesh Patel
 To: common-user@hadoop.apache.org
 ReplyTo: common-user@hadoop.apache.org
 Subject: hdfs directory location
 Sent: Oct 10, 2011 23:45
 
 I am using following command to create a file in Unix(i.e. mac) system. 
 
 bin/hadoop fs -mkdir /user/hadoop-user/citation/input
 
 While it creates the directory I need, I am struggling to figure out exact 
 location of the folder in my local box.
 
 
 
 
 
 Regards
 Bejoy K S



Regards
Bejoy K S

Re: hdfs directory location

2011-10-10 Thread Jignesh Patel
Bejoy,

copyToLocal makes sense, it worked. But I am still wondering if HDFS has a 
directory created on local box, somewhere it exist physically but couldn't able 
to locate it.

Is HDFS directory structure is a virtual structure, doesn't exist physically?

-Jignesh
On Oct 10, 2011, at 3:53 PM, bejoy.had...@gmail.com wrote:

 Jignesh
Sorry I didn't get your query, 'how I can link it with HDFS 
 directory structure?
 ' 
 
 You mean putting your unix dir contents into hdfs? If so use hadoop fs 
 -copyFromLocal src destn 
 --Original Message--
 From: Jignesh Patel
 To: common-user@hadoop.apache.org
 To: bejoy.had...@gmail.com
 Subject: Re: hdfs directory location
 Sent: Oct 11, 2011 01:18
 
 Bejoy,
 
 If I create a directory in unix box then how I can link it with HDFS 
 directory structure?
 
 -Jignesh
 On Oct 10, 2011, at 2:59 PM, bejoy.had...@gmail.com wrote:
 
 Jignesh
  You are creating a dir in hdfs by that command. The dir won't be in 
 your local file system but it hdfs. Issue a command like
 hadoop fs -ls /user/hadoop-user/citation/
 You can see the dir you created in hdfs
 
 If you want to create a die on local unix use a simple linux command
 mkdir /user/hadoop-user/citation/input
 
 
 --Original Message--
 From: Jignesh Patel
 To: common-user@hadoop.apache.org
 ReplyTo: common-user@hadoop.apache.org
 Subject: hdfs directory location
 Sent: Oct 10, 2011 23:45
 
 I am using following command to create a file in Unix(i.e. mac) system. 
 
 bin/hadoop fs -mkdir /user/hadoop-user/citation/input
 
 While it creates the directory I need, I am struggling to figure out exact 
 location of the folder in my local box.
 
 
 
 
 
 Regards
 Bejoy K S
 
 
 
 Regards
 Bejoy K S



Re: hdfs directory location

2011-10-10 Thread bejoy . hadoop
Jignesh
 You are absolutely right. In hdfs directory doesn't exist physically. It is 
just meta data on name node. I don't think such a dir structure would be there 
in name node lfs as well as it just meta data and hence no physical dir 
structure is  created.

Regards
Bejoy K S

-Original Message-
From: Jignesh Patel jign...@websoft.com
Date: Mon, 10 Oct 2011 16:02:53 
To: bejoy.had...@gmail.com
Cc: common-user@hadoop.apache.org
Subject: Re: hdfs directory location

Bejoy,

copyToLocal makes sense, it worked. But I am still wondering if HDFS has a 
directory created on local box, somewhere it exist physically but couldn't able 
to locate it.

Is HDFS directory structure is a virtual structure, doesn't exist physically?

-Jignesh
On Oct 10, 2011, at 3:53 PM, bejoy.had...@gmail.com wrote:

 Jignesh
Sorry I didn't get your query, 'how I can link it with HDFS 
 directory structure?
 ' 
 
 You mean putting your unix dir contents into hdfs? If so use hadoop fs 
 -copyFromLocal src destn 
 --Original Message--
 From: Jignesh Patel
 To: common-user@hadoop.apache.org
 To: bejoy.had...@gmail.com
 Subject: Re: hdfs directory location
 Sent: Oct 11, 2011 01:18
 
 Bejoy,
 
 If I create a directory in unix box then how I can link it with HDFS 
 directory structure?
 
 -Jignesh
 On Oct 10, 2011, at 2:59 PM, bejoy.had...@gmail.com wrote:
 
 Jignesh
  You are creating a dir in hdfs by that command. The dir won't be in 
 your local file system but it hdfs. Issue a command like
 hadoop fs -ls /user/hadoop-user/citation/
 You can see the dir you created in hdfs
 
 If you want to create a die on local unix use a simple linux command
 mkdir /user/hadoop-user/citation/input
 
 
 --Original Message--
 From: Jignesh Patel
 To: common-user@hadoop.apache.org
 ReplyTo: common-user@hadoop.apache.org
 Subject: hdfs directory location
 Sent: Oct 10, 2011 23:45
 
 I am using following command to create a file in Unix(i.e. mac) system. 
 
 bin/hadoop fs -mkdir /user/hadoop-user/citation/input
 
 While it creates the directory I need, I am struggling to figure out exact 
 location of the folder in my local box.
 
 
 
 
 
 Regards
 Bejoy K S
 
 
 
 Regards
 Bejoy K S



Re: hdfs directory location

2011-10-10 Thread Arko Provo Mukherjee
Hi,

I guess what you are wanting is to see your HDFS directory through normal
File System commands like ls etc or by browsing your directory structure.

This is not possible as none of your commands or Finder (in Mac) have
ability to read / write HDFS. So they don't have the capability to show HDFS
directories.

Hence, the HDFS directory structure must be viewed using the HDFS tools and
not the Operating System FS commands.

Hope this helps!

Warm regards
Arko

On Mon, Oct 10, 2011 at 3:08 PM, bejoy.had...@gmail.com wrote:

 Jignesh
  You are absolutely right. In hdfs directory doesn't exist physically. It
 is just meta data on name node. I don't think such a dir structure would be
 there in name node lfs as well as it just meta data and hence no physical
 dir structure is  created.

 Regards
 Bejoy K S

 -Original Message-
 From: Jignesh Patel jign...@websoft.com
 Date: Mon, 10 Oct 2011 16:02:53
 To: bejoy.had...@gmail.com
 Cc: common-user@hadoop.apache.org
 Subject: Re: hdfs directory location

 Bejoy,

 copyToLocal makes sense, it worked. But I am still wondering if HDFS has a
 directory created on local box, somewhere it exist physically but couldn't
 able to locate it.

 Is HDFS directory structure is a virtual structure, doesn't exist
 physically?

 -Jignesh
 On Oct 10, 2011, at 3:53 PM, bejoy.had...@gmail.com wrote:

  Jignesh
 Sorry I didn't get your query, 'how I can link it with HDFS
  directory structure?
  '
 
  You mean putting your unix dir contents into hdfs? If so use hadoop fs
 -copyFromLocal src destn
  --Original Message--
  From: Jignesh Patel
  To: common-user@hadoop.apache.org
  To: bejoy.had...@gmail.com
  Subject: Re: hdfs directory location
  Sent: Oct 11, 2011 01:18
 
  Bejoy,
 
  If I create a directory in unix box then how I can link it with HDFS
 directory structure?
 
  -Jignesh
  On Oct 10, 2011, at 2:59 PM, bejoy.had...@gmail.com wrote:
 
  Jignesh
   You are creating a dir in hdfs by that command. The dir won't be in
 your local file system but it hdfs. Issue a command like
  hadoop fs -ls /user/hadoop-user/citation/
  You can see the dir you created in hdfs
 
  If you want to create a die on local unix use a simple linux command
  mkdir /user/hadoop-user/citation/input
 
 
  --Original Message--
  From: Jignesh Patel
  To: common-user@hadoop.apache.org
  ReplyTo: common-user@hadoop.apache.org
  Subject: hdfs directory location
  Sent: Oct 10, 2011 23:45
 
  I am using following command to create a file in Unix(i.e. mac) system.
 
  bin/hadoop fs -mkdir /user/hadoop-user/citation/input
 
  While it creates the directory I need, I am struggling to figure out
 exact location of the folder in my local box.
 
 
 
 
 
  Regards
  Bejoy K S
 
 
 
  Regards
  Bejoy K S




ssh setup stop working

2011-10-10 Thread Jignesh Patel
I have created private key setup on local box and till this week end everything 
was working great. 


But when today I tried JPS I found none of the service works as well as when I 
tried to do ssh localhost it started asking for password.

when I tried ssh-keygen -t rsa the message appeared
/Users/hadoop-user/.ssh/id_rsa already exists

What went wrong? Do  I need to recreate the key?

-Jignesh

Re: ssh setup stop working

2011-10-10 Thread Jignesh Patel
nope they works. I have a mac system
On Oct 10, 2011, at 4:40 PM, Ilker Ozkaymak wrote:

 Has your user account's password been expired??
 
 Best regards,
 IO
 
 On Mon, Oct 10, 2011 at 3:35 PM, Jignesh Patel jign...@websoft.com wrote:
 
 I have created private key setup on local box and till this week end
 everything was working great.
 
 
 But when today I tried JPS I found none of the service works as well as
 when I tried to do ssh localhost it started asking for password.
 
 when I tried ssh-keygen -t rsa the message appeared
 /Users/hadoop-user/.ssh/id_rsa already exists
 
 What went wrong? Do  I need to recreate the key?
 
 -Jignesh



Re: Secondary namenode fsimage concept

2011-10-10 Thread Shouguo Li
hey parick

i wanted to configure my cluster to write namenode metadata to multiple
directories as well:
  property
namedfs.name.dir/name
value/hadoop/var/name,/mnt/hadoop/var/name/value
  /property

in my case, /hadoop/var/name is local directory, /mnt/hadoop/var/name is NFS
volume. i took down the cluster first, then copied over files from
/hadoop/var/name to /mnt/hadoop/var/name, and then tried to start up the
cluster. but the cluster won't start up properly...
here's the namenode log: http://pastebin.com/gmu0B7yd

any ideas why it wouldn't start up?
thx


On Thu, Oct 6, 2011 at 6:58 PM, patrick sang silvianhad...@gmail.comwrote:

 I would say your namenode write metadata in local fs (where your secondary
 namenode will pull files), and NFS mount.

  property
namedfs.name.dir/name
value/hadoop/name,/hadoop/nfs_server_name/value
  /property


 my 0.02$
 P

 On Thu, Oct 6, 2011 at 12:04 AM, shanmuganathan.r 
 shanmuganatha...@zohocorp.com wrote:

  Hi Kai,
 
   There is no datas stored  in the secondarynamenode related to the
  Hadoop cluster . Am I correct?
  If it correct means If we run the secondaryname node in separate machine
  then fetching , merging and transferring time is increased if the cluster
  has large data in the namenode fsimage file . At the time if fail over
  occurs , then how can we recover the nearly one hour changes in the HDFS
  file ? (default check point time is one hour)
 
  Thanks R.Shanmuganathan
 
 
 
 
 
 
   On Thu, 06 Oct 2011 12:20:28 +0530 Kai Voigtlt;k...@123.orggt; wrote
  
 
 
  Hi,
 
  the secondary namenode only fetches the two files when a checkpointing is
  needed.
 
  Kai
 
  Am 06.10.2011 um 08:45 schrieb shanmuganathan.r:
 
  gt; Hi Kai,
  gt;
  gt; In the Second part I meant
  gt;
  gt;
  gt; Is the secondary namenode also contain the FSImage file or the two
  files(FSImage and EdiltLog) are transferred from the namenode at the
  checkpoint time.
  gt;
  gt;
  gt; Thanks
  gt; Shanmuganathan
  gt;
  gt;
  gt;
  gt;
  gt;
  gt;  On Thu, 06 Oct 2011 11:37:50 +0530 Kai Voigtamp;lt;k...@123.org
 amp;gt;
  wrote 
  gt;
  gt;
  gt; Hi,
  gt;
  gt; you're correct when saying the namenode hosts the fsimage file and
 the
  edits log file.
  gt;
  gt; The fsimage file contains a snapshot of the HDFS metadata (a
 filename
  to blocks list mapping). Whenever there is a change to HDFS, it will be
  appended to the edits file. Think of it as a database transaction log,
 where
  changes will not be applied to the datafile, but appended to a log.
  gt;
  gt; To prevent the edits file growing infinitely, the secondary namenode
  periodically pulls these two files, and the namenode starts writing
 changes
  to a new edits file. Then, the secondary namenode merges the changes from
  the edits file with the old snapshot from the fsimage file and creates an
  updated fsimage file. This updated fsimage file is then copied to the
  namenode.
  gt;
  gt; Then, the entire cycle starts again. To answer your question: The
  namenode has both files, even if the secondary namenode is running on a
  different machine.
  gt;
  gt; Kai
  gt;
  gt; Am 06.10.2011 um 07:57 schrieb shanmuganathan.r:
  gt;
  gt; amp;gt;
  gt; amp;gt; Hi All,
  gt; amp;gt;
  gt; amp;gt; I have a doubt in hadoop secondary namenode concept .
 Please
  correct if the following statements are wrong .
  gt; amp;gt;
  gt; amp;gt;
  gt; amp;gt; The namenode hosts the fsimage and edit log files. The
  secondary namenode hosts the fsimage file only. At the time of checkpoint
  the edit log file is transferred to the secondary namenode and the both
  files are merged, Then the updated fsimage file is transferred to the
  namenode . Is it correct?
  gt; amp;gt;
  gt; amp;gt;
  gt; amp;gt; If we run the secondary namenode in separate machine , then
  both machines contain the fsimage file . Namenode only contains the
 editlog
  file. Is it true?
  gt; amp;gt;
  gt; amp;gt;
  gt; amp;gt;
  gt; amp;gt; Thanks R.Shanmuganathan
  gt; amp;gt;
  gt; amp;gt;
  gt; amp;gt;
  gt; amp;gt;
  gt; amp;gt;
  gt; amp;gt;
  gt;
  gt; --
  gt; Kai Voigt
  gt; k...@123.org
  gt;
  gt;
  gt;
  gt;
  gt;
  gt;
  gt;
 
  --
  Kai Voigt
  k...@123.org
 
 
 
 
 
 
 



Re: ssh setup stop working

2011-10-10 Thread Jignesh Patel
Infect I have created passphraseless key again and still it asks me for 
password.
On Oct 10, 2011, at 4:51 PM, Jignesh Patel wrote:

 nope they works. I have a mac system
 On Oct 10, 2011, at 4:40 PM, Ilker Ozkaymak wrote:
 
 Has your user account's password been expired??
 
 Best regards,
 IO
 
 On Mon, Oct 10, 2011 at 3:35 PM, Jignesh Patel jign...@websoft.com wrote:
 
 I have created private key setup on local box and till this week end
 everything was working great.
 
 
 But when today I tried JPS I found none of the service works as well as
 when I tried to do ssh localhost it started asking for password.
 
 when I tried ssh-keygen -t rsa the message appeared
 /Users/hadoop-user/.ssh/id_rsa already exists
 
 What went wrong? Do  I need to recreate the key?
 
 -Jignesh
 



Re: ssh setup stop working

2011-10-10 Thread Ilker Ozkaymak
Key requires a specific permissions for .ssh directory 700 and
authorized_keys file 600 anything more it won't work.
However you said it worked before, I usually experience problem when
password ages the key also doesn't work until the password is reset.

Anyhow it might be little different.

Best regards,

On Mon, Oct 10, 2011 at 4:10 PM, Jignesh Patel jign...@websoft.com wrote:

 Infect I have created passphraseless key again and still it asks me for
 password.
 On Oct 10, 2011, at 4:51 PM, Jignesh Patel wrote:

  nope they works. I have a mac system
  On Oct 10, 2011, at 4:40 PM, Ilker Ozkaymak wrote:
 
  Has your user account's password been expired??
 
  Best regards,
  IO
 
  On Mon, Oct 10, 2011 at 3:35 PM, Jignesh Patel jign...@websoft.com
 wrote:
 
  I have created private key setup on local box and till this week end
  everything was working great.
 
 
  But when today I tried JPS I found none of the service works as well as
  when I tried to do ssh localhost it started asking for password.
 
  when I tried ssh-keygen -t rsa the message appeared
  /Users/hadoop-user/.ssh/id_rsa already exists
 
  What went wrong? Do  I need to recreate the key?
 
  -Jignesh
 




Subscribe to list

2011-10-10 Thread Joan Figuerola hurtado
Hi,
 I want to know your improvement subscribing to this list.

Many thanks :)


problem in running program

2011-10-10 Thread Jignesh Patel

I m trying to run attached program. My input directory structure  is 
/user/hadoop-user/input/cite65_77.txt file.

But it doesn't do anything. It doesn't read the file and not creates output 
directory.





Re: ssh setup stop working

2011-10-10 Thread Jignesh Patel
You are right I have a problem with the access rights. Now it works.
On Oct 10, 2011, at 5:36 PM, Ilker Ozkaymak wrote:

 Key requires a specific permissions for .ssh directory 700 and
 authorized_keys file 600 anything more it won't work.
 However you said it worked before, I usually experience problem when
 password ages the key also doesn't work until the password is reset.
 
 Anyhow it might be little different.
 
 Best regards,
 
 On Mon, Oct 10, 2011 at 4:10 PM, Jignesh Patel jign...@websoft.com wrote:
 
 Infect I have created passphraseless key again and still it asks me for
 password.
 On Oct 10, 2011, at 4:51 PM, Jignesh Patel wrote:
 
 nope they works. I have a mac system
 On Oct 10, 2011, at 4:40 PM, Ilker Ozkaymak wrote:
 
 Has your user account's password been expired??
 
 Best regards,
 IO
 
 On Mon, Oct 10, 2011 at 3:35 PM, Jignesh Patel jign...@websoft.com
 wrote:
 
 I have created private key setup on local box and till this week end
 everything was working great.
 
 
 But when today I tried JPS I found none of the service works as well as
 when I tried to do ssh localhost it started asking for password.
 
 when I tried ssh-keygen -t rsa the message appeared
 /Users/hadoop-user/.ssh/id_rsa already exists
 
 What went wrong? Do  I need to recreate the key?
 
 -Jignesh
 
 
 



Re: ways to expand hadoop.tmp.dir capacity?

2011-10-10 Thread Meng Mao
So the only way we can expand to multiple mapred.local.dir paths is to
config our site.xml and to restart the DataNode?

On Mon, Oct 10, 2011 at 9:36 AM, Marcos Luis Ortiz Valmaseda 
marcosluis2...@googlemail.com wrote:

 2011/10/9 Harsh J ha...@cloudera.com

  Hello Meng,
 
  On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao meng...@gmail.com wrote:
   Currently, we've got defined:
property
   namehadoop.tmp.dir/name
   value/hadoop/hadoop-metadata/cache//value
/property
  
   In our experiments with SOLR, the intermediate files are so large that
  they
   tend to blow out disk space and fail (and annoyingly leave behind their
  huge
   failed attempts). We've had issues with it in the past, but we're
 having
   real problems with SOLR if we can't comfortably get more space out of
   hadoop.tmp.dir somehow.
  
   1) It seems we never set *mapred.system.dir* to anything special, so
 it's
   defaulting to ${hadoop.tmp.dir}/mapred/system.
   Is this a problem? The docs seem to recommend against it when
  hadoop.tmp.dir
   had ${user.name} in it, which ours doesn't.
 
  The {mapred.system.dir} is a HDFS location, and you shouldn't really
  be worried about it as much.
 
   1b) The doc says mapred.system.dir is the in-HDFS path to shared
  MapReduce
   system files. To me, that means there's must be 1 single path for
   mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
   Otherwise, one might imagine that you could specify multiple paths to
  store
   hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
   interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if
   there were more mapping/lookup between mapred.system.dir and
  hadoop.tmp.dir?
 
  {hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although it
  is on HDFS, and hence is confusing, but there should just be one
  mapred.system.dir, yes.
 
  Also, the config {hadoop.tmp.dir} doesn't support  1 path. What you
  need here is a proper {mapred.local.dir} configuration.
 
   2) IIRC, there's a -D switch for supplying config name/value pairs into
   indivdiual jobs. Does such a switch exist? Googling for single letters
 is
   fruitless. If we had a path on our workers with more space (in our
 case,
   another hard disk), could we simply pass that path in as hadoop.tmp.dir
  for
   our SOLR jobs? Without incurring any consistency issues on future jobs
  that
   might use the SOLR output on HDFS?
 
  Only a few parameters of a job are user-configurable. Stuff like
  hadoop.tmp.dir and mapred.local.dir are not override-able by user set
  parameters as they are server side configurations (static).
 
   Given that the default value is ${hadoop.tmp.dir}/mapred/local, would
 the
   expanded capacity we're looking for be as easily accomplished as by
  defining
   mapred.local.dir to span multiple disks? Setting aside the issue of
 temp
   files so big that they could still fill a whole disk.
 
  1. You can set mapred.local.dir independent of hadoop.tmp.dir
  2. mapred.local.dir can have comma separated values in it, spanning
  multiple disks
  3. Intermediate outputs may spread across these disks but shall not
  consume  1 disk at a time. So if your largest configured disk is 500
  GB while the total set of them may be 2 TB, then your intermediate
  output size can't really exceed 500 GB, cause only one disk is
  consumed by one task -- the multiple disks are for better I/O
  parallelism between tasks.
 
  Know that hadoop.tmp.dir is a convenience property, for quickly
  starting up dev clusters and such. For a proper configuration, you
  need to remove dependency on it (almost nothing uses hadoop.tmp.dir on
  the server side, once the right properties are configured - ex:
  dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc.)
 
  --
  Harsh J
 

 Here it's a excellent explanation how to install Apache Hadoop manually,
 and
 Lars explains this very good.


 http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/

 Regards

 --
 Marcos Luis Ortíz Valmaseda
  Linux Infrastructure Engineer
  Linux User # 418229
  http://marcosluis2186.posterous.com
  http://www.linkedin.com/in/marcosluis2186
  Twitter: @marcosluis2186



Re: hdfs directory location

2011-10-10 Thread Harsh J
Jignesh,

Can be done. Use the fuse-dfs feature of HDFS to have your DFS as a
'physical' mount point on Linux. Instructions may be found here:
http://wiki.apache.org/hadoop/MountableHDFS and on other resources
across the web (search around for fuse hdfs).

On Tue, Oct 11, 2011 at 1:32 AM, Jignesh Patel jign...@websoft.com wrote:
 Bejoy,

 copyToLocal makes sense, it worked. But I am still wondering if HDFS has a 
 directory created on local box, somewhere it exist physically but couldn't 
 able to locate it.

 Is HDFS directory structure is a virtual structure, doesn't exist physically?

 -Jignesh
 On Oct 10, 2011, at 3:53 PM, bejoy.had...@gmail.com wrote:

 Jignesh
        Sorry I didn't get your query, 'how I can link it with HDFS
 directory structure?
 '

 You mean putting your unix dir contents into hdfs? If so use hadoop fs 
 -copyFromLocal src destn
 --Original Message--
 From: Jignesh Patel
 To: common-user@hadoop.apache.org
 To: bejoy.had...@gmail.com
 Subject: Re: hdfs directory location
 Sent: Oct 11, 2011 01:18

 Bejoy,

 If I create a directory in unix box then how I can link it with HDFS 
 directory structure?

 -Jignesh
 On Oct 10, 2011, at 2:59 PM, bejoy.had...@gmail.com wrote:

 Jignesh
      You are creating a dir in hdfs by that command. The dir won't be in 
 your local file system but it hdfs. Issue a command like
 hadoop fs -ls /user/hadoop-user/citation/
 You can see the dir you created in hdfs

 If you want to create a die on local unix use a simple linux command
 mkdir /user/hadoop-user/citation/input


 --Original Message--
 From: Jignesh Patel
 To: common-user@hadoop.apache.org
 ReplyTo: common-user@hadoop.apache.org
 Subject: hdfs directory location
 Sent: Oct 10, 2011 23:45

 I am using following command to create a file in Unix(i.e. mac) system.

 bin/hadoop fs -mkdir /user/hadoop-user/citation/input

 While it creates the directory I need, I am struggling to figure out exact 
 location of the folder in my local box.





 Regards
 Bejoy K S



 Regards
 Bejoy K S





-- 
Harsh J


Re: problem in running program

2011-10-10 Thread Harsh J
Jignesh,

Please do not attach files to the mailing list. They are stripped away
and the community will never receive them. Instead, if its small
enough, paste it along in the mail, or paste it at services like
pastebin.com and pass along the public links.

On Tue, Oct 11, 2011 at 3:35 AM, Jignesh Patel jign...@websoft.com wrote:

 I m trying to run attached program. My input directory structure  is 
 /user/hadoop-user/input/cite65_77.txt file.

 But it doesn't do anything. It doesn't read the file and not creates output 
 directory.









-- 
Harsh J


Re: ways to expand hadoop.tmp.dir capacity?

2011-10-10 Thread Harsh J
Meng,

Yes, configure the mapred-site.xml (mapred.local.dir) to add the
property and roll-restart your TaskTrackers. If you'd like to expand
your DataNode to multiple disks as well (helps HDFS I/O greatly), do
the same with hdfs-site.xml (dfs.data.dir) and perform the same
rolling restart of DataNodes.

Ensure that for each service, the directories you create are owned by
the same user as the one running the process. This will help avoid
permission nightmares.

On Tue, Oct 11, 2011 at 3:58 AM, Meng Mao meng...@gmail.com wrote:
 So the only way we can expand to multiple mapred.local.dir paths is to
 config our site.xml and to restart the DataNode?

 On Mon, Oct 10, 2011 at 9:36 AM, Marcos Luis Ortiz Valmaseda 
 marcosluis2...@googlemail.com wrote:

 2011/10/9 Harsh J ha...@cloudera.com

  Hello Meng,
 
  On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao meng...@gmail.com wrote:
   Currently, we've got defined:
    property
       namehadoop.tmp.dir/name
       value/hadoop/hadoop-metadata/cache//value
    /property
  
   In our experiments with SOLR, the intermediate files are so large that
  they
   tend to blow out disk space and fail (and annoyingly leave behind their
  huge
   failed attempts). We've had issues with it in the past, but we're
 having
   real problems with SOLR if we can't comfortably get more space out of
   hadoop.tmp.dir somehow.
  
   1) It seems we never set *mapred.system.dir* to anything special, so
 it's
   defaulting to ${hadoop.tmp.dir}/mapred/system.
   Is this a problem? The docs seem to recommend against it when
  hadoop.tmp.dir
   had ${user.name} in it, which ours doesn't.
 
  The {mapred.system.dir} is a HDFS location, and you shouldn't really
  be worried about it as much.
 
   1b) The doc says mapred.system.dir is the in-HDFS path to shared
  MapReduce
   system files. To me, that means there's must be 1 single path for
   mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
   Otherwise, one might imagine that you could specify multiple paths to
  store
   hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
   interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if
   there were more mapping/lookup between mapred.system.dir and
  hadoop.tmp.dir?
 
  {hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although it
  is on HDFS, and hence is confusing, but there should just be one
  mapred.system.dir, yes.
 
  Also, the config {hadoop.tmp.dir} doesn't support  1 path. What you
  need here is a proper {mapred.local.dir} configuration.
 
   2) IIRC, there's a -D switch for supplying config name/value pairs into
   indivdiual jobs. Does such a switch exist? Googling for single letters
 is
   fruitless. If we had a path on our workers with more space (in our
 case,
   another hard disk), could we simply pass that path in as hadoop.tmp.dir
  for
   our SOLR jobs? Without incurring any consistency issues on future jobs
  that
   might use the SOLR output on HDFS?
 
  Only a few parameters of a job are user-configurable. Stuff like
  hadoop.tmp.dir and mapred.local.dir are not override-able by user set
  parameters as they are server side configurations (static).
 
   Given that the default value is ${hadoop.tmp.dir}/mapred/local, would
 the
   expanded capacity we're looking for be as easily accomplished as by
  defining
   mapred.local.dir to span multiple disks? Setting aside the issue of
 temp
   files so big that they could still fill a whole disk.
 
  1. You can set mapred.local.dir independent of hadoop.tmp.dir
  2. mapred.local.dir can have comma separated values in it, spanning
  multiple disks
  3. Intermediate outputs may spread across these disks but shall not
  consume  1 disk at a time. So if your largest configured disk is 500
  GB while the total set of them may be 2 TB, then your intermediate
  output size can't really exceed 500 GB, cause only one disk is
  consumed by one task -- the multiple disks are for better I/O
  parallelism between tasks.
 
  Know that hadoop.tmp.dir is a convenience property, for quickly
  starting up dev clusters and such. For a proper configuration, you
  need to remove dependency on it (almost nothing uses hadoop.tmp.dir on
  the server side, once the right properties are configured - ex:
  dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc.)
 
  --
  Harsh J
 

 Here it's a excellent explanation how to install Apache Hadoop manually,
 and
 Lars explains this very good.


 http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/

 Regards

 --
 Marcos Luis Ortíz Valmaseda
  Linux Infrastructure Engineer
  Linux User # 418229
  http://marcosluis2186.posterous.com
  http://www.linkedin.com/in/marcosluis2186
  Twitter: @marcosluis2186





-- 
Harsh J


Re: Secondary namenode fsimage concept

2011-10-10 Thread Uma Maheswara Rao G 72686
Hi,

It looks to me that, problem with your NFS. It is not supporting locks. Which 
version of NFS are you using? 
Please check your NFS locking support by writing simple program for file 
locking.

I think NFS4 supports locking ( i did not tried).

http://nfs.sourceforge.net/

  A6. What are the main new features in version 4 of the NFS protocol?
  *NFS Versions 2 and 3 are stateless protocols, but NFS Version 4 introduces 
state. An NFS Version 4 client uses state to notify an NFS Version 4 server of 
its intentions on a file: locking, reading, writing, and so on. An NFS Version 
4 server can return information to a client about what other clients have 
intentions on a file to allow a client to cache file data more aggressively via 
delegation. To help keep state consistent, more sophisticated client and server 
reboot recovery mechanisms are built in to the NFS Version 4 protocol.
 *NFS Version 4 introduces support for byte-range locking and share 
reservation. Locking in NFS Version 4 is lease-based, so an NFS Version 4 
client must maintain contact with an NFS Version 4 server to continue extending 
its open and lock leases. 


Regards,
Uma
- Original Message -
From: Shouguo Li the1plum...@gmail.com
Date: Tuesday, October 11, 2011 2:31 am
Subject: Re: Secondary namenode fsimage concept
To: common-user@hadoop.apache.org

 hey parick
 
 i wanted to configure my cluster to write namenode metadata to 
 multipledirectories as well:
  property
namedfs.name.dir/name
value/hadoop/var/name,/mnt/hadoop/var/name/value
  /property
 
 in my case, /hadoop/var/name is local directory, 
 /mnt/hadoop/var/name is NFS
 volume. i took down the cluster first, then copied over files from
 /hadoop/var/name to /mnt/hadoop/var/name, and then tried to start 
 up the
 cluster. but the cluster won't start up properly...
 here's the namenode log: http://pastebin.com/gmu0B7yd
 
 any ideas why it wouldn't start up?
 thx
 
 
 On Thu, Oct 6, 2011 at 6:58 PM, patrick sang 
 silvianhad...@gmail.comwrote:
  I would say your namenode write metadata in local fs (where your 
 secondary namenode will pull files), and NFS mount.
 
   property
 namedfs.name.dir/name
 value/hadoop/name,/hadoop/nfs_server_name/value
   /property
 
 
  my 0.02$
  P
 
  On Thu, Oct 6, 2011 at 12:04 AM, shanmuganathan.r 
  shanmuganatha...@zohocorp.com wrote:
 
   Hi Kai,
  
There is no datas stored  in the secondarynamenode related 
 to the
   Hadoop cluster . Am I correct?
   If it correct means If we run the secondaryname node in 
 separate machine
   then fetching , merging and transferring time is increased if 
 the cluster
   has large data in the namenode fsimage file . At the time if 
 fail over
   occurs , then how can we recover the nearly one hour changes in 
 the HDFS
   file ? (default check point time is one hour)
  
   Thanks R.Shanmuganathan
  
  
  
  
  
  
    On Thu, 06 Oct 2011 12:20:28 +0530 Kai Voigtk...@123.orggt; 
 wrote  
  
  
   Hi,
  
   the secondary namenode only fetches the two files when a 
 checkpointing is
   needed.
  
   Kai
  
   Am 06.10.2011 um 08:45 schrieb shanmuganathan.r:
  
   gt; Hi Kai,
   gt;
   gt; In the Second part I meant
   gt;
   gt;
   gt; Is the secondary namenode also contain the FSImage file or 
 the two
   files(FSImage and EdiltLog) are transferred from the namenode 
 at the
   checkpoint time.
   gt;
   gt;
   gt; Thanks
   gt; Shanmuganathan
   gt;
   gt;
   gt;
   gt;
   gt;
   gt;  On Thu, 06 Oct 2011 11:37:50 +0530 Kai 
 Voigtamp;lt;k...@123.org amp;gt;
   wrote 
   gt;
   gt;
   gt; Hi,
   gt;
   gt; you're correct when saying the namenode hosts the fsimage 
 file and
  the
   edits log file.
   gt;
   gt; The fsimage file contains a snapshot of the HDFS metadata (a
  filename
   to blocks list mapping). Whenever there is a change to HDFS, it 
 will be
   appended to the edits file. Think of it as a database 
 transaction log,
  where
   changes will not be applied to the datafile, but appended to a 
 log.  gt;
   gt; To prevent the edits file growing infinitely, the 
 secondary namenode
   periodically pulls these two files, and the namenode starts 
 writing changes
   to a new edits file. Then, the secondary namenode merges the 
 changes from
   the edits file with the old snapshot from the fsimage file and 
 creates an
   updated fsimage file. This updated fsimage file is then copied 
 to the
   namenode.
   gt;
   gt; Then, the entire cycle starts again. To answer your 
 question: The
   namenode has both files, even if the secondary namenode is 
 running on a
   different machine.
   gt;
   gt; Kai
   gt;
   gt; Am 06.10.2011 um 07:57 schrieb shanmuganathan.r:
   gt;
   gt; amp;gt;
   gt; amp;gt; Hi All,
   gt; amp;gt;
   gt; amp;gt; I have a doubt in hadoop secondary namenode 
 concept .
  Please
   correct if the following statements are wrong .
   gt; amp;gt;
   gt; amp;gt;
   gt; amp;gt; The namenode hosts the fsimage and edit log 
 files. 

Re: Secondary namenode fsimage concept

2011-10-10 Thread Harsh J
Generally you just gotta ensure that your rpc.lockd service is up and
running on both ends, to allow for locking over NFS.

On Tue, Oct 11, 2011 at 8:16 AM, Uma Maheswara Rao G 72686
mahesw...@huawei.com wrote:
 Hi,

 It looks to me that, problem with your NFS. It is not supporting locks. Which 
 version of NFS are you using?
 Please check your NFS locking support by writing simple program for file 
 locking.

 I think NFS4 supports locking ( i did not tried).

 http://nfs.sourceforge.net/

  A6. What are the main new features in version 4 of the NFS protocol?
  *NFS Versions 2 and 3 are stateless protocols, but NFS Version 4 introduces 
 state. An NFS Version 4 client uses state to notify an NFS Version 4 server 
 of its intentions on a file: locking, reading, writing, and so on. An NFS 
 Version 4 server can return information to a client about what other clients 
 have intentions on a file to allow a client to cache file data more 
 aggressively via delegation. To help keep state consistent, more 
 sophisticated client and server reboot recovery mechanisms are built in to 
 the NFS Version 4 protocol.
  *NFS Version 4 introduces support for byte-range locking and share 
 reservation. Locking in NFS Version 4 is lease-based, so an NFS Version 4 
 client must maintain contact with an NFS Version 4 server to continue 
 extending its open and lock leases.


 Regards,
 Uma
 - Original Message -
 From: Shouguo Li the1plum...@gmail.com
 Date: Tuesday, October 11, 2011 2:31 am
 Subject: Re: Secondary namenode fsimage concept
 To: common-user@hadoop.apache.org

 hey parick

 i wanted to configure my cluster to write namenode metadata to
 multipledirectories as well:
  property
    namedfs.name.dir/name
    value/hadoop/var/name,/mnt/hadoop/var/name/value
  /property

 in my case, /hadoop/var/name is local directory,
 /mnt/hadoop/var/name is NFS
 volume. i took down the cluster first, then copied over files from
 /hadoop/var/name to /mnt/hadoop/var/name, and then tried to start
 up the
 cluster. but the cluster won't start up properly...
 here's the namenode log: http://pastebin.com/gmu0B7yd

 any ideas why it wouldn't start up?
 thx


 On Thu, Oct 6, 2011 at 6:58 PM, patrick sang
 silvianhad...@gmail.comwrote:
  I would say your namenode write metadata in local fs (where your
 secondary namenode will pull files), and NFS mount.
 
   property
     namedfs.name.dir/name
     value/hadoop/name,/hadoop/nfs_server_name/value
   /property
 
 
  my 0.02$
  P
 
  On Thu, Oct 6, 2011 at 12:04 AM, shanmuganathan.r 
  shanmuganatha...@zohocorp.com wrote:
 
   Hi Kai,
  
        There is no datas stored  in the secondarynamenode related
 to the
   Hadoop cluster . Am I correct?
   If it correct means If we run the secondaryname node in
 separate machine
   then fetching , merging and transferring time is increased if
 the cluster
   has large data in the namenode fsimage file . At the time if
 fail over
   occurs , then how can we recover the nearly one hour changes in
 the HDFS
   file ? (default check point time is one hour)
  
   Thanks R.Shanmuganathan
  
  
  
  
  
  
    On Thu, 06 Oct 2011 12:20:28 +0530 Kai Voigtk...@123.orggt;
 wrote  
  
  
   Hi,
  
   the secondary namenode only fetches the two files when a
 checkpointing is
   needed.
  
   Kai
  
   Am 06.10.2011 um 08:45 schrieb shanmuganathan.r:
  
   gt; Hi Kai,
   gt;
   gt; In the Second part I meant
   gt;
   gt;
   gt; Is the secondary namenode also contain the FSImage file or
 the two
   files(FSImage and EdiltLog) are transferred from the namenode
 at the
   checkpoint time.
   gt;
   gt;
   gt; Thanks
   gt; Shanmuganathan
   gt;
   gt;
   gt;
   gt;
   gt;
   gt;  On Thu, 06 Oct 2011 11:37:50 +0530 Kai
 Voigtamp;lt;k...@123.org amp;gt;
   wrote 
   gt;
   gt;
   gt; Hi,
   gt;
   gt; you're correct when saying the namenode hosts the fsimage
 file and
  the
   edits log file.
   gt;
   gt; The fsimage file contains a snapshot of the HDFS metadata (a
  filename
   to blocks list mapping). Whenever there is a change to HDFS, it
 will be
   appended to the edits file. Think of it as a database
 transaction log,
  where
   changes will not be applied to the datafile, but appended to a
 log.  gt;
   gt; To prevent the edits file growing infinitely, the
 secondary namenode
   periodically pulls these two files, and the namenode starts
 writing changes
   to a new edits file. Then, the secondary namenode merges the
 changes from
   the edits file with the old snapshot from the fsimage file and
 creates an
   updated fsimage file. This updated fsimage file is then copied
 to the
   namenode.
   gt;
   gt; Then, the entire cycle starts again. To answer your
 question: The
   namenode has both files, even if the secondary namenode is
 running on a
   different machine.
   gt;
   gt; Kai
   gt;
   gt; Am 06.10.2011 um 07:57 schrieb shanmuganathan.r:
   gt;
   gt; amp;gt;
   gt; amp;gt; Hi All,
   gt; amp;gt;
   gt; amp;gt; I have a 

Re: Is it possible to run multiple MapReduce against the same HDFS?

2011-10-10 Thread Zhenhua (Gerald) Guo
Thanks, Robert.  I will look into hod.

When MapReduce framework accesses data stored in HDFS, which account
is used, the account which MapReduce daemons (e.g. job tracker) run as
or the account of the user who submits the job?  If HDFS and MapReduce
clusters are run with different accounts, can MapReduce cluster be
able to access HDFS directories and files (if authentication in HDFS
is enabled)?

Thanks!

Gerald

On Mon, Oct 10, 2011 at 12:36 PM, Robert Evans ev...@yahoo-inc.com wrote:
 It should be possible to use multiple map/reduce clusters sharing the same 
 HDFS, you can look at hod where it launches a JT on demand.  The only change 
 of collision that I can think of would be if by some odd chance both Job 
 Trackers were started at exactly the same millisecond.   The JT uses the time 
 it was started as part of the job id for all jobs.  Those job ids are assumed 
 to be unique and used to create files/directories in HDFS to store data for 
 that job.

 --Bobby Evans

 On 10/7/11 12:09 PM, Zhenhua (Gerald) Guo jen...@gmail.com wrote:

 I plan to deploy a HDFS cluster which will be shared by multiple
 MapReduce clusters.
 I wonder whether this is possible.  Will it incur any conflicts among
 MapReduce (e.g. different MapReduce clusters try to use the same temp
 directory in HDFS)?
 If it is possible, how should the security parameters be set up (e.g.
 user identity, file permission)?

 Thanks,

 Gerald




Re: hadoop input buffer size

2011-10-10 Thread Mark question
Thanks for the clarifications guys :)
Mark

On Mon, Oct 10, 2011 at 8:27 AM, Uma Maheswara Rao G 72686 
mahesw...@huawei.com wrote:

 I think below can give you more info about it.

 http://developer.yahoo.com/blogs/hadoop/posts/2009/08/the_anatomy_of_hadoop_io_pipel/
 Nice explanation by Owen here.

 Regards,
 Uma

 - Original Message -
 From: Yang Xiaoliang yangxiaoliang2...@gmail.com
 Date: Wednesday, October 5, 2011 4:27 pm
 Subject: Re: hadoop input buffer size
 To: common-user@hadoop.apache.org

  Hi,
 
  Hadoop neither read one line each time, nor fetching
  dfs.block.size of lines
  into a buffer,
  Actually, for the TextInputFormat, it read io.file.buffer.size
  bytes of text
  into a buffer each time,
  this can be seen from the hadoop source file LineReader.java
 
 
 
  2011/10/5 Mark question markq2...@gmail.com
 
   Hello,
  
Correct me if I'm wrong, but when a program opens n-files at
  the same time
   to read from, and start reading from each file at a time 1 line
  at a time.
   Isn't hadoop actually fetching dfs.block.size of lines into a
  buffer? and
   not actually one line.
  
If this is correct, I set up my dfs.block.size = 3MB and each
  line takes
   about 650 bytes only, then I would assume the performance for
  reading 1-4000
   lines would be the same, but it isn't !  Do you know a way to
  find #n of
   lines to be read at once?
  
   Thank you,
   Mark
  
 



Re: Error using hadoop distcp

2011-10-10 Thread Uma Maheswara Rao G 72686
Distcp will run as mapreduce job.
Here tasktrackers required the hostname mappings to contact to other nodes.
Please configure the mapping correctly in both the machines and try.
egards,
Uma

- Original Message -
From: trang van anh anh...@vtc.vn
Date: Wednesday, October 5, 2011 1:41 pm
Subject: Re: Error using hadoop distcp
To: common-user@hadoop.apache.org

 which  host run the task that throws the exception ? ensure that 
 each 
 data node know another data nodes in hadoop cluster- add ub16 
 entry 
 in /etc/hosts on where the task running.
 On 10/5/2011 12:15 PM, praveenesh kumar wrote:
  I am trying to use distcp to copy a file from one HDFS to another.
 
  But while copying I am getting the following exception :
 
  hadoop distcp hdfs://ub13:54310/user/hadoop/weblog
  hdfs://ub16:54310/user/hadoop/weblog
 
  11/10/05 10:41:01 INFO mapred.JobClient: Task Id :
  attempt_201110031447_0005_m_07_0, Status : FAILED
  java.net.UnknownHostException: unknown host: ub16
   at 
 org.apache.hadoop.ipc.Client$Connection.init(Client.java:195)   
   at org.apache.hadoop.ipc.Client.getConnection(Client.java:850)
   at org.apache.hadoop.ipc.Client.call(Client.java:720)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
   at $Proxy1.getProtocolVersion(Unknown Source)
   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
   at
  
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:113)   
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:215)
   at 
 org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:177)   
   at
  
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
   at
  
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)   
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
   at 
 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)   
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
   at
  
 org.apache.hadoop.mapred.FileOutputCommitter.setupJob(FileOutputCommitter.java:48)
   at
  
 org.apache.hadoop.mapred.OutputCommitter.setupJob(OutputCommitter.java:124)  
 at org.apache.hadoop.mapred.Task.runJobSetupTask(Task.java:835)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:296)
   at org.apache.hadoop.mapred.Child.main(Child.java:170)
 
  Its saying its not finding ub16. But the entry is there in 
 /etc/hosts files.
  I am able to ssh both the machines. Do I need password less ssh 
 between these two NNs ?
  What can be the issue ? Any thing I am missing before using 
 distcp ?
 
  Thanks,
  Praveenesh