doubt about reduce tasks and block writes

2012-08-24 Thread Marc Sturlese
Hey there,
I have a doubt about reduce tasks and block writes. Do a reduce task always
first write to hdfs in the node where they it is placed? (and then these
blocks would be replicated to other nodes)
In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one
(node A) just run DN, when running MR jobs, map tasks would never read from
node A? This would be because maps have data locality and if the reduce
tasks write first to the node where they live, one replica of the block
would always be in a node that has a TT. Node A would just contain blocks
created from replication by the framework as no reduce task would write
there directly. Is this correct?
Thanks in advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: doubt about reduce tasks and block writes

2012-08-24 Thread Minh Duc Nguyen
Marc, see my inline comments.

On Fri, Aug 24, 2012 at 4:09 PM, Marc Sturlese marc.sturl...@gmail.comwrote:

 Hey there,
 I have a doubt about reduce tasks and block writes. Do a reduce task always
 first write to hdfs in the node where they it is placed? (and then these
 blocks would be replicated to other nodes)


Yes, if there is a DN running on that server (it's possible to be running
TT without a DN).


 In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and
 one
 (node A) just run DN, when running MR jobs, map tasks would never read from
 node A? This would be because maps have data locality and if the reduce
 tasks write first to the node where they live, one replica of the block
 would always be in a node that has a TT. Node A would just contain blocks
 created from replication by the framework as no reduce task would write
 there directly. Is this correct?


I believe that it's possible that a map task would read from node A's DN.
 Yes, the JobTracker tries to schedule map tasks on nodes where the data
would be local, but it can't always do so.  If there's a node with a free
map slot, but that node doesn't have the data blocks locally, the
JobTracker will assign the map task to that free map slot.  Some work done
(albeit slower than the ideal case because of the increased network IO) is
better than no work done.


 Thanks in advance



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: doubt about reduce tasks and block writes

2012-08-24 Thread Bertrand Dechoux
Assuming that node A only contains replica, there is no garante that its
data would never be read.
First, you might lose a replica. The copy inside the node A could be used
to create the missing replica again.
Second, data locality is on best effort. If all the map slots are occupied
except one on one node without a replica of the data then your node A is as
likely as any other to be chosen as a source.

Regards

Bertrand

On Fri, Aug 24, 2012 at 10:09 PM, Marc Sturlese marc.sturl...@gmail.comwrote:

 Hey there,
 I have a doubt about reduce tasks and block writes. Do a reduce task always
 first write to hdfs in the node where they it is placed? (and then these
 blocks would be replicated to other nodes)
 In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and
 one
 (node A) just run DN, when running MR jobs, map tasks would never read from
 node A? This would be because maps have data locality and if the reduce
 tasks write first to the node where they live, one replica of the block
 would always be in a node that has a TT. Node A would just contain blocks
 created from replication by the framework as no reduce task would write
 there directly. Is this correct?
 Thanks in advance



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.




-- 
Bertrand Dechoux


Re: doubt about reduce tasks and block writes

2012-08-24 Thread Raj Vishwanathan
But since node A has no TT running, it will not run map or reduce tasks. When 
the reducer node writes the output file, the fist block will be written on the 
local node and never on node A.

So, to answer the question, Node A will contain copies of blocks of all output 
files. It wont contain the copy 0 of any output file.


I am reasonably sure about this , but there could be corner cases in case of 
node failure and such like! I need to look into the code. 


Raj

 From: Marc Sturlese marc.sturl...@gmail.com
To: hadoop-u...@lucene.apache.org 
Sent: Friday, August 24, 2012 1:09 PM
Subject: doubt about reduce tasks and block writes
 
Hey there,
I have a doubt about reduce tasks and block writes. Do a reduce task always
first write to hdfs in the node where they it is placed? (and then these
blocks would be replicated to other nodes)
In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one
(node A) just run DN, when running MR jobs, map tasks would never read from
node A? This would be because maps have data locality and if the reduce
tasks write first to the node where they live, one replica of the block
would always be in a node that has a TT. Node A would just contain blocks
created from replication by the framework as no reduce task would write
there directly. Is this correct?
Thanks in advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.





Re: Reading multiple lines from a microsoft doc in hadoop

2012-08-24 Thread Bertrand Dechoux
And that would help you with performance too.
Were you originally planning to have one file per word document?
What is the average size of you word documents?
It shouldn't be much. I am afraid your map startup time won't be negligible
in that case.

Regards

Bertrand

On Fri, Aug 24, 2012 at 8:07 AM, Håvard Wahl Kongsgård 
haavard.kongsga...@gmail.com wrote:

 It's much easier if you convert the documents to text first

 use
 http://tika.apache.org/

 or some other doc parser


 -Håvard

 On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
 siddharth.tiw...@live.com wrote:
  hi,
  I have doc files in msword doc and docx format. These have entries which
 are
  seperated by an empty line. Is it possible for me to read
  these lines separated from empty lines at a time. Also which inpurformat
  shall I use to read doc docx. Please help
 
  **
  Cheers !!!
  Siddharth Tiwari
  Have a refreshing day !!!
  Every duty is holy, and devotion to duty is the highest form of worship
 of
  God.”
  Maybe other people will try to limit me but I don't limit myself



 --
 Håvard Wahl Kongsgård
 Faculty of Medicine 
 Department of Mathematical Sciences
 NTNU

 http://havard.security-review.net/




-- 
Bertrand Dechoux


Re: namenode not starting

2012-08-24 Thread Nitin Pawar
did you run the command bin/hadoop namenode -format before starting
the namenode ?

On Fri, Aug 24, 2012 at 12:58 PM, Abhay Ratnaparkhi
abhay.ratnapar...@gmail.com wrote:
 Hello,

 I had a running hadoop cluster.
 I restarted it and after that namenode is unable to start. I am getting
 error saying that it's not formatted. :(
 Is it possible to recover the data on HDFS?

 2012-08-24 03:17:55,378 ERROR
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
 initialization failed.
 java.io.IOException: NameNode is not formatted.
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)
 2012-08-24 03:17:55,380 ERROR
 org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException:
 NameNode is not formatted.
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)

 Regards,
 Abhay





-- 
Nitin Pawar


RE: Reading multiple lines from a microsoft doc in hadoop

2012-08-24 Thread Siddharth Tiwari
Hi,
Thank you for the suggestion. Actually I was using poi to extract text, but 
since now  I  have so many  documents I thought I will use hadoop directly to 
parse as well. Average size of each document is around 120 kb. Also I want to 
read multiple lines from the text until I find a blank line. I do not have any 
idea ankit how to design custom input format and record reader. Pleaser help 
with some tutorial tutorial, code or resource around it. I am struggling with 
the issue. I will be highly grateful. Thank you so much once again

 Date: Fri, 24 Aug 2012 08:07:39 +0200
 Subject: Re: Reading multiple lines from a microsoft doc in hadoop
 From: haavard.kongsga...@gmail.com
 To: user@hadoop.apache.org
 
 It's much easier if you convert the documents to text first
 
 use
 http://tika.apache.org/
 
 or some other doc parser
 
 
 -Håvard
 
 On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
 siddharth.tiw...@live.com wrote:
  hi,
  I have doc files in msword doc and docx format. These have entries which are
  seperated by an empty line. Is it possible for me to read
  these lines separated from empty lines at a time. Also which inpurformat
  shall I use to read doc docx. Please help
 
  **
  Cheers !!!
  Siddharth Tiwari
  Have a refreshing day !!!
  Every duty is holy, and devotion to duty is the highest form of worship of
  God.”
  Maybe other people will try to limit me but I don't limit myself
 
 
 
 -- 
 Håvard Wahl Kongsgård
 Faculty of Medicine 
 Department of Mathematical Sciences
 NTNU
 
 http://havard.security-review.net/
  

Re: namenode not starting

2012-08-24 Thread vivek
hi,
Have u rubn the command namenode -format???
Thanks  regards ,
Vivek

On Fri, Aug 24, 2012 at 12:58 PM, Abhay Ratnaparkhi 
abhay.ratnapar...@gmail.com wrote:

 Hello,

 I had a running hadoop cluster.
 I restarted it and after that namenode is unable to start. I am getting
 error saying that it's not formatted. :(
 Is it possible to recover the data on HDFS?

 2012-08-24 03:17:55,378 ERROR
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
 initialization failed.
 java.io.IOException: NameNode is not formatted.
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)
 2012-08-24 03:17:55,380 ERROR
 org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException:
 NameNode is not formatted.
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)

 Regards,
 Abhay





-- 







Thanks and Regards,

VIVEK KOUL


Re: namenode not starting

2012-08-24 Thread Bejoy KS
Hi Abhay

What is the value for hadoop.tmp.dir or dfs.name.dir . If it was set to /tmp 
the contents would be deleted on a OS restart. You need to change this location 
before you start your NN.
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.com
Date: Fri, 24 Aug 2012 12:58:41 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: namenode not starting

Hello,

I had a running hadoop cluster.
I restarted it and after that namenode is unable to start. I am getting
error saying that it's not formatted. :(
Is it possible to recover the data on HDFS?

2012-08-24 03:17:55,378 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
initialization failed.
java.io.IOException: NameNode is not formatted.
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)
2012-08-24 03:17:55,380 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException:
NameNode is not formatted.
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)

Regards,
Abhay



Re: namenode not starting

2012-08-24 Thread Abhay Ratnaparkhi
Hello,

I was using cluster for long time and not formatted the namenode.
I ran bin/stop-all.sh and bin/start-all.sh scripts only.

I am using NFS for dfs.name.dir.
hadoop.tmp.dir is a /tmp directory. I've not restarted the OS.  Any way to
recover the data?

Thanks,
Abhay

On Fri, Aug 24, 2012 at 1:01 PM, Bejoy KS bejoy.had...@gmail.com wrote:

 **
 Hi Abhay

 What is the value for hadoop.tmp.dir or dfs.name.dir . If it was set to
 /tmp the contents would be deleted on a OS restart. You need to change this
 location before you start your NN.
 Regards
 Bejoy KS

 Sent from handheld, please excuse typos.
 --
 *From: * Abhay Ratnaparkhi abhay.ratnapar...@gmail.com
 *Date: *Fri, 24 Aug 2012 12:58:41 +0530
 *To: *user@hadoop.apache.org
 *ReplyTo: * user@hadoop.apache.org
 *Subject: *namenode not starting

 Hello,

 I had a running hadoop cluster.
 I restarted it and after that namenode is unable to start. I am getting
 error saying that it's not formatted. :(
 Is it possible to recover the data on HDFS?

 2012-08-24 03:17:55,378 ERROR
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
 initialization failed.
 java.io.IOException: NameNode is not formatted.
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)
 2012-08-24 03:17:55,380 ERROR
 org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException:
 NameNode is not formatted.
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)

 Regards,
 Abhay





Re: Reading multiple lines from a microsoft doc in hadoop

2012-08-24 Thread Mohammad Tariq
Hello Siddharth,

   You can tweak the NLineInputFormat as per your requirement and use
it. It allows us to read a specified no of lines
unlike TextInputFormat. Here is a good post by Boris and Michael on
custom record reader. Also I would suggest you to
combine similar files together into one bigger file if feasible, as you
files are very small.

Regards,
Mohammad Tariq



On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari siddharth.tiw...@live.com
 wrote:

 Hi,
 Thank you for the suggestion. Actually I was using poi to extract text,
 but since now  I  have so many  documents I thought I will use hadoop
 directly to parse as well. Average size of each document is around 120 kb.
 Also I want to read multiple lines from the text until I find a blank line.
 I do not have any idea ankit how to design custom input format and record
 reader. Pleaser help with some tutorial tutorial, code or resource around
 it. I am struggling with the issue. I will be highly grateful. Thank you so
 much once again

  Date: Fri, 24 Aug 2012 08:07:39 +0200
  Subject: Re: Reading multiple lines from a microsoft doc in hadoop
  From: haavard.kongsga...@gmail.com
  To: user@hadoop.apache.org
 
  It's much easier if you convert the documents to text first
 
  use
  http://tika.apache.org/
 
  or some other doc parser
 
 
  -Håvard
 
  On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
  siddharth.tiw...@live.com wrote:
   hi,
   I have doc files in msword doc and docx format. These have entries
 which are
   seperated by an empty line. Is it possible for me to read
   these lines separated from empty lines at a time. Also which
 inpurformat
   shall I use to read doc docx. Please help
  
   **
   Cheers !!!
   Siddharth Tiwari
   Have a refreshing day !!!
   Every duty is holy, and devotion to duty is the highest form of
 worship of
   God.”
   Maybe other people will try to limit me but I don't limit myself
 
 
 
  --
  Håvard Wahl Kongsgård
  Faculty of Medicine 
  Department of Mathematical Sciences
  NTNU
 
  http://havard.security-review.net/



Re: About many user accounts in hadoop platform

2012-08-24 Thread Li Shengmei
Hi, Sonal

   Thanks for your information. 
   Because some users want to modify the source codes of hadoop, these
users want to install their own hadoop version in the same clusters. After
they modify their hadoop version, they may compile and install the modified
hadoop version and don't want to make any influence to other users. 

   Does your method make effect on this?

Thanks a lot,

May

 

 

---

Hi,

 

Do your users want different versions of Hadoop? Or can they share the same
hadoop cluster and schedule their jobs? If the latter, Hadoop can be
configured to run for multiple users, and each user can submit their data
and jobs to the same cluster. Hence you can maintain a single cluster and
utilize your resources more efficiently. You can read more here:

 

http://www.ibm.com/developerworks/linux/library/os-hadoop-scheduling/index.h
tml

 

http://www.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/

 

Best Regards,
Sonal
Crux: Reporting for HBase https://github.com/sonalgoyal/crux 
Nube Technologies http://www.nubetech.co  








On Fri, Aug 24, 2012 at 9:13 AM, Li Shengmei lisheng...@ict.ac.cn wrote:

Hi, all

 There are many users in hadoop platform. Can they install their own
hadoop version on the same clusters platform? 

I tried to do this but failed. There exsited a user account and the user
install his hadoop. I create another account and install his hadoop. The
logs display ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:
java.net.BindException: Problem binding to hadoop01/10.3.1.91:9000 : Address
already in use. So I change the port no. to 8000, but still failed. 

When I start-all.sh, the namenode can't start, the logs display ERROR
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:lismhadoop cause:java.net.BindException: Address already in use

Can anyone give some suggestions? 

Thanks,

May

 



Re: About many user accounts in hadoop platform

2012-08-24 Thread Nitin Pawar
Hi Li,

The approach of everyone having their own version of hadoop on same
cluster is way too much complicated.

Good aproach would be all of them test the patches they are compiling
for hadoop in pseudo mode on their person laptops/desktops and then
merge the features or patches with a single version to cluster.

It is never recommended that you run multiple version of hadoops on
single hardware cluster.

Thanks,
nitin

On Fri, Aug 24, 2012 at 1:42 PM, Li Shengmei lisheng...@ict.ac.cn wrote:
 Hi, Sonal

Thanks for your information.
Because some users want to modify the source codes of hadoop, these
 users want to install their own hadoop version in the same clusters. After
 they modify their hadoop version, they may compile and install the modified
 hadoop version and don’t want to make any influence to other users.

Does your method make effect on this?

 Thanks a lot,

 May





 ---

 Hi,



 Do your users want different versions of Hadoop? Or can they share the same
 hadoop cluster and schedule their jobs? If the latter, Hadoop can be
 configured to run for multiple users, and each user can submit their data
 and jobs to the same cluster. Hence you can maintain a single cluster and
 utilize your resources more efficiently. You can read more here:



 http://www.ibm.com/developerworks/linux/library/os-hadoop-scheduling/index.html



 http://www.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/



 Best Regards,
 Sonal
 Crux: Reporting for HBase
 Nube Technologies






 On Fri, Aug 24, 2012 at 9:13 AM, Li Shengmei lisheng...@ict.ac.cn wrote:

 Hi, all

  There are many users in hadoop platform. Can they install their own
 hadoop version on the same clusters platform?

 I tried to do this but failed. There exsited a user account and the user
 install his hadoop. I create another account and install his hadoop. The
 logs display “ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:
 java.net.BindException: Problem binding to hadoop01/10.3.1.91:9000 : Address
 already in use”. So I change the port no. to 8000, but still failed.

 When I “start-all.sh”, the namenode can’t start, the logs display “ERROR
 org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
 as:lismhadoop cause:java.net.BindException: Address already in use”

 Can anyone give some suggestions?

 Thanks,

 May





-- 
Nitin Pawar


Re: Hadoop on EC2 Managing Internal/External IPs

2012-08-24 Thread Håvard Wahl Kongsgård
Hi, a vpn or simply first uploading the files to an ec2 node is the best option

but an alternative is to use the external interface/IP instead of the
internal in the hadoop config¸ I assume this will be slower and more
costly...

-Håvard

On Fri, Aug 24, 2012 at 4:54 AM, igor Finkelshteyn iefin...@gmail.com wrote:
 I've seen a bunch of people with this exact same question all over Google 
 with no answers. I know people have successful non-temporary clusters in EC2. 
 Is there really no one that's needed to deal with having EC2 expose external 
 addresses instead of internal addresses before? This seems like it should be 
 a common thing.

 On Aug 23, 2012, at 12:34 PM, igor Finkelshteyn wrote:

 Hi,
 I'm currently setting up a Hadoop cluster on EC2, and everything works just 
 fine when accessing the cluster from inside EC2, but as soon as I try to do 
 something like upload a file from an external client, I get timeout errors 
 like:

 12/08/23 12:06:16 ERROR hdfs.DFSClient: Failed to close file 
 /user/some_file._COPYING_
 java.net.SocketTimeoutException: 65000 millis timeout while waiting for 
 channel to be ready for connect. ch : 
 java.nio.channels.SocketChannel[connection-pending remote=/10.123.x.x:50010]

 What's clearly happening is my NameNode is resolving my DataNode's IPs to 
 their internal EC2 values instead of their external values, and then sending 
 along the internal IP to my external client, which is obviously unable to 
 reach those. I'm thinking this must be a common problem. How do other people 
 deal with it? Is there a way to just force my name node to send along my 
 DataNode's hostname instead of IP, so that the hostname can be resolved 
 properly from whatever box will be sending files?

 Eli




-- 
Håvard Wahl Kongsgård
Faculty of Medicine 
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/


Re: namenode not starting

2012-08-24 Thread Håvard Wahl Kongsgård
You should start with a reboot of the system.

A lesson to everyone, this is exactly why you should have a secondary
name node 
(http://wiki.apache.org/hadoop/FAQ#What_is_the_purpose_of_the_secondary_name-node.3F)
and run the namenode a mirrored RAID-5/10 disk.


-Håvard



On Fri, Aug 24, 2012 at 9:40 AM, Abhay Ratnaparkhi
abhay.ratnapar...@gmail.com wrote:
 Hello,

 I was using cluster for long time and not formatted the namenode.
 I ran bin/stop-all.sh and bin/start-all.sh scripts only.

 I am using NFS for dfs.name.dir.
 hadoop.tmp.dir is a /tmp directory. I've not restarted the OS.  Any way to
 recover the data?

 Thanks,
 Abhay


 On Fri, Aug 24, 2012 at 1:01 PM, Bejoy KS bejoy.had...@gmail.com wrote:

 Hi Abhay

 What is the value for hadoop.tmp.dir or dfs.name.dir . If it was set to
 /tmp the contents would be deleted on a OS restart. You need to change this
 location before you start your NN.
 Regards
 Bejoy KS

 Sent from handheld, please excuse typos.
 
 From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.com
 Date: Fri, 24 Aug 2012 12:58:41 +0530
 To: user@hadoop.apache.org
 ReplyTo: user@hadoop.apache.org
 Subject: namenode not starting

 Hello,

 I had a running hadoop cluster.
 I restarted it and after that namenode is unable to start. I am getting
 error saying that it's not formatted. :(
 Is it possible to recover the data on HDFS?

 2012-08-24 03:17:55,378 ERROR
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
 initialization failed.
 java.io.IOException: NameNode is not formatted.
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)
 2012-08-24 03:17:55,380 ERROR
 org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException:
 NameNode is not formatted.
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)

 Regards,
 Abhay






-- 
Håvard Wahl Kongsgård
Faculty of Medicine 
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/


hadoop download path missing

2012-08-24 Thread Steven Willis
All the links at: http://www.apache.org/dyn/closer.cgi/hadoop/common/ are 
returning 404s, even the backup site at: 
http://www.us.apache.org/dist/hadoop/common/. However, the eu site: 
http://www.eu.apache.org/dist/hadoop/common/ does work.

-Steven Willis


Re: hadoop download path missing

2012-08-24 Thread Sonal Goyal
I just tried and could go to
http://apache.techartifact.com/mirror/hadoop/common/hadoop-2.0.1-alpha/

Is this still happening for you?

Best Regards,
Sonal
Crux: Reporting for HBase https://github.com/sonalgoyal/crux
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Fri, Aug 24, 2012 at 8:59 PM, Steven Willis swil...@compete.com wrote:

 All the links at: http://www.apache.org/dyn/closer.cgi/hadoop/common/ are
 returning 404s, even the backup site at:
 http://www.us.apache.org/dist/hadoop/common/. However, the eu site:
 http://www.eu.apache.org/dist/hadoop/common/ does work.

 -Steven Willis



RE: namenode not starting

2012-08-24 Thread Siddharth Tiwari

Hi Abhay,

I totaly conform with Bejoy. Can you paste your mapred-site.xml and 
hdfs-site.xml content here ?

**

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
Every duty is holy, and devotion to duty is the highest form of worship of 
God.” 

Maybe other people will try to limit me but I don't limit myself


 From: lle...@ddn.com
 To: user@hadoop.apache.org
 Subject: RE: namenode not starting
 Date: Fri, 24 Aug 2012 16:38:01 +
 
 Abhay,
   Sounds like your namenode cannot find the metadata information it needs to 
 start (the path/current | image | *checppints etc)
 
   Basically, if you cannot locate that data locally or on your NFS Server,  
 your cluster is busted.
 
   But, let's us be optimistic about this. 
 
  There is a chance that your NFS Server is down or the path mounted is lost.
 
   If it is NFS mounted (as you suggested) check that your host still have 
 that path mounted. (from the proper NFS Server)
   ( [shell] mount ) can tell. 
   * obviously if you originally mounted from foo:/mydata  and now do 
 bar:/mydata /you'll need to do some digging to find which NFS server it 
 was writing to before.
 
  Failing to locate your namenode metadata (locally or on any of your NFS 
 Server)  either because the NFS Server decided to become a blackhole, or 
 someone|thing removed it.
 
   And you don't have a backup of your namenode (tape or Secondary Namenode),  
   I think you are in a world of hurt there.
 
   In theory you can read the blocks on the DN and try to recover some of your 
 data (assume not in CODEC / compressed) .
 Humm.. anyone knows about recovery services? (^^)
 
 
 
 -Original Message-
 From: Håvard Wahl Kongsgård [mailto:haavard.kongsga...@gmail.com] 
 Sent: Friday, August 24, 2012 5:38 AM
 To: user@hadoop.apache.org
 Subject: Re: namenode not starting
 
 You should start with a reboot of the system.
 
 A lesson to everyone, this is exactly why you should have a secondary name 
 node 
 (http://wiki.apache.org/hadoop/FAQ#What_is_the_purpose_of_the_secondary_name-node.3F)
 and run the namenode a mirrored RAID-5/10 disk.
 
 
 -Håvard
 
 
 
 On Fri, Aug 24, 2012 at 9:40 AM, Abhay Ratnaparkhi 
 abhay.ratnapar...@gmail.com wrote:
  Hello,
 
  I was using cluster for long time and not formatted the namenode.
  I ran bin/stop-all.sh and bin/start-all.sh scripts only.
 
  I am using NFS for dfs.name.dir.
  hadoop.tmp.dir is a /tmp directory. I've not restarted the OS.  Any 
  way to recover the data?
 
  Thanks,
  Abhay
 
 
  On Fri, Aug 24, 2012 at 1:01 PM, Bejoy KS bejoy.had...@gmail.com wrote:
 
  Hi Abhay
 
  What is the value for hadoop.tmp.dir or dfs.name.dir . If it was set 
  to /tmp the contents would be deleted on a OS restart. You need to 
  change this location before you start your NN.
  Regards
  Bejoy KS
 
  Sent from handheld, please excuse typos.
  
  From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.com
  Date: Fri, 24 Aug 2012 12:58:41 +0530
  To: user@hadoop.apache.org
  ReplyTo: user@hadoop.apache.org
  Subject: namenode not starting
 
  Hello,
 
  I had a running hadoop cluster.
  I restarted it and after that namenode is unable to start. I am 
  getting error saying that it's not formatted. :( Is it possible to 
  recover the data on HDFS?
 
  2012-08-24 03:17:55,378 ERROR
  org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem 
  initialization failed.
  java.io.IOException: NameNode is not formatted.
  at
  org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
  at
  org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
  at
  org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
  at
  org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270)
  at
  org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271)
  at
  org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303)
  at
  org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433)
  at
  org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421)
  at
  org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359)
  at
  org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:13
  68)
  2012-08-24 03:17:55,380 ERROR
  org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException:
  NameNode is not formatted.
  at
  org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
  at
  org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
  at
  org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
  at
  org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270)
  at
  

RE: How do we view the blocks of a file in HDFS

2012-08-24 Thread Siddharth Tiwari

Hi Abhishek,

You can use fsck for this purpose

hadoop fsck HDFS directory -files -blocks -locations  --- Displays what you 
want

**

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
Every duty is holy, and devotion to duty is the highest form of worship of 
God.” 

Maybe other people will try to limit me but I don't limit myself


From: abhisheksgum...@gmail.com
Date: Fri, 24 Aug 2012 22:10:37 +0530
Subject: How do we view the blocks of a file in HDFS
To: user@hadoop.apache.org

hi,
   If I push a file into HDFS that runs on a 4 node cluster with 1 namenode and 
3 datanodes, how can I view where on the datanodes are the blocks of this 
file?I would like to view the blocks and their replicas individually. How can I 
do this?


The answer is very critical for my current task which is halted :) A detailed 
answer will be highly appreciated.Thank you!

With Regards,
Abhishek S

  

unsubscribe

2012-08-24 Thread Dan Yi



easy mv or heavy mv

2012-08-24 Thread Yue Guan

Hi, there

I'm just want to know that for hadoop dfs -mv. Does 'mv' just change the 
meta info, or really copy data around on the hdfs? Thank you very much!


Thanks


Re: questions about CDH Version 4.0.1

2012-08-24 Thread Arun C Murthy
Pls email CDH lists.

On Aug 24, 2012, at 2:34 AM, jing wang wrote:

 Hi,
 
   I'm curious of what the release notes 
 said,http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0+91.releasenotes.html
 
   Is cdh4 based on ‘Release 2.0.0-alpha’ 
 ?http://hadoop.apache.org/common/releases.html#23+May%2C+2012%3A+Release+2.0.0-alpha+available
 
   Are there limitations on cdh packages?
 
 
 Any advice will be appreciated!
 
 
 Thanks  Best Regards
 Jing Wang
 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




RE: Reading multiple lines from a microsoft doc in hadoop

2012-08-24 Thread Siddharth Tiwari

Any help on below would be really appreciated. i am stuck with it

**

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
Every duty is holy, and devotion to duty is the highest form of worship of 
God.” 

Maybe other people will try to limit me but I don't limit myself


From: siddharth.tiw...@live.com
To: user@hadoop.apache.org; bejoy.had...@gmail.com; bejoy...@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 20:23:45 +





Hi ,

Can anyone please help ?

Thank you in advance

**

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
Every duty is holy, and devotion to duty is the highest form of worship of 
God.” 

Maybe other people will try to limit me but I don't limit myself


From: siddharth.tiw...@live.com
To: user@hadoop.apache.org; bejoy.had...@gmail.com; bejoy...@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 16:22:57 +





Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for 
reading one paragraph at a time. But when I use it I get lines read. Can you 
please suggest what changes I must make to read one para at a time seperated by 
null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormatLongWritable,Text {
private String nullRegex = ^\\s*$ ;
public String StrLine = null;
/*public RecordReaderLongWritable, Text getRecordReader (InputSplit 
genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReaderLongWritable, Text createRecordReader(InputSplit 
genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws 
IOException {
ArrayListFileSplit splits = new ArrayListFileSplit();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException(Not a file:  + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null  (num = lr.readLine(line))  0  ! 
( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




**

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
Every duty is holy, and devotion to duty is the highest form of worship of 
God.” 

Maybe other people will try to limit me but I don't limit myself


 Date: Fri, 24 Aug 2012 09:54:10 +0200
 Subject: Re: Reading multiple lines from a microsoft doc in hadoop
 From: haavard.kongsga...@gmail.com
 To: user@hadoop.apache.org
 
 Hi, maybe you should check out the old nutch project
 http://nutch.apache.org/ (hadoop was developed for nutch).
 It's a web crawler and indexer, but the malinglists hold much info
 doc/pdf parsing which also relates to hadoop.
 
 Have never parsed many docx or doc files, but it should be
 strait-forward. But generally for text analysis preprocessing is the
 KEY! For example replace dual lines \r\n\r\n or (\n\n) with  is a
 simple trick)
 
 
 -Håvard
 
 On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
 siddharth.tiw...@live.com wrote:
  Hi,
  Thank you for