Re: When applying a patch, which attachment should I use?

2011-01-13 Thread edward choi
Dear Adarsh,

My situation is somewhat different from yours as I am only running Hadoop
and Hbase (as opposed to Hadoop/Hive/Hbase).

But I hope my experience could be of help to you somehow.

I applied the hdfs-630-0.20-append.patch to every single Hadoop node.
(including master and slaves)
Then I followed exactly what they told me to do on
http://hbase.apache.org/docs/current/api/overview-summary.html#overview_description
.

I didn't get a single error message and successfully started HBase in a
fully distributed mode.

I am not using Hive so I can't tell what caused the
MasterNotRunningException, but the patch above is meant to  allow DFSClients
pass NameNode lists of known dead Datanodes.
I doubt that the patch has anything to do with MasterNotRunningException.

Hope this helps.

Regards,
Ed

2011/1/13 Adarsh Sharma adarsh.sha...@orkash.com


 I am also facing some issues  and i think applying

 hdfs-630-0.20-append.patch
 https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch
 

 would solve my problem.

 I try to run Hadoop/Hive/Hbase integration in fully Distributed mode.

 But I am facing master Not Running Exception mentioned in

 http://wiki.apache.org/hadoop/Hive/HBaseIntegration.

 My Hadoop Version= 0.20.2, Hive =0.6.0 , Hbase=0.20.6.

 What you think Edward.


 Thanks  Adarsh






 edward choi wrote:

 I am not familiar with this whole svn and patch stuff, so please
 understand
 my asking.

 I was going to apply
 hdfs-630-0.20-append.patch
 https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch
 
 only
 because I wanted to install HBase and the installation guide told me to.
 The append branch you mentioned, does that include
 hdfs-630-0.20-append.patch
 https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch
 
 as
 well?
 Is it like the latest patch with all the good stuff packed in one?

 Regards,
 Ed

 2011/1/12 Ted Dunning tdunn...@maprtech.com



 You may also be interested in the append branch:

 http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/

 On Tue, Jan 11, 2011 at 3:12 AM, edward choi mp2...@gmail.com wrote:



 Thanks for the info.
 I am currently using Hadoop 0.20.2, so I guess I only need apply
 hdfs-630-0.20-append.patch




 https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch


 .
 I wasn't familiar with the term trunk. I guess it means the latest
 development.
 Thanks again.

 Best Regards,
 Ed

 2011/1/11 Konstantin Boudnik c...@apache.org



 Yeah, that's pretty crazy all right. In your case looks like that 3
 patches on the top are the latest for 0.20-append branch, 0.21 branch
 and trunk (which perhaps 0.22 branch at the moment). It doesn't look
 like you need to apply all of them - just try the latest for your
 particular branch.

 The mess is caused by the fact the ppl are using different names for
 consequent patches (as in file.1.patch, file.2.patch etc) This is
 _very_ confusing indeed, especially when different contributors work
 on the same fix/feature.
 --
  Take care,
 Konstantin (Cos) Boudnik


 On Mon, Jan 10, 2011 at 01:10, edward choi mp2...@gmail.com wrote:


 Hi,
 For the first time I am about to apply a patch to HDFS.

 https://issues.apache.org/jira/browse/HDFS-630

 Above is the one that I am trying to do.
 But there are like 15 patches and I don't know which one to use.

 Could anyone tell me if I need to apply them all or just the one at


 the


 top?


 The whole patching process is just so confusing :-(

 Ed










Why Hadoop uses HTTP for file transmission between Map and Reduce?

2011-01-13 Thread Nan Zhu
Hi, all

I have a question about the file transmission between Map and Reduce stage,
in current implementation, the Reducers get the results generated by Mappers
through HTTP Get, I don't understand why HTTP is selected, why not FTP, or a
self-developed protocal?

Just for HTTP's simple?

thanks

Nan


Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?

2011-01-13 Thread li ping
That is also my concerns. Is it efficient for data transmission.

On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhu zhunans...@gmail.com wrote:

 Hi, all

 I have a question about the file transmission between Map and Reduce stage,
 in current implementation, the Reducers get the results generated by
 Mappers
 through HTTP Get, I don't understand why HTTP is selected, why not FTP, or
 a
 self-developed protocal?

 Just for HTTP's simple?

 thanks

 Nan




-- 
-李平


Re: TeraSort question.

2011-01-13 Thread Steve Loughran

On 11/01/11 16:40, Raj V wrote:

Ted


Thanks. I have all the graphs I need that include, map reduce timeline, system 
activity for all the nodes when the sort was running. I will publish them once 
I have them in some presentable format.,

For legal reasons, I really don't want to send the complete job histiory files.

My question is still this. When running terasort, would the CPU, disk and 
network utilization of all the nodes be more or less similar or completely 
different.


They can be different. The JT pushes out work to machines when they 
report in, some may get more work than others, so generate more local 
data. This will have follow-on consequences. In a live system things are 
different as the work tends to follow the data, so machines with (or 
near) the data you need get the work.


It's a really hard thing to say is the cluster working right, when 
bringing it up, everyone is really guessing about expected performance.


-Steve


Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?

2011-01-13 Thread Steve Loughran

On 13/01/11 08:34, li ping wrote:

That is also my concerns. Is it efficient for data transmission.


It's long lived TCP connections, reasonably efficient for bulk data 
xfer, has all the throttling of TCP built in, and comes with some 
excellently debugged client and server code in the form of jetty and 
httpclient. In maintenance costs alone, those libraries justify HTTP 
unless you have a vastly superior option *and are willing to maintain it 
forever*


FTPs limits are well known (security), NFS limits well known (security, 
UDP version doesn't throttle), self developed protocols will have 
whatever problems you want.


There are better protocols for long-haul data transfer over fat pipes, 
such as GridFTP , PhedEX ( 
http://www.gridpp.ac.uk/papers/ah05_phedex.pdf ), which use multiple TCP 
channels in parallel to reduce the impact of a single lost packet, but 
within a datacentre, you shouldn't have to worry about this. If you do 
find lots of packets get lost, raise the issue with the networking team.


-Steve



On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhuzhunans...@gmail.com  wrote:


Hi, all

I have a question about the file transmission between Map and Reduce stage,
in current implementation, the Reducers get the results generated by
Mappers
through HTTP Get, I don't understand why HTTP is selected, why not FTP, or
a
self-developed protocal?

Just for HTTP's simple?

thanks

Nan









Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?

2011-01-13 Thread He Chen
Actually, PhedEx is using GridFTP for its data transferring.

On Thu, Jan 13, 2011 at 5:34 AM, Steve Loughran ste...@apache.org wrote:

 On 13/01/11 08:34, li ping wrote:

 That is also my concerns. Is it efficient for data transmission.


 It's long lived TCP connections, reasonably efficient for bulk data xfer,
 has all the throttling of TCP built in, and comes with some excellently
 debugged client and server code in the form of jetty and httpclient. In
 maintenance costs alone, those libraries justify HTTP unless you have a
 vastly superior option *and are willing to maintain it forever*

 FTPs limits are well known (security), NFS limits well known (security, UDP
 version doesn't throttle), self developed protocols will have whatever
 problems you want.

 There are better protocols for long-haul data transfer over fat pipes, such
 as GridFTP , PhedEX ( http://www.gridpp.ac.uk/papers/ah05_phedex.pdf ),
 which use multiple TCP channels in parallel to reduce the impact of a single
 lost packet, but within a datacentre, you shouldn't have to worry about
 this. If you do find lots of packets get lost, raise the issue with the
 networking team.

 -Steve



 On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhuzhunans...@gmail.com  wrote:

  Hi, all

 I have a question about the file transmission between Map and Reduce
 stage,
 in current implementation, the Reducers get the results generated by
 Mappers
 through HTTP Get, I don't understand why HTTP is selected, why not FTP,
 or
 a
 self-developed protocal?

 Just for HTTP's simple?

 thanks

 Nan








About hadoop-..-examples.jar

2011-01-13 Thread Bo Sang
Hi, guys:

Do anyone know where I can get package  hadoop-..-examples.jar? I want ti
use  TeraSort in it. It seems this package is not included in hadoop source
code. And I also fail to find download links on its Homepage.
-- 
Best Regards!

Sincerely
Bo Sang


Re: About hadoop-..-examples.jar

2011-01-13 Thread Tsz Wo (Nicholas), Sze
The examples package is in the MapReduce trunk.  Note that it is under a 
different src directory src/examples but not src/java.

See also 
http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/examples/org/apache/hadoop/examples/terasort/


Nicholas






From: Bo Sang sampl...@gmail.com
To: Hadoop user mail list common-user@hadoop.apache.org
Sent: Thu, January 13, 2011 11:23:44 AM
Subject: About hadoop-..-examples.jar

Hi, guys:

Do anyone know where I can get package  hadoop-..-examples.jar? I want ti
use  TeraSort in it. It seems this package is not included in hadoop source
code. And I also fail to find download links on its Homepage.
-- 
Best Regards!

Sincerely
Bo Sang


Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?

2011-01-13 Thread Owen O'Malley
At some point, we'll replace Jetty in the shuffle, because it imposes too
much overhead and go to Netty or some other lower level library. I don't
think that using HTTP adds that much overhead although it would be
interesting to measure that.

-- Owen


Re: MultipleOutputs Performance?

2011-01-13 Thread David Rosenstrauch

On 12/10/2010 02:16 PM, Harsh J wrote:

Hi,

On Thu, Dec 2, 2010 at 10:40 PM, Matt Tanquarymatt.tanqu...@gmail.com  wrote:

I am using MultipleOutputs to split a mapper input into about 20
different files. Adding this split has had an extremely adverse effect
on performance. Is MultipleOutputs known for performing slowly?


There was a bug in MultipleOutputs which could've lead to this. It has
been fixed in MAPREDUCE-1853. Should be in the next 0.21 maintenance
release as well as 0.22.

(And in next CDH3, if you are using that).


Is there any workaround to this issue for those of us who are still 
running 0.20?


I have a job that very much lends itself to using the MultipleOutputs 
functionality, but this bug is absolutely crushing the job's performance.


Are there any ways to fix/workaround this issue without having to a) 
upgrade our cluster to 0.21, or b) completely re-write my job?


Thanks,

DR


Re: When applying a patch, which attachment should I use?

2011-01-13 Thread edward choi
Dear Adarsh,

I have a single machine running Namenode/JobTracker/Hbase Master.
There are 17 machines running Datanode/TaskTracker
Among those 17 machines, 14 are running Hbase Regionservers.
The other 3 machines are running Zookeeper.

And about the Zookeeper,
Hbase comes with its own Zookeeper so you don't need to install a new
Zookeeper. (except for the special occasion, which I'll explain later)
I assigned 14 machines as regionservers using
$HBASE_HOME/conf/regionservers.
I assigned 3 machines as Zookeeperss using hbase.zookeeper.quorum property
in $HBASE_HOME/conf/hbase-site.xml.
Don't forget to set export HBASE_MANAGES_ZK=true
in $HBASE_HOME/conf/hbase-env.sh. (This is where you announce that you
will be using Zookeeper that comes with HBase)
This way, when you execute $HBASE_HOME/bin/start-hbase.sh, HBase will
automatically start Zookeeper first, then start HBase daemons.

Also, you can install your own Zookeeper and tell HBase to use it instead of
its own.
I read it on the internet that Zookeeper that comes with HBase does not work
properly on Windows 7 64bit. (
http://alans.se/blog/2010/hadoop-hbase-cygwin-windows-7-x64/)
So in that case you need to install your own Zookeeper, set it up properly,
and tell HBase to use it instead of its own.
All you need to do is configure zoo.cfg and add it to the HBase CLASSPATH.
And don't forget to set export HBASE_MANAGES_ZK=false
in $HBASE_HOME/conf/hbase-env.sh.
This way, HBase will not start Zookeeper automatically.

About the separation of Zookeepers from regionservers,
Yes, it is recommended to separate Zookeepers from regionservers.
But that won't be necessary unless your clusters are very heavily loaded.
They also suggest that you give Zookeeper its own hard disk. But I haven't
done that myself yet. (Hard disks cost money you know)
So I'd say your cluster seems fine.
But when you want to expand your cluster, you'd need some changes. I suggest
you take a look at Hadoop: The Definitive Guide.

Regards,
Edward

2011/1/13 Adarsh Sharma adarsh.sha...@orkash.com

 Thanks Edward,

 Can you describe me the architecture used in your configuration.

 Fore.g I have a cluster of 10 servers and

 1 node act as ( Namenode, Jobtracker, Hmaster ).
 Remainning 9 nodes act as ( Slaves, datanodes, Tasktracker, Hregionservers
 ).
 Among these 9 nodes I also set 3 nodes in zookeeper.quorum.property.

 I want to know that is it necessary to configure zookeeper separately with
 the zookeeper-3.2.2 package or just have some IP's listed in

 zookeeper.quorum.property and Hbase take care of it.

 Can we specify IP's of Hregionservers used before as zookeeper servers (
 HQuorumPeer ) or we must need separate servers for it.

 My problem arises in running zookeeper. My Hbase is up and running  in
 fully distributed mode too.




 With Best Regards

 Adarsh Sharma








 edward choi wrote:

 Dear Adarsh,

 My situation is somewhat different from yours as I am only running Hadoop
 and Hbase (as opposed to Hadoop/Hive/Hbase).

 But I hope my experience could be of help to you somehow.

 I applied the hdfs-630-0.20-append.patch to every single Hadoop node.
 (including master and slaves)
 Then I followed exactly what they told me to do on

 http://hbase.apache.org/docs/current/api/overview-summary.html#overview_description
 .

 I didn't get a single error message and successfully started HBase in a
 fully distributed mode.

 I am not using Hive so I can't tell what caused the
 MasterNotRunningException, but the patch above is meant to  allow
 DFSClients
 pass NameNode lists of known dead Datanodes.
 I doubt that the patch has anything to do with MasterNotRunningException.

 Hope this helps.

 Regards,
 Ed

 2011/1/13 Adarsh Sharma adarsh.sha...@orkash.com



 I am also facing some issues  and i think applying

 hdfs-630-0.20-append.patch

 https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch
would solve my problem.

 I try to run Hadoop/Hive/Hbase integration in fully Distributed mode.

 But I am facing master Not Running Exception mentioned in

 http://wiki.apache.org/hadoop/Hive/HBaseIntegration.

 My Hadoop Version= 0.20.2, Hive =0.6.0 , Hbase=0.20.6.

 What you think Edward.


 Thanks  Adarsh






 edward choi wrote:



 I am not familiar with this whole svn and patch stuff, so please
 understand
 my asking.

 I was going to apply
 hdfs-630-0.20-append.patch

 https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch
  only
 because I wanted to install HBase and the installation guide told me to.
 The append branch you mentioned, does that include
 hdfs-630-0.20-append.patch

 https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch
  as
 well?
 Is it like the latest patch with all the good stuff packed in one?

 Regards,
 Ed

 2011/1/12 Ted Dunning tdunn...@maprtech.com





 You may also be interested in the append branch:

 

cannot connect from slaves

2011-01-13 Thread Mark Kerzner
Hi,

my list file command

hadoop fs -ls hdfs://master-url/

works locally on the master, but cannot connect from any of the slaves.

What should I check for?

Thank you,
Mark


found an inconsistent entry in 0.21 API

2011-01-13 Thread Yang Sun
I searched the MultipleOutputs class in google and found a 0.21 API
documentation page that describes the class in the new version of hadoop.
But the downloaded jar file doesn't support this class. There are also a few
errors in the example on MultipleOutputs API document page.


Re: cannot connect from slaves

2011-01-13 Thread Zhang Jianfeng
Can you connect the other slaves by ssh or ping cmd?

On Fri, Jan 14, 2011 at 9:02 AM, Mark Kerzner markkerz...@gmail.com wrote:

 Hi,

 my list file command

 hadoop fs -ls hdfs://master-url/

 works locally on the master, but cannot connect from any of the slaves.

 What should I check for?

 Thank you,
 Mark



Re: cannot connect from slaves

2011-01-13 Thread Mark Kerzner
I did this, and tried both ways:

hadoop fs -ls /
11/01/14 02:45:25 INFO ipc.Client: Retrying connect to server: /
10.113.118.244:8020. Already tried 0 time(s).
11/01/14 02:45:29 INFO ipc.Client: Retrying connect to server: /
10.113.118.244:8020. Already tried 1 time(s).

and

hadoop fs -ls hdfs://10.113.118.244/
11/01/14 02:48:56 INFO ipc.Client: Retrying connect to server: /
10.113.118.244:8020. Already tried 0 time(s).

I am suspecting the port 8020 - how do I test it outside of hadoop?

Thank you,
Mark

On Thu, Jan 13, 2011 at 8:29 PM, George Datskos 
george.dats...@jp.fujitsu.com wrote:

 Hello


 On 2011/01/14 10:02, Mark Kerzner wrote:

 hadoop fs -ls hdfs://master-url/

 works locally on the master, but cannot connect from any of the slaves.


 Make sure to replicate conf/core-site.xml to each of the slaves.  The
 fs.default.name property should point to the master node.  That way the
 slaves know how to reach the NameNode.

  namefs.default.name/name
  valuehdfs://master:8020/value


 (adjust the host name and port as necessary)


 George




Re: cannot connect from slaves

2011-01-13 Thread rishi pathak
Hello,
It can be a firewall issue. Try telnet from master and from
slaves and see:

(master)# telnet master 8020

and
(slave)# telnet master 8020


On Fri, Jan 14, 2011 at 8:19 AM, Mark Kerzner markkerz...@gmail.com wrote:

 I did this, and tried both ways:

 hadoop fs -ls /
 11/01/14 02:45:25 INFO ipc.Client: Retrying connect to server: /
 10.113.118.244:8020. Already tried 0 time(s).
 11/01/14 02:45:29 INFO ipc.Client: Retrying connect to server: /
 10.113.118.244:8020. Already tried 1 time(s).

 and

 hadoop fs -ls hdfs://10.113.118.244/
 11/01/14 02:48:56 INFO ipc.Client: Retrying connect to server: /
 10.113.118.244:8020. Already tried 0 time(s).

 I am suspecting the port 8020 - how do I test it outside of hadoop?

 Thank you,
 Mark

 On Thu, Jan 13, 2011 at 8:29 PM, George Datskos 
 george.dats...@jp.fujitsu.com wrote:

  Hello
 
 
  On 2011/01/14 10:02, Mark Kerzner wrote:
 
  hadoop fs -ls hdfs://master-url/
 
  works locally on the master, but cannot connect from any of the slaves.
 
 
  Make sure to replicate conf/core-site.xml to each of the slaves.  The
  fs.default.name property should point to the master node.  That way the
  slaves know how to reach the NameNode.
 
   namefs.default.name/name
   valuehdfs://master:8020/value
 
 
  (adjust the host name and port as necessary)
 
 
  George
 
 




-- 
Regards--
Rishi Pathak
National PARAM Supercomputing Facility
Center for Development of Advanced Computing(C-DAC)
Pune University Campus,Ganesh Khind Road
Pune-Maharastra


Re: cannot connect from slaves

2011-01-13 Thread Mark Kerzner
probably that's what it was - since with the network change it finally
connects

On Thu, Jan 13, 2011 at 10:44 PM, rishi pathak mailmaverick...@gmail.comwrote:

 Hello,
It can be a firewall issue. Try telnet from master and from
 slaves and see:

 (master)# telnet master 8020

 and
 (slave)# telnet master 8020


 On Fri, Jan 14, 2011 at 8:19 AM, Mark Kerzner markkerz...@gmail.com
 wrote:

  I did this, and tried both ways:
 
  hadoop fs -ls /
  11/01/14 02:45:25 INFO ipc.Client: Retrying connect to server: /
  10.113.118.244:8020. Already tried 0 time(s).
  11/01/14 02:45:29 INFO ipc.Client: Retrying connect to server: /
  10.113.118.244:8020. Already tried 1 time(s).
 
  and
 
  hadoop fs -ls hdfs://10.113.118.244/
  11/01/14 02:48:56 INFO ipc.Client: Retrying connect to server: /
  10.113.118.244:8020. Already tried 0 time(s).
 
  I am suspecting the port 8020 - how do I test it outside of hadoop?
 
  Thank you,
  Mark
 
  On Thu, Jan 13, 2011 at 8:29 PM, George Datskos 
  george.dats...@jp.fujitsu.com wrote:
 
   Hello
  
  
   On 2011/01/14 10:02, Mark Kerzner wrote:
  
   hadoop fs -ls hdfs://master-url/
  
   works locally on the master, but cannot connect from any of the
 slaves.
  
  
   Make sure to replicate conf/core-site.xml to each of the slaves.  The
   fs.default.name property should point to the master node.  That way
 the
   slaves know how to reach the NameNode.
  
namefs.default.name/name
valuehdfs://master:8020/value
  
  
   (adjust the host name and port as necessary)
  
  
   George
  
  
 



 --
 Regards--
 Rishi Pathak
 National PARAM Supercomputing Facility
 Center for Development of Advanced Computing(C-DAC)
 Pune University Campus,Ganesh Khind Road
 Pune-Maharastra