RE: libhdfs install dep

2012-09-25 Thread Leo Leung
Rodrigo,
  Assuming you are asking for hadoop 1.x

  You are missing the hadoop-*libhdfs* rpm.
  Build it or get it from the vendor you got your hadoop from.

 

-Original Message-
From: Pastrana, Rodrigo (RIS-BCT) [mailto:rodrigo.pastr...@lexisnexis.com] 
Sent: Monday, September 24, 2012 8:20 PM
To: 'core-u...@hadoop.apache.org'
Subject: libhdfs install dep

Anybody know why libhdfs.so is not found by package managers on CentOS 64 and 
OpenSuse64? 

I hava an rpm which declares Hadoop as a dependacy, but the package managers 
(KPackageKit, zypper, etc) report libhdfs.so as a missing dependency eventhough 
Hadoop has been installed via rpm package, and libhdfs.so is installed as well. 

Thanks, Rodrigo.



RE: hadoop dfs -ls

2012-07-13 Thread Leo Leung
Hi Nitin,



Normally your conf should reside in /etc/hadoop/conf (if you don't have one. 
Copy it from the namenode - and keep it sync)



hadoop (script) by default depends on hadoop-setup.sh which depends on 
hadoop-env.sh in /etc/hadoop/conf



Or during runtime specify the config dir

i.e:



[hdfs]$  hadoop [--config path to your config dir] commands





P.S. Some useful links:

http://wiki.apache.org/hadoop/FAQ

http://wiki.apache.org/hadoop/FrontPage

http://wiki.apache.org/hadoop/

http://hadoop.apache.org/common/docs/r1.0.3/









-Original Message-
From: d...@paraliatech.com [mailto:d...@paraliatech.com] On Behalf Of Dave Beech
Sent: Friday, July 13, 2012 6:18 AM
To: common-user@hadoop.apache.org
Subject: Re: hadoop dfs -ls



Hi Nitin



It's likely that your hadoop command isn't finding the right configuration.

In particular it doesn't know where your namenode is (fs.default.namesetting in 
core-site.xml)



Maybe you need to set the HADOOP_CONF_DIR environment variable to point to your 
conf directory.



Dave



On 13 July 2012 14:11, Nitin Pawar 
nitinpawar...@gmail.commailto:nitinpawar...@gmail.com wrote:



 Hi,



 I have done setup numerous times but this time i did after some break.



 I managed to get the cluster up and running fine but when I do  hadoop

 dfs -ls /



 it actually shows me contents of linux file system



 I am using hadoop-1.0.3 on rhel5.6



 Can anyone suggest what I must have done wrong?



 --

 Nitin Pawar




RE: JAVA_HOME is not set

2012-07-05 Thread Leo Leung
I don't think OpenJDK is supported. There were a lot of problems with it.

But feel free to give it a try,  if you run into JVM crashes. Use Oracle(Sun) 
JDK 6. (NOT 7) 
Harsh had a good post before regarding the JVM(s).



-Original Message-
From: Simon [mailto:gsmst...@gmail.com] 
Sent: Thursday, July 05, 2012 9:53 AM
To: common-user@hadoop.apache.org
Cc: huangyi...@gmail.com
Subject: Re: JAVA_HOME is not set

I think you should set  JAVA_HOME=/usr/lib/jvm/java-7-**openjdk-i386/jre

JAVA_HOME is the base location of java, where it can find the java executable 
$JAVA_HOME/bin/java

Regards,
Simon


On Thu, Jul 5, 2012 at 12:42 PM, Ying Huang huangyi...@gmail.com wrote:

 Hello,
 I am installing hadoop according to this page:
 https://cwiki.apache.org/**BIGTOP/how-to-install-hadoop-**
 distribution-from-bigtop.htmlhttps://cwiki.apache.org/BIGTOP/how-to-install-hadoop-distribution-from-bigtop.html
 I think I have successfully installed hadoop on my Ubuntu 12.04 x64.
 Then I go to step Running Hadoop, bellowing is my operation step, 
 why it prompts that my JAVA_HOME is not set?
 --**--**
 --
 root@ubuntu32:/usr/lib/hadoop# export JAVA_HOME=/usr/lib/jvm/java-7-** 
 openjdk-i386/jre/bin/java root@ubuntu32:/usr/lib/hadoop# sudo -u hdfs 
 hadoop namenode -format
 Error: JAVA_HOME is not set.
 root@ubuntu32:/usr/lib/hadoop# ls $JAVA_HOME -al -rwxr-xr-x 1 root 
 root 5588 May  2 20:14 /usr/lib/jvm/java-7-openjdk-** 
 i386/jre/bin/java root@ubuntu32:/usr/lib/hadoop#

 --**--**
 --


 --


 Best Regards
 Ying Huang




RE: 8021 failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)

2012-05-16 Thread Leo Leung
Hmm...
Waqas,  I think you are using pre 1.6.0.20 build, and likely to be OpenJDK.

Please try the Sun/Oracle JDK 1.6.0_26+  (as Harsh said.. don't stay away from 
1.7)

And if I'm reading the logs right you are using the older release of HDP?.  
V1.0.0 
If this is the case, you may also want to check with the HortonWorks team.

Their distro datanode uses 32bit JDK and the NN uses 64bit.  You'll have to be 
careful of running the right JVM for each node type in the same host.

BTW, also I think you are starting the JT as wagas,  try running it under 
mapred.
Something like
$ su - mapred -c path to your hadoop-dameon.sh  start jobtracker  (once you 
fix the JVM versions)

Cheers



-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Wednesday, May 16, 2012 8:59 AM
To: common-user@hadoop.apache.org
Subject: Re: 8021 failed on connection exception: java.net.ConnectException: 
Connection refused at 
org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)

Hi,

On Wed, May 16, 2012 at 9:17 PM, waqas latif waqas...@gmail.com wrote:
 Hi I tried it with java 6 but with no success here are the links for 
 log and out file of jobtracker with java 6 logfile link 
 http://pastebin.com/bvWZRt0A

 outfile link is here which is a bit different from the java 7 
 http://pastebin.com/4YCZhQGh

Now this does look odd (and is again the same thing). What exactly is your 
Java6 version? i.e. java -version output?

Do you also get the same error if you try a hadoop jobtracker
directly in the CLI?

 Also please keep in mind that i can run hadoop 0.20 with java home 
 path set to java 7.

You may be able to, but none of us presently test it with that configuration. 
So if you run into bugs or odd behavior with that, you'll pretty much be alone 
:)

--
Harsh J


RE: Question on MapReduce

2012-05-11 Thread Leo Leung
Nope, you must tune the config on that specific super node to have more M/R 
slots (this is for 1.0.x)
This does not mean the JobTracker will be eager to stuff that super node with 
all the M/R jobs at hand.

It still goes through the scheduler,  Capacity Scheduler is most likely what 
you have.  (check your config)

IMO, If the data locality is not going to be there, your cluster is going to 
suffer from Network I/O.


-Original Message-
From: Satheesh Kumar [mailto:nks...@gmail.com] 
Sent: Friday, May 11, 2012 9:51 AM
To: common-user@hadoop.apache.org
Subject: Question on MapReduce

Hi,

I am a newbie on Hadoop and have a quick question on optimal compute vs.
storage resources for MapReduce.

If I have a multiprocessor node with 4 processors, will Hadoop schedule higher 
number of Map or Reduce tasks on the system than on a uni-processor system? In 
other words, does Hadoop detect denser systems and schedule denser tasks on 
multiprocessor systems?

If yes, will that imply that it makes sense to attach higher capacity storage 
to store more number of blocks on systems with dense compute?

Any insights will be very useful.

Thanks,
Satheesh


RE: Question on MapReduce

2012-05-11 Thread Leo Leung

This maybe dated materials.

Cloudera and HDP folks please correct with updates :)

http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/

http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/

Hope this helps.



-Original Message-
From: Satheesh Kumar [mailto:nks...@gmail.com] 
Sent: Friday, May 11, 2012 12:48 PM
To: common-user@hadoop.apache.org
Subject: Re: Question on MapReduce

Thanks, Leo. What is the config of a typical data node in a Hadoop cluster
- cores, storage capacity, and connectivity (SATA?).? How many tasktrackers 
scheduled per core in general?

Is there a best practices guide somewhere?

Thanks,
Satheesh

On Fri, May 11, 2012 at 10:48 AM, Leo Leung lle...@ddn.com wrote:

 Nope, you must tune the config on that specific super node to have 
 more M/R slots (this is for 1.0.x) This does not mean the JobTracker 
 will be eager to stuff that super node with all the M/R jobs at hand.

 It still goes through the scheduler,  Capacity Scheduler is most 
 likely what you have.  (check your config)

 IMO, If the data locality is not going to be there, your cluster is 
 going to suffer from Network I/O.


 -Original Message-
 From: Satheesh Kumar [mailto:nks...@gmail.com]
 Sent: Friday, May 11, 2012 9:51 AM
 To: common-user@hadoop.apache.org
 Subject: Question on MapReduce

 Hi,

 I am a newbie on Hadoop and have a quick question on optimal compute vs.
 storage resources for MapReduce.

 If I have a multiprocessor node with 4 processors, will Hadoop 
 schedule higher number of Map or Reduce tasks on the system than on a 
 uni-processor system? In other words, does Hadoop detect denser 
 systems and schedule denser tasks on multiprocessor systems?

 If yes, will that imply that it makes sense to attach higher capacity 
 storage to store more number of blocks on systems with dense compute?

 Any insights will be very useful.

 Thanks,
 Satheesh



RE: Why is hadoop build I generated from a release branch different from release build?

2012-03-08 Thread Leo Leung
Hi Pawan,

  ant -p (not for 0.23+) will tell you the available build targets.

  Use mvn (maven) for 0.23 or newer



-Original Message-
From: Matt Foley [mailto:mfo...@hortonworks.com] 
Sent: Thursday, March 08, 2012 3:52 PM
To: common-user@hadoop.apache.org
Subject: Re: Why is hadoop build I generated from a release branch different 
from release build?

Hi Pawan,
The complete way releases are built (for v0.20/v1.0) is documented at
http://wiki.apache.org/hadoop/HowToRelease#Building
However, that does a bunch of stuff you don't need, like generate the 
documentation and do a ton of cross-checks.

The full set of ant build targets are defined in build.xml in the top level of 
the source code tree.
binary may be the target you want.

--Matt

On Thu, Mar 8, 2012 at 3:35 PM, Pawan Agarwal pawan.agar...@gmail.comwrote:

 Hi,

 I am trying to generate hadoop binaries from source and execute hadoop 
 from the build I generate. I am able to build, however I am seeing 
 that as part of build *bin* folder which comes with hadoop 
 installation is not generated in my build. Can someone tell me how to 
 do a build so that I can generate build equivalent to hadoop release 
 build and which can be used directly to run hadoop.

 Here's the details.
 Desktop: Ubuntu Server 11.10
 Hadoop version for installation: 0.20.203.0  (link:
 http://mirrors.gigenet.com/apache//hadoop/common/hadoop-0.20.203.0/)
 Hadoop Branch used build: branch-0.20-security-203 Build Command used: 
 Ant maven-install

 Here's the directory structures from build I generated vs hadoop 
 official release build.

 *Hadoop directory which I generated:*
 pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls -1 ant
 c++
 classes
 contrib
 examples
 hadoop-0.20-security-203-pawan
 hadoop-ant-0.20-security-203-pawan.jar
 hadoop-core-0.20-security-203-pawan.jar
 hadoop-examples-0.20-security-203-pawan.jar
 hadoop-test-0.20-security-203-pawan.jar
 hadoop-tools-0.20-security-203-pawan.jar
 ivy
 jsvc
 src
 test
 tools
 webapps

 *Official Hadoop build installation*
 pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls /hadoop -1 
 bin build.xml
 c++
 CHANGES.txt
 conf
 contrib
 docs
 hadoop-ant-0.20.203.0.jar
 hadoop-core-0.20.203.0.jar
 hadoop-examples-0.20.203.0.jar
 hadoop-test-0.20.203.0.jar
 hadoop-tools-0.20.203.0.jar
 input
 ivy
 ivy.xml
 lib
 librecordio
 LICENSE.txt
 logs
 NOTICE.txt
 README.txt
 src
 webapps



 Any pointers for help are greatly appreciated?

 Also, if there are any other resources for understanding hadoop build 
 system, pointers to that would be also helpful.

 Thanks
 Pawan



RE: Hadoop and Hibernate

2012-03-02 Thread Leo Leung
Geoffry,

 Hadoop distributedCache (as of now) is used to cache M/R application 
specific files.
 These files are used by M/R app only and not the framework. (Normally as 
side-lookup)

 You can certainly try to use Hibernate to query your SQL based back-end within 
the M/R code.
 But think of what happens when a few hundred or thousands of M/R task do that 
concurrently.
 Your back-end is going to cry. (if it can - before it dies)

 So IMO,  prep your M/R job with distributedCache files (pull it down first) is 
a better approach.

 Also, MPI is pretty much out of question (not baked into the framework).  
 You'll likely have to roll your own.  (And try to trick the JobTracker in not 
starting the same task)

 Anyone has a better solution for Geoffry?



-Original Message-
From: Geoffry Roberts [mailto:geoffry.robe...@gmail.com] 
Sent: Friday, March 02, 2012 9:42 AM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop and Hibernate

This is a tardy response.  I'm spread pretty thinly right now.

DistributedCachehttp://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCacheis
apparently deprecated.  Is there a replacement?  I didn't see anything about 
this in the documentation, but then I am still using 0.21.0. I have to for 
performance reasons.  1.0.1 is too slow and the client won't have it.

Also, the 
DistributedCachehttp://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCacheapproach
seems only to work from within a hadoop job.  i.e. From within a Mapper or a 
Reducer, but not from within a Driver.  I have libraries that I must access 
both from both places.  I take it that I am stuck keeping two copies of these 
libraries in synch--Correct?  It's either that, or copy them into hdfs, 
replacing them all at the beginning of each job run.

Looking for best practices.

Thanks

On 28 February 2012 10:17, Owen O'Malley omal...@apache.org wrote:

 On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts 
 geoffry.robe...@gmail.com wrote:

  If I create an executable jar file that contains all dependencies
 required
  by the MR job do all said dependencies get distributed to all nodes?

 You can make a single jar and that will be distributed to all of the 
 machines that run the task, but it is better in most cases to use the 
 distributed cache.

 See
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#Distr
 ibutedCache

  If I specify but one reducer, which node in the cluster will the 
  reducer run on?

 The scheduling is done by the JobTracker and it isn't possible to 
 control the location of the reducers.

 -- Owen




--
Geoffry Roberts