Re: How to verify all my master/slave name/data nodes have been configured correctly?

2012-03-08 Thread madhu phatak
Hi,
 Use the JobTracker WEB UI at master:50030 and Namenode web UI at
master:50070.

On Fri, Feb 10, 2012 at 9:03 AM, Wq Az  wrote:

> Hi,
> Is there a quick way to check this?
> Thanks ahead,
> Will
>



-- 
Join me at http://hadoopworkshop.eventbrite.com/


Re: Regression on Hadoop ?

2012-03-08 Thread madhu phatak
Hi,
 You can look here https://github.com/zinnia-phatak-dev/Nectar

On Thu, Feb 9, 2012 at 12:09 PM, praveenesh kumar wrote:

> Guys,
>
> Is there any regression API/tool that is developed on top of hadoop *(APART
> from mahout) *?
>
> Thanks,
> Praveenesh
>



-- 
Join me at http://hadoopworkshop.eventbrite.com/


Re: Standalone operation - file permission, Pseudo-Distributed operation - no output

2012-03-08 Thread madhu phatak
Hi,
Just make sure both task tracker and data node is up. Go to localhost:50030
and see is it shows no.of nodes equal to 1?

On Thu, Feb 9, 2012 at 9:18 AM, Kyong-Ho Min wrote:

> Hello,
>
> I am a hadoop newbie and I have 2 questions.
>
> I followed Hadoop standalone mode testing.
> I got error message from Cygwin terminal  like file permission error.
> I checked out mailing list and changed the part in RawLocalFileSystem.java
> but not working.
> Still I have file permission error in the directory:
> c:/tmp/hadoop../mapred/staging...
>
>
> I followed instruction about Pseudo-Distributed operation.
> Ssh is OK and namenode -format is OK.
> But it did not return any results and the processing is just halted.
> The Cygwin console scripts are
>
> -
> $ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
> 12/02/09 14:25:44 INFO mapred.FileInputFormat: Total input paths to
> process : 17
> 12/02/09 14:25:44 INFO mapred.JobClient: Running job: job_201202091423_0001
> 12/02/09 14:25:45 INFO mapred.JobClient:  map 0% reduce 0%
> -
>
> Any help pls.
> Thanks.
>
> Kyongho Min
>



-- 
https://github.com/zinnia-phatak-dev/Nectar


Re: Can I start a Hadoop job from an EJB?

2012-03-08 Thread madhu phatak
Yes you can . Please make sure all Hadoop jars and conf directory is in
classpath.

On Thu, Feb 9, 2012 at 7:02 AM, Sanjeev Verma wrote:

> This is based on my understanding and no real life experience, so going to
> go out on a limb here :-)...assuming that you are planning on kicking off
> this map-reduce job based on a event of sorts (a file arrived and is ready
> to be processed?), and no direct "user wait" is involved, then yes, I would
> imagine you should be able to do something like this from inside a MDB
> (asynchronous so no one is held up in queue). Some random thoughts:
>
> 1. The user under which the app server is running will need to be a setup
> as a hadoop client user - this is rather obvious, just wanted to list it
> for completeness.
> 2. Hadoop, AFAIK, does not support transactions, and no XA. I assume you
> have no need for any of that stuff either.
> 3. Your MDB could potentially log job start/end times, but that info is
> available from Hadoop's monitoring infrastructure also.
>
> I would be very interested in hearing what senior members on the list have
> to say...
>
> HTH
>
> Sanjeev
>
> On Wed, Feb 8, 2012 at 2:18 PM, Andy Doddington 
> wrote:
>
> > OK, I have a working Hadoop application that I would like to integrate
> > into an application
> > server environment. So, the question arises: can I do this? E.g. can I
> > create a JobClient
> > instance inside an EJB and run it in the normal way, or is something more
> > complex
> > required? In addition, are there any unpleasant interactions between the
> > application
> > server and the hadoop runtime?
> >
> > Thanks for any guidance.
> >
> >Andy D.
>



-- 
https://github.com/zinnia-phatak-dev/Nectar


Re: Standalone operation - file permission, Pseudo-Distributed operation - no output

2012-03-08 Thread Jagat
Hello

Can you please tell which version of Hadoop you are using and also

Does your error matches with below message?

Failed to set permissions of path:
file:/tmp/hadoop-jj/mapred/staging/jj-1931875024/.staging to 0700

Thanks
Jagat


On Thu, Mar 8, 2012 at 5:10 PM, madhu phatak  wrote:

> Hi,
> Just make sure both task tracker and data node is up. Go to localhost:50030
> and see is it shows no.of nodes equal to 1?
>
> On Thu, Feb 9, 2012 at 9:18 AM, Kyong-Ho Min  >wrote:
>
> > Hello,
> >
> > I am a hadoop newbie and I have 2 questions.
> >
> > I followed Hadoop standalone mode testing.
> > I got error message from Cygwin terminal  like file permission error.
> > I checked out mailing list and changed the part in
> RawLocalFileSystem.java
> > but not working.
> > Still I have file permission error in the directory:
> > c:/tmp/hadoop../mapred/staging...
> >
> >
> > I followed instruction about Pseudo-Distributed operation.
> > Ssh is OK and namenode -format is OK.
> > But it did not return any results and the processing is just halted.
> > The Cygwin console scripts are
> >
> > -
> > $ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
> > 12/02/09 14:25:44 INFO mapred.FileInputFormat: Total input paths to
> > process : 17
> > 12/02/09 14:25:44 INFO mapred.JobClient: Running job:
> job_201202091423_0001
> > 12/02/09 14:25:45 INFO mapred.JobClient:  map 0% reduce 0%
> > -
> >
> > Any help pls.
> > Thanks.
> >
> > Kyongho Min
> >
>
>
>
> --
> https://github.com/zinnia-phatak-dev/Nectar
>


hadoop & DNS : a short tutorial and a tool to check DNS in a cluster

2012-03-08 Thread Sujee Maniyam
HI all,

I have a handy utility that will verify DNS settings in a cluster.
https://github.com/sujee/hadoop-dns-checker

- It is written in pure Java; doesn't use any third party libraries. So it
is very easy to compile and run.
- it does both  IP lookup and reverse DNS lookup
- will also check if machine's own hostname resolves correctly
- it can run on a single machine
- it can run on machines across cluster (using password-less ssh)

would love to get some feedback from others...

Also have a small tutorial on this subject here :
http://sujee.net/tech/articles/hadoop/hadoop-dns/

regards
Sujee
http://sujee.net


Convergence on File Format?

2012-03-08 Thread Michal Klos
Hi,

It seems that  Avro is poised to become "the" file format, is that still the 
case?

We've looked at Text, RCFile and Avro. Text is nice, but we'd really need to 
extend it. RCFile is great for Hive, but it has been a challenge using it 
outside of Hive. Avro has a great feature set, but is comparably (to RCFile) 
significantly slower and larger on disk in our testing, but if it has the 
highest rate of development, it may be the right choice.

If you were choosing a File Format today to build a general purpose cluster 
(general purpose in the sense of using all the Hadoop tools, not just Hive), 
what would you choose? (one of the choices being development of a Custom format)

Thanks,

Mike



Re: Convergence on File Format?

2012-03-08 Thread Serge Blazhievsky
We started using Avro few month ago and results are great!

Easy to use, reliable, feature rich, great integration with MapReduce

On 3/8/12 3:07 PM, "Michal Klos"  wrote:

>Hi,
>
>It seems that  Avro is poised to become "the" file format, is that still
>the case?
>
>We've looked at Text, RCFile and Avro. Text is nice, but we'd really need
>to extend it. RCFile is great for Hive, but it has been a challenge using
>it outside of Hive. Avro has a great feature set, but is comparably (to
>RCFile) significantly slower and larger on disk in our testing, but if it
>has the highest rate of development, it may be the right choice.
>
>If you were choosing a File Format today to build a general purpose
>cluster (general purpose in the sense of using all the Hadoop tools, not
>just Hive), what would you choose? (one of the choices being development
>of a Custom format)
>
>Thanks,
>
>Mike
>



Re: Profiling Hadoop Job

2012-03-08 Thread Leonardo Urbina
Does anyone have any idea how to solve this problem? Regardless of whether
I'm using plain HPROF or profiling through Starfish, I am getting the same
error:

Exception in thread "main" java.io.FileNotFoundException:
attempt_201203071311_0004_m_
00_0.profile (Permission denied)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:194)
at java.io.FileOutputStream.(FileOutputStream.java:84)
at
org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226)
at
org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
at
com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

But I can't find what permissions to change to fix this issue. Any ideas?
Thanks in advance,

Best,
-Leo


On Wed, Mar 7, 2012 at 3:52 PM, Leonardo Urbina  wrote:

> Thanks,
> -Leo
>
>
> On Wed, Mar 7, 2012 at 3:47 PM, Jie Li  wrote:
>
>> Hi Leo,
>>
>> Thanks for pointing out the outdated README file.  Glad to tell you that
>> we
>> do support the old API in the latest version. See here:
>>
>> http://www.cs.duke.edu/starfish/previous.html
>>
>> Welcome to join our mailing list and your questions will reach more of our
>> group members.
>>
>> Jie
>>
>> On Wed, Mar 7, 2012 at 3:37 PM, Leonardo Urbina  wrote:
>>
>> > Hi Jie,
>> >
>> > According to the Starfish README, the hadoop programs must be written
>> using
>> > the new Hadoop API. This is not my case (I am using MultipleInputs among
>> > other non-new API supported features). Is there any way around this?
>> > Thanks,
>> >
>> > -Leo
>> >
>> > On Wed, Mar 7, 2012 at 3:19 PM, Jie Li  wrote:
>> >
>> > > Hi Leonardo,
>> > >
>> > > You might want to try Starfish which supports the memory profiling as
>> > well
>> > > as cpu/disk/network profiling for the performance tuning.
>> > >
>> > > Jie
>> > > --
>> > > Starfish is an intelligent performance tuning tool for Hadoop.
>> > > Homepage: www.cs.duke.edu/starfish/
>> > > Mailing list: http://groups.google.com/group/hadoop-starfish
>> > >
>> > >
>> > > On Wed, Mar 7, 2012 at 2:36 PM, Leonardo Urbina 
>> wrote:
>> > >
>> > > > Hello everyone,
>> > > >
>> > > > I have a Hadoop job that I run on several GBs of data that I am
>> trying
>> > to
>> > > > optimize in order to reduce the memory consumption as well as
>> improve
>> > the
>> > > > speed. I am following the steps outlined in Tom White's "Hadoop: The
>> > > > Definitive Guide" for profiling using HPROF (p161), by setting the
>> > > > following properties in the JobConf:
>> > > >
>> > > >job.setProfileEnabled(true);
>> > > >
>> > > >
>> job.setProfileParams("-agentlib:hprof=cpu=samples,heap=sites,depth=6,"
>> > +
>> > > >"force=n,thread=y,verbose=n,file=%s");
>> > > >job.setProfileTaskRange(true, "0-2");
>> > > >job.setProfileTaskRange(false, "0-2");
>> > > >
>> > > > I am trying to run this locally on a single pseudo-distributed
>> install
>> > of
>> > > > hadoop (0.20.2) and it gives the following error:
>> > > >
>> > > > Exception in thread "main" java.io.FileNotFoundException:
>> > > > attempt_201203071311_0004_m_00_0.profile (Permission denied)
>> > > >at java.io.FileOutputStream.open(Native Method)
>> > > >at java.io.FileOutputStream.(FileOutputStream.java:194)
>> > > >at java.io.FileOutputStream.(FileOutputStream.java:84)
>> > > >at
>> > > >
>> org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226)
>> > > >at
>> > > >
>> > >
>> >
>> org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302)
>> > > >at
>> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
>> > > >at
>> > > >
>> > > >
>> > >
>> >
>> com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89)
>> > > >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> > > >at
>> > > >
>> > > >
>> > >
>> >
>> com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94)
>> > > >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>> Method)
>> > > >at
>> > > >
>> > > >
>> > >
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> > > >at
>> > > >
>> > > >
>> > >
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke

Re: Why is hadoop build I generated from a release branch different from release build?

2012-03-08 Thread Matt Foley
Hi Pawan,
The complete way releases are built (for v0.20/v1.0) is documented at
http://wiki.apache.org/hadoop/HowToRelease#Building
However, that does a bunch of stuff you don't need, like generate the
documentation and do a ton of cross-checks.

The full set of ant build targets are defined in build.xml in the top level
of the source code tree.
"binary" may be the target you want.

--Matt

On Thu, Mar 8, 2012 at 3:35 PM, Pawan Agarwal wrote:

> Hi,
>
> I am trying to generate hadoop binaries from source and execute hadoop from
> the build I generate. I am able to build, however I am seeing that as part
> of build *bin* folder which comes with hadoop installation is not generated
> in my build. Can someone tell me how to do a build so that I can generate
> build equivalent to hadoop release build and which can be used directly to
> run hadoop.
>
> Here's the details.
> Desktop: Ubuntu Server 11.10
> Hadoop version for installation: 0.20.203.0  (link:
> http://mirrors.gigenet.com/apache//hadoop/common/hadoop-0.20.203.0/)
> Hadoop Branch used build: branch-0.20-security-203
> Build Command used: "Ant maven-install"
>
> Here's the directory structures from build I generated vs hadoop official
> release build.
>
> *Hadoop directory which I generated:*
> pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls -1
> ant
> c++
> classes
> contrib
> examples
> hadoop-0.20-security-203-pawan
> hadoop-ant-0.20-security-203-pawan.jar
> hadoop-core-0.20-security-203-pawan.jar
> hadoop-examples-0.20-security-203-pawan.jar
> hadoop-test-0.20-security-203-pawan.jar
> hadoop-tools-0.20-security-203-pawan.jar
> ivy
> jsvc
> src
> test
> tools
> webapps
>
> *Official Hadoop build installation*
> pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls /hadoop -1
> bin
> build.xml
> c++
> CHANGES.txt
> conf
> contrib
> docs
> hadoop-ant-0.20.203.0.jar
> hadoop-core-0.20.203.0.jar
> hadoop-examples-0.20.203.0.jar
> hadoop-test-0.20.203.0.jar
> hadoop-tools-0.20.203.0.jar
> input
> ivy
> ivy.xml
> lib
> librecordio
> LICENSE.txt
> logs
> NOTICE.txt
> README.txt
> src
> webapps
>
>
>
> Any pointers for help are greatly appreciated?
>
> Also, if there are any other resources for understanding hadoop build
> system, pointers to that would be also helpful.
>
> Thanks
> Pawan
>


Re: Convergence on File Format?

2012-03-08 Thread Russell Jurney
Avro support in Pig will be fairly mature in 0.10.

Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com

On Mar 8, 2012, at 3:10 PM, Serge Blazhievsky
 wrote:

> We started using Avro few month ago and results are great!
>
> Easy to use, reliable, feature rich, great integration with MapReduce
>
> On 3/8/12 3:07 PM, "Michal Klos"  wrote:
>
>> Hi,
>>
>> It seems that  Avro is poised to become "the" file format, is that still
>> the case?
>>
>> We've looked at Text, RCFile and Avro. Text is nice, but we'd really need
>> to extend it. RCFile is great for Hive, but it has been a challenge using
>> it outside of Hive. Avro has a great feature set, but is comparably (to
>> RCFile) significantly slower and larger on disk in our testing, but if it
>> has the highest rate of development, it may be the right choice.
>>
>> If you were choosing a File Format today to build a general purpose
>> cluster (general purpose in the sense of using all the Hadoop tools, not
>> just Hive), what would you choose? (one of the choices being development
>> of a Custom format)
>>
>> Thanks,
>>
>> Mike
>>
>


Re: Profiling Hadoop Job

2012-03-08 Thread Mohit Anchlia
Can you check which user you are running this process as and compare it
with the ownership on the directory?

On Thu, Mar 8, 2012 at 3:13 PM, Leonardo Urbina  wrote:

> Does anyone have any idea how to solve this problem? Regardless of whether
> I'm using plain HPROF or profiling through Starfish, I am getting the same
> error:
>
> Exception in thread "main" java.io.FileNotFoundException:
> attempt_201203071311_0004_m_
> 00_0.profile (Permission denied)
>at java.io.FileOutputStream.open(Native Method)
>at java.io.FileOutputStream.(FileOutputStream.java:194)
>at java.io.FileOutputStream.(FileOutputStream.java:84)
>at
> org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226)
>at
> org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302)
>at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
>at
>
> com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>at
>
> com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> But I can't find what permissions to change to fix this issue. Any ideas?
> Thanks in advance,
>
> Best,
> -Leo
>
>
>  On Wed, Mar 7, 2012 at 3:52 PM, Leonardo Urbina  wrote:
>
> > Thanks,
> > -Leo
> >
> >
> > On Wed, Mar 7, 2012 at 3:47 PM, Jie Li  wrote:
> >
> >> Hi Leo,
> >>
> >> Thanks for pointing out the outdated README file.  Glad to tell you that
> >> we
> >> do support the old API in the latest version. See here:
> >>
> >> http://www.cs.duke.edu/starfish/previous.html
> >>
> >> Welcome to join our mailing list and your questions will reach more of
> our
> >> group members.
> >>
> >> Jie
> >>
> >> On Wed, Mar 7, 2012 at 3:37 PM, Leonardo Urbina 
> wrote:
> >>
> >> > Hi Jie,
> >> >
> >> > According to the Starfish README, the hadoop programs must be written
> >> using
> >> > the new Hadoop API. This is not my case (I am using MultipleInputs
> among
> >> > other non-new API supported features). Is there any way around this?
> >> > Thanks,
> >> >
> >> > -Leo
> >> >
> >> > On Wed, Mar 7, 2012 at 3:19 PM, Jie Li  wrote:
> >> >
> >> > > Hi Leonardo,
> >> > >
> >> > > You might want to try Starfish which supports the memory profiling
> as
> >> > well
> >> > > as cpu/disk/network profiling for the performance tuning.
> >> > >
> >> > > Jie
> >> > > --
> >> > > Starfish is an intelligent performance tuning tool for Hadoop.
> >> > > Homepage: www.cs.duke.edu/starfish/
> >> > > Mailing list: http://groups.google.com/group/hadoop-starfish
> >> > >
> >> > >
> >> > > On Wed, Mar 7, 2012 at 2:36 PM, Leonardo Urbina 
> >> wrote:
> >> > >
> >> > > > Hello everyone,
> >> > > >
> >> > > > I have a Hadoop job that I run on several GBs of data that I am
> >> trying
> >> > to
> >> > > > optimize in order to reduce the memory consumption as well as
> >> improve
> >> > the
> >> > > > speed. I am following the steps outlined in Tom White's "Hadoop:
> The
> >> > > > Definitive Guide" for profiling using HPROF (p161), by setting the
> >> > > > following properties in the JobConf:
> >> > > >
> >> > > >job.setProfileEnabled(true);
> >> > > >
> >> > > >
> >> job.setProfileParams("-agentlib:hprof=cpu=samples,heap=sites,depth=6,"
> >> > +
> >> > > >"force=n,thread=y,verbose=n,file=%s");
> >> > > >job.setProfileTaskRange(true, "0-2");
> >> > > >job.setProfileTaskRange(false, "0-2");
> >> > > >
> >> > > > I am trying to run this locally on a single pseudo-distributed
> >> install
> >> > of
> >> > > > hadoop (0.20.2) and it gives the following error:
> >> > > >
> >> > > > Exception in thread "main" java.io.FileNotFoundException:
> >> > > > attempt_201203071311_0004_m_00_0.profile (Permission denied)
> >> > > >at java.io.FileOutputStream.open(Native Method)
> >> > > >at
> java.io.FileOutputStream.(FileOutputStream.java:194)
> >> > > >at
> java.io.FileOutputStream.(FileOutputStream.java:84)
> >> > > >at
> >> > > >
> >> org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226)
> >> > > >at
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302)
> >> > > >at
> >> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
> >> > > >at
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89)
> >> > > >at
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:6

RE: Why is hadoop build I generated from a release branch different from release build?

2012-03-08 Thread Leo Leung
Hi Pawan,

  ant -p (not for 0.23+) will tell you the available build targets.

  Use mvn (maven) for 0.23 or newer



-Original Message-
From: Matt Foley [mailto:mfo...@hortonworks.com] 
Sent: Thursday, March 08, 2012 3:52 PM
To: common-user@hadoop.apache.org
Subject: Re: Why is hadoop build I generated from a release branch different 
from release build?

Hi Pawan,
The complete way releases are built (for v0.20/v1.0) is documented at
http://wiki.apache.org/hadoop/HowToRelease#Building
However, that does a bunch of stuff you don't need, like generate the 
documentation and do a ton of cross-checks.

The full set of ant build targets are defined in build.xml in the top level of 
the source code tree.
"binary" may be the target you want.

--Matt

On Thu, Mar 8, 2012 at 3:35 PM, Pawan Agarwal wrote:

> Hi,
>
> I am trying to generate hadoop binaries from source and execute hadoop 
> from the build I generate. I am able to build, however I am seeing 
> that as part of build *bin* folder which comes with hadoop 
> installation is not generated in my build. Can someone tell me how to 
> do a build so that I can generate build equivalent to hadoop release 
> build and which can be used directly to run hadoop.
>
> Here's the details.
> Desktop: Ubuntu Server 11.10
> Hadoop version for installation: 0.20.203.0  (link:
> http://mirrors.gigenet.com/apache//hadoop/common/hadoop-0.20.203.0/)
> Hadoop Branch used build: branch-0.20-security-203 Build Command used: 
> "Ant maven-install"
>
> Here's the directory structures from build I generated vs hadoop 
> official release build.
>
> *Hadoop directory which I generated:*
> pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls -1 ant
> c++
> classes
> contrib
> examples
> hadoop-0.20-security-203-pawan
> hadoop-ant-0.20-security-203-pawan.jar
> hadoop-core-0.20-security-203-pawan.jar
> hadoop-examples-0.20-security-203-pawan.jar
> hadoop-test-0.20-security-203-pawan.jar
> hadoop-tools-0.20-security-203-pawan.jar
> ivy
> jsvc
> src
> test
> tools
> webapps
>
> *Official Hadoop build installation*
> pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls /hadoop -1 
> bin build.xml
> c++
> CHANGES.txt
> conf
> contrib
> docs
> hadoop-ant-0.20.203.0.jar
> hadoop-core-0.20.203.0.jar
> hadoop-examples-0.20.203.0.jar
> hadoop-test-0.20.203.0.jar
> hadoop-tools-0.20.203.0.jar
> input
> ivy
> ivy.xml
> lib
> librecordio
> LICENSE.txt
> logs
> NOTICE.txt
> README.txt
> src
> webapps
>
>
>
> Any pointers for help are greatly appreciated?
>
> Also, if there are any other resources for understanding hadoop build 
> system, pointers to that would be also helpful.
>
> Thanks
> Pawan
>


Re: Profiling Hadoop Job

2012-03-08 Thread Vinod Kumar Vavilapalli
The JobClient is trying to download the profile output to the local
directory. It seems like you don't have write permissions in the
current working directory where you are running the JobClient. Please
check that.

HTH.

+Vinod
Hortonworks Inc.
http://hortonworks.com/


On Thu, Mar 8, 2012 at 3:13 PM, Leonardo Urbina  wrote:
> Does anyone have any idea how to solve this problem? Regardless of whether
> I'm using plain HPROF or profiling through Starfish, I am getting the same
> error:
>
> Exception in thread "main" java.io.FileNotFoundException:
> attempt_201203071311_0004_m_
> 00_0.profile (Permission denied)
>        at java.io.FileOutputStream.open(Native Method)
>        at java.io.FileOutputStream.(FileOutputStream.java:194)
>        at java.io.FileOutputStream.(FileOutputStream.java:84)
>        at
> org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226)
>        at
> org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
>        at
> com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
> com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> But I can't find what permissions to change to fix this issue. Any ideas?
> Thanks in advance,
>
> Best,
> -Leo
>
>
> On Wed, Mar 7, 2012 at 3:52 PM, Leonardo Urbina  wrote:
>
>> Thanks,
>> -Leo
>>
>>
>> On Wed, Mar 7, 2012 at 3:47 PM, Jie Li  wrote:
>>
>>> Hi Leo,
>>>
>>> Thanks for pointing out the outdated README file.  Glad to tell you that
>>> we
>>> do support the old API in the latest version. See here:
>>>
>>> http://www.cs.duke.edu/starfish/previous.html
>>>
>>> Welcome to join our mailing list and your questions will reach more of our
>>> group members.
>>>
>>> Jie
>>>
>>> On Wed, Mar 7, 2012 at 3:37 PM, Leonardo Urbina  wrote:
>>>
>>> > Hi Jie,
>>> >
>>> > According to the Starfish README, the hadoop programs must be written
>>> using
>>> > the new Hadoop API. This is not my case (I am using MultipleInputs among
>>> > other non-new API supported features). Is there any way around this?
>>> > Thanks,
>>> >
>>> > -Leo
>>> >
>>> > On Wed, Mar 7, 2012 at 3:19 PM, Jie Li  wrote:
>>> >
>>> > > Hi Leonardo,
>>> > >
>>> > > You might want to try Starfish which supports the memory profiling as
>>> > well
>>> > > as cpu/disk/network profiling for the performance tuning.
>>> > >
>>> > > Jie
>>> > > --
>>> > > Starfish is an intelligent performance tuning tool for Hadoop.
>>> > > Homepage: www.cs.duke.edu/starfish/
>>> > > Mailing list: http://groups.google.com/group/hadoop-starfish
>>> > >
>>> > >
>>> > > On Wed, Mar 7, 2012 at 2:36 PM, Leonardo Urbina 
>>> wrote:
>>> > >
>>> > > > Hello everyone,
>>> > > >
>>> > > > I have a Hadoop job that I run on several GBs of data that I am
>>> trying
>>> > to
>>> > > > optimize in order to reduce the memory consumption as well as
>>> improve
>>> > the
>>> > > > speed. I am following the steps outlined in Tom White's "Hadoop: The
>>> > > > Definitive Guide" for profiling using HPROF (p161), by setting the
>>> > > > following properties in the JobConf:
>>> > > >
>>> > > >        job.setProfileEnabled(true);
>>> > > >
>>> > > >
>>> job.setProfileParams("-agentlib:hprof=cpu=samples,heap=sites,depth=6,"
>>> > +
>>> > > >                "force=n,thread=y,verbose=n,file=%s");
>>> > > >        job.setProfileTaskRange(true, "0-2");
>>> > > >        job.setProfileTaskRange(false, "0-2");
>>> > > >
>>> > > > I am trying to run this locally on a single pseudo-distributed
>>> install
>>> > of
>>> > > > hadoop (0.20.2) and it gives the following error:
>>> > > >
>>> > > > Exception in thread "main" java.io.FileNotFoundException:
>>> > > > attempt_201203071311_0004_m_00_0.profile (Permission denied)
>>> > > >        at java.io.FileOutputStream.open(Native Method)
>>> > > >        at java.io.FileOutputStream.(FileOutputStream.java:194)
>>> > > >        at java.io.FileOutputStream.(FileOutputStream.java:84)
>>> > > >        at
>>> > > >
>>> org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226)
>>> > > >        at
>>> > > >
>>> > >
>>> >
>>> org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302)
>>> > > >        at
>>> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
>>> > > >        at
>>> > > >
>>> > > >
>>> > >
>>> >
>>> com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89)
>>> > > >        at org.apache.hadoop.ut

does hadoop always respect setNumReduceTasks?

2012-03-08 Thread Jane Wayne
i am wondering if hadoop always respect Job.setNumReduceTasks(int)?

as i am emitting items from the mapper, i expect/desire only 1 reducer to
get these items because i want to assign each key of the key-value input
pair a unique integer id. if i had 1 reducer, i can just keep a local
counter (with respect to the reducer instance) and increment it.

on my local hadoop cluster, i noticed that most, if not all, my jobs have
only 1 reducer, regardless of whether or not i set
Job.setNumReduceTasks(int).

however, as soon as i moved the code unto amazon's elastic mapreduce (emr),
i notice that there are multiple reducers. if i set the number of reduce
tasks to 1, is this always guaranteed? i ask because i don't know if there
is a gotcha like the combiner (where it may or may not run at all).

also, it looks like this might not be a good idea just having 1 reducer (it
won't scale). it is most likely better if there are +1 reducers, but in
that case, i lose the ability to assign unique numbers to the key-value
pairs coming in. is there a design pattern out there that addresses this
issue?

my mapper/reducer key-value pair signatures looks something like the
following.

mapper(Text, Text, Text, IntWritable)
reducer(Text, IntWritable, IntWritable, Text)

the mapper reads a sequence file whose key-value pairs are of type Text and
Text. i then emit Text (let's say a word) and IntWritable (let's say
frequency of the word).

the reducer gets the word and its frequencies, and then assigns the word an
integer id. it emits IntWritable (the id) and Text (the word).

i remember seeing code from mahout's API where they assign integer ids to
items. the items were already given an id of type long. the conversion they
make is as follows.

public static int idToIndex(long id) {
 return 0x7FFF & ((int) id ^ (int) (id >>> 32));
}

is there something equivalent for Text or a "word"? i was thinking about
simply taking the hash value of the string/word, but of course, different
strings can map to the same hash value.


Re: does hadoop always respect setNumReduceTasks?

2012-03-08 Thread Lance Norskog
Instead of String.hashCode() you can use the MD5 hashcode generator.
This has not "in the wild" created a duplicate. (It has been hacked,
but that's not relevant here.)

http://snippets.dzone.com/posts/show/3686

I think the Partitioner class guarantees that you will have multiple reducers.

On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne  wrote:
> i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
>
> as i am emitting items from the mapper, i expect/desire only 1 reducer to
> get these items because i want to assign each key of the key-value input
> pair a unique integer id. if i had 1 reducer, i can just keep a local
> counter (with respect to the reducer instance) and increment it.
>
> on my local hadoop cluster, i noticed that most, if not all, my jobs have
> only 1 reducer, regardless of whether or not i set
> Job.setNumReduceTasks(int).
>
> however, as soon as i moved the code unto amazon's elastic mapreduce (emr),
> i notice that there are multiple reducers. if i set the number of reduce
> tasks to 1, is this always guaranteed? i ask because i don't know if there
> is a gotcha like the combiner (where it may or may not run at all).
>
> also, it looks like this might not be a good idea just having 1 reducer (it
> won't scale). it is most likely better if there are +1 reducers, but in
> that case, i lose the ability to assign unique numbers to the key-value
> pairs coming in. is there a design pattern out there that addresses this
> issue?
>
> my mapper/reducer key-value pair signatures looks something like the
> following.
>
> mapper(Text, Text, Text, IntWritable)
> reducer(Text, IntWritable, IntWritable, Text)
>
> the mapper reads a sequence file whose key-value pairs are of type Text and
> Text. i then emit Text (let's say a word) and IntWritable (let's say
> frequency of the word).
>
> the reducer gets the word and its frequencies, and then assigns the word an
> integer id. it emits IntWritable (the id) and Text (the word).
>
> i remember seeing code from mahout's API where they assign integer ids to
> items. the items were already given an id of type long. the conversion they
> make is as follows.
>
> public static int idToIndex(long id) {
>  return 0x7FFF & ((int) id ^ (int) (id >>> 32));
> }
>
> is there something equivalent for Text or a "word"? i was thinking about
> simply taking the hash value of the string/word, but of course, different
> strings can map to the same hash value.



-- 
Lance Norskog
goks...@gmail.com


Best way for setting up a large cluster

2012-03-08 Thread Masoud

Hi all,

I installed hadoop in a pilot cluster with 3 machines and now going to 
make our actual cluster with 32 nodes.
as you know setting up hadoop separately in every nodes is time 
consuming and not perfect way.

whats the best way or tool to setup hadoop cluster (expect cloudera)?

Thanks,
B.S


Re: Best way for setting up a large cluster

2012-03-08 Thread Joey Echeverria
Something like puppet it is a good choice. There are example puppet
manifests available for most Hadoop-related projects in Apache BigTop,
for example:

https://svn.apache.org/repos/asf/incubator/bigtop/branches/branch-0.2/bigtop-deploy/puppet/

-Joey

On Thu, Mar 8, 2012 at 9:42 PM, Masoud  wrote:
> Hi all,
>
> I installed hadoop in a pilot cluster with 3 machines and now going to make
> our actual cluster with 32 nodes.
> as you know setting up hadoop separately in every nodes is time consuming
> and not perfect way.
> whats the best way or tool to setup hadoop cluster (expect cloudera)?
>
> Thanks,
> B.S



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Getting different results every time I run the same job on the cluster

2012-03-08 Thread Mark Kerzner
Hi,

I have to admit, I am lost. My code  is stable on a
pseudo distributed cluster, but every time I run it one a 4 - slave
cluster, I get different results, ranging from 100 output lines to 4,000
output lines, whereas the real answer on my standalone is about 2000.

I look at the logs and see no exceptions, so I am totally lost. Where
should I look?

Thank you,
Mark


Re: Why is hadoop build I generated from a release branch different from release build?

2012-03-08 Thread Pawan Agarwal
Thanks for all the replies. It turns out that build generated by ant has
bin & conf etc folders in one level above. And I looked at hadoop scripts
and apparently it looks for right jars both in root directory and
root/build/ directory as well. so I think I am covered for now.

Thanks again!

On Thu, Mar 8, 2012 at 4:15 PM, Leo Leung  wrote:

> Hi Pawan,
>
>  ant -p (not for 0.23+) will tell you the available build targets.
>
>  Use mvn (maven) for 0.23 or newer
>
>
>
> -Original Message-
> From: Matt Foley [mailto:mfo...@hortonworks.com]
> Sent: Thursday, March 08, 2012 3:52 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Why is hadoop build I generated from a release branch
> different from release build?
>
> Hi Pawan,
> The complete way releases are built (for v0.20/v1.0) is documented at
>http://wiki.apache.org/hadoop/HowToRelease#Building
> However, that does a bunch of stuff you don't need, like generate the
> documentation and do a ton of cross-checks.
>
> The full set of ant build targets are defined in build.xml in the top
> level of the source code tree.
> "binary" may be the target you want.
>
> --Matt
>
> On Thu, Mar 8, 2012 at 3:35 PM, Pawan Agarwal  >wrote:
>
> > Hi,
> >
> > I am trying to generate hadoop binaries from source and execute hadoop
> > from the build I generate. I am able to build, however I am seeing
> > that as part of build *bin* folder which comes with hadoop
> > installation is not generated in my build. Can someone tell me how to
> > do a build so that I can generate build equivalent to hadoop release
> > build and which can be used directly to run hadoop.
> >
> > Here's the details.
> > Desktop: Ubuntu Server 11.10
> > Hadoop version for installation: 0.20.203.0  (link:
> > http://mirrors.gigenet.com/apache//hadoop/common/hadoop-0.20.203.0/)
> > Hadoop Branch used build: branch-0.20-security-203 Build Command used:
> > "Ant maven-install"
> >
> > Here's the directory structures from build I generated vs hadoop
> > official release build.
> >
> > *Hadoop directory which I generated:*
> > pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls -1 ant
> > c++
> > classes
> > contrib
> > examples
> > hadoop-0.20-security-203-pawan
> > hadoop-ant-0.20-security-203-pawan.jar
> > hadoop-core-0.20-security-203-pawan.jar
> > hadoop-examples-0.20-security-203-pawan.jar
> > hadoop-test-0.20-security-203-pawan.jar
> > hadoop-tools-0.20-security-203-pawan.jar
> > ivy
> > jsvc
> > src
> > test
> > tools
> > webapps
> >
> > *Official Hadoop build installation*
> > pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls /hadoop -1
> > bin build.xml
> > c++
> > CHANGES.txt
> > conf
> > contrib
> > docs
> > hadoop-ant-0.20.203.0.jar
> > hadoop-core-0.20.203.0.jar
> > hadoop-examples-0.20.203.0.jar
> > hadoop-test-0.20.203.0.jar
> > hadoop-tools-0.20.203.0.jar
> > input
> > ivy
> > ivy.xml
> > lib
> > librecordio
> > LICENSE.txt
> > logs
> > NOTICE.txt
> > README.txt
> > src
> > webapps
> >
> >
> >
> > Any pointers for help are greatly appreciated?
> >
> > Also, if there are any other resources for understanding hadoop build
> > system, pointers to that would be also helpful.
> >
> > Thanks
> > Pawan
> >
>


Hadoop-Pig setup question

2012-03-08 Thread Atul Thapliyal
Hi Hadoop users,

I am new member and please let me know if this is not the correct format to
ask questions.

I am trying to setup a small Hadoop cluster where I will run Pig queries.
Hadoop cluster is running fine but when I run a pig query it just hangs.

Note - Pig runs fine in local mode

So I narrowed down the errors to the following -

I have a secondary name node on a different machine (.e.g node 2).

Point 1.
When I execute start-mapred.sh on node 2, I get
"ssh_exchange_identification closed by remote host" message. BUT the
secondary name node starts with no error messages in the log
I can even access it through port 50030

So far no errors

Point 2.
When I try to run a map-reduce job. I get a "java.net.ConnectException:
Connection refused" error in the secondary name node log files.

Are point 1 and point 2 related

Any hints / pointers on how to solve this ? Also, ssh time out is set to 20
so I am assuming that error is not because of this.

Thanks for reading

-- 
Warm Regards
Atul


Hadoop node name problem

2012-03-08 Thread 韶隆吴
Hi All:
   I'm trying to use hadoop,zookeeper and hbase to build a NoSQL
database,but when I make hadoop and zookeeper work well and going to
install hbase,it report an exception:
BindException:Problem binding to /202.106.199.37:60020:Cannot assign
requested address
My PC IP&Host is 192.168.1.91 slave1.
Then I search the http://192.168.1.90:50070
(master)/dfsnodelist.jsp?whatNodes=LIVE
I saw the message like this:
Node








web30









bt-199-036









202.106.199.37









I want to know why the node's name like this and how to solve this


Re: Java Heap space error

2012-03-08 Thread hadoopman
I'm curious if you have been able to track down the cause of the error?  
We've seen similar problems with loading data and I've discovered if I 
presort my data before the load that things go a LOT smoother.


When running queries against our data sometimes we've seen it where the 
jobtracker just freezes.  I've seen Heap out of memory errors when I 
cranked up jobtracker logging to debug.  Still working on figuring this 
one out.  should be an interesting ride :D




On 03/06/2012 11:10 AM, Mohit Anchlia wrote:

I am still trying to see how to narrow this down. Is it possible to set
heapdumponoutofmemoryerror option on these individual tasks?

On Mon, Mar 5, 2012 at 5:49 PM, Mohit Anchliawrote:







state of HOD

2012-03-08 Thread Stijn De Weirdt
(my apologies for those who have received this already. i posted this 
mail a few days back on the common-dev list, as this is more a 
development related mail; but one of the original authors/maintainers 
suggested to also post this here)


hi all,

i am a system administrator/user support person/... for the HPC team at 
Ghent University (Ghent, Flanders, Belgium).


recently we have been asked to look into support for hadoop. for the 
moment we are holding off on a dedicated cluster (esp dedicated hdfs setup).


but as all our systems are torque/pbs based, we looked into HOD to help 
out our users.
we have started from the HOD code that was part of the hadoop 1.0.0 
release (in the contrib part).
at first it was not working, but we have been patching and cleaning up 
the code for a a few weeks and now have a version that works for us (we 
had to add some features besides fixing a few things).
it looks sufficient for now, although we will add some more features 
soon to get the users started.



my question is the following: what is the state of HOD atm? is it still 
maintained/supported? are there forks somewhere that have more 
up-to-date code?
what we are now missing most is the documentation (eg 
http://hadoop.apache.org/common/docs/r0.16.4/hod.html) so we can update 
this with our extra features. is the source available somewhere?


i could contribute back all patches, but a few of them are identation 
fixes (to use 4 space indentation throughout the code) and other 
cosmetic changes, so this messes up patches a lot.
i have also shuffled a bit with the options (rename and/or move to other 
sections) so no 100% backwards compatibility with the current HOD code.


current main improvements:
- works with python 2.5 and up (we have been testing with 2.7.2)
- set options through environment variables
- better default values (we can now run with empty hodrc file)
- support for mail and nodes:ppn for pbs
- no deprecation warnings from hadoop (nearly finished)
- host-mask to bind xrs addr on non-default ip (in case you have 
non-standard network on the compute nodes)

- more debug statements
- gradual code cleanup (using pylint)

on the todo list:
- further tuning of hadoop parameters (i'm not a hadoop user myself, so 
this will take some time)

- 0.23.X support



many thanks,

stijn