Re: Hadoop Cookbook?

2010-05-04 Thread Steve Loughran

Mark Kerzner wrote:

Hi, guys,

I think that there is a need for a collection of Hadoop exercises. The great
books out there teach you how to use Hadoop, but the Hadoop Cookbook is
missing, If people can submit their solutions, I can become an editor - or a
group of editors can do it - but there are lots of people out there who have
designed interesting solutions that they could share.

Cheers,
Mark



would be good on the apache hadoop wiki


Re: Hadoop Cookbook?

2010-05-04 Thread Mark Kerzner
Thank you

On Tue, May 4, 2010 at 4:52 AM, Steve Loughran ste...@apache.org wrote:

 Mark Kerzner wrote:

 Hi, guys,

 I think that there is a need for a collection of Hadoop exercises. The
 great
 books out there teach you how to use Hadoop, but the Hadoop Cookbook is
 missing, If people can submit their solutions, I can become an editor - or
 a
 group of editors can do it - but there are lots of people out there who
 have
 designed interesting solutions that they could share.

 Cheers,
 Mark


 would be good on the apache hadoop wiki



Doubt: Using PBS to run mapreduce jobs.

2010-05-04 Thread Udaya Lakshmi
Hi,
   I am given an account on a cluster which uses OpenPBS as the cluster
management software. The only way I can run a job is by submitting it to
OpenPBS. How to run mapreduce programs on it? Is there any possible work
around?

Thanks,
Udaya.


Need a Jira?

2010-05-04 Thread Michael Segel

Hi,

Came across something ugly.

I'm using the latest Hadoop version in Cloudera's CH2 :Hadoop 0.20.1+169.68
(At least I think its the latest version in CH2)

Noticed that when I instantiate a JobClient() passing in a Configuration 
object, I have to cast it to the deprecated class (JobConf).

Is this something that should be updated, or is this fixed in the next Cloudera 
(CH3) release?

Thx

-Mike


  
_
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1

Re: having a directory as input split

2010-05-04 Thread Sonal Goyal
One way to do this will be:

Create a DirectoryInputFormat which accepts the list of directories as
inputs and emits each directory path in one split. Your custom RecordReader
can then read this split and generate appropriate input for your mapper.

Thanks and Regards,
Sonal
www.meghsoft.com


On Fri, Apr 30, 2010 at 11:48 AM, akhil1988 akhilan...@gmail.com wrote:


 How can I make a directory as a InputSplit rather than a file. I want that
 the input split available to a map task should be a directory and not a
 file. And I will implement my own record reader which will read appropriate
 data from the directory and thus give the records to the map tasks.

 To explain in other words,
 I have a list of directories distributed over hdfs and I know that each of
 these directories is small enough to be present on a single node. I want
 that one directory to be given  to each map task rather than the files
 present in it. How to do this?

 Thanks,
  Akhil
 --
 View this message in context:
 http://old.nabble.com/having-a-directory-as-input-split-tp28408886p28408886.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Need a Jira?

2010-05-04 Thread Eric Sammer
On Tue, May 4, 2010 at 10:50 AM, Michael Segel
michael_se...@hotmail.com wrote:

 Hi,

 Came across something ugly.

 I'm using the latest Hadoop version in Cloudera's CH2 :Hadoop 0.20.1+169.68
 (At least I think its the latest version in CH2)

 Noticed that when I instantiate a JobClient() passing in a Configuration 
 object, I have to cast it to the deprecated class (JobConf).

 Is this something that should be updated, or is this fixed in the next 
 Cloudera (CH3) release?

The reason / problem here is because JobClient is from the old (0.18)
API and thus has no understanding of Configuration. You can initialize
a JobConf from a Configuration rather than casting it which avoids the
cast.

JobConf conf = new JobConf(new Configuration())

This isn't a bug as much as it is confusion between the new and old
APIs. As the new APIs become more feature complete (probably at or
around 0.21) the recommendation will be to prefer those. There has
been discussion around un-deprecating the old APIs.
-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com


RE: Need a Jira?

2010-05-04 Thread Michael Segel



 Date: Tue, 4 May 2010 11:03:48 -0400
 Subject: Re: Need a Jira?
 From: esam...@cloudera.com
 To: common-user@hadoop.apache.org

 The reason / problem here is because JobClient is from the old (0.18)
 API and thus has no understanding of Configuration. You can initialize
 a JobConf from a Configuration rather than casting it which avoids the
 cast.
 
 JobConf conf = new JobConf(new Configuration())
 
 This isn't a bug as much as it is confusion between the new and old
 APIs. As the new APIs become more feature complete (probably at or
 around 0.21) the recommendation will be to prefer those. There has
 been discussion around un-deprecating the old APIs.

Well that's why I asked about creating a Jira.
Here's the code ...
jc = new JobClient(new JobConf(conf) ); 

conf is actually an instance of Configuration which is what we are *supposed* 
to use. ;-)


Of course JobConf has an ugly 'strikeout' through it. And that's what I meant 
by ugly. 

I wonder if there's a better interface to JobTracker than JobClient planned? 
(Not that I'm complaining. It does what I need...)

I would hope that JobClient gets refactored to know about Configuration.. :-)

Thx

-Mike

  
_
Hotmail is redefining busy with tools for the New Busy. Get more from your 
inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2

Applying HDFS-630 patch to hadoop-0.20.2 tarball release?

2010-05-04 Thread Joseph Chiu
I am currently testing out a rollout of HBase 0.20.3 on top of Hadoop
0.20.2.  The HBase doc recommends HDFS-630 patch be applied.

I realize this is a newbieish question, but has anyone done this to the
tarball Hadoop-0.20.2 release?  Since this is a specific recommendation by
the HBase release, I think a walk-through would be quite useful for anyone
else similary coming up the Hadoop + HBase learning curve.

(I'm afraid I've been away from the Linux / DB / Systems world for far too
long, nearly a decade, and I've come back to work to a very changed
landscape.  But I digress...)

Thanks in advance.

Joseph


Re: Doubt: Using PBS to run mapreduce jobs.

2010-05-04 Thread Craig Macdonald
HOD supports a PBS environment, namely Torque. Torque is the vastly 
improved fork of OpenPBS. You may be able to get HOD working on OpenPBS, 
or better still persuade your cluster admins to upgrade to a more recent 
version of Torque (e.g. at least 2.1.x)


Craig


On 22/07/28164 20:59, Udaya Lakshmi wrote:

Hi,
I am given an account on a cluster which uses OpenPBS as the cluster
management software. The only way I can run a job is by submitting it to
OpenPBS. How to run mapreduce programs on it? Is there any possible work
around?

Thanks,
Udaya.
   




Re: Applying HDFS-630 patch to hadoop-0.20.2 tarball release?

2010-05-04 Thread Todd Lipcon
Hi Joseph,

You'll have to apply the patch with patch -p0  foo.patch and then recompile
using ant.

If you want to avoid this you can grab the CDH2 tarball here:
http://archive.cloudera.com/cdh/2/ - it includes the HDFS-630 patch.

Thanks
-Todd

On Tue, May 4, 2010 at 9:38 AM, Joseph Chiu joec...@joechiu.com wrote:

 I am currently testing out a rollout of HBase 0.20.3 on top of Hadoop
 0.20.2.  The HBase doc recommends HDFS-630 patch be applied.

 I realize this is a newbieish question, but has anyone done this to the
 tarball Hadoop-0.20.2 release?  Since this is a specific recommendation by
 the HBase release, I think a walk-through would be quite useful for
 anyone
 else similary coming up the Hadoop + HBase learning curve.

 (I'm afraid I've been away from the Linux / DB / Systems world for far too
 long, nearly a decade, and I've come back to work to a very changed
 landscape.  But I digress...)

 Thanks in advance.

 Joseph




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Applying HDFS-630 patch to hadoop-0.20.2 tarball release?

2010-05-04 Thread Joseph Chiu
Thanks Todd.Where I really need help is to get up to speed on that
process of recompiling (and re-installing the build outputs) with ant.

Cheers,
Joseph

On Tue, May 4, 2010 at 9:48 AM, Todd Lipcon t...@cloudera.com wrote:

 Hi Joseph,

 You'll have to apply the patch with patch -p0  foo.patch and then
 recompile
 using ant.

 If you want to avoid this you can grab the CDH2 tarball here:
 http://archive.cloudera.com/cdh/2/ - it includes the HDFS-630 patch.

 Thanks
 -Todd

 On Tue, May 4, 2010 at 9:38 AM, Joseph Chiu joec...@joechiu.com wrote:

  I am currently testing out a rollout of HBase 0.20.3 on top of Hadoop
  0.20.2.  The HBase doc recommends HDFS-630 patch be applied.
 
  I realize this is a newbieish question, but has anyone done this to the
  tarball Hadoop-0.20.2 release?  Since this is a specific recommendation
 by
  the HBase release, I think a walk-through would be quite useful for
  anyone
  else similary coming up the Hadoop + HBase learning curve.
 
  (I'm afraid I've been away from the Linux / DB / Systems world for far
 too
  long, nearly a decade, and I've come back to work to a very changed
  landscape.  But I digress...)
 
  Thanks in advance.
 
  Joseph
 



 --
 Todd Lipcon
 Software Engineer, Cloudera



Re: Doubt: Using PBS to run mapreduce jobs.

2010-05-04 Thread Udaya Lakshmi
Thank you Craig. My cluster has got Torque. Can you please point me
something which will have detailed explanation about using HOD on Torque.

On Tue, May 4, 2010 at 10:17 PM, Craig Macdonald cra...@dcs.gla.ac.ukwrote:

 HOD supports a PBS environment, namely Torque. Torque is the vastly
 improved fork of OpenPBS. You may be able to get HOD working on OpenPBS, or
 better still persuade your cluster admins to upgrade to a more recent
 version of Torque (e.g. at least 2.1.x)

 Craig



 On 22/07/28164 20:59, Udaya Lakshmi wrote:

 Hi,
I am given an account on a cluster which uses OpenPBS as the cluster
 management software. The only way I can run a job is by submitting it to
 OpenPBS. How to run mapreduce programs on it? Is there any possible work
 around?

 Thanks,
 Udaya.






Re: Doubt: Using PBS to run mapreduce jobs.

2010-05-04 Thread Peeyush Bishnoi
Udaya,

Following link will help you for HOD on torque.
http://hadoop.apache.org/common/docs/r0.20.0/hod_user_guide.html


Thanks,

---
Peeyush

On Tue, 2010-05-04 at 22:49 +0530, Udaya Lakshmi wrote:

 Thank you Craig. My cluster has got Torque. Can you please point me
 something which will have detailed explanation about using HOD on Torque.
 
 On Tue, May 4, 2010 at 10:17 PM, Craig Macdonald cra...@dcs.gla.ac.ukwrote:
 
  HOD supports a PBS environment, namely Torque. Torque is the vastly
  improved fork of OpenPBS. You may be able to get HOD working on OpenPBS, or
  better still persuade your cluster admins to upgrade to a more recent
  version of Torque (e.g. at least 2.1.x)
 
  Craig
 
 
 
  On 22/07/28164 20:59, Udaya Lakshmi wrote:
 
  Hi,
 I am given an account on a cluster which uses OpenPBS as the cluster
  management software. The only way I can run a job is by submitting it to
  OpenPBS. How to run mapreduce programs on it? Is there any possible work
  around?
 
  Thanks,
  Udaya.
 
 
 
 


Re: Applying HDFS-630 patch to hadoop-0.20.2 tarball release?

2010-05-04 Thread Owen O'Malley
On Tue, May 4, 2010 at 10:03 AM, Joseph Chiu joec...@joechiu.com wrote:
 Thanks Todd.    Where I really need help is to get up to speed on that
 process of recompiling (and re-installing the build outputs) with ant.

The place to look is in the wiki:

http://wiki.apache.org/hadoop/HowToRelease

It walks through the build process very well.

-- Owen


Re: Applying HDFS-630 patch to hadoop-0.20.2 tarball release?

2010-05-04 Thread Joseph Chiu
Thanks!

On Tue, May 4, 2010 at 11:14 AM, Owen O'Malley owen.omal...@gmail.comwrote:

 On Tue, May 4, 2010 at 10:03 AM, Joseph Chiu joec...@joechiu.com wrote:
  Thanks Todd.Where I really need help is to get up to speed on that
  process of recompiling (and re-installing the build outputs) with ant.

 The place to look is in the wiki:

 http://wiki.apache.org/hadoop/HowToRelease

 It walks through the build process very well.

 -- Owen



RE: Hadoop User Group - May 19th at Yahoo!

2010-05-04 Thread Dekel Tankel
Hi

Agenda is available for the upcoming HUG.

Hope to see you all there.

http://www.meetup.com/hadoop/calendar/13048582/

thanks

Dekel






Register today for Hadoop Summit 2010 June 29th, Hyatt, Santa Clara, CA
http://hadoopsummit2010.eventbrite.com/

Presentation submission deadline extended until May 10th
http://developer.yahoo.com/events/hadoopsummit2010/presentationguidelines.html







Re: Doubt: Using PBS to run mapreduce jobs.

2010-05-04 Thread Allen Wittenauer

On May 4, 2010, at 7:46 AM, Udaya Lakshmi wrote:

 Hi,
   I am given an account on a cluster which uses OpenPBS as the cluster
 management software. The only way I can run a job is by submitting it to
 OpenPBS. How to run mapreduce programs on it? Is there any possible work
 around?


Take a look at Hadoop on Demand.  It was built with Torque in mind, but any PBS 
system should work with few changes.




Re: Doubt: Using PBS to run mapreduce jobs.

2010-05-04 Thread Udaya Lakshmi
Thank you.
Udaya.

On Wed, May 5, 2010 at 12:23 AM, Allen Wittenauer
awittena...@linkedin.comwrote:


 On May 4, 2010, at 7:46 AM, Udaya Lakshmi wrote:

  Hi,
I am given an account on a cluster which uses OpenPBS as the cluster
  management software. The only way I can run a job is by submitting it to
  OpenPBS. How to run mapreduce programs on it? Is there any possible work
  around?


 Take a look at Hadoop on Demand.  It was built with Torque in mind, but any
 PBS system should work with few changes.





about CombineFileInputFormat

2010-05-04 Thread Zhenyu Zhong
Hi,

I tried to use CombineFileInputFormat in 0.20.2. It seems I need to extend
it because it is an abstract class.
However, I need to implement getRecordReader method in the extended class.

May I ask how to implement this getRecordReader method?

I tried to do something like this:

public RecordReader getRecordReader(InputSplit genericSplit, JobConf job,

Reporter reporter) throws IOException {

// TODO Auto-generated method stub

reporter.setStatus(genericSplit.toString());

return new CombineFileRecordReader(job, (CombineFileSplit) genericSplit,
reporter, CombineFileRecordReader.class);

}

It doesn't seem to be working. I would be very appreciated if someone can
shed a light on this.

thanks
zhenyu


new to hadoop

2010-05-04 Thread jamborta

Hi,

I am tring to set up a small hadoop cluster with 6 machines. the problem I
have now is that if I set the memory allocated to a task low (e.g -Xmx512m)
the application does not run, if I set it higher some machines in the
cluster only have not got too much memory (1 or 2GB) and when the
computation gets intensive hadoop create so many jobs and send them to these
weaker machines, which brings the whole cluster down. 
my question is whether it is possible to specify -Xmx for each machine in
the cluster and specify how many task can run on a machine. or what is the
optimal setting in this situation?

thanks for your help

Tom

-- 
View this message in context: 
http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Accepting contributions for the Hadooop in Practice book

2010-05-04 Thread Mark Kerzner
Hi, guys,

I am working on this book for Manning http://www.manning.com/, and I need
your solutions. If you had a specific problem that you solved with Hadoop,
and you can share your solution, even in general terms, I will accept it
from you and put it in the book. You will be mentioned as the person/company
who contributed this specific solution. Contributions about Pig, Hive,
Scaling, etc. are also welcome.

It does not have to be a formal documented description; a few written ideas
are enough, a phone conversation where you will explain the problem and the
solution will also be good, and if you can point me to something already out
on the web, that will be great. Like, for example, dealing with many small
files http://www.cloudera.com/blog/2009/02/the-small-files-problem/.

Thank you. Sincerely,
Mark


Re: new to hadoop

2010-05-04 Thread Ravi Phulari
How much RAM ?
With 6-8GB RAM you can go for 4 mappers and 2 reducers (this is my personal 
guess).

-
Ravi

On 5/4/10 4:33 PM, Tamas Jambor jambo...@googlemail.com wrote:

thank you. so what would be the optimal setting for mapred.map.tasks and 
mapred.reduce.tasks, say, on a dual-core machine?

Tom

On 05/05/2010 00:12, Ravi Phulari wrote:
Re: new to hadoop You can configure (conf/hadoop-env.sh) configuration files on 
each node to specify -Xmx values.
You can use conf/mapred-site.xml to configure default mappers and reducers 
running on a node.

property
  namemapred.map.tasks/name
  value2/value
  descriptionThe default number of map tasks per job.
  Ignored when mapred.job.tracker is local.
  /description
/property

property
  namemapred.reduce.tasks/name
  value1/value
  descriptionThe default number of reduce tasks per job. Typically set to 99%
  of the cluster's reduce capacity, so that if a node fails the reduces can
  still be executed in a single wave.
  Ignored when mapred.job.tracker is local.
  /description
/property


-
Ravi

On 5/4/10 3:54 PM, jamborta jambo...@gmail.com wrote:




Hi,

I am tring to set up a small hadoop cluster with 6 machines. the problem I
have now is that if I set the memory allocated to a task low (e.g -Xmx512m)
the application does not run, if I set it higher some machines in the
cluster only have not got too much memory (1 or 2GB) and when the
computation gets intensive hadoop create so many jobs and send them to these
weaker machines, which brings the whole cluster down.
my question is whether it is possible to specify -Xmx for each machine in
the cluster and specify how many task can run on a machine. or what is the
optimal setting in this situation?

thanks for your help

Tom

--
View this message in context: 
http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.




Ravi

Ravi
--



Re: new to hadoop

2010-05-04 Thread Tamas Jambor
thank you. so what would be the optimal setting for mapred.map.tasks and 
mapred.reduce.tasks, say, on a dual-core machine?


Tom

On 05/05/2010 00:12, Ravi Phulari wrote:
You can configure (conf/hadoop-env.sh) configuration files on each 
node to specify --Xmx values.
You can use conf/mapred-site.xml to configure default mappers and 
reducers running on a node.


property
namemapred.map.tasks/name
value2/value
descriptionThe default number of map tasks per job.
  Ignored when mapred.job.tracker is local.
/description
/property

property
namemapred.reduce.tasks/name
value1/value
descriptionThe default number of reduce tasks per job. Typically set 
to 99%
  of the cluster's reduce capacity, so that if a node fails the 
reduces can

  still be executed in a single wave.
  Ignored when mapred.job.tracker is local.
/description
/property


-
Ravi

On 5/4/10 3:54 PM, jamborta jambo...@gmail.com wrote:



Hi,

I am tring to set up a small hadoop cluster with 6 machines. the
problem I
have now is that if I set the memory allocated to a task low (e.g
-Xmx512m)
the application does not run, if I set it higher some machines in the
cluster only have not got too much memory (1 or 2GB) and when the
computation gets intensive hadoop create so many jobs and send
them to these
weaker machines, which brings the whole cluster down.
my question is whether it is possible to specify -Xmx for each
machine in
the cluster and specify how many task can run on a machine. or
what is the
optimal setting in this situation?

thanks for your help

Tom

--
View this message in context:
http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Ravi
--



Re: new to hadoop

2010-05-04 Thread Tamas Jambor

great. thank you. I'll set it up that way.

Tom

On 05/05/2010 00:37, Ravi Phulari wrote:

How much RAM ?
With 6-8GB RAM you can go for 4 mappers and 2 reducers (this is my 
personal guess).


-
Ravi

On 5/4/10 4:33 PM, Tamas Jambor jambo...@googlemail.com wrote:

thank you. so what would be the optimal setting for
mapred.map.tasks and mapred.reduce.tasks, say, on a dual-core machine?

Tom

On 05/05/2010 00:12, Ravi Phulari wrote:

Re: new to hadoop You can configure (conf/hadoop-env.sh)
configuration files on each node to specify --Xmx values.
You can use conf/mapred-site.xml to configure default mappers
and reducers running on a node.

property
namemapred.map.tasks/name
value2/value
descriptionThe default number of map tasks per job.
  Ignored when mapred.job.tracker is local.
/description
/property

property
namemapred.reduce.tasks/name
value1/value
descriptionThe default number of reduce tasks per job.
Typically set to 99%
  of the cluster's reduce capacity, so that if a node fails
the reduces can
  still be executed in a single wave.
  Ignored when mapred.job.tracker is local.
/description
/property


-
Ravi

On 5/4/10 3:54 PM, jamborta jambo...@gmail.com wrote:




Hi,

I am tring to set up a small hadoop cluster with 6
machines. the problem I
have now is that if I set the memory allocated to a task
low (e.g -Xmx512m)
the application does not run, if I set it higher some
machines in the
cluster only have not got too much memory (1 or 2GB) and
when the
computation gets intensive hadoop create so many jobs and
send them to these
weaker machines, which brings the whole cluster down.
my question is whether it is possible to specify -Xmx for
each machine in
the cluster and specify how many task can run on a
machine. or what is the
optimal setting in this situation?

thanks for your help

Tom

--
View this message in context:
http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html
Sent from the Hadoop core-user mailing list archive at
Nabble.com.




Ravi


Ravi
--





Re: about CombineFileInputFormat

2010-05-04 Thread Amareshwari Sri Ramadasu

See patch on https://issues.apache.org/jira/browse/MAPREDUCE-364 as an example.

-Amareshwari

On 5/5/10 1:52 AM, Zhenyu Zhong zhongresea...@gmail.com wrote:

Hi,

I tried to use CombineFileInputFormat in 0.20.2. It seems I need to extend
it because it is an abstract class.
However, I need to implement getRecordReader method in the extended class.

May I ask how to implement this getRecordReader method?

I tried to do something like this:

public RecordReader getRecordReader(InputSplit genericSplit, JobConf job,

Reporter reporter) throws IOException {

// TODO Auto-generated method stub

reporter.setStatus(genericSplit.toString());

return new CombineFileRecordReader(job, (CombineFileSplit) genericSplit,
reporter, CombineFileRecordReader.class);

}

It doesn't seem to be working. I would be very appreciated if someone can
shed a light on this.

thanks
zhenyu