Using different file systems for Map Reduce job input and output

2008-10-06 Thread Naama Kraus
Hi,

I wanted to know if it is possible to use different file systems for Map
Reduce job input and output.
I.e. have a M/R job input reside on one file system and the M/R output be
written to another file system (e.g. input on HDFS, output on KFS. Input on
HDFS output on local file system, or anything else ...).

Is it possible to somehow specify that through
FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath() ?
Or by any other mechanism ?

Thanks, Naama

-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)


Re: Using different file systems for Map Reduce job input and output

2008-10-06 Thread Amareshwari Sriramadasu

Hi Naama,

Yes. It is possible to specify using the apis

FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath(). 


You can specify the FileSystem uri for the path.

Thanks,
Amareshwari
Naama Kraus wrote:

Hi,

I wanted to know if it is possible to use different file systems for Map
Reduce job input and output.
I.e. have a M/R job input reside on one file system and the M/R output be
written to another file system (e.g. input on HDFS, output on KFS. Input on
HDFS output on local file system, or anything else ...).

Is it possible to somehow specify that through
FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath() ?
Or by any other mechanism ?

Thanks, Naama

  




A scalable gallery with hadoop?

2008-10-06 Thread Alberto Cusinato
Hi, I am a new user.I need to develop a huge mediagallery. My reqs in a
nutshell are a high scalability on the number of users, reliability of
users' data (photos, videos, docs, etc.. uploaded by users) and an internal
search engine.
I've seen some posts about the applicability of Hadoop on web apps, mainly
with negative response (ie:
http://www.nabble.com/Hadoop-also-applicable-in-a-web-app-environment--to18836915.html#a18836915
 )
I know about Amazon AWS and I think that maybe S3+EC2 could be a solution
(even if I still don't know how to integrate the search engine), but I would
like to not close the opportunity of using my own HW in the future.
I've seen that Hadoop provides API for using S3 HDFS and so i thought this
(hadoop framework) was the right layer on which my app should be based
(allowing me to store data locally or on S3 without changing the application
layer).
Now that I've configured a clustered environment with Hadoop, I'm not still
so sure about that.
I am a perfect newbie on this topic, so any suggestion is welcome! (maybe
the right framework could be Hbase?)

Thanks,
Alberto.


Re: Using different file systems for Map Reduce job input and output

2008-10-06 Thread Naama Kraus
Thanks ! Naama

On Mon, Oct 6, 2008 at 10:27 AM, Amareshwari Sriramadasu <
[EMAIL PROTECTED]> wrote:

> Hi Naama,
>
> Yes. It is possible to specify using the apis
>
> FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath().
> You can specify the FileSystem uri for the path.
>
> Thanks,
> Amareshwari
>
> Naama Kraus wrote:
>
>> Hi,
>>
>> I wanted to know if it is possible to use different file systems for Map
>> Reduce job input and output.
>> I.e. have a M/R job input reside on one file system and the M/R output be
>> written to another file system (e.g. input on HDFS, output on KFS. Input
>> on
>> HDFS output on local file system, or anything else ...).
>>
>> Is it possible to somehow specify that through
>> FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath() ?
>> Or by any other mechanism ?
>>
>> Thanks, Naama
>>
>>
>>
>
>


-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)


Re: Hadoop and security.

2008-10-06 Thread Steve Loughran

Dmitry Pushkarev wrote:
Dear hadoop users, 

 


I'm lucky to work in academic environment where information security is not
the question. However, I'm sure that most of the hadoop users aren't. 

 


Here is the question: how secure hadoop is?  (or let's say foolproof)


Right now hadoop is about as secure as NFS. when deployed onto private 
datacentres with good physical security and well set up networks, you 
can control who gets at the data. Without that, you are sharing your 
state with anyone who can issue HTTP and hadoop IPC requests.




 


Here is the answer: http://www.google.com/search?client=opera

&rls=en&q=Hadoop+Map/Reduce+Administration&sourceid=opera&ie=utf-8&oe=utf-8
not quite.



see also http://www.google.com/search?q=axis+happiness+page  ; pages 
that we add for benefit of the ops team end up sneaking out into the big 
net.


 


What we're seeing here is open hadoop cluster, where anyone who capable of
installing hadoop and changing his username to webcrawl can use their
cluster and read their data, even though firewall is perfectly installed and
ports like ssh are filtered to outsiders. After you've played enough with
data, you can observe that you can submit jobs as well, and these jobs can
execute shell commands. Which is very, very sad.

 


In my view, this significantly limits distributed hadoop applications, where
part of your cluster may reside on EC2 or other distant datacenter, since
you always need to have certain ports open to an array of ip addresses (if
your instances are dynamic) which isn't acceptable if anyone from that ip
range can connect to your cluster.


well, maybe that's a fault of EC2s architecture in which a deployment 
request doesn't include a declaration of the network configuration?




 


Can we propose to developers to introduce some basic user-management and
access controls to help hadoop make one step further towards
production-quality system?




Being an open source project, you can do more than propose, you can help 
build some basic user-management and access controls. As to "production 
quality"; it is ready for production, albeit in locked down datacentres. 
Which is the primary deployment infrastructure of many of the active 
developers. As in most community-contributed open source projects, if 
you have specific needs beyond what the active developers need, you end 
up implementing them your self.


The big issue with security is that it is all or nothing. Right now it 
is blatantly insecure, so you should not be surprised that anyone has 
access to your files. To actually lock it down, you would need to 
authenticate and possibly encrypt all communications; this adds a lot of 
overhead, which is why it will be avoided in the big datacentres. You 
also need to go to a lot of effort to make sure it is secure across the 
board, with no JSP pages providing accidental privilege escalation, no 
api calls letting you see stuff you shouldn't. Its not like a normal 
feature defect where you can say "don't do that"; it's not so easy to 
validate using functional tests that test the expected uses of the code. 
This is why securing an application is such a hard thing to do.




--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: Hadoop and security.

2008-10-06 Thread Edward Capriolo
You bring up some valid points. This would be a great topic for a
white paper. The first line of defense should be to apply inbound and
outbound iptables rules. Only source IPs that have a direct need to
interact with the cluster should be allowed to. The same is true with
the   web access. Only a range of source IP's should be allowed to
access the web interfaces. You can do this through SSH tunneling.

Preventing exec commands can be handled with the security manager and
the sandbox. I was thinking to only allow the execution of signed jars
myself but I never implemented it.


Re: architecture diagram

2008-10-06 Thread Terrence A. Pietrondi
Can you explain "The location of these splits is semi-arbitrary"? What if the 
example was...

AAA|BBB|CCC|DDD
EEE|FFF|GGG|HHH


Does this mean the split might be between CCC such that it results in AAA|BBB|C 
and C|DDD for the first line? Is there a way to control this behavior to split 
on my delimiter?


Terrence A. Pietrondi


--- On Sun, 10/5/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:

> From: Alex Loddengaard <[EMAIL PROTECTED]>
> Subject: Re: architecture diagram
> To: core-user@hadoop.apache.org
> Date: Sunday, October 5, 2008, 9:26 PM
> Let's say you have one very large input file of the
> form:
> 
> A|B|C|D
> E|F|G|H
> ...
> |1|2|3|4
> 
> This input file will be broken up into N pieces, where N is
> the number of
> mappers that run.  The location of these splits is
> semi-arbitrary.  This
> means that unless you have one mapper, you won't be
> able to see the entire
> contents of a column in your mapper.  Given that you would
> need one mapper
> to be able to see the entirety of a column, you've now
> essentially reduced
> your problem to a single machine.
> 
> You may want to play with the following idea: collect key
> => column_number
> and value => column_contents in your map step.  This
> means that you would be
> able to see the entirety of a column in your reduce step,
> though you're
> still faced with the tasks of shuffling and re-pivoting.
> 
> Does this clear up your confusion?  Let me know if
> you'd like me to clarify
> more.
> 
> Alex
> 
> On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi
> <[EMAIL PROTECTED]
> > wrote:
> 
> > I am not sure why this doesn't fit, maybe you can
> help me understand. Your
> > previous comment was...
> >
> > "The reason I'm making this claim is because
> in order to do the pivot
> > operation you must know about every row. Your input
> files will be split at
> > semi-arbitrary places, essentially making it
> impossible for each mapper to
> > know every single row."
> >
> > Are you saying that my row segments might not actually
> be the entire row so
> > I will get a bad key index? If so, would the row
> segments be determined? I
> > based my initial work off of the word count example,
> where the lines are
> > tokenized. Does this mean in this example the row
> tokens may not be the
> > complete row?
> >
> > Thanks.
> >
> > Terrence A. Pietrondi
> >
> >
> > --- On Fri, 10/3/08, Alex Loddengaard
> <[EMAIL PROTECTED]> wrote:
> >
> > > From: Alex Loddengaard
> <[EMAIL PROTECTED]>
> > > Subject: Re: architecture diagram
> > > To: core-user@hadoop.apache.org
> > > Date: Friday, October 3, 2008, 7:14 PM
> > > The approach that you've described does not
> fit well in
> > > to the MapReduce
> > > paradigm.  You may want to consider randomizing
> your data
> > > in a different
> > > way.
> > >
> > > Unfortunately some things can't be solved
> well with
> > > MapReduce, and I think
> > > this is one of them.
> > >
> > > Can someone else say more?
> > >
> > > Alex
> > >
> > > On Fri, Oct 3, 2008 at 8:15 AM, Terrence A.
> Pietrondi
> > > <[EMAIL PROTECTED]
> > > > wrote:
> > >
> > > > Sorry for the confusion, I did make some
> typos. My
> > > example should have
> > > > looked like...
> > > >
> > > > > A|B|C
> > > > > D|E|G
> > > > >
> > > > > pivots too...
> > > > >
> > > > > D|A
> > > > > E|B
> > > > > G|C
> > > > >
> > > > > Then for each row, shuffle the contents
> around
> > > randomly...
> > > > >
> > > > > D|A
> > > > > B|E
> > > > > C|G
> > > > >
> > > > > Then pivot the data back...
> > > > >
> > > > > A|E|G
> > > > > D|B|C
> > > >
> > > > The general goal is to shuffle the elements
> in each
> > > column in the input
> > > > data. Meaning, the ordering of the elements
> in each
> > > column will not be the
> > > > same as in input.
> > > >
> > > > If you look at the initial input and compare
> to the
> > > final output, you'll
> > > > see that during the shuffling, B and E are
> swapped,
> > > and G and C are swapped,
> > > > while A and D were shuffled back into their
> > > originating positions in the
> > > > column.
> > > >
> > > > Once again, sorry for the typos and
> confusion.
> > > >
> > > > Terrence A. Pietrondi
> > > >
> > > > --- On Fri, 10/3/08, Alex Loddengaard
> > > <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > From: Alex Loddengaard
> > > <[EMAIL PROTECTED]>
> > > > > Subject: Re: architecture diagram
> > > > > To: core-user@hadoop.apache.org
> > > > > Date: Friday, October 3, 2008, 11:01 AM
> > > > > Can you confirm that the example
> you've
> > > presented is
> > > > > accurate?  I think you
> > > > > may have made some typos, because the
> letter
> > > "G"
> > > > > isn't in the final result;
> > > > > I also think your first pivot
> accidentally
> > > swapped C and G.
> > > > >  I'm having a
> > > > > hard time understanding what you want
> to do,
> > > because it
> > > > > seems like your
> > > > > operations differ from your example.
> > > > >
> > > > > With that said, at first glance, this
> problem may
> > > not fit
> > > > > 

Re: Hadoop and security.

2008-10-06 Thread Allen Wittenauer



On 10/6/08 6:39 AM, "Steve Loughran" <[EMAIL PROTECTED]> wrote:

> Edward Capriolo wrote:
>> You bring up some valid points. This would be a great topic for a
>> white paper. 
> 
> -a wiki page would be a start too

I was thinking about doing "Deploying Hadoop Securely" for a ApacheCon EU
talk, as by that time, some of the basic Kerberos stuff should be in
place... This whole conversation is starting to reinforce the idea




Re: Hadoop and security.

2008-10-06 Thread Steve Loughran

Edward Capriolo wrote:

You bring up some valid points. This would be a great topic for a
white paper. 


-a wiki page would be a start too


The first line of defense should be to apply inbound and

outbound iptables rules. Only source IPs that have a direct need to
interact with the cluster should be allowed to. The same is true with
the   web access. Only a range of source IP's should be allowed to
access the web interfaces. You can do this through SSH tunneling.

Preventing exec commands can be handled with the security manager and
the sandbox. I was thinking to only allow the execution of signed jars
myself but I never implemented it.



--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: Hadoop and security.

2008-10-06 Thread Steve Loughran

Allen Wittenauer wrote:



On 10/6/08 6:39 AM, "Steve Loughran" <[EMAIL PROTECTED]> wrote:


Edward Capriolo wrote:

You bring up some valid points. This would be a great topic for a
white paper. 

-a wiki page would be a start too


I was thinking about doing "Deploying Hadoop Securely" for a ApacheCon EU
talk, as by that time, some of the basic Kerberos stuff should be in
place... This whole conversation is starting to reinforce the idea




-Start with an ApacheCon US fastfeather talk on the current state of 
affairs "a hadoop cluster is as secure as a farm of machines running 
NFS". Just to let people know where things stand.


for the EU one, I will probably put in for one on deploying/managing 
using our toolset.


I'm also thinking of a talk "datamining a city" that looks at what data 
sources a city is already instrumented with, if only you could get at 
them. The hard part is getting at them. I have my eye on our local speed 
camera/red light cameras, that track the speed of every vehicle passing 
and time of day; you could build up a map of traffic velocity, where the 
jams are, when, etc. Getting the machine-parsed number plate data would 
be even more interesting, but governments tend to restrict that data to 
state security, rather than useful things like analysing and predicting 
traffic flow.


Re: architecture diagram

2008-10-06 Thread Alex Loddengaard
As far as I know, splits will never be made within a line, only between
rows.  To answer your question about ways to control the splits, see below:


<
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
>

Alex

On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi <[EMAIL PROTECTED]
> wrote:

> Can you explain "The location of these splits is semi-arbitrary"? What if
> the example was...
>
> AAA|BBB|CCC|DDD
> EEE|FFF|GGG|HHH
>
>
> Does this mean the split might be between CCC such that it results in
> AAA|BBB|C and C|DDD for the first line? Is there a way to control this
> behavior to split on my delimiter?
>
>
> Terrence A. Pietrondi
>
>
> --- On Sun, 10/5/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Sunday, October 5, 2008, 9:26 PM
> > Let's say you have one very large input file of the
> > form:
> >
> > A|B|C|D
> > E|F|G|H
> > ...
> > |1|2|3|4
> >
> > This input file will be broken up into N pieces, where N is
> > the number of
> > mappers that run.  The location of these splits is
> > semi-arbitrary.  This
> > means that unless you have one mapper, you won't be
> > able to see the entire
> > contents of a column in your mapper.  Given that you would
> > need one mapper
> > to be able to see the entirety of a column, you've now
> > essentially reduced
> > your problem to a single machine.
> >
> > You may want to play with the following idea: collect key
> > => column_number
> > and value => column_contents in your map step.  This
> > means that you would be
> > able to see the entirety of a column in your reduce step,
> > though you're
> > still faced with the tasks of shuffling and re-pivoting.
> >
> > Does this clear up your confusion?  Let me know if
> > you'd like me to clarify
> > more.
> >
> > Alex
> >
> > On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi
> > <[EMAIL PROTECTED]
> > > wrote:
> >
> > > I am not sure why this doesn't fit, maybe you can
> > help me understand. Your
> > > previous comment was...
> > >
> > > "The reason I'm making this claim is because
> > in order to do the pivot
> > > operation you must know about every row. Your input
> > files will be split at
> > > semi-arbitrary places, essentially making it
> > impossible for each mapper to
> > > know every single row."
> > >
> > > Are you saying that my row segments might not actually
> > be the entire row so
> > > I will get a bad key index? If so, would the row
> > segments be determined? I
> > > based my initial work off of the word count example,
> > where the lines are
> > > tokenized. Does this mean in this example the row
> > tokens may not be the
> > > complete row?
> > >
> > > Thanks.
> > >
> > > Terrence A. Pietrondi
> > >
> > >
> > > --- On Fri, 10/3/08, Alex Loddengaard
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Alex Loddengaard
> > <[EMAIL PROTECTED]>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Friday, October 3, 2008, 7:14 PM
> > > > The approach that you've described does not
> > fit well in
> > > > to the MapReduce
> > > > paradigm.  You may want to consider randomizing
> > your data
> > > > in a different
> > > > way.
> > > >
> > > > Unfortunately some things can't be solved
> > well with
> > > > MapReduce, and I think
> > > > this is one of them.
> > > >
> > > > Can someone else say more?
> > > >
> > > > Alex
> > > >
> > > > On Fri, Oct 3, 2008 at 8:15 AM, Terrence A.
> > Pietrondi
> > > > <[EMAIL PROTECTED]
> > > > > wrote:
> > > >
> > > > > Sorry for the confusion, I did make some
> > typos. My
> > > > example should have
> > > > > looked like...
> > > > >
> > > > > > A|B|C
> > > > > > D|E|G
> > > > > >
> > > > > > pivots too...
> > > > > >
> > > > > > D|A
> > > > > > E|B
> > > > > > G|C
> > > > > >
> > > > > > Then for each row, shuffle the contents
> > around
> > > > randomly...
> > > > > >
> > > > > > D|A
> > > > > > B|E
> > > > > > C|G
> > > > > >
> > > > > > Then pivot the data back...
> > > > > >
> > > > > > A|E|G
> > > > > > D|B|C
> > > > >
> > > > > The general goal is to shuffle the elements
> > in each
> > > > column in the input
> > > > > data. Meaning, the ordering of the elements
> > in each
> > > > column will not be the
> > > > > same as in input.
> > > > >
> > > > > If you look at the initial input and compare
> > to the
> > > > final output, you'll
> > > > > see that during the shuffling, B and E are
> > swapped,
> > > > and G and C are swapped,
> > > > > while A and D were shuffled back into their
> > > > originating positions in the
> > > > > column.
> > > > >
> > > > > Once again, sorry for the typos and
> > confusion.
> > > > >
> > > > > Terrence A. Pietrondi
> > > > >
> > > > > --- On Fri, 10/3/08, Alex Loddengaard
> > > > <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > > From: Alex Loddengaard
> > > >

nagios to monitor hadoop datanodes!

2008-10-06 Thread Gerardo Velez
Hi Everyone!


I would like to implement Nagios health monitoring of a Hadoop grid.

Some of you have some experience here, do you hace any approach or advice I
could use.

At this time I've been only playing with jsp's files that hadoop has
integrated into it. so I;m not sure if it could be a good idea that
nagios monitor request info to these jsp?


Thanks in advance!


-- Gerardo


Searching Lucene Index built using Hadoop

2008-10-06 Thread Saranath

I'm trying to index a large dataset using Hadoop+Lucene. I used the example
under hadoop/trunk/src/conrib/index/ for indexing. I'm unable to find a way
to search the index that was successfully built.

I tried copying over the index to one machine and merging them using
IndexWriter.addIndexesNoOptimize().

I would like hear your input on the best way to index+search large datasets.

Thanks,
Saranath
-- 
View this message in context: 
http://www.nabble.com/Searching-Lucene-Index-built-using-Hadoop-tp19842438p19842438.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: How to GET row name/column name in HBase using JAVA API

2008-10-06 Thread Jean-Daniel Cryans
Please use the HBase mailing list for HBase-related questions:
http://hadoop.apache.org/hbase/mailing_lists.html#Users

Regards your question, have you looked at
http://wiki.apache.org/hadoop/Hbase/HbaseRest ?

J-D

On Mon, Oct 6, 2008 at 12:05 AM, Trinh Tuan Cuong <[EMAIL PROTECTED]
> wrote:

> Hello guys,
>
>
>
> I m trying to use Java to manipluate HBase using its API. Atm, I m trying
> to do some simple CRUDs activities with it and from here a little problem
> arise.
>
>
>
> What I was trying to make is : Given an existing table, display/get existed
> row name ( in order to update, manipulate datas ) , columns name ( in String
> ? – or bytes ? ) , I could see that HstoreKey in HBase API allow to get the
> Row values but it wont specific what table(s) it working on ??
>
>
>
> So if any of you could please, show me how to get rows, columns family
> names of a given table, I would thanks in advance.
>
>
>
> Best Regards,
>
>
>
> Trịnh Tuấn Cường
>
>
>
> Luvina Software Company
>
> Website : www.luvina.net
>
>
>
> Address : 1001 Hoang Quoc Viet Street
>
> Email : [EMAIL PROTECTED],[EMAIL PROTECTED]
>
> Mobile : 097 4574 457
>
>
>
>


Re: Searching Lucene Index built using Hadoop

2008-10-06 Thread Stefan Groschupf

Hi,
you might find http://katta.wiki.sourceforge.net/ interesting. If you  
have any katta releated question please use the katta mailing list.

Stefan

~~~
101tec Inc., Menlo Park, California
web:  http://www.101tec.com
blog: http://www.find23.net



On Oct 6, 2008, at 10:26 AM, Saranath wrote:



I'm trying to index a large dataset using Hadoop+Lucene. I used the  
example
under hadoop/trunk/src/conrib/index/ for indexing. I'm unable to  
find a way

to search the index that was successfully built.

I tried copying over the index to one machine and merging them using
IndexWriter.addIndexesNoOptimize().

I would like hear your input on the best way to index+search large  
datasets.


Thanks,
Saranath
--
View this message in context: 
http://www.nabble.com/Searching-Lucene-Index-built-using-Hadoop-tp19842438p19842438.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.






Weird problem running wordcount example from within Eclipse

2008-10-06 Thread Ski Gh3
Hi all,

I have a weird problem regarding running the wordcount example from eclipse.

I was able to run the wordcount example from the command line like:
$ ...MyHadoop/bin/hadoop jar ../MyHadoop/hadoop-xx-examples.jar wordcount
myinputdir myoutputdir

However, if I try to run the wordcount program from Eclipse (supplying same
two args: myinputdir myoutputdir)
I got the following error messsage:

Exception in thread "main" java.lang.RuntimeException: java.io.IOException:
No FileSystem for scheme: file
at
org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:356)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:331)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:304)
at org.apache.hadoop.examples.WordCount.run(WordCount.java:149)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:161)
Caused by: java.io.IOException: No FileSystem for scheme: file
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1277)
at org.apache.hadoop.fs.FileSystem.access$1(FileSystem.java:1273)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1291)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:108)
at
org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:352)
... 5 more

It seems from within Eclipse, the program does not know how to interpret the
myinputdir as a hadoop path?

Can someone please tell me how I can fix this?

Thanks a lot!!!


Questions regarding adding resource via Configuration

2008-10-06 Thread Tarandeep Singh
Hi,

I have a configuration file (similar to hadoop-site.xml) and I want to
include this file as a resource while running Map-Reduce jobs. Similarly, I
want to add a jar file that is required by Mappers and Reducers

ToolRunner.run( ...) allows me to do this easily, my question is can I add
these files permanently? I am running a lot of different Map-Reduce jobs in
a loop, so is there a way I can add these files once and subsequent jobs
need not to add them ?

Also, if I don't implement Tool and don't use ToolRunner.run, but call
Configuration.addresource( ), then will the parameters defined in my
configuration file available in mappers and reducers ?

Thanks,
Taran


Re: Turning off FileSystem statistics during MapReduce

2008-10-06 Thread Nathan Marz
We see this on Maps and only on incrementBytesRead (not on  
incrementBytesWritten). It is on HDFS where we are seeing the time  
spent. It seems that this is because incrementBytesRead is called  
every time a record is read, while incrementBytesWritten is only  
called when a buffer is spilled. We would benefit a lot from being  
able to turn this off.




On Oct 3, 2008, at 6:19 PM, Arun C Murthy wrote:


Nathan,

On Oct 3, 2008, at 5:18 PM, Nathan Marz wrote:


Hello,

We have been doing some profiling of our MapReduce jobs, and we are  
seeing about 20% of the time of our jobs is spent calling  
"FileSystem$Statistics.incrementBytesRead" when we interact with  
the FileSystem. Is there a way to turn this stats-collection off?




This is interesting... could you provide more details? Are you  
seeing this on Maps or Reduces? Which FileSystem exhibited this i.e.  
HDFS or LocalFS? Any details on about your application?


To answer your original question - no, there isn't a way to disable  
this. However, if this turns out to be a systemic problem we  
definitely should consider having an option to allow users to switch  
it off.


So any information you can provide helps - thanks!

Arun



Thanks,
Nathan Marz
Rapleaf







Add jar file via -libjars - giving errors

2008-10-06 Thread Tarandeep Singh
Hi,

I want to add a jar file (that is required by mappers and reducers) to the
classpath. Initially I had copied the jar file to all the slave nodes in the
$HADOOP_HOME/lib directory and it was working fine.

However when I tried the libjars option to add jar files -

$HADOOP_HOME/bin/hadoop  jar myApp.jar -conf $MY_CONF_FILE -libjars jdom.jar


I got this error-

java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder

Can someone please tell me what needs to be fixed here ?

Thanks,
Taran


Re: Add jar file via -libjars - giving errors

2008-10-06 Thread Mahadev Konar
HI Tarandeep,
 the libjars options does not add the jar on the client side. Their is an
open jira for that ( id ont remember which one)...

Oyu have to add the jar to the

HADOOP_CLASSPATH on the client side so that it gets picked up on the client
side as well.


mahadev


On 10/6/08 2:30 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> I want to add a jar file (that is required by mappers and reducers) to the
> classpath. Initially I had copied the jar file to all the slave nodes in the
> $HADOOP_HOME/lib directory and it was working fine.
> 
> However when I tried the libjars option to add jar files -
> 
> $HADOOP_HOME/bin/hadoop  jar myApp.jar -conf $MY_CONF_FILE -libjars jdom.jar
> 
> 
> I got this error-
> 
> java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder
> 
> Can someone please tell me what needs to be fixed here ?
> 
> Thanks,
> Taran



Re: architecture diagram

2008-10-06 Thread Terrence A. Pietrondi
So looking at the following mapper...

http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup

On line 32, you can see the row split via a delimiter. On line 43, you can see 
that the field index (the column index) is the map key, and the map value is 
the field contents. How is this incorrect? I think this follows your earlier 
suggestion of:

"You may want to play with the following idea: collect key => column_number and 
value => column_contents in your map step."

Terrence A. Pietrondi


--- On Mon, 10/6/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:

> From: Alex Loddengaard <[EMAIL PROTECTED]>
> Subject: Re: architecture diagram
> To: core-user@hadoop.apache.org
> Date: Monday, October 6, 2008, 12:55 PM
> As far as I know, splits will never be made within a line,
> only between
> rows.  To answer your question about ways to control the
> splits, see below:
> 
> 
> <
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
> >
> 
> Alex
> 
> On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
> <[EMAIL PROTECTED]
> > wrote:
> 
> > Can you explain "The location of these splits is
> semi-arbitrary"? What if
> > the example was...
> >
> > AAA|BBB|CCC|DDD
> > EEE|FFF|GGG|HHH
> >
> >
> > Does this mean the split might be between CCC such
> that it results in
> > AAA|BBB|C and C|DDD for the first line? Is there a way
> to control this
> > behavior to split on my delimiter?
> >
> >
> > Terrence A. Pietrondi
> >
> >
> > --- On Sun, 10/5/08, Alex Loddengaard
> <[EMAIL PROTECTED]> wrote:
> >
> > > From: Alex Loddengaard
> <[EMAIL PROTECTED]>
> > > Subject: Re: architecture diagram
> > > To: core-user@hadoop.apache.org
> > > Date: Sunday, October 5, 2008, 9:26 PM
> > > Let's say you have one very large input file
> of the
> > > form:
> > >
> > > A|B|C|D
> > > E|F|G|H
> > > ...
> > > |1|2|3|4
> > >
> > > This input file will be broken up into N pieces,
> where N is
> > > the number of
> > > mappers that run.  The location of these splits
> is
> > > semi-arbitrary.  This
> > > means that unless you have one mapper, you
> won't be
> > > able to see the entire
> > > contents of a column in your mapper.  Given that
> you would
> > > need one mapper
> > > to be able to see the entirety of a column,
> you've now
> > > essentially reduced
> > > your problem to a single machine.
> > >
> > > You may want to play with the following idea:
> collect key
> > > => column_number
> > > and value => column_contents in your map step.
>  This
> > > means that you would be
> > > able to see the entirety of a column in your
> reduce step,
> > > though you're
> > > still faced with the tasks of shuffling and
> re-pivoting.
> > >
> > > Does this clear up your confusion?  Let me know
> if
> > > you'd like me to clarify
> > > more.
> > >
> > > Alex
> > >
> > > On Sun, Oct 5, 2008 at 3:54 PM, Terrence A.
> Pietrondi
> > > <[EMAIL PROTECTED]
> > > > wrote:
> > >
> > > > I am not sure why this doesn't fit,
> maybe you can
> > > help me understand. Your
> > > > previous comment was...
> > > >
> > > > "The reason I'm making this claim
> is because
> > > in order to do the pivot
> > > > operation you must know about every row.
> Your input
> > > files will be split at
> > > > semi-arbitrary places, essentially making it
> > > impossible for each mapper to
> > > > know every single row."
> > > >
> > > > Are you saying that my row segments might
> not actually
> > > be the entire row so
> > > > I will get a bad key index? If so, would the
> row
> > > segments be determined? I
> > > > based my initial work off of the word count
> example,
> > > where the lines are
> > > > tokenized. Does this mean in this example
> the row
> > > tokens may not be the
> > > > complete row?
> > > >
> > > > Thanks.
> > > >
> > > > Terrence A. Pietrondi
> > > >
> > > >
> > > > --- On Fri, 10/3/08, Alex Loddengaard
> > > <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > From: Alex Loddengaard
> > > <[EMAIL PROTECTED]>
> > > > > Subject: Re: architecture diagram
> > > > > To: core-user@hadoop.apache.org
> > > > > Date: Friday, October 3, 2008, 7:14 PM
> > > > > The approach that you've described
> does not
> > > fit well in
> > > > > to the MapReduce
> > > > > paradigm.  You may want to consider
> randomizing
> > > your data
> > > > > in a different
> > > > > way.
> > > > >
> > > > > Unfortunately some things can't be
> solved
> > > well with
> > > > > MapReduce, and I think
> > > > > this is one of them.
> > > > >
> > > > > Can someone else say more?
> > > > >
> > > > > Alex
> > > > >
> > > > > On Fri, Oct 3, 2008 at 8:15 AM,
> Terrence A.
> > > Pietrondi
> > > > > <[EMAIL PROTECTED]
> > > > > > wrote:
> > > > >
> > > > > > Sorry for the confusion, I did
> make some
> > > typos. My
> > > > > example should have
> > > > > > looked like...
> > > > > >
> > > > > > > A|B|C
> > > > > > > D|E|G
> > > > > > >
> > 

Re: Add jar file via -libjars - giving errors

2008-10-06 Thread Tarandeep Singh
thanks Mahadev for the reply.
So that means I have to copy my jar file in the $HADOOP_HOME/lib folder on
all slave machines like before.

One more question- I am adding a conf file (just like HADOOP_SITE.xml) via
-conf option and I am able to query parameters in mapper/reducers. But is
there a way I can query the parameters in my job driver class -

public class jobDriver extends Configured
{
   someMethod( )
   {
  ToolRunner.run( new MyJob( ), commandLineArgs);
  // I want to query parameters present in my conf file here
   }
}

public class MyJob extends Configured implements Tool
{
}

Thanks,
Taran

On Mon, Oct 6, 2008 at 2:46 PM, Mahadev Konar <[EMAIL PROTECTED]> wrote:

> HI Tarandeep,
>  the libjars options does not add the jar on the client side. Their is an
> open jira for that ( id ont remember which one)...
>
> Oyu have to add the jar to the
>
> HADOOP_CLASSPATH on the client side so that it gets picked up on the client
> side as well.
>
>
> mahadev
>
>
> On 10/6/08 2:30 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > I want to add a jar file (that is required by mappers and reducers) to
> the
> > classpath. Initially I had copied the jar file to all the slave nodes in
> the
> > $HADOOP_HOME/lib directory and it was working fine.
> >
> > However when I tried the libjars option to add jar files -
> >
> > $HADOOP_HOME/bin/hadoop  jar myApp.jar -conf $MY_CONF_FILE -libjars
> jdom.jar
> >
> >
> > I got this error-
> >
> > java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder
> >
> > Can someone please tell me what needs to be fixed here ?
> >
> > Thanks,
> > Taran
>
>


Re: architecture diagram

2008-10-06 Thread Alex Loddengaard
This mapper does follow my original suggestion, though I'm not familiar with
how the delimiter works in this example.  Anyone else?

Alex

On Mon, Oct 6, 2008 at 2:55 PM, Terrence A. Pietrondi <[EMAIL PROTECTED]
> wrote:

> So looking at the following mapper...
>
>
> http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup
>
> On line 32, you can see the row split via a delimiter. On line 43, you can
> see that the field index (the column index) is the map key, and the map
> value is the field contents. How is this incorrect? I think this follows
> your earlier suggestion of:
>
> "You may want to play with the following idea: collect key => column_number
> and value => column_contents in your map step."
>
> Terrence A. Pietrondi
>
>
> --- On Mon, 10/6/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Monday, October 6, 2008, 12:55 PM
> > As far as I know, splits will never be made within a line,
> > only between
> > rows.  To answer your question about ways to control the
> > splits, see below:
> >
> > 
> > <
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
> > >
> >
> > Alex
> >
> > On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
> > <[EMAIL PROTECTED]
> > > wrote:
> >
> > > Can you explain "The location of these splits is
> > semi-arbitrary"? What if
> > > the example was...
> > >
> > > AAA|BBB|CCC|DDD
> > > EEE|FFF|GGG|HHH
> > >
> > >
> > > Does this mean the split might be between CCC such
> > that it results in
> > > AAA|BBB|C and C|DDD for the first line? Is there a way
> > to control this
> > > behavior to split on my delimiter?
> > >
> > >
> > > Terrence A. Pietrondi
> > >
> > >
> > > --- On Sun, 10/5/08, Alex Loddengaard
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Alex Loddengaard
> > <[EMAIL PROTECTED]>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Sunday, October 5, 2008, 9:26 PM
> > > > Let's say you have one very large input file
> > of the
> > > > form:
> > > >
> > > > A|B|C|D
> > > > E|F|G|H
> > > > ...
> > > > |1|2|3|4
> > > >
> > > > This input file will be broken up into N pieces,
> > where N is
> > > > the number of
> > > > mappers that run.  The location of these splits
> > is
> > > > semi-arbitrary.  This
> > > > means that unless you have one mapper, you
> > won't be
> > > > able to see the entire
> > > > contents of a column in your mapper.  Given that
> > you would
> > > > need one mapper
> > > > to be able to see the entirety of a column,
> > you've now
> > > > essentially reduced
> > > > your problem to a single machine.
> > > >
> > > > You may want to play with the following idea:
> > collect key
> > > > => column_number
> > > > and value => column_contents in your map step.
> >  This
> > > > means that you would be
> > > > able to see the entirety of a column in your
> > reduce step,
> > > > though you're
> > > > still faced with the tasks of shuffling and
> > re-pivoting.
> > > >
> > > > Does this clear up your confusion?  Let me know
> > if
> > > > you'd like me to clarify
> > > > more.
> > > >
> > > > Alex
> > > >
> > > > On Sun, Oct 5, 2008 at 3:54 PM, Terrence A.
> > Pietrondi
> > > > <[EMAIL PROTECTED]
> > > > > wrote:
> > > >
> > > > > I am not sure why this doesn't fit,
> > maybe you can
> > > > help me understand. Your
> > > > > previous comment was...
> > > > >
> > > > > "The reason I'm making this claim
> > is because
> > > > in order to do the pivot
> > > > > operation you must know about every row.
> > Your input
> > > > files will be split at
> > > > > semi-arbitrary places, essentially making it
> > > > impossible for each mapper to
> > > > > know every single row."
> > > > >
> > > > > Are you saying that my row segments might
> > not actually
> > > > be the entire row so
> > > > > I will get a bad key index? If so, would the
> > row
> > > > segments be determined? I
> > > > > based my initial work off of the word count
> > example,
> > > > where the lines are
> > > > > tokenized. Does this mean in this example
> > the row
> > > > tokens may not be the
> > > > > complete row?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Terrence A. Pietrondi
> > > > >
> > > > >
> > > > > --- On Fri, 10/3/08, Alex Loddengaard
> > > > <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > > From: Alex Loddengaard
> > > > <[EMAIL PROTECTED]>
> > > > > > Subject: Re: architecture diagram
> > > > > > To: core-user@hadoop.apache.org
> > > > > > Date: Friday, October 3, 2008, 7:14 PM
> > > > > > The approach that you've described
> > does not
> > > > fit well in
> > > > > > to the MapReduce
> > > > > > paradigm.  You may want to consider
> > randomizing
> > > > your data
> > > > > > in a different
> > > > > > way.
> > > > > >
> >

Re: is 12 minutes ok for dfs chown -R on 45000 files ?

2008-10-06 Thread Allen Wittenauer



On 10/2/08 11:33 PM, "Frank Singleton" <[EMAIL PROTECTED]> wrote:

> Just to clarify, this is for when the chown will modify all files owner
> attributes
> 
> eg: toggle all from frank:frank to hadoop:hadoop (see below)

When we converted from 0.15 to 0.16, we chown'ed all of our files.  The
local dev team wrote the code in
https://issues.apache.org/jira/browse/HADOOP-3052 , but it wasn't committed
as a standard feature as they viewed this as a one off. :(

Needless to say, running a large chown as a MR job should be
significantly faster.



Why is super user privilege required for FS statistics?

2008-10-06 Thread Brian Bockelman

Hey all,

I noticed something really funny about fuse-dfs: because super-user  
privileges are required to run the getStats function in  
FSNamesystem.java, my file systems show up as having 16 exabytes total  
and 0 bytes free.  If I mount fuse-dfs as root, then I get the correct  
results from df.


Is this an oversight?  Is there any good reason I shouldn't file a bug  
to make the getStats command (which only returns the used, free, and  
total space in the file system) not require superuser privilege?


Brian

smime.p7s
Description: S/MIME cryptographic signature


Map and Reduce numbers are not restricted by setNumMapTasks and setNumReduceTasks, JobConf related?

2008-10-06 Thread Andy Li
Dears,

Sorry, I did not mean to cross post.  But the previous article was
accidentally posted to the HBase user list.  I would like to bring it back
to the Hadoop user since it is confusing me a lot and it is mainly MapReduce
related.

Currently running version hadoop-0.18.1 on 25 nodes.  Map and Reduce Task
Capacity is 92.  When I do this in my MapReduce program:

= SAMPLE CODE =
JobConf jconf = new JobConf(conf, TestTask.class);
jconf.setJobName("my.test.TestTask");
jconf.setOutputKeyClass(Text.class);
jconf.setOutputValueClass(Text.class);
jconf.setOutputFormat(TextOutputFormat.class);
jconf.setMapperClass(MyMapper.class);
jconf.setCombinerClass(MyReducer.class);
jconf.setReducerClass(MyReducer.class);
jconf.setInputFormat(TextInputFormat.class);
try {
jconf.setNumMapTasks(5);
jconf.setNumReduceTasks(3);
JobClient.runJob(jconf);
} catch (Exception e) {
e.printStackTrace();
}
= = =

When I run the job, I'm always getting 300 mappers and 1 reducers from the
JobTracker webpage running on the default port 50030.
No matter how I configure the numbers in methods "setNumMapTasks" and
"setNumReduceTasks", I get the same result.
Does anyone know why this is happening?
Am I missing something or misunderstand something in the picture?  =(

Here's a reference to the parameters we have override in "hadoop-site.xml".
===

  mapred.tasktracker.map.tasks.maximum
  4



  mapred.tasktracker.reduce.tasks.maximum
  4


other parameters are default from hadoop-default.xml.

Any idea how this is happening?

Any inputs are appreciated.

Thanks,
-Andy


Re: architecture diagram

2008-10-06 Thread Samuel Guo
I think what Alex talked about 'split' is the mapreduce system's action.
What you said about 'split' is your mapper's action.

I guess that your map/reduce application uses *TextInputFormat* to treat
your input file.

your input file will first be splitted into a few splits. these splits may
be like . What Alex said about 'The location of
these splits is semi-arbitrary' means that the file split's offset in your
input file is semi-arbitrary. Am I right, Alex?
Then *TextInputFormat* will translate these file splits into a sequence of
lines, where offset is treated as key and line is treated as value.

As these file splits are splitted by offset. Some lines in your file may be
splitted into different file splits. A *LineRecordReader* used by
*TextInputFormat* will remove the half-baked line in these file splits to
make sure that every mapper will get integrated lines one by one.

For examples:

a file as below:

AAA BBB CCC DDD
EEE FFF GGG HHH
AAA BBB CCC DDD


it may be splitted into two file splits(we assume that there are two
mappers.).
split one:

AAA BBB CCC

split two:
DDD
EEE FFF GGG HHH
AAA BBB CCC DDD


take split two as example:
TextInputFormat will use LineRecordReader to translate split two into a
sequence of  pairs, and it will skip the first half-baked line
"DDD". so the sequence will be:




Then what to do with the lines depends on your job.


On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi <[EMAIL PROTECTED]
> wrote:

> So looking at the following mapper...
>
>
> http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup
>
> On line 32, you can see the row split via a delimiter. On line 43, you can
> see that the field index (the column index) is the map key, and the map
> value is the field contents. How is this incorrect? I think this follows
> your earlier suggestion of:
>
> "You may want to play with the following idea: collect key => column_number
> and value => column_contents in your map step."
>
> Terrence A. Pietrondi
>
>
> --- On Mon, 10/6/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Monday, October 6, 2008, 12:55 PM
> > As far as I know, splits will never be made within a line,
> > only between
> > rows.  To answer your question about ways to control the
> > splits, see below:
> >
> > 
> > <
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
> > >
> >
> > Alex
> >
> > On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
> > <[EMAIL PROTECTED]
> > > wrote:
> >
> > > Can you explain "The location of these splits is
> > semi-arbitrary"? What if
> > > the example was...
> > >
> > > AAA|BBB|CCC|DDD
> > > EEE|FFF|GGG|HHH
> > >
> > >
> > > Does this mean the split might be between CCC such
> > that it results in
> > > AAA|BBB|C and C|DDD for the first line? Is there a way
> > to control this
> > > behavior to split on my delimiter?
> > >
> > >
> > > Terrence A. Pietrondi
> > >
> > >
> > > --- On Sun, 10/5/08, Alex Loddengaard
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Alex Loddengaard
> > <[EMAIL PROTECTED]>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Sunday, October 5, 2008, 9:26 PM
> > > > Let's say you have one very large input file
> > of the
> > > > form:
> > > >
> > > > A|B|C|D
> > > > E|F|G|H
> > > > ...
> > > > |1|2|3|4
> > > >
> > > > This input file will be broken up into N pieces,
> > where N is
> > > > the number of
> > > > mappers that run.  The location of these splits
> > is
> > > > semi-arbitrary.  This
> > > > means that unless you have one mapper, you
> > won't be
> > > > able to see the entire
> > > > contents of a column in your mapper.  Given that
> > you would
> > > > need one mapper
> > > > to be able to see the entirety of a column,
> > you've now
> > > > essentially reduced
> > > > your problem to a single machine.
> > > >
> > > > You may want to play with the following idea:
> > collect key
> > > > => column_number
> > > > and value => column_contents in your map step.
> >  This
> > > > means that you would be
> > > > able to see the entirety of a column in your
> > reduce step,
> > > > though you're
> > > > still faced with the tasks of shuffling and
> > re-pivoting.
> > > >
> > > > Does this clear up your confusion?  Let me know
> > if
> > > > you'd like me to clarify
> > > > more.
> > > >
> > > > Alex
> > > >
> > > > On Sun, Oct 5, 2008 at 3:54 PM, Terrence A.
> > Pietrondi
> > > > <[EMAIL PROTECTED]
> > > > > wrote:
> > > >
> > > > > I am not sure why this doesn't fit,
> > maybe you can
> > > > help me understand. Your
> > > > > previous comment was...
> > > > >
> > > > > "The reason I'm making this claim
> > is because
> > > > in order to do the pivot
> > > > > operatio

Re: Map and Reduce numbers are not restricted by setNumMapTasks and setNumReduceTasks, JobConf related?

2008-10-06 Thread Samuel Guo
Mapper's Number depends on your inputformat.
Default Inputformat try to treat every file block of a file as a InputSplit.
And you will get the same number of mappers as the number of your
inputsplits.
try to configure "mapred.min.split.size" to reduce the number of your mapper
if you want to.

And I don't know why your reducer is just one. Anyone knows?

On Tue, Oct 7, 2008 at 9:06 AM, Andy Li <[EMAIL PROTECTED]> wrote:

> Dears,
>
> Sorry, I did not mean to cross post.  But the previous article was
> accidentally posted to the HBase user list.  I would like to bring it back
> to the Hadoop user since it is confusing me a lot and it is mainly
> MapReduce
> related.
>
> Currently running version hadoop-0.18.1 on 25 nodes.  Map and Reduce Task
> Capacity is 92.  When I do this in my MapReduce program:
>
> = SAMPLE CODE =
>JobConf jconf = new JobConf(conf, TestTask.class);
>jconf.setJobName("my.test.TestTask");
>jconf.setOutputKeyClass(Text.class);
>jconf.setOutputValueClass(Text.class);
>jconf.setOutputFormat(TextOutputFormat.class);
>jconf.setMapperClass(MyMapper.class);
>jconf.setCombinerClass(MyReducer.class);
>jconf.setReducerClass(MyReducer.class);
>jconf.setInputFormat(TextInputFormat.class);
>try {
>jconf.setNumMapTasks(5);
>jconf.setNumReduceTasks(3);
>JobClient.runJob(jconf);
>} catch (Exception e) {
>e.printStackTrace();
>}
> = = =
>
> When I run the job, I'm always getting 300 mappers and 1 reducers from the
> JobTracker webpage running on the default port 50030.
> No matter how I configure the numbers in methods "setNumMapTasks" and
> "setNumReduceTasks", I get the same result.
> Does anyone know why this is happening?
> Am I missing something or misunderstand something in the picture?  =(
>
> Here's a reference to the parameters we have override in "hadoop-site.xml".
> ===
> 
>  mapred.tasktracker.map.tasks.maximum
>  4
> 
>
> 
>  mapred.tasktracker.reduce.tasks.maximum
>  4
> 
> 
> other parameters are default from hadoop-default.xml.
>
> Any idea how this is happening?
>
> Any inputs are appreciated.
>
> Thanks,
> -Andy
>


Re: Add jar file via -libjars - giving errors

2008-10-06 Thread Taeho Kang
Adding your jar files in the $HADOOP_HOME/lib folder works, but you would
have to restart all your tasktrackers to have your jar files loaded.

If you repackage your map-reduce jar file (e.g. hadoop-0.18.0-examples.jar)
with your jar file and run your job with the newly repackaged jar file, it
would work, too.

On Tue, Oct 7, 2008 at 6:55 AM, Tarandeep Singh <[EMAIL PROTECTED]> wrote:

> thanks Mahadev for the reply.
> So that means I have to copy my jar file in the $HADOOP_HOME/lib folder on
> all slave machines like before.
>
> One more question- I am adding a conf file (just like HADOOP_SITE.xml) via
> -conf option and I am able to query parameters in mapper/reducers. But is
> there a way I can query the parameters in my job driver class -
>
> public class jobDriver extends Configured
> {
>   someMethod( )
>   {
>  ToolRunner.run( new MyJob( ), commandLineArgs);
>  // I want to query parameters present in my conf file here
>   }
> }
>
> public class MyJob extends Configured implements Tool
> {
> }
>
> Thanks,
> Taran
>
> On Mon, Oct 6, 2008 at 2:46 PM, Mahadev Konar <[EMAIL PROTECTED]>
> wrote:
>
> > HI Tarandeep,
> >  the libjars options does not add the jar on the client side. Their is an
> > open jira for that ( id ont remember which one)...
> >
> > Oyu have to add the jar to the
> >
> > HADOOP_CLASSPATH on the client side so that it gets picked up on the
> client
> > side as well.
> >
> >
> > mahadev
> >
> >
> > On 10/6/08 2:30 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote:
> >
> > > Hi,
> > >
> > > I want to add a jar file (that is required by mappers and reducers) to
> > the
> > > classpath. Initially I had copied the jar file to all the slave nodes
> in
> > the
> > > $HADOOP_HOME/lib directory and it was working fine.
> > >
> > > However when I tried the libjars option to add jar files -
> > >
> > > $HADOOP_HOME/bin/hadoop  jar myApp.jar -conf $MY_CONF_FILE -libjars
> > jdom.jar
> > >
> > >
> > > I got this error-
> > >
> > > java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder
> > >
> > > Can someone please tell me what needs to be fixed here ?
> > >
> > > Thanks,
> > > Taran
> >
> >
>


Re: nagios to monitor hadoop datanodes!

2008-10-06 Thread Taeho Kang
The easiest approach I can think of is to write a simple Nagios plugin that
checks if the datanode JVM process is alive. Or you may
write a Nagios-plugin that checks for error or warning messages in datanode
logs. (I am sure you can find quite a few log-checking Nagios plugin in
nagiosplugin.org)

If you are unsure of how to write nagios-plugin, I suggest you to read stuff
from link "Leverage Nagios with plug-ins you write"
http://www.ibm.com/developerworks/aix/library/au-nagios/ as it's got good
explanations and examples on how to write nagios plugin.

Or if you've got time to burn, you might want to read Nagios documentation,
too.

Let me know if you need help on this matter.

/Taeho



On Tue, Oct 7, 2008 at 2:05 AM, Gerardo Velez <[EMAIL PROTECTED]>wrote:

> Hi Everyone!
>
>
> I would like to implement Nagios health monitoring of a Hadoop grid.
>
> Some of you have some experience here, do you hace any approach or advice I
> could use.
>
> At this time I've been only playing with jsp's files that hadoop has
> integrated into it. so I;m not sure if it could be a good idea that
> nagios monitor request info to these jsp?
>
>
> Thanks in advance!
>
>
> -- Gerardo
>


Re: Add jar file via -libjars - giving errors

2008-10-06 Thread Mahadev Konar
You can just add the jar to the env variable HADOOP_CLASSPATH

If using bash 

Just do this : 

Export HADOOP_CLASSPATH=path  to your class path on the client

And then use the libjars option.

mahadev


On 10/6/08 2:55 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote:

> thanks Mahadev for the reply.
> So that means I have to copy my jar file in the $HADOOP_HOME/lib folder on
> all slave machines like before.
> 
> One more question- I am adding a conf file (just like HADOOP_SITE.xml) via
> -conf option and I am able to query parameters in mapper/reducers. But is
> there a way I can query the parameters in my job driver class -
> 
> public class jobDriver extends Configured
> {
>someMethod( )
>{
>   ToolRunner.run( new MyJob( ), commandLineArgs);
>   // I want to query parameters present in my conf file here
>}
> }
> 
> public class MyJob extends Configured implements Tool
> {
> }
> 
> Thanks,
> Taran
> 
> On Mon, Oct 6, 2008 at 2:46 PM, Mahadev Konar <[EMAIL PROTECTED]> wrote:
> 
>> HI Tarandeep,
>>  the libjars options does not add the jar on the client side. Their is an
>> open jira for that ( id ont remember which one)...
>> 
>> Oyu have to add the jar to the
>> 
>> HADOOP_CLASSPATH on the client side so that it gets picked up on the client
>> side as well.
>> 
>> 
>> mahadev
>> 
>> 
>> On 10/6/08 2:30 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote:
>> 
>>> Hi,
>>> 
>>> I want to add a jar file (that is required by mappers and reducers) to
>> the
>>> classpath. Initially I had copied the jar file to all the slave nodes in
>> the
>>> $HADOOP_HOME/lib directory and it was working fine.
>>> 
>>> However when I tried the libjars option to add jar files -
>>> 
>>> $HADOOP_HOME/bin/hadoop  jar myApp.jar -conf $MY_CONF_FILE -libjars
>> jdom.jar
>>> 
>>> 
>>> I got this error-
>>> 
>>> java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder
>>> 
>>> Can someone please tell me what needs to be fixed here ?
>>> 
>>> Thanks,
>>> Taran
>> 
>> 



Re: Add jar file via -libjars - giving errors

2008-10-06 Thread Amareshwari Sriramadasu

Hi,

From 0.19, the jars added using -libjars are available on the client 
classpath also, fixed by HADOOP-3570.


Thanks
Amareshwari

Mahadev Konar wrote:

HI Tarandeep,
 the libjars options does not add the jar on the client side. Their is an
open jira for that ( id ont remember which one)...

Oyu have to add the jar to the

HADOOP_CLASSPATH on the client side so that it gets picked up on the client
side as well.


mahadev


On 10/6/08 2:30 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote:

  

Hi,

I want to add a jar file (that is required by mappers and reducers) to the
classpath. Initially I had copied the jar file to all the slave nodes in the
$HADOOP_HOME/lib directory and it was working fine.

However when I tried the libjars option to add jar files -

$HADOOP_HOME/bin/hadoop  jar myApp.jar -conf $MY_CONF_FILE -libjars jdom.jar


I got this error-

java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder

Can someone please tell me what needs to be fixed here ?

Thanks,
Taran



  




Re: Map and Reduce numbers are not restricted by setNumMapTasks and setNumReduceTasks, JobConf related?

2008-10-06 Thread Andy Li
Thanks Samuel.  I have tried to look for answers and some try-and-error in
my own program.

The only way I know so far to enforce the Mapper is to assign each file to
one Mapper by your own customized InputFormat and RecordReader and override
isSplittable() to always return false.

In another mail threat, some one posted it.  I found this link:
http://www.nabble.com/1-file-per-record-td19644985.html
which shows you how to prevent the file from splitting.
In Hadoop Wiki FAQ 10, it also shows you how to assign one file to one
Mapper in several ways.

I think there should be some documents where it indicates that the
"setNumMapTaks" and "setNumReduceTaks"
will be override by the splitting.  It was misleading at the first time when
I use them.  I expect that the files will be
split into blocks/chunks and each block/chunk will be assign to a Mapper.
The maximum Mapper count will be controlled
by the number specified in "setNumMapTaks" and "setNumReduceTaks".

Unfortunately, this was not the exact expectation base on the method name.
=(
Anyone know if this is the correct answer to this problem? or they are
actually 2 different things?

Thanks,
-Andy

On Mon, Oct 6, 2008 at 7:02 PM, Samuel Guo <[EMAIL PROTECTED]> wrote:

> Mapper's Number depends on your inputformat.
> Default Inputformat try to treat every file block of a file as a
> InputSplit.
> And you will get the same number of mappers as the number of your
> inputsplits.
> try to configure "mapred.min.split.size" to reduce the number of your
> mapper
> if you want to.
>
> And I don't know why your reducer is just one. Anyone knows?
>
> On Tue, Oct 7, 2008 at 9:06 AM, Andy Li <[EMAIL PROTECTED]> wrote:
>
> > Dears,
> >
> > Sorry, I did not mean to cross post.  But the previous article was
> > accidentally posted to the HBase user list.  I would like to bring it
> back
> > to the Hadoop user since it is confusing me a lot and it is mainly
> > MapReduce
> > related.
> >
> > Currently running version hadoop-0.18.1 on 25 nodes.  Map and Reduce Task
> > Capacity is 92.  When I do this in my MapReduce program:
> >
> > = SAMPLE CODE =
> >JobConf jconf = new JobConf(conf, TestTask.class);
> >jconf.setJobName("my.test.TestTask");
> >jconf.setOutputKeyClass(Text.class);
> >jconf.setOutputValueClass(Text.class);
> >jconf.setOutputFormat(TextOutputFormat.class);
> >jconf.setMapperClass(MyMapper.class);
> >jconf.setCombinerClass(MyReducer.class);
> >jconf.setReducerClass(MyReducer.class);
> >jconf.setInputFormat(TextInputFormat.class);
> >try {
> >jconf.setNumMapTasks(5);
> >jconf.setNumReduceTasks(3);
> >JobClient.runJob(jconf);
> >} catch (Exception e) {
> >e.printStackTrace();
> >}
> > = = =
> >
> > When I run the job, I'm always getting 300 mappers and 1 reducers from
> the
> > JobTracker webpage running on the default port 50030.
> > No matter how I configure the numbers in methods "setNumMapTasks" and
> > "setNumReduceTasks", I get the same result.
> > Does anyone know why this is happening?
> > Am I missing something or misunderstand something in the picture?  =(
> >
> > Here's a reference to the parameters we have override in
> "hadoop-site.xml".
> > ===
> > 
> >  mapred.tasktracker.map.tasks.maximum
> >  4
> > 
> >
> > 
> >  mapred.tasktracker.reduce.tasks.maximum
> >  4
> > 
> > 
> > other parameters are default from hadoop-default.xml.
> >
> > Any idea how this is happening?
> >
> > Any inputs are appreciated.
> >
> > Thanks,
> > -Andy
> >
>


Re: Map and Reduce numbers are not restricted by setNumMapTasks and setNumReduceTasks, JobConf related?

2008-10-06 Thread Samuel Guo
On Tue, Oct 7, 2008 at 1:12 PM, Andy Li <[EMAIL PROTECTED]> wrote:

> I think there should be some documents where it indicates that the
> "setNumMapTaks" and "setNumReduceTaks"
> will be override by the splitting.  It was misleading at the first time
> when
> I use them.  I expect that the files will be
> split into blocks/chunks and each block/chunk will be assign to a Mapper.
> The maximum Mapper count will be controlled
> by the number specified in "setNumMapTaks" and "setNumReduceTaks".
>

check the document:
http://hadoop.apache.org/core/docs/r0.18.1/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)
http://hadoop.apache.org/core/docs/r0.18.1/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int)


>
> Unfortunately, this was not the exact expectation base on the method name.
> =(
> Anyone know if this is the correct answer to this problem? or they are
> actually 2 different things?
>
> Thanks,
> -Andy
>