Re: client connect as different username?

2008-06-11 Thread s29752-hadoopuser
This information can be found in 
http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html
Nicholas


- Original Message 
> From: Chris Collins <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, June 11, 2008 9:31:18 PM
> Subject: Re: client connect as different username?
> 
> Thanks Doug, should this be added to the permissions doc or to the  
> faq?  See you in Sonoma.
> 
> C
> On Jun 11, 2008, at 9:15 PM, Doug Cutting wrote:
> 
> > Chris Collins wrote:
> >> You are referring to creating a directory in hdfs?  Because if I am  
> >> user chris and the hdfs only has user foo, then I cant create a  
> >> directory because I dont have perms, infact I cant even connect.
> >
> > Today, users and groups are declared by the client.  The namenode  
> > only records and checks against user and group names provided by the  
> > client.  So if someone named "foo" writes a file, then that file is  
> > owned by someone named "foo" and anyone named "foo" is the owner of  
> > that file. No "foo" account need exist on the namenode.
> >
> > The one (important) exception is the "superuser".  Whatever user  
> > name starts the namenode is the superuser for that filesystem.  And  
> > if "/" is not world writable, a new filesystem will not contain a  
> > home directory (or anywhere else) writable by other users.  So, in a  
> > multiuser Hadoop installation, the superuser needs to create home  
> > directories and project directories for other users and set their  
> > protections accordingly before other users can do anything.  Perhaps  
> > this is what you've run into?
> >
> > Doug



Re: client connect as different username?

2008-06-11 Thread Chris Collins
Thanks Doug, should this be added to the permissions doc or to the  
faq?  See you in Sonoma.


C
On Jun 11, 2008, at 9:15 PM, Doug Cutting wrote:


Chris Collins wrote:
You are referring to creating a directory in hdfs?  Because if I am  
user chris and the hdfs only has user foo, then I cant create a  
directory because I dont have perms, infact I cant even connect.


Today, users and groups are declared by the client.  The namenode  
only records and checks against user and group names provided by the  
client.  So if someone named "foo" writes a file, then that file is  
owned by someone named "foo" and anyone named "foo" is the owner of  
that file. No "foo" account need exist on the namenode.


The one (important) exception is the "superuser".  Whatever user  
name starts the namenode is the superuser for that filesystem.  And  
if "/" is not world writable, a new filesystem will not contain a  
home directory (or anywhere else) writable by other users.  So, in a  
multiuser Hadoop installation, the superuser needs to create home  
directories and project directories for other users and set their  
protections accordingly before other users can do anything.  Perhaps  
this is what you've run into?


Doug




Re: client connect as different username?

2008-06-11 Thread Doug Cutting

Chris Collins wrote:
You are referring to creating a directory in hdfs?  Because if I am user 
chris and the hdfs only has user foo, then I cant create a directory 
because I dont have perms, infact I cant even connect.


Today, users and groups are declared by the client.  The namenode only 
records and checks against user and group names provided by the client. 
 So if someone named "foo" writes a file, then that file is owned by 
someone named "foo" and anyone named "foo" is the owner of that file. 
No "foo" account need exist on the namenode.


The one (important) exception is the "superuser".  Whatever user name 
starts the namenode is the superuser for that filesystem.  And if "/" is 
not world writable, a new filesystem will not contain a home directory 
(or anywhere else) writable by other users.  So, in a multiuser Hadoop 
installation, the superuser needs to create home directories and project 
directories for other users and set their protections accordingly before 
other users can do anything.  Perhaps this is what you've run into?


Doug


Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Chris K Wensel
However, for continuous production data processing, hadoop+cascading  
sounds like a good option.



This will be especially true with stream assertions and traps (as  
mentioned previously, and available in trunk). 


I've written workloads for clients that render down to ~60 unique  
Hadoop map/reduce jobs, all inter-related, from ~10 unique units of  
work (internally lots of joins, sorts and math). I can't imagine  
having written them by hand.


ckw

--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/







Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Chris K Wensel

Thanks Ted..

Couple quick comments.

At one level Cascading is a MapReduce query planner, just like PIG.  
Except the API is for public consumption and fully extensible, in PIG  
you typically interact with the PigLatin syntax. Subsequently, with  
Cascading, you can layer your own syntax on top of the API. Currently  
there is Groovy support (Groovy is used to assemble the work, it does  
not run on the mappers or reducers). I hear rumors about Jython  
elsewhere.


A couple groovy examples (note these are obviously trivial, the dsl  
can absorb tremendous complexity if need be)...

http://code.google.com/p/cascading/source/browse/trunk/cascading.groovy/sample/wordcount.groovy
http://code.google.com/p/cascading/source/browse/trunk/cascading.groovy/sample/widefinder.groovy

Since Cascading is in part a 'planner', it actually builds internally  
a new representation from what the developer assembled and renders  
out  the necessary map/reduce jobs (and transparently links them) at  
runtime. As Hadoop evolves, the planner will incorporate the new  
features and leverage them transparently. Plus there are opportunities  
for identifying patterns and applying different strategies  
(hypothetically map side vs reduce side joins, for one). It is also  
conceivable (but untried) that different planners can exist to target  
different systems other than Hadoop (making your code/libraries  
portable). Much of this is true for PIG as well.

http://www.cascading.org/documentation/overview.html

Also, Cascading will at some point provide a PIG adapter, allowing  
PigLatin queries to participate in a larger Cascading 'Cascade' (the  
topological scheduler). Cascading is great with integration,  
connecting things outside Hadoop with stuff to be done inside Hadoop.  
And PIG looks like a great way to concisely represent a complex  
solution and execute it. There isn't any reason they can't work  
together (it has always been the intention).


The takeaway is that with Cascading and PIG, users do not think in  
MapReduce. With PIG, you think in PigLatin. With Cascading, you can  
use the pipe/filter based API, or use your favorite scripting language  
and build a DSL for your problem domain.


Many companies have done similar things internally, but they tend to  
be nothing more than a scriptable way to write a map/reduce job and  
glue them together. You still think in MapReduce, which in my opinion  
doesn't scale well.


My (biased) recommendation is this.

Build out your application in Cascading. If part of the problem is  
best represented in PIG, no worries use PIG and feed and clean up  
after PIG with Cascading. And if you see a solvable bottleneck, and we  
can't convince the planner to recognize the pattern and plan better,  
replace that piece of the process with a custom MapReduce job (or more).


Solve your problem first, then optimize the solution, if need be.

ckw

On Jun 11, 2008, at 5:00 PM, Ted Dunning wrote:

Pig is much more ambitious than cascading.  Because of the  
ambitions, simple
things got overlooked.  For instance, something as simple as  
computing a

file name to load is not possible in pig, nor is it possible to write
functions in pig.  You can hook to Java functions (for some things),  
but you
can't really write programs in pig.  On the other hand, pig may  
eventually

provide really incredible capabilities including program rewriting and
optimization that would be incredibly hard to write directly in Java.

The point of cascading was simply to make life easier for a normal
Java/map-reduce programmer.  It provides an abstraction for gluing  
together
several map-reduce programs and for doing a few common things like  
joins.
Because you are still writing Java (or Groovy) code, you have all of  
the
functionality you always had.  But, this same benefit costs you the  
future

in terms of what optimizations are likely to ever be possible.

The summary for us (especially 4-6 months ago when we were deciding)  
is that
cascading is good enough to use now and pig will probably be more  
useful

later.

On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao <[EMAIL PROTECTED]>  
wrote:




I find cascading very similar to pig, do you care to provide your  
comment
here? If map reduce programmers are to go to the next level  
(scripting/query

language), which way to go?





--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/







RE: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Haijun Cao
Thanks for sharing. We have need to expose hadoop cluster to 'casual' users for 
ad-hoc query, I find it difficult to ask them to write map reduce program, pig 
latin comes in very handy in this case. However, for continuous production data 
processing, hadoop+cascading sounds like a good option. 

Haijun

-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 11, 2008 5:01 PM
To: core-user@hadoop.apache.org
Subject: Re: does anyone have idea on how to run multiple sequential jobs with 
bash script

Pig is much more ambitious than cascading.  Because of the ambitions, simple
things got overlooked.  For instance, something as simple as computing a
file name to load is not possible in pig, nor is it possible to write
functions in pig.  You can hook to Java functions (for some things), but you
can't really write programs in pig.  On the other hand, pig may eventually
provide really incredible capabilities including program rewriting and
optimization that would be incredibly hard to write directly in Java.

The point of cascading was simply to make life easier for a normal
Java/map-reduce programmer.  It provides an abstraction for gluing together
several map-reduce programs and for doing a few common things like joins.
Because you are still writing Java (or Groovy) code, you have all of the
functionality you always had.  But, this same benefit costs you the future
in terms of what optimizations are likely to ever be possible.

The summary for us (especially 4-6 months ago when we were deciding) is that
cascading is good enough to use now and pig will probably be more useful
later.

On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao <[EMAIL PROTECTED]> wrote:

>
> I find cascading very similar to pig, do you care to provide your comment
> here? If map reduce programmers are to go to the next level (scripting/query
> language), which way to go?
>
>
>


Re: client connect as different username?

2008-06-11 Thread Chris Collins
We know whomai is called, thanks, I found out painfully the first day  
I played with this because in dev, my ide is started not from a  
shell.  Therefor the path is not inherited to include /usr/bin.  Hdfs  
client hides the actual fact that ProcessBuilder barfs with a file not  
found with a "login exception", "whoami".  Not as clear as I would of  
liked :-}|


You are referring to creating a directory in hdfs?  Because if I am  
user chris and the hdfs only has user foo, then I cant create a  
directory because I dont have perms, infact I cant even connect.  I  
believe another emailer holds the answer which was blindly dumb on my  
part for not trying, that of adding a user in unix and creating a  
group that those users belong to.


Thanks

Chris
On Jun 11, 2008, at 5:36 PM, Allen Wittenauer wrote:





On 6/11/08 5:17 PM, "Chris Collins" <[EMAIL PROTECTED]> wrote:

The finer point to this is that in development you may be logged in  
as

user x and have a shared hdfs instance that a number of people are
using.  In that mode its not practical to sudo as you have all your
development tools setup for userx.  hdfs is setup with a single user,
what is the procedure to add users to that hdfs instance?  It has to
support it surely?  Its really not obvious, looking in the hdfs docs
that come with the distro nothing springs out.  the hadoop command
line tool doesnt have anything that vaguely looks like a way to  
create

a user.


   User information is sent from the client.  The code literally  
does a

'whoami' and 'groups' and sends that information to the server.

   Shared data should be handled just like you would in UNIX:

   - create a directory
   - set permissions to be insecure
   - go crazy







Re: client connect as different username?

2008-06-11 Thread Allen Wittenauer



On 6/11/08 5:17 PM, "Chris Collins" <[EMAIL PROTECTED]> wrote:

> The finer point to this is that in development you may be logged in as
> user x and have a shared hdfs instance that a number of people are
> using.  In that mode its not practical to sudo as you have all your
> development tools setup for userx.  hdfs is setup with a single user,
> what is the procedure to add users to that hdfs instance?  It has to
> support it surely?  Its really not obvious, looking in the hdfs docs
> that come with the distro nothing springs out.  the hadoop command
> line tool doesnt have anything that vaguely looks like a way to create
> a user.

User information is sent from the client.  The code literally does a
'whoami' and 'groups' and sends that information to the server.

Shared data should be handled just like you would in UNIX:

- create a directory
- set permissions to be insecure
- go crazy

  



RE: client connect as different username?

2008-06-11 Thread Xavier Stevens
This is how I've done it before:

1.) Create a hadoop user/group.  
2.) Make the local filesystem dfs directories writable by the hadoop
group and set the sticky bit.  
3.) Run hadoop as the hadoop user.
4.) Then add all of your users to the hadoop group.  

I also changed the dfs.permissions.supergroup property to "hadoop" in
the $HADOOP_HOME/conf/hadoop-site.xml as well.

This works pretty well for us.  Hope it helps.

Cheers,

-Xavier 


-Original Message-
From: Chris Collins [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 11, 2008 5:18 PM
To: core-user@hadoop.apache.org
Subject: Re: client connect as different username?

The finer point to this is that in development you may be logged in as
user x and have a shared hdfs instance that a number of people are
using.  In that mode its not practical to sudo as you have all your
development tools setup for userx.  hdfs is setup with a single user,
what is the procedure to add users to that hdfs instance?  It has to
support it surely?  Its really not obvious, looking in the hdfs docs
that come with the distro nothing springs out.  the hadoop command line
tool doesnt have anything that vaguely looks like a way to create a
user.

Help is greatly appreciated.  I am sure its somewhere so blindingly
obvious.

How are other people doing other that sudoing to one single user name?

Thanks

ChRiS


On Jun 11, 2008, at 5:11 PM, [EMAIL PROTECTED] wrote:

> The best way is to use sudo command to execute hadoop client.  Does it

> work for you?
>
> Nicholas
>
>
> - Original Message 
>> From: Bob Remeika <[EMAIL PROTECTED]>
>> To: core-user@hadoop.apache.org
>> Sent: Wednesday, June 11, 2008 12:56:14 PM
>> Subject: client connect as different username?
>>
>> Apologies if this is an RTM response, but I looked and wasn't able to

>> find anything concrete.  Is it possible to connect to HDFS via the 
>> HDFS client under a different username than I am currently logged in 
>> as?
>>
>> Here is our situation, I am user bobr on the client machine.  I need 
>> to add something to the HDFS cluster as the user "companyuser".  Is 
>> this possible with the current set of APIs or do I have to upload and

>> "chown"?
>>
>> Thanks,
>> Bob
>





Re: client connect as different username?

2008-06-11 Thread Chris Collins
The finer point to this is that in development you may be logged in as  
user x and have a shared hdfs instance that a number of people are  
using.  In that mode its not practical to sudo as you have all your  
development tools setup for userx.  hdfs is setup with a single user,  
what is the procedure to add users to that hdfs instance?  It has to  
support it surely?  Its really not obvious, looking in the hdfs docs  
that come with the distro nothing springs out.  the hadoop command  
line tool doesnt have anything that vaguely looks like a way to create  
a user.


Help is greatly appreciated.  I am sure its somewhere so blindingly  
obvious.


How are other people doing other that sudoing to one single user name?

Thanks

ChRiS


On Jun 11, 2008, at 5:11 PM, [EMAIL PROTECTED] wrote:

The best way is to use sudo command to execute hadoop client.  Does  
it work for you?


Nicholas


- Original Message 

From: Bob Remeika <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, June 11, 2008 12:56:14 PM
Subject: client connect as different username?

Apologies if this is an RTM response, but I looked and wasn't able  
to find
anything concrete.  Is it possible to connect to HDFS via the HDFS  
client

under a different username than I am currently logged in as?

Here is our situation, I am user bobr on the client machine.  I  
need to add
something to the HDFS cluster as the user "companyuser".  Is this  
possible

with the current set of APIs or do I have to upload and "chown"?

Thanks,
Bob






Re: client connect as different username?

2008-06-11 Thread s29752-hadoopuser
The best way is to use sudo command to execute hadoop client.  Does it work for 
you?

Nicholas


- Original Message 
> From: Bob Remeika <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, June 11, 2008 12:56:14 PM
> Subject: client connect as different username?
> 
> Apologies if this is an RTM response, but I looked and wasn't able to find
> anything concrete.  Is it possible to connect to HDFS via the HDFS client
> under a different username than I am currently logged in as?
> 
> Here is our situation, I am user bobr on the client machine.  I need to add
> something to the HDFS cluster as the user "companyuser".  Is this possible
> with the current set of APIs or do I have to upload and "chown"?
> 
> Thanks,
> Bob



Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Ted Dunning
Pig is much more ambitious than cascading.  Because of the ambitions, simple
things got overlooked.  For instance, something as simple as computing a
file name to load is not possible in pig, nor is it possible to write
functions in pig.  You can hook to Java functions (for some things), but you
can't really write programs in pig.  On the other hand, pig may eventually
provide really incredible capabilities including program rewriting and
optimization that would be incredibly hard to write directly in Java.

The point of cascading was simply to make life easier for a normal
Java/map-reduce programmer.  It provides an abstraction for gluing together
several map-reduce programs and for doing a few common things like joins.
Because you are still writing Java (or Groovy) code, you have all of the
functionality you always had.  But, this same benefit costs you the future
in terms of what optimizations are likely to ever be possible.

The summary for us (especially 4-6 months ago when we were deciding) is that
cascading is good enough to use now and pig will probably be more useful
later.

On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao <[EMAIL PROTECTED]> wrote:

>
> I find cascading very similar to pig, do you care to provide your comment
> here? If map reduce programmers are to go to the next level (scripting/query
> language), which way to go?
>
>
>


Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Arun C Murthy


On Jun 10, 2008, at 2:48 PM, Meng Mao wrote:

I'm interested in the same thing -- is there a recommended way to  
batch

Hadoop jobs together?



Hadoop Map-Reduce JobControl:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Job 
+Control

and
http://hadoop.apache.org/core/docs/current/ 
mapred_tutorial.html#JobControl


Arun

On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang  
<[EMAIL PROTECTED]>

wrote:


Hello folks:
I am running several hadoop applications on hdfs. To save the  
efforts in
issuing the set of commands every time, I am trying to use bash  
script to
run the several applications sequentially. To let the job finishes  
before

it
is proceeding to the next job, I am using wait in the script like  
below.


sh bin/start-all.sh
wait
echo cluster start
(bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D
test.randomwrite.bytes_per_map=107374182 rand)
wait
bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter  -D
test.randomtextwrite.total_bytes=107374182 rand-text
bin/stop-all.sh
echo finished hdfs randomwriter experiment


However, it always give the error like below. Does anyone have  
better idea

on how to run the multiple sequential jobs with bash script?

HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell

org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job  
tracker

still
initializing
   at
org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java: 
1722)

   at
org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

sun.reflect.NativeMethodAccessorImpl.invoke 
(NativeMethodAccessorImpl.java:39)

   at

sun.reflect.DelegatingMethodAccessorImpl.invoke 
(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

   at org.apache.hadoop.ipc.Client.call(Client.java:557)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
   at $Proxy1.getNewJobId(Unknown Source)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

sun.reflect.NativeMethodAccessorImpl.invoke 
(NativeMethodAccessorImpl.java:39)

   at

sun.reflect.DelegatingMethodAccessorImpl.invoke 
(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at

org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod 
(RetryInvocationHandler.java:82)

   at

org.apache.hadoop.io.retry.RetryInvocationHandler.invoke 
(RetryInvocationHandler.java:59)

   at $Proxy1.getNewJobId(Unknown Source)
   at org.apache.hadoop.mapred.JobClient.submitJob 
(JobClient.java:696)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
973)

   at
org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at
org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

sun.reflect.NativeMethodAccessorImpl.invoke 
(NativeMethodAccessorImpl.java:39)

   at

sun.reflect.DelegatingMethodAccessorImpl.invoke 
(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at

org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke 
(ProgramDriver.java:68)

   at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
   at
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

sun.reflect.NativeMethodAccessorImpl.invoke 
(NativeMethodAccessorImpl.java:39)

   at

sun.reflect.DelegatingMethodAccessorImpl.invoke 
(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
   at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
   at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)





--
hustlin, hustlin, everyday I'm hustlin




RE: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Haijun Cao
Ted,

I find cascading very similar to pig, do you care to provide your comment here? 
If map reduce programmers are to go to the next level (scripting/query 
language), which way to go?

Thanks
Haijun 
 

-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 11, 2008 2:16 PM
To: core-user@hadoop.apache.org
Subject: Re: does anyone have idea on how to run multiple sequential jobs with 
bash script

Just a quick plug for Cascading.  Our team uses cascading quite a bit and
found it to be a simpler way to write map reduce jobs.  The guys using it
find it very helpful.

On Wed, Jun 11, 2008 at 1:31 PM, Chris K Wensel <[EMAIL PROTECTED]> wrote:

>
> Depending on the nature of your jobs, Cascading has built in a topological
> scheduler. It will schedule all your work as their dependencies are
> satisfied. Dependencies being source data and inter-job intermediate data.
>
> http://www.cascading.org
>
>
>


-- 
ted


Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Ted Dunning
Just a quick plug for Cascading.  Our team uses cascading quite a bit and
found it to be a simpler way to write map reduce jobs.  The guys using it
find it very helpful.

On Wed, Jun 11, 2008 at 1:31 PM, Chris K Wensel <[EMAIL PROTECTED]> wrote:

>
> Depending on the nature of your jobs, Cascading has built in a topological
> scheduler. It will schedule all your work as their dependencies are
> satisfied. Dependencies being source data and inter-job intermediate data.
>
> http://www.cascading.org
>
>
>


-- 
ted


RE: hadoop benchmarked, too slow to use

2008-06-11 Thread Ashish Thusoo
good to know... this puppy does scale :) and hadoop is awesome for what
it does...

Ashish

-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 11, 2008 11:54 AM
To: core-user@hadoop.apache.org
Subject: Re: hadoop benchmarked, too slow to use


we concatenated the files to bring them close to and less than 64mb and
the difference was huge without changing anything else we went from 214
minutes to 3 minutes !

Elia Mazzawi wrote:
> Thanks for the suggestions,
>
> I'm going to rerun the same test with close to < 64Mb files and 7 then
> 14 reducers.
>
>
> we've done another test to see if more servers would speed up the 
> cluster,
>
> with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours vs 
> 214 minutes with all nodes online.
> started the test after hdfs marked the nodes as dead, and there were 
> no timeouts.
>
> 332/214 = 55% more time with 5/7 = 71%  servers.
>
> so our conclusion is that more servers will make the cluster faster.
>
>
>
> Ashish Thusoo wrote:
>> Try by first just reducing the number of files and increasing the 
>> data in each file so you have close to 64MB of data per file. So in 
>> your case that would amount to about 700-800 files in the 10X test 
>> case (instead of 35000 that you have). See if that give substantially

>> better results on your larger test case. For the smaller one, I don't

>> think you will be able to do better than the unix  command - the data
set is too small.
>>
>> Ashish
>> -Original Message-
>> From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: 
>> Tuesday, June 10, 2008 5:00 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: hadoop benchmarked, too slow to use
>>
>> so it would make sense for me to configure hadoop for smaller chunks?
>>
>> Elia Mazzawi wrote:
>>  
>>> yes chunk size was 64mb, and each file has some data it used 7 
>>> mappers
>>> 
>>
>>  
>>> and 1 reducer.
>>>
>>> 10X the data took 214 minutes
>>> vs 26 minutes for the smaller set
>>>
>>> i uploaded the same data 10 times in different directories ( so more

>>> files, same size )
>>>
>>>
>>> Ashish Thusoo wrote:
>>>
 Apart from the setup times, the fact that you have 3500 files means

 that you are going after around 220GB of data as each file would 
 have
   
>>
>>  
 atleast one chunk (this calculation is assuming a chunk size of 
 64MB and this assumes that each file has atleast some data).
 Mappers would
   
>>
>>  
 probably need to read up this amount of data and with 7 nodes you 
 may
   
>>
>>  
 just have
 14 map slots. I may be wrong here, but just out of curiosity how 
 many
   
>>
>>  
 mappers does your job use.

 Don't know why the 10X data was not better though if the bad 
 performance of the smaller test case was due to fragmentation. For 
 that test did you also increase the number of files, or did you 
 simply increase the amount of data in each file.

 Plus on small sets (of the order of 2-3 GB) of data unix commands 
 can't really be beaten :)

 Ashish
 -Original Message-
 From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: 
 Tuesday, June 10, 2008 3:56 PM
 To: core-user@hadoop.apache.org
 Subject: hadoop benchmarked, too slow to use

 Hello,

 we were considering using hadoop to process some data, we have it 
 set
   
>>
>>  
 up on 8 nodes ( 1 master + 7 slaves)

 we filled the cluster up with files that contain tab delimited
data.
 string \tab string etc
 then we ran the example grep with a regular expression to count the

 number of each unique starting string.
 we had 3500 files containing 3,015,294 lines totaling 5 GB.

 to benchmark it we ran
 bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
 '^[a-zA-Z]+\t'
 it took 26 minutes

 then to compare, we ran this bash command on one of the nodes, 
 which produced the same output out of the data:

 cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out (sed regexpr is

 tab not spaces)

 which took 2.5 minutes

 Then we added 10X the data into the cluster and reran Hadoop, it 
 took
 214 minutes which is less than 10X the time, but still not that 
 impressive.


 so we are seeing a 10X performance penalty for using Hadoop vs the 
 system commands, is that expected?
 we were expecting hadoop to be faster since it is distributed?
 perhaps there is too much overhead involved here?
 is the data too small?
 
>>
>>   
>



Re: hadoop benchmarked, too slow to use

2008-06-11 Thread Ted Dunning
Yes.  That does count as huge.

Congratulations!

On Wed, Jun 11, 2008 at 11:53 AM, Elia Mazzawi <[EMAIL PROTECTED]>
wrote:

>
> we concatenated the files to bring them close to and less than 64mb and the
> difference was huge without changing anything else
> we went from 214 minutes to 3 minutes !
>
>


-- 
ted


Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Chris K Wensel


Depending on the nature of your jobs, Cascading has built in a  
topological scheduler. It will schedule all your work as their  
dependencies are satisfied. Dependencies being source data and inter- 
job intermediate data.


http://www.cascading.org

The first catch is that you will still need bash to start/stop your  
cluster and to start the cascading job (per your example below).


The second catch is that you currently must use the cascading api  (or  
the groovy api) to assemble your data processing flows. Hopefully in  
the next couple weeks we will have a means to support custom/raw  
hadoop jobs as members of a set of dependent jobs.


This feature is being delayed by our adding support for stream  
assertions, the ability to validate data during runtime but have the  
assertions 'planned' out of the process flow on demand, ie. for  
production runs.


And for stream traps, built in support for siphoning off bad data into  
side files so long running (or low fidelity) jobs can continue running  
without losing any data.


can read more about these features here
http://groups.google.com/group/cascading-user

ckw

On Jun 10, 2008, at 2:48 PM, Meng Mao wrote:

I'm interested in the same thing -- is there a recommended way to  
batch

Hadoop jobs together?

On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang <[EMAIL PROTECTED] 
>

wrote:


Hello folks:
I am running several hadoop applications on hdfs. To save the  
efforts in
issuing the set of commands every time, I am trying to use bash  
script to
run the several applications sequentially. To let the job finishes  
before

it
is proceeding to the next job, I am using wait in the script like  
below.


sh bin/start-all.sh
wait
echo cluster start
(bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D
test.randomwrite.bytes_per_map=107374182 rand)
wait
bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter  -D
test.randomtextwrite.total_bytes=107374182 rand-text
bin/stop-all.sh
echo finished hdfs randomwriter experiment


However, it always give the error like below. Does anyone have  
better idea

on how to run the multiple sequential jobs with bash script?

HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell

org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job  
tracker

still
initializing
  at
org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java: 
1722)

  at
org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at

sun 
.reflect 
.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

  at

sun 
.reflect 
.DelegatingMethodAccessorImpl 
.invoke(DelegatingMethodAccessorImpl.java:25)

  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

  at org.apache.hadoop.ipc.Client.call(Client.java:557)
  at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
  at $Proxy1.getNewJobId(Unknown Source)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at

sun 
.reflect 
.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

  at

sun 
.reflect 
.DelegatingMethodAccessorImpl 
.invoke(DelegatingMethodAccessorImpl.java:25)

  at java.lang.reflect.Method.invoke(Method.java:597)
  at

org 
.apache 
.hadoop 
.io 
.retry 
.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)

  at

org 
.apache 
.hadoop 
.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java: 
59)

  at $Proxy1.getNewJobId(Unknown Source)
  at  
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
973)

  at
org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at
org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at

sun 
.reflect 
.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

  at

sun 
.reflect 
.DelegatingMethodAccessorImpl 
.invoke(DelegatingMethodAccessorImpl.java:25)

  at java.lang.reflect.Method.invoke(Method.java:597)
  at

org.apache.hadoop.util.ProgramDriver 
$ProgramDescription.invoke(ProgramDriver.java:68)

  at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
  at
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at

sun 
.reflect 
.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

  at

sun 
.reflect 
.DelegatingMethodAccessorImpl 
.invoke(DelegatingMethodAccessorImpl.java:25)

  at java.lang.reflect.Method.invoke(Method.java:

Re: hadoop benchmarked, too slow to use

2008-06-11 Thread Elia Mazzawi
that was with 7 reducers, but i meant to run it with 1. I'll re-run to 
compare.


Arun C Murthy wrote:


On Jun 11, 2008, at 11:53 AM, Elia Mazzawi wrote:



we concatenated the files to bring them close to and less than 64mb 
and the difference was huge without changing anything else

we went from 214 minutes to 3 minutes !



*smile*

How many reduces are you running now? 1 or more?

Arun


Elia Mazzawi wrote:

Thanks for the suggestions,

I'm going to rerun the same test with close to < 64Mb files and 7 
then 14 reducers.



we've done another test to see if more servers would speed up the 
cluster,


with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours
vs 214 minutes with all nodes online.
started the test after hdfs marked the nodes as dead, and there were 
no timeouts.


332/214 = 55% more time with 5/7 = 71%  servers.

so our conclusion is that more servers will make the cluster faster.



Ashish Thusoo wrote:

Try by first just reducing the number of files and increasing the data
in each file so you have close to 64MB of data per file. So in your 
case

that would amount to about 700-800 files in the 10X test case (instead
of 35000 that you have). See if that give substantially better results
on your larger test case. For the smaller one, I don't think you 
will be

able to do better than the unix  command - the data set is too small.

Ashish
-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: 
Tuesday, June 10, 2008 5:00 PM

To: core-user@hadoop.apache.org
Subject: Re: hadoop benchmarked, too slow to use

so it would make sense for me to configure hadoop for smaller chunks?

Elia Mazzawi wrote:

yes chunk size was 64mb, and each file has some data it used 7 
mappers






and 1 reducer.

10X the data took 214 minutes
vs 26 minutes for the smaller set

i uploaded the same data 10 times in different directories ( so 
more files, same size )



Ashish Thusoo wrote:

Apart from the setup times, the fact that you have 3500 files 
means that you are going after around 220GB of data as each file 
would have





atleast one chunk (this calculation is assuming a chunk size of 
64MB and this assumes that each file has atleast some data). 
Mappers would





probably need to read up this amount of data and with 7 nodes you 
may






just have
14 map slots. I may be wrong here, but just out of curiosity how 
many






mappers does your job use.

Don't know why the 10X data was not better though if the bad 
performance of the smaller test case was due to fragmentation. 
For that test did you also increase the number of files, or did 
you simply increase the amount of data in each file.


Plus on small sets (of the order of 2-3 GB) of data unix commands 
can't really be beaten :)


Ashish
-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: 
Tuesday, June 10, 2008 3:56 PM

To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use

Hello,

we were considering using hadoop to process some data, we have it 
set






up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count 
the number of each unique starting string.

we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
'^[a-zA-Z]+\t'

it took 26 minutes

then to compare, we ran this bash command on one of the nodes, 
which produced the same output out of the data:


cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out (sed regexpr 
is tab not spaces)


which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it 
took
214 minutes which is less than 10X the time, but still not that 
impressive.



so we are seeing a 10X performance penalty for using Hadoop vs 
the system commands, is that expected?

we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?














Re: hadoop benchmarked, too slow to use

2008-06-11 Thread Arun C Murthy


On Jun 11, 2008, at 11:53 AM, Elia Mazzawi wrote:



we concatenated the files to bring them close to and less than 64mb  
and the difference was huge without changing anything else

we went from 214 minutes to 3 minutes !



*smile*

How many reduces are you running now? 1 or more?

Arun


Elia Mazzawi wrote:

Thanks for the suggestions,

I'm going to rerun the same test with close to < 64Mb files and 7  
then 14 reducers.



we've done another test to see if more servers would speed up the  
cluster,


with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours
vs 214 minutes with all nodes online.
started the test after hdfs marked the nodes as dead, and there  
were no timeouts.


332/214 = 55% more time with 5/7 = 71%  servers.

so our conclusion is that more servers will make the cluster faster.



Ashish Thusoo wrote:
Try by first just reducing the number of files and increasing the  
data
in each file so you have close to 64MB of data per file. So in  
your case
that would amount to about 700-800 files in the 10X test case  
(instead
of 35000 that you have). See if that give substantially better  
results
on your larger test case. For the smaller one, I don't think you  
will be
able to do better than the unix  command - the data set is too  
small.


Ashish
-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent:  
Tuesday, June 10, 2008 5:00 PM

To: core-user@hadoop.apache.org
Subject: Re: hadoop benchmarked, too slow to use

so it would make sense for me to configure hadoop for smaller  
chunks?


Elia Mazzawi wrote:

yes chunk size was 64mb, and each file has some data it used 7  
mappers






and 1 reducer.

10X the data took 214 minutes
vs 26 minutes for the smaller set

i uploaded the same data 10 times in different directories ( so  
more files, same size )



Ashish Thusoo wrote:

Apart from the setup times, the fact that you have 3500 files  
means that you are going after around 220GB of data as each  
file would have





atleast one chunk (this calculation is assuming a chunk size of  
64MB and this assumes that each file has atleast some data).  
Mappers would





probably need to read up this amount of data and with 7 nodes  
you may






just have
14 map slots. I may be wrong here, but just out of curiosity  
how many






mappers does your job use.

Don't know why the 10X data was not better though if the bad  
performance of the smaller test case was due to fragmentation.  
For that test did you also increase the number of files, or did  
you simply increase the amount of data in each file.


Plus on small sets (of the order of 2-3 GB) of data unix  
commands can't really be beaten :)


Ashish
-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent:  
Tuesday, June 10, 2008 3:56 PM

To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use

Hello,

we were considering using hadoop to process some data, we have  
it set






up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited  
data.

string \tab string etc
then we ran the example grep with a regular expression to count  
the number of each unique starting string.

we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output '^ 
[a-zA-Z]+\t'

it took 26 minutes

then to compare, we ran this bash command on one of the nodes,  
which produced the same output out of the data:


cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out (sed  
regexpr is tab not spaces)


which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop,  
it took
214 minutes which is less than 10X the time, but still not that  
impressive.



so we are seeing a 10X performance penalty for using Hadoop vs  
the system commands, is that expected?

we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?












client connect as different username?

2008-06-11 Thread Bob Remeika
Apologies if this is an RTM response, but I looked and wasn't able to find
anything concrete.  Is it possible to connect to HDFS via the HDFS client
under a different username than I am currently logged in as?

Here is our situation, I am user bobr on the client machine.  I need to add
something to the HDFS cluster as the user "companyuser".  Is this possible
with the current set of APIs or do I have to upload and "chown"?

Thanks,
Bob


Re: hadoop benchmarked, too slow to use

2008-06-11 Thread Elia Mazzawi


we concatenated the files to bring them close to and less than 64mb and 
the difference was huge without changing anything else

we went from 214 minutes to 3 minutes !

Elia Mazzawi wrote:

Thanks for the suggestions,

I'm going to rerun the same test with close to < 64Mb files and 7 then 
14 reducers.



we've done another test to see if more servers would speed up the 
cluster,


with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours
vs 214 minutes with all nodes online.
started the test after hdfs marked the nodes as dead, and there were 
no timeouts.


332/214 = 55% more time with 5/7 = 71%  servers.

so our conclusion is that more servers will make the cluster faster.



Ashish Thusoo wrote:

Try by first just reducing the number of files and increasing the data
in each file so you have close to 64MB of data per file. So in your case
that would amount to about 700-800 files in the 10X test case (instead
of 35000 that you have). See if that give substantially better results
on your larger test case. For the smaller one, I don't think you will be
able to do better than the unix  command - the data set is too small.

Ashish
-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: 
Tuesday, June 10, 2008 5:00 PM

To: core-user@hadoop.apache.org
Subject: Re: hadoop benchmarked, too slow to use

so it would make sense for me to configure hadoop for smaller chunks?

Elia Mazzawi wrote:
 

yes chunk size was 64mb, and each file has some data it used 7 mappers



 

and 1 reducer.

10X the data took 214 minutes
vs 26 minutes for the smaller set

i uploaded the same data 10 times in different directories ( so more 
files, same size )



Ashish Thusoo wrote:
   
Apart from the setup times, the fact that you have 3500 files means 
that you are going after around 220GB of data as each file would have
  


 
atleast one chunk (this calculation is assuming a chunk size of 
64MB and this assumes that each file has atleast some data). 
Mappers would
  


 

probably need to read up this amount of data and with 7 nodes you may
  


 

just have
14 map slots. I may be wrong here, but just out of curiosity how many
  


 

mappers does your job use.

Don't know why the 10X data was not better though if the bad 
performance of the smaller test case was due to fragmentation. For 
that test did you also increase the number of files, or did you 
simply increase the amount of data in each file.


Plus on small sets (of the order of 2-3 GB) of data unix commands 
can't really be beaten :)


Ashish
-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: 
Tuesday, June 10, 2008 3:56 PM

To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use

Hello,

we were considering using hadoop to process some data, we have it set
  


 

up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the 
number of each unique starting string.

we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
'^[a-zA-Z]+\t'

it took 26 minutes

then to compare, we ran this bash command on one of the nodes, 
which produced the same output out of the data:


cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out (sed regexpr is 
tab not spaces)


which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took
214 minutes which is less than 10X the time, but still not that 
impressive.



so we are seeing a 10X performance penalty for using Hadoop vs the 
system commands, is that expected?

we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?



  






Re: hadoop benchmarked, too slow to use

2008-06-11 Thread Elia Mazzawi

Thanks for the suggestions,

I'm going to rerun the same test with close to < 64Mb files and 7 then 
14 reducers.



we've done another test to see if more servers would speed up the cluster,

with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours
vs 214 minutes with all nodes online.
started the test after hdfs marked the nodes as dead, and there were no 
timeouts.


332/214 = 55% more time with 5/7 = 71%  servers.

so our conclusion is that more servers will make the cluster faster.



Ashish Thusoo wrote:

Try by first just reducing the number of files and increasing the data
in each file so you have close to 64MB of data per file. So in your case
that would amount to about 700-800 files in the 10X test case (instead
of 35000 that you have). See if that give substantially better results
on your larger test case. For the smaller one, I don't think you will be
able to do better than the unix  command - the data set is too small.

Ashish 


-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 10, 2008 5:00 PM

To: core-user@hadoop.apache.org
Subject: Re: hadoop benchmarked, too slow to use

so it would make sense for me to configure hadoop for smaller chunks?

Elia Mazzawi wrote:
  

yes chunk size was 64mb, and each file has some data it used 7 mappers



  

and 1 reducer.

10X the data took 214 minutes
vs 26 minutes for the smaller set

i uploaded the same data 10 times in different directories ( so more 
files, same size )



Ashish Thusoo wrote:

Apart from the setup times, the fact that you have 3500 files means 
that you are going after around 220GB of data as each file would have
  


  
atleast one chunk (this calculation is assuming a chunk size of 64MB 
and this assumes that each file has atleast some data). Mappers would
  


  

probably need to read up this amount of data and with 7 nodes you may
  


  

just have
14 map slots. I may be wrong here, but just out of curiosity how many
  


  

mappers does your job use.

Don't know why the 10X data was not better though if the bad 
performance of the smaller test case was due to fragmentation. For 
that test did you also increase the number of files, or did you 
simply increase the amount of data in each file.


Plus on small sets (of the order of 2-3 GB) of data unix commands 
can't really be beaten :)


Ashish
-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: 
Tuesday, June 10, 2008 3:56 PM

To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use

Hello,

we were considering using hadoop to process some data, we have it set
  


  

up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the 
number of each unique starting string.

we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
'^[a-zA-Z]+\t'

it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which 
produced the same output out of the data:


cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out (sed regexpr is 
tab not spaces)


which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took
214 minutes which is less than 10X the time, but still not that 
impressive.



so we are seeing a 10X performance penalty for using Hadoop vs the 
system commands, is that expected?

we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?
  
  


  




Re: HDFS crash recovery -- "The directory is already locked."

2008-06-11 Thread Ben Slusky

Never mind. The storage in question was on an NFS share, and the locking
problem seems to have resolved itself overnight. Fracking NFS.

Thanks anyway,
-
-Ben

Ben Slusky wrote:

Greetings,

We had a hard crash due to hardware failure in our Hadoop namenode host,
and now the namenode won't start because it thinks its storage is still
locked. Also, I'm apparently too stupid to find any documentation on
crash recovery. Could someone please enlighten me?

Thanks
-
-Ben



--
Ben Slusky <[EMAIL PROTECTED]>


Re: Streaming --counters question

2008-06-11 Thread Miles Osborne
great!  looking forwards to 0.18

Miles

2008/6/11 Arun C Murthy <[EMAIL PROTECTED]>:

>
> On Jun 10, 2008, at 3:16 PM, Miles Osborne wrote:
>
>  Is there support for counters in streaming?  In particular, it would be
>> nice
>> to be able to access these after a job has run.
>>
>>
> Yes. Streaming applications can update counters in hadoop-0.18:
> http://issues.apache.org/jira/browse/HADOOP-1328
>
> Arun
>
>
>  Thanks!
>>
>> Miles
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in Scotland,
>> with registration number SC005336.
>>
>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.