Re: client connect as different username?
This information can be found in http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html Nicholas - Original Message > From: Chris Collins <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Wednesday, June 11, 2008 9:31:18 PM > Subject: Re: client connect as different username? > > Thanks Doug, should this be added to the permissions doc or to the > faq? See you in Sonoma. > > C > On Jun 11, 2008, at 9:15 PM, Doug Cutting wrote: > > > Chris Collins wrote: > >> You are referring to creating a directory in hdfs? Because if I am > >> user chris and the hdfs only has user foo, then I cant create a > >> directory because I dont have perms, infact I cant even connect. > > > > Today, users and groups are declared by the client. The namenode > > only records and checks against user and group names provided by the > > client. So if someone named "foo" writes a file, then that file is > > owned by someone named "foo" and anyone named "foo" is the owner of > > that file. No "foo" account need exist on the namenode. > > > > The one (important) exception is the "superuser". Whatever user > > name starts the namenode is the superuser for that filesystem. And > > if "/" is not world writable, a new filesystem will not contain a > > home directory (or anywhere else) writable by other users. So, in a > > multiuser Hadoop installation, the superuser needs to create home > > directories and project directories for other users and set their > > protections accordingly before other users can do anything. Perhaps > > this is what you've run into? > > > > Doug
Re: client connect as different username?
Thanks Doug, should this be added to the permissions doc or to the faq? See you in Sonoma. C On Jun 11, 2008, at 9:15 PM, Doug Cutting wrote: Chris Collins wrote: You are referring to creating a directory in hdfs? Because if I am user chris and the hdfs only has user foo, then I cant create a directory because I dont have perms, infact I cant even connect. Today, users and groups are declared by the client. The namenode only records and checks against user and group names provided by the client. So if someone named "foo" writes a file, then that file is owned by someone named "foo" and anyone named "foo" is the owner of that file. No "foo" account need exist on the namenode. The one (important) exception is the "superuser". Whatever user name starts the namenode is the superuser for that filesystem. And if "/" is not world writable, a new filesystem will not contain a home directory (or anywhere else) writable by other users. So, in a multiuser Hadoop installation, the superuser needs to create home directories and project directories for other users and set their protections accordingly before other users can do anything. Perhaps this is what you've run into? Doug
Re: client connect as different username?
Chris Collins wrote: You are referring to creating a directory in hdfs? Because if I am user chris and the hdfs only has user foo, then I cant create a directory because I dont have perms, infact I cant even connect. Today, users and groups are declared by the client. The namenode only records and checks against user and group names provided by the client. So if someone named "foo" writes a file, then that file is owned by someone named "foo" and anyone named "foo" is the owner of that file. No "foo" account need exist on the namenode. The one (important) exception is the "superuser". Whatever user name starts the namenode is the superuser for that filesystem. And if "/" is not world writable, a new filesystem will not contain a home directory (or anywhere else) writable by other users. So, in a multiuser Hadoop installation, the superuser needs to create home directories and project directories for other users and set their protections accordingly before other users can do anything. Perhaps this is what you've run into? Doug
Re: does anyone have idea on how to run multiple sequential jobs with bash script
However, for continuous production data processing, hadoop+cascading sounds like a good option. This will be especially true with stream assertions and traps (as mentioned previously, and available in trunk). I've written workloads for clients that render down to ~60 unique Hadoop map/reduce jobs, all inter-related, from ~10 unique units of work (internally lots of joins, sorts and math). I can't imagine having written them by hand. ckw -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: does anyone have idea on how to run multiple sequential jobs with bash script
Thanks Ted.. Couple quick comments. At one level Cascading is a MapReduce query planner, just like PIG. Except the API is for public consumption and fully extensible, in PIG you typically interact with the PigLatin syntax. Subsequently, with Cascading, you can layer your own syntax on top of the API. Currently there is Groovy support (Groovy is used to assemble the work, it does not run on the mappers or reducers). I hear rumors about Jython elsewhere. A couple groovy examples (note these are obviously trivial, the dsl can absorb tremendous complexity if need be)... http://code.google.com/p/cascading/source/browse/trunk/cascading.groovy/sample/wordcount.groovy http://code.google.com/p/cascading/source/browse/trunk/cascading.groovy/sample/widefinder.groovy Since Cascading is in part a 'planner', it actually builds internally a new representation from what the developer assembled and renders out the necessary map/reduce jobs (and transparently links them) at runtime. As Hadoop evolves, the planner will incorporate the new features and leverage them transparently. Plus there are opportunities for identifying patterns and applying different strategies (hypothetically map side vs reduce side joins, for one). It is also conceivable (but untried) that different planners can exist to target different systems other than Hadoop (making your code/libraries portable). Much of this is true for PIG as well. http://www.cascading.org/documentation/overview.html Also, Cascading will at some point provide a PIG adapter, allowing PigLatin queries to participate in a larger Cascading 'Cascade' (the topological scheduler). Cascading is great with integration, connecting things outside Hadoop with stuff to be done inside Hadoop. And PIG looks like a great way to concisely represent a complex solution and execute it. There isn't any reason they can't work together (it has always been the intention). The takeaway is that with Cascading and PIG, users do not think in MapReduce. With PIG, you think in PigLatin. With Cascading, you can use the pipe/filter based API, or use your favorite scripting language and build a DSL for your problem domain. Many companies have done similar things internally, but they tend to be nothing more than a scriptable way to write a map/reduce job and glue them together. You still think in MapReduce, which in my opinion doesn't scale well. My (biased) recommendation is this. Build out your application in Cascading. If part of the problem is best represented in PIG, no worries use PIG and feed and clean up after PIG with Cascading. And if you see a solvable bottleneck, and we can't convince the planner to recognize the pattern and plan better, replace that piece of the process with a custom MapReduce job (or more). Solve your problem first, then optimize the solution, if need be. ckw On Jun 11, 2008, at 5:00 PM, Ted Dunning wrote: Pig is much more ambitious than cascading. Because of the ambitions, simple things got overlooked. For instance, something as simple as computing a file name to load is not possible in pig, nor is it possible to write functions in pig. You can hook to Java functions (for some things), but you can't really write programs in pig. On the other hand, pig may eventually provide really incredible capabilities including program rewriting and optimization that would be incredibly hard to write directly in Java. The point of cascading was simply to make life easier for a normal Java/map-reduce programmer. It provides an abstraction for gluing together several map-reduce programs and for doing a few common things like joins. Because you are still writing Java (or Groovy) code, you have all of the functionality you always had. But, this same benefit costs you the future in terms of what optimizations are likely to ever be possible. The summary for us (especially 4-6 months ago when we were deciding) is that cascading is good enough to use now and pig will probably be more useful later. On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao <[EMAIL PROTECTED]> wrote: I find cascading very similar to pig, do you care to provide your comment here? If map reduce programmers are to go to the next level (scripting/query language), which way to go? -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
RE: does anyone have idea on how to run multiple sequential jobs with bash script
Thanks for sharing. We have need to expose hadoop cluster to 'casual' users for ad-hoc query, I find it difficult to ask them to write map reduce program, pig latin comes in very handy in this case. However, for continuous production data processing, hadoop+cascading sounds like a good option. Haijun -Original Message- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 11, 2008 5:01 PM To: core-user@hadoop.apache.org Subject: Re: does anyone have idea on how to run multiple sequential jobs with bash script Pig is much more ambitious than cascading. Because of the ambitions, simple things got overlooked. For instance, something as simple as computing a file name to load is not possible in pig, nor is it possible to write functions in pig. You can hook to Java functions (for some things), but you can't really write programs in pig. On the other hand, pig may eventually provide really incredible capabilities including program rewriting and optimization that would be incredibly hard to write directly in Java. The point of cascading was simply to make life easier for a normal Java/map-reduce programmer. It provides an abstraction for gluing together several map-reduce programs and for doing a few common things like joins. Because you are still writing Java (or Groovy) code, you have all of the functionality you always had. But, this same benefit costs you the future in terms of what optimizations are likely to ever be possible. The summary for us (especially 4-6 months ago when we were deciding) is that cascading is good enough to use now and pig will probably be more useful later. On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao <[EMAIL PROTECTED]> wrote: > > I find cascading very similar to pig, do you care to provide your comment > here? If map reduce programmers are to go to the next level (scripting/query > language), which way to go? > > >
Re: client connect as different username?
We know whomai is called, thanks, I found out painfully the first day I played with this because in dev, my ide is started not from a shell. Therefor the path is not inherited to include /usr/bin. Hdfs client hides the actual fact that ProcessBuilder barfs with a file not found with a "login exception", "whoami". Not as clear as I would of liked :-}| You are referring to creating a directory in hdfs? Because if I am user chris and the hdfs only has user foo, then I cant create a directory because I dont have perms, infact I cant even connect. I believe another emailer holds the answer which was blindly dumb on my part for not trying, that of adding a user in unix and creating a group that those users belong to. Thanks Chris On Jun 11, 2008, at 5:36 PM, Allen Wittenauer wrote: On 6/11/08 5:17 PM, "Chris Collins" <[EMAIL PROTECTED]> wrote: The finer point to this is that in development you may be logged in as user x and have a shared hdfs instance that a number of people are using. In that mode its not practical to sudo as you have all your development tools setup for userx. hdfs is setup with a single user, what is the procedure to add users to that hdfs instance? It has to support it surely? Its really not obvious, looking in the hdfs docs that come with the distro nothing springs out. the hadoop command line tool doesnt have anything that vaguely looks like a way to create a user. User information is sent from the client. The code literally does a 'whoami' and 'groups' and sends that information to the server. Shared data should be handled just like you would in UNIX: - create a directory - set permissions to be insecure - go crazy
Re: client connect as different username?
On 6/11/08 5:17 PM, "Chris Collins" <[EMAIL PROTECTED]> wrote: > The finer point to this is that in development you may be logged in as > user x and have a shared hdfs instance that a number of people are > using. In that mode its not practical to sudo as you have all your > development tools setup for userx. hdfs is setup with a single user, > what is the procedure to add users to that hdfs instance? It has to > support it surely? Its really not obvious, looking in the hdfs docs > that come with the distro nothing springs out. the hadoop command > line tool doesnt have anything that vaguely looks like a way to create > a user. User information is sent from the client. The code literally does a 'whoami' and 'groups' and sends that information to the server. Shared data should be handled just like you would in UNIX: - create a directory - set permissions to be insecure - go crazy
RE: client connect as different username?
This is how I've done it before: 1.) Create a hadoop user/group. 2.) Make the local filesystem dfs directories writable by the hadoop group and set the sticky bit. 3.) Run hadoop as the hadoop user. 4.) Then add all of your users to the hadoop group. I also changed the dfs.permissions.supergroup property to "hadoop" in the $HADOOP_HOME/conf/hadoop-site.xml as well. This works pretty well for us. Hope it helps. Cheers, -Xavier -Original Message- From: Chris Collins [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 11, 2008 5:18 PM To: core-user@hadoop.apache.org Subject: Re: client connect as different username? The finer point to this is that in development you may be logged in as user x and have a shared hdfs instance that a number of people are using. In that mode its not practical to sudo as you have all your development tools setup for userx. hdfs is setup with a single user, what is the procedure to add users to that hdfs instance? It has to support it surely? Its really not obvious, looking in the hdfs docs that come with the distro nothing springs out. the hadoop command line tool doesnt have anything that vaguely looks like a way to create a user. Help is greatly appreciated. I am sure its somewhere so blindingly obvious. How are other people doing other that sudoing to one single user name? Thanks ChRiS On Jun 11, 2008, at 5:11 PM, [EMAIL PROTECTED] wrote: > The best way is to use sudo command to execute hadoop client. Does it > work for you? > > Nicholas > > > - Original Message >> From: Bob Remeika <[EMAIL PROTECTED]> >> To: core-user@hadoop.apache.org >> Sent: Wednesday, June 11, 2008 12:56:14 PM >> Subject: client connect as different username? >> >> Apologies if this is an RTM response, but I looked and wasn't able to >> find anything concrete. Is it possible to connect to HDFS via the >> HDFS client under a different username than I am currently logged in >> as? >> >> Here is our situation, I am user bobr on the client machine. I need >> to add something to the HDFS cluster as the user "companyuser". Is >> this possible with the current set of APIs or do I have to upload and >> "chown"? >> >> Thanks, >> Bob >
Re: client connect as different username?
The finer point to this is that in development you may be logged in as user x and have a shared hdfs instance that a number of people are using. In that mode its not practical to sudo as you have all your development tools setup for userx. hdfs is setup with a single user, what is the procedure to add users to that hdfs instance? It has to support it surely? Its really not obvious, looking in the hdfs docs that come with the distro nothing springs out. the hadoop command line tool doesnt have anything that vaguely looks like a way to create a user. Help is greatly appreciated. I am sure its somewhere so blindingly obvious. How are other people doing other that sudoing to one single user name? Thanks ChRiS On Jun 11, 2008, at 5:11 PM, [EMAIL PROTECTED] wrote: The best way is to use sudo command to execute hadoop client. Does it work for you? Nicholas - Original Message From: Bob Remeika <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Wednesday, June 11, 2008 12:56:14 PM Subject: client connect as different username? Apologies if this is an RTM response, but I looked and wasn't able to find anything concrete. Is it possible to connect to HDFS via the HDFS client under a different username than I am currently logged in as? Here is our situation, I am user bobr on the client machine. I need to add something to the HDFS cluster as the user "companyuser". Is this possible with the current set of APIs or do I have to upload and "chown"? Thanks, Bob
Re: client connect as different username?
The best way is to use sudo command to execute hadoop client. Does it work for you? Nicholas - Original Message > From: Bob Remeika <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Wednesday, June 11, 2008 12:56:14 PM > Subject: client connect as different username? > > Apologies if this is an RTM response, but I looked and wasn't able to find > anything concrete. Is it possible to connect to HDFS via the HDFS client > under a different username than I am currently logged in as? > > Here is our situation, I am user bobr on the client machine. I need to add > something to the HDFS cluster as the user "companyuser". Is this possible > with the current set of APIs or do I have to upload and "chown"? > > Thanks, > Bob
Re: does anyone have idea on how to run multiple sequential jobs with bash script
Pig is much more ambitious than cascading. Because of the ambitions, simple things got overlooked. For instance, something as simple as computing a file name to load is not possible in pig, nor is it possible to write functions in pig. You can hook to Java functions (for some things), but you can't really write programs in pig. On the other hand, pig may eventually provide really incredible capabilities including program rewriting and optimization that would be incredibly hard to write directly in Java. The point of cascading was simply to make life easier for a normal Java/map-reduce programmer. It provides an abstraction for gluing together several map-reduce programs and for doing a few common things like joins. Because you are still writing Java (or Groovy) code, you have all of the functionality you always had. But, this same benefit costs you the future in terms of what optimizations are likely to ever be possible. The summary for us (especially 4-6 months ago when we were deciding) is that cascading is good enough to use now and pig will probably be more useful later. On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao <[EMAIL PROTECTED]> wrote: > > I find cascading very similar to pig, do you care to provide your comment > here? If map reduce programmers are to go to the next level (scripting/query > language), which way to go? > > >
Re: does anyone have idea on how to run multiple sequential jobs with bash script
On Jun 10, 2008, at 2:48 PM, Meng Mao wrote: I'm interested in the same thing -- is there a recommended way to batch Hadoop jobs together? Hadoop Map-Reduce JobControl: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Job +Control and http://hadoop.apache.org/core/docs/current/ mapred_tutorial.html#JobControl Arun On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang <[EMAIL PROTECTED]> wrote: Hello folks: I am running several hadoop applications on hdfs. To save the efforts in issuing the set of commands every time, I am trying to use bash script to run the several applications sequentially. To let the job finishes before it is proceeding to the next job, I am using wait in the script like below. sh bin/start-all.sh wait echo cluster start (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D test.randomwrite.bytes_per_map=107374182 rand) wait bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter -D test.randomtextwrite.total_bytes=107374182 rand-text bin/stop-all.sh echo finished hdfs randomwriter experiment However, it always give the error like below. Does anyone have better idea on how to run the multiple sequential jobs with bash script? HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job tracker still initializing at org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java: 1722) at org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) at org.apache.hadoop.ipc.Client.call(Client.java:557) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at $Proxy1.getNewJobId(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod (RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke (RetryInvocationHandler.java:59) at $Proxy1.getNewJobId(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJob (JobClient.java:696) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 973) at org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke (ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:155) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220) -- hustlin, hustlin, everyday I'm hustlin
RE: does anyone have idea on how to run multiple sequential jobs with bash script
Ted, I find cascading very similar to pig, do you care to provide your comment here? If map reduce programmers are to go to the next level (scripting/query language), which way to go? Thanks Haijun -Original Message- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 11, 2008 2:16 PM To: core-user@hadoop.apache.org Subject: Re: does anyone have idea on how to run multiple sequential jobs with bash script Just a quick plug for Cascading. Our team uses cascading quite a bit and found it to be a simpler way to write map reduce jobs. The guys using it find it very helpful. On Wed, Jun 11, 2008 at 1:31 PM, Chris K Wensel <[EMAIL PROTECTED]> wrote: > > Depending on the nature of your jobs, Cascading has built in a topological > scheduler. It will schedule all your work as their dependencies are > satisfied. Dependencies being source data and inter-job intermediate data. > > http://www.cascading.org > > > -- ted
Re: does anyone have idea on how to run multiple sequential jobs with bash script
Just a quick plug for Cascading. Our team uses cascading quite a bit and found it to be a simpler way to write map reduce jobs. The guys using it find it very helpful. On Wed, Jun 11, 2008 at 1:31 PM, Chris K Wensel <[EMAIL PROTECTED]> wrote: > > Depending on the nature of your jobs, Cascading has built in a topological > scheduler. It will schedule all your work as their dependencies are > satisfied. Dependencies being source data and inter-job intermediate data. > > http://www.cascading.org > > > -- ted
RE: hadoop benchmarked, too slow to use
good to know... this puppy does scale :) and hadoop is awesome for what it does... Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 11, 2008 11:54 AM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use we concatenated the files to bring them close to and less than 64mb and the difference was huge without changing anything else we went from 214 minutes to 3 minutes ! Elia Mazzawi wrote: > Thanks for the suggestions, > > I'm going to rerun the same test with close to < 64Mb files and 7 then > 14 reducers. > > > we've done another test to see if more servers would speed up the > cluster, > > with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours vs > 214 minutes with all nodes online. > started the test after hdfs marked the nodes as dead, and there were > no timeouts. > > 332/214 = 55% more time with 5/7 = 71% servers. > > so our conclusion is that more servers will make the cluster faster. > > > > Ashish Thusoo wrote: >> Try by first just reducing the number of files and increasing the >> data in each file so you have close to 64MB of data per file. So in >> your case that would amount to about 700-800 files in the 10X test >> case (instead of 35000 that you have). See if that give substantially >> better results on your larger test case. For the smaller one, I don't >> think you will be able to do better than the unix command - the data set is too small. >> >> Ashish >> -Original Message- >> From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: >> Tuesday, June 10, 2008 5:00 PM >> To: core-user@hadoop.apache.org >> Subject: Re: hadoop benchmarked, too slow to use >> >> so it would make sense for me to configure hadoop for smaller chunks? >> >> Elia Mazzawi wrote: >> >>> yes chunk size was 64mb, and each file has some data it used 7 >>> mappers >>> >> >> >>> and 1 reducer. >>> >>> 10X the data took 214 minutes >>> vs 26 minutes for the smaller set >>> >>> i uploaded the same data 10 times in different directories ( so more >>> files, same size ) >>> >>> >>> Ashish Thusoo wrote: >>> Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have >> >> atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would >> >> probably need to read up this amount of data and with 7 nodes you may >> >> just have 14 map slots. I may be wrong here, but just out of curiosity how many >> >> mappers does your job use. Don't know why the 10X data was not better though if the bad performance of the smaller test case was due to fragmentation. For that test did you also increase the number of files, or did you simply increase the amount of data in each file. Plus on small sets (of the order of 2-3 GB) of data unix commands can't really be beaten :) Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set >> >> up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c > /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small? >> >> >
Re: hadoop benchmarked, too slow to use
Yes. That does count as huge. Congratulations! On Wed, Jun 11, 2008 at 11:53 AM, Elia Mazzawi <[EMAIL PROTECTED]> wrote: > > we concatenated the files to bring them close to and less than 64mb and the > difference was huge without changing anything else > we went from 214 minutes to 3 minutes ! > > -- ted
Re: does anyone have idea on how to run multiple sequential jobs with bash script
Depending on the nature of your jobs, Cascading has built in a topological scheduler. It will schedule all your work as their dependencies are satisfied. Dependencies being source data and inter- job intermediate data. http://www.cascading.org The first catch is that you will still need bash to start/stop your cluster and to start the cascading job (per your example below). The second catch is that you currently must use the cascading api (or the groovy api) to assemble your data processing flows. Hopefully in the next couple weeks we will have a means to support custom/raw hadoop jobs as members of a set of dependent jobs. This feature is being delayed by our adding support for stream assertions, the ability to validate data during runtime but have the assertions 'planned' out of the process flow on demand, ie. for production runs. And for stream traps, built in support for siphoning off bad data into side files so long running (or low fidelity) jobs can continue running without losing any data. can read more about these features here http://groups.google.com/group/cascading-user ckw On Jun 10, 2008, at 2:48 PM, Meng Mao wrote: I'm interested in the same thing -- is there a recommended way to batch Hadoop jobs together? On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang <[EMAIL PROTECTED] > wrote: Hello folks: I am running several hadoop applications on hdfs. To save the efforts in issuing the set of commands every time, I am trying to use bash script to run the several applications sequentially. To let the job finishes before it is proceeding to the next job, I am using wait in the script like below. sh bin/start-all.sh wait echo cluster start (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D test.randomwrite.bytes_per_map=107374182 rand) wait bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter -D test.randomtextwrite.total_bytes=107374182 rand-text bin/stop-all.sh echo finished hdfs randomwriter experiment However, it always give the error like below. Does anyone have better idea on how to run the multiple sequential jobs with bash script? HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job tracker still initializing at org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java: 1722) at org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun .reflect .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun .reflect .DelegatingMethodAccessorImpl .invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) at org.apache.hadoop.ipc.Client.call(Client.java:557) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at $Proxy1.getNewJobId(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun .reflect .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun .reflect .DelegatingMethodAccessorImpl .invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org .apache .hadoop .io .retry .RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org .apache .hadoop .io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java: 59) at $Proxy1.getNewJobId(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 973) at org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun .reflect .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun .reflect .DelegatingMethodAccessorImpl .invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver $ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun .reflect .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun .reflect .DelegatingMethodAccessorImpl .invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:
Re: hadoop benchmarked, too slow to use
that was with 7 reducers, but i meant to run it with 1. I'll re-run to compare. Arun C Murthy wrote: On Jun 11, 2008, at 11:53 AM, Elia Mazzawi wrote: we concatenated the files to bring them close to and less than 64mb and the difference was huge without changing anything else we went from 214 minutes to 3 minutes ! *smile* How many reduces are you running now? 1 or more? Arun Elia Mazzawi wrote: Thanks for the suggestions, I'm going to rerun the same test with close to < 64Mb files and 7 then 14 reducers. we've done another test to see if more servers would speed up the cluster, with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours vs 214 minutes with all nodes online. started the test after hdfs marked the nodes as dead, and there were no timeouts. 332/214 = 55% more time with 5/7 = 71% servers. so our conclusion is that more servers will make the cluster faster. Ashish Thusoo wrote: Try by first just reducing the number of files and increasing the data in each file so you have close to 64MB of data per file. So in your case that would amount to about 700-800 files in the 10X test case (instead of 35000 that you have). See if that give substantially better results on your larger test case. For the smaller one, I don't think you will be able to do better than the unix command - the data set is too small. Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 5:00 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use so it would make sense for me to configure hadoop for smaller chunks? Elia Mazzawi wrote: yes chunk size was 64mb, and each file has some data it used 7 mappers and 1 reducer. 10X the data took 214 minutes vs 26 minutes for the smaller set i uploaded the same data 10 times in different directories ( so more files, same size ) Ashish Thusoo wrote: Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would probably need to read up this amount of data and with 7 nodes you may just have 14 map slots. I may be wrong here, but just out of curiosity how many mappers does your job use. Don't know why the 10X data was not better though if the bad performance of the smaller test case was due to fragmentation. For that test did you also increase the number of files, or did you simply increase the amount of data in each file. Plus on small sets (of the order of 2-3 GB) of data unix commands can't really be beaten :) Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c > /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
Re: hadoop benchmarked, too slow to use
On Jun 11, 2008, at 11:53 AM, Elia Mazzawi wrote: we concatenated the files to bring them close to and less than 64mb and the difference was huge without changing anything else we went from 214 minutes to 3 minutes ! *smile* How many reduces are you running now? 1 or more? Arun Elia Mazzawi wrote: Thanks for the suggestions, I'm going to rerun the same test with close to < 64Mb files and 7 then 14 reducers. we've done another test to see if more servers would speed up the cluster, with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours vs 214 minutes with all nodes online. started the test after hdfs marked the nodes as dead, and there were no timeouts. 332/214 = 55% more time with 5/7 = 71% servers. so our conclusion is that more servers will make the cluster faster. Ashish Thusoo wrote: Try by first just reducing the number of files and increasing the data in each file so you have close to 64MB of data per file. So in your case that would amount to about 700-800 files in the 10X test case (instead of 35000 that you have). See if that give substantially better results on your larger test case. For the smaller one, I don't think you will be able to do better than the unix command - the data set is too small. Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 5:00 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use so it would make sense for me to configure hadoop for smaller chunks? Elia Mazzawi wrote: yes chunk size was 64mb, and each file has some data it used 7 mappers and 1 reducer. 10X the data took 214 minutes vs 26 minutes for the smaller set i uploaded the same data 10 times in different directories ( so more files, same size ) Ashish Thusoo wrote: Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would probably need to read up this amount of data and with 7 nodes you may just have 14 map slots. I may be wrong here, but just out of curiosity how many mappers does your job use. Don't know why the 10X data was not better though if the bad performance of the smaller test case was due to fragmentation. For that test did you also increase the number of files, or did you simply increase the amount of data in each file. Plus on small sets (of the order of 2-3 GB) of data unix commands can't really be beaten :) Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^ [a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c > /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
client connect as different username?
Apologies if this is an RTM response, but I looked and wasn't able to find anything concrete. Is it possible to connect to HDFS via the HDFS client under a different username than I am currently logged in as? Here is our situation, I am user bobr on the client machine. I need to add something to the HDFS cluster as the user "companyuser". Is this possible with the current set of APIs or do I have to upload and "chown"? Thanks, Bob
Re: hadoop benchmarked, too slow to use
we concatenated the files to bring them close to and less than 64mb and the difference was huge without changing anything else we went from 214 minutes to 3 minutes ! Elia Mazzawi wrote: Thanks for the suggestions, I'm going to rerun the same test with close to < 64Mb files and 7 then 14 reducers. we've done another test to see if more servers would speed up the cluster, with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours vs 214 minutes with all nodes online. started the test after hdfs marked the nodes as dead, and there were no timeouts. 332/214 = 55% more time with 5/7 = 71% servers. so our conclusion is that more servers will make the cluster faster. Ashish Thusoo wrote: Try by first just reducing the number of files and increasing the data in each file so you have close to 64MB of data per file. So in your case that would amount to about 700-800 files in the 10X test case (instead of 35000 that you have). See if that give substantially better results on your larger test case. For the smaller one, I don't think you will be able to do better than the unix command - the data set is too small. Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 5:00 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use so it would make sense for me to configure hadoop for smaller chunks? Elia Mazzawi wrote: yes chunk size was 64mb, and each file has some data it used 7 mappers and 1 reducer. 10X the data took 214 minutes vs 26 minutes for the smaller set i uploaded the same data 10 times in different directories ( so more files, same size ) Ashish Thusoo wrote: Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would probably need to read up this amount of data and with 7 nodes you may just have 14 map slots. I may be wrong here, but just out of curiosity how many mappers does your job use. Don't know why the 10X data was not better though if the bad performance of the smaller test case was due to fragmentation. For that test did you also increase the number of files, or did you simply increase the amount of data in each file. Plus on small sets (of the order of 2-3 GB) of data unix commands can't really be beaten :) Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c > /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
Re: hadoop benchmarked, too slow to use
Thanks for the suggestions, I'm going to rerun the same test with close to < 64Mb files and 7 then 14 reducers. we've done another test to see if more servers would speed up the cluster, with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours vs 214 minutes with all nodes online. started the test after hdfs marked the nodes as dead, and there were no timeouts. 332/214 = 55% more time with 5/7 = 71% servers. so our conclusion is that more servers will make the cluster faster. Ashish Thusoo wrote: Try by first just reducing the number of files and increasing the data in each file so you have close to 64MB of data per file. So in your case that would amount to about 700-800 files in the 10X test case (instead of 35000 that you have). See if that give substantially better results on your larger test case. For the smaller one, I don't think you will be able to do better than the unix command - the data set is too small. Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 5:00 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use so it would make sense for me to configure hadoop for smaller chunks? Elia Mazzawi wrote: yes chunk size was 64mb, and each file has some data it used 7 mappers and 1 reducer. 10X the data took 214 minutes vs 26 minutes for the smaller set i uploaded the same data 10 times in different directories ( so more files, same size ) Ashish Thusoo wrote: Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would probably need to read up this amount of data and with 7 nodes you may just have 14 map slots. I may be wrong here, but just out of curiosity how many mappers does your job use. Don't know why the 10X data was not better though if the bad performance of the smaller test case was due to fragmentation. For that test did you also increase the number of files, or did you simply increase the amount of data in each file. Plus on small sets (of the order of 2-3 GB) of data unix commands can't really be beaten :) Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c > /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
Re: HDFS crash recovery -- "The directory is already locked."
Never mind. The storage in question was on an NFS share, and the locking problem seems to have resolved itself overnight. Fracking NFS. Thanks anyway, - -Ben Ben Slusky wrote: Greetings, We had a hard crash due to hardware failure in our Hadoop namenode host, and now the namenode won't start because it thinks its storage is still locked. Also, I'm apparently too stupid to find any documentation on crash recovery. Could someone please enlighten me? Thanks - -Ben -- Ben Slusky <[EMAIL PROTECTED]>
Re: Streaming --counters question
great! looking forwards to 0.18 Miles 2008/6/11 Arun C Murthy <[EMAIL PROTECTED]>: > > On Jun 10, 2008, at 3:16 PM, Miles Osborne wrote: > > Is there support for counters in streaming? In particular, it would be >> nice >> to be able to access these after a job has run. >> >> > Yes. Streaming applications can update counters in hadoop-0.18: > http://issues.apache.org/jira/browse/HADOOP-1328 > > Arun > > > Thanks! >> >> Miles >> >> >> -- >> The University of Edinburgh is a charitable body, registered in Scotland, >> with registration number SC005336. >> > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.