Kerberos and Delegation Tokens
Hi, According to the 'Hadoop - The Definitive Guide' In a distributed system like HDFS or MapReduce, there are many client-server interactions, each of which must be authenticated. For example, an HDFS read operation will involve multiple calls to the namenode and calls to one or more datanodes. Instead of using the three-step Kerberos ticket exchange protocol to authenticate each call, which would present a high load on the KDC on a busy cluster, Hadoop uses delegation tokens to allow later authenticated access without having to contact the KDC again. Once the authentication is established between the client and the NameNode, there is no need to contact the KDC (Key Distribution Center) till the ticket expires for any NameNode queries. So, I don't see how delegation tokens will lower the burden on the KDC by having to contact the KDC fewer times. Could someone please explain me how delegation tokens help? Praveen
Re: Security at file level in Hadoop
According to this (http://goo.gl/rfwy4) Prior to 0.22, Hadoop uses the 'whoami' and id commands to determine the user and groups of the running process. How does this work now? Praveen On Wed, Feb 22, 2012 at 6:03 PM, Joey Echeverria j...@cloudera.com wrote: HDFS supports POSIX style file and directory permissions (read, write, execute) for the owner, group and world. You can change the permissions with hadoop fs -chmod permissions path -Joey On Feb 22, 2012, at 5:32, shreya@cognizant.com wrote: Hi I want to implement security at file level in Hadoop, essentially restricting certain data to certain users. Ex - File A can be accessed only by a user X File B can be accessed by only user X and user Y Is this possible in Hadoop, how do we do it? At what level are these permissions applied (before copying to HDFS or after putting in HDFS)? When the file gets replicated does it retain these permissions? Thanks Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Re: Can't achieve load distribution
I have a simple MR job, and I want each Mapper to get one line from my input file (which contains further instructions for lengthy processing). Use the NLineInputFormat class. http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.html Praveen On Thu, Feb 2, 2012 at 9:43 AM, Mark Kerzner mark.kerz...@shmsoft.comwrote: Thanks! Mark On Wed, Feb 1, 2012 at 7:44 PM, Anil Gupta anilgupt...@gmail.com wrote: Yes, if ur block size is 64mb. Btw, block size is configurable in Hadoop. Best Regards, Anil On Feb 1, 2012, at 5:06 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Anil, do you mean one block of HDFS, like 64MB? Mark On Wed, Feb 1, 2012 at 7:03 PM, Anil Gupta anilgupt...@gmail.com wrote: Do u have enough data to start more than one mapper? If entire data is less than a block size then only 1 mapper will run. Best Regards, Anil On Feb 1, 2012, at 4:21 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Hi, I have a simple MR job, and I want each Mapper to get one line from my input file (which contains further instructions for lengthy processing). Each line is 100 characters long, and I tell Hadoop to read only 100 bytes, job.getConfiguration().setInt(mapreduce.input.linerecordreader.line.maxlength, 100); I see that this part works - it reads only one line at a time, and if I change this parameter, it listens. However, on a cluster only one node receives all the map tasks. Only one map tasks is started. The others never get anything, they just wait. I've added 100 seconds wait to the mapper - no change! Any advice? Thank you. Sincerely, Mark
Re: Can't achieve load distribution
Mark, NLineInputFormat was not something which was introduced in 0.21, I have just sent the reference to the 0.21 url FYI. It's in 0.20.205, 1.0.0 and 0.23 releases also. Praveen On Fri, Feb 3, 2012 at 1:25 AM, Mark Kerzner mark.kerz...@shmsoft.comwrote: Praveen, this seems just like the right thing, but it's API 0.21 (I googled about the problems with it), so I have to use either the next Cloudera release, or Hortonworks, or something, am I right? Mark On Thu, Feb 2, 2012 at 7:39 AM, Praveen Sripati praveensrip...@gmail.com wrote: I have a simple MR job, and I want each Mapper to get one line from my input file (which contains further instructions for lengthy processing). Use the NLineInputFormat class. http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.html Praveen On Thu, Feb 2, 2012 at 9:43 AM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Thanks! Mark On Wed, Feb 1, 2012 at 7:44 PM, Anil Gupta anilgupt...@gmail.com wrote: Yes, if ur block size is 64mb. Btw, block size is configurable in Hadoop. Best Regards, Anil On Feb 1, 2012, at 5:06 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Anil, do you mean one block of HDFS, like 64MB? Mark On Wed, Feb 1, 2012 at 7:03 PM, Anil Gupta anilgupt...@gmail.com wrote: Do u have enough data to start more than one mapper? If entire data is less than a block size then only 1 mapper will run. Best Regards, Anil On Feb 1, 2012, at 4:21 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Hi, I have a simple MR job, and I want each Mapper to get one line from my input file (which contains further instructions for lengthy processing). Each line is 100 characters long, and I tell Hadoop to read only 100 bytes, job.getConfiguration().setInt(mapreduce.input.linerecordreader.line.maxlength, 100); I see that this part works - it reads only one line at a time, and if I change this parameter, it listens. However, on a cluster only one node receives all the map tasks. Only one map tasks is started. The others never get anything, they just wait. I've added 100 seconds wait to the mapper - no change! Any advice? Thank you. Sincerely, Mark
Re: connection between slaves and master
Mark, [mark@node67 ~]$ telnet node77 You need to specify the port number along with the server name like `telnet node77 1234`. 2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 0 time(s). Slaves are not able to connect to the master. The configurations ` fs.default.name` and `mapred.job.tracker` should point to the master and not to localhost when the master and slaves are on different machines. Praveen On Mon, Jan 9, 2012 at 11:41 PM, Mark question markq2...@gmail.com wrote: Hello guys, I'm requesting from a PBS scheduler a number of machines to run Hadoop and even though all hadoop daemons start normally on the master and slaves, the slaves don't have worker tasks in them. Digging into that, there seems to be some blocking between nodes (?) don't know how to describe it except that on slave if I telnet master-node it should be able to connect, but I get this error: [mark@node67 ~]$ telnet node77 Trying 192.168.1.77... telnet: connect to address 192.168.1.77: Connection refused telnet: Unable to connect to remote host: Connection refused The log at the slave nodes shows the same thing, even though it has datanode and tasktracker started from the maste (?): 2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 0 time(s). 2012-01-09 10:04:04,439 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 1 time(s). 2012-01-09 10:04:05,442 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 2 time(s). 2012-01-09 10:04:06,444 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 3 time(s). 2012-01-09 10:04:07,446 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 4 time(s). 2012-01-09 10:04:08,448 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 5 time(s). 2012-01-09 10:04:09,450 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 6 time(s). 2012-01-09 10:04:10,452 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 7 time(s). 2012-01-09 10:04:11,454 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 8 time(s). 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 9 time(s). 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.RPC: Server at localhost/ 127.0.0.1:12123 not available yet, Z... Any suggestions of what I can do? Thanks, Mark
Re: Allowing multiple users to submit jobs in hadoop 0.20.205 ?
By default `security.job.submission.protocol.acl` is set to * in the hadoop-policy.xml, so it will allow any/multiple users to submit/query job status. Check this (1) for more details. property namesecurity.job.submission.protocol.acl/name value*/value descriptionACL for JobSubmissionProtocol, used by job clients to communciate with the jobtracker for job submission, querying job status etc. The ACL is a comma-separated list of user and group names. The user and group list is separated by a blank. For e.g. alice,bob users,wheel. A special value of * means all users are allowed./description /property (1) http://hadoop.apache.org/common/docs/r0.20.2/service_level_auth.html Praveen On Tue, Jan 3, 2012 at 10:46 AM, praveenesh kumar praveen...@gmail.comwrote: Hi, How can I allow multiple users to submit jobs in hadoop 0.20.205 ? Thanks, Praveenesh
Re: Hive starting error
http://hive.apache.org/releases.html#21+June%2C+2011%3A+release+0.7.1+available 21 June, 2011: release 0.7.1 available This release is the latest release of Hive and it works with Hadoop 0.20.1 and 0.20.2 I don't see the method the method thrown in the exception in 0.20.205. Praveen On Fri, Dec 30, 2011 at 3:09 PM, praveenesh kumar praveen...@gmail.comwrote: Hi, I am using Hive 0.7.1 on hadoop 0.20.205 While running hive. its giving me following error : Exception in thread main java.lang.NoSuchMethodError: org.apache.hadoop.security.UserGroupInformation.login(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/security/UserGroupInformation; at org.apache.hadoop.hive.shims.Hadoop20Shims.getUGIForConf(Hadoop20Shims.java:448) at org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator.setConf(HadoopDefaultAuthenticator.java:51) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthenticator(HiveUtils.java:222) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:219) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:417) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Any Idea on how to resolve this issue ? Thanks, Praveenesh
Re: Remote access to namenode is not allowed despite the services are already started.
Changing the VM settings won't help. Change the value of fs.default.name to hdfs://106.77.211.187:9000 from hdfs://localhost:9000 in core-site.xml for both the client and the NameNode. Replace the IP address with the IP address of the node on which the NameNode is running or with the hostname. Praveen 2012/1/2 Harsh J ha...@cloudera.com Woraphol, Yes you'd need to tweak some settings in your VMs such that they allow remote connections. Could also be a firewall running inside of your NameNode instance preventing this. Once you get the telnet working after troubleshooting your network settings (I don't know the bullseye spot here, sorry), it should be fine after-on. 2012/1/1 s4510...@hotmail.com: Dear all, I successfully installed and run Hadoop on a single machine whose ip is 192.168.1.109 (In fact it is actually an Ubuntu instance running on virtual box ) . When typing jps it shows2473 DataNode2765 TaskTracker3373 Jps2361 NameNode2588 SecondaryNameNode2655 JobTracker This should mean that the hadoop is up and running.Running commands like ./hadoop fs -ls is fine and produces the expected result. But If I try to connect it from my windows box whose ip is 192.168.1.80 by writingJava code's HDFS API to connect it as follows: Configuration conf = new Configuration();FileSystem hdfs = null;Path filenamePath = new Path(FILE_NAME); hdfs = FileSystem.get(conf); -- the problem occurred at this line when I run the code, the error displayed as follows: 11/12/07 20:37:24 INFO ipc.Client: Retrying connect to server: / 192.168.1.109:9000. Already tried 0 time(s).11/12/07 20:37:26 INFO ipc.Client: Retrying connect to server: /192.168.1.109:9000. Already tried 1 time(s).11/12/07 20:37:28 INFO ipc.Client: Retrying connect to server: /192.168.1.109:9000. Already tried 2 time(s).11/12/07 20:37:30 INFO ipc.Client: Retrying connect to server: /192.168.1.109:9000. Already tried 3 time(s).11/12/07 20:37:32 INFO ipc.Client: Retrying connect to server: /192.168.1.109:9000. Already tried 4 time(s).11/12/07 20:37:33 INFO ipc.Client: Retrying connect to server: /192.168.1.109:9000. Already tried 5 time(s).11/12/07 20:37:35 INFO ipc.Client: Retrying connect to server: /192.168.1.109:9000. Already tried 6 time(s).11/12/07 20:37:37 INFO ipc.Client: Retrying connect to server: /192.168.1.109:9000. Already tried 7 time(s).11/12/07 20:37:39 INFO ipc.Client: Retrying connect to server: /192.168.1.109:9000. Already tried 8 time(s).11/12/07 20:37:41 INFO ipc.Client: Retrying connect to server: /192.168.1.109:9000. Already tried 9 time(s).java.net.ConnectException: Call to / 192.168.1.109:9000 failed on connection exception: java.net.ConnectException: Connection refused: no further information To make sure if the socket is already opened and waits for the incoming connections on the hadoop serer, I netstat on the ubuntu boxthe result shows as follows: tcp0 0 localhost:51201 *:* LISTEN 2765/java tcp0 0 *:50020 *:* LISTEN 2473/java tcp0 0 localhost:9000 *:* LISTEN 2361/java tcp0 0 localhost:9001 *:* LISTEN 2655/java tcp0 0 *:mysql *:* LISTEN - tcp0 0 *:50090 *:* LISTEN 2588/java tcp0 0 *:11211 *:* LISTEN - tcp0 0 *:40843 *:* LISTEN 2473/java tcp0 0 *:58699 *:* LISTEN - tcp0 0 *:50060 *:* LISTEN 2765/java tcp 0 0 *:50030 *:* LISTEN 2655/java tcp 0 0 *:53966 *:* LISTEN 2655/java tcp0 0 *:www *:* LISTEN - tcp0 0 *:epmd *:* LISTEN - tcp0 0 *:55826 *:* LISTEN 2588/java tcp0 0 *:ftp *:* LISTEN - tcp0 0 *:50070 *:* LISTEN 2361/java tcp0 0 *:52822 *:* LISTEN 2361/java tcp 0 0 *:ssh *:* LISTEN - tcp0 0 *:55672 *:* LISTEN - tcp0 0 *:50010 *:* LISTEN 2473/java tcp0 0 *:50075 *:* LISTEN 2473/java I noticed that if the local address column is something like localhost:9000
Re: Hadoop MySQL database access
Check the `mapreduce.job.reduce.slowstart.completedmaps` parameter. The reducers cannot start processing the data from the mappers until the all the map tasks are complete, but the reducers can start fetching the data from the nodes on which the map tasks have completed. Praveen On Thu, Dec 29, 2011 at 12:44 AM, Prashant Kommireddi prash1...@gmail.comwrote: By design reduce would start only after all the maps finish. There is no way for the reduce to begin grouping/merging by key unless all the maps have finished. Sent from my iPhone On Dec 28, 2011, at 8:53 AM, JAGANADH G jagana...@gmail.com wrote: Hi All, I wrote a map reduce program to fetch data from MySQL and process the data(word count). The program executes successfully . But I noticed that the reduce task starts after finishing the map task only . Is there any way to run the map and reduce in parallel. The program fetches data from MySQL and writes the processed output to hdfs. I am using hadoop in pseduo-distributed mode . -- ** JAGANADH G http://jaganadhg.in *ILUGCBE* http://ilugcbe.org.in
Re: Automate Hadoop installation
Also, checkout Ambari (http://incubator.apache.org/ambari/) which is still in the Incubator status. How does Ambari and Puppet compare? Regards, Praveen On Tue, Dec 6, 2011 at 1:00 PM, alo alt wget.n...@googlemail.com wrote: Hi, to deploy software I suggest pulp: https://fedorahosted.org/pulp/wiki/HowTo For a package-based distro (debian, redhat, centos) you can build apache's hadoop, pack it and delpoy. Configs, as Cos say, over puppet. If you use a redhat / centos take a look at spacewalk. best, Alex On Mon, Dec 5, 2011 at 8:20 PM, Konstantin Boudnik c...@apache.org wrote: These that great project called BigTop (in the apache incubator) which provides for building of Hadoop stack. The part of what it provides is a set of Puppet recipes which will allow you to do exactly what you're looking for with perhaps some minor corrections. Serious, look at Puppet - otherwise it will be a living through nightmare of configuration mismanagements. Cos On Mon, Dec 05, 2011 at 04:02PM, praveenesh kumar wrote: Hi all, Can anyone guide me how to automate the hadoop installation/configuration process? I want to install hadoop on 10-20 nodes which may even exceed to 50-100 nodes ? I know we can use some configuration tools like puppet/or shell-scripts ? Has anyone done it ? How can we do hadoop installations on so many machines parallely ? What are the best practices for this ? Thanks, Praveenesh -- Alexander Lorenz http://mapredit.blogspot.com *P **Think of the environment: please don't print this email unless you really need to.*
Re: Multiple Mappers for Multiple Tables
MultipleInputs take multiple Path (files) and not DB as input. As mentioned earlier export tables into HDFS either using Sqoop or native DB export tool and then do the processing. Sqoop is configured to use native DB export tool whenever possible. Regards, Praveen On Tue, Dec 6, 2011 at 3:44 AM, Justin Vincent justi...@gmail.com wrote: Thanks Bejoy, I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a Path parameter. Are these paths just ignored here? On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Justin, Just to add on to my response. If you need to fetch data from rdbms on your mapper using your custom mapreduce code you can use the DBInputFormat in your mapper class with MultipleInputs. You have to be careful in using the number of mappers for your application as dbs would be constrained with a limit on maximum simultaneous connections. Also you need to ensure that that the same Query is not executed n number of times in n mappers all fetching the same data, It'd be just wastage of network. Sqoop + Hive would be my recommendation and a good combination for such use cases. If you have Pig competency you can also look into pig instead of hive. Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Justin If I get your requirement right you need to get in data from multiple rdbms sources and do a join on the same, also may be some more custom operations on top of this. For this you don't need to go in for writing your custom mapreduce code unless it is that required. You can achieve the same in two easy steps - Import data from RDBMS into Hive using SQOOP (Import) - Use hive to do some join and processing on this data Hope it helps!.. Regards Bejoy.K.S On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent justi...@gmail.com wrote: I would like join some db tables, possibly from different databases, in a MR job. I would essentially like to use MultipleInputs, but that seems file oriented. I need a different mapper for each db table. Suggestions? Thanks! Justin Vincent
Re: Running a job continuously
If the requirement is for real time data processing, using Flume will not suffice as there is a time lag between the collection of files by Flume and processing done by Hadoop. Consider frameworks like S4, Storm (from Twitter), HStreaming etc which suits realtime processing. Regards, Praveen On Tue, Dec 6, 2011 at 10:39 AM, Ravi teja ch n v raviteja.c...@huawei.comwrote: Hi Burak, Bejoy Ks, i have a continuous inflow of data but i think i need a near real-time system. Just to add to Bejoy's point, with Oozie, you can specify the data dependency for running your job. When specific amount of data is in, your can configure Oozie to run your job. I think this will suffice your requirement. Regards, Ravi Teja From: burakkk [burak.isi...@gmail.com] Sent: 06 December 2011 04:03:59 To: mapreduce-u...@hadoop.apache.org Cc: common-user@hadoop.apache.org Subject: Re: Running a job continuously Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to execute the MR job on the same algorithm but different files have different velocity. Both Storm and facebook's hadoop are designed for that. But i want to use apache distribution. Bejoy Ks, i have a continuous inflow of data but i think i need a near real-time system. Mike Spreitzer, both output and input are continuous. Output isn't relevant to the input. Only that i want is all the incoming files are processed by the same job and the same algorithm. For ex, you think about wordcount problem. When you want to run wordcount, you implement that: http://wiki.apache.org/hadoop/WordCount But when the program find that code job.waitForCompletion(true);, somehow job will end up. When you want to make it continuously, what will you do in hadoop without other tools? One more thing is you assumption that the input file's name is filename_timestamp(filename_20111206_0030) public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = new Job(conf, wordcount);job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class);job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Burak If you have a continuous inflow of data, you can choose flume to aggregate the files into larger sequence files or so if they are small and when you have a substantial chunk of data(equal to hdfs block size). You can push that data on to hdfs based on your SLAs you need to schedule your jobs using oozie or simpe shell script. In very simple terms - push input data (could be from flume collector) into a staging hdfs dir - before triggering the job(hadoop jar) copy the input from staging to main input dir - execute the job - archive the input and output into archive dirs(any other dirs). - the output archive dir could be source of output data - delete output dir and empty input dir Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 2:19 AM, burakkk burak.isi...@gmail.com wrote: Hi everyone, I want to run a MR job continuously. Because i have streaming data and i try to analyze it all the time in my way(algorithm). For example you want to solve wordcount problem. It's the simplest one :) If you have some multiple files and the new files are keep going, how do you handle it? You could execute a MR job per one file but you have to do it repeatly. So what do you think? Thanks Best regards... -- *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* * * -- *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* * *
Re: Availability of Job traces or logs
Arun, I want to control the split placements. InputSplits are logical and part of the input data, there is nothing to do with placement of the InputSplits. InputSplits are calculated on a client by the InputFormat class when a job is submitted and the InputSplit metadata data is put in HDFS to be fetched later. Each InputSplit is processed by a map task. The Hadoop framework makes sure that the task and the InputSplit it processes are as close as possible to avoid any overheads. MAPREDUCE-207 is for moving the calculation of the InputSplits from the client to the cluster, but I don't see any progress in it. BTW, what is the new scheduler about? Regards, Praveen On Sun, Dec 4, 2011 at 10:19 AM, ArunKumar arunk...@gmail.com wrote: Amar, I am attempting to write a new scheduler for Hadoop and test it using Mumak. 1 I want to test its behaviour under different size of jobs traces(meaning number of jobs say 5,10,25,50,100) under different number of nodes. Till now i was using only the test/data given by mumak which has 19 jobs and 1529 node topology. I don' have many nodes with me to run some programs and collect logs and use Rumen to generate traces. 2 I want to control the split placements so i need to modify preferred locations for task attempts in the trace but the trace for even 19 jobs is huge. So, I was thinking whether i can get a small, medium and large number of Job traces with corresponding topology trace so that modifying will be easier. Arun On Sat, Dec 3, 2011 at 1:15 PM, Amar Kamat [via Lucene] ml-node+s472066n3556710...@n3.nabble.com wrote: Arun, You can very well run synthetic workloads like large scale sort, wordcount etc or more realistic workloads like PigMix ( https://cwiki.apache.org/confluence/display/PIG/PigMix). On a decent enough cluster, these workloads work pretty well. Is there a specific reason why you want traces of varied sizes from various organizations? How can i make sure that the rumen generates only say 25 jobs,50 jobs or so Do you want to get 25/50 jobs based on some filtering criterion? I recently faced a similar situation where I wanted to extract jobs from a Rumen trace based on job ids. I will be happy to share these filtering tools. Amar On 12/1/11 8:48 AM, ArunKumar [hidden email] http://user/SendEmail.jtp?type=nodenode=3556710i=0 wrote: Hi guys ! Apart from generating the job traces from RUMEN , can i get logs or job traces of varied sizes from some organizations. How can i make sure that the rumen generates only say 25 jobs,50 jobs or so ? Thanks, Arun -- View this message in context: http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3550462.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3556710.html To unsubscribe from Availability of Job traces or logs, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3550462code=YXJ1bms3ODZAZ21haWwuY29tfDM1NTA0NjJ8NzA5NTc4MTY3 . NAML http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespacebreadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3558530.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: How do I programmatically get total job execution time?
Hi, Ran a job using new MR API in stand alone mode and 0.21. Both, Job#getFinishTime and Job#getStartTime are returning 0. Not sure, if this is a bug. Thanks, Praveen On Sat, Dec 3, 2011 at 6:14 AM, Raj V rajv...@yahoo.com wrote: As Harsh said, I don't think there is a simple way to way to find when the job ended, especially after the job is completed. But cant you just wait for your job to complete and log the time when the job completed? Raj From: Harsh J ha...@cloudera.com To: common-user@hadoop.apache.org Sent: Friday, December 2, 2011 12:53 PM Subject: Re: How do I programmatically get total job execution time? I remember hitting this once in 0.20 - seems like an API limitation. The resolution we took back then was to get a list of all tasks, and get the end time with the last ended task's completion time (sort and pick). There may be other ways though - others can comment on that perhaps (metrics? job-history?) On 02-Dec-2011, at 11:27 PM, W.P. McNeill wrote: After my Hadoop job has successfully completed I'd like to log the total amount of time it took. This is the Finished in statistic in the web UI. How do I get this number programmatically? Is there some way I can query the Job object? I didn't see anything in the API documentation. 02-Dec-2011, at 11:27 PM, W.P. McNeill wrote: After my Hadoop job has successfully completed I'd like to log the total amount of time it took. This is the Finished in statistic in the web UI. How do I get this number programmatically? Is there some way I can query the Job object? I didn't see anything in the API documentation.
Re: Binary content
Mohit, Hadoop: The Definitive Guide (Chapter 3 - Hadoop I/O) has a section on SequenceFile and is worth reading. http://oreilly.com/catalog/9780596521981 Thanks, Praveen On Thu, Sep 1, 2011 at 9:15 PM, Owen O'Malley o...@hortonworks.com wrote: On Thu, Sep 1, 2011 at 8:37 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Thanks! Is there a specific tutorial I can focus on to see how it could be done? Take the word count example and change its output format to be SequenceFileOutputFormat. job.setOutputFormatClass(SequenceFileOutputFormat.class); and it will generate SequenceFiles instead of text. There is SequenceFileInputFormat for reading. -- Owen
Eclipse Hadoop Plugin Error creating New Hadoop location ....
Hi, I am trying to run Hadoop from Eclipse using the Eclipse Hadoop Plugin and stuck with the following problem. First copied the hadoop-0.21.0-eclipse-plugin.jar to the Eclipse Plugin folder, started eclipse and switched to the Map/Reduce perspective. In the Map/Reduce Locations View when I try to add a New Hadoop location the following error appears in the Eclipse Error Log. The version of Eclipse is Helios Service Release 2. Message : Unhandled event loop exception Exception Stack Trace : java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration at org.apache.hadoop.eclipse.server.HadoopServer.init(HadoopServer.java:223) at org.apache.hadoop.eclipse.servers.HadoopLocationWizard.init(HadoopLocationWizard.java:88) at org.apache.hadoop.eclipse.actions.NewLocationAction$1.init(NewLocationAction.java:41) at org.apache.hadoop.eclipse.actions.NewLocationAction.run(NewLocationAction.java:40) at org.eclipse.jface.action.Action.runWithEvent(Action.java:498) at org.eclipse.jface.action.ActionContributionItem.handleWidgetSelection(ActionContributionItem.java:584) at org.eclipse.jface.action.ActionContributionItem.access$2(ActionContributionItem.java:501) at org.eclipse.jface.action.ActionContributionItem$5.handleEvent(ActionContributionItem.java:411) at org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:84) at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1053) at org.eclipse.swt.widgets.Display.runDeferredEvents(Display.java:4066) at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3657) at org.eclipse.ui.internal.Workbench.runEventLoop(Workbench.java:2640) at org.eclipse.ui.internal.Workbench.runUI(Workbench.java:2604) at org.eclipse.ui.internal.Workbench.access$4(Workbench.java:2438) at org.eclipse.ui.internal.Workbench$7.run(Workbench.java:671) at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:332) at org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:664) at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:149) at org.eclipse.ui.internal.ide.application.IDEApplication.start(IDEApplication.java:115) at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:196) at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:110) at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:79) at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:369) at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:179) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:620) at org.eclipse.equinox.launcher.Main.basicRun(Main.java:575) at org.eclipse.equinox.launcher.Main.run(Main.java:1408) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:506) at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:422) at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:410) at org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.loadClass(DefaultClassLoader.java:107) at java.lang.ClassLoader.loadClass(Unknown Source) ... 32 more When I used the hadoop-0.20.2-eclipse-plugin.jar and hadoop-eclipse-plugin-0.20.203.0.jar, none of the views appear in the Map/Reduce perspective and also there are no corresponding view in the Windows - Show View - Other also. Thanks, Praveen
Hadoop Jar Files
Hi, I have extracted the hadoop-0.20.2, hadoop-0.20.203.0 and hadoop-0.21.0 files. In the hadoop-0.21.0 folder the hadoop-hdfs-0.21.0.jar, hadoop-mapred-0.21.0.jar and the hadoop-common-0.21.0.jar files are there. But in the hadoop-0.20.2 and the hadoop-0.20.203.0 releases the same files are missing. Have the jar files been packaged differently in the 0.20.2 and 0.20.203.0 releases or should I get these jars from some other projects? Thanks, Praveen