Re: Questions about hadoop-metrics2.properties
1. File and Ganglia are the only bundled sinks, though there are socket/json (for chukwa) and graphite sinks patches in the works. 2. Hadoop metrics (and metrics2) is mostly designed for system/process metrics, which means you'll need to attach jconsole to your map/reduce task processes to see your task metrics instrumented via metrics. What you actually want is probably custom job counters. 3. You don't need any configuration to use JMX to access metrics2, as JMX is currently on by default. The configuration in hadoop-metrics2.properties is mostly for optional sink configuration and metrics filtering. __Luke On Wed, Oct 23, 2013 at 4:21 PM, Benyi Wang bewang.t...@gmail.com wrote: 1. Does hadoop metrics2 only support File and Ganglia sink? 2. Can I expose metrics as JMX, especially for customized metrics? I created some metrics in my mapreduce job and could successfully output them using a FileSink. But if I use jconsole to access YARN nodemanager, I can only see hadoop metrics e.g Hadoop/NodeManager/NodeManagerMetrices etc., not mine with prefix maptask. How to setup to see maptask/reducetask prefix metrics? 3. Is there an example using jmx? I could not find The configuration syntax is: [prefix].[source|sink|jmx|].[instance].[option] http://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html
Re: What happens when you have fewer input files than mapper slots?
Short version : let's say you have 20 nodes, and each node has 10 mapper slots. You start a job with 20 very small input files. How is the work distributed to the cluster? Will it be even, with each node spawning one mapper task? Is there any way of predicting or controlling how the work will be distributed? You're right in expecting that the tasks of the small job will likely be evenly distributed among 20 nodes, if the 20 files are evenly distributed among the nodes and that there are free slots on every node. Long version : My cluster is currently used for two different jobs. The cluster is currently optimized for Job A, so each node has a maximum of 18 mapper slots. However, I also need to run Job B. Job B is VERY cpu-intensive, so we really only want one mapper to run on a node at any given time. I've done a bunch of research, and it doesn't seem like Hadoop gives you any way to set the maximum number of mappers per node on a per-job basis. I'm at my wit's end here, and considering some rather egregious workarounds. If you can think of anything that can help me, I'd very much appreciate it. Are you seeing Job B tasks are not being evenly distributed to each node? You can check the locations of the files by hadoop fsck. If the evenness is the goal you can also write your own input format that return empty locations for each split and read the small files in map task directly. If you're using Hadoop 1.0.x and fair scheduler, you might need to set mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart required) to work around a bug in fairscheduler (MAPREDUCE-2905) that causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+. __Luke
Re: how to resolve conflicts with jar dependencies
The problem is resolved in the next release of hadoop (2.0.3-alpha cf. MAPREDUCE-1700) For hadoop 1.x based releases/distributions, put -Dmapreduce.user.classpath.first=true on the hadoop command line and/or client config On Tue, Mar 12, 2013 at 6:49 AM, Jane Wayne jane.wayne2...@gmail.comwrote: hi, i need to know how to resolve conflicts with jar dependencies. * first, my job requires Jackson JSON-processor v1.9.11. * second, the hadoop cluster has Jackson JSON-processor v1.5.2. the jars are installed in $HADOOP_HOME/lib. according to this link, http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ , there are 3 ways to include 3rd party libraries in a map/reduce (mr) job. * use the -libjars flag * include the dependent libraries in the executing jar file's /lib directory * put the jars in the $HADOOP_HOME/lib directory i can report that using -libjars and including the libraries in my jar's /lib directory do not work (in my case of jar conflicts). i still get a NoSuchMethodException. the only way to get my job to run is the last option, placing the newer jars in $HADOOP_HOME/lib. the last option is fine on a sandbox or development instance, but there are some political difficulties (not only technical) in modifying our production environment. my questions/concerns are: 1. how come the -libjars and /lib options do not work? how does class loading work in mr tasks? 2. is there another option available that i am not aware of to try and get dependent jars by the job to overwrite what's in $HADOOP_HOME/lib at runtime of the mr tasks? any help is appreciated. thank you all.
Re: Introducing Parquet: efficient columnar storage for Hadoop.
IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni and ORCFile, all of which are columnar formats for Hadoop that are relatively new. Do we really need 3 columnar formats? On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Fellow Hadoopers, We'd like to introduce a joint project between Twitter and Cloudera engineers -- a new columnar storage format for Hadoop called Parquet ( http://parquet.github.com). We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language. Parquet is built from the ground up with complex nested data structures in mind. We adopted the repetition/definition level approach to encoding such data structures, as described in Google's Dremel paper; we have found this to be a very efficient method of encoding data in non-trivial object schemas. Parquet is built to support very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. We separate the concepts of encoding and compression, allowing parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible. Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies. The initial code, available at https://github.com/Parquet, defines the file format, provides Java building blocks for processing columnar data, and implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example of a complex integration -- Input/Output formats that can convert Parquet-stored data directly to and from Thrift objects. A preview version of Parquet support will be available in Cloudera's Impala 0.7. Twitter is starting to convert some of its major data source to Parquet in order to take advantage of the compression and deserialization savings. Parquet is currently under heavy development. Parquet's near-term roadmap includes: * Hive SerDes (Criteo) * Cascading Taps (Criteo) * Support for dictionary encoding, zigzag encoding, and RLE encoding of data (Cloudera and Twitter) * Further improvements to Pig support (Twitter) Company names in parenthesis indicate whose engineers signed up to do the work -- others can feel free to jump in too, of course. We've also heard requests to provide an Avro container layer, similar to what we do with Thrift. Seeking volunteers! We welcome all feedback, patches, and ideas; to foster community development, we plan to contribute Parquet to the Apache Incubator when the development is farther along. Regards, Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy, Jonathan Coveney, and friends.
Re: Hadoop cluster hangs on big hive job
You mean HDFS-4479? The log seems to indicate the infamous jetty hang issue (MAPREDUCE-2386) though. On Mon, Mar 11, 2013 at 1:52 PM, Suresh Srinivas sur...@hortonworks.comwrote: I have seen one such problem related to big hive jobs that open a lot of files. See HDFS-4496 for more details. Snippet from the description: The following issue was observed in a cluster that was running a Hive job and was writing to 100,000 temporary files (each task is writing to 1000s of files). When this job is killed, a large number of files are left open for write. Eventually when the lease for open files expires, lease recovery is started for all these files in a very short duration of time. This causes a large number of commitBlockSynchronization where logSync is performed with the FSNamesystem lock held. This overloads the namenode resulting in slowdown. Could this be the cause? Can you see namenode log to see if you have lease recovery activity? If not, can you send some information about what is happening in the namenode logs at the time of this slowdown? On Mon, Mar 11, 2013 at 1:32 PM, Daning Wang dan...@netseer.com wrote: [hive@mr3-033 ~]$ hadoop version Hadoop 1.0.4 Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290 Compiled by hortonfo on Wed Oct 3 05:13:58 UTC 2012 On Sun, Mar 10, 2013 at 8:16 AM, Suresh Srinivas sur...@hortonworks.comwrote: What is the version of hadoop? Sent from phone On Mar 7, 2013, at 11:53 AM, Daning Wang dan...@netseer.com wrote: We have hive query processing zipped csv files. the query was scanning for 10 days(partitioned by date). data for each day around 130G. The problem is not consistent since if you run it again, it might go through. but the problem has never happened on the smaller jobs(like processing only one days data). We don't have space issue. I have attached log file when problem happening. it is stuck like following(just search 19706 of 49964) 2013-03-05 15:13:51,587 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_19_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) 2013-03-05 15:13:51,811 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_39_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) 2013-03-05 15:13:52,551 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_32_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) 2013-03-05 15:13:52,760 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_00_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) 2013-03-05 15:13:52,946 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_24_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) 2013-03-05 15:13:54,742 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_08_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) Thanks, Daning On Thu, Mar 7, 2013 at 12:21 AM, Håvard Wahl Kongsgård haavard.kongsga...@gmail.com wrote: hadoop logs? On 6. mars 2013 21:04, Daning Wang dan...@netseer.com wrote: We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while running big jobs. Basically all the nodes are dead, from that trasktracker's log looks it went into some kinds of loop forever. All the log entries like this when problem happened. Any idea how to debug the issue? Thanks in advance. 2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_12_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) 2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_28_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) 2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_36_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) 2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_16_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) 2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_19_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) 2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_39_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) 2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_32_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) 2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_00_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) 2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_24_0 0.131468% reduce copy (19706 of 49964 at 0.00 MB/s) 2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker:
Re: issue with permissions of mapred.system.dir
It's a known issue for fairscheduler in Hadoop 1.x, see MAPREDUCE-4398. A workaround is to submit 4 or more jobs as the user of the jobtracker and everything will work fine afterwards. BTW, IBM BigInsights community version (open source) contains the right fix (properly initialize job init threads) since BigInsights 1.3.0.1. Unfortunately IBM devs are too busy to port/submit the patches to Apache right now :) __Luke On Wed, Oct 10, 2012 at 9:32 AM, Goldstone, Robin J. goldsto...@llnl.gov wrote: There is no /hadoop1 directory. It is //hadoop1 which is the name of the server running the name node daemon: valuehdfs://hadoop1/mapred/value Per offline conversations with Arpit, it appears this problem is related to the fact that I am using the fair scheduler. The fair scheduler is designed to run map reduce jobs as the user, rather than under the mapred username. Apparently there are some issues with this scheduler related to permissions on certain directories not allowing other users to execute/write in places that are necessary for the job to run. I haven't yet tried Arpit's suggestion to switch to the task scheduler but I imagine it will resolve my issue, at least for now. Ultimately I do want to use the fair scheduler, as multi-tenancy is a key requirement for our Hadoop deployment. From: Manu S manupk...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Wednesday, October 10, 2012 3:34 AM To: user@hadoop.apache.org user@hadoop.apache.org Subject: Re: issue with permissions of mapred.system.dir What is the permission for /hadoop1 dir in HDFS? Is mapred user have permission on the same directory? Thanks, Manu S On Wed, Oct 10, 2012 at 5:52 AM, Arpit Gupta ar...@hortonworks.com wrote: what is your mapreduce.jobtracker.staging.root.dir set to. This is a directory that needs to be writable by the user and is is recommended to be set to /user so it writes in appropriate users home directory. -- Arpit Gupta Hortonworks Inc. http://hortonworks.com/ On Oct 9, 2012, at 4:44 PM, Goldstone, Robin J. goldsto...@llnl.gov wrote: I am bringing up a Hadoop cluster for the first time (but am an experienced sysadmin with lots of cluster experience) and running into an issue with permissions on mapred.system.dir. It has generally been a chore to figure out all the various directories that need to be created to get Hadoop working, some on the local FS, others within HDFS, getting the right ownership and permissions, etc.. I think I am mostly there but can't seem to get past my current issue with mapred.system.dir. Some general info first: OS: RHEL6 Hadoop version: hadoop-1.0.3-1.x86_64 20 node cluster configured as follows 1 node as primary namenode 1 node as secondary namenode + job tracker 18 nodes as datanode + tasktracker I have HDFS up and running and have the following in mapred-site.xml: property namemapred.system.dir/name valuehdfs://hadoop1/mapred/value descriptionShared data for JT - this must be in HDFS/description /property I have created this directory in HDFS, owner mapred:hadoop, permissions 700 which seems to be the most common recommendation amongst multiple, often conflicting articles about how to set up Hadoop. Here is the top level of my filesystem: hyperion-hdp4@hdfs:hadoop fs -ls / Found 3 items drwx-- - mapred hadoop 0 2012-10-09 12:58 /mapred drwxrwxrwx - hdfs hadoop 0 2012-10-09 13:00 /tmp drwxr-xr-x - hdfs hadoop 0 2012-10-09 12:51 /user Note, it doesn't seem to really matter what permissions I set on /mapred since when the Jobtracker starts up it changes them to 700. However, when I try to run the hadoop example teragen program as a regular user I am getting this error: hyperion-hdp4@robing:hadoop jar /usr/share/hadoop/hadoop-examples*.jar teragen -D dfs.block.size=536870912 100 /user/robing/terasort-input Generating 100 using 2 maps with step of 50 12/10/09 16:27:02 INFO mapred.JobClient: Running job: job_201210072045_0003 12/10/09 16:27:03 INFO mapred.JobClient: map 0% reduce 0% 12/10/09 16:27:03 INFO mapred.JobClient: Job complete: job_201210072045_0003 12/10/09 16:27:03 INFO mapred.JobClient: Counters: 0 12/10/09 16:27:03 INFO mapred.JobClient: Job Failed: Job initialization failed: org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=robing, access=EXECUTE, inode=mapred:mapred:hadoop:rwx-- at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at
Re: Writing click stream data to hadoop
SequenceFile.Writer#syncFs is in Hadoop 1.0.0 (actually since 0.20.205), which calls the underlying FSDataOutputStream#sync which is actually hflush semantically (data not durable in case of data center wide power outage). hsync implementation is not yet in 2.0. HDFS-744 just brought hsync in trunk. __Luke On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote: Mohit, Not if you call sync (or hflush/hsync in 2.0) periodically to persist your changes to the file. SequenceFile doesn't currently have a sync-API inbuilt in it (in 1.0 at least), but you can call sync on the underlying output stream instead at the moment. This is possible to do in 1.0 (just own the output stream). Your use case also sounds like you may want to simply use Apache Flume (Incubating) [http://incubator.apache.org/flume/] that already does provide these features and the WAL-kinda reliability you seek. On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com wrote: We get click data through API calls. I now need to send this data to our hadoop environment. I am wondering if I could open one sequence file and write to it until it's of certain size. Once it's over the specified size I can close that file and open a new one. Is this a good approach? Only thing I worry about is what happens if the server crashes before I am able to cleanly close the file. Would I lose all previous data? -- Harsh J
Re: JMX Monitoring of DataNode?
JMX is available via web service in json format @ node:port/jmx in 0.204+ and 0.23+ On Mon, Oct 24, 2011 at 5:29 PM, Time Less timelessn...@gmail.com wrote: I setup JMX monitoring of the NameNode, and it worked fine. Tried to do the same for the DataNode, and it fails. The datanode listens on the port I specify, but VisualVM can't connect (I'm specifying no user/password/SSL). For troubleshooting purposes, I'm totally willing to try different tools than VisualVM, so if anyone knows some tool that works particularly well, I'd love to hear it. -- Tim Ellis Data Architect, Riot Games
Re: How do I add Hadoop dependency to a Maven project?
Pre-0.21 (sustaining releases, large-scale tested) hadoop: dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-core/artifactId version0.20.203.0/version /dependency Pre-0.23 (small scale tested) hadoop: dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-mapred/artifactId version.../version /dependency Trunk (currently targeting 0.23.0, large-scale tested) hadoop WILL be: dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-mapreduce/artifactId version.../version /dependency On Fri, Aug 12, 2011 at 2:20 PM, W.P. McNeill bill...@gmail.com wrote: I'm building a Hadoop project using Maven. I want to add Maven dependencies to my project. What do I do? I think the answer is I add a dependency/dependency section to my .POM file, but I'm not sure what the contents of this section (groupId, artifactId etc.) should be. Googling does not turn up a clear answer. Is there a canonical Hadoop Maven dependency specification?
Re: How do I add Hadoop dependency to a Maven project?
There is a reason I capitalized WILL (SHALL) :) The current trunk mapreduce code is influx. Once mr2 (MAPREDUCE-279) is merged into trunk (soon!). We'll be producing hadoop-mapreduce-0.23.0-SNAPSHOT, which depends on hadoop-hdfs-0.23.0-SNAPSHOT, which depends on hadoop-common-0.23.0-SNAPSHOT. If you just want to play with the new API, you can use the 0.22.0-SNAPSHOT artifacts. 0.23.0 is supposedly source compatible with previous hadoop versions including 0.20.x (for legacy API). On Fri, Aug 12, 2011 at 4:08 PM, W.P. McNeill bill...@gmail.com wrote: I want the latest version of Hadoop (with the new API). I guess that's the trunk version, but I don't see the hadoop-mapreduce artifact listed on https://repository.apache.org/index.html#nexus-search;quick~hadoop On Fri, Aug 12, 2011 at 2:47 PM, Luke Lu l...@apache.org wrote: Pre-0.21 (sustaining releases, large-scale tested) hadoop: dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-core/artifactId version0.20.203.0/version /dependency Pre-0.23 (small scale tested) hadoop: dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-mapred/artifactId version.../version /dependency Trunk (currently targeting 0.23.0, large-scale tested) hadoop WILL be: dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-mapreduce/artifactId version.../version /dependency On Fri, Aug 12, 2011 at 2:20 PM, W.P. McNeill bill...@gmail.com wrote: I'm building a Hadoop project using Maven. I want to add Maven dependencies to my project. What do I do? I think the answer is I add a dependency/dependency section to my .POM file, but I'm not sure what the contents of this section (groupId, artifactId etc.) should be. Googling does not turn up a clear answer. Is there a canonical Hadoop Maven dependency specification?
Re: one question about hadoop
Hadoop embeds jetty directly into hadoop servers with the org.apache.hadoop.http.HttpServer class for servlets. For jsp, web.xml is auto generated with the jasper compiler during the build phase. The new web framework for mapreduce 2.0 (MAPREDUCE-2399) wraps the hadoop HttpServer and doesn't need web.xml and/or jsp support either. On Thu, May 26, 2011 at 12:14 AM, 王晓峰 sanlang2...@gmail.com wrote: hi,admin: I'm a fresh fish from China. I want to know how the Jetty combines with the hadoop. I can't find the file named web.xml that should exist in usual system that combine with Jetty. I'll be very happy to receive your answer. If you have any question,please feel free to contract with me. Best Regards, Jack
Re: Configuring jvm metrics in hadoop-0.20.203.0
On Fri, May 20, 2011 at 9:02 AM, Matyas Markovics markovics.mat...@gmail.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am trying to get jvm metrics from the new verison of hadoop. I have read the migration instructions and come up with the following content for hadoop-metrics2.properties: *.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink jvm.sink.file.period=2 jvm.sink.file.filename=/home/ec2-user/jvmmetrics.log The (documented) syntax is [lowercased-service].sink.[sink-name].[option], So for jobtracker it would be jobtracker.sink.file... This will get all metrics from all the contexts (unlike metrics1 where you're required to configure each context). If you want to restrict the sink to only jvm metrics do this: jobtracker.sink.jvmfile.class=${*.sink.file.class} jobtracker.sink.jvmfile.context=jvm jobtracker.sink.jvmfile.filename=/path/to/namenode-jvm-metrics.out Any help would be appreciated even if you have a different approach to get memory usage from reducers. reducetask.sink.file.filename=/path/to/reducetask-metrics.out __Luke
Re: metrics2 ganglia monitoring
Ganglia plugin is not yet ported to metrics v2 (because Y don't use Ganglia, see also the discussion links on HADOOP-6728). It shouldn't be hard to do a port though, as the new sink interface is actually simpler. On Wed, May 18, 2011 at 4:07 AM, Eric Berkowitz eberkow...@roosevelt.edu wrote: We have a 2 rack hadoop cluster with ganglia 3.0 monitoring on all stations both on the native os and within hadoop. We want to upgrade to the to hadoop 20.203 but with the migration to metrics2 we need help configuring the metrics to continue ganglia monitoring. All tasktrackers/datanodes push unicast udp upstream to a central gmond daemon on their rack that is then polled by a single gmetad daemon for the cluster. The current metrics files includes entries similar to the following for all contexts: Configuration of the dfs context for ganglia dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=25 dfs.servers=xsrv1.cs.roosevelt.edu:8670 We have reviewed the package documentation for metrics2 but the examples and explanation are not helpful. Any assistance in the proper configuration of hadoop-metrics2.properties file to support our current ganglia configuration would be appreciated. Eric
Re: Memory mapped resources
You can use distributed cache for memory mapped files (they're local to the node the tasks run on.) http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata On Tue, Apr 12, 2011 at 10:40 AM, Benson Margulies bimargul...@gmail.com wrote: Here's the OP again. I want to make it clear that my question here has to do with the problem of distributing 'the program' around the cluster, not 'the data'. In the case at hand, the issue a system that has a large data resource that it needs to do its work. Every instance of the code needs the entire model. Not just some blocks or pieces. Memory mapping is a very attractive tactic for this kind of data resource. The data is read-only. Memory-mapping it allows the operating system to ensure that only one copy of the thing ends up in physical memory. If we force the model into a conventional file (storable in HDFS) and read it into the JVM in a conventional way, then we get as many copies in memory as we have JVMs. On a big machine with a lot of cores, this begins to add up. For people who are running a cluster of relatively conventional systems, just putting copies on all the nodes in a conventional place is adequate.
Re: location for mapreduce next generation branch
MR-279 is the branch you're looking for. On Mon, Mar 28, 2011 at 3:59 PM, Stephen Boesch java...@gmail.com wrote: Looking under http://svn.apache.org/repos/asf/hadoop/mapreduce/branches/ it does not seem to be present. pointers to correct location appreciated.
Re: monitor the hadoop cluster
The job detail page from the jobtracker shows a lot of information about any given job: the start/finish times of each task and various counters (like time spent in various phase, the input/output bytes/records etc.) For monitoring the aggregate performance of a cluster, the hadoop metrics system can send a lot of information to standard monitoring tools (ganglia etc.) to graph and monitor various aggregate metrics like (running/waiting maps/reduces etc.) __Luke On Thu, Nov 11, 2010 at 12:23 PM, Da Zheng zhengda1...@gmail.com wrote: Hello, On 11/11/2010 03:00 PM, David Rosenstrauch wrote: On 11/11/2010 02:52 PM, Da Zheng wrote: Hello, I wrote a MapReduce program and ran it on a 3-node hadoop cluster, but its running time varies a lot, from 2 minutes to 3 minutes. I want to understand how time is used by the map phase and the reduce phase, and hope to find the place to improve the performance. Also the current input data is sorted, so I wrote a customized partitioner to reduce the data shuffling across the network. I need some means to help me observe the data movement. I know hadoop community developed chukwa for monitoring, but it seems very immature right now. I wonder how people monitor hadoop cluster right now. Is there a good way to solve my problems listed above? Thanks, Da Just my $0.02, but IMO you're working on some faulty assumptions here. Hadoop is explicitly *not* a real-time system, and so it's not reasonable for you to expect to have such fine-grained control over its processing speed. It's a distributed system, where many things can affect how long a job takes, such as: how many nodes in the cluster, how many other jobs are running, the technical specs of each node, whether/how Hadoop implements speculative execution during your job, whether your job as any task failures/retries, whether you have any hardware failures during your job, .. You can have control over performance on a Hadoop cluster, via things like adding nodes, tweaking some config parms, etc. But you're much more likely to be able to make performance improvements like cutting a job down from 3 hours to 2 hours, not from 3 minutes to 2 minutes. You're just not going to get that kind of fine-grained control with Hadoop. Nor should you be looking for it, IMO. If that's what you want, then Hadoop is probably the wrong tool for your job. I don't really try to cut the time from 3 minutes to 2 minutes. I was asking whether I can have some tools to monitor the hadoop cluster and possibly find the spot for performance improvement. I'm very new to hadoop, and I hope to have a good view how time is used by each mapper and reducer, so I'll have more confidence to run it on a much larger dataset. More importantly, I want to see how much data shaffling can be saved if I use the customized partitioner. Best, Da
Re: is there any way to set the niceness of map/reduce jobs
nice(1) only changes cpu scheduling priority. It doesn't really help if you have tasks (and their child processes) that use too much memory, which causes swapping, which is probably the real culprit to cause servers to freeze. Decreasing kernel swappiness probably helps. Another thing to try is ionice (on linux if you have reasonably recent kernel with cfq as io scheduler, default for rhel5) if the freeze is caused by io contention (assuming no swapping.) You can write a simple script to periodically renice(1) and ionice(1) these processes to see if they actually work for you. On Tue, Nov 2, 2010 at 4:51 PM, Jinsong Hu jinsong...@hotmail.com wrote: Hi: there: I have a cluster that is used for both hadoop mapreduce and hbase. What I found is that when I am running map/reduce jobs, the job can be very memory/cpu intensive, and cause hbase or data nodes to freeze. in hbase's case, the region server may shut it self down. In order to avoid this, I made very conservative configuration of the maximum number of mappers and reducers. However, I am wonder if hadoop allows me to start map/reduce with the command nice so that those jobs get lower priority than datanode/tasktracker/hbase regionserver. That way, if there is enough resource, the jobs can fully utilize them. but if not, those jobs will yield to other processes. Jimmy.
Re: load a serialized object in hadoop
On Wed, Oct 13, 2010 at 12:27 PM, Shi Yu sh...@uchicago.edu wrote: I haven't implemented anything in map/reduce yet for this issue. I just try to invoke the same java class using bin/hadoop command. The thing is a very simple program could be executed in Java, but not doable in bin/hadoop command. If you are just trying to use bin/hadoop jar your.jar command, your code runs in a local client jvm and mapred.child.java.opts has no effect. You should run it with HADOOP_CLIENT_OPTS=-Xmx1000m bin/hadoop jar your.jar I think if I couldn't get through the first stage, even I had a map/reduce program it would also fail. I am using Hadoop 0.19.2. Thanks. Best Regards, Shi On 2010-10-13 14:15, Luke Lu wrote: Can you post your mapper/reducer implementation? or are you using hadoop streaming? for which mapred.child.java.opts doesn't apply to the jvm you care about. BTW, what's the hadoop version you're using? On Wed, Oct 13, 2010 at 11:45 AM, Shi Yush...@uchicago.edu wrote: Here is my code. There is no Map/Reduce in it. I could run this code using java -Xmx1000m , however, when using bin/hadoop -D mapred.child.java.opts=-Xmx3000M it has heap space not enough error. I have tried other program in Hadoop with the same settings so the memory is available in my machines. public static void main(String[] args) { try{ String myFile = xxx.dat; FileInputStream fin = new FileInputStream(myFile); ois = new ObjectInputStream(fin); margintagMap = ois.readObject(); ois.close(); fin.close(); }catch(Exception e){ // } } On 2010-10-13 13:30, Luke Lu wrote: On Wed, Oct 13, 2010 at 8:04 AM, Shi Yush...@uchicago.edu wrote: As a coming-up to the my own question, I think to invoke the JVM in Hadoop requires much more memory than an ordinary JVM. That's simply not true. The default mapreduce task Xmx is 200M, which is much smaller than the standard jvm default 512M and most users don't need to increase it. Please post the code reading the object (in hdfs?) in your tasks. I found that instead of serialization the object, maybe I could create a MapFile as an index to permit lookups by key in Hadoop. I have also compared the performance of MongoDB and Memcache. I will let you know the result after I try the MapFile approach. Shi On 2010-10-12 21:59, M. C. Srivas wrote: On Tue, Oct 12, 2010 at 4:50 AM, Shi Yush...@uchicago.edu wrote: Hi, I want to load a serialized HashMap object in hadoop. The file of stored object is 200M. I could read that object efficiently in JAVA by setting -Xmx as 1000M. However, in hadoop I could never load it into memory. The code is very simple (just read the ObjectInputStream) and there is yet no map/reduce implemented. I set the mapred.child.java.opts=-Xmx3000M, still get the java.lang.OutOfMemoryError: Java heap space Could anyone explain a little bit how memory is allocate to JVM in hadoop. Why hadoop takes up so much memory? If a program requires 1G memory on a single node, how much memory it requires (generally) in Hadoop? The JVM reserves swap space in advance, at the time of launching the process. If your swap is too low (or do not have any swap configured), you will hit this. Or, you are on a 32-bit machine, in which case 3G is not possible in the JVM. -Srivas. Thanks. Shi -- -- Postdoctoral Scholar Institute for Genomics and Systems Biology Department of Medicine, the University of Chicago Knapp Center for Biomedical Discovery 900 E. 57th St. Room 10148 Chicago, IL 60637, US Tel: 773-702-6799 -- Postdoctoral Scholar Institute for Genomics and Systems Biology Department of Medicine, the University of Chicago Knapp Center for Biomedical Discovery 900 E. 57th St. Room 10148 Chicago, IL 60637, US Tel: 773-702-6799
Re: load a serialized object in hadoop
On Wed, Oct 13, 2010 at 2:21 PM, Shi Yu sh...@uchicago.edu wrote: Hi, thanks for the advice. I tried with your settings, $ bin/hadoop jar Test.jar OOloadtest -D HADOOP_CLIENT_OPTS=-Xmx4000m still no effect. Or this is a system variable? Should I export it? How to configure it? HADOOP_CLIENT_OPTS is an environment variable so you should run it as HADOOP_CLIENT_OPTS=-Xmx1000m bin/hadoop jar Test.jar OOloadtest if you use sh derivative shells (bash, ksh etc.) prepend env for other shells. __Luke Shi java -Xms3G -Xmx3G -classpath .:WordCount.jar:hadoop-0.19.2-core.jar:lib/log4j-1.2.15.jar:lib/commons-collections-3.2.1.jar:lib/stanford-postagger-2010-05-26.jar OOloadtest On 2010-10-13 15:28, Luke Lu wrote: On Wed, Oct 13, 2010 at 12:27 PM, Shi Yush...@uchicago.edu wrote: I haven't implemented anything in map/reduce yet for this issue. I just try to invoke the same java class using bin/hadoop command. The thing is a very simple program could be executed in Java, but not doable in bin/hadoop command. If you are just trying to use bin/hadoop jar your.jar command, your code runs in a local client jvm and mapred.child.java.opts has no effect. You should run it with HADOOP_CLIENT_OPTS=-Xmx1000m bin/hadoop jar your.jar I think if I couldn't get through the first stage, even I had a map/reduce program it would also fail. I am using Hadoop 0.19.2. Thanks. Best Regards, Shi On 2010-10-13 14:15, Luke Lu wrote: Can you post your mapper/reducer implementation? or are you using hadoop streaming? for which mapred.child.java.opts doesn't apply to the jvm you care about. BTW, what's the hadoop version you're using? On Wed, Oct 13, 2010 at 11:45 AM, Shi Yush...@uchicago.edu wrote: Here is my code. There is no Map/Reduce in it. I could run this code using java -Xmx1000m , however, when using bin/hadoop -D mapred.child.java.opts=-Xmx3000M it has heap space not enough error. I have tried other program in Hadoop with the same settings so the memory is available in my machines. public static void main(String[] args) { try{ String myFile = xxx.dat; FileInputStream fin = new FileInputStream(myFile); ois = new ObjectInputStream(fin); margintagMap = ois.readObject(); ois.close(); fin.close(); }catch(Exception e){ // } } On 2010-10-13 13:30, Luke Lu wrote: On Wed, Oct 13, 2010 at 8:04 AM, Shi Yush...@uchicago.edu wrote: As a coming-up to the my own question, I think to invoke the JVM in Hadoop requires much more memory than an ordinary JVM. That's simply not true. The default mapreduce task Xmx is 200M, which is much smaller than the standard jvm default 512M and most users don't need to increase it. Please post the code reading the object (in hdfs?) in your tasks. I found that instead of serialization the object, maybe I could create a MapFile as an index to permit lookups by key in Hadoop. I have also compared the performance of MongoDB and Memcache. I will let you know the result after I try the MapFile approach. Shi On 2010-10-12 21:59, M. C. Srivas wrote: On Tue, Oct 12, 2010 at 4:50 AM, Shi Yush...@uchicago.edu wrote: Hi, I want to load a serialized HashMap object in hadoop. The file of stored object is 200M. I could read that object efficiently in JAVA by setting -Xmx as 1000M. However, in hadoop I could never load it into memory. The code is very simple (just read the ObjectInputStream) and there is yet no map/reduce implemented. I set the mapred.child.java.opts=-Xmx3000M, still get the java.lang.OutOfMemoryError: Java heap space Could anyone explain a little bit how memory is allocate to JVM in hadoop. Why hadoop takes up so much memory? If a program requires 1G memory on a single node, how much memory it requires (generally) in Hadoop? The JVM reserves swap space in advance, at the time of launching the process. If your swap is too low (or do not have any swap configured), you will hit this. Or, you are on a 32-bit machine, in which case 3G is not possible in the JVM. -Srivas. Thanks. Shi -- -- Postdoctoral Scholar Institute for Genomics and Systems Biology Department of Medicine, the University of Chicago Knapp Center for Biomedical Discovery 900 E. 57th St. Room 10148 Chicago, IL 60637, US Tel: 773-702-6799 -- Postdoctoral Scholar Institute for Genomics and Systems Biology Department of Medicine, the University of Chicago Knapp Center for Biomedical Discovery 900 E. 57th St. Room 10148 Chicago, IL 60637, US Tel: 773-702-6799
Re: load a serialized object in hadoop
Just took a look at the bin/hadoop of your particular version (http://svn.apache.org/viewvc/hadoop/common/tags/release-0.19.2/bin/hadoop?revision=796970view=markup). It looks like that HADOOP_CLIENT_OPTS doesn't work with the jar command, which is fixed in later version. So try HADOOP_OPTS=-Xmx1000M bin/hadoop ... instead. It would work because it just translates to the same java command line that worked for you :) __Luke On Wed, Oct 13, 2010 at 4:18 PM, Shi Yu sh...@uchicago.edu wrote: Hi, I tried the following five ways: Approach 1: in command line HADOOP_CLIENT_OPTS=-Xmx4000m bin/hadoop jar WordCount.jar OOloadtest Approach 2: I added the hadoop-site.xml file with the following element. Each time I changed, I stop and restart hadoop on all the nodes. ... property nameHADOOP_CLIENT_OPTS/name value-Xmx4000m/value /property run the command $bin/hadoop jar WordCount.jar OOloadtest Approach 3: I changed like this ... property nameHADOOP_CLIENT_OPTS/name value4000m/value /property Then run the command: $bin/hadoop jar WordCount.jar OOloadtest Approach 4: To make sure, I changed the m to numbers, that was ... property nameHADOOP_CLIENT_OPTS/name value40/value /property Then run the command: $bin/hadoop jar WordCount.jar OOloadtest All these four approaches come to the same Java heap space error. java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.init(AbstractStringBuilder.java:45) at java.lang.StringBuilder.init(StringBuilder.java:68) at java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:2997) at java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:2818) at java.io.ObjectInputStream.readString(ObjectInputStream.java:1599) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1320) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at java.util.HashMap.readObject(HashMap.java:1028) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1846) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at ObjectManager.loadObject(ObjectManager.java:42) at OOloadtest.main(OOloadtest.java:21) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Approach 5: In comparison, I called the Java command directly as follows (there is a counter showing how much time it costs if the serialized object is successfully loaded): $java -Xms3G -Xmx3G -classpath .:WordCount.jar:hadoop-0.19.2-core.jar:lib/log4j-1.2.15.jar OOloadtest return: object loaded, timing (hms): 0 hour(s) 1 minute(s) 12 second(s) 162millisecond(s) What was the problem in my command? Where can I find the documentation about HADOOP_CLIENT_OPTS? Have you tried the same thing and found it works? Shi On 2010-10-13 16:28, Luke Lu wrote: On Wed, Oct 13, 2010 at 2:21 PM, Shi Yush...@uchicago.edu wrote: Hi, thanks for the advice. I tried with your settings, $ bin/hadoop jar Test.jar OOloadtest -D HADOOP_CLIENT_OPTS=-Xmx4000m still no effect. Or this is a system variable? Should I export it? How to configure it? HADOOP_CLIENT_OPTS is an environment variable so you should run it as HADOOP_CLIENT_OPTS=-Xmx1000m bin/hadoop jar Test.jar OOloadtest if you use sh derivative shells (bash, ksh etc.) prepend env for other shells. __Luke Shi java -Xms3G -Xmx3G -classpath .:WordCount.jar:hadoop-0.19.2-core.jar:lib/log4j-1.2.15.jar:lib/commons-collections-3.2.1.jar:lib/stanford-postagger-2010-05-26.jar OOloadtest On 2010-10-13 15:28, Luke Lu wrote: On Wed, Oct 13, 2010 at 12:27 PM, Shi Yush...@uchicago.edu wrote: I haven't implemented anything in map/reduce yet for this issue. I