Re: hadoop1.2.1 speedup model
How many times did you run the experiment at each setting? What is the standard deviation for each of these settings. It could be that you are simply running into the error bounds of Hadoop. Hadoop is far from consistent in it's performance. For our benchmarking we typically will run the test 5 times, throw out the top and bottom result, as possibly outliers and then average the other runs. Even with that we have to be very careful that we weed out bad nodes or the numbers are useless for comparison. The other thing to look at is where was all of the time spent for each of these settings. The map portion should be very close to linear with the number of tasks, assuming that there is no disk or network contention. The shuffle is far from linear as the number of fetches is a function of the number of maps and the number of reducers. The reduce phase itself should be close to linear assuming that there isn't much skew to your data. --Bobby On 9/7/13 3:33 AM, 牛兆捷 nzjem...@gmail.com wrote: But I still want to fine the most efficient assignment and scale both data and nodes as you said, for example in my result, 2 is the best, and 8 is better than 4. Why is it sub-linear from 2 to 4, super-linear from 4 to 8. I find it is hard to model this result. Can you give me some hint about this kind of trend? 2013/9/7 Vinod Kumar Vavilapalli vino...@hortonworks.com Clearly your input size isn't changing. And depending on how they are distributed on the nodes, there could be Datanode/disks contention. The better way to model this is by scaling the input data also linearly. More nodes should process more data in the same amount of time. Thanks, +Vinod On Sep 6, 2013, at 8:27 AM, 牛兆捷 wrote: Hi all: I vary the computational nodes of cluster and get the speedup result in attachment. In my mind, there are three type of speedup model: linear, sub-linear and super-linear. However the curve of my result seems a little strange. I have attached it. speedup.png This is sort in example.jar, actually it is done only using the default map-reduce mechanism of Hadoop. I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12 cpu, 20g men) io.sort.mb = 512, block size = 512mb, heap size = 1024mb, reduce.slowstart = 0.05, the others are default. Input data: 20g, I divide it to 64 files Sort example: 64 map tasks, 64 reduce tasks Computational nodes: varying from 2 to 9 Why the speedup mechanism is like this? How can I model it properly? Thanks〜 -- Sincerely, Zhaojie -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- *Sincerely,* *Zhaojie* * *
Re: [VOTE] Release Apache Hadoop 0.23.9
+1 downloaded the release. Ran a couple of simple jobs and everything worked. On 7/1/13 12:20 PM, Thomas Graves tgra...@yahoo-inc.com wrote: I've created a release candidate (RC0) for hadoop-0.23.9 that I would like to release. The RC is available at: http://people.apache.org/~tgraves/hadoop-0.23.9-candidate-0/ The RC tag in svn is here: http://svn.apache.org/viewvc/hadoop/common/tags/release-0.23.9-rc0/ The maven artifacts are available via repository.apache.org. Please try the release and vote; the vote will run for the usual 7 days til July 8th. I am +1 (binding). thanks, Tom Graves
Re: InputFormat to regroup splits of underlying InputFormat to control number of map tasks
This sounds similar to MultiFileInputFormat http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/h adoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apach e/hadoop/mapred/MultiFileInputFormat.java?revision=1239482view=markup It would be nice if you could take a look at it and see if there is something we can do here to improve it/combine the two. --Bobby On 6/19/13 2:53 AM, Nicolae Marasoiu nmara...@adobe.com wrote: Hi, When running map-reduce with many splits it would be nice from a performance perspective to have fewer splits while maintaining data locality, so that the overhead of running a map task (jvm creation, map executor ramp-up e.g. spring context, etc) be less impactful when frequently running map-reduces with low data processing. I created such an AggregatingInputFormat that simply groups input splits into composite ones with same location and creates a record reader that iterates over the record reader created by underlying inputFormat for the underlying raw splits. Currently we intend to use it for hbase sharding but I would like to also implement an optimal algorithm to ensure both fair distribution and locality, which I can describe if you find it useful to apply in multi-locations such as replicated kafka or hdfs. Thanks, waiting for your feedback, Nicu Marasoiu Adobe
Re: mapred.child.ulimit in MR2
Sandy, I think it was something that was missed in the port to YARN and the dead code was cleaned up as part of HADOOP-8288. If you have a use case for it or are worried about backwards compatibility we can add it back in. It is not that hard, all it did was add 'ulimt -v number' to the shell script that launched the task, except on windows. --Bobby On 6/18/13 3:56 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi yarn-dev/mapreduce-dev, Is there a reason that mapred.child.ulimit no longer has an effect in MR2? Should it be added back in? thanks for any help, -Sandy
Re: Visual debugging tools for hadoop
Yes data flow visualizations definitely sound like something that would be good for Ambari. If you are interested in debugging Hadoop jobs there is also the Hadoop Development Tools project http://incubator.apache.org/projects/hdt.html It is taking the Eclipse plugin for Hadoop and really improving it. I know that there has been some work to try and get a debugger working over there where you could walk through parts of your MR job line by line. --Bobby On 6/14/13 12:40 PM, Chris Nauroth cnaur...@hortonworks.com wrote: Hi Saikat, You might want to investigate contributing on Apache Ambari, which has features for visualization of jobs and end-to-end flows consisting of multiple dependent jobs. http://incubator.apache.org/ambari/ Chris Nauroth Hortonworks http://hortonworks.com/ On Fri, Jun 14, 2013 at 8:20 AM, Saikat Kanjilal sxk1...@hotmail.comwrote: Hi Folks, I was wondering if anyone is currently working on or thinking about visual debugging tools for mapreduce jobs, I was thinking about starting an effort to build an end to end visual tool that shows all the steps in the mapreduce workflow and data flows, variable content changing to speed up debugging of jobs.Please ignore if something like this already exists and if not I'd love to collaborate with folks to build something. Regards
Re: [VOTE] Release Apache Hadoop 0.23.8
+1 Downloaded the release and ran a few basic tests. --Bobby On 5/28/13 11:00 AM, Thomas Graves tgra...@yahoo-inc.com wrote: I've created a release candidate (RC0) for hadoop-0.23.8 that I would like to release. This release is a sustaining release with several important bug fixes in it. The most critical one is MAPREDUCE-5211. The RC is available at: http://people.apache.org/~tgraves/hadoop-0.23.8-candidate-0/ The RC tag in svn is here: http://svn.apache.org/viewvc/hadoop/common/tags/release-0.23.8-rc0/ The maven artifacts are available via repository.apache.org. Please try the release and vote; the vote will run for the usual 7 days. I am +1 (binding). thanks, Tom Graves
Re: [VOTE] Plan to create release candidate for 0.23.8
+1 On 5/17/13 4:10 PM, Thomas Graves tgra...@yahoo-inc.com wrote: Hello all, We've had a few critical issues come up in 0.23.7 that I think warrants a 0.23.8 release. The main one is MAPREDUCE-5211. There are a couple of other issues that I want finished up and get in before we spin it. Those include HDFS-3875, HDFS-4805, and HDFS-4835. I think those are on track to finish up early next week. So I hope to spin 0.23.8 soon after this vote completes. Please vote '+1' to approve this plan. Voting will close on Friday May 24th at 2:00pm PDT. Thanks, Tom Graves
Re: Heads up - 2.0.5-beta
I agree that destructive is not the correct word to describe features like snapshots and windows support. However, I also agree with Konstantin that any large feature will have a destabilizing effect on the code base, even if it is done on a branch and thoroughly tested before being merged in. HDFS HA from what I have seen and heard is rock solid, but it took a while to get there even after it was merged into branch-2. And we all know how long YARN and MRv2 have taken to stabilize. I also agree that no one individual is able to police all of Hadoop. We have to rely on the committers to make sure that what is placed in a branch is appropriate for that branch in preparation for a release. As a community we need to decided what the goals of a branch are so that I as a committer can know what is and is not appropriate to be placed in that branch. This is the reason why we are discussing API and binary compatibility. This is the reason why I support having a vote for a release plan. The question for the community comes down to do we want to release quickly and often off of trunk trying hard to maintain compatibility between releases or do we want to follow what we have done up to now where a single branch goes into stabilization, trunk gets anything that is not compatible with that branch, and it takes a huge effort to switch momentum from one branch to another. Up to this point we have almost successfully done this switch once, from 1.0 to 2.0. I have a hard time believing that we are going to do this again for another 5 years. There is nothing preventing the community from letting each organization decide what they want to do and we end up with both. But this results in fragmentation of the community, and makes it difficult for those trying to stabilize a release because there is no critical mass of individuals using and testing that branch. It also results in the scrambling we are seeing now to try and revert the incompatibles between 1.0 and 2.0 that were introduced in the years between these releases. If we are going to do the same and make 3.0 compatible with 2.0 when the switch comes, why do we even allow any incompatible changes in at all? It just feels like trunk is a place to put tech debt that we are going to try and revert later. I personally like the Linux and BSD models, where there is a new feature merge window and any new features can come in, then the entire community works together to stabilize the release before going on the the next merge window. If the release does not stabilize quickly the next merge window gets pushed back. I realize this is very different from the current model and is not likely to receive a lot of support, but it has worked for them for a long time, and they have code bases just as large as Hadoop and even larger and more diverse communities. I am +1 for Konstantin's release plan and will vote as such on that thread. --Bobby On 5/3/13 3:06 AM, Konstantin Shvachko shv.had...@gmail.com wrote: Hi Arun and Suresh, I am glad my choice of words attracted your attention. I consider this important for the project otherwise I wouldn't waste everybody's time. You tend reacting on a latest message taken out of context, which does not reveal full picture. I'll try here to summarize my proposal and motivation expressed earlier in these two threads: http://s.apache.org/fs http://s.apache.org/Streamlining I am advocating 1. to make 2.0.5 a release that will a) make any necessary changes so that Hadoop APIs could be fixed after that b) fix bugs: internal and those important for stabilizing downstream projects 2. Release 2.1.0 stable. I.e. both with stable APIs and stable code base. 3. Produce a series of feature releases. Potentially catching up with the state of trunk. 4. Release from trunk afterwards. The main motivation to minimize changes in 2.0.5 is to let Hadoop users and the downstream projects, that is the Hadoop community, to start adapting to the new APIs asap. This will provide certainty that people can build their products on top of 2.0.5 APIs with minimal risk the next release will break them. Thus Bobby in http://goo.gl/jm5am is saying that the meaning of beta for him is locked down APIs for wire and binary compatibility. For Hadoop Yahoo using 2.x is an opportunity to have it tested at very large scale, which in turn will bring other users on board. I agree with Arun that we are not disagreeing on much. Just on the order of execution: what goes first stability or features. I am not challenging any features, the implementations, or the developers. But putting all changes together is destructive for the stability of the release. Adding a 500 KB patch invalidates prio testing solely because it is a big change that needs testing not only by itself but with upstream applications. With 2.0.3 , 2.0.4 tested thoroughly and widely in many organizations and several distributions it seems like a perfect base for the stable release. We could be just
Re: JVM vs container memory configs
For us we typically leave a 500MB difference between the heap and the container size. I think we can make this smaller, but we have not really tried. --Bobby On 5/3/13 11:20 AM, Karthik Kambatla ka...@cloudera.com wrote: Hi While looking into MAPREDUCE-5207 (adding defaults for mapreduce.{map|reduce}.memory.mb), I was wondering how much headroom should be left on top of mapred.child.java.opts (or other similar JVM opts) for the container memory itself? Currently, mapred.child.java.opts (per mapred-default.xml) is set to 200 MB by default. The default for mapreduce.{map|reduce}.memory.mb is 1024 in the code, which is significantly higher than the 200MB value. Do we need more than 100 MB for non-JVM memory per container? If so, does it make sense make that a config property in itself and the code to verify all 3 values are clear enough? Thanks Karthik
Re: Versions - Confusion
It is kind of complex. Up until 0.20 everything was fairly regular like you would expect. In 0.20 there was a split where security was added in to a branch and started to be numbered as 0.20.20X. But the other releases went on without security and became 0.21 and 0.22. 0.23 was created when YARN was introduced and it also had security merged in. To be fair 0.22 had security in it, but was never officially supported in a release. At about this same time the community decided that we needed to do something better with number and renamed 0.20.20X to be 1.0 and started releasing more versions from this line. This is the current stable line. 0.23 was renamed 2.0 and there have been a few releases but the code is still being stabilized. To make things even more confusing some people kept 0.23 alive and stabilized it, so there have been some releases of 0.23 in parallel with 2.0. The difference between the two is that 2.0 had HDFS HA in it where as 0.23 does not. --Bobby Evans On 4/26/13 12:39 AM, Suresh S suresh...@gmail.com wrote: Hello, I was confused with Hadoop versioning. I found that, some people working on version starting with 0. Some others, working on version starting with 2. Also, i was confused with branch. Which version is really current version. *Regards* *S.Suresh,* *Research Scholar,* *Department of Computer Applications,* *National Institute of Technology,* *Tiruchirappalli - 620015.* *+91-9941506562*
Re: [VOTE] Release Apache Hadoop 2.0.4-alpha
+1 (binding) Downloaded the tar ball and ran some simple jobs. --Bobby Evans On 4/17/13 2:01 PM, Siddharth Seth seth.siddha...@gmail.com wrote: +1 (binding) Verified checksums and signatures. Built from the source tar, deployed a single node cluster and tested a couple of simple MR jobs. - Sid On Fri, Apr 12, 2013 at 2:56 PM, Arun C Murthy a...@hortonworks.com wrote: Folks, I've created a release candidate (RC2) for hadoop-2.0.4-alpha that I would like to release. The RC is available at: http://people.apache.org/~acmurthy/hadoop-2.0.4-alpha-rc2/ The RC tag in svn is here: http://svn.apache.org/repos/asf/hadoop/common/tags/release-2.0.4-alpha-rc 2 The maven artifacts are available via repository.apache.org. Please try the release and vote; the vote will run for the usual 7 days. thanks, Arun -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: Help on submitting a patch for an unassigned bug
Also be aware that sometimes committers don't notice that a patch is not in patch available, so if you need a review and no one has started reviewing it, please send an e-mail to the dev list and we will do our best to take a look at it. --Bobby On 3/26/13 5:28 AM, Harsh J ha...@cloudera.com wrote: Hi Niranjan, You can work on these by submitting a patch directly to the ticket. A committer/reviewer will assign the issue to you the on first contribution time on each project, and thereon you can assign them to yourself as you work on more. On Tue, Mar 26, 2013 at 9:26 AM, maisnam ns maisnam...@gmail.com wrote: Hello , I recently started looking into the bugs in issues.appache.org/jira related to (HADOOP/MAP REDUCE/HDFS) which have not be assigned to anyone and the status as unresolved. My intention is to fix those bugs , can you please let me know if I can assign those bugs to myself and submit a patch with unit test case or do I have to send a mail to any comitter asking for approval for assigning the bug to myself . Could somebody help me out. If somebody can elaborate a little on the process it would be helpful, I have read the 'How to contribute to Hadoop' on Wiki but couldn't find related to assigning the bug or may be I missed out somewhere. Thanks in advance. Regards, Niranjan -- Harsh J
Re: [Vote] Merge branch-trunk-win to trunk
add Windows support for new features that are platform specific is it assumed that Windows development will either lag or will people actively work on keeping Windows up with the latest? And vice versa in case Windows support is implemented first. Is there a jira for resolving the outstanding TODOs in the code base (similar to HDFS-2148)? Looks like this merge doesn't introduce many which is great (just did a quick diff and grep). Thanks, Eli On Wed, Feb 27, 2013 at 8:17 AM, Robert Evans ev...@yahoo-inc.com wrote: After this is merged in is Windows still going to be a second class citizen but happens to work for more than just development or is it a fully supported platform where if something breaks it can block a release? How do we as a community intend to keep Windows support from breaking? We don't have any Jenkins slaves to be able to run nightly tests to validate everything still compiles/runs. This is not a blocker for me because we often rely on individuals and groups to test Hadoop, but I do think we need to have this discussion before we put it in. --Bobby On 2/26/13 4:55 PM, Suresh Srinivas sur...@hortonworks.com wrote: I had posted heads up about merging branch-trunk-win to trunk on Feb 8th. I am happy to announce that we are ready for the merge. Here is a brief recap on the highlights of the work done: - Command-line scripts for the Hadoop surface area - Mapping the HDFS permissions model to Windows - Abstracted and reconciled mismatches around differences in Path semantics in Java and Windows - Native Task Controller for Windows - Implementation of a Block Placement Policy to support cloud environments, more specifically Azure. - Implementation of Hadoop native libraries for Windows (compression codecs, native I/O) - Several reliability issues, including race-conditions, intermittent test failures, resource leaks. - Several new unit test cases written for the above changes Please find the details of the work in CHANGES.branch-trunk-win.txt - Common changeshttp://bit.ly/Xe7Ynv, HDFS changes http://bit.ly/13QOSo9 , and YARN and MapReduce changes http://bit.ly/128zzMt. This is the work ported from branch-1-win to a branch based on trunk. For details of the testing done, please see the thread - http://bit.ly/WpavJ4. Merge patch for this is available on HADOOP-8562 https://issues.apache.org/jira/browse/HADOOP-8562. This was a large undertaking that involved developing code, testing the entire Hadoop stack, including scale tests. This is made possible only with the contribution from many many folks in the community. Following people contributed to this work: Ivan Mitic, Chuan Liu, Ramya Sunil, Bikas Saha, Kanna Karanam, John Gordon, Brandon Li, Chris Nauroth, David Lao, Sumadhur Reddy Bolli, Arpit Agarwal, Ahmed El Baz, Mike Liddell, Jing Zhao, Thejas Nair, Steve Maine, Ganeshan Iyer, Raja Aluri, Giridharan Kesavan, Ramya Bharathi Nimmagadda, Daryn Sharp, Arun Murthy, Tsz-Wo Nicholas Sze, Suresh Srinivas and Sanjay Radia. There are many others who contributed as well providing feedback and comments on numerous jiras. The vote will run for seven days and will end on March 5, 6:00PM PST. Regards, Suresh On Thu, Feb 7, 2013 at 6:41 PM, Mahadevan Venkatraman mah...@microsoft.comwrote: It is super exciting to look at the prospect of these changes being merged to trunk. Having Windows as one of the supported Hadoop platforms is a fantastic opportunity both for the Hadoop project and Microsoft customers. This work began around a year back when a few of us started with a basic port of Hadoop on Windows. Ever since, the Hadoop team in Microsoft have made significant progress in the following areas: (PS: Some of these items are already included in Suresh's email, but including again for completeness) - Command-line scripts for the Hadoop surface area - Mapping the HDFS permissions model to Windows - Abstracted and reconciled mismatches around differences in Path semantics in Java and Windows - Native Task Controller for Windows - Implementation of a Block Placement Policy to support cloud environments, more specifically Azure. - Implementation of Hadoop native libraries for Windows (compression codecs, native I/O) - Several reliability issues, including race-conditions, intermittent test failures, resource leaks. - Several new unit test cases written for the above changes In the process, we have closely engaged with the Apache open source community and have got great support and assistance from the community in terms of contributing fixes, code review comments
Re: [Vote] Merge branch-trunk-win to trunk
After this is merged in is Windows still going to be a second class citizen but happens to work for more than just development or is it a fully supported platform where if something breaks it can block a release? How do we as a community intend to keep Windows support from breaking? We don't have any Jenkins slaves to be able to run nightly tests to validate everything still compiles/runs. This is not a blocker for me because we often rely on individuals and groups to test Hadoop, but I do think we need to have this discussion before we put it in. --Bobby On 2/26/13 4:55 PM, Suresh Srinivas sur...@hortonworks.com wrote: I had posted heads up about merging branch-trunk-win to trunk on Feb 8th. I am happy to announce that we are ready for the merge. Here is a brief recap on the highlights of the work done: - Command-line scripts for the Hadoop surface area - Mapping the HDFS permissions model to Windows - Abstracted and reconciled mismatches around differences in Path semantics in Java and Windows - Native Task Controller for Windows - Implementation of a Block Placement Policy to support cloud environments, more specifically Azure. - Implementation of Hadoop native libraries for Windows (compression codecs, native I/O) - Several reliability issues, including race-conditions, intermittent test failures, resource leaks. - Several new unit test cases written for the above changes Please find the details of the work in CHANGES.branch-trunk-win.txt - Common changeshttp://bit.ly/Xe7Ynv, HDFS changeshttp://bit.ly/13QOSo9, and YARN and MapReduce changes http://bit.ly/128zzMt. This is the work ported from branch-1-win to a branch based on trunk. For details of the testing done, please see the thread - http://bit.ly/WpavJ4. Merge patch for this is available on HADOOP-8562 https://issues.apache.org/jira/browse/HADOOP-8562. This was a large undertaking that involved developing code, testing the entire Hadoop stack, including scale tests. This is made possible only with the contribution from many many folks in the community. Following people contributed to this work: Ivan Mitic, Chuan Liu, Ramya Sunil, Bikas Saha, Kanna Karanam, John Gordon, Brandon Li, Chris Nauroth, David Lao, Sumadhur Reddy Bolli, Arpit Agarwal, Ahmed El Baz, Mike Liddell, Jing Zhao, Thejas Nair, Steve Maine, Ganeshan Iyer, Raja Aluri, Giridharan Kesavan, Ramya Bharathi Nimmagadda, Daryn Sharp, Arun Murthy, Tsz-Wo Nicholas Sze, Suresh Srinivas and Sanjay Radia. There are many others who contributed as well providing feedback and comments on numerous jiras. The vote will run for seven days and will end on March 5, 6:00PM PST. Regards, Suresh On Thu, Feb 7, 2013 at 6:41 PM, Mahadevan Venkatraman mah...@microsoft.comwrote: It is super exciting to look at the prospect of these changes being merged to trunk. Having Windows as one of the supported Hadoop platforms is a fantastic opportunity both for the Hadoop project and Microsoft customers. This work began around a year back when a few of us started with a basic port of Hadoop on Windows. Ever since, the Hadoop team in Microsoft have made significant progress in the following areas: (PS: Some of these items are already included in Suresh's email, but including again for completeness) - Command-line scripts for the Hadoop surface area - Mapping the HDFS permissions model to Windows - Abstracted and reconciled mismatches around differences in Path semantics in Java and Windows - Native Task Controller for Windows - Implementation of a Block Placement Policy to support cloud environments, more specifically Azure. - Implementation of Hadoop native libraries for Windows (compression codecs, native I/O) - Several reliability issues, including race-conditions, intermittent test failures, resource leaks. - Several new unit test cases written for the above changes In the process, we have closely engaged with the Apache open source community and have got great support and assistance from the community in terms of contributing fixes, code review comments and commits. In addition, the Hadoop team at Microsoft has also made good progress in other projects including Hive, Pig, Sqoop, Oozie, HCat and HBase. Many of these changes have already been committed to the respective trunks with help from various committers and contributors. It is great to see the commitment of the community to support multiple platforms, and we look forward to the day when a developer/customer is able to successfully deploy a complete solution stack based on Apache Hadoop releases. Next Steps: All of the above changes are part of the Windows Azure HDInsight and HDInsight Server products from Microsoft. We have successfully on-boarded several internal customers and have been running production workloads on Windows Azure HDInsight. Our vision is to create a big data platform based on Hadoop, and we are committed to helping make Hadoop a world-class solution that anyone can use to solve
Re: tests in mapreduce.lib excluded in jenkins?
All of the pre-commit builds only run tests for the projects that had changes. This is a known issue, but was done because the pre-commit builds were taking a very long time. There have been a few proposals to improve the situation, like having any change in map/reduce run all of the map/reduce tests instead of just a sub set of them (sorry JIRA is acting up right now so I don't have a reference to the JIRA number). But none of them have gone in yet. --Bobby On 2/25/13 6:45 PM, Sandy Ryza sandy.r...@cloudera.com wrote: A recent patch of mine (https://issues.apache.org/jira/browse/MAPREDUCE-4994) broke a couple of tests, but the Hadoop QA build ( https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3321/testReport/) didn't catch anything wrong. It looks like the tests under mapred.lib and mapreduce.lib, such as TestChainMapReduce and TestLineRecordReader aren't running. Is this intentional? thanks, Sandy
timeout is now requested to be on all tests
Sorry about cross posting, but this will impact all developers and I wanted to give you all a heads-up. HADOOP-9112https://issues.apache.org/jira/browse/HADOOP-9112 was just checked it. This means that the pre commit build will now give a –1 for any patch with junit tests that do not include a timeout option. See http://junit.sourceforge.net/javadoc/org/junit/Test.html for more info on that. This is to avoid surefire timing out junit when it gets stuck and not giving any real feedback on which test failed. --Bobby
Re: Doubt about map reduce version 2
Suresh, The 1.0 line is still the stable line and improvements there can have a large impact on existing users. That being said I think there will be a lot of movement to Yarn/MRv2 starting in the second half of this year and all of next year. Also YARN scheduling is a larger area for study because it doesn't just run Map/Reduce. It allows you to explore how to effectively schedule other workloads in a multi-tennant environment. There has already been a lot of discussion about the scheduler and its protocol recently because it is still a very new area to explore and no one really knows how well the current solutions work for other work loads. As for speculative execution in MRv2 it is completely pluggable by the user. This should make it very easy for you to explore and compare different speculation schemes. --Bobby On 2/7/13 11:39 PM, Suresh S suresh...@gmail.com wrote: Hello Friends, I am working to propose some improved hadoop scheduling algorithm or speculative execution algorithim as part of my Phd research work. Now, the new version of hadoop, YARN/MR v2, is available. I have the following doubts: So, the algorithms (particularly scheduling and speculation algorithms) proposed for old hadoop version are applicable for new version of hadoop (YARN) or not. Is it worth and usefull to propose an algorithm for old hadoop version now? Is the user community can support and discuss the issues releted to old version? Thanks in Advance. *Regards* *S.Suresh,* *Research Scholar,* *Department of Computer Applications,* *National Institute of Technology,* *Tiruchirappalli - 620015.* *+91-9941506562*
Re: [VOTE] Release hadoop-2.0.3-alpha
I downloaded the binary package and ran a few example jobs on a 3 node cluster. Everything seems to be working OK on it, I did see WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable For every shell command, but just like with 0.23.6 I don't think it is a blocker. +1 (Binding) --Bobby On 2/6/13 9:59 PM, Arun C Murthy a...@hortonworks.com wrote: Folks, I've created a release candidate (rc0) for hadoop-2.0.3-alpha that I would like to release. This release contains several major enhancements such as QJM for HDFS HA, multi-resource scheduling for YARN, YARN ResourceManager restart etc. Also YARN has achieved significant stability at scale (more details from Y! folks here: http://s.apache.org/VYO). The RC is available at: http://people.apache.org/~acmurthy/hadoop-2.0.3-alpha-rc0/ The RC tag in svn is here: http://svn.apache.org/viewvc/hadoop/common/tags/release-2.0.3-alpha-rc0/ The maven artifacts are available via repository.apache.org. Please try the release and vote; the vote will run for the usual 7 days. thanks, Arun -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: One output file per node
Tejay, The way the scheduler works you are not guaranteed to get one reducer per node. Reducers are not scheduled based off of locality of any kind, and even if they were the scheduler typically treats rack local the same as node local. The partitioner interface only allows you to say what numeric partition an entry should go to, nothing else. There is no way to map that numeric partition to a particular machine. You could try to play games but they would be very difficult to get right, especially for the corner cases where a task can fail and may be rerun. If your partitioner is not exactly deterministic, you could lose some data, and double count other data in the case of a failure. Why don't you want to send all of the data over the wire? When you write it out to HDFS it will all be sent over the wire. How do you plan on using these indexes after they are generated? Do you plan to read from all of the indexes in parallel to search for a single entry or do you want to merge them together again before actually using them? You could do your original proposal by simulating the combiner within the map itself. If your data is small enough you could aggregate the data within the mapper and then only output the aggregate when all the entries have been processed. If it is too big to fit into memory you could look at having a disk backed data structure with in memory caching, or even simulate Map/Reduce itself and write all of the data out to a local file, sort the data and read it back in already partitioned. --Bobby On 12/13/12 1:47 AM, Aloke Ghoshal alghos...@gmail.com wrote: Hi Tejay, Building a consolidated index file for all your source files (for terms within the source files) may not be doable this way. On the other hand, building one index file per node is doable if you run a Reducer per Node use a Partitioner. - Run one Reducer per node - Let Mapper output carry *NodeHostName:Term* as the key - Use a Partitioner based on the NodeHostName portion of the key (KeyFieldBasedPartitioner) a GroupingComparator based on the Term portion Regards, Aloke On Wed, Dec 12, 2012 at 11:32 PM, Cardon, Tejay E tejay.e.car...@lmco.comwrote: First, I hope I¹m posting this to the right list. I wasn¹t sure if developer questions belonged here or on the user list. Second, thanks for your thoughts. ** ** So I have a situation in which I¹m building an index across many files. I don¹t want to send ALL the data across the wire by using reducers, so I¹d like to use a map only job. However, I don¹t want one file per mapper, I¹d like to consolidate them to only one file per node. Effectively, I¹d like to have the output of the combiner go to file, but I know I can¹t trust combiner to always run on all outputs for the map. ** ** Is this possible? Perhaps some crafty partitioner that somehow sends all records to a reducer on the local node?? (I don¹t see this working) Thanks, Tejay ** ** ** ** ** ** Follow me on Eureka https://eureka.isgs.lmco.com/#people/cardonte and Brainstorm http://brainstorm.isgs.lmco.com/Person.aspx?id=1200 ** ** ** ** ** **
Re: Shuffle phase: fine-grained control of data flow
Jiwei, Ok so you are specifically looking at reducing overall network bandwidth of skewed map outputs, not all map outputs. That would very much mean that #1 and #3 are off base. But as you point out it would only really be performance win if the data fits into memory. It seems like an interesting idea. If the goal is to reduce bandwidth and not improve individual job performance then it seems more plausible. Do you have a benchmark (grid mix run etc) that really taxes the network that you could use to measure the impact such a change would have? Something like this really needs some hard numbers for a proper evaluation. --Bobby Evans On 11/7/12 11:32 PM, Jiwei Li cxm...@gmail.com wrote: Hi Bobby, Thank you a lot for your suggestions. My whole idea is to minimize the aggregate network bandwidth during Shuffle Phase, that is, to limit the hops to minimum when transmitting data from map node to reduce node. Usually, Partitioner creates skews that the JobTracker allocates different amounts of map outputs to participating reduce nodes. Making reduce nodes near map outputs with largest concerned partitions can reduce the aggregate network bandwidth. For #1, there is no need to schedule map tasks to be close to one another, since it will only congest links among the cluster. For #2, the location and size of each partition in each map output can be sent to JobTracker along with the processing of InputSplit. Collecting enough such information (not necessarily waiting map tasks to finish), the JobTracker starts to schedule reduce tasks to fetch map output data. #3 is the same as #1. Now the tricky part is that if all map outputs are spilled to disks, network bandwidth may not be a bottleneck, because the time consumed in disk seeks outnumbers that in data transmission. If map outputs fit in memory, then network must be taken seriously. Also note that for evenly distributed map outputs, current scheduling policy works just fine. Jiwei On Wed, Nov 7, 2012 at 11:45 PM, Robert Evans ev...@yahoo-inc.com wrote: Jiwei, I think you could use that knowledge to launch reducers closer to the map output, but I am not sure that it would make much difference. It may even slow things down. It is a question of several things 1) Can we get enough map tasks close to one another that it will make a difference? 2) Does the reduced shuffle time offset the overhead of waiting for the map location data before launching and fetching data early? 3) and do the time savings also offset the overhead of getting the map tasks to be close to one another? For #2 you might be able to deal with this by using speculative execution, and launching some reduce tasks later if you see a clustering of map output. For #1 it will require changes to how we schedule tasks which depending on how well it is implemented will impact #3 as well. Additionally for #1 any job that approaches the same order of size as the cluster will almost require the map tasks to be evenly distributed around the cluster. If you can come up with a patch I would love to see some performance numbers. Personally I think spending time reducing the size of the data sent to the reducers is a much bigger win. Can you use a combiner? Do you really need all of the data or can you sample the data to get a statistically significant picture of what is in the data? Have you enabled compression between the maps and the reducers? --Bobby On 11/7/12 8:05 AM, Harsh J ha...@cloudera.com wrote: Hi Jiwei, In trunk (i.e. MR2), the completion events selection + scheduling logic lies under class EventFetcher's getMapCompletionEvents() method, as viewable at http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project / hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/a pa che/hadoop/mapreduce/task/reduce/EventFetcher.java?view=markup This EventFetcher thread is used by the Shuffle (reduce package) class, to continually do the shuffling. The Shuffle class is then itself used by the ReduceTask class (look in mapred package of same maven module). I guess you can start there, to see if a better selection+scheduling logic would yield better results. On Wed, Nov 7, 2012 at 12:26 PM, Jiwei Li cxm...@gmail.com wrote: Dear all, For jobs like Sort, massive amounts of network traffic happen during shuffle phase. The simple mechanism in Hadoop 1.0.4 to choose reduce nodes does not help reduce network traffic. If JobTracker is fully aware of locations of every map output, why not take advantage of this topology knowledge? So, is there anyone who knows where to develop such codes upon? Many thanks. Regards. -- Jiwei -- Harsh J -- Jiwei Li
Re: Shuffle phase: fine-grained control of data flow
Jiwei, I think you could use that knowledge to launch reducers closer to the map output, but I am not sure that it would make much difference. It may even slow things down. It is a question of several things 1) Can we get enough map tasks close to one another that it will make a difference? 2) Does the reduced shuffle time offset the overhead of waiting for the map location data before launching and fetching data early? 3) and do the time savings also offset the overhead of getting the map tasks to be close to one another? For #2 you might be able to deal with this by using speculative execution, and launching some reduce tasks later if you see a clustering of map output. For #1 it will require changes to how we schedule tasks which depending on how well it is implemented will impact #3 as well. Additionally for #1 any job that approaches the same order of size as the cluster will almost require the map tasks to be evenly distributed around the cluster. If you can come up with a patch I would love to see some performance numbers. Personally I think spending time reducing the size of the data sent to the reducers is a much bigger win. Can you use a combiner? Do you really need all of the data or can you sample the data to get a statistically significant picture of what is in the data? Have you enabled compression between the maps and the reducers? --Bobby On 11/7/12 8:05 AM, Harsh J ha...@cloudera.com wrote: Hi Jiwei, In trunk (i.e. MR2), the completion events selection + scheduling logic lies under class EventFetcher's getMapCompletionEvents() method, as viewable at http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/ hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apa che/hadoop/mapreduce/task/reduce/EventFetcher.java?view=markup This EventFetcher thread is used by the Shuffle (reduce package) class, to continually do the shuffling. The Shuffle class is then itself used by the ReduceTask class (look in mapred package of same maven module). I guess you can start there, to see if a better selection+scheduling logic would yield better results. On Wed, Nov 7, 2012 at 12:26 PM, Jiwei Li cxm...@gmail.com wrote: Dear all, For jobs like Sort, massive amounts of network traffic happen during shuffle phase. The simple mechanism in Hadoop 1.0.4 to choose reduce nodes does not help reduce network traffic. If JobTracker is fully aware of locations of every map output, why not take advantage of this topology knowledge? So, is there anyone who knows where to develop such codes upon? Many thanks. Regards. -- Jiwei -- Harsh J
Re: division by zero in getLocalPathForWrite()
It looks like you are running with an older version of 2.0, even though it does not really make much of a difference in this case, The issue shows up when getLocalPathForWrite thinks there is no space on to write to on any of the disks it has configured. This could be because you do not have any directories configured. I really don't know for sure exactly what is happening. It might be disk fail in place removing disks for you because of other issues. Either way we should file a JIRA against Hadoop to make it so we never get the / by zero error and provide a better way to handle the possible causes. --Bobby Evans On 10/24/12 11:54 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, HBase has Jenkins build against hadoop 2.0 I was checking why TestRowCounter sometimes failed: https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/231/testReport/o rg.apache.hadoop.hbase.mapreduce/TestRowCounter/testRowCounterExclusiveCol umn/ I think the following could be the cause: 2012-10-22 23:46:32,571 WARN [AsyncDispatcher event handler] resourcemanager.RMAuditLogger(255): USER=jenkins OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1350949562159_0002 failed 1 times due to AM Container for appattempt_1350949562159_0002_01 exited with exitCode: -1000 due to: java.lang.ArithmeticException: / by zero at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathFor Write(LocalDirAllocator.java:355) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAlloca tor.java:150) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAlloca tor.java:131) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAlloca tor.java:115) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocal PathForWrite(LocalDirsHandlerService.java:257) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.Resou rceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.jav a:849) However, I don't seem to find where in getLocalPathForWrite() division by zero could have arisen. Comment / hint is welcome. Thanks
Re: pluggable resources
I agree that having it be pluggable opens up a lot of new possibilities. +1 for the idea. Although I think in the short term we are having enough problems as it is with just CPU and memory that it may be a little while before we get to a pluggable solution. Once YARN-2 goes in, if you can get an initial proof of concept patch for a generic solution I would be happy to review it and push for it to go in. --Bobby On 10/22/12 5:41 AM, Radim Kolar h...@filez.com wrote: I have proposal for improved resource scheduling. https://issues.apache.org/jira/browse/MAPREDUCE-4256 as i see, development seems to go other way for example in https://issues.apache.org/jira/browse/YARN-2 for every added kind of resource there has to be significant rework. you do not see benefits of having framework able to handle custom resource types? Its not all about memory and cores. You need to schedule jobs based on other factors (network capacity, availability of GPU cores, data locality). And every cluster might have special considerations for example do not overload central SQL database. We usually have few hundred submitted jobs, proper resource sharing is essential. No point in running jobs which needs GPU which is in use by other mapper, better to run some other jobs until gpu becomes available again.
Re: Fix versions for commits branch-0.23
I don't see much of a reason to have the same JIRA listed under both 0.23 and 2.0. I can see some advantage of being able to see what went into 0.23.X by looking at a 2.0.X CHANGES.txt, but unless the two are released at exactly the same time they will be out of date with each other in the best cases. I personally think the only way to truly know what is in 0.23.X is to look at the CHANGES.txt on 0.23.X and similarly for 2.X. Having JIRA be in sync is a huge help and we should definitely push for that. I just don't see much value in trying very hard to have the CHANGES.txt stay in sync. --Bobby On 10/8/12 10:21 PM, Siddharth Seth seth.siddha...@gmail.com wrote: Along with fix versions, does it make sense to add JIRAs under 0.23 as well as branch-2 in CHANGES.txt, if they're committed to both branches. CHANGES.txt tends to get out of sync with the different release schedules of the 2 branches. Thanks - Sid On Sat, Sep 29, 2012 at 10:33 PM, Arun C Murthy a...@hortonworks.com wrote: Guys, A request - can everyone please set fix-version to both 2.* and 0.23.*? I found some with only 0.23.*, makes generating release-notes very hard. thanks, Arun
Re: Commits breaking compilation of MR 'classic' tests
That is fine, we may want to then mark it so that the MR-4687 depends on the JIRA to port the tests, so the tests don't disapear before we are done. --Bobby From: Arun C Murthy a...@hortonworks.commailto:a...@hortonworks.com Date: Wednesday, September 26, 2012 12:31 PM To: hdfs-...@hadoop.apache.orgmailto:hdfs-...@hadoop.apache.org hdfs-...@hadoop.apache.orgmailto:hdfs-...@hadoop.apache.org, Yahoo! Inc. ev...@yahoo-inc.commailto:ev...@yahoo-inc.com Cc: common-...@hadoop.apache.orgmailto:common-...@hadoop.apache.org common-...@hadoop.apache.orgmailto:common-...@hadoop.apache.org, yarn-...@hadoop.apache.orgmailto:yarn-...@hadoop.apache.org yarn-...@hadoop.apache.orgmailto:yarn-...@hadoop.apache.org, mapreduce-dev@hadoop.apache.orgmailto:mapreduce-dev@hadoop.apache.org mapreduce-dev@hadoop.apache.orgmailto:mapreduce-dev@hadoop.apache.org Subject: Re: Commits breaking compilation of MR 'classic' tests Fair, however there are still tests which need to be ported over. We can remove them after the port. On Sep 26, 2012, at 9:54 AM, Robert Evans wrote: As per my comment on the bug. I though we were going to remove them. MAPREDUCE-4266 only needs a little bit more work, change a patch to a script, before they disappear entirely. I would much rather see dead code die then be maintained for a few tests that are mostly testing the dead code itself. --Bobby On 9/26/12 9:39 AM, Arun C Murthy a...@hortonworks.commailto:a...@hortonworks.com wrote: Point. I've opened https://issues.apache.org/jira/browse/MAPREDUCE-4687 to track this. On Sep 25, 2012, at 9:33 PM, Eli Collins wrote: How about adding this step to the MR PreCommit jenkins job so it's run as part test-patch? On Tue, Sep 25, 2012 at 7:48 PM, Arun C Murthy a...@hortonworks.commailto:a...@hortonworks.com wrote: Committers, As most people are aware, the MapReduce 'classic' tests (in hadoop-mapreduce-project/src/test) still need to built using ant since they aren't mavenized yet. I've seen several commits (and 2 within the last hour i.e. MAPREDUCE-3681 and MAPREDUCE-3682) which lead me to believe developers/committers aren't checking for this. Henceforth, with all changes, before committing, please do run: $ mvn install $ cd hadoop-mapreduce-project $ ant veryclean all-jars -Dresolvers=internal These instructions were already in http://wiki.apache.org/hadoop/HowToReleasePostMavenization and I've just updated http://wiki.apache.org/hadoop/HowToContribute. thanks, Arun -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: Speculative Execution...
Under YARN (branch-2, branch-0.23, and trunk) the speculative execution decision is pluggable, and can be replaced by a user. If you could come up with a better solution to speculative execution that would be great. We have known for a while that it is not very good (most of the time we run a speculative task it is just wasted). In branch-2 we have a new version that we think is better (more conservative), but I am not sure how much of a study has been done on exactly how much better it is or what else can be done to make it even better. I would look at adding your ideas into that plugin and not so much of using the config to turn speculation on or off dynamically, because there are some map/reduce applications that abuse map/reduce somewhat and will not run correctly if speculative execution is enabled. --Bobby Evans On 9/13/12 1:08 AM, Suresh S suresh...@gmail.com wrote: Hello Sir/Madam I am doing PhD. I am interested to do research in Hadoop for publishing paper. I know little bit about speculative execution of slow tasks. I know it is possible to enable or disable speculative execution. But, Is there any idea published already for dynamically enable or disable the speculative execution depending on application, cluster load and other run time parameters? Is it worth to do research in this direction? Is this contribution is worth to publish a conference or journal paper(s)? *Regards* *S.Suresh,* *Research Scholar,* *Department of Computer Applications,* *National Institute of Technology,* *Tiruchirappalli - 620015.* *India* *Mobile: +91-9941506562*
Re: On the topic of task scheduling
The other thing to point out too is that in order to solve this problem perfectly you litterly have to solve the halting problem. You have to predict if the maps are going to finish quickly or slowly. If they finish quickly then you want to launch reduces quickly to start fetching data from the mappers, if they are going to finish very slowly, then you have a lot of reducers taking up resources not doing anything. That is why there is the config parameter that can be set on a per job basis to tell the AM when to start launch maps. We have actually been experimenting with setting this to 100% because it improves utilization of the cluster a lot. But be careful there are a lot of bug that you might run into if you do this. I think we have fixed al of them, but I don't know how many have been merged into 2.1 and how many are still sitting on 2.2. --Bobby On 9/2/12 1:46 PM, Arun C Murthy a...@hortonworks.com wrote: Vasco, Welcome to Hadoop! You observations are all correct - in simplest case you launch all reduces up front (we used to do that initially) and get a good 'pipeline' between maps, shuffle (i.e. moving map-outputs to reduces) and the reduce itself. However, one thing to remember is that keeping reduces up and running without sufficient maps being completed is a waste of resources in the cluster. As a result, we have a simple heuristic in hadoop-1 i.e. do not launch reduces until a certain percentage of the job's maps are complete - by default it's set to 5%. However, there still is a flaw with it (regardless of what you set it to be i.e. 5% or 50%). If it's too high, you lose the 'pipeline' and too low (5%), reduces still spin waiting for all maps to complete wasting resources in the cluster. Given that, we've implemented the heuristic you've described below for hadoop-2 which is better at balancing resource-utilization v/s pipelining or job latency. However, as you've pointed out there are several improvements which are feasible. But, remember that the complexity involved has on a number of factors you've already mentioned: # Job size (a job with 100m/10r v/s 10m/1r) # Skew for reduces # Resource availability i.e. other active jobs/shuffles in the system, network bandwidth etc. If you look at an ideal shuffle it will look so (pardon my primitive scribble): http://people.apache.org/~acmurthy/ideal-shuffle.png From that graph: # X i.e. when to launch reduces depends on resource availability, job size maps' completion rate. # Slope of shuffles (red worm) depends on network b/w, skew etc. None of your points are invalid - I'm just pointing out the possibilities and complexities. Your points about aggregation are also valid, look at http://code.google.com/p/sailfish/ for e.g. One of the advantages of hadoop-2 is that anyone can play with these heuristics and implement your own - I'd love to help if you are interested in playing with them. Related jiras: https://issues.apache.org/jira/browse/MAPREDUCE-4584 hth, Arun On Sep 2, 2012, at 9:34 AM, Vasco Visser wrote: Hi, I am new to the list, I am working with hadoop in the context of my MSc graduation project (has nothing to do with task scheduling per se). I came across task scheduling because I ran into the fifo starvation bug (MAPREDUCE-4613). Now, I am running 2.1.0 branch where the fifo starvation issue is solved. The behavior of task scheduling I observe in this branch is as follows. It begins with all containers allocated to mappers. Pretty quickly reducers are starting to be scheduled. In a linear way more containers are given to reducers, until about 50% (does anybody know why 50%?) of available containers are reducers (this point is reached when ~ 50% of the mappers are finished). It stays ~50-50 for until all mappers are scheduled. Only then the proportion of containers allocated to reducers is increased to 50%. I don't think this is in general quite the optimal (in terms of total job completion time) scheduling behavior. The reason being that the last reducer can only be scheduled when a free container becomes available after all mappers are scheduled. Thus, in order to shorten total job completion time the last reducer must be scheduled as early as possible. For the following gedankenexperiment, assume # reducer is set to 99% capacity, as suggested somewhere in the hadoop docs, and that each reducer will process roughly the same amount of work. I am going to schedule as in 2.1.0, but instead of allocating reducers slowly up to 50 % of capacity, I am just going to take away containers. Thus, the amount of map work is the same as in 2.1.0, only no reduce work will be done. At the point that the proportion of reducers would increased to more than 50% of the containers (i.e., near the end of the map phase), I schedule all reducers in the containers I took away, making sure that the last reducer is scheduled at the same moment as it would be in 2.1.0. My claim
Re: On the topic of task scheduling
You are correct about my typo, should be launching reducers, not maps. We do want a solution that is good in most cases, and preferably automatic, because most users and not going to change any default values. But I think you also want to give administrators of a cluster and individual users as well the knobs to adjust if resources are better spent on improving overall throughput of the cluster or if the run time of a job is a higher priority. On our clusters some jobs have a tight SLA. We ideally want to do what we can to meet their SLA, even if it requires using more resources. On the other hand, running on the same cluster will be jobs with either no SLA or a very lenient one. In those cases we want to use the resources as wisely as possible so as many jobs as possible can complete in the given time frame. This has bigger ramifications with the RM's scheduling, but ideally AM would also adjust its timing of requests as well so both work together for a common goal. --Bobby Evans On 9/4/12 8:59 AM, Vasco Visser vasco.vis...@gmail.com wrote: On Tue, Sep 4, 2012 at 3:11 PM, Robert Evans ev...@yahoo-inc.com wrote: The other thing to point out too is that in order to solve this problem perfectly you litterly have to solve the halting problem. You have to predict if the maps are going to finish quickly or slowly. If they finish quickly then you want to launch reduces quickly to start fetching data from the mappers, if they are going to finish very slowly, then you have a lot of reducers taking up resources not doing anything. I agree with you that a perfect solution is not going to be feasible. The aim should probably be a solution that is good in many cases. That is why there is the config parameter that can be set on a per job basis to tell the AM when to start launch maps. I assume you mean start launching reducers We have actually been experimenting with setting this to 100% because it improves utilization of the cluster a lot. thanks for pointing this out, I didn't know about this config option. That the utilization of the cluster improves by setting this to 1 doesn't surprise me. Maybe it is a good idea to introduce a concept like job container time that captures how much resources a job uses in its life time. For example, if a job uses 10 mappers each for a minute and 10 reducers also each for a minute, then the container time would be 20 minutes. Having idle reducer will increase container time. A conceptually simple method to optimize the container time of a job is to let the AM monitor for each scheduled reducer how much of the time it is waiting for mappers to produce intermediate data (maybe embed this in the heartbeat?). If the average waiting for all scheduled reducers is above a certain proportion (say waiting more than 25% of the time or smt), then the AM can decide to discard some/all reducers and give the freed resources to mappers. This is just an idea, I don't know about the feasibility. Also I didn't think about the relationship between optimizing container time for a single job and optimizing it for all jobs utilizing on the cluster. Might be that minimizing for each job gives minimal overall, but not sure. On 9/2/12 1:46 PM, Arun C Murthy a...@hortonworks.com wrote: Vasco, Welcome to Hadoop! You observations are all correct - in simplest case you launch all reduces up front (we used to do that initially) and get a good 'pipeline' between maps, shuffle (i.e. moving map-outputs to reduces) and the reduce itself. However, one thing to remember is that keeping reduces up and running without sufficient maps being completed is a waste of resources in the cluster. As a result, we have a simple heuristic in hadoop-1 i.e. do not launch reduces until a certain percentage of the job's maps are complete - by default it's set to 5%. However, there still is a flaw with it (regardless of what you set it to be i.e. 5% or 50%). If it's too high, you lose the 'pipeline' and too low (5%), reduces still spin waiting for all maps to complete wasting resources in the cluster. Given that, we've implemented the heuristic you've described below for hadoop-2 which is better at balancing resource-utilization v/s pipelining or job latency. However, as you've pointed out there are several improvements which are feasible. But, remember that the complexity involved has on a number of factors you've already mentioned: # Job size (a job with 100m/10r v/s 10m/1r) # Skew for reduces # Resource availability i.e. other active jobs/shuffles in the system, network bandwidth etc. If you look at an ideal shuffle it will look so (pardon my primitive scribble): http://people.apache.org/~acmurthy/ideal-shuffle.png From that graph: # X i.e. when to launch reduces depends on resource availability, job size maps' completion rate. # Slope of shuffles (red worm) depends on network b/w, skew etc. None of your points are invalid - I'm just pointing out the possibilities
Re: Cannot create a new Jira issue for MapReduce
It is a bit worse then that though. I found that it did create the JIRA, but it is in a bad state where you cannot put it in patch available or close it. So we may need to do some cleanup of these JIRAs later. --Bobby On 8/9/12 3:19 PM, Ted Yu yuzhih...@gmail.com wrote: This has been reported by HBase developers as well. See https://issues.apache.org/jira/browse/INFRA-5131 On Thu, Aug 9, 2012 at 1:10 PM, Benoy Antony bant...@gmail.com wrote: Hi, I am getting the following error when I try to create a Jira issue. Error creating issue: com.atlassian.jira.util.RuntimeIOException: java.io.IOException: read past EOF Anyone else face the same problem ? Thanks , Benoy
Re: Multi-level aggregation with combining the result of maps per node/rack
Tsuyoshi, There has been a lot of work happening in the shuffle phase. It is being made pluggable in both 1.0 and 2.0/trunk (MAPREDUCE-4049). There is also some work being done to reuse containers in trunk/2.0 (MAPREDUCE-3902). This should have a similar, although perhaps more limited result, because when different map tasks run in the same container their outputs also go through the same combiner. I have heard that it is showing some good results for both small and large jobs. There was also some work to try and pull in Sailfish (No JIRA just ramblings on the mailing list), which moves the shuffle phase to a separate process. I have not seen much happen on that front recently, but it saw some large gains on big jobs, but is worse on small jobs. I think that this is something very interesting and I would encourage you to file a JIRA and pursue it. I don't know anything about your design, so please feel free to disregard my comments if they do not apply. I would encourage you to think about security on this. When you run the combiner you need to be sure that it runs as the user that owns the data. This should probably not be too difficult if you hijack a mapper tasks that has just finished to try and combine the data from others on the same node. To do this you will probably need some sort of a coordination system in the AM to tell that mapper what other mappers to try and combine data from. It would be nice to coordinate this with the container reuse work, which currently just tells the container to run another split through. It could be another option to tell it to combine with the map output from container X. Another thing to be aware of is small jobs. It would be great to see how this impacts small jobs, and if it has a negative impact we should look for an automated way to turn this off or on. Thanks for your work, Bobby Evans On 7/30/12 8:11 PM, Tsuyoshi OZAWA ozawa.tsuyo...@gmail.com wrote: Hi, We consider the shuffle cost is a main concern in MapReduce, in particular, aggregation processing. The shuffle costs is also expensive in Hadoop in spite of the existence of combiner, because the scope of combining is limited within only one MapTask. To solve this problem, I've implemented the prototype that combines the result of multiple maps per node[1]. This is the first step to make hadoop faster with multi-level aggregation technique like Google Dremel[2]. I took a benchmark with the prototype. We used WordCount program with in-mapper combining optimization as the benchmark. The benchmark is taken under 40 nodes [3]. The input data set is 300GB, 500GB, 1TB, and 2TB texts which is generated by default RandomTextWriter. Reducer is configured as 1 on the assumption that some workload forces 1 reducer like Google Dremel. The result is as follows: | 300GB | 500GB | 1TB | 2TB | Normal (sec) | 4004 | 5551 | 12177 | 27608 | Combining per node (sec) | 3678 | 3844 | 7440 | 15591 | Note that a MapTask runs combiner per node every 3 minutes in the current prototype, so the aggregation rate is very limited. Normal is the result of current hadoop, and Combining per node is the result with my optimization. Regardless of the 3-minutes restriction, the prototype is 1.7 times faster than normal hadoop in 2TB case. Another benchmark also shows that the shuffle costs is cut down by 50%. I want to know from you guys, do you think is it a useful feature? If yes, I will work for contributing it. It is also welcome to tell me the benchmark that you want me to do with my prototype. Regards, Tsuyoshi [1] The idea is also described in Hadoop wiki: http://wiki.apache.org/hadoop/HadoopResearchProjects [2] Dremel paper is available at: http://research.google.com/pubs/pub36632.html [3] The specification of each nodes is as follows: CPU Core(TM)2 Duo CPU E7400 2.80GHz x 2 Memory 8 GB Network 1 GbE
Re: Can we use String.intern inside WritableUtils#readString()?
Yes I filed a JIRA for something like this a while ago MAPREDUCE-4303. I have not done anything with it for this very reason. There are some potential fixes for this, we could keep a somewhat small weak reference cache of these strings so that if a string is read multiple times it is dedupped and if it is collected we don't force it to stay around too long and it is not placed in the permgen space. But that is not a small change. If you want to take over that JIRA feel free, otherwise I will get around to it eventually. --Bobby Evans On 7/12/12 1:27 PM, Ramkumar Vadali ramkumar.vad...@gmail.com wrote: String.intern() should be used with caution. The intern'ed strings go to the perm gen space in the java process, which is limited. You could easily run out of that space and get OOM errors even when the total usage is well below the Xmx value. A better way would be to have a MapString, String that de-deplicates string objects Ramkumar On Thu, Jul 12, 2012 at 6:02 AM, Bhallamudi Venkata Siva Kamesh kames...@imaginea.com wrote: Hi All, I noticed that WritableUtils.readString(), while deserializing the strings, creates a string object every time. But there may be applications, which serialize a small no of the strings, a huge number of times. So while deserializing them, this may lead to OOMs sometimes. I think using intern() will reduce the creation of the number of String objects. Please correct me if my understading is wrong. -- ThanksRegards, Bh.V.S.Kamesh, +91-9652725948
Re: Cyclic dependency in JobControl job DAG
I personally think it is useful. I would say contribute it. (Moved common-dev to bcc, we try not to cross post on these lists) --Bobby Evans On 6/25/12 3:37 AM, madhu phatak phatak@gmail.com wrote: Hi, In current implementation of JobControl, whenever there is a cyclic dependency between the jobs it throws a Stack overflow exception . For example, ControlledJob job1 = new ControlledJob(new Configuration()); job1.setJobName(job1); ControlledJob job2 = new ControlledJob(new Configuration()); job2.setJobName(job2); job1.addDependingJob(job2); job2.addDependingJob(job1); JobControl jobControl = new JobControl(jobcontrol); jobControl.addJob(job1); jobControl.addJob(job2); jobControl.run(); throws java.lang.StackOverflowError at java.util.ArrayList.get(ArrayList.java:322) at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.checkState(ControlledJob.java:295) Whenever we write complex application, there is always possibility of cyclic dependencies.I have written a method which checks for the cyclic dependency upfront and informs it to the user. I want to know from you guys, do you think is it a useful feature? If yes I can contribute it as a patch. Regards, Madhukara Phatak -- https://github.com/zinnia-phatak-dev/Nectar
Re: try to fix hadoop streaming bug
It looks like your jar's MANIFEST file is missing the Main Class attribute. It may have something to do with how you created the updated jar you are using. Hadoop is trying to run the jar, and because it did not find the MainClass in the jar's manifest it thinks you are supplying it as the next argument, and looking for the -mapper class, which obviously does not exist. You can either update the MANIFEST when you build the jar, or you can supply the main class on the command line like hadoop path/hadoop-streaming.jar org.apache.hadoop.streaming.HadoopStreaming -mapper ... --Bobby Evans On 6/14/12 5:01 AM, HU Wenjing A wenjing.a...@alcatel-sbell.com.cn wrote: Hi all, I tried to fix the hadoop streaming bug for the version 0.21.0 (streaming overrides user given output key and value types). I saw some useful message about this issue on https://issues.apache.org/jira/browse/MAPREDUCE-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel and modified some code following the patch file. I modified and compiled the code. It seems only about thirteen .java files need to be modified. But when I tried to replace the old .classes files using the new ones, I can only find StreamJob.class in ${hadoop_home}/ /root/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar. And the other twelve modified files could't be found in any jar files in the ${hadoop_home} directory. Then I executed the command bin/hadoop jar mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -mapper org.apache.hadoop.mapred.lib.IdentityMapper -reducer NONE -input input -output output with the modified streaming jar and just received some error information: Exception in thread main java.lang.ClassNotFoundException: -mapper at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:185) And I think this error should have some thing to do with the modification of the StreamJob.java. But I saw someone says they have fixed the streaming override issue using the patch. So, Could anyone give me some suggestion about this issue? Or just give me another way to fix the bug? Thanks in advance! : ) Thanks best regards, Wenjing
Re: Hadoop optimization for Lustre FS
Zam, http://wiki.apache.org/hadoop/HowToContribute is a wiki that can tell you in more detail the steps you need to do for this. In general though to push the patch upstream you want to file a Map/Reduce JIRA, and attach your patch. After that several people from the community are likely to comment on the JIRA. If you don't get feedback you can bug us on the dev mailing list about it. As part of this you are also going to need to do a port to trunk, as we do not want to have new features go into any line without having it go into trunk as well. Even though this sounds potentially complex because trunk uses YARN instead of the previous Map/Reduce specific framework both 1.0 and trunk are in the process of getting a pluggable shuffle service MAPREDUCE-4049. It would probably be best to port your patch to be a plugin for this. Then hopefully the porting between trunk and 1.0 will be relatively simple. If this is the route you want to go you should put 1.1 and 3.0.0 as the target versions of the JIRA. 3.0.0 corresponds to trunk, and 1.1 is the next release of the 1 line that is accepting new major feature work. You probably also want to link your JIRA to the MAPREDUCE-4049 JIRA as a dependency, if you are making it a plugin. In addition because this is an optimization it would be nice to have some information in the JIRA showing the benchmarks you ran and the performance improvements you got. Ultimately we are also going to want to have some documentation about this as well, but that is something that can come later after you lock down the code more. --Bobby Evans On 5/16/12 3:34 AM, Alexander Zarochentsev alexander_zarochent...@xyratex.com wrote: Hello, there is an optimization for Hadoop on Lustre FS, or any high-performance distributed filesystem. The research paper with test results can be found here http://www.xyratex.com/pdfs/whitepapers/Xyratex_white_paper_MapReduce_1-4.pdf and a presentation for LUG 2011: http://www.olcf.ornl.gov/wp-content/events/lug2011/4-12-2011/1100-1130_Nathan_Rutman_MapReduce_Lug_2011.pptx Basically the optimization is a replacement for http transport in shuffle phase by simple linking target file to the source one. I attached a draft patch against hadoop-1.0.0 to illustrate the idea. How to push this patch upstream? Thanks, -- Alexander Zam Zarochentsev alexander_zarochent...@xyratex.com __ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan. __
Re: Building first time
http://wiki.apache.org/hadoop/HowToContribute is the best place to start. Checking the code in through git will not trigger a jenkins build, unless you have a special setup that goes beyond Apache provides. You do not need to compile the entire tree to get Map/Reduce, but typically it is not a big deal to compile everything. --Bobby Evans On 5/8/12 11:50 PM, Radim Kolar h...@filez.com wrote: I am interested in working on mapreduce package, so not sure if I need to compile the whole tree. I work on branch-0.23. It can be just imported into SpringToolsSuite, then click on Run - Maven - type in 'compile' target. It compiles module it just fails on Avro stuff. But it is good enough that you can edit it in Eclipse with some comfort. Then just commit to git and let Jenkins on Unix to build it for you.
Mixed Mode Environments
I just noticed that HADOOP-7484 and MAPREDUCE-3500 recently got committed to trunk and 0.23. I missed them before they were committed. I am curious if we are dropping support for running Hadoop in mixed mode environments? Meaning I want Hadoop to run as 32-bit by default, because that is faster then 64-bit, but if one of my users wants to launch a mapper or reducer in a 64-bit JVM to have access to more memory they can, and the native libraries should be able to work with them. --Bobby Evans
Re: Status of the completed containers (0.23)
Praveen, Looking at the code, it does not appear to currently be used outside of testing. I really don't know. Perhaps in the future if it is extended then it might be used more. Or perhaps the author of the API added it in for completeness. Just speculating. --Bobby Evans On 1/9/12 7:42 AM, Praveen Sripati praveensrip...@gmail.com wrote: Hi, Documentation says that the NM sends the status of the completed containers to the RM and the RM sends it to the AM. This is the interface (1) below. What is the purpose of the interface (2)? 1) AMRMProtocol has the below method. The AllocateResponse has the list of completed containers. public AllocateResponse allocate(AllocateRequest request) throws YarnRemoteException; 2) ContainerManager has the below method. The GetContainerStatusResponse has the status of the container. GetContainerStatusResponse getContainerStatus( GetContainerStatusRequest request) throws YarnRemoteException; Regards, Praveen
Re: Reduce output is strange
It looks mostly correct to me. I am not an expert on sequence files, and I have not checked the text against the spec nor have I checked the binary numbers in it to be sure they add up to the correct lengths etc, but it looks good from a first glance. I can see the SEQ tag at the beginning to mark it as a sequence file and the org.apache.hadoop.io.Text as the type for both the keys and the values. --Bobby Evans On 12/19/11 7:51 AM, Pedro Costa psdc1...@gmail.com wrote: Hi, In the hadoop MapReduce, I've executed the webdatascan example, and the reduce output is in a SequeceFile. The result is shows here ( http://paste.lisp.org/display/126572). What's the trash (random characters), like u 265 100 330 320 252 \n # ; 374 5 211 V ' 340 376 in the output? Is the output correct? 000 S E Q 006 031 o r g . a p a c h e . 020 h a d o o p . i o . T e x t 031 o 040 r g . a p a c h e . h a d o o p 060 . i o . T e x t \0 \0 \0 \0 \0 \0 u 265 100 330 320 252 \n # ; 374 5 211 V ' 340 376 \0 \0 120 \0 X \0 \0 \0 037 a p p l e a p p 140 l e b a n a n a a p p l e 160 a p p l e 7 c a r r o t c a 200 r r o t c a r r o t c a r r 220 o t a p p l e b a n a n a 240 c a r r o t b a n a n a 256 -- Thanks,
Re: Reduce output is strange
Oh I forgot to say that part of the Random Characters are actually random characters. Sequence files store a set of random characters as synch points within the file. This allows for splitting the file easily without a high risk that the random sequence appears inside the data itself just by chance. --Bobby Evans On 12/19/11 7:51 AM, Pedro Costa psdc1...@gmail.com wrote: Hi, In the hadoop MapReduce, I've executed the webdatascan example, and the reduce output is in a SequeceFile. The result is shows here ( http://paste.lisp.org/display/126572). What's the trash (random characters), like u 265 100 330 320 252 \n # ; 374 5 211 V ' 340 376 in the output? Is the output correct? 000 S E Q 006 031 o r g . a p a c h e . 020 h a d o o p . i o . T e x t 031 o 040 r g . a p a c h e . h a d o o p 060 . i o . T e x t \0 \0 \0 \0 \0 \0 u 265 100 330 320 252 \n # ; 374 5 211 V ' 340 376 \0 \0 120 \0 X \0 \0 \0 037 a p p l e a p p 140 l e b a n a n a a p p l e 160 a p p l e 7 c a r r o t c a 200 r r o t c a r r o t c a r r 220 o t a p p l e b a n a n a 240 c a r r o t b a n a n a 256 -- Thanks,
Re: Multiple resource requests for a given node (or all nodes)?
Arun, I am saying that I don't know what the correct solution is to updating the scheduler interface. Perhaps the correct solution is no change, I have not taken the time to think about it much. What I am saying is that there are a number of new features that are likely going to be going into the scheduler, and if we are going to change the interface, I want to be sure that we think about these use cases before we change it. That is all I am saying. I am not advocating for a particular interface at this point, as I said I have not taken the time to think about it in depth. --Bobby Evans On 12/13/11 12:42 AM, Arun C Murthy a...@hortonworks.com wrote: I'd argue that Robert is complaining that the interface *is not* MR-centric enough. IAC, priorities is fairly generic. MR AM uses it to get constraints to stick. Arun On Dec 12, 2011, at 7:50 PM, Patrick Wendell wrote: Todd - that's a good question and I haven't looked closely into whether simply adding a multimap is enough or if there are more deeply seeded issues (at least to address this specific case). If it's the former I'll probably just submit a patch. Arun - that seems like a hack but I guess it is a sufficient workaround for current applications. I'm finishing up a bare-bones version of the Fair Scheduler right now (going to throw something up for review soon) but I haven't yet added preemption. How this is going to work well with various types of applications is unclear. In the MR case we can probably just preempt based on priorities, since they are essentially just ordering constraints right now. As Robert points out, this interface is very MR-Centric right now - i'm not sure this generalizes well to other applications depending on how they use priorities. - Patrick On Mon, Dec 12, 2011 at 1:27 PM, Arun C Murthy a...@hortonworks.com wrote: Use priorities to ask for different resource types. Arun On Dec 10, 2011, at 12:23 PM, Patrick Wendell wrote: If you look at how resource requests are stored now, they use a map keyed on the node hostname. == AppSchedulingInfo.java == final MapPriority, MapString, ResourceRequest requests = new HashMapPriority, MapString, ResourceRequest(); What happens if an application wants to request multiple container types on a given node. E.g. say I need 10 2GB containers and 10 1GB containers, and I don't care which node they are on (i.e. RMNode.ANY). I really want to store 2 resource requests under RMNode.ANY in this case... don't I? Is the model just that an AM would ask for these in series? - Patrick
Re: Multiple resource requests for a given node (or all nodes)?
I think there may be some need for a bigger redesign in how requests are made to the scheduler because the only use case really was map/reduce at the time it was designed. It works very well for that purpose but has missed a few other use cases. For example there could be something like HBase where it wants a specific number of nodes with no overlap on the same physical machines (Yes you can do it now but it may take many iterations to get it right). Or perhaps like with MPI or Storm where they don't really care where the nodes are so long as they are all relatively close to one another in the network topology. Or things like with MPI where it cannot start any processing until all of the containers are ready (gang scheduling). It gets even more complicated if we want to support preemption like with the fair scheduler. Which imo is needed even more once MPI and other potentially very long lived jobs start to coexist with shorter jobs with tight SLAs. In order to make a good decision about what to preempt the scheduler needs to know that if it preempts a mapper, even though it may have been running a lot shorter time then some reducer in the same application it is likely to slow things down further then if it preempts that reducer. Or if it preempts an MPI node it might was well kill the entire application and start over, unless we some how give the scheduler the ability to tell MPI that it is going to be preempted and it needs to save its state away. But even then the scheduler needs to know that preempting an MPI job will cause all progress on it, and all of the containers it is holding, to stop. Even if we are not putting any of these scheduling features in now we need to think about them when designing the interface to not limit ourselves and force us to change things drastically later on. I am just saying that I am not sure just switching to a multimap is enough. -- Bobby Evans On 12/10/11 6:21 PM, Todd Lipcon t...@cloudera.com wrote: On Sat, Dec 10, 2011 at 12:23 PM, Patrick Wendell pwend...@eecs.berkeley.edu wrote: What happens if an application wants to request multiple container types on a given node. E.g. say I need 10 2GB containers and 10 1GB containers, and I don't care which node they are on (i.e. RMNode.ANY). I really want to store 2 resource requests under RMNode.ANY in this case... don't I? Is the model just that an AM would ask for these in series? My hunch is that this was overlooked because the resource sizes for MR are basically set on a per-task-type level. That is, maps need X MB and reduces need Y MB. Since maps and reduces are set at different 'priorities', they haven't conflicted. Does it seem straightforward to change it to a multimap? Guava has a nice implementation. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Incremental builds in 0.23 using Maven
Praveen, One thing to be aware of with removing the clean is that I have run into situations, in both hadoop and in other projects, where an API changed as part of the update or something and maven did not realize it and did not rebuild something that depended on it. I then got a runtime error and after doing a clean I got a compile error. I have also seen situations where the tar.gz package for hadoop was not properly updated and after deploying, it did not have my fix in it. It took me a long time to figure out why my fix did not work. After doing a clean and rebuilding/redeploying everything worked just fine. I have not had the time to dig though why that is happening or file any JIRAs on them. Also it is rather rare, but if I want to be sure something has my changes in it I always do a clean first. --Bobby Evans On 12/7/11 7:42 AM, Harsh J ha...@cloudera.com wrote: Praveen, Obviously, a clean target will wipe out all your existing build directories, and hence the other things start from scratch. That is your slowdown-causer. Just remove the clean from that command and you're good to go. On 07-Dec-2011, at 6:37 PM, Praveen Sripati wrote: Alejandro, Here is the command I use for branch-0.23 mvn clean install package -Pdist -Dtar -DskipTests -Dmaven.javadoc.skip=true Regards, Praveen On Wed, Dec 7, 2011 at 11:24 AM, Alejandro Abdelnur t...@cloudera.comwrote: what is your 'do a build' command in both cases? On Tue, Dec 6, 2011 at 6:06 PM, Praveen Sripati praveensrip...@gmail.com wrote: Alejandro, Here is the sequence 1. 'svn get ' 2. do a build 3. 'svn up' with no changes 4. do a build Tasks (2) and (4) are taking almost equal time. I expected task (4) to be much faster. Regards, Praveen On Tue, Dec 6, 2011 at 11:08 PM, Alejandro Abdelnur t...@cloudera.com wrote: Maven does incremental builds. taking time as in? Thanks. Alejandro On Tue, Dec 6, 2011 at 6:31 AM, Praveen Sripati praveensrip...@gmail.com wrote: Could someone please respond to the below query? Regards, Praveen On Tue, Nov 22, 2011 at 11:43 AM, Praveen Sripati praveensrip...@gmail.comwrote: Hi, Does Maven support incremental builds? After `svn up', the build is taking time even without any updates from svn. Thanks, Praveen
Re: Automatically Documenting Apache Hadoop Configuration
From my work on yarn trying to document the configs there and to standardize them, writing anything that is going to automatically detect config values through static analysis is going to be very difficult. This is because most of the configs in yarn are now built up using static string concatenation. public static String BASE = yarn.base.; public static String CONF = BASE+config; I am not sure that there is a good way around this short of using a full java parser to trace out all method calls, and try to resolve the parameters. I know this is possible, just not that simple to do. I am +1 for anything that will clean up configs and improve the documentation of them. Even if we have to rewire or rewrite a lot of the Configuration class to make things work properly. --Bobby Evans On 12/5/11 11:54 AM, Harsh J ha...@cloudera.com wrote: Praveen, (Inline.) On 05-Dec-2011, at 10:14 PM, Praveen Sripati wrote: Hi, Recently there was a query about the Hadoop framework being tolerant for map/reduce task failure towards the job completion. And the solution was to set the 'mapreduce.map.failures.maxpercent` and 'mapreduce.reduce.failures.maxpercent' properties. Although this feature was introduced couple of years back, it was not documented. Had similar experience with 0.23 release also. I do not know if we recommend using config strings directly when there's an API in Job/JobConf supporting setting the same thing. Just saying - that there was javadoc already available on this. But of course, it would be better if the tutorial covered this too. Doc-patches welcome! It would be really good for Hadoop adoption to automatically dig and document all the existing configurable properties in Hadoop and also to identify newly added properties in a particular release during the build processes. Documentation would also lead to fewer queries in the forums. Cloudera has done something similar [1], though it's not 100% accurate, it would definitely help to some extent. I'm +1 for this. We do request and consistently add entries to *-default.xml files if we find them undocumented today. I think we should also enforce it at the review level, so that patches do not go in undocumented -- at minimum the configuration tweaks at least.
Re: Start Nodemanager with webapp disabled.
The simplest way is to use ephemeral ports. Set the port number to 0 in the config and the node manager will pick a free port to listen on. It will then heartbeat back into the Resource Manager with the port it is listening on and the RM can pass that info off to whoever else needs it. I am not positive that this will work in all cases as I have not tried it myself. There is some work to enable this in the mini yarn cluster. --Bobby Evans On 10/5/11 3:24 AM, Prashant Sharma prashant.ii...@gmail.com wrote: Hi all, Is it possible to start NM daemon with webapp disabled. Overriding port address in yarn-site is not an option. Since I want more than one NM to be started. I tried passing the property -Dyarn.nodemanager.webapp.address=localhost:port in command line options unfortunately that doesnot seem to override the default. Thanks Prashant.
Re: Regarding 'branch-0.20-security'
It is kind of a long history and I will try to leave out all of the politics involved to make it shorter. For a long time 0.20 has been the stable release of Hadoop. It is supposedly in sustaining releases now, but many new features keep going in because that is what most people use in production and they do not want to wait several years for a new interesting feature. One of the very big features that went in is security. There was a separate branch created for it which is branch-0.20-security. Branch-0.20-security has essentially replaced branch-0.20 as the 0.20 release branch. All features that go into branch-0.20-security or any other release branch are also supposed to go into trunk first, if they are not specific to branch-0.20-security. So in theory everything in any release has also been applied to trunk. --Bobby Evans On 9/28/11 9:01 AM, Praveen Sripati praveensrip...@gmail.com wrote: Hi, There seems to be continuous changes to the 'branch-0.20-security' and also there are references to it once in a while in the mailing list. What is the significance of the 'branch-0.20-security'? Do all the security related features go into this branch and then ported to others? Thanks, Praveen
Re: RecommenderJob Mahout Creating a data model
This should probably be directed more toward the Mahout list then the Hadoop Map/reduce one. mahout-u...@apache.org --Bobby Evans On 9/14/11 6:28 AM, Amit Sangroya sangroyaa...@gmail.com wrote: Hi all, I am trying to run the example from https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering , with the following command bin/mahout org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input -Dmapred.output.dir=output --itemsFile itemfile --tempDir tempDir The algorithm estimate the preference of a user towards an item which he/she has not yet seen. Once an algorithm can predict preferences it can also be used to do Top-N-Recommendation where the task is to find the N items a given user might like best. It is mentioned that given a DataModel, it can produce recommendations. The algorithm takes approx. 5 minutes to generate top 5 recommendations for one user on a 10 node hadoop cluster. The size of input is shortened only to 200 users from 1 Million MovieLens Dataset from Grouplens.org. I have few questions: 1) I want to know that if it is possible to isolate the data model building step to generating recommendations. 2) Can we use the model once generated using the training data for generating recommendations for a range of users. 3) To be specific, if I want to provide an on-line service that generates recommendations for users, Can I minimize the cost of MapReduce interactions each time. I am not a data mining expert. Please help me to understand this in a better way. Thanks and Regards, Amit
500 error in review board
Whenever I try to post a new patch to review board I get a 500 error. Something broke! (Error 500) It appears something broke when you tried to go to here. This is either a bug in Review Board or a server configuration error. Please report this to your administrator. Who should I talk to/report this to? Thanks, Bobby Evans
MAPREDUCE-2864 Has been merged to trunk and 0.23
MAPREDCUE-2864 was an effort to rename and reorganize the YARN configuration parameters to make them consistent. If you are setting anything in your yarn-site.xml then you will need to update your configuration. The patch did not provide backwards compatible mappings because there has never been a release with these configs in it. I have provided a script that should hopefully do the conversion for you. https://issues.apache.org/jira/secure/attachment/12492495/update.pl I have not fully tested it so please double check the results when it is run. This script is a bit of a hack, but it should take a config file name on the command line as input and update all of the configs in it to use the newly renamed ones. The original file will be saved with .orig at the end. If you do have any problems with this please feel free to respond to this e-mail and I will do my best to help you out. --Bobby Evans
Re: MAPREDUCE-2864 Has been merged to trunk and 0.23
A quick update. I found a bug in the script, and it has now been fixed. Please use this script instead. https://issues.apache.org/jira/secure/attachment/12493787/update.pl --Bobby Evans On 9/9/11 8:53 AM, Robert Evans ev...@yahoo-inc.com wrote: MAPREDCUE-2864 was an effort to rename and reorganize the YARN configuration parameters to make them consistent. If you are setting anything in your yarn-site.xml then you will need to update your configuration. The patch did not provide backwards compatible mappings because there has never been a release with these configs in it. I have provided a script that should hopefully do the conversion for you. https://issues.apache.org/jira/secure/attachment/12492495/update.pl I have not fully tested it so please double check the results when it is run. This script is a bit of a hack, but it should take a config file name on the command line as input and update all of the configs in it to use the newly renamed ones. The original file will be saved with .orig at the end. If you do have any problems with this please feel free to respond to this e-mail and I will do my best to help you out. --Bobby Evans
Re: MRv1 in 0.23+
There is a MiniYarnCluster and a MiniMRYarnCluster, it is just that the tests have not been ported over to use them yet. --Bobby On 9/7/11 2:01 PM, Eli Collins e...@cloudera.com wrote: My understanding is that the MR1 code is currently needed to run the tests because there is no Mini MR cluster for MR2. So the code is needed until the tests can run against MR2 (not sure if there's an effort underway). However, see MR-2736, if we remove the ability to run the daemons I don't think we need to maintain eg the code for security patches. Ie seems like 23 and trunk should be able to ignore the LTC fixes. Thanks, Eli On Wed, Sep 7, 2011 at 11:22 AM, milind.bhandar...@emc.com wrote: Folks, Has the community decided how long MRv1 will remain part of the codebase, after 0.23 ? The reason I am asking is, for those who are working on forward porting LinuxTaskController fixes (from 0.20.2xx) to 0.22, will they have to patch 0.23 and trunk as well ? Or should these branches be left alone ? - Milind --- Milind Bhandarkar Greenplum Labs, EMC (Disclaimer: Opinions expressed in this email are those of the author, and do not necessarily represent the views of any organization, past or present, the author might be affiliated with.)
Re: Get Hadoop 0.24.0-SNAPSHOT ready for Eclipse fails on retrieve hadoop-yarn-common jar
I believe that if you take off the -e then it will work. If not run mvn eclipse:clean and then mvn eclipse:eclipse. It worked for me yesterday. --Bobby On 9/2/11 4:53 AM, Mario Pastorelli pastorelli.ma...@gmail.com wrote: Hi all, I'm trying to download and prepare Hadoop trunk to be used on Eclipse using https://wiki.apache.org/hadoop/EclipseEnvironment but I'm having problems with Yarn. In particular the command mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true -e output this (this is just the end, other goals compile): [INFO] [INFO] Building hadoop-yarn-api 0.24.0-SNAPSHOT [INFO] [INFO] [INFO] maven-eclipse-plugin:2.8:eclipse (default-cli) @ hadoop-yarn-api [INFO] [INFO] --- maven-antrun-plugin:1.6:run (create-protobuf-generated-sources-directory) @ hadoop-yarn-api --- [INFO] Executing tasks main: [INFO] Executed tasks [INFO] [INFO] --- exec-maven-plugin:1.2:exec (generate-sources) @ hadoop-yarn-api --- [INFO] [INFO] --- build-helper-maven-plugin:1.5:add-source (add-source) @ hadoop-yarn-api --- [INFO] Source directory: /home/rief/Programmazione/Java/Hadoop/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/target/generated-sources/proto added. [INFO] [INFO] maven-eclipse-plugin:2.8:eclipse (default-cli) @ hadoop-yarn-api [INFO] [INFO] --- maven-eclipse-plugin:2.8:eclipse (default-cli) @ hadoop-yarn-api --- [INFO] Using Eclipse Workspace: null [INFO] Adding default classpath container: org.eclipse.jdt.launching.JRE_CONTAINER [INFO] File /home/rief/Programmazione/Java/Hadoop/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/.project already exists. Additional settings will be preserved, run mvn eclipse:clean if you want old settings to be removed. [INFO] Wrote Eclipse project for hadoop-yarn-api to /home/rief/Programmazione/Java/Hadoop/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api. [INFO] [INFO] [INFO] [INFO] Building hadoop-yarn-common 0.24.0-SNAPSHOT [INFO] [INFO] [INFO] maven-eclipse-plugin:2.8:eclipse (default-cli) @ hadoop-yarn-common [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Hadoop Project POM . SUCCESS [1.437s] [INFO] Apache Hadoop Annotations . SUCCESS [0.153s] [INFO] Apache Hadoop Project Dist POM SUCCESS [0.032s] [INFO] Apache Hadoop Assemblies .. SUCCESS [0.059s] [INFO] Apache Hadoop Auth SUCCESS [0.222s] [INFO] Apache Hadoop Auth Examples ... SUCCESS [0.138s] [INFO] Apache Hadoop Common .. SUCCESS [2.172s] [INFO] Apache Hadoop Common Project .. SUCCESS [0.017s] [INFO] Apache Hadoop HDFS SUCCESS [2.305s] [INFO] Apache Hadoop HDFS Project SUCCESS [0.016s] [INFO] hadoop-yarn-api ... SUCCESS [1.964s] [INFO] hadoop-yarn-common FAILURE [0.234s] [INFO] hadoop-yarn-server-common . SKIPPED [INFO] hadoop-yarn-server-nodemanager SKIPPED [INFO] hadoop-yarn-server-resourcemanager SKIPPED [INFO] hadoop-yarn-server-tests .. SKIPPED [INFO] hadoop-yarn-server SKIPPED [INFO] hadoop-yarn ... SKIPPED [INFO] hadoop-mapreduce-client-core .. SKIPPED [INFO] hadoop-mapreduce-client-common SKIPPED [INFO] hadoop-mapreduce-client-shuffle ... SKIPPED [INFO] hadoop-mapreduce-client-app ... SKIPPED [INFO] hadoop-mapreduce-client-hs SKIPPED [INFO] hadoop-mapreduce-client-jobclient . SKIPPED [INFO] hadoop-mapreduce-client ... SKIPPED [INFO] hadoop-mapreduce .. SKIPPED [INFO] Apache Hadoop Main SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 11.370s [INFO] Finished at: Fri Sep 02 10:58:42 CEST 2011 [INFO] Final Memory: 26M/269M [INFO] [ERROR] Failed to execute goal on project hadoop-yarn-common: Could not resolve dependencies for project org.apache.hadoop:hadoop-yarn-common:jar:0.24.0-SNAPSHOT: Failure to find org.apache.hadoop:hadoop-yarn-api:jar:0.24.0-SNAPSHOT in
Re: Jenkins's Links to FindBugs warnings not useful
You can do mvn findbugs:gui and then open up each of the findbugsXml.xml files manually. Or you should be able to run mvn site to generate HTML. You may need to modify the pom.xml file to include findbugs in the report section though. On 9/2/11 9:38 AM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Oh, I also just found this working link https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/lastSuccessfulBuild/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.htmlon https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/ . Seems that the artifacts are there only for the lastSuccessfulBuild though. +Vinod On Fri, Sep 2, 2011 at 8:03 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: None of the links to the warnings related to FindBugs by Jenkins on submitting patch are working. You can see any of the JIRAs being built at https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/. OTOH, I ran ~/Applications/full-packages/apache-maven-3.0.3/bin/mvn clean test findbugs:findbugs -DskipTests -DHadoopPatchProcess to generate the warnings on my local box. I do see bunch of findBugsXml.xml files which seem to indicating warnings, but they are hardly readable. Does anyone know how to generate html reports locally? Giri? Thanks, +Vinod
Trunk and 0.23 build failing with clean .m2 directory
I am getting the following errors when I try to build either trunk or 0.23 with a clean maven cache. I don't get any errors if I use my old cache. [INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ hadoop-yarn-common --- [INFO] Compiling 2 source files to /home/evans/src/hadoop-git/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn- common/target/classes [INFO] [INFO] [INFO] Building hadoop-yarn-server-common 0.24.0-SNAPSHOT [INFO] [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Hadoop Project POM . SUCCESS [0.714s] [INFO] Apache Hadoop Annotations . SUCCESS [0.323s] [INFO] Apache Hadoop Project Dist POM SUCCESS [0.001s] [INFO] Apache Hadoop Assemblies .. SUCCESS [0.025s] [INFO] Apache Hadoop Alfredo . SUCCESS [0.067s] [INFO] Apache Hadoop Common .. SUCCESS [2.117s] [INFO] Apache Hadoop Common Project .. SUCCESS [0.001s] [INFO] Apache Hadoop HDFS SUCCESS [1.419s] [INFO] Apache Hadoop HDFS Project SUCCESS [0.001s] [INFO] hadoop-yarn-api ... SUCCESS [7.019s] [INFO] hadoop-yarn-common SUCCESS [2.181s] [INFO] hadoop-yarn-server-common . FAILURE [0.058s] [INFO] hadoop-yarn-server-nodemanager SKIPPED [INFO] hadoop-yarn-server-resourcemanager SKIPPED [INFO] hadoop-yarn-server-tests .. SKIPPED [INFO] hadoop-yarn-server SKIPPED [INFO] hadoop-yarn ... SKIPPED [INFO] hadoop-mapreduce-client-core .. SKIPPED [INFO] hadoop-mapreduce-client-common SKIPPED [INFO] hadoop-mapreduce-client-shuffle ... SKIPPED [INFO] hadoop-mapreduce-client-app ... SKIPPED [INFO] hadoop-mapreduce-client-hs SKIPPED [INFO] hadoop-mapreduce-client-jobclient . SKIPPED [INFO] hadoop-mapreduce-client ... SKIPPED [INFO] hadoop-mapreduce .. SKIPPED [INFO] Apache Hadoop Main SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 14.938s [INFO] Finished at: Mon Aug 29 11:18:06 CDT 2011 [INFO] Final Memory: 29M/207M [INFO] [ERROR] Failed to execute goal on project hadoop-yarn-server-common: Could not resolve dependencies for project org.apache.hadoop:hadoop-yarn-server-common:jar:0.24.0-SNAPSHOT: Failure to find org.apache.hadoop:hadoop-yarn-common:jar:tests:0.24.0-SNAPSHOT in http://ymaven.corp.yahoo.com:/proximity/repository/apache.snapshot was cached in the local repository, resolution will not be reattempted until the update interval of local apache.snapshot mirror has elapsed or updates are forced - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionExcepti on [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :hadoop-yarn-server-common Is anyone looking into this yet? --Bobby
Re: Trunk and 0.23 build failing with clean .m2 directory
Wow this is odd install works just fine, but compile fails unless I do an install first (I found this trying to run test-patch). $mvn --version Apache Maven 3.0.3 (r1075438; 2011-02-28 11:31:09-0600) Maven home: /home/evans/bin/maven Java version: 1.6.0_22, vendor: Sun Microsystems Inc. Java home: /home/evans/bin/jdk1.6.0/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.18-238.12.1.el5, arch: i386, family: unix Has anyone else seen this, or is there something messed up with my machine? Thanks, Bobby On 8/29/11 11:18 AM, Robert Evans ev...@yahoo-inc.com wrote: I am getting the following errors when I try to build either trunk or 0.23 with a clean maven cache. I don't get any errors if I use my old cache. [INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ hadoop-yarn-common --- [INFO] Compiling 2 source files to /home/evans/src/hadoop-git/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn- common/target/classes [INFO] [INFO] [INFO] Building hadoop-yarn-server-common 0.24.0-SNAPSHOT [INFO] [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Hadoop Project POM . SUCCESS [0.714s] [INFO] Apache Hadoop Annotations . SUCCESS [0.323s] [INFO] Apache Hadoop Project Dist POM SUCCESS [0.001s] [INFO] Apache Hadoop Assemblies .. SUCCESS [0.025s] [INFO] Apache Hadoop Alfredo . SUCCESS [0.067s] [INFO] Apache Hadoop Common .. SUCCESS [2.117s] [INFO] Apache Hadoop Common Project .. SUCCESS [0.001s] [INFO] Apache Hadoop HDFS SUCCESS [1.419s] [INFO] Apache Hadoop HDFS Project SUCCESS [0.001s] [INFO] hadoop-yarn-api ... SUCCESS [7.019s] [INFO] hadoop-yarn-common SUCCESS [2.181s] [INFO] hadoop-yarn-server-common . FAILURE [0.058s] [INFO] hadoop-yarn-server-nodemanager SKIPPED [INFO] hadoop-yarn-server-resourcemanager SKIPPED [INFO] hadoop-yarn-server-tests .. SKIPPED [INFO] hadoop-yarn-server SKIPPED [INFO] hadoop-yarn ... SKIPPED [INFO] hadoop-mapreduce-client-core .. SKIPPED [INFO] hadoop-mapreduce-client-common SKIPPED [INFO] hadoop-mapreduce-client-shuffle ... SKIPPED [INFO] hadoop-mapreduce-client-app ... SKIPPED [INFO] hadoop-mapreduce-client-hs SKIPPED [INFO] hadoop-mapreduce-client-jobclient . SKIPPED [INFO] hadoop-mapreduce-client ... SKIPPED [INFO] hadoop-mapreduce .. SKIPPED [INFO] Apache Hadoop Main SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 14.938s [INFO] Finished at: Mon Aug 29 11:18:06 CDT 2011 [INFO] Final Memory: 29M/207M [INFO] [ERROR] Failed to execute goal on project hadoop-yarn-server-common: Could not resolve dependencies for project org.apache.hadoop:hadoop-yarn-server-common:jar:0.24.0-SNAPSHOT: Failure to find org.apache.hadoop:hadoop-yarn-common:jar:tests:0.24.0-SNAPSHOT in http://ymaven.corp.yahoo.com:/proximity/repository/apache.snapshot was cached in the local repository, resolution will not be reattempted until the update interval of local apache.snapshot mirror has elapsed or updates are forced - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionExcepti on [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :hadoop-yarn-server-common Is anyone looking into this yet? --Bobby
Re: Trunk and 0.23 build failing with clean .m2 directory
Thanks Alejandro, That really clears things up. Is the a JIRA you know of to change test-patch to do mvn test -DskipTests instead of mvn compile? If not I can file one and do the work. Test-patch failed for me because of this. --Bobby On 8/29/11 12:21 PM, Alejandro Abdelnur t...@cloudera.com wrote: The reason for this failure is because of how Maven reactor/dependency resolution works (IMO a bug). Maven reactor/dependency resolution is smart enough to create the classpath using the classes from all modules being built. However, this smartness falls short just a bit. The dependencies are resolved using the deepest maven phase used by current mvn invocation. If you are doing 'mvn compile' you don't get to the test compile phase. This means that the TEST classes are not resolved from the build but from the cache/repo. The solution is to run 'mvn test -DskipTests' instead 'mvn compile'. This will include the TEST classes from the build. The same when creating the eclipse profile, run 'mvn test -DskipTests eclipse:eclipse' Thanks. Alejandro On Mon, Aug 29, 2011 at 9:59 AM, Ravi Prakash ravihad...@gmail.com wrote: Yeah I've seen this before. Sometimes I had to descend into child directories to mvn install them, before I could maven install parents. I'm hoping/guessing that issue is fixed now On Mon, Aug 29, 2011 at 11:39 AM, Robert Evans ev...@yahoo-inc.com wrote: Wow this is odd install works just fine, but compile fails unless I do an install first (I found this trying to run test-patch). $mvn --version Apache Maven 3.0.3 (r1075438; 2011-02-28 11:31:09-0600) Maven home: /home/evans/bin/maven Java version: 1.6.0_22, vendor: Sun Microsystems Inc. Java home: /home/evans/bin/jdk1.6.0/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.18-238.12.1.el5, arch: i386, family: unix Has anyone else seen this, or is there something messed up with my machine? Thanks, Bobby On 8/29/11 11:18 AM, Robert Evans ev...@yahoo-inc.com wrote: I am getting the following errors when I try to build either trunk or 0.23 with a clean maven cache. I don't get any errors if I use my old cache. [INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ hadoop-yarn-common --- [INFO] Compiling 2 source files to /home/evans/src/hadoop-git/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn- common/target/classes [INFO] [INFO] [INFO] Building hadoop-yarn-server-common 0.24.0-SNAPSHOT [INFO] [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Hadoop Project POM . SUCCESS [0.714s] [INFO] Apache Hadoop Annotations . SUCCESS [0.323s] [INFO] Apache Hadoop Project Dist POM SUCCESS [0.001s] [INFO] Apache Hadoop Assemblies .. SUCCESS [0.025s] [INFO] Apache Hadoop Alfredo . SUCCESS [0.067s] [INFO] Apache Hadoop Common .. SUCCESS [2.117s] [INFO] Apache Hadoop Common Project .. SUCCESS [0.001s] [INFO] Apache Hadoop HDFS SUCCESS [1.419s] [INFO] Apache Hadoop HDFS Project SUCCESS [0.001s] [INFO] hadoop-yarn-api ... SUCCESS [7.019s] [INFO] hadoop-yarn-common SUCCESS [2.181s] [INFO] hadoop-yarn-server-common . FAILURE [0.058s] [INFO] hadoop-yarn-server-nodemanager SKIPPED [INFO] hadoop-yarn-server-resourcemanager SKIPPED [INFO] hadoop-yarn-server-tests .. SKIPPED [INFO] hadoop-yarn-server SKIPPED [INFO] hadoop-yarn ... SKIPPED [INFO] hadoop-mapreduce-client-core .. SKIPPED [INFO] hadoop-mapreduce-client-common SKIPPED [INFO] hadoop-mapreduce-client-shuffle ... SKIPPED [INFO] hadoop-mapreduce-client-app ... SKIPPED [INFO] hadoop-mapreduce-client-hs SKIPPED [INFO] hadoop-mapreduce-client-jobclient . SKIPPED [INFO] hadoop-mapreduce-client ... SKIPPED [INFO] hadoop-mapreduce .. SKIPPED [INFO] Apache Hadoop Main SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 14.938s [INFO] Finished at: Mon Aug 29 11:18:06 CDT 2011 [INFO] Final Memory
Re: which Eclipse plugin to use for Maven?
Jim, The m2 plugin replaces the normal eclipse build system with maven. If you want to use M2 then you don't need to run mvn eclipse:eclipse at all. What mvn eclipse:eclipse does is it generates source code, and produces a .project and .classpath so that eclipse can use it's normal build system not work. The two approaches are not really compatible with each other. --Bobby On 8/29/11 11:52 AM, Jim Falgout jim.falg...@pervasive.com wrote: Using the latest trunk code, used the mvn eclipse:eclipse target to build the Eclipse project files. I've got the M2E plugin for Maven installed. After some trouble with lifecycle errors (Plugin execution not covered by lifecycle configuration error messages) I noticed this comment in the .project file: NO_M2ECLIPSE_SUPPORT: Project files created with the maven-eclipse-plugin are not supported in M2Eclipse. Is there another recommendation for Maven integration using an Eclipse plugin that will work out of the box? Thanks!
Re: Trunk and 0.23 build failing with clean .m2 directory
DONE I filed HADOOP-7589 and uploaded my patch to it. Alejandro, could you take a quick look at the patch because you appear to be the maven expert. Thanks, Bobby Evans On 8/29/11 12:39 PM, Mahadev Konar maha...@hortonworks.com wrote: Bobby, You are right. The test-patch uses mvn compile. Please file a jira. It should be a minor change: thanks mahadev On Mon, Aug 29, 2011 at 10:34 AM, Robert Evans ev...@yahoo-inc.com wrote: Thanks Alejandro, That really clears things up. Is the a JIRA you know of to change test-patch to do mvn test -DskipTests instead of mvn compile? If not I can file one and do the work. Test-patch failed for me because of this. --Bobby On 8/29/11 12:21 PM, Alejandro Abdelnur t...@cloudera.com wrote: The reason for this failure is because of how Maven reactor/dependency resolution works (IMO a bug). Maven reactor/dependency resolution is smart enough to create the classpath using the classes from all modules being built. However, this smartness falls short just a bit. The dependencies are resolved using the deepest maven phase used by current mvn invocation. If you are doing 'mvn compile' you don't get to the test compile phase. This means that the TEST classes are not resolved from the build but from the cache/repo. The solution is to run 'mvn test -DskipTests' instead 'mvn compile'. This will include the TEST classes from the build. The same when creating the eclipse profile, run 'mvn test -DskipTests eclipse:eclipse' Thanks. Alejandro On Mon, Aug 29, 2011 at 9:59 AM, Ravi Prakash ravihad...@gmail.com wrote: Yeah I've seen this before. Sometimes I had to descend into child directories to mvn install them, before I could maven install parents. I'm hoping/guessing that issue is fixed now On Mon, Aug 29, 2011 at 11:39 AM, Robert Evans ev...@yahoo-inc.com wrote: Wow this is odd install works just fine, but compile fails unless I do an install first (I found this trying to run test-patch). $mvn --version Apache Maven 3.0.3 (r1075438; 2011-02-28 11:31:09-0600) Maven home: /home/evans/bin/maven Java version: 1.6.0_22, vendor: Sun Microsystems Inc. Java home: /home/evans/bin/jdk1.6.0/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.18-238.12.1.el5, arch: i386, family: unix Has anyone else seen this, or is there something messed up with my machine? Thanks, Bobby On 8/29/11 11:18 AM, Robert Evans ev...@yahoo-inc.com wrote: I am getting the following errors when I try to build either trunk or 0.23 with a clean maven cache. I don't get any errors if I use my old cache. [INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ hadoop-yarn-common --- [INFO] Compiling 2 source files to /home/evans/src/hadoop-git/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn- common/target/classes [INFO] [INFO] [INFO] Building hadoop-yarn-server-common 0.24.0-SNAPSHOT [INFO] [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Hadoop Project POM . SUCCESS [0.714s] [INFO] Apache Hadoop Annotations . SUCCESS [0.323s] [INFO] Apache Hadoop Project Dist POM SUCCESS [0.001s] [INFO] Apache Hadoop Assemblies .. SUCCESS [0.025s] [INFO] Apache Hadoop Alfredo . SUCCESS [0.067s] [INFO] Apache Hadoop Common .. SUCCESS [2.117s] [INFO] Apache Hadoop Common Project .. SUCCESS [0.001s] [INFO] Apache Hadoop HDFS SUCCESS [1.419s] [INFO] Apache Hadoop HDFS Project SUCCESS [0.001s] [INFO] hadoop-yarn-api ... SUCCESS [7.019s] [INFO] hadoop-yarn-common SUCCESS [2.181s] [INFO] hadoop-yarn-server-common . FAILURE [0.058s] [INFO] hadoop-yarn-server-nodemanager SKIPPED [INFO] hadoop-yarn-server-resourcemanager SKIPPED [INFO] hadoop-yarn-server-tests .. SKIPPED [INFO] hadoop-yarn-server SKIPPED [INFO] hadoop-yarn ... SKIPPED [INFO] hadoop-mapreduce-client-core .. SKIPPED [INFO] hadoop-mapreduce-client-common SKIPPED [INFO] hadoop-mapreduce-client-shuffle ... SKIPPED [INFO] hadoop-mapreduce-client-app ... SKIPPED [INFO] hadoop-mapreduce-client-hs SKIPPED [INFO] hadoop-mapreduce-client-jobclient . SKIPPED [INFO] hadoop-mapreduce-client
Re: DistCpV2 in 0.23
I agree with Mithun. They are related but this goes beyond distcpv2 and should not block distcpv2 from going in. It would be very nice, however, to get the layout settled soon so that we all know where to find something when we want to work on it. Also +1 for Alejandro's I also prefer to keep tools at the trunk level. Even though HDFS, Common, and Mapreduce and perhaps soon tools are separate modules right now, there is still tight coupling between the different pieces, especially with tests. IMO until we can reduce that coupling we should treat building and testing Hadoop as a single project instead of trying to keep them separate. --Bobby On 8/26/11 7:45 AM, Mithun Radhakrishnan mithun.radhakrish...@yahoo.com wrote: Would it be acceptable if retooling of tools/ were taken up separately? It sounds to me like this might be a distinct (albeit related) task. Mithun From: Giridharan Kesavan gkesa...@hortonworks.com To: mapreduce-dev@hadoop.apache.org Sent: Friday, August 26, 2011 12:04 PM Subject: Re: DistCpV2 in 0.23 +1 to Alejandro's I prefer to keep the hadoop-tools at trunk level. -Giri On Thu, Aug 25, 2011 at 9:15 PM, Alejandro Abdelnur t...@cloudera.com wrote: I'd suggest putting hadoop-tools either at trunk/ level or having a a tools aggregator module for hdfs and other for common. I personal would prefer at trunk/. Thanks. Alejandro On Thu, Aug 25, 2011 at 9:06 PM, Amareshwari Sri Ramadasu amar...@yahoo-inc.com wrote: Agree. It should be separate maven module (and patch puts it as separate maven module now). And top level for hadoop tools is nice to have, but it becomes hard to maintain until patch automation tests run the tests under tools. Currently we see many times the changes in HDFS effecting RAID tests in MapReduce. So, I'm fine putting the tools under hadoop-mapreduce. I propose we can have something like the following: trunk/ - hadoop-mapreduce - hadoop-mr-client - hadoop-yarn - hadoop-tools - hadoop-streaming - hadoop-archives - hadoop-distcp Thoughts? @Eli and @JD, we did not replace old legacy distcp because this is really a complete rewrite and did not want to remove it until users are familiarized with new one. On 8/26/11 12:51 AM, Todd Lipcon t...@cloudera.com wrote: Maybe a separate toplevel for hadoop-tools? Stuff like RAID could go in there as well - ie tools that are downstream of MR and/or HDFS. On Thu, Aug 25, 2011 at 12:09 PM, Mahadev Konar maha...@hortonworks.com wrote: +1 for a seperate module in hadoop-mapreduce-project. I think hadoop-mapreduce-client might not be right place for it. We might have to pick a new maven module under hadoop-mapreduce-project that could host streaming/distcp/hadoop archives. thanks mahadev On Thu, Aug 25, 2011 at 11:04 AM, Alejandro Abdelnur t...@cloudera.com wrote: Agree, it should be a separate maven module. And it should be under hadoop-mapreduce-client, right? And now that we are in the topic, the same should go for streaming, no? Thanks. Alejandro On Thu, Aug 25, 2011 at 10:58 AM, Todd Lipcon t...@cloudera.com wrote: On Thu, Aug 25, 2011 at 10:36 AM, Eli Collins e...@cloudera.com wrote: Nice work! I definitely think this should go in 23 and 20x. Agree with JD that it should be in the core code, not contrib. If it's going to be maintained then we should put it in the core code. Now that we're all mavenized, though, a separate maven module and artifact does make sense IMO - ie hadoop jar hadoop-distcp-0.23.0-SNAPSHOT rather than hadoop distcp -Todd -- Todd Lipcon Software Engineer, Cloudera -- Todd Lipcon Software Engineer, Cloudera -- -Giri
Re: Picking up local common changes in mr
One thing to be aware of is that with -SNAPSHOT at the end of the version Maven will start looking at dates. So if you have a 0.23.0-SNAPSHOT that you personally modified/built in your .m2 repository and go to build something that depends on it. If the nightly build has pushed it to the apache repo after you built your version maven might download the newer version replacing your changes. If you changes impact multiple components then your choices are to always build the entire project (or at least the subset that has dependent changes) or always build with -o after your initial build/install. --Bobby On 8/19/11 11:41 AM, Matt Foley mfo...@hortonworks.com wrote: Thanks for the nice clear statement, Alejandro. --Matt On Thu, Aug 18, 2011 at 4:40 PM, Alejandro Abdelnur t...@cloudera.comwrote: This is handled by maven reactor. When your run Maven in a multimodule project (like we have), all modules that are part of the build (from the dir where you are) down are used for the build/test/packaging, all modules that are not part of the build are picked up from .m2/repo. For example cd trunk/hadoop-mapreduce;mvn compile uses hadoop-common hadoop-hdfs from m2/repo cd trunk;mvn compile uses hadoop-common, hadoop-hdfs, hadoop-mapreduce from the build. HTH Thxs. Alejandro On Thu, Aug 18, 2011 at 4:35 PM, Matt Foley mfo...@hortonworks.com wrote: Since we put all the effort into un-splitting the components, shouldn't we have a switch that causes, eg, the MAPREDUCE build to pick up artifacts from COMMON and HDFS builds in specified sibling directories, without using m2 as an intermediary? Of course it should respect dependencies (via maven) so that if HDFS source has been modified, the HDFS artifacts will also be rebuilt before MAPREDUCE uses them :-) --Matt On Thu, Aug 18, 2011 at 3:30 PM, Giridharan Kesavan gkesa...@hortonworks.com wrote: Hello, Its the same -Dresolvers=internal for the ant build system; For the maven/yarn build system as long as you have the latest common jar in the m2 cache its going to resolve common from the maven cache. If not from the apache maven repo. You can force the builds to use the cache by adding -o option. (offline builds) Thanks, Giri On Thu, Aug 18, 2011 at 3:19 PM, Eli Collins e...@cloudera.com wrote: Hey gang, What's the new equivalent of resolvers=true in the new MR build? ie how do you get a a local common change to get picked up by mr? Thanks, Eli
Re: Notes for working on mapreduce trunk after the MR-279 merge.
It looks like git has not seen the changes yet, even though the last change was over 90 mins ago. Is there any way to kick git to pull in the changes sooner so I can rebase? Thanks, Bobby Evans On 8/18/11 7:49 AM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: MR-279 branch is merged into mapreduce trunk and this changes things a bit for developing on mapreduce. You can get all the help that is needed from the INSTALL file at http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce/INSTALL. Reproducing some of those contents here for the short-term lookup. Checking out source code svn checkout http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce -- Directory structure -- trunk/ - hadoop-mapreduce ( was mapreduce before) trunk/hadoop-mapreduce - Classic code. JT/TT reside here - build.xml - src trunk/hadoop-mapreduce/ - New code related to yarn reside here. - assembly - pom.xml - hadoop-mr-client - hadoop-yarn - Yarn APIs, libraries, and server code -- hadoop-yarn-api -- hadoop-yarn-common -- hadoop-yarn-server - Server code, ResourceManager, NodeManager, server libraries and tests. --- hadoop-yarn-server-common --- hadoop-yarn-server-nodemanager --- hadoop-yarn-server-resourcemanager --- hadoop-yarn-server-tests - hadoop-mr-client - MapReduce server and client code -- hadoop-mapreduce-client-app -- hadoop-mapreduce-client-core -- hadoop-mapreduce-client-jobclient -- hadoop-mapreduce-client-common -- hadoop-mapreduce-client-hs -- hadoop-mapreduce-client-shuffle --- Building --- Building yarn code and install into the local maven cache. - mvn clean install - In case you want to skip the tests run: mvn clean install -DskipTests Building classic code once yarn code is built. - ant veryclean jar jar-test -Dresolvers=internal -- Eclipse --- 1) For hacking on the new yarn+MR code in eclipse, you should run mvn eclipse:eclipse and then import the checked out source root as a maven project. 2) For developing on classic JT/TT code, running ant eclipse and importing as java project should continue to work. Hope that helps. If you run into issues, please send an email or create a JIRA issue. Thanks, +Vinod
Re: Problem while running eclipse-files for Next Gen Mapreduce branch
I mapreduce/INSTALL also has some important information in it, and be aware that you do not have to install the avro plugin any more. Maven can download it and install it automatically now, but the README was never updated. Also be sure to install protocol buffers. The build will fail without it. --Bobby On 7/8/11 9:04 AM, Josh Wills jwi...@cloudera.com wrote: You want to generate them using mvn instead. See the mapreduce/yarn/README file for how to do it. On Fri, Jul 8, 2011 at 7:00 AM, Devaraj K devara...@huawei.com wrote: Hi, I am getting this below errors when I try to generate eclipse files using eclipse-files target. Can anybody help me? Buildfile: D:\svn\nextgenmapreduce\mapreduce\build.xml ivy-download: [get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar [get] To: D:\svn\nextgenmapreduce\mapreduce\ivy\ivy-2.2.0.jar [get] Not modified - so not downloaded ivy-init-dirs: ivy-probe-antlib: ivy-init-antlib: ivy-init: [ivy:configure] :: Ivy non official version - :: http://ant.apache.org/ivy/ http://ant.apache.org/ivy/ :: [ivy:configure] :: loading settings :: file = D:\svn\nextgenmapreduce\mapreduce\ivy\ivysettings.xml ivy-resolve-common: [ivy:resolve] [ivy:resolve] :: problems summary :: [ivy:resolve] WARNINGS [ivy:resolve] module not found: org.apache.hadoop#yarn-server-common;1.0-SNAPSHOT [ivy:resolve] apache-snapshot: tried [ivy:resolve] https://repository.apache.org/content/repositories/snapshots/org/apache/had oop/yarn-server-common/1.0-SNAPSHOT/yarn-server-common-1.0-SNAPSHOT.pom https://repository.apache.org/content/repositories/snapshots/org/apache/hado op/yarn-server-common/1.0-SNAPSHOT/yarn-server-common-1.0-SNAPSHOT.pom [ivy:resolve] -- artifact org.apache.hadoop#yarn-server-common;1.0-SNAPSHOT!yarn-server-common.jar: [ivy:resolve] https://repository.apache.org/content/repositories/snapshots/org/apache/had oop/yarn-server-common/1.0-SNAPSHOT/yarn-server-common-1.0-SNAPSHOT.jar https://repository.apache.org/content/repositories/snapshots/org/apache/hado op/yarn-server-common/1.0-SNAPSHOT/yarn-server-common-1.0-SNAPSHOT.jar [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/org/apache/hadoop/yarn-server-common/1.0-SNAP SHOT/yarn-server-common-1.0-SNAPSHOT.pom http://repo1.maven.org/maven2/org/apache/hadoop/yarn-server-common/1.0-SNAPS HOT/yarn-server-common-1.0-SNAPSHOT.pom [ivy:resolve] -- artifact org.apache.hadoop#yarn-server-common;1.0-SNAPSHOT!yarn-server-common.jar: [ivy:resolve] http://repo1.maven.org/maven2/org/apache/hadoop/yarn-server-common/1.0-SNAP SHOT/yarn-server-common-1.0-SNAPSHOT.jar http://repo1.maven.org/maven2/org/apache/hadoop/yarn-server-common/1.0-SNAPS HOT/yarn-server-common-1.0-SNAPSHOT.jar [ivy:resolve] module not found: org.apache.hadoop#hadoop-mapreduce-client-core;1.0-SNAPSHOT [ivy:resolve] apache-snapshot: tried [ivy:resolve] https://repository.apache.org/content/repositories/snapshots/org/apache/had oop/hadoop-mapreduce-client-core/1.0-SNAPSHOT/hadoop-mapreduce-client-core-1 .0-SNAPSHOT.pom https://repository.apache.org/content/repositories/snapshots/org/apache/hado op/hadoop-mapreduce-client-core/1.0-SNAPSHOT/hadoop-mapreduce-client-core-1. 0-SNAPSHOT.pom [ivy:resolve] -- artifact org.apache.hadoop#hadoop-mapreduce-client-core;1.0-SNAPSHOT!hadoop-mapreduce -client-core.jar: [ivy:resolve] https://repository.apache.org/content/repositories/snapshots/org/apache/had oop/hadoop-mapreduce-client-core/1.0-SNAPSHOT/hadoop-mapreduce-client-core-1 .0-SNAPSHOT.jar https://repository.apache.org/content/repositories/snapshots/org/apache/hado op/hadoop-mapreduce-client-core/1.0-SNAPSHOT/hadoop-mapreduce-client-core-1. 0-SNAPSHOT.jar [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-cor e/1.0-SNAPSHOT/hadoop-mapreduce-client-core-1.0-SNAPSHOT.pom http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-core /1.0-SNAPSHOT/hadoop-mapreduce-client-core-1.0-SNAPSHOT.pom [ivy:resolve] -- artifact org.apache.hadoop#hadoop-mapreduce-client-core;1.0-SNAPSHOT!hadoop-mapreduce -client-core.jar: [ivy:resolve] http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-cor e/1.0-SNAPSHOT/hadoop-mapreduce-client-core-1.0-SNAPSHOT.jar http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-core /1.0-SNAPSHOT/hadoop-mapreduce-client-core-1.0-SNAPSHOT.jar [ivy:resolve] module not found: org.apache.hadoop#yarn-common;1.0-SNAPSHOT [ivy:resolve] apache-snapshot: tried [ivy:resolve]
Re: Reg ChainReducer usage
Moving to mapreduce user. Ravi, The issue is with the shuffle. The chain reducer cannot re-shuffle the output of a previous reducer. If you want that then you need to run a second reduce only job. Instead usually the chain reducer would have a single reducer followed by 0 or more mappers, that can process the output of the reducer. --Bobby On 6/2/11 5:25 AM, Ravi Teja ravit...@huawei.com wrote: Hi, I Had some queries in the usage of the ChainReducer . 1)Only one reducer can be set. If we try to set the second reducer to the chain, IllegalArgumentException will be thrown. Then why is it a chainreducer ? 2)We have a option chain.reducer.byValue where in, it will decide whether the key value pair can be passed or not to the next Mapper/Reducer. But why is this property significant, as only reducer is called at last in the chain, no matter whatever the order is in the chain and there is nothing to pass to. Regards, Ravi Teja *** This e-mail and attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient's) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!