Custom error message and action for authentication failure
Hello All, We have a requirement to display a custom error message and do some bookkeeping if a user faces an authentication error when making a hadoop call. We use Hadoop 1. How do we go about accomplishing this ? benoy
Re: How to update the timestamp of a file in HDFS
Hi , Try this touchz hadoop command. hadoop -touchz filename Thanks and Regards, Adi Reddy Murali On Thu, Sep 5, 2013 at 11:06 AM, Harsh J ha...@cloudera.com wrote: There's no shell command (equivalent to Linux's touch) but you can use the Java API: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#setTimes(org.apache.hadoop.fs.Path,%20long,%20long) On Thu, Sep 5, 2013 at 10:58 AM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, Can you please help on to update the date timestamp of a file in HDFS. regards, Rams -- Harsh J
Re: How to update the timestamp of a file in HDFS
right usage of command is: hadoop fs - touchz filename On Thu, Sep 5, 2013 at 12:14 PM, murali adireddy murali.adire...@gmail.comwrote: Hi , Try this touchz hadoop command. hadoop -touchz filename Thanks and Regards, Adi Reddy Murali On Thu, Sep 5, 2013 at 11:06 AM, Harsh J ha...@cloudera.com wrote: There's no shell command (equivalent to Linux's touch) but you can use the Java API: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#setTimes(org.apache.hadoop.fs.Path,%20long,%20long) On Thu, Sep 5, 2013 at 10:58 AM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, Can you please help on to update the date timestamp of a file in HDFS. regards, Rams -- Harsh J
how to observe spill operation during shuffle in mapreduce?
hi,all: is there any MR spill count metric?
Re: how to observe spill operation during shuffle in mapreduce?
Hi , You can look at the job metrics from your jobtracker Web UI . The Spilled Record Counter under the group Map Reduce Framework displays the number of records spilled in both map and reduce tasks. Regards Ravi Magham. On Thu, Sep 5, 2013 at 12:23 PM, ch huang justlo...@gmail.com wrote: hi,all: is there any MR spill count metric?
Re: Multidata center support
Hi friends hello baskar i think rack awareness and data center awareness are different and similarly nodes and data centers are different things from hadoops perspective but ideally it shud be same i mean nodes can be in different data centers right but i think hadoop doesnt not replicate data across data centers i am not sure abt this (can anyone please comment on this). federation can provide different namenodes so you can create independent clusters for example one cluster at one data center and another cluster at a different data center. but if hadoop can replicate across data centers then we need only one federation cluster for all data centers :)...are any of you guys using a single federation cluster across multiple data centers in production for example CASE--1 one cluster federation/data centers at US/Europe---(if hadoop can replicate across data centers ) NN1 --US DN1 US NN2 --Europe DN2 -Europe In this case data can be replicated to DN1 and DN2 -- CASE--2 two independent cluster federation/data centers at US/Europe--(if hadoop cannot replicate across data centers ) cluster 1 cluster 2 NN1 --US DN1 US NN2 --Europe DN2 -Europe In this case data cannot be replicated to DN2 or vice versa *Can anyone clarify which will be the right and optimal case for hadoop -:)* On Wed, Sep 4, 2013 at 2:20 AM, Baskar Duraikannu baskar.duraika...@outlook.com wrote: Rahul Are you talking about rack-awareness script? I did go through rack awareness. Here are the problems with rack awareness w.r.to my (given) business requirment 1. Hadoop , default places two copies on the same rack and 1 copy on some other rack. This would work as long as we have two data centers. if business wants to have three data centers, then data would not be spread across. Separately there is a question around whether it is the right thing to do or not. I have been promised by business that they would buy enough bandwidth such that each data center will be few milliseconds apart (in latency). 2. I believe Hadoop automatically re-replicates data if one or more node is down. Assume when one out of 2 data center goes down. There will be a massive data flow to create additional copies. When I say data center support, I should be able to configure hadoop to say a) Maintain 1 copy per data center b) If any data center goes down, dont create additional copies. Above requirements that I am pointing will essentially move hadoop from strongly consistent to a week/eventual consistent model. Since this changes fundamental architecture, it will probably break all sort of things... Might not be possible ever in Hadoop. Thoughts? Sadak Is there a way to implement above requirement via Federation? Thanks Baskar -- Date: Sun, 1 Sep 2013 00:20:04 +0530 Subject: Re: Multidata center support From: visioner.sa...@gmail.com To: user@hadoop.apache.org What do you think friends I think hadoop clusters can run on multiple data centers using FEDERATION On Sat, Aug 31, 2013 at 8:39 PM, Visioner Sadak visioner.sa...@gmail.comwrote: The only problem i guess hadoop wont be able to duplicate data from one data center to another but i guess i can identify data nodes or namenodes from another data center correct me if i am wrong On Sat, Aug 31, 2013 at 7:00 PM, Visioner Sadak visioner.sa...@gmail.comwrote: lets say that you have some machines in europe and some in US I think you just need the ips and configure them in your cluster set up it will work... On Sat, Aug 31, 2013 at 7:52 AM, Jun Ping Du j...@vmware.com wrote: Hi, Although you can set datacenter layer on your network topology, it is never enabled in hadoop as lacking of replica placement and task scheduling support. There are some work to add layers other than rack and node under HADOOP-8848 but may not suit for your case. Agree with Adam that a cluster spanning multiple data centers seems not make sense even for DR case. Do you have other cases to do such a deployment? Thanks, Junping -- *From: *Adam Muise amu...@hortonworks.com *To: *user@hadoop.apache.org *Sent: *Friday, August 30, 2013 6:26:54 PM *Subject: *Re: Multidata center support Nothing has changed. DR best practice is still one (or more) clusters per site and replication is handled via distributed copy or some variation of it. A cluster spanning
Re: M/R API and Writable semantics in reducer
Hi, is there anyone interested in this topic? Basically, what I'm trying to find out is, whether it is 'safe' to rely on the side-effect of updating key during iterating values. I believe that there must be someone who is also interested in this, the secondary sort pattern is very common (at least in our jobs). So far, we have been emulating the GroupingComparator by holding state in the Reducer class and therefore being able to keep track of 'groups' of keys among several calls to reduce() method. This method seems quite safe in the sense of API, but in the sense of code is not as pretty (and vulnerable to ugly bugs if you forget to reset the state correctly for instance). On the other hand, if the way key gets updated while iterating the values is to be considered contract of the MapReduce API, I think it should be implemented in MRUnit (or you basically cannot use MRUnit to unittest your job) and if it isn't, than it is probably a bug. If this is internal behavior and might be subject to change anytime, than it clearly seems that keeping the state in Reducer is the only option. Does anyone else have similar considerations? How do others implement the secondary sort? Thanks, Jan On 09/02/2013 03:29 PM, Jan Lukavský wrote: Hi all, some time ago, I wrote a note to this conference, that it would be nice if it would be possible to get the *real* key emitted from mapper to reducer, when using the GroupingComparator. I got the answer, that it is possible, because of the Writable semantics and that currently the following holds: @Override protected void reduce(Key key, IterableValue values, Context context) { for (Value v : values) { // The key MIGHT change its value in this cycle, because readFields() will be called on it. // When using GroupingComparator that groups only by some part of the key, // many different keys might be considered single group, so the *real* data matters. } } When you use GroupingComparator the contents of the key can matter, because if you cannot access it, you have to duplicate the data in value (which means more network traffic in shuffle phase, and more I/O generally). Now, the question is, how much is this a matter of API that is reliable, or how much it is likely, that relying on this feature might break in future versions. To me, it seems more like a side effect, that is not guaranteed to be maintained in the future. There already exists a suggestion, that this is probably very fragile, because MRUnit seems not to update the key during the iteration. Does anyone have any suggested way around? Is the 'official' preferred way of accessing the original key to call context.getCurrentKey()? Isn't this the same case? Wouldn't it be nice, if the API itself had some guaranties or suggestions how it works? I can imagine modified reduce() metod, with a signature like protected void reduce(Key key, IterablePairKey, Value keyValues, Context context); This seems easily transformable to the old call (which could be default implementation of this method). Any opinion on this? Thanks, Jan
Re: How to update the timestamp of a file in HDFS
Murali, The touchz creates a zero sized file. It does not allow modifying a timestamp like Linux's touch command does, which is what the OP seems to be asking about. On Thu, Sep 5, 2013 at 12:14 PM, murali adireddy murali.adire...@gmail.com wrote: right usage of command is: hadoop fs - touchz filename On Thu, Sep 5, 2013 at 12:14 PM, murali adireddy murali.adire...@gmail.com wrote: Hi , Try this touchz hadoop command. hadoop -touchz filename Thanks and Regards, Adi Reddy Murali On Thu, Sep 5, 2013 at 11:06 AM, Harsh J ha...@cloudera.com wrote: There's no shell command (equivalent to Linux's touch) but you can use the Java API: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#setTimes(org.apache.hadoop.fs.Path,%20long,%20long) On Thu, Sep 5, 2013 at 10:58 AM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, Can you please help on to update the date timestamp of a file in HDFS. regards, Rams -- Harsh J -- Harsh J
Re: what is the difference between mapper and identity mapper, reducer and identity reducer?
Identity Mapper and Reducer just like the concept of Identity function in mathematics i.e. do not transform the input and return it as it is in output form. Identity Mapper takes the input key/value pair and spits it out without any processing. The case of identity reducer is a bit different. It does not mean that the reduce step will not take place. It will take place and the related sorting and shuffling will also be performed but there will be no aggregation. So you can use identity reducer if you want to sort your data that is coming from map but don't care for any grouping and also fine with multiple reducer outputs (unlike using 1 reducer.) Regards, Shahab On Thu, Sep 5, 2013 at 9:43 AM, mallik arjun mallik.cl...@gmail.com wrote: hi all, please tell me what is the difference between mapper and identtiy mapper , reducer and identity reducer. thanks in advance.
Re: Symbolic Link in Hadoop 1.0.4
FileContext APIs and symlink functionality is not available in 1.0. It is only available in 0.23 and 2.x release. On Thu, Sep 5, 2013 at 8:06 AM, Gobilliard, Olivier olivier.gobilli...@cartesian.com wrote: Hi, I am using Hadoop 1.0.4 and need to create a symbolic link in HDSF. This feature has been added in Hadoop 0.21.0 ( https://issues.apache.org/jira/browse/HDFS-245) in the new FileContext API ( http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileContext.html ). However, I cannot find the FileContext API in the 1.0.4 release ( http://archive.apache.org/dist/hadoop/core/hadoop-1.0.4/). I cannot find it in any of the 1.X releases actually. Has this functionality been moved to another Class? Many thanks, Olivier __ This email and any attachments are confidential. If you have received this email in error please notify the sender immediately by replying to this email and then delete from your computer without copying or distributing in any other way. Cartesian Limited - Registered in England and Wales with number 3230513 Registered office: Descartes House, 8 Gate Street, London, WC2A 3HP www.cartesian.com -- http://hortonworks.com/download/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Question related to resource allocation in Yarn!
Hi, I am trying to make a small poc on top of yarn. Within the launched application master , I am trying to request for 50 containers and launch a same task on those allocated containers. My config : AM registration response minimumCapability {, memory: 1024, virtual_cores: 1, }, maximumCapability {, memory: 8192, virtual_cores: 32, }, 1) I am asking for 1G mem and 1 core container to the RM. Ideally the RM should return me 6 - 7 containers , but the response always returns with only 2 containers. Why is that ? 2) So , when in the first ask 2 containers are returned , then I again required the RM for 50 - 2 = 48 containers. I keep getting 0 containers , even if the previously started containers have finished. Why is that ? Any link explaining the allocate request of RM would be very helpful. Thanks, Rahul
Symbolic Link in Hadoop 1.0.4
Hi, I am using Hadoop 1.0.4 and need to create a symbolic link in HDSF. This feature has been added in Hadoop 0.21.0 (https://issues.apache.org/jira/browse/HDFS-245) in the new FileContext API (http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileContext.html). However, I cannot find the FileContext API in the 1.0.4 release (http://archive.apache.org/dist/hadoop/core/hadoop-1.0.4/). I cannot find it in any of the 1.X releases actually. Has this functionality been moved to another Class? Many thanks, Olivier __ This email and any attachments are confidential. If you have received this email in error please notify the sender immediately by replying to this email and then delete from your computer without copying or distributing in any other way. Cartesian Limited - Registered in England and Wales with number 3230513 Registered office: Descartes House, 8 Gate Street, London, WC2A 3HP www.cartesian.comhttp://www.cartesian.com
RE: Multidata center support
Thanks Mike. I am assuming that it is a poor idea due to network bandwidth constraints across data center (backplane speed of TOR is typically greater than data center connectivity). From: michael_se...@hotmail.com Subject: Re: Multidata center support Date: Wed, 4 Sep 2013 20:15:08 -0500 To: user@hadoop.apache.org Sorry, its a poor idea period. Its one thing for something like Cleversafe to span a data center, but you're also having unit of work in terms of map/reduce. Think about all of the bad things that can happen when you have to deal with a sort/shuffle stage across data centers... (Its not a pretty sight.) As Adam points out... DR and copies across data centers are one thing. Running a single cluster spanning data centers... I would hate to be you when you have to face your devOps team. Does the expression BOFH ring a bell? ;-) HTH -Mike On Aug 30, 2013, at 5:26 AM, Adam Muise amu...@hortonworks.com wrote:Nothing has changed. DR best practice is still one (or more) clusters per site and replication is handled via distributed copy or some variation of it. A cluster spanning multiple data centers is a poor idea right now. On Fri, Aug 30, 2013 at 12:35 AM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: My take on this. Why hadoop has to know about data center thing. I think it can be installed across multiple data centers , however topology configuration would be required to tell which node belongs to which data center and switch for block placement. Thanks, Rahul On Fri, Aug 30, 2013 at 12:42 AM, Baskar Duraikannu baskar.duraika...@outlook.com wrote: We have a need to setup hadoop across data centers. Does hadoop support multi data center configuration? I searched through archives and have found that hadoop did not support multi data center configuration some time back. Just wanted to see whether situation has changed. Please help. -- Adam MuiseSolution EngineerHortonworks amuise@hortonworks.com416-417-4037 Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop. Hortonworks Virtual Sandbox Hadoop: Disruptive Possibilities by Jeff Needham CONFIDENTIALITY NOTICENOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segelmichael_segel (AT) hotmail.com
Re: Disc not equally utilized in hdfs data nodes
Please share your hdfs-site.xml. HDFS needs to be configured to use all 4 disk mounts - it does not auto-discover and use all drives today. On Thu, Sep 5, 2013 at 10:48 PM, Viswanathan J jayamviswanat...@gmail.com wrote: Hi, The data which are storing in data nodes are not equally utilized in all the data directories. We having 4x1 TB drives, but huge data storing in single disc only at all the nodes. How to balance for utilize all the drives. This causes the hdfs storage size becomes high very soon even though we have available space. Thanks, Viswa.J -- Harsh J
Disc not equally utilized in hdfs data nodes
Hi, The data which are storing in data nodes are not equally utilized in all the data directories. We having 4x1 TB drives, but huge data storing in single disc only at all the nodes. How to balance for utilize all the drives. This causes the hdfs storage size becomes high very soon even though we have available space. Thanks, Viswa.J
RE: yarn-site.xml and aux-services
Harsh, Thanks as usual for your sage advice. I was hoping to avoid actually installing anything on individual Hadoop nodes and finessing the service by spawning it from a task using LocalResources, but this is probably fraught with trouble. FWIW, I would vote to be able to load YARN services from HDFS. What is the appropriate forum to file a request like that? Thanks John -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Wednesday, September 04, 2013 12:05 AM To: user@hadoop.apache.org Subject: Re: yarn-site.xml and aux-services Thanks for the clarification. I would find it very convenient in this case to have my custom jars available in HDFS, but I can see the added complexity needed for YARN to maintain cache those to local disk. We could class-load directly from HDFS, like HBase Co-Processors do. Consider a scenario analogous to the MR shuffle, where the persistent service serves up mapper output files to the reducers across the network: Isn't this more complex than just running a dedicated service all the time, and/or implementing a way to spawn/end a dedicated service temporarily? I'd pick trying to implement such a thing than have my containers implement more logic. On Fri, Aug 23, 2013 at 11:17 PM, John Lilley john.lil...@redpoint.net wrote: Harsh, Thanks for the clarification. I would find it very convenient in this case to have my custom jars available in HDFS, but I can see the added complexity needed for YARN to maintain cache those to local disk. What about having the tasks themselves start the per-node service as a child process? I've been told that the NM kills the process group, but won't setgrp() circumvent that? Even given that, would the child process of one task have proper environment and permission to act on behalf of other tasks? Consider a scenario analogous to the MR shuffle, where the persistent service serves up mapper output files to the reducers across the network: 1) AM spawns mapper-like tasks around the cluster 2) Each mapper-like task on a given node launches a persistent service child, but only if one is not already running. 3) Each mapper-like task writes one or more output files, and informs the service of those files (along with AM-id, Task-id etc). 4) AM spawns reducer-like tasks around the cluster. 5) Each reducer-like task is told which nodes contain mapper result data, and connects to services on those nodes to read the data. There are some details missing, like how the lifetime of the temporary files is controlled to extend beyond the mapper-like task lifetime but still be cleaned up on AM exit, and how the reducer-like tasks are informed of which nodes have data. John -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Friday, August 23, 2013 11:00 AM To: user@hadoop.apache.org Subject: Re: yarn-site.xml and aux-services The general practice is to install your deps into a custom location such as /opt/john-jars, and extend YARN_CLASSPATH to include the jars, while also configuring the classes under the aux-services list. You need to take care of deploying jar versions to /opt/john-jars/ contents across the cluster though. I think it may be a neat idea to have jars be placed on HDFS or any other DFS, and the yarn-site.xml indicating the location plus class to load. Similar to HBase co-processors. But I'll defer to Vinod on if this would be a good thing to do. (I know the right next thing with such an ability people will ask for is hot-code-upgrades...) On Fri, Aug 23, 2013 at 10:11 PM, John Lilley john.lil...@redpoint.net wrote: Are there recommended conventions for adding additional code to a stock Hadoop install? It would be nice if we could piggyback on whatever mechanisms are used to distribute hadoop itself around the cluster. john From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com] Sent: Thursday, August 22, 2013 6:25 PM To: user@hadoop.apache.org Subject: Re: yarn-site.xml and aux-services Auxiliary services are essentially administer-configured services. So, they have to be set up at install time - before NM is started. +Vinod On Thu, Aug 22, 2013 at 1:38 PM, John Lilley john.lil...@redpoint.net wrote: Following up on this, how exactly does one *install* the jar(s) for auxiliary service? Can it be shipped out with the LocalResources of an AM? MapReduce's aux-service is presumably installed with Hadoop and is just sitting there in the right place, but if one wanted to make a whole new aux-service that belonged with an AM, how would one do it? John -Original Message- From: John Lilley [mailto:john.lil...@redpoint.net] Sent: Wednesday, June 05, 2013 11:41 AM To: user@hadoop.apache.org Subject: RE: yarn-site.xml and aux-services Wow, thanks. Is this documented anywhere other than the code? I hate to waste y'alls time on things that
Re: SNN not writing data fs.checkpoint.dir location
Please share your Hadoop version and hdfs-site.xml conf also I'm assuming that you already restarted your cluster after changing fs.checkpoint.dir. Thanks On 9/5/13, Munna munnava...@gmail.com wrote: Hi, I have configured fs.checkpoint.dir in hdfs-site.xml, but still it was writing in /tmp location. Please give me some solution for checkpointing on respective location. -- *Regards* * * *Munna*
RE: Multidata center support
Currently there is no relation betweeen weak consistency and hadoop. I just spent more time thinking about the requirement (as outlined below) a) Maintain total of 3 data centers b) Maintain 1 copy per data center c) If any data center goes down, dont create additional copies. Above is not a valid model, especially requirement (c). Because this will take away Strong Consistency model supported by Hadoop. Hope this explains. I believe we can give up on requirement (c). I more currently exploring to see whether anyway to achieve (a) and (b). Requirement (b) can also be relaxed to have more copies per data center if needed From: rahul.rec@gmail.com Date: Wed, 4 Sep 2013 10:04:49 +0530 Subject: Re: Multidata center support To: user@hadoop.apache.org Under replicated blocks are also consistent from a consumers point. Care of explain the relation to weak consistency to hadoop. Thanks, Rahul On Wed, Sep 4, 2013 at 9:56 AM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Adam's response makes more sense to me to offline replicate generated data from one cluster to another across data centers. Not sure if configurable block placement block placement policy is supported in Hadoop.If yes , then alone side with rack awareness , you should be able to achieve the same. I could not follow your question related to weak consistency. Thanks, Rahul On Wed, Sep 4, 2013 at 2:20 AM, Baskar Duraikannu baskar.duraika...@outlook.com wrote: RahulAre you talking about rack-awareness script? I did go through rack awareness. Here are the problems with rack awareness w.r.to my (given) business requirment 1. Hadoop , default places two copies on the same rack and 1 copy on some other rack. This would work as long as we have two data centers. if business wants to have three data centers, then data would not be spread across. Separately there is a question around whether it is the right thing to do or not. I have been promised by business that they would buy enough bandwidth such that each data center will be few milliseconds apart (in latency). 2. I believe Hadoop automatically re-replicates data if one or more node is down. Assume when one out of 2 data center goes down. There will be a massive data flow to create additional copies. When I say data center support, I should be able to configure hadoop to say a) Maintain 1 copy per data center b) If any data center goes down, dont create additional copies. Above requirements that I am pointing will essentially move hadoop from strongly consistent to a week/eventual consistent model. Since this changes fundamental architecture, it will probably break all sort of things... Might not be possible ever in Hadoop. Thoughts? SadakIs there a way to implement above requirement via Federation? ThanksBaskar Date: Sun, 1 Sep 2013 00:20:04 +0530 Subject: Re: Multidata center support From: visioner.sa...@gmail.com To: user@hadoop.apache.org What do you think friends I think hadoop clusters can run on multiple data centers using FEDERATION On Sat, Aug 31, 2013 at 8:39 PM, Visioner Sadak visioner.sa...@gmail.com wrote: The only problem i guess hadoop wont be able to duplicate data from one data center to another but i guess i can identify data nodes or namenodes from another data center correct me if i am wrong On Sat, Aug 31, 2013 at 7:00 PM, Visioner Sadak visioner.sa...@gmail.com wrote: lets say that you have some machines in europe and some in US I think you just need the ips and configure them in your cluster set upit will work... On Sat, Aug 31, 2013 at 7:52 AM, Jun Ping Du j...@vmware.com wrote: Hi,Although you can set datacenter layer on your network topology, it is never enabled in hadoop as lacking of replica placement and task scheduling support. There are some work to add layers other than rack and node under HADOOP-8848 but may not suit for your case. Agree with Adam that a cluster spanning multiple data centers seems not make sense even for DR case. Do you have other cases to do such a deployment? Thanks, Junping From: Adam Muise amu...@hortonworks.com To: user@hadoop.apache.org Sent: Friday, August 30, 2013 6:26:54 PM Subject: Re: Multidata center support Nothing has changed. DR best practice is still one (or more) clusters per site and replication is handled via distributed copy or some variation of it. A cluster spanning multiple data centers is a poor idea right now. On Fri, Aug 30, 2013 at 12:35 AM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: My take on this. Why hadoop has to know about data center thing. I think it can be installed across multiple data centers , however topology configuration would be required to tell which node belongs to which data center and switch for block placement. Thanks, Rahul On Fri, Aug 30, 2013 at 12:42 AM, Baskar Duraikannu
Re: SNN not writing data fs.checkpoint.dir location
These configs need to be present at SNN, not at just the NN. On Thu, Sep 5, 2013 at 11:54 PM, Munna munnava...@gmail.com wrote: Hi Yadav, We are using CDH3 and I restarted after changing configuration. property namefs.checkpoint.dir/name value/data/1/dfs/snn,/nfsmount/dfs/snn/value finaltrue/final /property property namefs.checkpoint.period/name value3600/value descriptionThe number of seconds between two periodic checkpoints/description /property I have entered these changes in Namenode only. On Thu, Sep 5, 2013 at 11:47 PM, Jitendra Yadav jeetuyadav200...@gmail.com wrote: Please share your Hadoop version and hdfs-site.xml conf also I'm assuming that you already restarted your cluster after changing fs.checkpoint.dir. Thanks On 9/5/13, Munna munnava...@gmail.com wrote: Hi, I have configured fs.checkpoint.dir in hdfs-site.xml, but still it was writing in /tmp location. Please give me some solution for checkpointing on respective location. -- *Regards* * * *Munna* -- Regards Munna -- Harsh J
Re: SNN not writing data fs.checkpoint.dir location
Hi, Well I think you should only restart your SNN after the change. Also refer the checkpoint directory for any 'in_use.lock' file. Thanks Jitendra On 9/6/13, Munna munnava...@gmail.com wrote: Thank you Jitendar. After chage these perameter on SNN, is it require to restart NN also? please confirm... On Fri, Sep 6, 2013 at 12:10 AM, Jitendra Yadav jeetuyadav200...@gmail.comwrote: Hi, If you are running SNN on same node as NN then it's ok otherwise you should add these properties at SNN side too. Thanks Jitendra On 9/6/13, Munna munnava...@gmail.com wrote: you mean that same configurations are required as NN in SNN On Thu, Sep 5, 2013 at 11:58 PM, Harsh J ha...@cloudera.com wrote: These configs need to be present at SNN, not at just the NN. On Thu, Sep 5, 2013 at 11:54 PM, Munna munnava...@gmail.com wrote: Hi Yadav, We are using CDH3 and I restarted after changing configuration. property namefs.checkpoint.dir/name value/data/1/dfs/snn,/nfsmount/dfs/snn/value finaltrue/final /property property namefs.checkpoint.period/name value3600/value descriptionThe number of seconds between two periodic checkpoints/description /property I have entered these changes in Namenode only. On Thu, Sep 5, 2013 at 11:47 PM, Jitendra Yadav jeetuyadav200...@gmail.com wrote: Please share your Hadoop version and hdfs-site.xml conf also I'm assuming that you already restarted your cluster after changing fs.checkpoint.dir. Thanks On 9/5/13, Munna munnava...@gmail.com wrote: Hi, I have configured fs.checkpoint.dir in hdfs-site.xml, but still it was writing in /tmp location. Please give me some solution for checkpointing on respective location. -- *Regards* * * *Munna* -- Regards Munna -- Harsh J -- *Regards* * * *Munna* -- *Regards* * * *Munna*
Re: SNN not writing data fs.checkpoint.dir location
in_use.lock ? On Fri, Sep 6, 2013 at 12:26 AM, Jitendra Yadav jeetuyadav200...@gmail.comwrote: Hi, Well I think you should only restart your SNN after the change. Also refer the checkpoint directory for any 'in_use.lock' file. Thanks Jitendra On 9/6/13, Munna munnava...@gmail.com wrote: Thank you Jitendar. After chage these perameter on SNN, is it require to restart NN also? please confirm... On Fri, Sep 6, 2013 at 12:10 AM, Jitendra Yadav jeetuyadav200...@gmail.comwrote: Hi, If you are running SNN on same node as NN then it's ok otherwise you should add these properties at SNN side too. Thanks Jitendra On 9/6/13, Munna munnava...@gmail.com wrote: you mean that same configurations are required as NN in SNN On Thu, Sep 5, 2013 at 11:58 PM, Harsh J ha...@cloudera.com wrote: These configs need to be present at SNN, not at just the NN. On Thu, Sep 5, 2013 at 11:54 PM, Munna munnava...@gmail.com wrote: Hi Yadav, We are using CDH3 and I restarted after changing configuration. property namefs.checkpoint.dir/name value/data/1/dfs/snn,/nfsmount/dfs/snn/value finaltrue/final /property property namefs.checkpoint.period/name value3600/value descriptionThe number of seconds between two periodic checkpoints/description /property I have entered these changes in Namenode only. On Thu, Sep 5, 2013 at 11:47 PM, Jitendra Yadav jeetuyadav200...@gmail.com wrote: Please share your Hadoop version and hdfs-site.xml conf also I'm assuming that you already restarted your cluster after changing fs.checkpoint.dir. Thanks On 9/5/13, Munna munnava...@gmail.com wrote: Hi, I have configured fs.checkpoint.dir in hdfs-site.xml, but still it was writing in /tmp location. Please give me some solution for checkpointing on respective location. -- *Regards* * * *Munna* -- Regards Munna -- Harsh J -- *Regards* * * *Munna* -- *Regards* * * *Munna* -- *Regards* * * *Munna*
Re: SNN not writing data fs.checkpoint.dir location
Hi, This means that your specified checkpoint directory has been locked by SNN for use. Thanks Jitendra On 9/6/13, Munna munnava...@gmail.com wrote: in_use.lock ? On Fri, Sep 6, 2013 at 12:26 AM, Jitendra Yadav jeetuyadav200...@gmail.comwrote: Hi, Well I think you should only restart your SNN after the change. Also refer the checkpoint directory for any 'in_use.lock' file. Thanks Jitendra On 9/6/13, Munna munnava...@gmail.com wrote: Thank you Jitendar. After chage these perameter on SNN, is it require to restart NN also? please confirm... On Fri, Sep 6, 2013 at 12:10 AM, Jitendra Yadav jeetuyadav200...@gmail.comwrote: Hi, If you are running SNN on same node as NN then it's ok otherwise you should add these properties at SNN side too. Thanks Jitendra On 9/6/13, Munna munnava...@gmail.com wrote: you mean that same configurations are required as NN in SNN On Thu, Sep 5, 2013 at 11:58 PM, Harsh J ha...@cloudera.com wrote: These configs need to be present at SNN, not at just the NN. On Thu, Sep 5, 2013 at 11:54 PM, Munna munnava...@gmail.com wrote: Hi Yadav, We are using CDH3 and I restarted after changing configuration. property namefs.checkpoint.dir/name value/data/1/dfs/snn,/nfsmount/dfs/snn/value finaltrue/final /property property namefs.checkpoint.period/name value3600/value descriptionThe number of seconds between two periodic checkpoints/description /property I have entered these changes in Namenode only. On Thu, Sep 5, 2013 at 11:47 PM, Jitendra Yadav jeetuyadav200...@gmail.com wrote: Please share your Hadoop version and hdfs-site.xml conf also I'm assuming that you already restarted your cluster after changing fs.checkpoint.dir. Thanks On 9/5/13, Munna munnava...@gmail.com wrote: Hi, I have configured fs.checkpoint.dir in hdfs-site.xml, but still it was writing in /tmp location. Please give me some solution for checkpointing on respective location. -- *Regards* * * *Munna* -- Regards Munna -- Harsh J -- *Regards* * * *Munna* -- *Regards* * * *Munna* -- *Regards* * * *Munna*
Re: ContainerLaunchContext in 2.1.x
Good question... There was a security problem earlier and to address that we removed it from ContainerLaunchContext. Today if you check the payload we are sending Container which contains ContainerToken. ContainerToken is the secured channel for RM to tell NM about 1) ContainerId 2) Resource 3) User 4) NodeId It is present there by default (irrespective of security). I hope it answers your doubt. Thanks, Omkar Joshi *Hortonworks Inc.* http://www.hortonworks.com On Wed, Sep 4, 2013 at 2:51 AM, Janne Valkealahti janne.valkeala...@gmail.com wrote: With 2.0.x ContainerId was part of the ContainerLaunchContext and I assume container id was then used to identify what node manager would actually start. With 2.1.x ContainerId was removed from ContainerLaunchContext. ContainerManagementProtocol is only using a list of StartContainerRequest which have ContainerLaunchContext and Token. My first question is that if you have different ContainerLaunchContext(i.e. command, env variables, etc), how do you know which container is launched with which launch context? My second question is how node manager is assosiating allocated container(which you requested from resource manager) to ContainerLaunchContext? -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: ContainerLaunchContext in 2.1.x
Other than that, you can find all API incompatible changes from 2.0.x to 2.1.x in this link: http://hortonworks.com/blog/stabilizing-yarn-apis-for-apache-hadoop-2-beta-and-beyond/ Jian On Thu, Sep 5, 2013 at 10:44 AM, Omkar Joshi ojo...@hortonworks.com wrote: Good question... There was a security problem earlier and to address that we removed it from ContainerLaunchContext. Today if you check the payload we are sending Container which contains ContainerToken. ContainerToken is the secured channel for RM to tell NM about 1) ContainerId 2) Resource 3) User 4) NodeId It is present there by default (irrespective of security). I hope it answers your doubt. Thanks, Omkar Joshi *Hortonworks Inc.* http://www.hortonworks.com On Wed, Sep 4, 2013 at 2:51 AM, Janne Valkealahti janne.valkeala...@gmail.com wrote: With 2.0.x ContainerId was part of the ContainerLaunchContext and I assume container id was then used to identify what node manager would actually start. With 2.1.x ContainerId was removed from ContainerLaunchContext. ContainerManagementProtocol is only using a list of StartContainerRequest which have ContainerLaunchContext and Token. My first question is that if you have different ContainerLaunchContext(i.e. command, env variables, etc), how do you know which container is launched with which launch context? My second question is how node manager is assosiating allocated container(which you requested from resource manager) to ContainerLaunchContext? CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: SNN not writing data fs.checkpoint.dir location
Hi, If you are running SNN on same node as NN then it's ok otherwise you should add these properties at SNN side too. Thanks Jitendra On 9/6/13, Munna munnava...@gmail.com wrote: you mean that same configurations are required as NN in SNN On Thu, Sep 5, 2013 at 11:58 PM, Harsh J ha...@cloudera.com wrote: These configs need to be present at SNN, not at just the NN. On Thu, Sep 5, 2013 at 11:54 PM, Munna munnava...@gmail.com wrote: Hi Yadav, We are using CDH3 and I restarted after changing configuration. property namefs.checkpoint.dir/name value/data/1/dfs/snn,/nfsmount/dfs/snn/value finaltrue/final /property property namefs.checkpoint.period/name value3600/value descriptionThe number of seconds between two periodic checkpoints/description /property I have entered these changes in Namenode only. On Thu, Sep 5, 2013 at 11:47 PM, Jitendra Yadav jeetuyadav200...@gmail.com wrote: Please share your Hadoop version and hdfs-site.xml conf also I'm assuming that you already restarted your cluster after changing fs.checkpoint.dir. Thanks On 9/5/13, Munna munnava...@gmail.com wrote: Hi, I have configured fs.checkpoint.dir in hdfs-site.xml, but still it was writing in /tmp location. Please give me some solution for checkpointing on respective location. -- *Regards* * * *Munna* -- Regards Munna -- Harsh J -- *Regards* * * *Munna*
Re: SNN not writing data fs.checkpoint.dir location
Thank you Jitendar. After chage these perameter on SNN, is it require to restart NN also? please confirm... On Fri, Sep 6, 2013 at 12:10 AM, Jitendra Yadav jeetuyadav200...@gmail.comwrote: Hi, If you are running SNN on same node as NN then it's ok otherwise you should add these properties at SNN side too. Thanks Jitendra On 9/6/13, Munna munnava...@gmail.com wrote: you mean that same configurations are required as NN in SNN On Thu, Sep 5, 2013 at 11:58 PM, Harsh J ha...@cloudera.com wrote: These configs need to be present at SNN, not at just the NN. On Thu, Sep 5, 2013 at 11:54 PM, Munna munnava...@gmail.com wrote: Hi Yadav, We are using CDH3 and I restarted after changing configuration. property namefs.checkpoint.dir/name value/data/1/dfs/snn,/nfsmount/dfs/snn/value finaltrue/final /property property namefs.checkpoint.period/name value3600/value descriptionThe number of seconds between two periodic checkpoints/description /property I have entered these changes in Namenode only. On Thu, Sep 5, 2013 at 11:47 PM, Jitendra Yadav jeetuyadav200...@gmail.com wrote: Please share your Hadoop version and hdfs-site.xml conf also I'm assuming that you already restarted your cluster after changing fs.checkpoint.dir. Thanks On 9/5/13, Munna munnava...@gmail.com wrote: Hi, I have configured fs.checkpoint.dir in hdfs-site.xml, but still it was writing in /tmp location. Please give me some solution for checkpointing on respective location. -- *Regards* * * *Munna* -- Regards Munna -- Harsh J -- *Regards* * * *Munna* -- *Regards* * * *Munna*
RE: yarn-site.xml and aux-services
https://issues.apache.org/jira/browse/YARN-1151 --john -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Thursday, September 05, 2013 12:14 PM To: user@hadoop.apache.org Subject: Re: yarn-site.xml and aux-services Please log a JIRA on https://issues.apache.org/jira/browse/YARN (do let the thread know the ID as well, in spirit of http://xkcd.com/979/) :) On Thu, Sep 5, 2013 at 11:41 PM, John Lilley john.lil...@redpoint.net wrote: Harsh, Thanks as usual for your sage advice. I was hoping to avoid actually installing anything on individual Hadoop nodes and finessing the service by spawning it from a task using LocalResources, but this is probably fraught with trouble. FWIW, I would vote to be able to load YARN services from HDFS. What is the appropriate forum to file a request like that? Thanks John -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Wednesday, September 04, 2013 12:05 AM To: user@hadoop.apache.org Subject: Re: yarn-site.xml and aux-services Thanks for the clarification. I would find it very convenient in this case to have my custom jars available in HDFS, but I can see the added complexity needed for YARN to maintain cache those to local disk. We could class-load directly from HDFS, like HBase Co-Processors do. Consider a scenario analogous to the MR shuffle, where the persistent service serves up mapper output files to the reducers across the network: Isn't this more complex than just running a dedicated service all the time, and/or implementing a way to spawn/end a dedicated service temporarily? I'd pick trying to implement such a thing than have my containers implement more logic. On Fri, Aug 23, 2013 at 11:17 PM, John Lilley john.lil...@redpoint.net wrote: Harsh, Thanks for the clarification. I would find it very convenient in this case to have my custom jars available in HDFS, but I can see the added complexity needed for YARN to maintain cache those to local disk. What about having the tasks themselves start the per-node service as a child process? I've been told that the NM kills the process group, but won't setgrp() circumvent that? Even given that, would the child process of one task have proper environment and permission to act on behalf of other tasks? Consider a scenario analogous to the MR shuffle, where the persistent service serves up mapper output files to the reducers across the network: 1) AM spawns mapper-like tasks around the cluster 2) Each mapper-like task on a given node launches a persistent service child, but only if one is not already running. 3) Each mapper-like task writes one or more output files, and informs the service of those files (along with AM-id, Task-id etc). 4) AM spawns reducer-like tasks around the cluster. 5) Each reducer-like task is told which nodes contain mapper result data, and connects to services on those nodes to read the data. There are some details missing, like how the lifetime of the temporary files is controlled to extend beyond the mapper-like task lifetime but still be cleaned up on AM exit, and how the reducer-like tasks are informed of which nodes have data. John -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Friday, August 23, 2013 11:00 AM To: user@hadoop.apache.org Subject: Re: yarn-site.xml and aux-services The general practice is to install your deps into a custom location such as /opt/john-jars, and extend YARN_CLASSPATH to include the jars, while also configuring the classes under the aux-services list. You need to take care of deploying jar versions to /opt/john-jars/ contents across the cluster though. I think it may be a neat idea to have jars be placed on HDFS or any other DFS, and the yarn-site.xml indicating the location plus class to load. Similar to HBase co-processors. But I'll defer to Vinod on if this would be a good thing to do. (I know the right next thing with such an ability people will ask for is hot-code-upgrades...) On Fri, Aug 23, 2013 at 10:11 PM, John Lilley john.lil...@redpoint.net wrote: Are there recommended conventions for adding additional code to a stock Hadoop install? It would be nice if we could piggyback on whatever mechanisms are used to distribute hadoop itself around the cluster. john From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com] Sent: Thursday, August 22, 2013 6:25 PM To: user@hadoop.apache.org Subject: Re: yarn-site.xml and aux-services Auxiliary services are essentially administer-configured services. So, they have to be set up at install time - before NM is started. +Vinod On Thu, Aug 22, 2013 at 1:38 PM, John Lilley john.lil...@redpoint.net wrote: Following up on this, how exactly does one *install* the jar(s) for auxiliary service? Can it be shipped out with the LocalResources of an AM? MapReduce's aux-service is presumably
Re: Disc not equally utilized in hdfs data nodes
The spaces may be a problem if you are using the older 1.x releases. Please try to specify the list without spaces, and also check if all of these paths exist and have some DN owned directories under them. Please also keep the lists in CC/TO when replying. Clicking Reply-to-all usually helps do this automatically. On Thu, Sep 5, 2013 at 11:16 PM, Viswanathan J jayamviswanat...@gmail.com wrote: Hi Harsh, dfs.data.dir property we defined the values as in comma separated, /mnt/hadoop0/hdfs, /mnt/hadoop1/hdfs, /mnt/hadoop2/hdfs, /mnt/hadoop3/hdfs The above values are different devices. Thanks, V On Sep 5, 2013 10:53 PM, Harsh J ha...@cloudera.com wrote: Please share your hdfs-site.xml. HDFS needs to be configured to use all 4 disk mounts - it does not auto-discover and use all drives today. On Thu, Sep 5, 2013 at 10:48 PM, Viswanathan J jayamviswanat...@gmail.com wrote: Hi, The data which are storing in data nodes are not equally utilized in all the data directories. We having 4x1 TB drives, but huge data storing in single disc only at all the nodes. How to balance for utilize all the drives. This causes the hdfs storage size becomes high very soon even though we have available space. Thanks, Viswa.J -- Harsh J -- Harsh J
How to support the (HDFS) FileSystem API of various Hadoop Distributions?
Hi, I start to write a small ncdu clone to browse HDFS on the CLI ( http://nchadoop.org/). Currently i'm testing it against CDH4, - but I like to make it available for a wider group of users (Hortonworks, ..). Is it enough to pick different vanilla Versions (for IPC 5, 7)? Best Regards, Christian.
How to speed up Hadoop?
Hi all, I am looking for ways to configure Hadoop inorder to speed up data processing. Assuming all my nodes are highly fault tolerant, will making data replication factor 1 speed up the processing? Are there some way to disable failure monitoring done by Hadoop? Thank you for your time. -Sundeep
Re: How to speed up Hadoop?
I think you just went backwards. more replicas (generally speaking) are better. I'd take 60 cheap, 1 U servers over 20 highly fault tolerant ones for almost every problem. I'd get them for the same or less $ too. On Thu, Sep 5, 2013 at 8:41 PM, Sundeep Kambhampati kambh...@cse.ohio-state.edu wrote: Hi all, I am looking for ways to configure Hadoop inorder to speed up data processing. Assuming all my nodes are highly fault tolerant, will making data replication factor 1 speed up the processing? Are there some way to disable failure monitoring done by Hadoop? Thank you for your time. -Sundeep
Re: How to speed up Hadoop?
Solution 1: Throw more hardware at the cluster. That's the whole point of hadoop. Solution 2: Try to optimize the mapreduce jobs. It depends on what kind of jobs you are running. I wouldn't suggest decreasing the number of replications as it kind of defeats the purpose of using Hadoop. You could do this if you can't get more hardware, are running experimental non-critical non-production data. What kind of Hadoop monitoring are you talking about? Regards, Vinayak. On Thu, Sep 5, 2013 at 7:51 PM, Chris Embree cemb...@gmail.com wrote: I think you just went backwards. more replicas (generally speaking) are better. I'd take 60 cheap, 1 U servers over 20 highly fault tolerant ones for almost every problem. I'd get them for the same or less $ too. On Thu, Sep 5, 2013 at 8:41 PM, Sundeep Kambhampati kambh...@cse.ohio-state.edu wrote: Hi all, I am looking for ways to configure Hadoop inorder to speed up data processing. Assuming all my nodes are highly fault tolerant, will making data replication factor 1 speed up the processing? Are there some way to disable failure monitoring done by Hadoop? Thank you for your time. -Sundeep
Re: Disc not equally utilized in hdfs data nodes
Thanks Harsh. Hope I don't have space in my list which I specified in the last mail. Thanks, V On Sep 5, 2013 11:20 PM, Harsh J ha...@cloudera.com wrote: The spaces may be a problem if you are using the older 1.x releases. Please try to specify the list without spaces, and also check if all of these paths exist and have some DN owned directories under them. Please also keep the lists in CC/TO when replying. Clicking Reply-to-all usually helps do this automatically. On Thu, Sep 5, 2013 at 11:16 PM, Viswanathan J jayamviswanat...@gmail.com wrote: Hi Harsh, dfs.data.dir property we defined the values as in comma separated, /mnt/hadoop0/hdfs, /mnt/hadoop1/hdfs, /mnt/hadoop2/hdfs, /mnt/hadoop3/hdfs The above values are different devices. Thanks, V On Sep 5, 2013 10:53 PM, Harsh J ha...@cloudera.com wrote: Please share your hdfs-site.xml. HDFS needs to be configured to use all 4 disk mounts - it does not auto-discover and use all drives today. On Thu, Sep 5, 2013 at 10:48 PM, Viswanathan J jayamviswanat...@gmail.com wrote: Hi, The data which are storing in data nodes are not equally utilized in all the data directories. We having 4x1 TB drives, but huge data storing in single disc only at all the nodes. How to balance for utilize all the drives. This causes the hdfs storage size becomes high very soon even though we have available space. Thanks, Viswa.J -- Harsh J -- Harsh J
Re: How to speed up Hadoop?
How about this: http://hadoop.apache.org/docs/stable/vaidya.html I've never tried it myself, i was just reading about it today. On Thu, Sep 5, 2013 at 5:57 PM, Preethi Vinayak Ponangi vinayakpona...@gmail.com wrote: Solution 1: Throw more hardware at the cluster. That's the whole point of hadoop. Solution 2: Try to optimize the mapreduce jobs. It depends on what kind of jobs you are running. I wouldn't suggest decreasing the number of replications as it kind of defeats the purpose of using Hadoop. You could do this if you can't get more hardware, are running experimental non-critical non-production data. What kind of Hadoop monitoring are you talking about? Regards, Vinayak. On Thu, Sep 5, 2013 at 7:51 PM, Chris Embree cemb...@gmail.com wrote: I think you just went backwards. more replicas (generally speaking) are better. I'd take 60 cheap, 1 U servers over 20 highly fault tolerant ones for almost every problem. I'd get them for the same or less $ too. On Thu, Sep 5, 2013 at 8:41 PM, Sundeep Kambhampati kambh...@cse.ohio-state.edu wrote: Hi all, I am looking for ways to configure Hadoop inorder to speed up data processing. Assuming all my nodes are highly fault tolerant, will making data replication factor 1 speed up the processing? Are there some way to disable failure monitoring done by Hadoop? Thank you for your time. -Sundeep
Re: How to speed up Hadoop?
On 9/5/2013 8:57 PM, Preethi Vinayak Ponangi wrote: Solution 1: Throw more hardware at the cluster. That's the whole point of hadoop. Solution 2: Try to optimize the mapreduce jobs. It depends on what kind of jobs you are running. I wouldn't suggest decreasing the number of replications as it kind of defeats the purpose of using Hadoop. You could do this if you can't get more hardware, are running experimental non-critical non-production data. What kind of Hadoop monitoring are you talking about? Regards, Vinayak. On Thu, Sep 5, 2013 at 7:51 PM, Chris Embree cemb...@gmail.com mailto:cemb...@gmail.com wrote: I think you just went backwards. more replicas (generally speaking) are better. I'd take 60 cheap, 1 U servers over 20 highly fault tolerant ones for almost every problem. I'd get them for the same or less $ too. On Thu, Sep 5, 2013 at 8:41 PM, Sundeep Kambhampati kambh...@cse.ohio-state.edu mailto:kambh...@cse.ohio-state.edu wrote: Hi all, I am looking for ways to configure Hadoop inorder to speed up data processing. Assuming all my nodes are highly fault tolerant, will making data replication factor 1 speed up the processing? Are there some way to disable failure monitoring done by Hadoop? Thank you for your time. -Sundeep Thank you for your inputs. I can't currently add more hardware. By monitoring I mean something like speculative execution. Regards, Sundeep
Re: How to speed up Hadoop?
On 9/5/2013 8:57 PM, Preethi Vinayak Ponangi wrote: Solution 1: Throw more hardware at the cluster. That's the whole point of hadoop. Solution 2: Try to optimize the mapreduce jobs. It depends on what kind of jobs you are running. I wouldn't suggest decreasing the number of replications as it kind of defeats the purpose of using Hadoop. You could do this if you can't get more hardware, are running experimental non-critical non-production data. What kind of Hadoop monitoring are you talking about? Regards, Vinayak. On Thu, Sep 5, 2013 at 7:51 PM, Chris Embree cemb...@gmail.com mailto:cemb...@gmail.com wrote: I think you just went backwards. more replicas (generally speaking) are better. I'd take 60 cheap, 1 U servers over 20 highly fault tolerant ones for almost every problem. I'd get them for the same or less $ too. On Thu, Sep 5, 2013 at 8:41 PM, Sundeep Kambhampati kambh...@cse.ohio-state.edu mailto:kambh...@cse.ohio-state.edu wrote: Hi all, I am looking for ways to configure Hadoop inorder to speed up data processing. Assuming all my nodes are highly fault tolerant, will making data replication factor 1 speed up the processing? Are there some way to disable failure monitoring done by Hadoop? Thank you for your time. -Sundeep Thank you your inputs. I can't currently add more hardware. By monitoring I mean something like speculative execution. Regards, Sundeep
Re: How to support the (HDFS) FileSystem API of various Hadoop Distributions?
Hello, There are a few additions to the FileSystem that may bite you across versions, but if you pick an old stable version such as Apache Hadoop 0.20.2, and stick to only its offered APIs, it would work better across different version dependencies as we try to maintain FileSystem as a stable interface as much as we can (there was also more recent work to ensure the stabilization). I looked over your current code state and it seemed to have pretty stable calls that I think have existed across several versions and exists today, but I did notice you had to remove an isRoot as part of a previous commit, which may have lead to this question? If that doesn't work for you, you can also switch out to using sub-modules carrying code specific to a build version type (such as what HBase does at https://github.com/apache/hbase/tree/trunk/ (see the hbase-hadoop-compat directories)). On Fri, Sep 6, 2013 at 2:59 AM, Christian Schneider cschneiderpub...@gmail.com wrote: Hi, I start to write a small ncdu clone to browse HDFS on the CLI (http://nchadoop.org/). Currently i'm testing it against CDH4, - but I like to make it available for a wider group of users (Hortonworks, ..). Is it enough to pick different vanilla Versions (for IPC 5, 7)? Best Regards, Christian. -- Harsh J
Re: How to support the (HDFS) FileSystem API of various Hadoop Distributions?
Oh and btw, nice utility! :) On Fri, Sep 6, 2013 at 7:50 AM, Harsh J ha...@cloudera.com wrote: Hello, There are a few additions to the FileSystem that may bite you across versions, but if you pick an old stable version such as Apache Hadoop 0.20.2, and stick to only its offered APIs, it would work better across different version dependencies as we try to maintain FileSystem as a stable interface as much as we can (there was also more recent work to ensure the stabilization). I looked over your current code state and it seemed to have pretty stable calls that I think have existed across several versions and exists today, but I did notice you had to remove an isRoot as part of a previous commit, which may have lead to this question? If that doesn't work for you, you can also switch out to using sub-modules carrying code specific to a build version type (such as what HBase does at https://github.com/apache/hbase/tree/trunk/ (see the hbase-hadoop-compat directories)). On Fri, Sep 6, 2013 at 2:59 AM, Christian Schneider cschneiderpub...@gmail.com wrote: Hi, I start to write a small ncdu clone to browse HDFS on the CLI (http://nchadoop.org/). Currently i'm testing it against CDH4, - but I like to make it available for a wider group of users (Hortonworks, ..). Is it enough to pick different vanilla Versions (for IPC 5, 7)? Best Regards, Christian. -- Harsh J -- Harsh J
Re: How to speed up Hadoop?
I'd recommend reading Eric Sammer's Hadoop Operations (O'Reilly) book. It goes over a lot of this stuff - building, monitoring, tuning, optimizing, etc.. If your goal is just speed and quicker results, and not retention or safety, by all means use replication factor as 1. Note that its difficult for us to suggest configs unless you also share your use-case (in brief) or goals. While the software is highly tunable, a lot of tweaks depend on what you are planning to do. On Fri, Sep 6, 2013 at 6:11 AM, Sundeep Kambhampati kambh...@cse.ohio-state.edu wrote: Hi all, I am looking for ways to configure Hadoop inorder to speed up data processing. Assuming all my nodes are highly fault tolerant, will making data replication factor 1 speed up the processing? Are there some way to disable failure monitoring done by Hadoop? Thank you for your time. -Sundeep -- Harsh J
RE: Question related to resource allocation in Yarn!
Hi Rahul, Could you tell me, what is the version you are using? · If you want a container, you need to issue 3 resource requests (1-node local, 1-rack local and 1-Any(*) ). If you are using 2.1.0-betahttps://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truemode=hidesorter/order=ASCsorter/field=prioritypid=12313722customfield_12310320=12324029 or later versions, you can set the Relax Locality flag to false for getting only on the specified host. Can you also share the code how you are requesting for containers…so that we can help you better.. Thanks Devaraj k From: Rahul Bhattacharjee [mailto:rahul.rec@gmail.com] Sent: 06 September 2013 09:43 To: user@hadoop.apache.org Subject: Re: Question related to resource allocation in Yarn! I could progress a bit on this. I was not setting responseId while asking for containers. Still I have one question as why I am only been allocated two containers whereas node manager can run more containers. Response while registering the application master - AM registration response minimumCapability {, memory: 1024, virtual_cores: 1, }, maximumCapability {, memory: 8192, virtual_cores: 32, }, Thanks, Rahul On Thu, Sep 5, 2013 at 8:33 PM, Rahul Bhattacharjee rahul.rec@gmail.commailto:rahul.rec@gmail.com wrote: Hi, I am trying to make a small poc on top of yarn. Within the launched application master , I am trying to request for 50 containers and launch a same task on those allocated containers. My config : AM registration response minimumCapability {, memory: 1024, virtual_cores: 1, }, maximumCapability {, memory: 8192, virtual_cores: 32, }, 1) I am asking for 1G mem and 1 core container to the RM. Ideally the RM should return me 6 - 7 containers , but the response always returns with only 2 containers. Why is that ? 2) So , when in the first ask 2 containers are returned , then I again required the RM for 50 - 2 = 48 containers. I keep getting 0 containers , even if the previously started containers have finished. Why is that ? Any link explaining the allocate request of RM would be very helpful. Thanks, Rahul
Re: Question related to resource allocation in Yarn!
Hi Devaraj, I am on Hadoop 2.0.4 . I am able to get containers now and my yarn app runs properly. I am setting hostname as * , while requesting containers. There is no problem as of now , only thing is I am allocated only 2 containers at one time , however I believe that the node manager can run more containers. 2013-09-06 09:53:38,433 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode: Assigned container container_1378441324025_0001_01_01 of capacity memory:100, vCores:1 on host storyacid-lm:55407, which currently has 1 containers, memory:100, vCores:1 used and memory:8092, vCores:15 available I am requesting containers with 100 mb mem and 1 core. If I know more about how is the capacity if calculated per node , or how the allocation is done , then it would be useful. Thanks for the help! Rahul On Fri, Sep 6, 2013 at 10:31 AM, Devaraj k devara...@huawei.com wrote: Hi Rahul, ** ** Could you tell me, what is the version you are using? **· **If you want a container, you need to issue 3 resource requests (1-node local, 1-rack local and 1-Any(*) ). If you are using 2.1.0-betahttps://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truemode=hidesorter/order=ASCsorter/field=prioritypid=12313722customfield_12310320=12324029or later versions, you can set the Relax Locality flag to false for getting only on the specified host. Can you also share the code how you are requesting for containers…so that we can help you better.. ** ** Thanks Devaraj k ** ** *From:* Rahul Bhattacharjee [mailto:rahul.rec@gmail.com] *Sent:* 06 September 2013 09:43 *To:* user@hadoop.apache.org *Subject:* Re: Question related to resource allocation in Yarn! ** ** I could progress a bit on this. I was not setting responseId while asking for containers. Still I have one question as why I am only been allocated two containers whereas node manager can run more containers. Response while registering the application master - AM registration response minimumCapability {, memory: 1024, virtual_cores: 1, }, maximumCapability {, memory: 8192, virtual_cores: 32, }, Thanks, Rahul ** ** On Thu, Sep 5, 2013 at 8:33 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi, I am trying to make a small poc on top of yarn. Within the launched application master , I am trying to request for 50 containers and launch a same task on those allocated containers. My config : AM registration response minimumCapability {, memory: 1024, virtual_cores: 1, }, maximumCapability {, memory: 8192, virtual_cores: 32, }, 1) I am asking for 1G mem and 1 core container to the RM. Ideally the RM should return me 6 - 7 containers , but the response always returns with only 2 containers. Why is that ? 2) So , when in the first ask 2 containers are returned , then I again required the RM for 50 - 2 = 48 containers. I keep getting 0 containers , even if the previously started containers have finished. Why is that ? Any link explaining the allocate request of RM would be very helpful. ** ** Thanks, Rahul ** **