Custom error message and action for authentication failure

2013-09-05 Thread Benoy Antony
Hello All,

We have a requirement to display a custom error message and do some
bookkeeping if a user faces an authentication error when making a hadoop
call.

We use Hadoop 1.  How do we go about accomplishing this ?

benoy


Re: How to update the timestamp of a file in HDFS

2013-09-05 Thread murali adireddy
Hi ,

Try this touchz hadoop command.

hadoop -touchz filename


Thanks and Regards,
Adi Reddy Murali


On Thu, Sep 5, 2013 at 11:06 AM, Harsh J ha...@cloudera.com wrote:

 There's no shell command (equivalent to Linux's touch) but you can use
 the Java API:
 http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#setTimes(org.apache.hadoop.fs.Path,%20long,%20long)

 On Thu, Sep 5, 2013 at 10:58 AM, Ramasubramanian Narayanan
 ramasubramanian.naraya...@gmail.com wrote:
  Hi,
 
  Can you please help on to update the date  timestamp of a file in HDFS.
 
  regards,
  Rams



 --
 Harsh J



Re: How to update the timestamp of a file in HDFS

2013-09-05 Thread murali adireddy
right usage of command is:

hadoop fs - touchz filename


On Thu, Sep 5, 2013 at 12:14 PM, murali adireddy
murali.adire...@gmail.comwrote:

 Hi ,

 Try this touchz hadoop command.

 hadoop -touchz filename


 Thanks and Regards,
 Adi Reddy Murali


 On Thu, Sep 5, 2013 at 11:06 AM, Harsh J ha...@cloudera.com wrote:

 There's no shell command (equivalent to Linux's touch) but you can use
 the Java API:
 http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#setTimes(org.apache.hadoop.fs.Path,%20long,%20long)

 On Thu, Sep 5, 2013 at 10:58 AM, Ramasubramanian Narayanan
 ramasubramanian.naraya...@gmail.com wrote:
  Hi,
 
  Can you please help on to update the date  timestamp of a file in HDFS.
 
  regards,
  Rams



 --
 Harsh J





how to observe spill operation during shuffle in mapreduce?

2013-09-05 Thread ch huang
hi,all:
   is there any MR spill count metric?


Re: how to observe spill operation during shuffle in mapreduce?

2013-09-05 Thread Ravi Kiran
Hi ,

   You can look at the job metrics from your jobtracker Web UI .  The
Spilled Record Counter under the group Map Reduce Framework displays
the number of records spilled in both map and reduce tasks.

Regards
Ravi Magham.


On Thu, Sep 5, 2013 at 12:23 PM, ch huang justlo...@gmail.com wrote:

 hi,all:
is there any MR spill count metric?



Re: Multidata center support

2013-09-05 Thread Visioner Sadak
Hi friends

hello baskar i think rack awareness and data center awareness are different
and similarly nodes and data centers are different things from hadoops
perspective but ideally it shud be same i mean nodes can be in different
data centers right but i think hadoop doesnt not  replicate data across
data centers i am not sure abt this (can anyone please comment on this).

federation can provide different namenodes so you can create independent
clusters for example one cluster at one data center
and another cluster at a different  data center. but if hadoop can
replicate across data centers then we need only one federation cluster for
all data centers :)...are any of you guys using a single federation cluster
across multiple data centers in production  for example

CASE--1

one cluster federation/data centers at  US/Europe---(if hadoop
can replicate across data centers )

NN1 --US   DN1 US
NN2 --Europe DN2 -Europe

In this case data can be replicated to DN1 and DN2

--
CASE--2

two independent cluster federation/data centers at
 US/Europe--(if hadoop cannot replicate across data centers )

cluster 1
cluster 2

NN1 --US   DN1 US  NN2
--Europe DN2 -Europe


In this case data cannot be replicated to DN2 or vice versa


*Can anyone clarify which will be the right and optimal case for hadoop
-:)*





On Wed, Sep 4, 2013 at 2:20 AM, Baskar Duraikannu 
baskar.duraika...@outlook.com wrote:

 Rahul
 Are you talking about rack-awareness script?

 I did go through rack awareness. Here are the problems with rack awareness
 w.r.to my (given) business requirment

 1.  Hadoop , default places two copies on the same rack and 1 copy on some
 other rack.  This would work as long as we have two data centers. if
 business wants to have three data centers, then data would not be spread
 across. Separately there is a question around whether it is the right thing
 to do or not. I have been promised by business that they would buy enough
 bandwidth such that each data center will be few milliseconds apart (in
 latency).

 2. I believe Hadoop automatically re-replicates data if one or more node
 is down. Assume when one out of 2 data center goes down. There will be a
 massive data flow to create additional copies.  When I say data center
 support, I should be able to configure hadoop to say
  a) Maintain 1 copy per data center
  b) If any data center goes down, dont create additional copies.

 Above requirements that I am pointing will essentially move hadoop from
 strongly consistent to a week/eventual consistent model. Since this changes
 fundamental architecture, it will probably break all sort of things...
 Might not be possible ever in Hadoop.

 Thoughts?

 Sadak
 Is there a way to implement above requirement via Federation?

 Thanks
 Baskar


 --
 Date: Sun, 1 Sep 2013 00:20:04 +0530

 Subject: Re: Multidata center support
 From: visioner.sa...@gmail.com
 To: user@hadoop.apache.org


 What do you think friends I think hadoop clusters can run on multiple data
 centers using FEDERATION


 On Sat, Aug 31, 2013 at 8:39 PM, Visioner Sadak 
 visioner.sa...@gmail.comwrote:

 The only problem i guess hadoop wont be able to duplicate data from one
 data center to another but i guess i can identify data nodes or namenodes
 from another data center correct me if i am wrong


 On Sat, Aug 31, 2013 at 7:00 PM, Visioner Sadak 
 visioner.sa...@gmail.comwrote:

 lets say that

 you have some machines in europe and some  in US I think you just need the
 ips and configure them in your cluster set up
 it will work...


 On Sat, Aug 31, 2013 at 7:52 AM, Jun Ping Du j...@vmware.com wrote:

 Hi,
 Although you can set datacenter layer on your network topology, it is
 never enabled in hadoop as lacking of replica placement and task scheduling
 support. There are some work to add layers other than rack and node under
 HADOOP-8848 but may not suit for your case. Agree with Adam that a cluster
 spanning multiple data centers seems not make sense even for DR case. Do
 you have other cases to do such a deployment?

 Thanks,

 Junping

 --
 *From: *Adam Muise amu...@hortonworks.com
 *To: *user@hadoop.apache.org
 *Sent: *Friday, August 30, 2013 6:26:54 PM
 *Subject: *Re: Multidata center support


 Nothing has changed. DR best practice is still one (or more) clusters per
 site and replication is handled via distributed copy or some variation of
 it. A cluster spanning 

Re: M/R API and Writable semantics in reducer

2013-09-05 Thread Jan Lukavský

Hi,

is there anyone interested in this topic? Basically, what I'm trying to 
find out is, whether it is 'safe' to rely on the side-effect of updating 
key during iterating values. I believe that there must be someone who is 
also interested in this, the secondary sort pattern is very common (at 
least in our jobs). So far, we have been emulating the 
GroupingComparator by holding state in the Reducer class and therefore 
being able to keep track of 'groups' of keys among several calls to 
reduce() method. This method seems quite safe in the sense of API, but 
in the sense of code is not as pretty (and vulnerable to ugly bugs if 
you forget to reset the state correctly for instance).


On the other hand, if the way key gets updated while iterating the 
values is to be considered contract of the MapReduce API, I think it 
should be implemented in MRUnit (or you basically cannot use MRUnit to 
unittest your job) and if it isn't, than it is probably a bug. If this 
is internal behavior and might be subject to change anytime, than it 
clearly seems that keeping the state in Reducer is the only option.


Does anyone else have similar considerations? How do others implement 
the secondary sort?


Thanks,
 Jan

On 09/02/2013 03:29 PM, Jan Lukavský wrote:

Hi all,

some time ago, I wrote a note to this conference, that it would be 
nice if it would be possible to get the *real* key emitted from mapper 
to reducer, when using the GroupingComparator. I got the answer, that 
it is possible, because of the Writable semantics and that currently 
the following holds:


@Override
protected void reduce(Key key, IterableValue values, Context context)
{
  for (Value v : values) {
// The key MIGHT change its value in this cycle, because 
readFields() will be called on it.
// When using GroupingComparator that groups only by some part of 
the key,
// many different keys might be considered single group, so the 
*real* data matters.

  }
}

When you use GroupingComparator the contents of the key can matter, 
because if you cannot access it, you have to duplicate the data in 
value (which means more network traffic in shuffle phase, and more I/O 
generally).


Now, the question is, how much is this a matter of API that is 
reliable, or how much it is likely, that relying on this feature might 
break in future versions. To me, it seems more like a side effect, 
that is not guaranteed to be maintained in the future. There already 
exists a suggestion, that this is probably very fragile, because 
MRUnit seems not to update the key during the iteration.


Does anyone have any suggested way around? Is the 'official' preferred 
way of accessing the original key to call context.getCurrentKey()? 
Isn't this the same case? Wouldn't it be nice, if the API itself had 
some guaranties or suggestions how it works? I can imagine modified 
reduce() metod, with a signature like


protected void reduce(Key key, IterablePairKey, Value keyValues, 
Context context);


This seems easily transformable to the old call (which could be 
default implementation of this method).


Any opinion on this?

Thanks,
 Jan





Re: How to update the timestamp of a file in HDFS

2013-09-05 Thread Harsh J
Murali,

The touchz creates a zero sized file. It does not allow modifying a
timestamp like Linux's touch command does, which is what the OP seems
to be asking about.

On Thu, Sep 5, 2013 at 12:14 PM, murali adireddy
murali.adire...@gmail.com wrote:
 right usage of command is:

 hadoop fs - touchz filename


 On Thu, Sep 5, 2013 at 12:14 PM, murali adireddy murali.adire...@gmail.com
 wrote:

 Hi ,

 Try this touchz hadoop command.

 hadoop -touchz filename


 Thanks and Regards,
 Adi Reddy Murali


 On Thu, Sep 5, 2013 at 11:06 AM, Harsh J ha...@cloudera.com wrote:

 There's no shell command (equivalent to Linux's touch) but you can use
 the Java API:
 http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#setTimes(org.apache.hadoop.fs.Path,%20long,%20long)

 On Thu, Sep 5, 2013 at 10:58 AM, Ramasubramanian Narayanan
 ramasubramanian.naraya...@gmail.com wrote:
  Hi,
 
  Can you please help on to update the date  timestamp of a file in
  HDFS.
 
  regards,
  Rams



 --
 Harsh J






-- 
Harsh J


Re: what is the difference between mapper and identity mapper, reducer and identity reducer?

2013-09-05 Thread Shahab Yunus
Identity Mapper and Reducer just like the concept of Identity function in
mathematics i.e. do not transform the input and return it as it is in
output form. Identity Mapper takes the input key/value pair and spits it
out without any processing.

The case of identity reducer is a bit different. It does not mean that the
reduce step will not take place. It will take place and the related sorting
and shuffling will also be performed but there will be no aggregation. So
you can use identity reducer if you want to sort your data that is coming
from map but don't care for any grouping and also fine with multiple
reducer outputs (unlike using 1 reducer.)

Regards,
Shahab



On Thu, Sep 5, 2013 at 9:43 AM, mallik arjun mallik.cl...@gmail.com wrote:

 hi  all,

 please  tell me what is the difference between mapper and identtiy mapper
 , reducer and identity reducer.

 thanks in advance.



Re: Symbolic Link in Hadoop 1.0.4

2013-09-05 Thread Suresh Srinivas
FileContext APIs and symlink functionality is not available in 1.0. It is
only available in 0.23 and 2.x release.


On Thu, Sep 5, 2013 at 8:06 AM, Gobilliard, Olivier 
olivier.gobilli...@cartesian.com wrote:

  Hi,



 I am using Hadoop 1.0.4 and need to create a symbolic link in HDSF.

 This feature has been added in Hadoop 0.21.0 (
 https://issues.apache.org/jira/browse/HDFS-245) in the new FileContext
 API (
 http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileContext.html
 ).

 However, I cannot find the FileContext API in the 1.0.4 release (
 http://archive.apache.org/dist/hadoop/core/hadoop-1.0.4/). I cannot find
 it in any of the 1.X releases actually.



 Has this functionality been moved to another Class?



 Many thanks,

 Olivier


 __
 This email and any attachments are confidential. If you have received this
 email in error please notify the sender immediately
 by replying to this email and then delete from your computer without
 copying or distributing in any other way.

 Cartesian Limited - Registered in England and Wales with number 3230513
 Registered office: Descartes House, 8 Gate Street, London, WC2A 3HP
 www.cartesian.com




-- 
http://hortonworks.com/download/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Question related to resource allocation in Yarn!

2013-09-05 Thread Rahul Bhattacharjee
Hi,

I am trying to make a small poc on top of yarn.

Within the launched application master , I am trying to request for 50
containers and launch  a same task on those allocated containers.

My config : AM registration response minimumCapability {, memory: 1024,
virtual_cores: 1, }, maximumCapability {, memory: 8192, virtual_cores: 32,
},

1) I am asking for 1G mem and 1 core container to the RM. Ideally the RM
should return me 6 - 7 containers , but the response always returns with
only 2 containers.

Why is that ?

2) So , when in the first ask 2 containers are returned , then I again
required the RM for 50 - 2 = 48 containers. I keep getting 0 containers ,
even if the previously started containers have finished.

Why is that ?

Any link explaining the allocate request of RM would be very helpful.

Thanks,
Rahul


Symbolic Link in Hadoop 1.0.4

2013-09-05 Thread Gobilliard, Olivier
Hi,

I am using Hadoop 1.0.4 and need to create a symbolic link in HDSF.
This feature has been added in Hadoop 0.21.0 
(https://issues.apache.org/jira/browse/HDFS-245) in the new FileContext API 
(http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileContext.html).
However, I cannot find the FileContext API in the 1.0.4 release 
(http://archive.apache.org/dist/hadoop/core/hadoop-1.0.4/). I cannot find it in 
any of the 1.X releases actually.

Has this functionality been moved to another Class?

Many thanks,
Olivier

__
This email and any attachments are confidential. If you have received this 
email in error please notify the sender immediately
by replying to this email and then delete from your computer without copying or 
distributing in any other way.

Cartesian Limited - Registered in England and Wales with number 3230513
Registered office: Descartes House, 8 Gate Street, London, WC2A 3HP
www.cartesian.comhttp://www.cartesian.com


RE: Multidata center support

2013-09-05 Thread Baskar Duraikannu
Thanks Mike. I am assuming that it is a poor idea due to network bandwidth 
constraints across data center (backplane speed of TOR is typically greater 
than data center connectivity). 
From: michael_se...@hotmail.com
Subject: Re: Multidata center support
Date: Wed, 4 Sep 2013 20:15:08 -0500
To: user@hadoop.apache.org

Sorry, its a poor idea period. 
Its one thing for something like Cleversafe to span a data center, but you're 
also having unit of work in terms of map/reduce. 
Think about all of the bad things that can happen when you have to deal with a 
sort/shuffle stage across data centers... (Its not a pretty sight.) 
As Adam points out... DR and copies across data centers are one thing. Running 
a single cluster spanning data centers...
I would hate to be you when you have to face your devOps team. Does the 
expression BOFH ring a bell? ;-) 
HTH
-Mike
On Aug 30, 2013, at 5:26 AM, Adam Muise amu...@hortonworks.com wrote:Nothing 
has changed. DR best practice is still one (or more) clusters per site and 
replication is handled via distributed copy or some variation of it. A cluster 
spanning multiple data centers is a poor idea right now.




On Fri, Aug 30, 2013 at 12:35 AM, Rahul Bhattacharjee rahul.rec@gmail.com 
wrote:

My take on this.


Why hadoop has to know about data center thing. I think it can be installed 
across multiple data centers , however topology configuration would be required 
to tell which node belongs to which data center and switch for block placement.




Thanks,
Rahul


On Fri, Aug 30, 2013 at 12:42 AM, Baskar Duraikannu 
baskar.duraika...@outlook.com wrote:






We have a need to setup hadoop across data centers.  Does hadoop support multi 
data center configuration? I searched through archives and have found that 
hadoop did not support multi data center configuration some time back. Just 
wanted to see whether situation has changed.



Please help.  




-- 



Adam MuiseSolution EngineerHortonworks
amuise@hortonworks.com416-417-4037

Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop.


Hortonworks Virtual Sandbox


Hadoop: Disruptive Possibilities by Jeff Needham





CONFIDENTIALITY NOTICENOTICE: This message is intended for the use of the 
individual or entity to which it is addressed and may contain information that 
is confidential, privileged and exempt from disclosure under applicable law. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have received 
this communication in error, please contact the sender immediately and delete 
it from your system. Thank You.

The opinions expressed here are mine, while they may reflect a cognitive 
thought, that is purely accidental. Use at your own risk. Michael 
Segelmichael_segel (AT) hotmail.com






  

Re: Disc not equally utilized in hdfs data nodes

2013-09-05 Thread Harsh J
Please share your hdfs-site.xml. HDFS needs to be configured to use
all 4 disk mounts - it does not auto-discover and use all drives
today.

On Thu, Sep 5, 2013 at 10:48 PM, Viswanathan J
jayamviswanat...@gmail.com wrote:
 Hi,

 The data which are storing in data nodes are not equally utilized in all the
 data directories.

 We having 4x1 TB drives, but huge data storing in single disc only at all
 the nodes. How to balance for utilize all the drives.

 This causes the hdfs storage size becomes high very soon even though we have
 available space.

 Thanks,
 Viswa.J



-- 
Harsh J


Disc not equally utilized in hdfs data nodes

2013-09-05 Thread Viswanathan J
Hi,

The data which are storing in data nodes are not equally utilized in all
the data directories.

We having 4x1 TB drives, but huge data storing in single disc only at all
the nodes. How to balance for utilize all the drives.

This causes the hdfs storage size becomes high very soon even though we
have available space.

Thanks,
Viswa.J


RE: yarn-site.xml and aux-services

2013-09-05 Thread John Lilley
Harsh,

Thanks as usual for your sage advice.  I was hoping to avoid actually 
installing anything on individual Hadoop nodes and finessing the service by 
spawning it from a task using LocalResources, but this is probably fraught with 
trouble.

FWIW, I would vote to be able to load YARN services from HDFS.  What is the 
appropriate forum to file a request like that?

Thanks
John

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Wednesday, September 04, 2013 12:05 AM
To: user@hadoop.apache.org
Subject: Re: yarn-site.xml and aux-services

 Thanks for the clarification.  I would find it very convenient in this case 
 to have my custom jars available in HDFS, but I can see the added complexity 
 needed for YARN to maintain cache those to local disk.

We could class-load directly from HDFS, like HBase Co-Processors do.

 Consider a scenario analogous to the MR shuffle, where the persistent service 
 serves up mapper output files to the reducers across the network:

Isn't this more complex than just running a dedicated service all the time, 
and/or implementing a way to spawn/end a dedicated service temporarily? I'd 
pick trying to implement such a thing than have my containers implement more 
logic.

On Fri, Aug 23, 2013 at 11:17 PM, John Lilley john.lil...@redpoint.net wrote:
 Harsh,

 Thanks for the clarification.  I would find it very convenient in this case 
 to have my custom jars available in HDFS, but I can see the added complexity 
 needed for YARN to maintain cache those to local disk.

 What about having the tasks themselves start the per-node service as a child 
 process?   I've been told that the NM kills the process group, but won't 
 setgrp() circumvent that?

 Even given that, would the child process of one task have proper environment 
 and permission to act on behalf of other tasks?  Consider a scenario 
 analogous to the MR shuffle, where the persistent service serves up mapper 
 output files to the reducers across the network:
 1) AM spawns mapper-like tasks around the cluster
 2) Each mapper-like task on a given node launches a persistent service 
 child, but only if one is not already running.
 3) Each mapper-like task writes one or more output files, and informs the 
 service of those files (along with AM-id, Task-id etc).
 4) AM spawns reducer-like tasks around the cluster.
 5) Each reducer-like task is told which nodes contain mapper result data, 
 and connects to services on those nodes to read the data.

 There are some details missing, like how the lifetime of the temporary files 
 is controlled to extend beyond the mapper-like task lifetime but still be 
 cleaned up on AM exit, and how the reducer-like tasks are informed of which 
 nodes have data.

 John


 -Original Message-
 From: Harsh J [mailto:ha...@cloudera.com]
 Sent: Friday, August 23, 2013 11:00 AM
 To: user@hadoop.apache.org
 Subject: Re: yarn-site.xml and aux-services

 The general practice is to install your deps into a custom location such as 
 /opt/john-jars, and extend YARN_CLASSPATH to include the jars, while also 
 configuring the classes under the aux-services list. You need to take care of 
 deploying jar versions to /opt/john-jars/ contents across the cluster though.

 I think it may be a neat idea to have jars be placed on HDFS or any other 
 DFS, and the yarn-site.xml indicating the location plus class to load. 
 Similar to HBase co-processors. But I'll defer to Vinod on if this would be a 
 good thing to do.

 (I know the right next thing with such an ability people will ask for 
 is hot-code-upgrades...)

 On Fri, Aug 23, 2013 at 10:11 PM, John Lilley john.lil...@redpoint.net 
 wrote:
 Are there recommended conventions for adding additional code to a 
 stock Hadoop install?

 It would be nice if we could piggyback on whatever mechanisms are 
 used to distribute hadoop itself around the cluster.

 john



 From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com]
 Sent: Thursday, August 22, 2013 6:25 PM


 To: user@hadoop.apache.org
 Subject: Re: yarn-site.xml and aux-services





 Auxiliary services are essentially administer-configured services. 
 So, they have to be set up at install time - before NM is started.



 +Vinod



 On Thu, Aug 22, 2013 at 1:38 PM, John Lilley 
 john.lil...@redpoint.net
 wrote:

 Following up on this, how exactly does one *install* the jar(s) for 
 auxiliary service?  Can it be shipped out with the LocalResources of an AM?
 MapReduce's aux-service is presumably installed with Hadoop and is 
 just sitting there in the right place, but if one wanted to make a 
 whole new aux-service that belonged with an AM, how would one do it?

 John


 -Original Message-
 From: John Lilley [mailto:john.lil...@redpoint.net]
 Sent: Wednesday, June 05, 2013 11:41 AM
 To: user@hadoop.apache.org
 Subject: RE: yarn-site.xml and aux-services

 Wow, thanks.  Is this documented anywhere other than the code?  I 
 hate to waste y'alls time on things that 

Re: SNN not writing data fs.checkpoint.dir location

2013-09-05 Thread Jitendra Yadav
Please share your Hadoop version and hdfs-site.xml conf also I'm
assuming that you already restarted your cluster after changing
fs.checkpoint.dir.

Thanks
On 9/5/13, Munna munnava...@gmail.com wrote:
 Hi,

 I have configured fs.checkpoint.dir in hdfs-site.xml, but still it was
 writing in /tmp location. Please give me some solution for checkpointing on
 respective location.

 --
 *Regards*
 *
 *
 *Munna*



RE: Multidata center support

2013-09-05 Thread Baskar Duraikannu
Currently there is no relation betweeen weak consistency and hadoop. I just 
spent more time thinking about the requirement (as outlined below) a) 
Maintain total of 3 data centers b) Maintain 1 copy per data center c) 
If any data center goes down, dont create additional copies.  
Above is not a valid model, especially requirement (c).  Because this will take 
away Strong Consistency model supported by Hadoop. Hope this explains. 
I believe we can give up on requirement (c). I more currently exploring to see 
whether anyway to achieve (a) and (b). Requirement (b) can also be relaxed to 
have more copies per data center if needed 
From: rahul.rec@gmail.com
Date: Wed, 4 Sep 2013 10:04:49 +0530
Subject: Re: Multidata center support
To: user@hadoop.apache.org

Under replicated blocks are also consistent from a consumers point. Care of 
explain the relation to weak consistency to hadoop.



Thanks,
Rahul


On Wed, Sep 4, 2013 at 9:56 AM, Rahul Bhattacharjee rahul.rec@gmail.com 
wrote:


Adam's response makes more sense to me to offline replicate generated data from 
one cluster to another across data centers.




Not sure if configurable block placement block placement policy is supported in 
Hadoop.If yes , then alone side with rack awareness , you should be able to 
achieve the same.




I could not follow your question related to weak consistency.


Thanks,
Rahul






On Wed, Sep 4, 2013 at 2:20 AM, Baskar Duraikannu 
baskar.duraika...@outlook.com wrote:






RahulAre you talking about rack-awareness script? 
I did go through rack awareness. Here are the problems with rack awareness 
w.r.to my (given) business requirment



1.  Hadoop , default places two copies on the same rack and 1 copy on some 
other rack.  This would work as long as we have two data centers. if business 
wants to have three data centers, then data would not be spread across. 
Separately there is a question around whether it is the right thing to do or 
not. I have been promised by business that they would buy enough bandwidth such 
that each data center will be few milliseconds apart (in latency).



2. I believe Hadoop automatically re-replicates data if one or more node is 
down. Assume when one out of 2 data center goes down. There will be a massive 
data flow to create additional copies.  When I say data center support, I 
should be able to configure hadoop to say 


 a) Maintain 1 copy per data center b) If any data center goes down, 
dont create additional copies.  
Above requirements that I am pointing will essentially move hadoop from 
strongly consistent to a week/eventual consistent model. Since this changes 
fundamental architecture, it will probably break all sort of things... Might 
not be possible ever in Hadoop. 



Thoughts? 
SadakIs there a way to implement above requirement via Federation? 
ThanksBaskar

Date: Sun, 1 Sep 2013 00:20:04 +0530



Subject: Re: Multidata center support
From: visioner.sa...@gmail.com
To: user@hadoop.apache.org




What do you think friends I think hadoop clusters can run on multiple data 
centers using FEDERATION

On Sat, Aug 31, 2013 at 8:39 PM, Visioner Sadak visioner.sa...@gmail.com 
wrote:




The only problem i guess hadoop wont be able to duplicate data from one data 
center to another but i guess i can identify data nodes or namenodes from 
another data center correct me if i am wrong






On Sat, Aug 31, 2013 at 7:00 PM, Visioner Sadak visioner.sa...@gmail.com 
wrote:





lets say that 
you have some machines in europe and some  in US I think you just need the ips 
and configure them in your cluster set upit will work...



On Sat, Aug 31, 2013 at 7:52 AM, Jun Ping Du j...@vmware.com wrote:






Hi,Although you can set datacenter layer on your network topology, it is 
never enabled in hadoop as lacking of replica placement and task scheduling 
support. There are some work to add layers other than rack and node under 
HADOOP-8848 but may not suit for your case. Agree with Adam that a cluster 
spanning multiple data centers seems not make sense even for DR case. Do you 
have other cases to do such a deployment?






Thanks,
Junping
From: Adam Muise amu...@hortonworks.com






To: user@hadoop.apache.org
Sent: Friday, August 30, 2013 6:26:54 PM
Subject: Re: Multidata center support


Nothing has changed. DR best practice is still one (or more) clusters per site 
and replication is handled via distributed copy or some variation of it. A 
cluster spanning multiple data centers is a poor idea right now.










On Fri, Aug 30, 2013 at 12:35 AM, Rahul Bhattacharjee rahul.rec@gmail.com 
wrote:







My take on this.





Why hadoop has to know about data center thing. I think it can be installed 
across multiple data centers , however topology configuration would be required 
to tell which node belongs to which data center and switch for block placement.










Thanks,
Rahul


On Fri, Aug 30, 2013 at 12:42 AM, Baskar Duraikannu 

Re: SNN not writing data fs.checkpoint.dir location

2013-09-05 Thread Harsh J
These configs need to be present at SNN, not at just the NN.

On Thu, Sep 5, 2013 at 11:54 PM, Munna munnava...@gmail.com wrote:
 Hi Yadav,

 We are using CDH3 and I restarted after changing configuration.
  property
 namefs.checkpoint.dir/name
 value/data/1/dfs/snn,/nfsmount/dfs/snn/value
 finaltrue/final
 /property
 property
 namefs.checkpoint.period/name
 value3600/value
 descriptionThe number of seconds between two periodic
 checkpoints/description
 /property

 I have entered these changes in Namenode only.


 On Thu, Sep 5, 2013 at 11:47 PM, Jitendra Yadav jeetuyadav200...@gmail.com
 wrote:

 Please share your Hadoop version and hdfs-site.xml conf also I'm
 assuming that you already restarted your cluster after changing
 fs.checkpoint.dir.

 Thanks
 On 9/5/13, Munna munnava...@gmail.com wrote:
  Hi,
 
  I have configured fs.checkpoint.dir in hdfs-site.xml, but still it was
  writing in /tmp location. Please give me some solution for checkpointing
  on
  respective location.
 
  --
  *Regards*
  *
  *
  *Munna*
 




 --
 Regards

 Munna



-- 
Harsh J


Re: SNN not writing data fs.checkpoint.dir location

2013-09-05 Thread Jitendra Yadav
Hi,

Well I think you should only restart your SNN after the change. Also
refer the checkpoint directory for any 'in_use.lock' file.

Thanks
Jitendra
On 9/6/13, Munna munnava...@gmail.com wrote:
 Thank you Jitendar.

 After chage these perameter on SNN, is it require to restart NN also?
 please confirm...


 On Fri, Sep 6, 2013 at 12:10 AM, Jitendra Yadav
 jeetuyadav200...@gmail.comwrote:

 Hi,

 If you are running SNN on same node as NN then it's ok otherwise you
 should add these  properties at SNN side too.


 Thanks
 Jitendra
 On 9/6/13, Munna munnava...@gmail.com wrote:
  you mean that same configurations are required as NN in SNN
 
 
  On Thu, Sep 5, 2013 at 11:58 PM, Harsh J ha...@cloudera.com wrote:
 
  These configs need to be present at SNN, not at just the NN.
 
  On Thu, Sep 5, 2013 at 11:54 PM, Munna munnava...@gmail.com wrote:
   Hi Yadav,
  
   We are using CDH3 and I restarted after changing configuration.
property
   namefs.checkpoint.dir/name
   value/data/1/dfs/snn,/nfsmount/dfs/snn/value
   finaltrue/final
   /property
   property
   namefs.checkpoint.period/name
   value3600/value
   descriptionThe number of seconds between two
 periodic
   checkpoints/description
   /property
  
   I have entered these changes in Namenode only.
  
  
   On Thu, Sep 5, 2013 at 11:47 PM, Jitendra Yadav 
  jeetuyadav200...@gmail.com
   wrote:
  
   Please share your Hadoop version and hdfs-site.xml conf also I'm
   assuming that you already restarted your cluster after changing
   fs.checkpoint.dir.
  
   Thanks
   On 9/5/13, Munna munnava...@gmail.com wrote:
Hi,
   
I have configured fs.checkpoint.dir in hdfs-site.xml, but still
it
was
writing in /tmp location. Please give me some solution for
  checkpointing
on
respective location.
   
--
*Regards*
*
*
*Munna*
   
  
  
  
  
   --
   Regards
  
   Munna
 
 
 
  --
  Harsh J
 
 
 
 
  --
  *Regards*
  *
  *
  *Munna*
 




 --
 *Regards*
 *
 *
 *Munna*



Re: SNN not writing data fs.checkpoint.dir location

2013-09-05 Thread Munna
in_use.lock ?


On Fri, Sep 6, 2013 at 12:26 AM, Jitendra Yadav
jeetuyadav200...@gmail.comwrote:

 Hi,

 Well I think you should only restart your SNN after the change. Also
 refer the checkpoint directory for any 'in_use.lock' file.

 Thanks
 Jitendra
 On 9/6/13, Munna munnava...@gmail.com wrote:
  Thank you Jitendar.
 
  After chage these perameter on SNN, is it require to restart NN also?
  please confirm...
 
 
  On Fri, Sep 6, 2013 at 12:10 AM, Jitendra Yadav
  jeetuyadav200...@gmail.comwrote:
 
  Hi,
 
  If you are running SNN on same node as NN then it's ok otherwise you
  should add these  properties at SNN side too.
 
 
  Thanks
  Jitendra
  On 9/6/13, Munna munnava...@gmail.com wrote:
   you mean that same configurations are required as NN in SNN
  
  
   On Thu, Sep 5, 2013 at 11:58 PM, Harsh J ha...@cloudera.com wrote:
  
   These configs need to be present at SNN, not at just the NN.
  
   On Thu, Sep 5, 2013 at 11:54 PM, Munna munnava...@gmail.com wrote:
Hi Yadav,
   
We are using CDH3 and I restarted after changing configuration.
 property
namefs.checkpoint.dir/name
value/data/1/dfs/snn,/nfsmount/dfs/snn/value
finaltrue/final
/property
property
namefs.checkpoint.period/name
value3600/value
descriptionThe number of seconds between two
  periodic
checkpoints/description
/property
   
I have entered these changes in Namenode only.
   
   
On Thu, Sep 5, 2013 at 11:47 PM, Jitendra Yadav 
   jeetuyadav200...@gmail.com
wrote:
   
Please share your Hadoop version and hdfs-site.xml conf also I'm
assuming that you already restarted your cluster after changing
fs.checkpoint.dir.
   
Thanks
On 9/5/13, Munna munnava...@gmail.com wrote:
 Hi,

 I have configured fs.checkpoint.dir in hdfs-site.xml, but still
 it
 was
 writing in /tmp location. Please give me some solution for
   checkpointing
 on
 respective location.

 --
 *Regards*
 *
 *
 *Munna*

   
   
   
   
--
Regards
   
Munna
  
  
  
   --
   Harsh J
  
  
  
  
   --
   *Regards*
   *
   *
   *Munna*
  
 
 
 
 
  --
  *Regards*
  *
  *
  *Munna*
 




-- 
*Regards*
*
*
*Munna*


Re: SNN not writing data fs.checkpoint.dir location

2013-09-05 Thread Jitendra Yadav
Hi,

This means that your specified checkpoint directory has been locked by
SNN for use.

Thanks
Jitendra
On 9/6/13, Munna munnava...@gmail.com wrote:
 in_use.lock ?


 On Fri, Sep 6, 2013 at 12:26 AM, Jitendra Yadav
 jeetuyadav200...@gmail.comwrote:

 Hi,

 Well I think you should only restart your SNN after the change. Also
 refer the checkpoint directory for any 'in_use.lock' file.

 Thanks
 Jitendra
 On 9/6/13, Munna munnava...@gmail.com wrote:
  Thank you Jitendar.
 
  After chage these perameter on SNN, is it require to restart NN also?
  please confirm...
 
 
  On Fri, Sep 6, 2013 at 12:10 AM, Jitendra Yadav
  jeetuyadav200...@gmail.comwrote:
 
  Hi,
 
  If you are running SNN on same node as NN then it's ok otherwise you
  should add these  properties at SNN side too.
 
 
  Thanks
  Jitendra
  On 9/6/13, Munna munnava...@gmail.com wrote:
   you mean that same configurations are required as NN in SNN
  
  
   On Thu, Sep 5, 2013 at 11:58 PM, Harsh J ha...@cloudera.com wrote:
  
   These configs need to be present at SNN, not at just the NN.
  
   On Thu, Sep 5, 2013 at 11:54 PM, Munna munnava...@gmail.com
   wrote:
Hi Yadav,
   
We are using CDH3 and I restarted after changing configuration.
 property
namefs.checkpoint.dir/name
value/data/1/dfs/snn,/nfsmount/dfs/snn/value
finaltrue/final
/property
property
namefs.checkpoint.period/name
value3600/value
descriptionThe number of seconds between two
  periodic
checkpoints/description
/property
   
I have entered these changes in Namenode only.
   
   
On Thu, Sep 5, 2013 at 11:47 PM, Jitendra Yadav 
   jeetuyadav200...@gmail.com
wrote:
   
Please share your Hadoop version and hdfs-site.xml conf also I'm
assuming that you already restarted your cluster after changing
fs.checkpoint.dir.
   
Thanks
On 9/5/13, Munna munnava...@gmail.com wrote:
 Hi,

 I have configured fs.checkpoint.dir in hdfs-site.xml, but
 still
 it
 was
 writing in /tmp location. Please give me some solution for
   checkpointing
 on
 respective location.

 --
 *Regards*
 *
 *
 *Munna*

   
   
   
   
--
Regards
   
Munna
  
  
  
   --
   Harsh J
  
  
  
  
   --
   *Regards*
   *
   *
   *Munna*
  
 
 
 
 
  --
  *Regards*
  *
  *
  *Munna*
 




 --
 *Regards*
 *
 *
 *Munna*



Re: ContainerLaunchContext in 2.1.x

2013-09-05 Thread Omkar Joshi
Good question... There was a security problem earlier and to address that
we removed it from ContainerLaunchContext.

Today if you check the payload we are sending Container which contains
ContainerToken. ContainerToken is the secured channel for RM to tell NM
about
1) ContainerId
2) Resource
3) User
4) NodeId

It is present there by default (irrespective of security). I hope it
answers your doubt.

Thanks,
Omkar Joshi
*Hortonworks Inc.* http://www.hortonworks.com


On Wed, Sep 4, 2013 at 2:51 AM, Janne Valkealahti 
janne.valkeala...@gmail.com wrote:

 With 2.0.x ContainerId was part of the ContainerLaunchContext and I assume
 container id was then used to identify what node manager would actually
 start.

 With 2.1.x ContainerId was removed from ContainerLaunchContext.
 ContainerManagementProtocol is only using a list of StartContainerRequest
 which have ContainerLaunchContext and Token.

 My first question is that if you have different ContainerLaunchContext(i.e.
 command, env variables, etc), how do you know which container is launched
 with which launch context?

 My second question is how node manager is assosiating allocated
 container(which you requested from resource manager) to
 ContainerLaunchContext?


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: ContainerLaunchContext in 2.1.x

2013-09-05 Thread Jian He
Other than that, you can find all API incompatible changes from 2.0.x to
2.1.x in this link:

http://hortonworks.com/blog/stabilizing-yarn-apis-for-apache-hadoop-2-beta-and-beyond/

Jian


On Thu, Sep 5, 2013 at 10:44 AM, Omkar Joshi ojo...@hortonworks.com wrote:

 Good question... There was a security problem earlier and to address that
 we removed it from ContainerLaunchContext.

 Today if you check the payload we are sending Container which contains
 ContainerToken. ContainerToken is the secured channel for RM to tell NM
 about
 1) ContainerId
 2) Resource
 3) User
 4) NodeId

 It is present there by default (irrespective of security). I hope it
 answers your doubt.

 Thanks,
 Omkar Joshi
 *Hortonworks Inc.* http://www.hortonworks.com


 On Wed, Sep 4, 2013 at 2:51 AM, Janne Valkealahti 
 janne.valkeala...@gmail.com wrote:

 With 2.0.x ContainerId was part of the ContainerLaunchContext and I assume
 container id was then used to identify what node manager would actually
 start.

 With 2.1.x ContainerId was removed from ContainerLaunchContext.
 ContainerManagementProtocol is only using a list of StartContainerRequest
 which have ContainerLaunchContext and Token.

 My first question is that if you have different
 ContainerLaunchContext(i.e.
 command, env variables, etc), how do you know which container is launched
 with which launch context?

 My second question is how node manager is assosiating allocated
 container(which you requested from resource manager) to
 ContainerLaunchContext?



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: SNN not writing data fs.checkpoint.dir location

2013-09-05 Thread Jitendra Yadav
Hi,

If you are running SNN on same node as NN then it's ok otherwise you
should add these  properties at SNN side too.


Thanks
Jitendra
On 9/6/13, Munna munnava...@gmail.com wrote:
 you mean that same configurations are required as NN in SNN


 On Thu, Sep 5, 2013 at 11:58 PM, Harsh J ha...@cloudera.com wrote:

 These configs need to be present at SNN, not at just the NN.

 On Thu, Sep 5, 2013 at 11:54 PM, Munna munnava...@gmail.com wrote:
  Hi Yadav,
 
  We are using CDH3 and I restarted after changing configuration.
   property
  namefs.checkpoint.dir/name
  value/data/1/dfs/snn,/nfsmount/dfs/snn/value
  finaltrue/final
  /property
  property
  namefs.checkpoint.period/name
  value3600/value
  descriptionThe number of seconds between two periodic
  checkpoints/description
  /property
 
  I have entered these changes in Namenode only.
 
 
  On Thu, Sep 5, 2013 at 11:47 PM, Jitendra Yadav 
 jeetuyadav200...@gmail.com
  wrote:
 
  Please share your Hadoop version and hdfs-site.xml conf also I'm
  assuming that you already restarted your cluster after changing
  fs.checkpoint.dir.
 
  Thanks
  On 9/5/13, Munna munnava...@gmail.com wrote:
   Hi,
  
   I have configured fs.checkpoint.dir in hdfs-site.xml, but still it
   was
   writing in /tmp location. Please give me some solution for
 checkpointing
   on
   respective location.
  
   --
   *Regards*
   *
   *
   *Munna*
  
 
 
 
 
  --
  Regards
 
  Munna



 --
 Harsh J




 --
 *Regards*
 *
 *
 *Munna*



Re: SNN not writing data fs.checkpoint.dir location

2013-09-05 Thread Munna
Thank you Jitendar.

After chage these perameter on SNN, is it require to restart NN also?
please confirm...


On Fri, Sep 6, 2013 at 12:10 AM, Jitendra Yadav
jeetuyadav200...@gmail.comwrote:

 Hi,

 If you are running SNN on same node as NN then it's ok otherwise you
 should add these  properties at SNN side too.


 Thanks
 Jitendra
 On 9/6/13, Munna munnava...@gmail.com wrote:
  you mean that same configurations are required as NN in SNN
 
 
  On Thu, Sep 5, 2013 at 11:58 PM, Harsh J ha...@cloudera.com wrote:
 
  These configs need to be present at SNN, not at just the NN.
 
  On Thu, Sep 5, 2013 at 11:54 PM, Munna munnava...@gmail.com wrote:
   Hi Yadav,
  
   We are using CDH3 and I restarted after changing configuration.
property
   namefs.checkpoint.dir/name
   value/data/1/dfs/snn,/nfsmount/dfs/snn/value
   finaltrue/final
   /property
   property
   namefs.checkpoint.period/name
   value3600/value
   descriptionThe number of seconds between two
 periodic
   checkpoints/description
   /property
  
   I have entered these changes in Namenode only.
  
  
   On Thu, Sep 5, 2013 at 11:47 PM, Jitendra Yadav 
  jeetuyadav200...@gmail.com
   wrote:
  
   Please share your Hadoop version and hdfs-site.xml conf also I'm
   assuming that you already restarted your cluster after changing
   fs.checkpoint.dir.
  
   Thanks
   On 9/5/13, Munna munnava...@gmail.com wrote:
Hi,
   
I have configured fs.checkpoint.dir in hdfs-site.xml, but still it
was
writing in /tmp location. Please give me some solution for
  checkpointing
on
respective location.
   
--
*Regards*
*
*
*Munna*
   
  
  
  
  
   --
   Regards
  
   Munna
 
 
 
  --
  Harsh J
 
 
 
 
  --
  *Regards*
  *
  *
  *Munna*
 




-- 
*Regards*
*
*
*Munna*


RE: yarn-site.xml and aux-services

2013-09-05 Thread John Lilley
https://issues.apache.org/jira/browse/YARN-1151
--john

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Thursday, September 05, 2013 12:14 PM
To: user@hadoop.apache.org
Subject: Re: yarn-site.xml and aux-services

Please log a JIRA on https://issues.apache.org/jira/browse/YARN (do let the 
thread know the ID as well, in spirit of http://xkcd.com/979/)
:)

On Thu, Sep 5, 2013 at 11:41 PM, John Lilley john.lil...@redpoint.net wrote:
 Harsh,

 Thanks as usual for your sage advice.  I was hoping to avoid actually 
 installing anything on individual Hadoop nodes and finessing the service by 
 spawning it from a task using LocalResources, but this is probably fraught 
 with trouble.

 FWIW, I would vote to be able to load YARN services from HDFS.  What is the 
 appropriate forum to file a request like that?

 Thanks
 John

 -Original Message-
 From: Harsh J [mailto:ha...@cloudera.com]
 Sent: Wednesday, September 04, 2013 12:05 AM
 To: user@hadoop.apache.org
 Subject: Re: yarn-site.xml and aux-services

 Thanks for the clarification.  I would find it very convenient in this case 
 to have my custom jars available in HDFS, but I can see the added complexity 
 needed for YARN to maintain cache those to local disk.

 We could class-load directly from HDFS, like HBase Co-Processors do.

 Consider a scenario analogous to the MR shuffle, where the persistent 
 service serves up mapper output files to the reducers across the network:

 Isn't this more complex than just running a dedicated service all the time, 
 and/or implementing a way to spawn/end a dedicated service temporarily? I'd 
 pick trying to implement such a thing than have my containers implement more 
 logic.

 On Fri, Aug 23, 2013 at 11:17 PM, John Lilley john.lil...@redpoint.net 
 wrote:
 Harsh,

 Thanks for the clarification.  I would find it very convenient in this case 
 to have my custom jars available in HDFS, but I can see the added complexity 
 needed for YARN to maintain cache those to local disk.

 What about having the tasks themselves start the per-node service as a child 
 process?   I've been told that the NM kills the process group, but won't 
 setgrp() circumvent that?

 Even given that, would the child process of one task have proper environment 
 and permission to act on behalf of other tasks?  Consider a scenario 
 analogous to the MR shuffle, where the persistent service serves up mapper 
 output files to the reducers across the network:
 1) AM spawns mapper-like tasks around the cluster
 2) Each mapper-like task on a given node launches a persistent service 
 child, but only if one is not already running.
 3) Each mapper-like task writes one or more output files, and informs the 
 service of those files (along with AM-id, Task-id etc).
 4) AM spawns reducer-like tasks around the cluster.
 5) Each reducer-like task is told which nodes contain mapper result data, 
 and connects to services on those nodes to read the data.

 There are some details missing, like how the lifetime of the temporary files 
 is controlled to extend beyond the mapper-like task lifetime but still be 
 cleaned up on AM exit, and how the reducer-like tasks are informed of which 
 nodes have data.

 John


 -Original Message-
 From: Harsh J [mailto:ha...@cloudera.com]
 Sent: Friday, August 23, 2013 11:00 AM
 To: user@hadoop.apache.org
 Subject: Re: yarn-site.xml and aux-services

 The general practice is to install your deps into a custom location such as 
 /opt/john-jars, and extend YARN_CLASSPATH to include the jars, while also 
 configuring the classes under the aux-services list. You need to take care 
 of deploying jar versions to /opt/john-jars/ contents across the cluster 
 though.

 I think it may be a neat idea to have jars be placed on HDFS or any other 
 DFS, and the yarn-site.xml indicating the location plus class to load. 
 Similar to HBase co-processors. But I'll defer to Vinod on if this would be 
 a good thing to do.

 (I know the right next thing with such an ability people will ask for 
 is hot-code-upgrades...)

 On Fri, Aug 23, 2013 at 10:11 PM, John Lilley john.lil...@redpoint.net 
 wrote:
 Are there recommended conventions for adding additional code to a 
 stock Hadoop install?

 It would be nice if we could piggyback on whatever mechanisms are 
 used to distribute hadoop itself around the cluster.

 john



 From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com]
 Sent: Thursday, August 22, 2013 6:25 PM


 To: user@hadoop.apache.org
 Subject: Re: yarn-site.xml and aux-services





 Auxiliary services are essentially administer-configured services.
 So, they have to be set up at install time - before NM is started.



 +Vinod



 On Thu, Aug 22, 2013 at 1:38 PM, John Lilley 
 john.lil...@redpoint.net
 wrote:

 Following up on this, how exactly does one *install* the jar(s) for 
 auxiliary service?  Can it be shipped out with the LocalResources of an AM?
 MapReduce's aux-service is presumably 

Re: Disc not equally utilized in hdfs data nodes

2013-09-05 Thread Harsh J
The spaces may be a problem if you are using the older 1.x releases.
Please try to specify the list without spaces, and also check if all
of these paths exist and have some DN owned directories under them.

Please also keep the lists in CC/TO when replying. Clicking
Reply-to-all usually helps do this automatically.

On Thu, Sep 5, 2013 at 11:16 PM, Viswanathan J
jayamviswanat...@gmail.com wrote:
 Hi Harsh,

 dfs.data.dir property we defined the values as in comma separated,

 /mnt/hadoop0/hdfs, /mnt/hadoop1/hdfs, /mnt/hadoop2/hdfs, /mnt/hadoop3/hdfs

 The above values are different devices.

 Thanks,
 V

 On Sep 5, 2013 10:53 PM, Harsh J ha...@cloudera.com wrote:

 Please share your hdfs-site.xml. HDFS needs to be configured to use
 all 4 disk mounts - it does not auto-discover and use all drives
 today.

 On Thu, Sep 5, 2013 at 10:48 PM, Viswanathan J
 jayamviswanat...@gmail.com wrote:
  Hi,
 
  The data which are storing in data nodes are not equally utilized in all
  the
  data directories.
 
  We having 4x1 TB drives, but huge data storing in single disc only at
  all
  the nodes. How to balance for utilize all the drives.
 
  This causes the hdfs storage size becomes high very soon even though we
  have
  available space.
 
  Thanks,
  Viswa.J



 --
 Harsh J



-- 
Harsh J


How to support the (HDFS) FileSystem API of various Hadoop Distributions?

2013-09-05 Thread Christian Schneider
Hi,
I start to write a small ncdu clone to browse HDFS on the CLI (
http://nchadoop.org/). Currently i'm testing it against CDH4, - but I like
to make it available for a wider group of users (Hortonworks, ..).

Is it enough to pick different vanilla Versions (for IPC 5, 7)?

Best Regards,
Christian.


How to speed up Hadoop?

2013-09-05 Thread Sundeep Kambhampati

Hi all,

I am looking for ways to configure Hadoop inorder to speed up data 
processing. Assuming all my nodes are highly fault tolerant, will making 
data replication factor 1 speed up the processing? Are there some way to 
disable failure monitoring done by Hadoop?


Thank you for your time.

-Sundeep


Re: How to speed up Hadoop?

2013-09-05 Thread Chris Embree
I think you just went backwards.   more replicas (generally speaking) are
better.

I'd take 60 cheap, 1 U servers over 20 highly fault tolerant ones for
almost every problem.  I'd get them for the same or less $ too.




On Thu, Sep 5, 2013 at 8:41 PM, Sundeep Kambhampati 
kambh...@cse.ohio-state.edu wrote:

 Hi all,

 I am looking for ways to configure Hadoop inorder to speed up data
 processing. Assuming all my nodes are highly fault tolerant, will making
 data replication factor 1 speed up the processing? Are there some way to
 disable failure monitoring done by Hadoop?

 Thank you for your time.

 -Sundeep



Re: How to speed up Hadoop?

2013-09-05 Thread Preethi Vinayak Ponangi
Solution 1: Throw more hardware at the cluster. That's the whole point of
hadoop.
Solution 2: Try to optimize the mapreduce jobs. It depends on what kind of
jobs you are running.

I wouldn't suggest decreasing the number of replications as it kind of
defeats the purpose of using Hadoop. You could do this if you can't get
more hardware, are running experimental non-critical non-production data.

What kind of Hadoop monitoring are you talking about?

Regards,
Vinayak.


On Thu, Sep 5, 2013 at 7:51 PM, Chris Embree cemb...@gmail.com wrote:

 I think you just went backwards.   more replicas (generally speaking) are
 better.

 I'd take 60 cheap, 1 U servers over 20 highly fault tolerant ones for
 almost every problem.  I'd get them for the same or less $ too.




 On Thu, Sep 5, 2013 at 8:41 PM, Sundeep Kambhampati 
 kambh...@cse.ohio-state.edu wrote:

 Hi all,

 I am looking for ways to configure Hadoop inorder to speed up data
 processing. Assuming all my nodes are highly fault tolerant, will making
 data replication factor 1 speed up the processing? Are there some way to
 disable failure monitoring done by Hadoop?

 Thank you for your time.

 -Sundeep





Re: Disc not equally utilized in hdfs data nodes

2013-09-05 Thread Viswanathan J
Thanks Harsh. Hope I don't have space in my list which I specified in the
last mail.

Thanks,
V
On Sep 5, 2013 11:20 PM, Harsh J ha...@cloudera.com wrote:

 The spaces may be a problem if you are using the older 1.x releases.
 Please try to specify the list without spaces, and also check if all
 of these paths exist and have some DN owned directories under them.

 Please also keep the lists in CC/TO when replying. Clicking
 Reply-to-all usually helps do this automatically.

 On Thu, Sep 5, 2013 at 11:16 PM, Viswanathan J
 jayamviswanat...@gmail.com wrote:
  Hi Harsh,
 
  dfs.data.dir property we defined the values as in comma separated,
 
  /mnt/hadoop0/hdfs, /mnt/hadoop1/hdfs, /mnt/hadoop2/hdfs,
 /mnt/hadoop3/hdfs
 
  The above values are different devices.
 
  Thanks,
  V
 
  On Sep 5, 2013 10:53 PM, Harsh J ha...@cloudera.com wrote:
 
  Please share your hdfs-site.xml. HDFS needs to be configured to use
  all 4 disk mounts - it does not auto-discover and use all drives
  today.
 
  On Thu, Sep 5, 2013 at 10:48 PM, Viswanathan J
  jayamviswanat...@gmail.com wrote:
   Hi,
  
   The data which are storing in data nodes are not equally utilized in
 all
   the
   data directories.
  
   We having 4x1 TB drives, but huge data storing in single disc only at
   all
   the nodes. How to balance for utilize all the drives.
  
   This causes the hdfs storage size becomes high very soon even though
 we
   have
   available space.
  
   Thanks,
   Viswa.J
 
 
 
  --
  Harsh J



 --
 Harsh J



Re: How to speed up Hadoop?

2013-09-05 Thread Peyman Mohajerian
How about this: http://hadoop.apache.org/docs/stable/vaidya.html
I've never tried it myself, i was just reading about it today.


On Thu, Sep 5, 2013 at 5:57 PM, Preethi Vinayak Ponangi 
vinayakpona...@gmail.com wrote:

 Solution 1: Throw more hardware at the cluster. That's the whole point of
 hadoop.
 Solution 2: Try to optimize the mapreduce jobs. It depends on what kind of
 jobs you are running.

 I wouldn't suggest decreasing the number of replications as it kind of
 defeats the purpose of using Hadoop. You could do this if you can't get
 more hardware, are running experimental non-critical non-production data.

 What kind of Hadoop monitoring are you talking about?

 Regards,
 Vinayak.


 On Thu, Sep 5, 2013 at 7:51 PM, Chris Embree cemb...@gmail.com wrote:

 I think you just went backwards.   more replicas (generally speaking) are
 better.

 I'd take 60 cheap, 1 U servers over 20 highly fault tolerant ones for
 almost every problem.  I'd get them for the same or less $ too.




 On Thu, Sep 5, 2013 at 8:41 PM, Sundeep Kambhampati 
 kambh...@cse.ohio-state.edu wrote:

 Hi all,

 I am looking for ways to configure Hadoop inorder to speed up data
 processing. Assuming all my nodes are highly fault tolerant, will making
 data replication factor 1 speed up the processing? Are there some way to
 disable failure monitoring done by Hadoop?

 Thank you for your time.

 -Sundeep






Re: How to speed up Hadoop?

2013-09-05 Thread Sundeep Kambhampati

On 9/5/2013 8:57 PM, Preethi Vinayak Ponangi wrote:
Solution 1: Throw more hardware at the cluster. That's the whole point 
of hadoop.
Solution 2: Try to optimize the mapreduce jobs. It depends on what 
kind of jobs you are running.


I wouldn't suggest decreasing the number of replications as it kind of 
defeats the purpose of using Hadoop. You could do this if you can't 
get more hardware, are running experimental non-critical 
non-production data.


What kind of Hadoop monitoring are you talking about?

Regards,
Vinayak.


On Thu, Sep 5, 2013 at 7:51 PM, Chris Embree cemb...@gmail.com 
mailto:cemb...@gmail.com wrote:


I think you just went backwards.   more replicas (generally
speaking) are better.

I'd take 60 cheap, 1 U servers over 20 highly fault tolerant
ones for almost every problem.  I'd get them for the same or less
$ too.




On Thu, Sep 5, 2013 at 8:41 PM, Sundeep Kambhampati
kambh...@cse.ohio-state.edu mailto:kambh...@cse.ohio-state.edu
wrote:

Hi all,

I am looking for ways to configure Hadoop inorder to speed
up data processing. Assuming all my nodes are highly fault
tolerant, will making data replication factor 1 speed up the
processing? Are there some way to disable failure monitoring
done by Hadoop?

Thank you for your time.

-Sundeep




Thank you for your inputs. I can't currently add more hardware.

By monitoring I mean something like speculative execution.

Regards,
Sundeep


Re: How to speed up Hadoop?

2013-09-05 Thread Sundeep Kambhampati

On 9/5/2013 8:57 PM, Preethi Vinayak Ponangi wrote:
Solution 1: Throw more hardware at the cluster. That's the whole point 
of hadoop.
Solution 2: Try to optimize the mapreduce jobs. It depends on what 
kind of jobs you are running.


I wouldn't suggest decreasing the number of replications as it kind of 
defeats the purpose of using Hadoop. You could do this if you can't 
get more hardware, are running experimental non-critical 
non-production data.


What kind of Hadoop monitoring are you talking about?

Regards,
Vinayak.


On Thu, Sep 5, 2013 at 7:51 PM, Chris Embree cemb...@gmail.com 
mailto:cemb...@gmail.com wrote:


I think you just went backwards.   more replicas (generally
speaking) are better.

I'd take 60 cheap, 1 U servers over 20 highly fault tolerant
ones for almost every problem.  I'd get them for the same or less
$ too.




On Thu, Sep 5, 2013 at 8:41 PM, Sundeep Kambhampati
kambh...@cse.ohio-state.edu mailto:kambh...@cse.ohio-state.edu
wrote:

Hi all,

I am looking for ways to configure Hadoop inorder to speed
up data processing. Assuming all my nodes are highly fault
tolerant, will making data replication factor 1 speed up the
processing? Are there some way to disable failure monitoring
done by Hadoop?

Thank you for your time.

-Sundeep




Thank you your inputs. I can't currently add more hardware.

By monitoring I mean something like speculative execution.

Regards,
Sundeep


Re: How to support the (HDFS) FileSystem API of various Hadoop Distributions?

2013-09-05 Thread Harsh J
Hello,

There are a few additions to the FileSystem that may bite you across
versions, but if you pick an old stable version such as Apache Hadoop
0.20.2, and stick to only its offered APIs, it would work better
across different version dependencies as we try to maintain FileSystem
as a stable interface as much as we can (there was also more recent
work to ensure the stabilization). I looked over your current code
state and it seemed to have pretty stable calls that I think have
existed across several versions and exists today, but I did notice you
had to remove an isRoot as part of a previous commit, which may have
lead to this question?

If that doesn't work for you, you can also switch out to using
sub-modules carrying code specific to a build version type (such as
what HBase does at https://github.com/apache/hbase/tree/trunk/ (see
the hbase-hadoop-compat directories)).

On Fri, Sep 6, 2013 at 2:59 AM, Christian Schneider
cschneiderpub...@gmail.com wrote:
 Hi,
 I start to write a small ncdu clone to browse HDFS on the CLI
 (http://nchadoop.org/). Currently i'm testing it against CDH4, - but I like
 to make it available for a wider group of users (Hortonworks, ..).

 Is it enough to pick different vanilla Versions (for IPC 5, 7)?

 Best Regards,
 Christian.




-- 
Harsh J


Re: How to support the (HDFS) FileSystem API of various Hadoop Distributions?

2013-09-05 Thread Harsh J
Oh and btw, nice utility! :)

On Fri, Sep 6, 2013 at 7:50 AM, Harsh J ha...@cloudera.com wrote:
 Hello,

 There are a few additions to the FileSystem that may bite you across
 versions, but if you pick an old stable version such as Apache Hadoop
 0.20.2, and stick to only its offered APIs, it would work better
 across different version dependencies as we try to maintain FileSystem
 as a stable interface as much as we can (there was also more recent
 work to ensure the stabilization). I looked over your current code
 state and it seemed to have pretty stable calls that I think have
 existed across several versions and exists today, but I did notice you
 had to remove an isRoot as part of a previous commit, which may have
 lead to this question?

 If that doesn't work for you, you can also switch out to using
 sub-modules carrying code specific to a build version type (such as
 what HBase does at https://github.com/apache/hbase/tree/trunk/ (see
 the hbase-hadoop-compat directories)).

 On Fri, Sep 6, 2013 at 2:59 AM, Christian Schneider
 cschneiderpub...@gmail.com wrote:
 Hi,
 I start to write a small ncdu clone to browse HDFS on the CLI
 (http://nchadoop.org/). Currently i'm testing it against CDH4, - but I like
 to make it available for a wider group of users (Hortonworks, ..).

 Is it enough to pick different vanilla Versions (for IPC 5, 7)?

 Best Regards,
 Christian.




 --
 Harsh J



-- 
Harsh J


Re: How to speed up Hadoop?

2013-09-05 Thread Harsh J
I'd recommend reading Eric Sammer's Hadoop Operations (O'Reilly)
book. It goes over a lot of this stuff - building, monitoring, tuning,
optimizing, etc..

If your goal is just speed and quicker results, and not retention or
safety, by all means use replication factor as 1. Note that its
difficult for us to suggest configs unless you also share your
use-case (in brief) or goals. While the software is highly tunable, a
lot of tweaks depend on what you are planning to do.

On Fri, Sep 6, 2013 at 6:11 AM, Sundeep Kambhampati
kambh...@cse.ohio-state.edu wrote:
 Hi all,

 I am looking for ways to configure Hadoop inorder to speed up data
 processing. Assuming all my nodes are highly fault tolerant, will making
 data replication factor 1 speed up the processing? Are there some way to
 disable failure monitoring done by Hadoop?

 Thank you for your time.

 -Sundeep



-- 
Harsh J


RE: Question related to resource allocation in Yarn!

2013-09-05 Thread Devaraj k
Hi Rahul,

Could you tell me, what is the version you are using?

· If you want a container, you need to issue 3 resource requests 
(1-node local, 1-rack local and 1-Any(*) ). If you are using 
2.1.0-betahttps://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truemode=hidesorter/order=ASCsorter/field=prioritypid=12313722customfield_12310320=12324029
 or later versions, you can set the Relax Locality flag to false for getting 
only on the specified host.
Can you also share the code how you are requesting for containers…so that we 
can help you better..

Thanks
Devaraj k

From: Rahul Bhattacharjee [mailto:rahul.rec@gmail.com]
Sent: 06 September 2013 09:43
To: user@hadoop.apache.org
Subject: Re: Question related to resource allocation in Yarn!

I could progress a bit on this.

I was not setting responseId while asking for containers.
Still I have one question as why I am only been allocated two containers 
whereas node manager can run more containers.

Response while registering the application master -
AM registration response minimumCapability {, memory: 1024, virtual_cores: 1, 
}, maximumCapability {, memory: 8192, virtual_cores: 32, },
Thanks,
Rahul

On Thu, Sep 5, 2013 at 8:33 PM, Rahul Bhattacharjee 
rahul.rec@gmail.commailto:rahul.rec@gmail.com wrote:
Hi,
I am trying to make a small poc on top of yarn.
Within the launched application master , I am trying to request for 50 
containers and launch  a same task on those allocated containers.
My config : AM registration response minimumCapability {, memory: 1024, 
virtual_cores: 1, }, maximumCapability {, memory: 8192, virtual_cores: 32, },
1) I am asking for 1G mem and 1 core container to the RM. Ideally the RM should 
return me 6 - 7 containers , but the response always returns with only 2 
containers.
Why is that ?
2) So , when in the first ask 2 containers are returned , then I again required 
the RM for 50 - 2 = 48 containers. I keep getting 0 containers , even if the 
previously started containers have finished.
Why is that ?
Any link explaining the allocate request of RM would be very helpful.

Thanks,
Rahul



Re: Question related to resource allocation in Yarn!

2013-09-05 Thread Rahul Bhattacharjee
Hi Devaraj,

I am on Hadoop 2.0.4 . I am able to get containers now and my yarn app runs
properly. I am setting hostname as * , while requesting containers.

There is no problem as of now , only thing is I am allocated only 2
containers at one time , however I believe that the node manager can run
more containers.

2013-09-06 09:53:38,433 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode:
Assigned container container_1378441324025_0001_01_01 of capacity
memory:100, vCores:1 on host storyacid-lm:55407, which currently has 1
containers, memory:100, vCores:1 used and memory:8092, vCores:15
available

I am requesting containers with 100 mb mem and 1 core. If  I know more
about how is the capacity if calculated per node , or how the allocation is
done , then it would be useful.

Thanks for the help!
Rahul







On Fri, Sep 6, 2013 at 10:31 AM, Devaraj k devara...@huawei.com wrote:

  Hi Rahul,

 ** **

 Could you tell me, what is the version you are using?

  

 **· **If you want a container, you need to issue 3 resource
 requests (1-node local, 1-rack local and 1-Any(*) ). If you are using
 2.1.0-betahttps://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truemode=hidesorter/order=ASCsorter/field=prioritypid=12313722customfield_12310320=12324029or
  later versions, you can set the Relax Locality flag to false for getting
 only on the specified host.

 Can you also share the code how you are requesting for containers…so that
 we can help you better..

 ** **

 Thanks

 Devaraj k

 ** **

 *From:* Rahul Bhattacharjee [mailto:rahul.rec@gmail.com]
 *Sent:* 06 September 2013 09:43
 *To:* user@hadoop.apache.org
 *Subject:* Re: Question related to resource allocation in Yarn!

 ** **

 I could progress a bit on this.

 I was not setting responseId while asking for containers.

 Still I have one question as why I am only been allocated two containers
 whereas node manager can run more containers.

 Response while registering the application master -
 AM registration response minimumCapability {, memory: 1024, virtual_cores:
 1, }, maximumCapability {, memory: 8192, virtual_cores: 32, }, 

 Thanks,

 Rahul

 ** **

 On Thu, Sep 5, 2013 at 8:33 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi,

 I am trying to make a small poc on top of yarn.

 Within the launched application master , I am trying to request for 50
 containers and launch  a same task on those allocated containers.

 My config : AM registration response minimumCapability {, memory: 1024,
 virtual_cores: 1, }, maximumCapability {, memory: 8192, virtual_cores: 32,
 }, 

 1) I am asking for 1G mem and 1 core container to the RM. Ideally the RM
 should return me 6 - 7 containers , but the response always returns with
 only 2 containers. 

 Why is that ?

 2) So , when in the first ask 2 containers are returned , then I again
 required the RM for 50 - 2 = 48 containers. I keep getting 0 containers ,
 even if the previously started containers have finished.

 Why is that ?

 Any link explaining the allocate request of RM would be very helpful.

 ** **

 Thanks,
 Rahul

 ** **