Where the output of mappers are saved ?

2014-12-11 Thread Abdul Navaz
Hello,


I am interested in efficiently manage the Hadoop shuffling traffic and
utilize the network bandwidth effectively. To do this I want to know how
much shuffling traffic generated by each Datanodes ? Shuffling traffic is
nothing but the output of mappers. So where this mapper output is saved ?
How can i get the size of mapper output from each datanodes in a real time ?
Appreciate your help.

Thanks & Regards,

Abdul Navaz





Re: DistributedCache

2014-12-11 Thread unmesha sreeveni
On Fri, Dec 12, 2014 at 9:55 AM, Shahab Yunus 
wrote:
>
> job.addCacheFiles


​Yes you can use job.addCacheFiles to cache the file.

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path cachefile = new Path("path/to/file");
FileStatus[] list = fs.globStatus(cachefile);
for (FileStatus status : list) {
 DistributedCache.addCacheFile(status.getPath().toUri(), conf);

}

Hope this link helps
[1]
http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: DistributedCache

2014-12-11 Thread Shahab Yunus
Look at this thread. It has alternatives to DistributedCache.
http://stackoverflow.com/questions/21239722/hadoop-distributedcache-is-deprecated-what-is-the-preferred-api

Basically you can use the new method job.addCacheFiles to pass on stuff to
the individual tasks.

Regards,
Shahab

On Thu, Dec 11, 2014 at 9:07 PM, Srinivas Chamarthi <
srinivas.chamar...@gmail.com> wrote:
>
> Hi,
>
> I want to cache map/reducer temporary output files so that I can compare
> two map results coming from two different nodes to verify the integrity
> check.
>
> I am simulating this use case with speculative execution by rescheduling
> the first task as soon as it is started and running.
>
> Now I want to compare output files coming from speculative attempt and
> prior attempt so that I can calculate the credit scoring of each node.
>
> I want to use DistributedCache to cache the local file system files in
> CommitPending stage from TaskImpl. But the DistributedCache is actually
> deprecated. is there any other way I can do this ?
>
> I think I can use HDFS to save the temporary output files so that other
> nodes can see it ? but is there any in-memory solution I can use ?
>
> any pointers are greatly appreciated.
>
> thx & rgds,
> srinivas chamarthi
>


Re: adding node(s) to Hadoop cluster

2014-12-11 Thread Vinod Kumar Vavilapalli

I may be mistaken, but let me try again with an example to see if we are on the 
same page

Principals
 - NameNode: nn/nn-h...@cluster.com
 - DataNode: dn/_h...@cluster.com

Auth to local mappings
 - nn/nn-h...@cluster.com -> hdfs
 - dn/.*@cluster.com -> hdfs

The combination of the above lets you block any other user other than hdfs from 
faking like a datanode.

Purposes
 - _HOST: Let you deploy all datanodes with the same principal value in all 
their configs.
 - Auth-to-local-mapping: Map kerberos principals to unix-login names to close 
the loop on identity

Don't think your example of "somebody on an untrusted client can disguise as 
hdfs/nodename@REALM" is possible at all with Kerberos. Any references to such 
possibilities? If it were possible, all security is toast anyways, no?

+Vinod


> Thanks, I may be mistaken, but I suspect you missed the point:
> 
> for me, auth_to_local's role is to protect the server(s). For example,  
> somebody on an untrusted "client" can disguise as hdfs/nodename@REALM and 
> hence take over hdfs through a careless principal->id translation. A 
> well-configured auth_to_local will deflect that rogue "hdfs" to "nobody" or 
> something, so a malicious client cannot do a "hdfs dfs -chown ..." for 
> example.
> 
> The _HOST construct makes using the same config files throughout the cluster 
> easier indeed, but as far as I see it mainly applies to the "client".
> 
> On the server, I see no way other than auth_to_local with a list/pattern of 
> trusted node names (on namenode and every datanode in the hdfs case) to 
> prevent the scenario above. Would there be?



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


DistributedCache

2014-12-11 Thread Srinivas Chamarthi
Hi,

I want to cache map/reducer temporary output files so that I can compare
two map results coming from two different nodes to verify the integrity
check.

I am simulating this use case with speculative execution by rescheduling
the first task as soon as it is started and running.

Now I want to compare output files coming from speculative attempt and
prior attempt so that I can calculate the credit scoring of each node.

I want to use DistributedCache to cache the local file system files in
CommitPending stage from TaskImpl. But the DistributedCache is actually
deprecated. is there any other way I can do this ?

I think I can use HDFS to save the temporary output files so that other
nodes can see it ? but is there any in-memory solution I can use ?

any pointers are greatly appreciated.

thx & rgds,
srinivas chamarthi


Re: run yarn container as specific user

2014-12-11 Thread Hitesh Shah
Is you app code running within the container also being run within a UGI.doAs() 
? 

You can use the following in your code to create a UGI for the “actual” user 
and run all the logic within that: 


actualUserUGI = UserGroupInformation.createRemoteUser(System
.getenv(ApplicationConstants.Environment.USER.toString()));


Your other option is to try and get the LinuxContainerExecutor working on a 
non-secure cluster ( not sure if that is trivial to do ).

— Hitesh

On Dec 11, 2014, at 12:04 PM, Tim Williams  wrote:

> I'm able to use the UGI.doAs(..) to launch a yarn app and, through the
> ResourceManager, both the ApplicationMaster and Containers are
> associated with the correct user.  But the process on the node itself
> really runs as the yarn user.  The problem is that the yarn app writes
> data to DFS and its being written as yarn, since that's what the real
> process is.  This is an non-secure cluster.  I've yet to stumble upon
> a solution that doesn't feel icky.  What's the right way to achieve
> this?
> 
> Thanks,
> --tim



run yarn container as specific user

2014-12-11 Thread Tim Williams
I'm able to use the UGI.doAs(..) to launch a yarn app and, through the
ResourceManager, both the ApplicationMaster and Containers are
associated with the correct user.  But the process on the node itself
really runs as the yarn user.  The problem is that the yarn app writes
data to DFS and its being written as yarn, since that's what the real
process is.  This is an non-secure cluster.  I've yet to stumble upon
a solution that doesn't feel icky.  What's the right way to achieve
this?

Thanks,
--tim


Re: Hadoop 2.4 + Hive 0.14 + Hbase 0.98.3 + snappy not working

2014-12-11 Thread peterm_second

Hi Hanish,
Thanks for the link it did help. Long story short , always recompile 
native libraries for your machine :)


Thanks,
Peter

On 11.12.2014 05:46, Hanish Bansal wrote:

Hope this may help you:

http://blogs.impetus.com/big_data/big_data_technologies/SnappyCompressionInHBase.do

On Thu, Dec 11, 2014 at 7:25 AM, Fabio > wrote:


Plain Apache Hadoop 2.5.0.
Too bad it didn't work, hope someone can help.


On 12/10/2014 06:22 PM, peterm_second wrote:

Hi Fabio ,
Thanks for the reply, but unfortunately it didn't work. I am
using vanilla hadoop 2.4 with vanilla hive 0.14 and so on, I
am using the vanilla distros.
I did set the HADOOP_COMMON_LIB_NATIVE_DIR but that didn't
make any change. What version were you using ?

Peter


On 10.12.2014 16:23, Fabio wrote:

Not sure it will help, but if the problem is native
library loading, I spent a lng time trying anything to
make it work.
I may suggest to try also:
export JAVA_LIBRARY_PATH=/opt/yarn/hadoop-2.5.0/lib/native
export HADOOP_COMMON_LIB_NATIVE_DIR=/opt/yarn/hadoop-2.5.0/lib
I have this both in the bash "init" script
(/etc/profile.p/...) and in
/opt/yarn/hadoop-2.5.0/etc/hadoop/hadoop-env.sh; quite
sure it's redundant, but as long as it works I don't
change it.
I see here I commented out my attempts to set HADOOP_OPTS,
so maybe it's not necessary.
I don't see anything in my .xml config files.
Also, someone says to compile the libraries under your 64
bit system, since the ones in Hadoop are for a 32bit
architecture.

Good luck

Fabio

On 12/10/2014 02:57 PM, peterm_second wrote:

Hi guys,
I have a hadoop + hbase + hive application,
For some reason my cluster is unable to find the
snappy native library
Here is the exception :
 org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at

org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native
Method)
at

org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)

at

org.apache.hadoop.io.compress.SnappyCodec.getCompressorType(SnappyCodec.java:132)

at

org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:148)
at

org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:163)
at
org.apache.hadoop.mapred.IFile$Writer.(IFile.java:115)
at

org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1583)

at

org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1462)
at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)
at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at
java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at

org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)

at
org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)


I am working with a 64bit ubuntu 14.04LTS. I've
installed snappy on my os and added the coppied the
libs to hadoop_home/lib/native
I've also added the libs to the JRE, but it still
fails as if nothing is present.
I've added 
HADOOP_OPTS="-Djava.net.preferIPv4Stack=true

$GC_DEBUG_OPTS
-Djava.library.path=/usr/local/hadoop-2.4.0/lib/native
$HADOOP_OPTS"
in my yarn xml I have

  yarn.app.mapreduce.am
.env
LD_LIBRARY_PATH=$HADOOP_HOME/lib/native


in my mapred-site.xml i have

mapred.child.java.opts

-Djava.library.path=/usr/local/hadoop-2.4.0/lib/native


mapreduce.reduce.java.opts

-Djava.library.path=/usr/local/hadoop-2.4.0/lib/native


The l

Re: adding node(s) to Hadoop cluster

2014-12-11 Thread Rainer Toebbicke

Le 10 déc. 2014 à 20:08, Vinod Kumar Vavilapalli  a 
écrit :

> You don't need patterns for host-names, did you see the support for _HOST in 
> the principle names? You can specify the datanode principle to be say 
> datanodeUser@_HOST@realm, and Hadoop libraries interpret and replace _HOST on 
> each machine with the real host-name.

Thanks, I may be mistaken, but I suspect you missed the point:

for me, auth_to_local's role is to protect the server(s). For example,  
somebody on an untrusted "client" can disguise as hdfs/nodename@REALM and hence 
take over hdfs through a careless principal->id translation. A well-configured 
auth_to_local will deflect that rogue "hdfs" to "nobody" or something, so a 
malicious client cannot do a "hdfs dfs -chown ..." for example.

The _HOST construct makes using the same config files throughout the cluster 
easier indeed, but as far as I see it mainly applies to the "client".

On the server, I see no way other than auth_to_local with a list/pattern of 
trusted node names (on namenode and every datanode in the hdfs case) to prevent 
the scenario above. Would there be?

Thanks, Rainer