Re: Use of CombineFileInputFormat

2012-09-28 Thread Harsh J
Combines multiple InputSplits per Mapper (CombineFileSplit), read in serial. Reduces # of mappers for inputs that carry several (usually small) files/blocks. On Fri, Sep 28, 2012 at 6:54 AM, Jay Vyas jayunit...@gmail.com wrote: Its not clear to me what the CombineInputFormat really is ? Can

Securing cluster from access

2012-09-28 Thread Shin Chan
Hello, We have 15 node cluster and right now we dont have Kerberos implemented. But on urgent basis we want to secure the cluster. Right now anyone who know IP of Namenode can just download the Hadoop jar , configure xml files and say hadoop fs -ls / And he can see the data. How to

Re: Securing cluster from access

2012-09-28 Thread Bertrand Dechoux
What you are looking for is not related to Hadoop in the end. It is how to restrict requests in a network. 'Firewall' is a broad term. iptables can allow you to do so quickly. You drop everything and then accept only from a set of IPs. You may receive answers using this mailing list but its

Re: Securing cluster from access

2012-09-28 Thread Shin Chan
Hello Bertrand , Thanks for your reply. Apology if this confused you. Yes IP Tables is one of the way to go but my question is more if there is configuration within hadoop xml files to say if this user is there then only allow to see HDFS. I can see that we can do something for Map reduce

Re: Securing cluster from access

2012-09-28 Thread Harsh J
You need a stronger authentication method (Kerberos), period. It isn't just fs -ls / you should be scared about. Read Natty's post here, on what it means to run an insecure cluster when you have secure requirements: http://www.cloudera.com/blog/2012/03/authorization-and-authentication-in-hadoop/.

Re: Securing cluster from access

2012-09-28 Thread Harsh J
ACLs are a good way to control roles of users, but in insecure mode users can easily be impersonated, rendering ACLs useless as a 'secure' measure. On Fri, Sep 28, 2012 at 3:15 PM, Shin Chan had...@gmx.com wrote: Hello Bertrand , Thanks for your reply. Apology if this confused you. Yes IP

Re: Securing cluster from access

2012-09-28 Thread Bertrand Dechoux
Harsh is right. It is important to know what is the difference between authorization and authentication. However if you do not want anybody to write to your cluster from outside then a firewall might be enough. You block everything but you allow access to the webinterfaces (without private actions

Re: MultipleOutputs side effects

2012-09-28 Thread Harsh J
Yes - for new API MultipleOutput, use LazyOutputFormat as job's output format, and for old API, use the NullOutputFormat as the jobconf's output format. On Fri, Sep 28, 2012 at 5:14 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: This came up recently on the forums, IIRC. The answer was to

Re: dfs.name.dir replication and disk not available

2012-09-28 Thread Bertrand Dechoux
That's definitely clearer (and it makes sense). Thanks a lot. Bertrand On Fri, Sep 28, 2012 at 11:56 AM, Harsh J ha...@cloudera.com wrote: I don't know how much of this is 1.x compatible: - When a transaction is logged and sync'd, and a single edits storage location fails during write,

Re: Usefulness of ChainMapper/ChainReducer

2012-09-28 Thread Harsh J
Hi, Modularity! I've always had the same question before. However, Tom White put that thought to rest: It’s possible to make map and reduce functions even more composable than we have done. A mapper commonly performs input format parsing, projection (selecting the relevant fields), and

Re: Usefulness of ChainMapper/ChainReducer

2012-09-28 Thread John Armstrong
On Fri 28 Sep 2012 09:39:13 AM EDT, Harsh J wrote: Modularity! Exactly! Write a mapper that operates as a filter on something about your keys, then use it in whatever jobs you want. Your job needs to operate on data subset A? chain it with the filter mapper that picks out A. Your next one

Re: Securing cluster from access

2012-09-28 Thread Yongzhi Wang
This document has clear description, although I don't know if it applies to hadoop2.0. http://hadoop.apache.org/docs/r1.0.3/hdfs_permissions_guide.html I quote some text from this document. Hopefully this can help you. Overview The Hadoop Distributed File System (HDFS) implements a permissions

context.write() Vs FSDataOutputStream.writeBytes()

2012-09-28 Thread Ranjithkumar Gampa
Hi, we are using FSDataOutputStream.writeBytes() from map/reduce to write to Hive table path directly instead of context.write() which is working fine and so far no problems with this approach. we make sure the file names are distinct by appending taskAttemptId to them and we use speculative