[ 
https://issues.apache.org/jira/browse/MAPREDUCE-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744467#action_12744467
 ] 

Devaraj Das commented on MAPREDUCE-181:
---------------------------------------

I wonder whether it makes sense to have the jobclient write two files per a 
split file:

1) the splits info (the actual bytes) written to a secure location on the hdfs 
(with permissions 700)
2) the split metadata, which is a set of entries like 
{<map-id>:<location_1><location_2>..<location_n>, 
<start-offset-in-split-file><length>} for each map-id. This is serialized over 
RPC, and the JobTracker writes it to the well known mapred-system-directory 
(which the JobTracker owns with perms 700).

The JobTracker just reads/loads the metadata, and creates the TIP cache.

The TaskTracker is handed off a split object that looks something like 
{<start-offset-in-split-file><length>}. As part of task localization, the TT 
copies the specific bytes from the split file (securely), and launches the task 
that then reads the split or the TT could simply stream it over RPC to the 
child. The replication factor could be set to a high number for the splits info 
file.. 

Doing it in this way should reduce the size of the split file information 
considerably (and we can have a cap on the metadata size as well), and also 
provide security for the user generated split files' content.

For the JobConf, passing the basic and the minimum info to the JobTracker as 
Hong suggested on MAPREDUCE-841 seems to make sense. For all other conf 
properties, the Task can load them directly from the HDFS. The max size (in 
terms of #bytes) of the basic information could be easily derived and we could 
have a cap on that for the RPC communication.

Thoughts?

> mapred.system.dir should be accessible only to hadoop daemons 
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-181
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-181
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>         Attachments: hadoop-3578-branch-20-example-2.patch, 
> hadoop-3578-branch-20-example.patch, HADOOP-3578-v2.6.patch, 
> HADOOP-3578-v2.7.patch
>
>
> Currently the jobclient accesses the {{mapred.system.dir}} to add job 
> details. Hence the {{mapred.system.dir}} has the permissions of 
> {{rwx-wx-wx}}. This could be a security loophole where the job files might 
> get overwritten/tampered after the job submission. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to