[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-10-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926107#action_12926107
 ] 

Hudson commented on MAPREDUCE-1288:
---

Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See 
[https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/])


> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: distributed-cache, security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Assignee: Devaraj Das
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: MR-1288-bp20-1.patch, MR-1288-bp20-2.patch, 
> MR-1288-bp20-3.patch, MR-1288-trunk-1.patch
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-08-02 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894620#action_12894620
 ] 

Owen O'Malley commented on MAPREDUCE-1288:
--

Looks good. +1

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: distributed-cache, security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Assignee: Devaraj Das
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: MR-1288-bp20-1.patch, MR-1288-bp20-2.patch, 
> MR-1288-bp20-3.patch, MR-1288-trunk-1.patch
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-07-27 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892971#action_12892971
 ] 

Owen O'Malley commented on MAPREDUCE-1288:
--

{quote}
(2) introduce the concept of group sharing of distributed cache files so as to 
avoid repetitive downloads for group shared files also. This may be a complex 
solution after all.
{quote}

This would be quite complex to get right. In particular, it is difficult to 
determine which group should have access. If we want to improve it, I'd suggest 
that we use hardlinks to give each user access to a single copy of the file.. 
Of course you need to ensure that they do in fact have read access to the 
original file. *smile*

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: distributed-cache, security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Critical
> Attachments: MR-1288-bp20-1.patch, MR-1288-bp20-2.patch, 
> MR-1288-bp20-3.patch
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-07-27 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892948#action_12892948
 ] 

Devaraj Das commented on MAPREDUCE-1288:


bq. Devaraj, this corner case is exactly what Hemanth was trying to explain 
earlier on this ticket, starting with comment #4 above

Yeah.. i realized that.. That's the reason i stuck to this jira rather than 
opening a new one :-)

bq. As for the approach, we have two options: (1) (this seems to be what the 
patch is doing) for group shared files, localize them separately for each user. 
This is a simple solution, but sacrifices the optimization ( may not be too 
bad?)

Yes, I am going with this for now. If needed (after we deploy this patch on our 
clusters and observe), we can look at proposal (2) in your comment..

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: distributed-cache, security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Critical
> Attachments: MR-1288-bp20-1.patch, MR-1288-bp20-2.patch, 
> MR-1288-bp20-3.patch
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-07-27 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892916#action_12892916
 ] 

Owen O'Malley commented on MAPREDUCE-1288:
--

It looks good. I'd suggest:
1. change DistributedChache.releaseCache to pass in the current user to 
TrackerDistributedCacheManager.releaseCache rather than creating a new method.
2. it looks like the constructor for CacheFile can easily throw IOException 
instead of putting it in a RuntimeException.


> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: distributed-cache, security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Critical
> Attachments: MR-1288-bp20-1.patch, MR-1288-bp20-2.patch
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-07-27 Thread Vinod K V (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892724#action_12892724
 ] 

Vinod K V commented on MAPREDUCE-1288:
--

Devaraj, this corner case is exactly what Hemanth was trying to explain earlier 
on this ticket, starting with comment #4 above :)

As for the approach, we have two options:
 (1) (this seems to be what the patch is doing) for group shared files, 
localize them separately for each user. This is a simple solution, but 
sacrifices the optimization ( may not be too bad?)
 (2) introduce the concept of group sharing of distributed cache files so as to 
avoid repetitive downloads for group shared files also. This may be a complex 
solution after all.

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: distributed-cache, security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Critical
> Attachments: MR-1288-bp20-1.patch, MR-1288-bp20-2.patch
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-04-27 Thread Vinod K V (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861369#action_12861369
 ] 

Vinod K V commented on MAPREDUCE-1288:
--

The only corner case that isn't handled yet is when a file on DFS is originally 
a private file, gets localized in private dirs on TT and then subsequently 
becomes a public file on DFS itself. Though it is still a possibility, it is 
pretty corner-case. I am +1 for moving this one out of the blocker queue for 
0.21.

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-04-27 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861363#action_12861363
 ] 

Hemanth Yamijala commented on MAPREDUCE-1288:
-

The original intent of this JIRA is certainly different from what we ended up 
discussing in the end. So, I request we move discussion about the in-flight 
change to MAPREDUCE-1729.

And I ask back the original question: With the availability of public and 
private distributed cache options, can we assume this issue is not a blocker 
any more ?

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-04-26 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861231#action_12861231
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1288:


bq. I raised MAPREDUCE-1288 for the discussion on "in flight" archive. 
Sorry, I meant MAPREDUCE-1729.

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-04-26 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861230#action_12861230
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1288:


I raised MAPREDUCE-1288 for the discussion on "in flight" archive.

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-04-26 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861226#action_12861226
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1288:


Allen, I understand your use-case and I agree that it should be tunable from 
the job. But, I think this is not a blocker anymore, because distributed cache 
always had this behavior to fail the job if cache file gets modified on the 
fly. 

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-04-26 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861028#action_12861028
 ] 

Allen Wittenauer commented on MAPREDUCE-1288:
-

Why would a task from an already running job not be able to find version-0?  
Why is the task tracker removing content from the cache of a running job?  If 
the content moved/is different, shouldn't the job tracker be able to reschedule 
tasks onto a task tracker that has a copy?   Why can't the task tracker copy 
the dcache from another task tracker that does have a copy?

That said, I'm not convinced that in the majority of cases that version-0 vs. 
version-1 is undefined.  From what I've seen, most of the time different 
versions of a dcache are downwardly compatible.  As much as folks hate 
tunables, perhaps that is the answer here:  mapred.job.dcache.failonupdate.



> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-04-26 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860880#action_12860880
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1288:


bq. Why should an old job fail because of what is, essentially, an external 
event?
The current behavior is that Job will fail if a distributed cache file (in use) 
gets modified on the DFS. Even if the task is localized on a new tracker we 
should fail the task, is being done through MAPREDUCE-1225.

Allen, What should be the behavior for the following use-case? 
For the case : "some tasks have downloaded a version-0 file and ran 
successfully; some other tasks cannot find version-0 and they can find only 
version-1 file, and they use version-1file and run successfully", final Job 
output would be undefined right? The output generated from the tasks which used 
version-0 file would be entirely different from output generated from the tasks 
which used version-1 file. 
So, I think the current behavior to fail the job, if a file added to 
distributed-cache gets changed after job submission is correct.


> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-04-25 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860786#action_12860786
 ] 

Hemanth Yamijala commented on MAPREDUCE-1288:
-

bq. Why should an old job fail because of what is, essentially, an external 
event? 

The job failing is unlikely, right ? Please note that I said tasks fail. I hope 
someone can clarify (given the two statuses we have - i.e. tasks failed vs 
killed), whether this condition can lead Hadoop to abort after sufficient 
number of failures. Even if it does, it should happen that at least one task''s 
attempts should get scheduled on 4 such nodes, and fail on all four. I am 
thinking this is unlikely. But let's hope someone (alias Amarsri *smile*) can 
clarify this.

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-04-23 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860339#action_12860339
 ] 

Allen Wittenauer commented on MAPREDUCE-1288:
-

That sounds like really bad behavior.

Why should an old job fail because of what is, essentially, an external event?  

This still sounds like a blocker to me.

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-04-22 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860119#action_12860119
 ] 

Hemanth Yamijala commented on MAPREDUCE-1288:
-

bq. What happens in the case that the archive file changes in flight. For 
example, I submit a job using that archive. While my job is running, I notice a 
bug, remove the old cache file, push a new one to hdfs, and then launch a new 
invocation of my job. Would the new job get the old cache file because the old 
job is still running? 

Allen, the key that identifies a cache file on a tasktracker node is a 
combination of the URL and the DFS timestamp that is determined when the job is 
submitted. Hence, the new job would get a new key and hence be localized 
afresh. This is irrespective of whether the old file was ever localized on the 
same node or not. I am assuming here that a file upload to DFS to the same URL 
would modify the timestamp.

Further, when this happens, new tasks of the old job that are running on nodes 
where the localization of the invalid file has already happened, will fail 
because the localization process for the new tasks will detect the file has 
changed in-flight.

Hope this is correct.

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-04-22 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860040#action_12860040
 ] 

Devaraj Das commented on MAPREDUCE-1288:


In my earlier analysis, i missed a code path. What I said is not true.

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-04-22 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860007#action_12860007
 ] 

Allen Wittenauer commented on MAPREDUCE-1288:
-

I think the 'in flight' archive is still a problem without this fixed.  Correct?

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2010-04-22 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859695#action_12859695
 ] 

Hemanth Yamijala commented on MAPREDUCE-1288:
-

I am trying to revisit if this bug continues to be a blocker, in light of the 
decision we have taken to rebase trunk as  the next new release.

For the default configuration (DefaultTaskController), even though the files 
are localized only once irrespective of the number of users accessing them, 
since all the files are localized as the TT user and can be accessed by the TT 
user, I suppose this is not a problem. For the LinuxTaskController, it seems 
like a problem. However, since we are rebasing from trunk, maybe the feature of 
public distributed cache files in MAPREDUCE-744 will cover cases that are 
otherwise not covered due to this bug. To explain more, either users want to 
share files, or they don't. If they do, they can use the public distributed 
cache feature which works fine. If they don't want to share, this bug becomes a 
non issue.

Based on this, I am tending to think this is not a blocker. Thoughts ?



> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2009-12-11 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789696#action_12789696
 ] 

Devaraj Das commented on MAPREDUCE-1288:


If i am reading the code right, the tasks of the new job would fail on those 
nodes that localized the old archive and still has a copy of that (the 
TaskTracker would detect the archive has changed and assuming that the change 
happened while the new job was running would fail the tasks). This will 
continue until the archive is purged from the cache and re-localized.

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2009-12-11 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789390#action_12789390
 ] 

Allen Wittenauer commented on MAPREDUCE-1288:
-

What happens in the case that the archive file changes in flight.  For example, 
I submit a job using that archive.  While my job is running, I notice a bug, 
remove the old cache file, push a new one to hdfs, and then launch a new 
invocation of my job.  Would the new job get the old cache file because the old 
job is still running?

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2009-12-10 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789160#action_12789160
 ] 

Hemanth Yamijala commented on MAPREDUCE-1288:
-

Just to be clear, I am *not* disagreeing at all that there's a bug. Or with the 
assessment that this is a blocker. +1 on both. *smile*

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.




[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2009-12-10 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789153#action_12789153
 ] 

Devaraj Das commented on MAPREDUCE-1288:


All I am saying is that irrespective of the file being public or not, in the 
current codebase, we localize the file exactly once per TaskTracker. On a given 
tasktracker, users cannot share the same hdfs file as a distributed cache 
file.. 
What I thought earlier was that the same file would be localized twice in such 
a case (in their respective private directories).

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2009-12-10 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789145#action_12789145
 ] 

Hemanth Yamijala commented on MAPREDUCE-1288:
-

bq. Even if the entire path were accessible to everyone,

If the entire path were accessible to everyone on DFS, there's really no great 
security for that file. I was just trying to point out that such a case may not 
even be valid in the context of how MAPREDUCE-856 was approached (i.e we wanted 
to secure localized files for users). But I am concurring that one could 
theoretically construct a case where the URI was accessible to a group of users 
on DFS and since there's no way to securely localize that per group on the TT, 
this bug is still valid.

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2009-12-10 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789134#action_12789134
 ] 

Devaraj Das commented on MAPREDUCE-1288:


Look at this scenario - the URI is hdfs://:/foo/bar/file.txt. Even 
if the entire path were accessible to everyone, the TaskTracker would localize 
it exactly once, and in a user's private directory. A second job wishing to 
access the same file wouldn't be able to do so since the TT wouldn't localize 
it again.. Does that make sense ?

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2009-12-10 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789128#action_12789128
 ] 

Hemanth Yamijala commented on MAPREDUCE-1288:
-

I suppose one could argue that if two different users can access the same set 
of files on the DFS for localization, they are 'public'. But then, you could 
theoretically construct a use case where there's a 'group' access for some 
files on DFS and these are localized per user on the task tracker. Is that what 
we're trying to address ?

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2009-12-10 Thread Vinod K V (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789122#action_12789122
 ] 

Vinod K V commented on MAPREDUCE-1288:
--

+1 for putting the username also as part of the key.

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2009-12-10 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789115#action_12789115
 ] 

Devaraj Das commented on MAPREDUCE-1288:


This issue is talking about how localization happens in 0.21/trunk. There is no 
"public" or "private" cache files currently. That is getting introduced In 
MAPREDUCE-744,

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1288) DistributedCache localizes only once per cache URI

2009-12-10 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789095#action_12789095
 ] 

Arun C Murthy commented on MAPREDUCE-1288:
--

I'm assuming this is done only for 'private' cache files? i.e. public cache 
files should probably use the 'username' of the TT itself?

> DistributedCache localizes only once per cache URI
> --
>
> Key: MAPREDUCE-1288
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1288
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: security, tasktracker
>Affects Versions: 0.21.0
>Reporter: Devaraj Das
>Priority: Blocker
> Fix For: 0.21.0
>
>
> As part of the file localization the distributed cache localizer creates a 
> copy of the file in the corresponding user's private directory. The 
> localization in DistributedCache assumes the key as the URI of the cachefile 
> and if it already exists in the map, the localization is not done again. This 
> means that another user cannot access the same distributed cache file. We 
> should change the key to include the username so that localization is done 
> for every user.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.