[
https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738811#action_12738811
]
Vinod K V commented on MAPREDUCE-476:
-------------------------------------
Replies in-line.
bq. Is there anything blocking MAPREDUCE-711 that prevents it from being
committed?
No, I've asked Owen to commit it.
bq. I had a very clever bug in there(caused by not thinking enough while
resolving a merge conflict) that deleted the current working directory,
recursively, in one of the tests. (TaskRunner is hard-coded to delete current
working directory, which is ok, since it's typically a child process; not ok
for LocalJobRunner.)
Nice catch! I too had burnt myself once trying to refactor out the code in
Child.java to a thread, and ended up deleting my whole work space because of
the same issue :(
bq. The tricky bit is always fixing only the lines I've changed, and not all
the lines in a given file, to preserve history and keep reviewing sane.
That is correct.
bq. This class should also have the variable number argument getLocalCache()
methods so that the corresponding methods in DistributedCache can be
deprecated. Also, each method in DistributedCache should call the correponding
method in DistributedCacheManager class.
bq. Don't think I agree here. We can deprecate the getLocalCache methods in
DistributedCache right away. They delegate to each other, and one of them
delegates to TrackerDistributedCacheManager. Ideally, I'd remove these
altogether — Hadoop internally does not use these methods with this patch, and
there's no sensible reason why someone else would, but since it's public, it's
getting deprecated. But it's not being deprecated with a pointer to use
something else; it's getting deprecated so that you don't use it at all.
Okay, i thought those methods were used internally atleast. But if, as you've
said, they are not used at all internally, +1 for deprecating them all together.
bq. Using .equals method at +150 if (cacheFile.type ==
CacheFile.FileType.ARCHIVE). If you feel strongly about this, happy to change
it, but I think == is more consistent.
I am fine, no problem with this.
bq. TaskTracker.initialize() A new DistributedCacheManager is created every
time, so old files will not be deleted by the subsequent purgeCache. Either we
should have only one cacheManager for a TaskTracker or
DistributedCacheManager.cachedArchives should be static. The same problem
exists with the deprecated purgeCache() method in DistributedCache.java
bq. If my understanding is correct, in the course of normal operations,
TaskTracker.initialize() is only called once, by TaskTracker.run(). At startup
time (and whenever the TaskTracker decides to lose all of its state and reset
itself, which is essentially the same), the TaskTracker initializes the
TrackerDistributedCacheManager object. This is also the only time it's allowed
to "clear" the cache, since there are for certain no tasks running at
initialization time that depend on those.
Ah, I was under the impression that distribute cache files are purged
EXPLICITLY when a TT reinitializes, but it looks like we do a full delete on
TaskTracker's local files during re-init. So, the code in your patch will work
for now, but it won't be in future when we might need explicit purge of
distributed cache files when they are owned by users themselves (HADOOP-4493).
We will change it then, +1 for the current changes.
bq. JobClient.java What happens to the code in here? Should this be refactored
too/ are we going to do something about it?
bq. Good question I do think that JobClient should be using some methods in the
filecache package, instead of hard-coding a lot of logic. That said, I chose to
stop somewhere to avoid making this patch even harder to review. I think it's
ripe for a future JIRA.
Okay.
bq. I think most of the getter/setter methods are deprecable and moveable to
DistributedCacheManager. Or at least we should give some thought about it. For
most of them, I do see from the javadoc that they are only for internal use
anyways.
bq. I agree. A few of them are used to manage the Configuration object. (In my
mind, we're serializing and de-serializing a set of requirements for the
distributed cache into the text configuration, and doing so a bit haphazardly.)
I was very tempted to remove all the ones that are only meant to be internal,
but Tom advised me that I need to keep them deprecated for a version. Again, I
think moving those methods into a more private place is a good task to do along
with changing how JobClient calls into this stuff.
+1. So are you planning to do in the next version or in this patch itself?
bq. I don't use Pipes typically, and it doesn't seem to compile on my Mac. I'll
try it on a Linux machine, but if it's easy/handy for you, it'd be great to
verify that bug.
Will do so.
bq. TestDistributedCacheManager * Code related to setFileStamps in
JobClient.java (+636 to +656) and testManagerFlow() (+71 to +74) can be
refactored into an internal-only method in DistributedCacheManager.
bq. I extracted the method and moved it to TrackerDistributedCacheManager. Now
that that has "Tracker" in the name, it's a little misplaced, but it's not too
bad.
I think that's agreeable for now. We can move it to a more appropriate place in
future if we find one.
bq. Another point. We should consolidate TestMRWithDistributedCache,
TestMiniMRLocalFS and TestMiniMRDFSCaching. That's a lot of code in three
different files testing mostly common code. May be another JIRA issue if you
feel so.
bq. I'd prefer a separate JIRA.
Fine. +1.
> extend DistributedCache to work locally (LocalJobRunner)
> --------------------------------------------------------
>
> Key: MAPREDUCE-476
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-476
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: sam rash
> Assignee: Philip Zeyliger
> Priority: Minor
> Attachments: HADOOP-2914-v1-full.patch,
> HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch,
> MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch,
> MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch,
> MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch,
> MAPREDUCE-476.patch
>
>
> The DistributedCache does not work locally when using the outlined recipe at
> http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html
>
> Ideally, LocalJobRunner would take care of populating the JobConf and copying
> remote files to the local file sytem (http, assume hdfs = default fs = local
> fs when doing local development.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.