[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-16 Thread M. C. Srivas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899296#action_12899296
 ] 

M. C. Srivas commented on MAPREDUCE-1901:
-

>[ from dhruba ]
> It means that a central authority stores the mapping of active callbacks and 
> their associated clients (and files). if a client dies prematurely, the 
> central authority should >have the option to recover that callback and hand 
> it over to a newly requesting client. are you proposing that the NN and/or JT 
> be this central authority?

Well, the mtime is a poor-man's version number that is getting checked on every 
access to see if the file at the server is newer.  Adding a callback should 
reduce this load significantly.

To the point of the question, yes, the NN should be able to revoke the callback 
whenever it feels like, at which point the client should get it back before 
reusing items in its cache. The client, on reboot (of itself or of the NN), 
must re-establish the callbacks it cares about. Note that the callback is not a 
lock, but a notification mechanism -- many clients can hold callbacks on the 
same file -- so it is not necessary for the NN to revoke a callback from one 
client in order to hand out a callback for the same file to another client.  
When a file changes, all outstanding callbacks for it are revoked so clients 
can discard/refresh their caches.

But the above is moot. Why does a "bulk-mtime" not work, esp given the manner 
the "bulk-get-md5-signatures" is supposed to work as in Joydeep's proposal? 
They seem to be equally onerous (or not).

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-16 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898990#action_12898990
 ] 

Koji Noguchi commented on MAPREDUCE-1901:
-

{quote}
The TaskTracker, on being requested to run a task requiring CAR resource md5_F 
checks whether md5_F is localized.

* If md5_F is already localized - then nothing more needs to be done. the 
localized version is used by the Task
* If md5_F is not localized - then its fetched from the CAR repository
{quote}
What are we gaining by using md5_F on the TaskTracker side?
Can we use the existing 'cacheStatus.mtime == confFileStamp' check and change 
the order of the check so that no unnecessary getFileStatus call is made 
(MAPRED-2011)? 
Otherwise, this can only be used for dist files loaded by this framework and 
would require two separate logic on the TaskTracker side.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-16 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898983#action_12898983
 ] 

dhruba borthakur commented on MAPREDUCE-1901:
-

> use a model like AFS with callbacks to implement a on-disk cache

It means that a central authority stores  the mapping of active callbacks and 
their associated clients (and files). if a client dies prematurely, the central 
authority should have the option to recover that callback and hand it over to a 
newly requesting client. are you proposing that the NN and/or JT be this 
central authority? 

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-16 Thread M. C. Srivas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898956#action_12898956
 ] 

M. C. Srivas commented on MAPREDUCE-1901:
-

Content-addressable is one way to solve this problem, and it seems like an 
extremely heavy-weight approach
   1. more processing to do whenever a file is added to the file-system
   2. reliability issues getting the signature to match the contents across 
failures/re-replication/etc
   3. a repository of signatures in HDFS is yet another single-point of 
failure, and yet another database that needs to be maintained (recovery code to 
handle "no-data-corruption" on a reboot, scaling it as more files added, 
backup/restore,  HA, etc)

Looks like there are a variety simpler approaches possible, a few of which come 
to mind immediately, and are list below in increasing order of complexity.

  1. use distcp or something similar to copy the files onto local disk whenever 
there is a new version of Hive released , and set pathnames to that. That is,  
different versions of a set of files are kept in a different directory, and 
pathnames are used to distinguish them. For example,  we do not do a md5 check 
of "/bin/ls" every time we need to run it. We set our pathname appropriately. 
If there is a different version of  "ls" we prefer to use, say, in  
"/my/local/bin", then we get that by setting /my/local/bin  ahead of other 
paths in our pathname.

  2. instead of implementing a bulk "getSignatures" call to replace several 
"get_mtime" calls, why not implement a  bulk get_mtime instead? 

  3. use a model like AFS  with callbacks to implement a on-disk cache that 
survives reboots (Dhruba knows AFS very well).  In other words, the client 
acquires a callback from the name-node for each file it has cached, and HDFS 
gurantees it will notify the client when the file is deleted or changed (at 
which point, the callback is revoked and the client must re-fetch the file). 
The callback lasts for, say, 1 week, and can be persisted on disk.  On a 
name-node reboot, the client is responsible for re-establishing the callbacks 
it already has (akin to a block-report). The client can also choose to return 
callbacks, in order to keep the memory requirements on the name-node to a 
minimum.  No repository of signatures is needed.


> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-16 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898835#action_12898835
 ] 

Joydeep Sen Sarma commented on MAPREDUCE-1901:
--

the proposal is below - rephrases some of the discussions above, addresses some 
of the comments around race conditions and points out limitations. Junjie will 
post a patch tomorrow (which probably needs some more work).

h4. Background

Hadoop map-reduce jobs commonly require jars, executables, archives and other 
resources for task execution on hadoop cluster nodes. A common deployment 
pattern for Hadoop applications is that the required resources are deployed 
centrally by administrators (either on a shared file system or deployed on 
standard local file system paths by package management tools). Users launch 
Hadoop jobs from these installation points. Applications use apis 
(-libjars/files/archives) provided by Hadoop to upload resources (from the 
installation point) so that they are made available for task execution. This 
behavior makes deployment of Hadoop applications very easy (just use standard 
package management tools).

As an example, Facebook has a few different Hive installations (of different 
versions) deployed on NFS filer. Each has a multitude of jar files - with only 
some differing across different Hive versions. Users also maintain a repository 
of map-reduce scripts and jar files contain Hive extensions (user defined 
functions) on a NFS filer. Any installation of Hive can be used to execute jobs 
against any of multiple map-reduce clusters. Most of the jar files are also 
required locally (by the Hive client) - either for query compilation or for 
local execution (either in hadoop local mode or for some special types of 
queries).

h4. Problems

With the above arrangement - each (non local-mode) Hadoop job will upload all 
the required jar files into HDFS. TaskTrackers will download these jars from 
HDFS (at most once per job) and check modification times of downloaded files 
(second task onwards) The following overheads are observed:

- Job submission latency is impacted because of the need to serially upload 
multiple jar files into HDFS. At Facebook - we typically see 5-6 seconds of 
pause in this stage (depends on how responsive DFS is on a given day)
- There is some latency in setting up the first task as resources must be 
downloaded from HDFS. We have typically observed this to be around 2-3 seconds 
at Facebook. 
- For subsequent tasks - the latency impact is not as high - but the mtime 
check adds to general Namenode pressure.

h4. Observations

- jars and other resources are shared across different jobs and users. there 
are, in fact, hardly any resources that are not shared.
- these resources are meant to be immutable 

We would like to use these properties to solve some of the overheads in the 
current protocol while retaining the simplicity of the deployment model that 
exists today.

h4. General Approach

We would like to introduce (for lack of a better term) the notion of Content 
Addressible Resources (CAR) that are stored in a central repository in HDFS:

# CAR jars/files/archives are be identified by their content (for example - 
named using their md5 checksum).\\
\\
This allows different jobs to share resources. Each Job can find out 
whether the resources required by it are already available in HDFS (by 
comparing the md5 signatures of their resources against the contents in the CAR 
repository).\\
\\
# Content Addressible resources (once uploaded) are immutable. They can only be 
garbage collected (by server side daemons).
This allows TaskTrackers to skip mtime checks on such resources.

The CAR functionality is exposed to clients in two ways:

- a boolean configuration option (defaulting to false) to indicate that 
resources added via -libjars/files/archives options are content addressible
- enhancing the Distributed Cache api to mark specific files/archives as CAR 
(similar to how specific files can be marked public)

h4. Protocol

Assume a jobclient has a CAR file _F_ on local disk to be uploaded for task 
execution. Here's approximately a trace of what happens from the beginning of 
the job to it's end:

# Client computes the md5 signature of F (= _md5_F_)
#- One can additionally provide an option to skip this step - the md5 can be 
precomputed and stored in a file named _F.md5_ stored alongside _F_. The client 
can look for and use the contents of this file as the md5 sum.\\
\\
# The client fetches (in a single filesystem call) the list of md5 signatures 
(and their 'atime' attribute among other things) of the CAR repository\\
\\
# If the _md5_F_ already exists in the CAR repository - then the client simply 
uses the URI of the existing copy as the resource to be downloaded on the 
TaskTrackers\\
#- If the atime of _md5_F_ is older than 1 day, then the client updates the 
atime

[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-13 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898484#action_12898484
 ] 

Koji Noguchi commented on MAPREDUCE-1901:
-

bq. For me, that's not of a worry. It may delay individual job submissions, but 
the overall load to the hdfs isn't much.
bq. (at least compared to later phase of hundreds and thousands of tasktrackers 
looking up mtime of 'all those jars'.)

Since my problem is just about lookup of mtime, created a new jira 
MAPREDUCE-2011.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-13 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898376#action_12898376
 ] 

Joydeep Sen Sarma commented on MAPREDUCE-1901:
--

> if the jar changed resulting in different tasks of the same job getting 
> different data rendering debugging impossible!

prior to security related functionality (and even with it if the user is 
capricious enough) - i think this was very much possible. the only contract 
that the TT's seem to follow today is that the launched task sees the latest 
possible version of a required resource (jar/file etc.). there's no contract 
that says that each task sees identical version of each resource. the mtime 
check only seems to reinforce this notion - that resources could be changing 
underneath as the job is running.

that said - there will likely be security holes in our current scheme. we will 
post a spec and trunk patch by sometime later today (or weekend) hopefully.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-13 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898342#action_12898342
 ] 

Arun C Murthy commented on MAPREDUCE-1901:
--

bq. sorry for the endless confusion - i will try to write up a detailed doc 
tomorrow covering use cases and design/gaps etc.

Can you please attach one? 

bq. the changes to distributed cache (of which there are little - i think most 
changes are in jobclient and taskrunner) are concerned with making the 
assumption that the shared objects are immutable (in which case mtime checks 
can be bypassed).

This seems just wrong - *assuming* immutability is just not useful - the cost 
of the mtime is trivial compared to doing the I/O which we are anyway saving. 
It will essentially introduce randomness *if* the jar changed resulting in 
different tasks of the same job getting different data rendering debugging 
impossible! 

-1

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-13 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898300#action_12898300
 ] 

Koji Noguchi commented on MAPREDUCE-1901:
-

bq. u would get a trace of all those jars being uploaded to hdfs. it's 
ridiculous.

For me, that's not of a worry.  It may delay individual job submissions, but 
the overall load to the hdfs isn't much.
(at least compared to later phase of hundreds and thousands of tasktrackers 
looking up mtime of 'all those jars'.)

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-12 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898082#action_12898082
 ] 

Joydeep Sen Sarma commented on MAPREDUCE-1901:
--

it's _almost_ in production :-) - the patch posted here had some bugs. we have 
final patch for 20 available - but have been trying to get the one for trunk 
into shape. should post soon.

try running hadoop in debug log4j mode. u would get a trace of all those jars 
being uploaded to hdfs. it's ridiculous.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-08-12 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897833#action_12897833
 ] 

Koji Noguchi commented on MAPREDUCE-1901:
-

bq. we have started testing this patch internally and this would become 
production in a couple of weeks.

Joydeep, is this being tested on your production?  How does the load look like?
I don't know the details, but I like the "part of the goal here is to not have 
to look up mtimes again and again. " part.

We certainly have applications with many small tasks having multiple 
libjar/distributed-caches resulting with too many getfileinfo calls to the 
namenode.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-07-26 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892206#action_12892206
 ] 

Joydeep Sen Sarma commented on MAPREDUCE-1901:
--

sorry for the endless confusion - i will try to write up a detailed doc 
tomorrow covering use cases and design/gaps etc.

the use case involves libjars being added from local file systems (since that's 
where software packages are deployed). it's really not possible to deploy 
software packages on hdfs (in certain cases - we wish to execute the software 
locally without interacting with hdfs entirely (see for example HIVE-1408)). 

the changes to distributed cache (of which there are little - i think most 
changes are in jobclient and taskrunner) are concerned with making the 
assumption that the shared objects are immutable (in which case mtime checks 
can be bypassed). 

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-07-26 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892198#action_12892198
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1901:


bq. Currently, files loaded through hadoop libjars/files/archives mechanism are 
copied onto HDFS and removed on every job.
JobClient copies the files/archives passed through -files, -libjars ad 
-archives options if they are in local file system. If they are already in 
HDFS, no copying is done. Then, the hdfs paths are added to distributedCache. 
Thus, Arun pointed in earlier comment,  DistributedCache does not  copy any 
files to dfs. 

I think this issue would involve picking up job jar from a dfs path directly 
instead of local fs.  

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-07-23 Thread Junjie Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891836#action_12891836
 ] 

Junjie Liang commented on MAPREDUCE-1901:
-

To supplement Joydeep's comment:

We are trying to save the number of calls to the NameNode, through the 
following optimizations:

1) Currently, files loaded through hadoop libjars/files/archives mechanism are 
copied onto HDFS and removed on every job. This is inefficient if most jobs are 
submitted from only 3-4 versions of hive, because rightfully the files should 
persist in HDFS to be reused. Hence the idea of decoupling files with their 
jobId to make them sharable across jobs.

2) If files are identified with their md5 checksums, we no longer need to 
verify file modification time in the TT. This saves another call to the 
NameNode to get the FileStatus object.

The reduction in the number of calls to the NameNode is small, but over a large 
number of jobs we believe it will be a noticeable difference.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-07-23 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891824#action_12891824
 ] 

Joydeep Sen Sarma commented on MAPREDUCE-1901:
--

> The DistributedCache already tracks mtimes for files

u - that's what i am saying. if u consider objects as immutable - then u 
don't have to track and look up mtimes. part of the goal here is to not have to 
look up mtimes again and again. if u have an object with matching md5 localized 
- you are done. (but we can't use the names alone for that. names can collide. 
md5 cannot (or nearly so). so we name objects based on their content signature 
(md5) - which is what a content addressible store/cache does).

> Admin installs pig/hive on hdfs:
> /share/hive/v1/hive.jar
> /share/hive/v2/hive.jar

that's not how hive works (or how hadoop streaming works). people deploy hive 
on NFS filers or local disks. users run hive jobs from these installation 
points. there's no hdfs involvement anywhere. people add jars to hive or hadoop 
streaming from their personal or shared folders. when people run hive jobs - 
they are not writing java. there's no .setRemoteJar() code they are writing.

hive loads the required jars (from the install directory) to hadoop via hadoop 
libjars/files/archives functionality. different hive clients are not aware of 
each other (ditto for hadoop streaming). most of the hive clients are running 
from common install points - but people may be running from personal install 
points with altered builds.

with what we have done in this patch - all these uncoordinated clients 
automatically share jars with each other. because the name for the shared 
object now is derived from the content of the object. we are still leveraging 
distributed cache - but we are naming objects based on their contents. Junjie 
tells me we can leverage the 'shared' objects namespace from trunk (in 20 we 
added our own shared namespace).

because the names are based on strong content signature - we can make the 
assumption of immutability. as i have tried to point out many times - when 
objects are immutable - one can make optimizations and skip timestamp based 
validation. the latter requires hdfs lookups and creates load and latency.

note that we need zero application changes for this sharing and zero admin 
overhead. so all sorts of hadoop users will automatically start getting the 
benefit a shared jars without writing any code and without any special admin 
recipe.

isn't that good?


> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-07-23 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891812#action_12891812
 ] 

Arun C Murthy commented on MAPREDUCE-1901:
--

Joydeep - Maybe we are talking past each other, yet ...

The DistributedCache _already_ tracks mtimes for files. Each TT, via the 
DistributedCache, localizes the file based on .

This seems sufficient for the use case as I understand it.

Here is the flow:

Admin installs pig/hive on hdfs:
/share/hive/v1/hive.jar
/share/hive/v2/hive.jar

The pig/hive framework, in fact, any MR job then does:

JobConf job = new JobConf();
job.setRemoteJar(new Path("/share/hive/v1/hive.jar")
JobConf.submitJob(job);


That's it. The JobClient has the smarts to use 
DistributedCache.addArchiveToClassPath as the implementation of 
JobConf.setRemoteJar.

If you want a new version of hive.jar, you change hive to use 
/share/hive/v2/hive.jar.

What am I missing here?

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-07-23 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891674#action_12891674
 ] 

Joydeep Sen Sarma commented on MAPREDUCE-1901:
--

@Arun - you are right - this is a layer above distributed cache for the most 
part. Take a look at our use case (bottom of my previous comments). Essentially 
we are extending the Distributed Cache a bit to be a content addressible cache. 
I do not think our use case is directly supported by Hadoop for this purpose - 
and we are hoping to make the change in the framework (instead of Hive) because 
there's nothing Hive specific here and whatever we are doing will be directly 
leveraged by other apps.

Sharing != Content addressible. A NFS filer can be globally shared - but it's 
not content addressible. An EMC Centera (amongst others) is. Sorry - terrible 
examples - trying to come up with something quickly.

Will address Vinod's comments later - we have taken race considerations into 
account.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-07-23 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891640#action_12891640
 ] 

Arun C Murthy commented on MAPREDUCE-1901:
--

To re-terate: 
Pre-security - Artifacts in DistributedCache are _already_ shared across jobs, 
no changes needed.
Post-security - MAPREDUCE-774 allows for a shared distributed cache across jobs 
too.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-07-23 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891633#action_12891633
 ] 

Arun C Murthy commented on MAPREDUCE-1901:
--

bq. I'm proposing a change in the way files are stored in HDFS. Instead of 
storing files in /jobid/files or /jobid/archives, we store them directly in 
{mapred.system.dir}/files and {mapred.system.dir}/archives. This removes the 
association between a file and the job ID, so that files can be persistent 
across jobs.

I'm confused here. The distributed-cache does not write any files to HDFS, it 
merely is configured with a set of files to be copied from HDFS to the compute 
node. Why are we making these changes?

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-07-23 Thread Vinod K V (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891529#action_12891529
 ] 

Vinod K V commented on MAPREDUCE-1901:
--

bq. Currently, auxiliary files added through DistributedCache.addCacheFiles  
and DistributedCache.addCacheArchive end up in {mapred.system.dir}/job_id/files 
or {mapred.system.dir}/job_id/archives. The /job_id directory is then removed 
after every job, which is why files cannot be reused across jobs.
That is only true for private distributed cache files. Artifacts which are 
already public on the DFS don't go to mapredsystem directly at all and are 
reusable across users/jobs.

bq. 2. it treats shared objects as immutable. meaning that we never look up the 
timestamp of the backing object in hdfs during task localization/validation. 
this saves time during task setup.
bq. 3. reasonable effort has been put to bypass as many hdfs calls as possible 
in step 1. the client gets a listing of all shared objects and their md5 
signatures in one shot. because of the immutability assumption - individual 
file stamps are never required and save hdfs calls.
I think this is an orthogonal. If md5 checksums are preferred over timestamp 
based checks for the sake of lessening DFS accesses, that can be done 
separately within the current design, no? Distributed cache files originally 
did rely on md5 checksum of the files/jars that HDFS itself used to have. 
However that changed via HADOOP-1084 when checksums paved way for block level 
crcs.

bq. 4. finally - there is inbuilt code to do garbage collection of the shared 
namespace (in hdfs) by deleting old shared objects that have not been recently 
accessed.
This is where I think it gets tricky. First, garbage collection of the dfs 
namespace should be accompanied by the same on individual TTs - more complexity.

There are race conditions too. It's not clear how the JobTracker is prevented 
from expiring shared cache files/jars when some JobClient has already marked or 
is in the process of marking those artifacts for usage by the job. Warranting 
such synchronization across JobTracker and JobClients is difficult and, at 
best, brittle. Leaving the synchronization issues unsolved would only mean 
leaving the tasks/job to fail later which is not desirable.

bq. the difference here is that all applications (like Hive) using libjars etc. 
options provided in hadoop automatically share jars with each other (when they 
set this option). the applications don't have to do anything special (like 
figuring out the right global identifier in hdfs for their jars).
That seems like a valid use-case. But as I mentioned above, because of 
complexity and race conditions it seems like a wrong place to develop it. 

I think the core problem is trying to perform a service (sharing of files) that 
strictly belongs to the layer above mapreduce - maintaining the share list 
doesn't seem like a JT's responsibility. The current way of leaving it to the 
users to decide which are public files(and hence shareable) and which are not 
and how and when they are purged, keeps things saner from the mapreduce 
framework point of view. What do you think?

bq. if u can look at the patch a bit - that might help understand the 
differences as well
I looked at the patch. And I am still not convinced. Yet, that is.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-07-22 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891471#action_12891471
 ] 

Joydeep Sen Sarma commented on MAPREDUCE-1901:
--

thanks for taking a look. i think there are some differences (and potentially 
some overlap as well) with what we are trying to do here:

1. the jobclient in this approach computes md5 of jars/files/archives (when a 
special option is enabled) and then automatically submits these jars as shared 
objects by putting them in a global namespace - where the (md5,file-name) 
identifies the shared object. (instead of the (jobid, file-name, file-timestamp)

2. it treats shared objects as immutable. meaning that we never look up the 
timestamp of the backing object in hdfs during task localization/validation. 
this saves time during task setup. 

3. reasonable effort has been put to bypass as many hdfs calls as possible in 
step 1. the client gets a listing of all shared objects and their md5 
signatures in one shot. because of the immutability assumption - individual 
file stamps are never required and save hdfs calls.

4. finally - there is inbuilt code to do garbage collection of the shared 
namespace (in hdfs)  by deleting old shared objects that have not been recently 
accessed.

so i believe the scope of this effort is somewhat different (based on looking 
at the last patch for 744).

the difference here is that all applications (like Hive) using libjars etc. 
options provided in hadoop automatically share jars with each other (when they 
set this option). the applications don't have to do anything special (like 
figuring out the right global identifier in hdfs for their jars).

Our primary use case is for Hive. Hive submits multiple jars for each Hadoop 
job. Users can add more. At any given time - we have at least 4-5 official 
versions of Hive being used to submit jobs. in addition - hive developers are 
developing custom builds and submitting jobs using them. total jobs submitted 
per day is tens of thousands.

with this patch - we automatically get sharing of jars and zero administration 
overhead of managing a global namespace amongst many versions of our software 
libraries. I believe there's nothing Hive specific here. We use hadoop jar/file 
resources just like hadoop-streaming and other map-reduce jobs.

before embarking on this venture - we looked at the hadoop code and tried to 
find out whether a similar facility existed. we noticed a md5 class - but no 
uses for it. if there is existing functionality to the above effect - we would 
love to pick it up (less work for us). otherwise - i think this is very useful 
functionality that would be good to have in Hadoop framework.

if u can look at the patch a bit - that might help understand the differences 
as well. 

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-07-22 Thread Vinod K V (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891461#action_12891461
 ] 

Vinod K V commented on MAPREDUCE-1901:
--

Apologies for not looking at this issue before.

Distributed cache already has the support for sharing of files/archives via 
MAPREDUCE-744. Went into 0.21, may be all you need is a back-port.

The requirements for this issue can be simply met by making the job jar files 
on dfs as public and adding them to distributed cache as files/archives to be 
put on the task's classpath. I don't see anything else needed.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-07-22 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891393#action_12891393
 ] 

Joydeep Sen Sarma commented on MAPREDUCE-1901:
--

Arun and other hadoop'ers  - it might take JJ sometime to get the patch for 
trunk ready. if u guys have some cycles - it would be good to vet the general 
approach by looking at the patch for 20. I think the code for trunk differs 
primarily in security related aspect (from a quick glance).

we have started testing this patch internally and this would become production 
in a couple of weeks.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-07-12 Thread Junjie Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887614#action_12887614
 ] 

Junjie Liang commented on MAPREDUCE-1901:
-

Here is some more details on how I intend to make the changes. Please comment 
and give suggestions as you see fit.

Currently, auxiliary files added through {{DistributedCache.addCacheFiles}} and 
{{DistributedCache.addCacheArchive}} end up in {mapred.system.dir}/job_id/files 
or {mapred.system.dir}/job_id/archives. The /job_id directory is then removed 
after every job, which is why files cannot be reused across jobs.

I'm proposing a change in the way files are stored in HDFS. Instead of storing 
files in /jobid/files or /jobid/archives, we store them directly in 
{mapred.system.dir}/files and {mapred.system.dir}/archives. This removes the 
association between a file and the job ID, so that files can be persistent 
across jobs.

Two new function calls: {{DistributedCache.addSharedCacheFiles()}} and 
{{DistributedCache.addSharedCacheArchives()}} are added for users to add files 
that can be shared across jobs. Files that are added through the original 
functions {{addCacheFiles()}} and {{addSharedArchives()}} are not affected; 
they go through the same code path as before.

The "shared" files are stored in {mapred.system.dir}/files and 
{mapred.system.dir}/archives (note the job_id is removed from the path). To 
prevent files with the same filename from colliding, a prefix which is the md5 
of the file is added to the filename of each file, so for example, test.txt 
becomes ab876d86389d76c9e692fffd51bb2acde_test.txt. We use both the md5 
checksum and filename to identify a file so there is no confusion between files 
with the same filename but have different contents, and files with the same 
contents but with different filenames.

The TaskRunner no longer needs to use timestamps to decide whether a file is up 
to date, since the file will have a different md5 checksum if it is modified.

Files that need to be changed: JobClient.java, DistributedCache.java, and 
TaskRunner.java have the most changes, since files move from the client to HDFS 
to the tasktracker nodes through these codes.

Thanks!

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-06-30 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884190#action_12884190
 ] 

Joydeep Sen Sarma commented on MAPREDUCE-1901:
--

certainly true that we could do this at the Hive layer. two issues:

- not generic (meaning wouldn't work for streaming for example)
- need to repeat some of the classpath management stuff that Jobclient/TT 
already take care.

currently Hive leverages Hadoop provided facilities for distributing jars and 
files - and we will try to extend this functionality.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-06-30 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884162#action_12884162
 ] 

dhruba borthakur commented on MAPREDUCE-1901:
-

+1

from what I have learnt, the files in the distributed cache are persisted even 
across map-reduce jobs. So, the hive client can upload the relavant jars to 
some location in hdfs and then point the distributed cache to that hdfs uri(s). 
If we do that, then the TT will download those hdfs uri(s) to the local disk 
only once and all tasks (across multiple jobs) on that task tracker will 
continue to use these jars.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-06-30 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884056#action_12884056
 ] 

Joydeep Sen Sarma commented on MAPREDUCE-1901:
--

yeah - we have an intern (Junjie Liang) working on this and he is reusing the 
Distributed Cache - should be posting some code/design soon.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

2010-06-30 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884049#action_12884049
 ] 

Arun C Murthy commented on MAPREDUCE-1901:
--

+1

A straight-forward way is to use the DistributedCache directly, an easy change 
is to get the JobSubmissionProtocol use either a custom jar (as today) or just 
refer to the jars in the DistributedCache.

> Jobs should not submit the same jar files over and over again
> -
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.