Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
That is correct,  However, it is a bit more complicated then that.  The Task 
Tracker's in memory index of the distributed cache is keyed off of the path of 
the file and the HDFS creation time of the file.  So if you delete the original 
file off of HDFS, and then recreate it with a new time stamp the distributed 
cache will start downloading the new file.

Also when the distributed cache on a disk fills up unused entries in it are 
deleted.

--Bobby Evans

On 9/27/11 2:32 PM, "Meng Mao"  wrote:

So the proper description of how DistributedCache normally works is:

1. have files to be cached sitting around in HDFS
2. Run Job A, which specifies those files to be put into DistributedCache
space. Each worker node copies the to-be-cached files from HDFS to local
disk, but more importantly, the TaskTracker acknowledges this distribution
and marks somewhere the fact that these files are cached for the first (and
only time)
3. job runs fine
4. Run Job A some time later. TaskTracker simply assumes (by looking at its
memory) that the files are still cached. The tasks on the workers, on this
second call to addCacheFile, don't actually copy the files from HDFS to
local disk again, but instead accept TaskTracker's word that they're still
there. Because the files actually still exist, the workers run fine and the
job finishes normally.

Is that a correct interpretation? If so, the caution, then, must be that if
you accidentally deleted the local disk cache file copies, you either
repopulate them (as well as their crc checksums) or you restart the
TaskTracker?



On Tue, Sep 27, 2011 at 3:03 PM, Robert Evans  wrote:

> Yes, all of the state for the task tracker is in memory.  It never looks at
> the disk to see what is there, it only maintains the state in memory.
>
> --bobby Evans
>
>
> On 9/27/11 1:00 PM, "Meng Mao"  wrote:
>
> I'm not concerned about disk space usage -- the script we used that deleted
> the taskTracker cache path has been fixed not to do so.
>
> I'm curious about the exact behavior of jobs that use DistributedCache
> files. Again, it seems safe from your description to delete files between
> completed runs. How could the job or the taskTracker distinguish between
> the
> files having been deleted and their not having been downloaded from a
> previous run of the job? Is it state in memory that the taskTracker
> maintains?
>
>
> On Tue, Sep 27, 2011 at 1:44 PM, Robert Evans  wrote:
>
> > If you are never ever going to use that file again for any map/reduce
> task
> > in the future then yes you can delete it, but I would not recommend it.
>  If
> > you want to reduce the amount of space that is used by the distributed
> cache
> > there is a config parameter for that.
> >
> > "local.cache.size"  it is the number of bytes per drive that will be used
> > for storing data in the distributed cache.   This is in 0.20 for hadoop I
> am
> > not sure if it has changed at all for trunk.  It is not documented as far
> as
> > I can tell, and it defaults to 10GB.
> >
> > --Bobby Evans
> >
> >
> > On 9/27/11 12:04 PM, "Meng Mao"  wrote:
> >
> > From that interpretation, it then seems like it would be safe to delete
> the
> > files between completed runs? How could it distinguish between the files
> > having been deleted and their not having been downloaded from a previous
> > run?
> >
> > On Tue, Sep 27, 2011 at 12:25 PM, Robert Evans 
> > wrote:
> >
> > > addCacheFile sets a config value in your jobConf that indicates which
> > files
> > > your particular job depends on.  When the TaskTracker is assigned to
> run
> > > part of your job (map task or reduce task), it will download your
> > jobConf,
> > > read it in, and then download the files listed in the conf, if it has
> not
> > > already downloaded them from a previous run.  Then it will set up the
> > > directory structure for your job, possibly adding in symbolic links to
> > these
> > > files in the working directory for your task.  After that it will
> launch
> > > your task.
> > >
> > > --Bobby Evans
> > >
> > > On 9/27/11 11:17 AM, "Meng Mao"  wrote:
> > >
> > > Who is in charge of getting the files there for the first time? The
> > > addCacheFile call in the mapreduce job? Or a manual setup by the
> > > user/operator?
> > >
> > > On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans 
> > > wrote:
> > >
> > > > The problem is the step 4 in the breaking sequence.  Currently the
> > > > TaskTracker never looks at the disk to know if a file is in the
> > > distributed
> > > > cache or not.  It assumes that if it downloaded the file and did not
> > > delete
> > > > that file itself then the file is still there in its original form.
>  It
> > > does
> > > > not know that you deleted those files, or if wrote to the files, or
> in
> > > any
> > > > way altered those files.  In general you should not be modifying
> those
> > > > files.  This is not only because it messes up the tracking of those
> > > files,
> > > > but because other jobs running concurrently with your t

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Meng Mao
So the proper description of how DistributedCache normally works is:

1. have files to be cached sitting around in HDFS
2. Run Job A, which specifies those files to be put into DistributedCache
space. Each worker node copies the to-be-cached files from HDFS to local
disk, but more importantly, the TaskTracker acknowledges this distribution
and marks somewhere the fact that these files are cached for the first (and
only time)
3. job runs fine
4. Run Job A some time later. TaskTracker simply assumes (by looking at its
memory) that the files are still cached. The tasks on the workers, on this
second call to addCacheFile, don't actually copy the files from HDFS to
local disk again, but instead accept TaskTracker's word that they're still
there. Because the files actually still exist, the workers run fine and the
job finishes normally.

Is that a correct interpretation? If so, the caution, then, must be that if
you accidentally deleted the local disk cache file copies, you either
repopulate them (as well as their crc checksums) or you restart the
TaskTracker?



On Tue, Sep 27, 2011 at 3:03 PM, Robert Evans  wrote:

> Yes, all of the state for the task tracker is in memory.  It never looks at
> the disk to see what is there, it only maintains the state in memory.
>
> --bobby Evans
>
>
> On 9/27/11 1:00 PM, "Meng Mao"  wrote:
>
> I'm not concerned about disk space usage -- the script we used that deleted
> the taskTracker cache path has been fixed not to do so.
>
> I'm curious about the exact behavior of jobs that use DistributedCache
> files. Again, it seems safe from your description to delete files between
> completed runs. How could the job or the taskTracker distinguish between
> the
> files having been deleted and their not having been downloaded from a
> previous run of the job? Is it state in memory that the taskTracker
> maintains?
>
>
> On Tue, Sep 27, 2011 at 1:44 PM, Robert Evans  wrote:
>
> > If you are never ever going to use that file again for any map/reduce
> task
> > in the future then yes you can delete it, but I would not recommend it.
>  If
> > you want to reduce the amount of space that is used by the distributed
> cache
> > there is a config parameter for that.
> >
> > "local.cache.size"  it is the number of bytes per drive that will be used
> > for storing data in the distributed cache.   This is in 0.20 for hadoop I
> am
> > not sure if it has changed at all for trunk.  It is not documented as far
> as
> > I can tell, and it defaults to 10GB.
> >
> > --Bobby Evans
> >
> >
> > On 9/27/11 12:04 PM, "Meng Mao"  wrote:
> >
> > From that interpretation, it then seems like it would be safe to delete
> the
> > files between completed runs? How could it distinguish between the files
> > having been deleted and their not having been downloaded from a previous
> > run?
> >
> > On Tue, Sep 27, 2011 at 12:25 PM, Robert Evans 
> > wrote:
> >
> > > addCacheFile sets a config value in your jobConf that indicates which
> > files
> > > your particular job depends on.  When the TaskTracker is assigned to
> run
> > > part of your job (map task or reduce task), it will download your
> > jobConf,
> > > read it in, and then download the files listed in the conf, if it has
> not
> > > already downloaded them from a previous run.  Then it will set up the
> > > directory structure for your job, possibly adding in symbolic links to
> > these
> > > files in the working directory for your task.  After that it will
> launch
> > > your task.
> > >
> > > --Bobby Evans
> > >
> > > On 9/27/11 11:17 AM, "Meng Mao"  wrote:
> > >
> > > Who is in charge of getting the files there for the first time? The
> > > addCacheFile call in the mapreduce job? Or a manual setup by the
> > > user/operator?
> > >
> > > On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans 
> > > wrote:
> > >
> > > > The problem is the step 4 in the breaking sequence.  Currently the
> > > > TaskTracker never looks at the disk to know if a file is in the
> > > distributed
> > > > cache or not.  It assumes that if it downloaded the file and did not
> > > delete
> > > > that file itself then the file is still there in its original form.
>  It
> > > does
> > > > not know that you deleted those files, or if wrote to the files, or
> in
> > > any
> > > > way altered those files.  In general you should not be modifying
> those
> > > > files.  This is not only because it messes up the tracking of those
> > > files,
> > > > but because other jobs running concurrently with your task may also
> be
> > > using
> > > > those files.
> > > >
> > > > --Bobby Evans
> > > >
> > > >
> > > > On 9/26/11 4:40 PM, "Meng Mao"  wrote:
> > > >
> > > > Let's frame the issue in another way. I'll describe a sequence of
> > Hadoop
> > > > operations that I think should work, and then I'll get into what we
> did
> > > and
> > > > how it failed.
> > > >
> > > > Normal sequence:
> > > > 1. have files to be cached in HDFS
> > > > 2. Run Job A, which specifies those files to be put into
> > Distri

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
Yes, all of the state for the task tracker is in memory.  It never looks at the 
disk to see what is there, it only maintains the state in memory.

--bobby Evans


On 9/27/11 1:00 PM, "Meng Mao"  wrote:

I'm not concerned about disk space usage -- the script we used that deleted
the taskTracker cache path has been fixed not to do so.

I'm curious about the exact behavior of jobs that use DistributedCache
files. Again, it seems safe from your description to delete files between
completed runs. How could the job or the taskTracker distinguish between the
files having been deleted and their not having been downloaded from a
previous run of the job? Is it state in memory that the taskTracker
maintains?


On Tue, Sep 27, 2011 at 1:44 PM, Robert Evans  wrote:

> If you are never ever going to use that file again for any map/reduce task
> in the future then yes you can delete it, but I would not recommend it.  If
> you want to reduce the amount of space that is used by the distributed cache
> there is a config parameter for that.
>
> "local.cache.size"  it is the number of bytes per drive that will be used
> for storing data in the distributed cache.   This is in 0.20 for hadoop I am
> not sure if it has changed at all for trunk.  It is not documented as far as
> I can tell, and it defaults to 10GB.
>
> --Bobby Evans
>
>
> On 9/27/11 12:04 PM, "Meng Mao"  wrote:
>
> From that interpretation, it then seems like it would be safe to delete the
> files between completed runs? How could it distinguish between the files
> having been deleted and their not having been downloaded from a previous
> run?
>
> On Tue, Sep 27, 2011 at 12:25 PM, Robert Evans 
> wrote:
>
> > addCacheFile sets a config value in your jobConf that indicates which
> files
> > your particular job depends on.  When the TaskTracker is assigned to run
> > part of your job (map task or reduce task), it will download your
> jobConf,
> > read it in, and then download the files listed in the conf, if it has not
> > already downloaded them from a previous run.  Then it will set up the
> > directory structure for your job, possibly adding in symbolic links to
> these
> > files in the working directory for your task.  After that it will launch
> > your task.
> >
> > --Bobby Evans
> >
> > On 9/27/11 11:17 AM, "Meng Mao"  wrote:
> >
> > Who is in charge of getting the files there for the first time? The
> > addCacheFile call in the mapreduce job? Or a manual setup by the
> > user/operator?
> >
> > On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans 
> > wrote:
> >
> > > The problem is the step 4 in the breaking sequence.  Currently the
> > > TaskTracker never looks at the disk to know if a file is in the
> > distributed
> > > cache or not.  It assumes that if it downloaded the file and did not
> > delete
> > > that file itself then the file is still there in its original form.  It
> > does
> > > not know that you deleted those files, or if wrote to the files, or in
> > any
> > > way altered those files.  In general you should not be modifying those
> > > files.  This is not only because it messes up the tracking of those
> > files,
> > > but because other jobs running concurrently with your task may also be
> > using
> > > those files.
> > >
> > > --Bobby Evans
> > >
> > >
> > > On 9/26/11 4:40 PM, "Meng Mao"  wrote:
> > >
> > > Let's frame the issue in another way. I'll describe a sequence of
> Hadoop
> > > operations that I think should work, and then I'll get into what we did
> > and
> > > how it failed.
> > >
> > > Normal sequence:
> > > 1. have files to be cached in HDFS
> > > 2. Run Job A, which specifies those files to be put into
> DistributedCache
> > > space
> > > 3. job runs fine
> > > 4. Run Job A some time later. job runs fine again.
> > >
> > > Breaking sequence:
> > > 1. have files to be cached in HDFS
> > > 2. Run Job A, which specifies those files to be put into
> DistributedCache
> > > space
> > > 3. job runs fine
> > > 4. Manually delete cached files out of local disk on worker nodes
> > > 5. Run Job A again, expect it to push out cache copies as needed.
> > > 6. job fails because the cache copies didn't get distributed
> > >
> > > Should this second sequence have broken?
> > >
> > > On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao  wrote:
> > >
> > > > Hmm, I must have really missed an important piece somewhere. This is
> > from
> > > > the MapRed tutorial text:
> > > >
> > > > "DistributedCache is a facility provided by the Map/Reduce framework
> to
> > > > cache files (text, archives, jars and so on) needed by applications.
> > > >
> > > > Applications specify the files to be cached via urls (hdfs://) in the
> > > > JobConf. The DistributedCache* assumes that the files specified via
> > > > hdfs:// urls are already present on the FileSystem.*
> > > >
> > > > *The framework will copy the necessary files to the slave node before
> > any
> > > > tasks for the job are executed on that node*. Its efficiency stems
> from
> > > > the fact that the files are 

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Meng Mao
I'm not concerned about disk space usage -- the script we used that deleted
the taskTracker cache path has been fixed not to do so.

I'm curious about the exact behavior of jobs that use DistributedCache
files. Again, it seems safe from your description to delete files between
completed runs. How could the job or the taskTracker distinguish between the
files having been deleted and their not having been downloaded from a
previous run of the job? Is it state in memory that the taskTracker
maintains?


On Tue, Sep 27, 2011 at 1:44 PM, Robert Evans  wrote:

> If you are never ever going to use that file again for any map/reduce task
> in the future then yes you can delete it, but I would not recommend it.  If
> you want to reduce the amount of space that is used by the distributed cache
> there is a config parameter for that.
>
> "local.cache.size"  it is the number of bytes per drive that will be used
> for storing data in the distributed cache.   This is in 0.20 for hadoop I am
> not sure if it has changed at all for trunk.  It is not documented as far as
> I can tell, and it defaults to 10GB.
>
> --Bobby Evans
>
>
> On 9/27/11 12:04 PM, "Meng Mao"  wrote:
>
> From that interpretation, it then seems like it would be safe to delete the
> files between completed runs? How could it distinguish between the files
> having been deleted and their not having been downloaded from a previous
> run?
>
> On Tue, Sep 27, 2011 at 12:25 PM, Robert Evans 
> wrote:
>
> > addCacheFile sets a config value in your jobConf that indicates which
> files
> > your particular job depends on.  When the TaskTracker is assigned to run
> > part of your job (map task or reduce task), it will download your
> jobConf,
> > read it in, and then download the files listed in the conf, if it has not
> > already downloaded them from a previous run.  Then it will set up the
> > directory structure for your job, possibly adding in symbolic links to
> these
> > files in the working directory for your task.  After that it will launch
> > your task.
> >
> > --Bobby Evans
> >
> > On 9/27/11 11:17 AM, "Meng Mao"  wrote:
> >
> > Who is in charge of getting the files there for the first time? The
> > addCacheFile call in the mapreduce job? Or a manual setup by the
> > user/operator?
> >
> > On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans 
> > wrote:
> >
> > > The problem is the step 4 in the breaking sequence.  Currently the
> > > TaskTracker never looks at the disk to know if a file is in the
> > distributed
> > > cache or not.  It assumes that if it downloaded the file and did not
> > delete
> > > that file itself then the file is still there in its original form.  It
> > does
> > > not know that you deleted those files, or if wrote to the files, or in
> > any
> > > way altered those files.  In general you should not be modifying those
> > > files.  This is not only because it messes up the tracking of those
> > files,
> > > but because other jobs running concurrently with your task may also be
> > using
> > > those files.
> > >
> > > --Bobby Evans
> > >
> > >
> > > On 9/26/11 4:40 PM, "Meng Mao"  wrote:
> > >
> > > Let's frame the issue in another way. I'll describe a sequence of
> Hadoop
> > > operations that I think should work, and then I'll get into what we did
> > and
> > > how it failed.
> > >
> > > Normal sequence:
> > > 1. have files to be cached in HDFS
> > > 2. Run Job A, which specifies those files to be put into
> DistributedCache
> > > space
> > > 3. job runs fine
> > > 4. Run Job A some time later. job runs fine again.
> > >
> > > Breaking sequence:
> > > 1. have files to be cached in HDFS
> > > 2. Run Job A, which specifies those files to be put into
> DistributedCache
> > > space
> > > 3. job runs fine
> > > 4. Manually delete cached files out of local disk on worker nodes
> > > 5. Run Job A again, expect it to push out cache copies as needed.
> > > 6. job fails because the cache copies didn't get distributed
> > >
> > > Should this second sequence have broken?
> > >
> > > On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao  wrote:
> > >
> > > > Hmm, I must have really missed an important piece somewhere. This is
> > from
> > > > the MapRed tutorial text:
> > > >
> > > > "DistributedCache is a facility provided by the Map/Reduce framework
> to
> > > > cache files (text, archives, jars and so on) needed by applications.
> > > >
> > > > Applications specify the files to be cached via urls (hdfs://) in the
> > > > JobConf. The DistributedCache* assumes that the files specified via
> > > > hdfs:// urls are already present on the FileSystem.*
> > > >
> > > > *The framework will copy the necessary files to the slave node before
> > any
> > > > tasks for the job are executed on that node*. Its efficiency stems
> from
> > > > the fact that the files are only copied once per job and the ability
> to
> > > > cache archives which are un-archived on the slaves."
> > > >
> > > >
> > > > After some close reading, the two bolded pieces seem to be in
> > > contr

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
If you are never ever going to use that file again for any map/reduce task in 
the future then yes you can delete it, but I would not recommend it.  If you 
want to reduce the amount of space that is used by the distributed cache there 
is a config parameter for that.

"local.cache.size"  it is the number of bytes per drive that will be used for 
storing data in the distributed cache.   This is in 0.20 for hadoop I am not 
sure if it has changed at all for trunk.  It is not documented as far as I can 
tell, and it defaults to 10GB.

--Bobby Evans


On 9/27/11 12:04 PM, "Meng Mao"  wrote:

>From that interpretation, it then seems like it would be safe to delete the
files between completed runs? How could it distinguish between the files
having been deleted and their not having been downloaded from a previous
run?

On Tue, Sep 27, 2011 at 12:25 PM, Robert Evans  wrote:

> addCacheFile sets a config value in your jobConf that indicates which files
> your particular job depends on.  When the TaskTracker is assigned to run
> part of your job (map task or reduce task), it will download your jobConf,
> read it in, and then download the files listed in the conf, if it has not
> already downloaded them from a previous run.  Then it will set up the
> directory structure for your job, possibly adding in symbolic links to these
> files in the working directory for your task.  After that it will launch
> your task.
>
> --Bobby Evans
>
> On 9/27/11 11:17 AM, "Meng Mao"  wrote:
>
> Who is in charge of getting the files there for the first time? The
> addCacheFile call in the mapreduce job? Or a manual setup by the
> user/operator?
>
> On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans 
> wrote:
>
> > The problem is the step 4 in the breaking sequence.  Currently the
> > TaskTracker never looks at the disk to know if a file is in the
> distributed
> > cache or not.  It assumes that if it downloaded the file and did not
> delete
> > that file itself then the file is still there in its original form.  It
> does
> > not know that you deleted those files, or if wrote to the files, or in
> any
> > way altered those files.  In general you should not be modifying those
> > files.  This is not only because it messes up the tracking of those
> files,
> > but because other jobs running concurrently with your task may also be
> using
> > those files.
> >
> > --Bobby Evans
> >
> >
> > On 9/26/11 4:40 PM, "Meng Mao"  wrote:
> >
> > Let's frame the issue in another way. I'll describe a sequence of Hadoop
> > operations that I think should work, and then I'll get into what we did
> and
> > how it failed.
> >
> > Normal sequence:
> > 1. have files to be cached in HDFS
> > 2. Run Job A, which specifies those files to be put into DistributedCache
> > space
> > 3. job runs fine
> > 4. Run Job A some time later. job runs fine again.
> >
> > Breaking sequence:
> > 1. have files to be cached in HDFS
> > 2. Run Job A, which specifies those files to be put into DistributedCache
> > space
> > 3. job runs fine
> > 4. Manually delete cached files out of local disk on worker nodes
> > 5. Run Job A again, expect it to push out cache copies as needed.
> > 6. job fails because the cache copies didn't get distributed
> >
> > Should this second sequence have broken?
> >
> > On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao  wrote:
> >
> > > Hmm, I must have really missed an important piece somewhere. This is
> from
> > > the MapRed tutorial text:
> > >
> > > "DistributedCache is a facility provided by the Map/Reduce framework to
> > > cache files (text, archives, jars and so on) needed by applications.
> > >
> > > Applications specify the files to be cached via urls (hdfs://) in the
> > > JobConf. The DistributedCache* assumes that the files specified via
> > > hdfs:// urls are already present on the FileSystem.*
> > >
> > > *The framework will copy the necessary files to the slave node before
> any
> > > tasks for the job are executed on that node*. Its efficiency stems from
> > > the fact that the files are only copied once per job and the ability to
> > > cache archives which are un-archived on the slaves."
> > >
> > >
> > > After some close reading, the two bolded pieces seem to be in
> > contradiction
> > > of each other? I'd always that addCacheFile() would perform the 2nd
> > bolded
> > > statement. If that sentence is true, then I still don't have an
> > explanation
> > > of why our job didn't correctly push out new versions of the cache
> files
> > > upon the startup and execution of JobConfiguration. We deleted them
> > before
> > > our job started, not during.
> > >
> > > On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans 
> > wrote:
> > >
> > >> Meng Mao,
> > >>
> > >> The way the distributed cache is currently written, it does not verify
> > the
> > >> integrity of the cache files at all after they are downloaded.  It
> just
> > >> assumes that if they were downloaded once they are still there and in
> > the
> > >> proper shape.  It might be good to file a J

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Meng Mao
>From that interpretation, it then seems like it would be safe to delete the
files between completed runs? How could it distinguish between the files
having been deleted and their not having been downloaded from a previous
run?

On Tue, Sep 27, 2011 at 12:25 PM, Robert Evans  wrote:

> addCacheFile sets a config value in your jobConf that indicates which files
> your particular job depends on.  When the TaskTracker is assigned to run
> part of your job (map task or reduce task), it will download your jobConf,
> read it in, and then download the files listed in the conf, if it has not
> already downloaded them from a previous run.  Then it will set up the
> directory structure for your job, possibly adding in symbolic links to these
> files in the working directory for your task.  After that it will launch
> your task.
>
> --Bobby Evans
>
> On 9/27/11 11:17 AM, "Meng Mao"  wrote:
>
> Who is in charge of getting the files there for the first time? The
> addCacheFile call in the mapreduce job? Or a manual setup by the
> user/operator?
>
> On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans 
> wrote:
>
> > The problem is the step 4 in the breaking sequence.  Currently the
> > TaskTracker never looks at the disk to know if a file is in the
> distributed
> > cache or not.  It assumes that if it downloaded the file and did not
> delete
> > that file itself then the file is still there in its original form.  It
> does
> > not know that you deleted those files, or if wrote to the files, or in
> any
> > way altered those files.  In general you should not be modifying those
> > files.  This is not only because it messes up the tracking of those
> files,
> > but because other jobs running concurrently with your task may also be
> using
> > those files.
> >
> > --Bobby Evans
> >
> >
> > On 9/26/11 4:40 PM, "Meng Mao"  wrote:
> >
> > Let's frame the issue in another way. I'll describe a sequence of Hadoop
> > operations that I think should work, and then I'll get into what we did
> and
> > how it failed.
> >
> > Normal sequence:
> > 1. have files to be cached in HDFS
> > 2. Run Job A, which specifies those files to be put into DistributedCache
> > space
> > 3. job runs fine
> > 4. Run Job A some time later. job runs fine again.
> >
> > Breaking sequence:
> > 1. have files to be cached in HDFS
> > 2. Run Job A, which specifies those files to be put into DistributedCache
> > space
> > 3. job runs fine
> > 4. Manually delete cached files out of local disk on worker nodes
> > 5. Run Job A again, expect it to push out cache copies as needed.
> > 6. job fails because the cache copies didn't get distributed
> >
> > Should this second sequence have broken?
> >
> > On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao  wrote:
> >
> > > Hmm, I must have really missed an important piece somewhere. This is
> from
> > > the MapRed tutorial text:
> > >
> > > "DistributedCache is a facility provided by the Map/Reduce framework to
> > > cache files (text, archives, jars and so on) needed by applications.
> > >
> > > Applications specify the files to be cached via urls (hdfs://) in the
> > > JobConf. The DistributedCache* assumes that the files specified via
> > > hdfs:// urls are already present on the FileSystem.*
> > >
> > > *The framework will copy the necessary files to the slave node before
> any
> > > tasks for the job are executed on that node*. Its efficiency stems from
> > > the fact that the files are only copied once per job and the ability to
> > > cache archives which are un-archived on the slaves."
> > >
> > >
> > > After some close reading, the two bolded pieces seem to be in
> > contradiction
> > > of each other? I'd always that addCacheFile() would perform the 2nd
> > bolded
> > > statement. If that sentence is true, then I still don't have an
> > explanation
> > > of why our job didn't correctly push out new versions of the cache
> files
> > > upon the startup and execution of JobConfiguration. We deleted them
> > before
> > > our job started, not during.
> > >
> > > On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans 
> > wrote:
> > >
> > >> Meng Mao,
> > >>
> > >> The way the distributed cache is currently written, it does not verify
> > the
> > >> integrity of the cache files at all after they are downloaded.  It
> just
> > >> assumes that if they were downloaded once they are still there and in
> > the
> > >> proper shape.  It might be good to file a JIRA to add in some sort of
> > check.
> > >>  Another thing to do is that the distributed cache also includes the
> > time
> > >> stamp of the original file, just incase you delete the file and then
> use
> > a
> > >> different version.  So if you want it to force a download again you
> can
> > copy
> > >> it delete the original and then move it back to what it was before.
> > >>
> > >> --Bobby Evans
> > >>
> > >> On 9/23/11 1:57 AM, "Meng Mao"  wrote:
> > >>
> > >> We use the DistributedCache class to distribute a few lookup files for
> > our
> > >> jobs. We have been aggressively deletin

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
addCacheFile sets a config value in your jobConf that indicates which files 
your particular job depends on.  When the TaskTracker is assigned to run part 
of your job (map task or reduce task), it will download your jobConf, read it 
in, and then download the files listed in the conf, if it has not already 
downloaded them from a previous run.  Then it will set up the directory 
structure for your job, possibly adding in symbolic links to these files in the 
working directory for your task.  After that it will launch your task.

--Bobby Evans

On 9/27/11 11:17 AM, "Meng Mao"  wrote:

Who is in charge of getting the files there for the first time? The
addCacheFile call in the mapreduce job? Or a manual setup by the
user/operator?

On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans  wrote:

> The problem is the step 4 in the breaking sequence.  Currently the
> TaskTracker never looks at the disk to know if a file is in the distributed
> cache or not.  It assumes that if it downloaded the file and did not delete
> that file itself then the file is still there in its original form.  It does
> not know that you deleted those files, or if wrote to the files, or in any
> way altered those files.  In general you should not be modifying those
> files.  This is not only because it messes up the tracking of those files,
> but because other jobs running concurrently with your task may also be using
> those files.
>
> --Bobby Evans
>
>
> On 9/26/11 4:40 PM, "Meng Mao"  wrote:
>
> Let's frame the issue in another way. I'll describe a sequence of Hadoop
> operations that I think should work, and then I'll get into what we did and
> how it failed.
>
> Normal sequence:
> 1. have files to be cached in HDFS
> 2. Run Job A, which specifies those files to be put into DistributedCache
> space
> 3. job runs fine
> 4. Run Job A some time later. job runs fine again.
>
> Breaking sequence:
> 1. have files to be cached in HDFS
> 2. Run Job A, which specifies those files to be put into DistributedCache
> space
> 3. job runs fine
> 4. Manually delete cached files out of local disk on worker nodes
> 5. Run Job A again, expect it to push out cache copies as needed.
> 6. job fails because the cache copies didn't get distributed
>
> Should this second sequence have broken?
>
> On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao  wrote:
>
> > Hmm, I must have really missed an important piece somewhere. This is from
> > the MapRed tutorial text:
> >
> > "DistributedCache is a facility provided by the Map/Reduce framework to
> > cache files (text, archives, jars and so on) needed by applications.
> >
> > Applications specify the files to be cached via urls (hdfs://) in the
> > JobConf. The DistributedCache* assumes that the files specified via
> > hdfs:// urls are already present on the FileSystem.*
> >
> > *The framework will copy the necessary files to the slave node before any
> > tasks for the job are executed on that node*. Its efficiency stems from
> > the fact that the files are only copied once per job and the ability to
> > cache archives which are un-archived on the slaves."
> >
> >
> > After some close reading, the two bolded pieces seem to be in
> contradiction
> > of each other? I'd always that addCacheFile() would perform the 2nd
> bolded
> > statement. If that sentence is true, then I still don't have an
> explanation
> > of why our job didn't correctly push out new versions of the cache files
> > upon the startup and execution of JobConfiguration. We deleted them
> before
> > our job started, not during.
> >
> > On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans 
> wrote:
> >
> >> Meng Mao,
> >>
> >> The way the distributed cache is currently written, it does not verify
> the
> >> integrity of the cache files at all after they are downloaded.  It just
> >> assumes that if they were downloaded once they are still there and in
> the
> >> proper shape.  It might be good to file a JIRA to add in some sort of
> check.
> >>  Another thing to do is that the distributed cache also includes the
> time
> >> stamp of the original file, just incase you delete the file and then use
> a
> >> different version.  So if you want it to force a download again you can
> copy
> >> it delete the original and then move it back to what it was before.
> >>
> >> --Bobby Evans
> >>
> >> On 9/23/11 1:57 AM, "Meng Mao"  wrote:
> >>
> >> We use the DistributedCache class to distribute a few lookup files for
> our
> >> jobs. We have been aggressively deleting failed task attempts' leftover
> >> data
> >> , and our script accidentally deleted the path to our distributed cache
> >> files.
> >>
> >> Our task attempt leftover data was here [per node]:
> >> /hadoop/hadoop-metadata/cache/mapred/local/
> >> and our distributed cache path was:
> >> hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/
> >> We deleted this path by accident.
> >>
> >> Does this latter path look normal? I'm not that familiar with
> >> DistributedCache but I'm up right now investigating the is

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Meng Mao
Who is in charge of getting the files there for the first time? The
addCacheFile call in the mapreduce job? Or a manual setup by the
user/operator?

On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans  wrote:

> The problem is the step 4 in the breaking sequence.  Currently the
> TaskTracker never looks at the disk to know if a file is in the distributed
> cache or not.  It assumes that if it downloaded the file and did not delete
> that file itself then the file is still there in its original form.  It does
> not know that you deleted those files, or if wrote to the files, or in any
> way altered those files.  In general you should not be modifying those
> files.  This is not only because it messes up the tracking of those files,
> but because other jobs running concurrently with your task may also be using
> those files.
>
> --Bobby Evans
>
>
> On 9/26/11 4:40 PM, "Meng Mao"  wrote:
>
> Let's frame the issue in another way. I'll describe a sequence of Hadoop
> operations that I think should work, and then I'll get into what we did and
> how it failed.
>
> Normal sequence:
> 1. have files to be cached in HDFS
> 2. Run Job A, which specifies those files to be put into DistributedCache
> space
> 3. job runs fine
> 4. Run Job A some time later. job runs fine again.
>
> Breaking sequence:
> 1. have files to be cached in HDFS
> 2. Run Job A, which specifies those files to be put into DistributedCache
> space
> 3. job runs fine
> 4. Manually delete cached files out of local disk on worker nodes
> 5. Run Job A again, expect it to push out cache copies as needed.
> 6. job fails because the cache copies didn't get distributed
>
> Should this second sequence have broken?
>
> On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao  wrote:
>
> > Hmm, I must have really missed an important piece somewhere. This is from
> > the MapRed tutorial text:
> >
> > "DistributedCache is a facility provided by the Map/Reduce framework to
> > cache files (text, archives, jars and so on) needed by applications.
> >
> > Applications specify the files to be cached via urls (hdfs://) in the
> > JobConf. The DistributedCache* assumes that the files specified via
> > hdfs:// urls are already present on the FileSystem.*
> >
> > *The framework will copy the necessary files to the slave node before any
> > tasks for the job are executed on that node*. Its efficiency stems from
> > the fact that the files are only copied once per job and the ability to
> > cache archives which are un-archived on the slaves."
> >
> >
> > After some close reading, the two bolded pieces seem to be in
> contradiction
> > of each other? I'd always that addCacheFile() would perform the 2nd
> bolded
> > statement. If that sentence is true, then I still don't have an
> explanation
> > of why our job didn't correctly push out new versions of the cache files
> > upon the startup and execution of JobConfiguration. We deleted them
> before
> > our job started, not during.
> >
> > On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans 
> wrote:
> >
> >> Meng Mao,
> >>
> >> The way the distributed cache is currently written, it does not verify
> the
> >> integrity of the cache files at all after they are downloaded.  It just
> >> assumes that if they were downloaded once they are still there and in
> the
> >> proper shape.  It might be good to file a JIRA to add in some sort of
> check.
> >>  Another thing to do is that the distributed cache also includes the
> time
> >> stamp of the original file, just incase you delete the file and then use
> a
> >> different version.  So if you want it to force a download again you can
> copy
> >> it delete the original and then move it back to what it was before.
> >>
> >> --Bobby Evans
> >>
> >> On 9/23/11 1:57 AM, "Meng Mao"  wrote:
> >>
> >> We use the DistributedCache class to distribute a few lookup files for
> our
> >> jobs. We have been aggressively deleting failed task attempts' leftover
> >> data
> >> , and our script accidentally deleted the path to our distributed cache
> >> files.
> >>
> >> Our task attempt leftover data was here [per node]:
> >> /hadoop/hadoop-metadata/cache/mapred/local/
> >> and our distributed cache path was:
> >> hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/
> >> We deleted this path by accident.
> >>
> >> Does this latter path look normal? I'm not that familiar with
> >> DistributedCache but I'm up right now investigating the issue so I
> thought
> >> I'd ask.
> >>
> >> After that deletion, the first 2 jobs to run (which are use the
> >> addCacheFile
> >> method to distribute their files) didn't seem to push the files out to
> the
> >> cache path, except on one node. Is this expected behavior? Shouldn't
> >> addCacheFile check to see if the files are missing, and if so,
> repopulate
> >> them as needed?
> >>
> >> I'm trying to get a handle on whether it's safe to delete the
> distributed
> >> cache path when the grid is quiet and no jobs are running. That is, if
> >> addCacheFile is designed to be robust agai

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
The problem is the step 4 in the breaking sequence.  Currently the TaskTracker 
never looks at the disk to know if a file is in the distributed cache or not.  
It assumes that if it downloaded the file and did not delete that file itself 
then the file is still there in its original form.  It does not know that you 
deleted those files, or if wrote to the files, or in any way altered those 
files.  In general you should not be modifying those files.  This is not only 
because it messes up the tracking of those files, but because other jobs 
running concurrently with your task may also be using those files.

--Bobby Evans


On 9/26/11 4:40 PM, "Meng Mao"  wrote:

Let's frame the issue in another way. I'll describe a sequence of Hadoop
operations that I think should work, and then I'll get into what we did and
how it failed.

Normal sequence:
1. have files to be cached in HDFS
2. Run Job A, which specifies those files to be put into DistributedCache
space
3. job runs fine
4. Run Job A some time later. job runs fine again.

Breaking sequence:
1. have files to be cached in HDFS
2. Run Job A, which specifies those files to be put into DistributedCache
space
3. job runs fine
4. Manually delete cached files out of local disk on worker nodes
5. Run Job A again, expect it to push out cache copies as needed.
6. job fails because the cache copies didn't get distributed

Should this second sequence have broken?

On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao  wrote:

> Hmm, I must have really missed an important piece somewhere. This is from
> the MapRed tutorial text:
>
> "DistributedCache is a facility provided by the Map/Reduce framework to
> cache files (text, archives, jars and so on) needed by applications.
>
> Applications specify the files to be cached via urls (hdfs://) in the
> JobConf. The DistributedCache* assumes that the files specified via
> hdfs:// urls are already present on the FileSystem.*
>
> *The framework will copy the necessary files to the slave node before any
> tasks for the job are executed on that node*. Its efficiency stems from
> the fact that the files are only copied once per job and the ability to
> cache archives which are un-archived on the slaves."
>
>
> After some close reading, the two bolded pieces seem to be in contradiction
> of each other? I'd always that addCacheFile() would perform the 2nd bolded
> statement. If that sentence is true, then I still don't have an explanation
> of why our job didn't correctly push out new versions of the cache files
> upon the startup and execution of JobConfiguration. We deleted them before
> our job started, not during.
>
> On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans  wrote:
>
>> Meng Mao,
>>
>> The way the distributed cache is currently written, it does not verify the
>> integrity of the cache files at all after they are downloaded.  It just
>> assumes that if they were downloaded once they are still there and in the
>> proper shape.  It might be good to file a JIRA to add in some sort of check.
>>  Another thing to do is that the distributed cache also includes the time
>> stamp of the original file, just incase you delete the file and then use a
>> different version.  So if you want it to force a download again you can copy
>> it delete the original and then move it back to what it was before.
>>
>> --Bobby Evans
>>
>> On 9/23/11 1:57 AM, "Meng Mao"  wrote:
>>
>> We use the DistributedCache class to distribute a few lookup files for our
>> jobs. We have been aggressively deleting failed task attempts' leftover
>> data
>> , and our script accidentally deleted the path to our distributed cache
>> files.
>>
>> Our task attempt leftover data was here [per node]:
>> /hadoop/hadoop-metadata/cache/mapred/local/
>> and our distributed cache path was:
>> hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/
>> We deleted this path by accident.
>>
>> Does this latter path look normal? I'm not that familiar with
>> DistributedCache but I'm up right now investigating the issue so I thought
>> I'd ask.
>>
>> After that deletion, the first 2 jobs to run (which are use the
>> addCacheFile
>> method to distribute their files) didn't seem to push the files out to the
>> cache path, except on one node. Is this expected behavior? Shouldn't
>> addCacheFile check to see if the files are missing, and if so, repopulate
>> them as needed?
>>
>> I'm trying to get a handle on whether it's safe to delete the distributed
>> cache path when the grid is quiet and no jobs are running. That is, if
>> addCacheFile is designed to be robust against the files it's caching not
>> being at each job start.
>>
>>
>



Re: operation of DistributedCache following manual deletion of cached files?

2011-09-26 Thread Meng Mao
Let's frame the issue in another way. I'll describe a sequence of Hadoop
operations that I think should work, and then I'll get into what we did and
how it failed.

Normal sequence:
1. have files to be cached in HDFS
2. Run Job A, which specifies those files to be put into DistributedCache
space
3. job runs fine
4. Run Job A some time later. job runs fine again.

Breaking sequence:
1. have files to be cached in HDFS
2. Run Job A, which specifies those files to be put into DistributedCache
space
3. job runs fine
4. Manually delete cached files out of local disk on worker nodes
5. Run Job A again, expect it to push out cache copies as needed.
6. job fails because the cache copies didn't get distributed

Should this second sequence have broken?

On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao  wrote:

> Hmm, I must have really missed an important piece somewhere. This is from
> the MapRed tutorial text:
>
> "DistributedCache is a facility provided by the Map/Reduce framework to
> cache files (text, archives, jars and so on) needed by applications.
>
> Applications specify the files to be cached via urls (hdfs://) in the
> JobConf. The DistributedCache* assumes that the files specified via
> hdfs:// urls are already present on the FileSystem.*
>
> *The framework will copy the necessary files to the slave node before any
> tasks for the job are executed on that node*. Its efficiency stems from
> the fact that the files are only copied once per job and the ability to
> cache archives which are un-archived on the slaves."
>
>
> After some close reading, the two bolded pieces seem to be in contradiction
> of each other? I'd always that addCacheFile() would perform the 2nd bolded
> statement. If that sentence is true, then I still don't have an explanation
> of why our job didn't correctly push out new versions of the cache files
> upon the startup and execution of JobConfiguration. We deleted them before
> our job started, not during.
>
> On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans  wrote:
>
>> Meng Mao,
>>
>> The way the distributed cache is currently written, it does not verify the
>> integrity of the cache files at all after they are downloaded.  It just
>> assumes that if they were downloaded once they are still there and in the
>> proper shape.  It might be good to file a JIRA to add in some sort of check.
>>  Another thing to do is that the distributed cache also includes the time
>> stamp of the original file, just incase you delete the file and then use a
>> different version.  So if you want it to force a download again you can copy
>> it delete the original and then move it back to what it was before.
>>
>> --Bobby Evans
>>
>> On 9/23/11 1:57 AM, "Meng Mao"  wrote:
>>
>> We use the DistributedCache class to distribute a few lookup files for our
>> jobs. We have been aggressively deleting failed task attempts' leftover
>> data
>> , and our script accidentally deleted the path to our distributed cache
>> files.
>>
>> Our task attempt leftover data was here [per node]:
>> /hadoop/hadoop-metadata/cache/mapred/local/
>> and our distributed cache path was:
>> hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/
>> We deleted this path by accident.
>>
>> Does this latter path look normal? I'm not that familiar with
>> DistributedCache but I'm up right now investigating the issue so I thought
>> I'd ask.
>>
>> After that deletion, the first 2 jobs to run (which are use the
>> addCacheFile
>> method to distribute their files) didn't seem to push the files out to the
>> cache path, except on one node. Is this expected behavior? Shouldn't
>> addCacheFile check to see if the files are missing, and if so, repopulate
>> them as needed?
>>
>> I'm trying to get a handle on whether it's safe to delete the distributed
>> cache path when the grid is quiet and no jobs are running. That is, if
>> addCacheFile is designed to be robust against the files it's caching not
>> being at each job start.
>>
>>
>


Re: operation of DistributedCache following manual deletion of cached files?

2011-09-23 Thread Meng Mao
Hmm, I must have really missed an important piece somewhere. This is from
the MapRed tutorial text:

"DistributedCache is a facility provided by the Map/Reduce framework to
cache files (text, archives, jars and so on) needed by applications.

Applications specify the files to be cached via urls (hdfs://) in the
JobConf. The DistributedCache* assumes that the files specified via hdfs://
urls are already present on the FileSystem.*

*The framework will copy the necessary files to the slave node before any
tasks for the job are executed on that node*. Its efficiency stems from the
fact that the files are only copied once per job and the ability to cache
archives which are un-archived on the slaves."


After some close reading, the two bolded pieces seem to be in contradiction
of each other? I'd always that addCacheFile() would perform the 2nd bolded
statement. If that sentence is true, then I still don't have an explanation
of why our job didn't correctly push out new versions of the cache files
upon the startup and execution of JobConfiguration. We deleted them before
our job started, not during.

On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans  wrote:

> Meng Mao,
>
> The way the distributed cache is currently written, it does not verify the
> integrity of the cache files at all after they are downloaded.  It just
> assumes that if they were downloaded once they are still there and in the
> proper shape.  It might be good to file a JIRA to add in some sort of check.
>  Another thing to do is that the distributed cache also includes the time
> stamp of the original file, just incase you delete the file and then use a
> different version.  So if you want it to force a download again you can copy
> it delete the original and then move it back to what it was before.
>
> --Bobby Evans
>
> On 9/23/11 1:57 AM, "Meng Mao"  wrote:
>
> We use the DistributedCache class to distribute a few lookup files for our
> jobs. We have been aggressively deleting failed task attempts' leftover
> data
> , and our script accidentally deleted the path to our distributed cache
> files.
>
> Our task attempt leftover data was here [per node]:
> /hadoop/hadoop-metadata/cache/mapred/local/
> and our distributed cache path was:
> hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/
> We deleted this path by accident.
>
> Does this latter path look normal? I'm not that familiar with
> DistributedCache but I'm up right now investigating the issue so I thought
> I'd ask.
>
> After that deletion, the first 2 jobs to run (which are use the
> addCacheFile
> method to distribute their files) didn't seem to push the files out to the
> cache path, except on one node. Is this expected behavior? Shouldn't
> addCacheFile check to see if the files are missing, and if so, repopulate
> them as needed?
>
> I'm trying to get a handle on whether it's safe to delete the distributed
> cache path when the grid is quiet and no jobs are running. That is, if
> addCacheFile is designed to be robust against the files it's caching not
> being at each job start.
>
>


Re: operation of DistributedCache following manual deletion of cached files?

2011-09-23 Thread Robert Evans
Meng Mao,

The way the distributed cache is currently written, it does not verify the 
integrity of the cache files at all after they are downloaded.  It just assumes 
that if they were downloaded once they are still there and in the proper shape. 
 It might be good to file a JIRA to add in some sort of check.  Another thing 
to do is that the distributed cache also includes the time stamp of the 
original file, just incase you delete the file and then use a different 
version.  So if you want it to force a download again you can copy it delete 
the original and then move it back to what it was before.

--Bobby Evans

On 9/23/11 1:57 AM, "Meng Mao"  wrote:

We use the DistributedCache class to distribute a few lookup files for our
jobs. We have been aggressively deleting failed task attempts' leftover data
, and our script accidentally deleted the path to our distributed cache
files.

Our task attempt leftover data was here [per node]:
/hadoop/hadoop-metadata/cache/mapred/local/
and our distributed cache path was:
hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/
We deleted this path by accident.

Does this latter path look normal? I'm not that familiar with
DistributedCache but I'm up right now investigating the issue so I thought
I'd ask.

After that deletion, the first 2 jobs to run (which are use the addCacheFile
method to distribute their files) didn't seem to push the files out to the
cache path, except on one node. Is this expected behavior? Shouldn't
addCacheFile check to see if the files are missing, and if so, repopulate
them as needed?

I'm trying to get a handle on whether it's safe to delete the distributed
cache path when the grid is quiet and no jobs are running. That is, if
addCacheFile is designed to be robust against the files it's caching not
being at each job start.