If we do that rolling restart scenario, will we have a completely quiet
migration? That is, if no jobs are running during the rolling restart of
TaskTrackers, then we will end up with expanded capacity with no risk of
data inconsistency in the cache paths?

Our data nodes already use multiple disks for storage. It was an early lack
of foresight that brings us to the present day where mapred.local.dir isn't
"distributed."

That said, one of our problems is that the SOLR index files we're building
are just plain huge. Even with expand disk capacity, I think we'd still run
into disk space issues. Is this something that's been generally reported for
SOLR hadoop jobs?

On Mon, Oct 10, 2011 at 10:08 PM, Harsh J <ha...@cloudera.com> wrote:

> Meng,
>
> Yes, configure the mapred-site.xml (mapred.local.dir) to add the
> property and roll-restart your TaskTrackers. If you'd like to expand
> your DataNode to multiple disks as well (helps HDFS I/O greatly), do
> the same with hdfs-site.xml (dfs.data.dir) and perform the same
> rolling restart of DataNodes.
>
> Ensure that for each service, the directories you create are owned by
> the same user as the one running the process. This will help avoid
> permission nightmares.
>
> On Tue, Oct 11, 2011 at 3:58 AM, Meng Mao <meng...@gmail.com> wrote:
> > So the only way we can expand to multiple mapred.local.dir paths is to
> > config our site.xml and to restart the DataNode?
> >
> > On Mon, Oct 10, 2011 at 9:36 AM, Marcos Luis Ortiz Valmaseda <
> > marcosluis2...@googlemail.com> wrote:
> >
> >> 2011/10/9 Harsh J <ha...@cloudera.com>
> >>
> >> > Hello Meng,
> >> >
> >> > On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao <meng...@gmail.com> wrote:
> >> > > Currently, we've got defined:
> >> > >  <property>
> >> > >     <name>hadoop.tmp.dir</name>
> >> > >     <value>/hadoop/hadoop-metadata/cache/</value>
> >> > >  </property>
> >> > >
> >> > > In our experiments with SOLR, the intermediate files are so large
> that
> >> > they
> >> > > tend to blow out disk space and fail (and annoyingly leave behind
> their
> >> > huge
> >> > > failed attempts). We've had issues with it in the past, but we're
> >> having
> >> > > real problems with SOLR if we can't comfortably get more space out
> of
> >> > > hadoop.tmp.dir somehow.
> >> > >
> >> > > 1) It seems we never set *mapred.system.dir* to anything special, so
> >> it's
> >> > > defaulting to ${hadoop.tmp.dir}/mapred/system.
> >> > > Is this a problem? The docs seem to recommend against it when
> >> > hadoop.tmp.dir
> >> > > had ${user.name} in it, which ours doesn't.
> >> >
> >> > The {mapred.system.dir} is a HDFS location, and you shouldn't really
> >> > be worried about it as much.
> >> >
> >> > > 1b) The doc says mapred.system.dir is "the in-HDFS path to shared
> >> > MapReduce
> >> > > system files." To me, that means there's must be 1 single path for
> >> > > mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
> >> > > Otherwise, one might imagine that you could specify multiple paths
> to
> >> > store
> >> > > hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
> >> > > interpretation? -- hadoop.tmp.dir could live on multiple paths/disks
> if
> >> > > there were more mapping/lookup between mapred.system.dir and
> >> > hadoop.tmp.dir?
> >> >
> >> > {hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although it
> >> > is on HDFS, and hence is confusing, but there should just be one
> >> > mapred.system.dir, yes.
> >> >
> >> > Also, the config {hadoop.tmp.dir} doesn't support > 1 path. What you
> >> > need here is a proper {mapred.local.dir} configuration.
> >> >
> >> > > 2) IIRC, there's a -D switch for supplying config name/value pairs
> into
> >> > > indivdiual jobs. Does such a switch exist? Googling for single
> letters
> >> is
> >> > > fruitless. If we had a path on our workers with more space (in our
> >> case,
> >> > > another hard disk), could we simply pass that path in as
> hadoop.tmp.dir
> >> > for
> >> > > our SOLR jobs? Without incurring any consistency issues on future
> jobs
> >> > that
> >> > > might use the SOLR output on HDFS?
> >> >
> >> > Only a few parameters of a job are user-configurable. Stuff like
> >> > hadoop.tmp.dir and mapred.local.dir are not override-able by user set
> >> > parameters as they are server side configurations (static).
> >> >
> >> > > Given that the default value is ${hadoop.tmp.dir}/mapred/local,
> would
> >> the
> >> > > expanded capacity we're looking for be as easily accomplished as by
> >> > defining
> >> > > mapred.local.dir to span multiple disks? Setting aside the issue of
> >> temp
> >> > > files so big that they could still fill a whole disk.
> >> >
> >> > 1. You can set mapred.local.dir independent of hadoop.tmp.dir
> >> > 2. mapred.local.dir can have comma separated values in it, spanning
> >> > multiple disks
> >> > 3. Intermediate outputs may spread across these disks but shall not
> >> > consume > 1 disk at a time. So if your largest configured disk is 500
> >> > GB while the total set of them may be 2 TB, then your intermediate
> >> > output size can't really exceed 500 GB, cause only one disk is
> >> > consumed by one task -- the multiple disks are for better I/O
> >> > parallelism between tasks.
> >> >
> >> > Know that hadoop.tmp.dir is a convenience property, for quickly
> >> > starting up dev clusters and such. For a proper configuration, you
> >> > need to remove dependency on it (almost nothing uses hadoop.tmp.dir on
> >> > the server side, once the right properties are configured - ex:
> >> > dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc.)
> >> >
> >> > --
> >> > Harsh J
> >> >
> >>
> >> Here it's a excellent explanation how to install Apache Hadoop manually,
> >> and
> >> Lars explains this very good.
> >>
> >>
> >>
> http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/
> >>
> >> Regards
> >>
> >> --
> >> Marcos Luis Ortíz Valmaseda
> >>  Linux Infrastructure Engineer
> >>  Linux User # 418229
> >>  http://marcosluis2186.posterous.com
> >>  http://www.linkedin.com/in/marcosluis2186
> >>  Twitter: @marcosluis2186
> >>
> >
>
>
>
> --
> Harsh J
>

Reply via email to