Re: [RFR] templates://hadoop/{hadoop-namenoded.templates}

2010-03-31 Thread Christian PERRIER
Quoting Justin B Rye (j...@edlug.org.uk):
 Justin B Rye wrote:
  comments but no actual patch attached.
 
 Second thoughts and patch.


As usuall, all suggestions adopted..:)




signature.asc
Description: Digital signature
__
This is the maintainer address of Debian's Java team
http://lists.alioth.debian.org/mailman/listinfo/pkg-java-maintainers. Please 
use
debian-j...@lists.debian.org for discussions and questions.


Re: [RFR] templates://hadoop/{hadoop-namenoded.templates}

2010-03-30 Thread Thomas Koch
Thank you Christian for the thorough review. It feels very good to get the 
help of older DDs.
I'm very comfortable with your improvements and would like to apply them 
without further changes.

Best regards, Thomas


Christian PERRIER:
 Please find, for review, the debconf templates and packages descriptions
  for the hadoop source package.
 
 This review will last from Tuesday, March 30, 2010 to Friday, April 09,
  2010.
 
 Please send reviews as unified diffs (diff -u) against the original
 files. Comments about your proposed changes will be appreciated.
 
 Your review should be sent as an answer to this mail.
 
 When appropriate, I will send intermediate requests for review, with
 [RFRn] (n=2) as a subject tag.
 
 When we will reach a consensus, I send a Last Chance For
 Comments mail with [LCFC] as a subject tag.
 
 Finally, the reviewed templates will be sent to the package maintainer
 as a bug report, and a mail will be sent to this list with [BTS] as
 a subject tag.
 
 Rationale:
 --- hadoop.old/debian/hadoop-namenoded.templates  2010-03-22
  09:56:11.717948376 +0100 +++
  hadoop/debian/hadoop-namenoded.templates 2010-03-30 07:22:12.123757400
  +0200 @@ -1,17 +1,17 @@
  Template: hadoop-namenoded/format
  Type: boolean
  Default: false
 -_Description: Should the namenode's filesystem be formatted now?
 +_Description: Should the namenode's file system be formatted?
 
 in other packages, we standardized on file system. Applying this all
 along this review
 
   The namenode manages the Hadoop Distributed FileSystem (HDFS). Like a
 - normal filesystem, it needs to be formatted prior to first use. If the
 - HDFS filesystem is not formatted, the namenode daemon will fail to
 + normal file system, it needs to be formatted prior to first use. If the
 + HDFS file system is not formatted, the namenode daemon will fail to
   start.
   .
 - This operation does not affect the normal filesystem on this
 - computer. If you're using HDFS for the first time and don't have data
 - from previous installations on this computer, it should be save to
 - proceed with yes.
 + This operation does not affect other file systems on this
 + computer. You can safely choose to format the file system if you're
 + using HDFS for the first time and don't have data from previous
 + installations on this computer.
 
 I guess that the main point is to warn users that all other FS are
 not at risk here. So, let's mention this slightly differently (they're
 not more normal than anything else...and there might be more than
 one file system on the system, of course).
 
 procees with yes is highly discouraged as it makes reference to the
 way the question is shown in *some* debconf interfaces (a yes/no
 question) and, anyway, it's always tricky for translators to know
 whether they should translate the yes or not (the answer being it
 depends..:-)).
 
   .
 - You can later on format the filesystem yourself with
 - .
 - su -chadoop namenode -format hadoop
 + If you choose not to format the file system right now, you can do it
 + later by executing hadoop namenode -format with the hadoop user
 + privileges.
 
 Don't waste space by splitting in two paragraphs. That will anyway
 loko ugly oin some interfaces. I rephrased the paragraph so that it
 doesn't depend on using su or not (which is not the point as the
 point is executing the command as hadoop).
 
 
 --- hadoop.old/debian/control 2010-03-22 09:56:11.717948376 +0100
 +++ hadoop/debian/control 2010-03-26 18:30:25.615052315 +0100
 @@ -44,14 +44,54 @@
   libslf4j-java,
   libxmlenc-java
  Suggests: libhsqldb-java
 -Description: software platform for processing vast amounts of data
 - This package contains the core java libraries.
 +Description: platform for processing vast amounts of data - Java libraries
 
 I standardized all binary packages description as
 general desc - specific desc
 
 general desc drops software. After all, this is all about software
 anyway? :-)
 
 Proper(?) capitalization fo Java
 
 + Hadoop is a software platform that lets one easily write and
 + run applications that process vast amounts of data.
 + .
 + Here's what makes Hadoop especially useful:
 +  * Scalable: Hadoop can reliably store and process petabytes.
 +  * Economical: It distributes the data and processing across clusters
 +of commonly available computers. These clusters can number
 +into the thousands of nodes.
 +  * Efficient: By distributing the data, Hadoop can process it in parallel
 +   on the nodes where the data is located. This makes it
 +   extremely rapid.
 +  * Reliable: Hadoop automatically maintains multiple copies of data and
 +  automatically redeploys computing tasks based on failures.
 + .
 + Hadoop implements MapReduce, using the Hadoop Distributed File System
  (HDFS). + MapReduce divides applications into many small blocks of work.
  HDFS creates + multiple replicas of data blocks for reliability, 

Re: [RFR] templates://hadoop/{hadoop-namenoded.templates}

2010-03-30 Thread Justin B Rye
Christian PERRIER wrote:
 Your review should be sent as an answer to this mail.

Sorry, I'm running late, comments but no actual patch attached.

  Template: hadoop-namenoded/format
[...]
 +_Description: Should the namenode's file system be formatted?
   The namenode manages the Hadoop Distributed FileSystem (HDFS). Like a
 + normal file system, it needs to be formatted prior to first use. If the
 + HDFS file system is not formatted, the namenode daemon will fail to
   start.

s/FileSystem/File System/.  We could save some verbiage here:

The namenode manages the Hadoop Distributed File System (HDFS). Like a
normal file system, it needs to be formatted before use; otherwise
the namenode daemon will not start.

   .
 + This operation does not affect other file systems on this
 + computer. You can safely choose to format the file system if you're
 + using HDFS for the first time and don't have data from previous
 + installations on this computer.
   .
 + If you choose not to format the file system right now, you can do it
 + later by executing hadoop namenode -format with the hadoop user
 + privileges.

I want to change that last phrase, but I'm not sure what to.  Maybe:

later by executing hadoop namenode -format as the user hadoop.
 
 --- hadoop.old/debian/control 2010-03-22 09:56:11.717948376 +0100
 +++ hadoop/debian/control 2010-03-26 18:30:25.615052315 +0100
 @@ -44,14 +44,54 @@
   libslf4j-java,
   libxmlenc-java
  Suggests: libhsqldb-java
 -Description: software platform for processing vast amounts of data
 - This package contains the core java libraries.
 +Description: platform for processing vast amounts of data - Java libraries

This doesn't strike me as conveying what Hadoop is; after all, you can
process vast amounts of data on any machine as long as you're allowed to
take vast amounts of time.  Hadoop's suite description should have the
words cluster or distributed or parallel in it somewhere.
Unfortunately there isn't much room, but how about:

   Description: data-intensive clustering framework - Java libraries
 
 + Hadoop is a software platform that lets one easily write and
 + run applications that process vast amounts of data.

The pronoun one is just that bit too formal.
Hadoop is a software platform for writing and running applications
that process vast amounts of data.
And it might make sense to insert:on a distributed file system.

 + .
 + Here's what makes Hadoop especially useful:
 +  * Scalable: Hadoop can reliably store and process petabytes.
 +  * Economical: It distributes the data and processing across clusters
 +of commonly available computers. These clusters can number
 +into the thousands of nodes.
 +  * Efficient: By distributing the data, Hadoop can process it in parallel
 +   on the nodes where the data is located. This makes it
 +   extremely rapid.
 +  * Reliable: Hadoop automatically maintains multiple copies of data and
 +  automatically redeploys computing tasks based on failures.

I'm not sure I like this layout, but it's all good material.

 + .
 + Hadoop implements MapReduce, using the Hadoop Distributed File System 
 (HDFS).
 + MapReduce divides applications into many small blocks of work. HDFS creates
 + multiple replicas of data blocks for reliability, placing them on compute
 + nodes around the cluster. MapReduce can then process the data where it is
 + located.
 + .
 + This package contains the core Java libraries.

I'm not sure they should all carry all three boilerplate paragraphs; maybe
since hadoop-bin is a common dependency it makes sense for it to carry the
long version.

  Package: libhadoop-index-java
  Architecture: all
  Depends: ${misc:Depends}, libhadoop-java (= ${binary:Version}),
   liblucene2-java
 -Description: Hadoop contrib to create lucene indexes
 +Description: platform for processing vast amounts of data - create Lucene 
 indexes
 
 The original synopsis was quite odd (verb sentence). Keep the create
 foo style, but I'd actually maybe prefer Lucene index creation.

I think it was claiming to be (a) contrib, meaning a third-party Java
library, but we want to keep that misleading keyword out of the way;
create Lucene indexes *is* a verb-based, um, phrasal constituent of some
sort.  I'd suggest:
   Description: data-intensive clustering framework - Lucene index support

 + Hadoop implements MapReduce, using the Hadoop Distributed File System 
 (HDFS).
 + MapReduce divides applications into many small blocks of work. HDFS creates
 + multiple replicas of data blocks for reliability, placing them on compute
 + nodes around the cluster. MapReduce can then process the data where it is
 + located.
 + .
   This contrib package provides a utility to build or update an index
   using Map/Reduce.

This replaces what was originally a package-specific discussion of
MapReduce (and Lucene and shards) with something generic.

 Package: hadoop-bin
[...]
 

Re: [RFR] templates://hadoop/{hadoop-namenoded.templates}

2010-03-30 Thread Justin B Rye
Justin B Rye wrote:
 comments but no actual patch attached.

Second thoughts and patch.
 
  Template: hadoop-namenoded/format
 [...]
The namenode manages the Hadoop Distributed File System (HDFS). Like a
normal file system, it needs to be formatted before use; otherwise
the namenode daemon will not start.

For consistency with the package descriptions maybe it should be
 jobtrackerd= the Job Tracker (daemon)
 tasktrackerd   = the Task Tracker (daemon)
 namenoded  = the Name Node (daemon)
so:

 _Description: Should namenoded's file system be formatted?
  The Name Node daemon manages the Hadoop Distributed File System (HDFS).
  Like a normal file system, it needs to be formatted prior to first use.
  If the HDFS file system is not formatted, the Name Node will fail to
  start.

(Or for extra terseness we could push the Description in the
direction of Format HDFS?)

The package description boilerplate:
 I'm not sure they should all carry all three boilerplate paragraphs; maybe
 since hadoop-bin is a common dependency it makes sense for it to carry the
 long version.

I was going to take out the third paragraph, but then I noticed it
wasn't in most of the package descriptions anyway.

  Package: libhadoop-index-java
Description: data-intensive clustering framework - Lucene index support
[...] 
 + Hadoop implements MapReduce, using the Hadoop Distributed File System 
 (HDFS).
 + MapReduce divides applications into many small blocks of work. HDFS creates
 + multiple replicas of data blocks for reliability, placing them on compute
 + nodes around the cluster. MapReduce can then process the data where it is
 + located.
   .
   This contrib package provides a utility to build or update an index
   using Map/Reduce.
   .
   A distributed index is partitioned into shards. Each shard corresponds
   to a Lucene instance. org.apache.hadoop.contrib.index.main.UpdateIndex
   contains the main() method which uses a Map/Reduce job to analyze documents
   and update Lucene instances in parallel.

Drop the first explanation of MapReduce and the misleading phrase
contrib package; merge the remainder into one package-specific
paragraph, and make it clearer exactly what it's describing.

 The org.apache.hadoop.contrib.index.main.UpdateIndex library provides
 support for managing an index using MapReduce. A distributed index is
 partitioned into shards, each corresponding to a Lucene instance.
 This library's main() method uses a MapReduce job to analyze documents
 and update Lucene instances in parallel.

(Is it canonically MapReduce, not Map/Reduce?)

 Package: hadoop-bin

Hang on, I hadn't noticed this contains /usr/bin/hadoop (which is an
executable, but not a binary one; it's a shell wrapperscript).  So:

 This package provides the hadoop command line interface. See the hadoop-.*d
 packages for the Hadoop daemons.

 Package: hadoop-daemons-common
 [...]
 + This package prepares some common things for all hadoop daemon packages:
* creates the user hadoop
* creates data and log directories owned by the hadoop user
* manages the update-alternatives mechanism for hadoop configuration

Rephrase as:
 This package provides infrastructure for the Hadoop daemon packages,
 creating the hadoop user (with data and log directories) and maintaining
 the update-alternatives mechanism for Hadoop configuration.

-- 
JBR with qualifications in linguistics, experience as a Debian
sysadmin, and probably no clue about this particular package
--- ../hadoop-0.20.2+dfsg1.pristine/debian/hadoop-namenoded.templates   
2010-03-05 17:54:15.0 +
+++ debian/hadoop-namenoded.templates   2010-03-30 17:44:29.0 +0100
@@ -1,17 +1,16 @@
 Template: hadoop-namenoded/format
 Type: boolean
 Default: false
-_Description: Should the namenode's filesystem be formatted now?
- The namenode manages the Hadoop Distributed FileSystem (HDFS). Like a
- normal filesystem, it needs to be formatted prior to first use. If the
- HDFS filesystem is not formatted, the namenode daemon will fail to
+_Description: Should namenoded's file system be formatted?
+ The Name Node daemon manages the Hadoop Distributed File System (HDFS).
+ Like a normal file system, it needs to be formatted prior to first use.
+ If the HDFS file system is not formatted, the Name Node will fail to
  start.
  .
- This operation does not affect the normal filesystem on this
- computer. If you're using HDFS for the first time and don't have data
- from previous installations on this computer, it should be save to
- proceed with yes.
+ This operation does not affect other file systems on this
+ computer. You can safely choose to format the file system if you're
+ using HDFS for the first time and don't have data from previous
+ installations on this computer.
  .
- You can later on format the filesystem yourself with
- . 
- su -chadoop namenode -format hadoop
+ If you choose not to format the file system 

Re: [RFR] templates://hadoop/{hadoop-namenoded.templates}

2010-03-30 Thread Christian PERRIER
Quoting Justin B Rye (j...@edlug.org.uk):
 Christian PERRIER wrote:
  Your review should be sent as an answer to this mail.
 
 Sorry, I'm running late, comments but no actual patch attached.


No problem. based on your comments and proposals, I cooked the
attached review.


Source: hadoop
Section: java
Priority: optional
Maintainer: Debian Java Maintainers 
pkg-java-maintainers@lists.alioth.debian.org
Uploaders: Thomas Koch thomas.k...@ymc.ch
Homepage: http://hadoop.apache.org
Vcs-Browser: http://git.debian.org/?p=pkg-java/hadoop.git
Vcs-Git: git://git.debian.org/pkg-java/hadoop.git
Standards-Version: 3.8.4
Build-Depends: debhelper (= 7.4.11), default-jdk, ant (= 1.6.0), javahelper 
(= 0.28),
 po-debconf,
 libcommons-cli-java,
 libcommons-codec-java,
 libcommons-el-java,
 libcommons-httpclient-java,
 libcommons-io-java,
 libcommons-logging-java,
 libcommons-net-java,
 libtomcat6-java,
 libjetty-java (6),
 libservlet2.5-java,
 liblog4j1.2-java,
 libslf4j-java,
 libxmlenc-java,
 liblucene2-java,
 libhsqldb-java,
 ant-optional,
 javacc

Package: libhadoop-java
Architecture: all
Depends: ${misc:Depends}, 
 libcommons-cli-java,
 libcommons-codec-java,
 libcommons-el-java,
 libcommons-httpclient-java,
 libcommons-io-java,
 libcommons-logging-java,
 libcommons-net-java,
 libtomcat6-java,
 libjetty-java (6),
 libservlet2.5-java,
 liblog4j1.2-java,
 libslf4j-java,
 libxmlenc-java
Suggests: libhsqldb-java
Description: data-intensive clustering framework - Java libraries
 Hadoop is a software platform for writing and running applications
 that process vast amounts of data on a distributed file system.
 .
 Here's what makes Hadoop especially useful:
  * Scalable: Hadoop can reliably store and process petabytes.
  * Economical: It distributes the data and processing across clusters
of commonly available computers. These clusters can number
into the thousands of nodes.
  * Efficient: By distributing the data, Hadoop can process it in parallel
   on the nodes where the data is located. This makes it
   extremely rapid.
  * Reliable: Hadoop automatically maintains multiple copies of data and
  automatically redeploys computing tasks based on failures.
 .
 Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS).
 MapReduce divides applications into many small blocks of work. HDFS creates
 multiple replicas of data blocks for reliability, placing them on compute
 nodes around the cluster. MapReduce can then process the data where it is
 located.
 .
 This package contains the core Java libraries.

Package: libhadoop-index-java
Architecture: all
Depends: ${misc:Depends}, libhadoop-java (= ${binary:Version}),
 liblucene2-java
Description: data-intensive clustering framework - Lucene index support
 Hadoop is a software platform for writing and running applications
 that process vast amounts of data on a distributed file system.
 .
 Here's what makes Hadoop especially useful:
  * Scalable: Hadoop can reliably store and process petabytes.
  * Economical: It distributes the data and processing across clusters
of commonly available computers. These clusters can number
into the thousands of nodes.
  * Efficient: By distributing the data, Hadoop can process it in parallel
   on the nodes where the data is located. This makes it
   extremely rapid.
  * Reliable: Hadoop automatically maintains multiple copies of data and
  automatically redeploys computing tasks based on failures.
 .
 Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS).
 MapReduce divides applications into many small blocks of work. HDFS creates
 multiple replicas of data blocks for reliability, placing them on compute
 nodes around the cluster. MapReduce can then process the data where it is
 located.
 .
 This contrib package provides a utility to build or update an index
 using Map/Reduce.
 .
 A distributed index is partitioned into shards. Each shard corresponds
 to a Lucene instance. org.apache.hadoop.contrib.index.main.UpdateIndex
 contains the main() method which uses a Map/Reduce job to analyze documents
 and update Lucene instances in parallel.

Package: hadoop-bin
Section: misc
Architecture: all
Depends: ${misc:Depends}, libhadoop-java (= ${binary:Version}),
 default-jre-headless | java6-runtime-headless
Description: data-intensive clustering framework - tools
 Hadoop is a software platform for writing and running applications
 that process vast amounts of data on a distributed file system.
 .
 Here's what makes Hadoop especially useful:
  * Scalable: Hadoop can reliably store and process petabytes.
  * Economical: It distributes the data and processing across clusters
of commonly available computers. These clusters can number
into the thousands of nodes.
  * Efficient: By distributing the data, Hadoop can process it in parallel
   on