advice requested: wenglish may be non free

David Coe 23 Dec 1999 15:31:46 -0000

I've noticed a possible commercial-use restriction while adopting the
wenglish package [a /usr/share/dict/words list of english words, in
main/text].  I searched the -devel and -legal archives, but found no
previous discussion about this.


The upstream README.linux.words file describes the "non-copyright"
status of the word lists that were used to construct this word list,
but its description of one of those component lists (a README within
the README) says:

   To the best of my knowledge, all the files I used to build these
   wordlists were available for public distribution and use, at least
   for non-commercial purposes.  I have confirmed this assumption with
   the authors of the lists, whenever they were known.
   
   Therefore, it is safe to assume that the wordlists in this package
   can also be freely copied, distributed, modified, and used for
   personal, educational, and research purposes.  (Use of these files
   in commercial products may require written permission from DEC
   and/or the authors of the original lists.)


The upstream README.linux.words has until now been (unintentionally?)
excluded from the Debian package.  The previous debian maintainers
whom I've heard from have no knowledge of why this was left out, or of
this package's DFSG compliance.)

A perhaps important point: this is not 'software' (in the genrally
accepted sense), it is a plain-text alphabetical list of
english words, which were extracted from other lists of english words,
which were (as described in the README.linux.words) created from
various apparently free sources.

So my *feeling* is that DEC and the other authors of those original
lists can't place inherited commercial-use restrictions on a new word
list that was constructed by copying (most of) the words from their
lists and merging them with other lists.  [Hmmmm, if I took a
copyrighted novel and published an alphabetical list of the words
extracted from its text, would I be violating the author's copyright?
I doubt it.]

Here is the entire upstream README.linux.words file.  I've numbered
the lines (with 'nl'); in all other respects this is unaltered.  lines
160-169 are what I'm worried about, but you probably have to read it
in context.

The simple question is: is the resultant list DFSG-compliant?  Thanks.

     1  #!/bin/sh -xe
     2  # README.linux.words - file used to create linux.words
     3  # Created: Wed Mar 10 09:12:49 1993 by [EMAIL PROTECTED] (Rik Faith)
     4  # Revised: Sat Mar 13 17:02:08 1993 by [EMAIL PROTECTED]
     5  #
     6  # Care was taken to be sure that the linux.words list was free of
     7  # copyright.  This makes linux.words a suitable /usr/dict/words
     8  # replacement for the Linux community.
     9  #
    10  # Since the majority of the words are from Tanenbaum's minix.dict file,
    11  # the notice from Barry Brachman, included below, should accompany any
    12  # redistribution of this list.
       
    13  # Here is a detailed explaination of how I created the linux.words file.
    14  #
    15  # This README.words file is actually a shell script that you can use to
    16  # recreate the linux.words file from original sources.
    17  #
    18  # First, I started with minix.dict
    19  # from cs.ubc.ca:/pub/local/src/sp-1.5/wordlists-1.0.tar.Z
    20  #
    21  # The following is from the NOTES file in wordlists-1.0.tar.Z:
       
    22  # NOTES> These word lists were collected by Barry Brachman
    23  # NOTES> <[EMAIL PROTECTED]> at the University of British Columbia.  
They
    24  # NOTES> may be freely distributed as long as this notice accompanies 
them.
    25  # NOTES> 
    26  # NOTES> 
==================================================================
    27  # NOTES> Info for minix.dict:
    28  # NOTES> 
    29  # NOTES> Article 1997 of comp.os.minix:
    30  # NOTES> From: [EMAIL PROTECTED]
    31  # NOTES> Subject: A spelling checker for MINIX
    32  # NOTES> Date: 6 Jan 88 22:28:22 GMT
    33  # NOTES> Reply-To: [EMAIL PROTECTED] (Andy Tanenbaum)
    34  # NOTES> Organization: VU Informatica, Amsterdam
    35  # NOTES> 
    36  # NOTES> This dictionary is NOT based on the UNIX dictionary so it is 
free
    37  # NOTES> of AT&T copyright.  I built the dictionary from three sources.
    38  # NOTES> First, I started by sorting and uniq'ing some public domain
    39  # NOTES> dictionaries.  Second, as some of you probably know, I have
    40  # NOTES> written somewhere between 3 and 6 books (depending on precisely
    41  # NOTES> what you count) and an additional 50 published papers on 
operating
    42  # NOTES> systems, networks, compilers, languages, etc.  This data base,
    43  # NOTES> which is online, is nonnegligible :-) Finally, I added a 
number of
    44  # NOTES> words that I thought ought to be in the dictionary including 
all
    45  # NOTES> the U.S. states, all the European and some other major 
countries,
    46  # NOTES> principal U.S. and world cities, and a bunch of technical 
terms.
    47  # NOTES> I don't want my spelling checker to barf on arpanet, diskless,
    48  # NOTES> modem, login, internetwork, subdirectory, superuser, vlsi, or
    49  # NOTES> winchester just because Webster wouldn't approve of them.  All 
in
    50  # NOTES> all, the dictionary is over 40,000 words.  If you have any
    51  # NOTES> suggestions for additions or deletions, please post them.  But
    52  # NOTES> please be sure you are not infringing on anyone's copyright in
    53  # NOTES> doing so.
    54  # NOTES> 
    55  # NOTES> Andy Tanenbaum ([EMAIL PROTECTED])
       
    56  # The main problem with minix.dict is that many proper names are not
    57  # capitalized.  So, I got english.tar.Z from 
ftp.uu.net:/doc/dictionaries,
    58  # which is a mirror of nic.funet.fi:/pub/unix/security/dictionaries.
    59  #
    60  # Here is part of the README file for english.tar.Z:
       
    61  # README> 
    62  # README> FILE: english.words
    63  # README> VERSION: DEC-SRC-92-04-05
    64  # README> 
    65  # README> EDITOR
    66  # README> 
    67  # README>     Jorge Stolfi <[EMAIL PROTECTED]>
    68  # README>     DEC Systems Research Center
    69  # README>   
    70  # README> AUTHORS OF ORIGIONAL WORDLISTS
    71  # README> 
    72  # README>     Andy Tanenbaum <[EMAIL PROTECTED]>
    73  # README>     Barry Brachman <[EMAIL PROTECTED]>
    74  # README>     Geoff Kuenning <[EMAIL PROTECTED]>
    75  # README>     Henk Smit <[EMAIL PROTECTED]>
    76  # README>     Walt Buehring <[EMAIL PROTECTED]>
    77  #
    78  # [stuff seleted]
    79  #
    80  # README> AUXILIARY LISTS
    81  # README> 
    82  # README>     In the same directory as englis.words there are a few
    83  # README>     complementary word lists, all derived from the same 
sources
    84  # README>     [1--8] as the main list:
    85  # README> 
    86  # README>     english.names
    87  # README> 
    88  # README>         A list of common English proper names and their 
derivatives.
    89  # README>         The list includes: person names ("John", "Abigail",
    90  # README>         "Barrymore"); countries, nations, and cities 
("Germany",
    91  # README>         "Gypsies", "Moscow"); historical, biblical and 
mythological
    92  # README>         figures ("Columbus", "Isaiah", "Ulysses"); important
    93  # README>         trademarked products ("Xerox", "Teflon"); biological 
genera
    94  # README>         ("Aerobacter"); and some of their derivatives 
("Germans",
    95  # README>         "Xeroxed", "Newtonian").
    96  # README>     
    97  # README>     misc.names
    98  # README> 
    99  # README>         A list of foreign-sounding names of persons and places
   100  # README>         ("Antonio", "Albuquerque", "Balzac", "Stravinski"), 
extracted
   101  # README>         from the lists [1--8].  (The distinction betweeen
   102  # README>         "English-sounding" and "foreign-sounding" is of 
course rather
   103  # README>         arbitrary).
   104  # README> 
   105  # README>     org.names
   106  # README> 
   107  # README>         A short lists names of corporations and other 
institutions
   108  # README>         ("Pepsico", "Amtrak", "Medicare"), and a few 
derivatives.  
   109  # README> 
   110  # README>         The file also includes some initialisms --- acronyms 
and
   111  # README>         abbreviations that are generally pronounced as words 
rather
   112  # README>         than spelled out ("NASA", "UNESCO").
   113  # README> 
   114  # README>     english.abbrs
   115  # README> 
   116  # README>         A list of common abbreviations ("etc.", "Dr.", 
"Wed."),
   117  # README>         acronyms ("A&M", "CPU", "IEEE"), and measurement 
symbols
   118  # README>         ("ft", "cm", "ns", "kHz").
   119  # README> 
   120  # README>     english.trash
   121  # README>                 
   122  # README>         A list of words from the original wordlists
   123  # README>         that I decided were either wrong or unsuitable for 
inclusion
   124  # README>         in the file english.words or any of the other 
auxiliary 
   125  # README>         lists. It includes
   126  # README>         
   127  # README>           typos ("accupy", "aquariia", "automatontons")
   128  # README>           spelling errors ("abcissa", "alleviater", 
"analagous")
   129  # README>           bogus derived forms ("homeown", "unfavorablies", 
"catched")
   130  # README>           uncapitalized proper names ("afghanistan",
   131  # README>                                       "algol", "decnet")
   132  # README>           uncapitalized acronyms ("apl", "ccw", "ibm")
   133  # README>           unpunctuated abbreviations ("amp", "approx", "etc")
   134  # README>           British spellings ("advertize", "archaeology")
   135  # README>           archaic words ("bedight")
   136  # README>           rare variants ("babirousa")
   137  # README>           unassimilated foreign words ("bambino", "oui", 
"caballero")
   138  # README>           mis-hyphenated compounds ("babylike", "backarrows")
   139  # README>           computer keywords and slang ("lconvert", "noecho", 
"prog") 
   140  # README> 
   141  # README>         (I apologize for excluding British spellings.  I 
should have
   142  # README>         split the list in three sublists--- common English, 
British,
   143  # README>         American---as ispell does.  But there are only so 
many hours
   144  # README>         in a day...)
   145  # README> 
   146  # README>     english.maybe
   147  # README> 
   148  # README>         A list of about 5,000 lowercase words from the 
"mts.dict"
   149  # README>         wordlist [6] that weren't included in english.words.
   150  # README> 
   151  # README>         This list seems to include lots of "trash", like
   152  # README>         uncapitalized proper names and weird words.  It would
   153  # README>         take me several days to sort this mess, so I decided 
to
   154  # README>         leave it as a separate file.  Use at your own risk...
   155  #
   156  # [stuff deleted]
   157  #
   158  # README> (NON-)COPYRIGHT STATUS
   159  # README> 
   160  # README>   To the best of my knowledge, all the files I used to build 
these
   161  # README>   wordlists were available for public distribution and use, 
at least
   162  # README>   for non-commercial purposes.  I have confirmed this 
assumption with
   163  # README>   the authors of the lists, whenever they were known.
   164  # README>   
   165  # README>   Therefore, it is safe to assume that the wordlists in this
   166  # README>   package can also be freely copied, distributed, modified, 
and
   167  # README>   used for personal, educational, and research purposes.  
(Use of
   168  # README>   these files in commercial products may require written
   169  # README>   permission from DEC and/or the authors of the original 
lists.)
   170  # README>   
   171  # README>   Whenever you distribute any of these wordlists, please 
distribute
   172  # README>   also the accompanying README file.  If you distribute a 
modified
   173  # README>   copy of one of these wordlists, please include the original 
README
   174  # README>   file with a note explaining your modifications.  Your users 
will
   175  # README>   surely appreciate that.
   176  # README> 
   177  # README> (NO-)WARRANTY DISCLAIMER
   178  # README> 
   179  # README>   These files, like the original wordlists on which they are
   180  # README>   based, are still very incomplete, uneven, and inconsitent, 
and
   181  # README>   probably contain many errors.  They are offered "as is" 
without
   182  # README>   any warranty of correctness or fitness for any particular
   183  # README>   purpose.  Neither I nor my employer can be held responsible 
for
   184  # README>   any losses or damages that may result from their use.
       
   185  # subtract english.trash
   186  cat minix.dict english.trash english.trash | sort | uniq -u > dict.1
   187  # subtract english.maybe
   188  cat dict.1 english.maybe english.maybe | sort | uniq -u > dict.2
       
   189  # build subtraction list of proper names and abbreviations
   190  cat english.names misc.names org.names computer.names english.abbrs > 
sub.1
   191  tr 'A-Z' 'a-z' < sub.1 | sort | uniq -u > sub.2
       
   192  # subtract proper names with incorrect capitalization
   193  cat dict.2 sub.2 sub.2 | sort | uniq -u > dict.3
       
   194  # build proper name list without possessives
   195  cat english.names misc.names org.names computer.names | fgrep -v \'s > 
names.1
       
   196  # add in proper names (use sort twice to get uppercase before lowercase)
   197  cat dict.3 names.1 | sort | sort -df | uniq > linux.words
       
   198  # clean up
   199  rm dict.[123] sub.[12] names.1

advice requested: wenglish may be non free

Reply via email to