http://mail-archives.apache.org/mod_mbox/www-legal-discuss/201011.mbox/%[email protected]%3E
On Tue, Feb 1, 2011 at 4:45 PM, Grant Ingersoll <[email protected]> wrote: > > On Feb 1, 2011, at 11:20 AM, Benson Margulies wrote: > >> With somewhat mixed feelings, I've been following this discussion. In >> the interests of full disclosure, I'll explain the mixed feelings in a >> moment. I warmed up legal-discuss for you during the incubator >> discussion and learned some things. > > What's the thread for this one? > >> >> Based on my legal understanding, I feel fairly confident that models >> derived from textual corpora are not 'derived works' subject to the >> copyrights and licenses of the corpora. However, IANAL, and this needs >> to be explored. Some remarks on legal-discuss suggest that, in Europe, >> I may be completely wrong. Still, this is probably the *good* news. >> >> The less-good news is that, as a general principle, the ASF would not >> want a release to contain a binary artifact derived from sources hat >> cannot be released under the Apache license, or even obtained under >> the Apache license or something remotely like it. An even stronger >> principle is that the source materials must be available, period >> (e.g. not available only to LDC members or something). > > This is the single most frustrating issue facing open source text tools to > date. It's why I started the Open Relevance Project, but until we have > enough of us willing to band together and work on it, we will be stuck. > >> >> The less bad news is that there is a precedent here: SpamAssassin. To >> train spam models, SpamAssassin has to collect and maintain large >> collections of materials that have restrictive licenses. The >> Foundation has decided that this is tolerable if these materials are >> kept on a Foundation server, and access to that granted to legitimate >> members of the development community, one by one. This avoids the >> spectre of 'publication' but permits open participation. > > This is OK, but it discourages newbies from participating. > >> >> The bottom line of the legal-discuss discussion was that this path >> was, broadly, available to OpenNLP. However, legal-discuss hates to >> discuss hypotheticals, so you won't get a definitive ruling until you >> ask a specific question. I recommend opening a JIRA on legal-discuss >> as a way to clarify that you need a clear and definitive ruling and >> not just an email food-fight. > > Yes, we should start assembling a list of corpora, even so we at least have > it for others that come later and want to reproduce them. In the meantime, I > would agree that we can just keep the models elsewhere. We don't have to > provide models. They are a convenience for all involved, but not a > requirement in order to run. I wonder how many people actually train there > own. (BTW, we should update our website to point to older models, too. They > are really hard to find unless you do some URL rewriting.) > > > -Grant
