With somewhat mixed feelings, I've been following this discussion. In
the interests of full disclosure, I'll explain the mixed feelings in a
moment. I warmed up legal-discuss for you during the incubator
discussion and learned some things.

Based on my legal understanding, I feel fairly confident that models
derived from textual corpora are not 'derived works' subject to the
copyrights and licenses of the corpora. However, IANAL, and this needs
to be explored. Some remarks on legal-discuss suggest that, in Europe,
I may be completely wrong. Still, this is probably the *good* news.

The less-good news is that, as a general principle, the ASF would not
want a release to contain a binary artifact derived from sources hat
cannot be released under the Apache license, or even obtained under
the Apache license or something remotely like it. An even stronger
principle is that the source materials must be available, period
(e.g. not available only to LDC members or something).

The less bad news is that there is a precedent here: SpamAssassin. To
train spam models, SpamAssassin has to collect and maintain large
collections of materials that have restrictive licenses. The
Foundation has decided that this is tolerable if these materials are
kept on a Foundation server, and access to that granted to legitimate
members of the development community, one by one. This avoids the
spectre of 'publication' but permits open participation.

The bottom line of the legal-discuss discussion was that this path
was, broadly, available to OpenNLP. However, legal-discuss hates to
discuss hypotheticals, so you won't get a definitive ruling until you
ask a specific question. I recommend opening a JIRA on legal-discuss
as a way to clarify that you need a clear and definitive ruling and
not just an email food-fight.

Now, the full disclosure dept. My paycheck depends, in part, on the
success of a closed source set of NLP modules that have some overlap
with the contents of OpenNLP. It is my belief that, over the long run,
our willingness to spend large sums of money to acquire, clean,
organize, and annotate volumes of data that are directly relevant to
our customers genres-of-interest will preserve that paycheck,
regardless of the success of OpenNLP. In fact, if I could get 5 spare
days, I'd compare your perceptron model to my perceptron model (mine
includes, for example, "Passive Aggressive" training and is tuned to
run like a bat out of hell) and contemplate mushing them together at
OpenNLP. Still, I thought it only fair to disclose that I could be
viewed as a sort of fox in the henhouse here.

--benson


On Tue, Feb 1, 2011 at 11:07 AM, Jörn Kottmann <[email protected]> wrote:
> On 2/1/11 4:57 PM, Grant Ingersoll wrote:
>>
>> Your timing is great, as I was just about to suggest the same thing.
>>
>> On Feb 1, 2011, at 6:51 AM, Jörn Kottmann wrote:
>>
>>> Hi all,
>>>
>>> I would like to go ahead and get our first release out. The release is
>>> backward compatible with the models we had over at SourceForge.
>>> Which means we do not need to release new models right now.
>>>
>>> The logic to train most of the models is already included in OpenNLP
>>> and enables our users just to train the models them self or even
>>> mix with their own data.
>>>
>>> To release the models at Apache we have to go trough a series of legal
>>> issues which I believe should not postpone our first release for
>>> weeks or months.
>>
>> Can you summarize here the issues?  The last thread is mountainous.  To
>> some extent, there is no time like the present to address the legal issues.
>>  The ASF has legal counsel, if you can summarize what we do to make the
>> models and what the concerns are, we can take it over to legal-discuss@ and
>> start working on it.  It may not be as big a deal as one might think.
>
> The concerns are, that our models are trained on various closed or free
> corpora which almost all have different licenses.
> We would have to discuss if the trained model from each corpora is allowed
> to be distributed under AL 2.0.
>
> I believe in most cases we do not validate any copyright, because statistics
> about text is not protected
> by its copyright. We would for example generate bigram or trigram features
> over the whole corpus.
>
> In my opinion we at least need to provide a list of corpora and licenses to
> start a discussion over at
> the legal list, which alone will take some time to dig out, at least for the
> english training data.
> We also have training data where we are unsure about the license.
>
> On the other side we have no advantage of doing it now as part of the
> release.
>
> In my opinion we should try to get the process started, maybe put together a
> wiki page and as soon
> as we have all the information we need for the legal people we start talking
> to them.
>
> Jörn
>
>

Reply via email to