2011/7/19 Fabian Christ <[email protected]>:
> 2011/7/19 Olivier Grisel <[email protected]>:
>> 2011/7/19 Fabian Christ <[email protected]>:
>> Also the stable launchers includes opennlp models that have no clear
>> license status (statistical models derived from copyrighted documents)
>> hence are not hostable / distributable on a the Apache maven repo.
>> Right now the models are downloaded from
>> http://opennlp.sourceforge.net/models-1.5/ and included in the
>> defaultdata bundle that is then uploaded to the maven repo of Nuxeo.
>> The OpenNLP project is working on building models out of pure open
>> data corpus (wikipedia / wikinews / dbpedia) but this is still work in
>> progress.
>>
>> I don't know what to do in the short term. Maybe the first release of
>> Stanbol could be done without the launcher bundle? Maybe we could
>> build a launcher that without the defaultdata bundle and tell in the
>> readme to download it manually from the nuxeo maven repo [1] so as to
>> but it in the right sling/ folder after the first start?
>
> So everything in /data and /defaultdata is not clear in terms of
> license? At least we can not redistribute it under Apache, right?

The DBpedia stuff is fine (we probably just need to add a header /
NOTICE / README file to say that is doubled licensed under CC By SA
3.0 / GFDL), the opennlp is not (yet). Joern Kottmann and others the
OpenNLP developers are currently working on a annotated corpus
infrastructure to be able able to train models that are 100% free with
a clear license and thus directly embeddable in Apache distributed
jars. But this work is likely to take at least a couple of months
hence we cannot delay the Stanbol release.

Maybe we could work on extending the DataFileProvider to make the
defaultdata provider only provide download URLs from the existing gray
licensed opennlp 1.5 models from
http://opennlp.sourceforge.net/models-1.5/ and let the
DataFileProvider download them from there automatically the first time
they are required. The issue then is that every integration tests job
will re-download the same data from sourceforge over and over again...
That will slowdown the builds / tests and waste bandwith for nothing +
add a new way for the builds and test to fail (dependency on the
network / sourceforge availability).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to