On Friday 03 April 2009 21:54:04 Jussi Pakkanen wrote:
> Let's not allow this to fall into limbo again. I have not heard from
> Cognitive people but as far as I know:
>
> - there is no (publicly) available source for the dat files
> - they were in the original source release, which was under BSD, so they
> should be BSD as well
The problem is currently that this stuff are blobs. Debian main should be made
from free stuff which can be edited/compiled and so on (see firmware
discussion on debian blogs/ml/..). All executables/libraries must have
readable sources from which they are compiled (so not be precompiled in
orig.tar.gz and then copied over unless it is a interpreted script in a
readable form).
Images/fonts/sounds are somewhat different. It should be possible to edit them
- ask a maintainer of a artworks/ttf/game package in what form they have the
raw data in the source package. I would doubt that many of them have the raw
tiff/bmp/pcm/psd/xcf/... inside of the package, but the files are still
editable with tools inside of debian. A -doc package which installs a pdf
should also contain the original (tex/docbook/...) file in the source package
so it can be changed without too much hassle.
The data files from cuneiform are a mistery for me. A punch of bytes which
seem to come from nowhere. I don't know how to generate them, I don't know how
to edit them, no documentation what is inside of them (ok, somewhere must be
code which reads it, but someone has to write a tool which can extract the
data/regenerate the data first). If it is a tool somewhere then please
document it. My first guess is that these files holds the recognition patterns
used by the ocr. This is something from which we definitely want the source.
There was a discussion some time ago about statistical data generated from
webpages. At the end most of the people aggreed that it is not possible to do
such an analysis each time the package needed to be build, but it had to be
done in a way that everyone is able to recreate the data with the same high
quality.
... maybe you should ask how it is done by the tesseract guys. I looked at the
tesseract-deu-2.00 files and these are binary blobs too, but they are in main.
Some training data can be found at the tesseract homepage, but no information
if the training data was/can be used to recreate the language specific data.
This should definitely added to the source package in a Debian specific
readme. Maybe recreating the files from the training data by the source
package would be nice too, but I am not sure if it is too cpu intensive or why
it wasn't done yet. The training data seems to be done by hand and I can
create a training page with tools in debian. I think they should qualify as
source files.
--
Robert Wohlrab
--
To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org