Bug#913951: tesseract-ocr: Legacy engine not directly usable because of missing files

2018-12-03 Thread Janusz S. Bień
On Mon, Dec 03 2018 at 11:21 -0800, Jeff Breidenbach wrote:
> Hi Janusz,
>
> Tesseract 4 uses tesseract-ocr-deu and tesseract-ocr-script-frak. 
> Tesseract 3 uses tesseract-ocr-deu-frak

I don't use Tesseract actively at the moment but subscribe to the
tesseract issues. I don't read them carefully but have an impression
that, at least at the very moment, Tesseract 4 traning data are not
necessary better then the legacy ones.

[...]

> We'll discuss this with upstream, but in the meantime I have a question for
> you: What is your best guess for how many people are like you, and want to 
> use the Tesseract 3 engine in Debian?

To say the truth, I'm aware of only one other person. He became confused
by the different paths to training data on Ubuntu and reported his
problem as an issue, which has been closed almost immediately;
unfortunately I'm unable to find quickly the issue numeber.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien



Bug#913951: tesseract-ocr: Legacy engine not directly usable because of missing files

2018-12-03 Thread Jeff Breidenbach
Hi Janusz,

Tesseract 4 uses tesseract-ocr-deu and tesseract-ocr-script-frak.
Tesseract 3 uses tesseract-ocr-deu-frak

I am worried about confusing users. If we include both sets of language
data in Debian, there will a huge number of choices, and some users
might feel overwhelmed. However, Alexander thinks it could work with
careful
package naming, like this:

  https://mentors.debian.net/package/tesseract-lang-legacy

I am also a little worried about the support costs of exposing lots of
users
to the legacy engine. It will make it harder to remove the legacy engine
completely from future Tesseract.  Especially if other Debian packages
start to have dependencies on it. Also bug reports against the legacy engine
will not get much attention from upstream.

We'll discuss this with upstream, but in the meantime I have a question for
you: What is your best guess for how many people are like you, and want to
use the  Tesseract 3  engine in Debian?

Thanks,
Jeff


Bug#913951: tesseract-ocr: Legacy engine not directly usable because of missing files

2018-11-17 Thread Janusz S. Bień
Package: tesseract-ocr
Version: 4.0.0-1+b1
Severity: normal

Legacy engine provides some data which are not available LSTM, e.g. font
information, so it is not obsolete and a user should be able to use it
regularily. The data for the legacy engine are in different directory,
so there is no conflict. However the current package is configured
to see a conflict here, e.g.

--8<---cut here---start->8---
The following packages have unmet dependencies:
 tesseract-ocr : Depends: tesseract-ocr-osd (>= 4.00~) but 3.04.00-1 is 
installed

apt --fix-broken install
Reading package lists... Done
Building dependency tree   
Reading state information... Done
Correcting dependencies... Done
The following additional packages will be installed:
  tesseract-ocr-osd
The following packages will be upgraded:
  tesseract-ocr-osd
1 upgraded, 0 newly installed, 0 to remove and 3 not upgraded.
Need to get 0 B/2,991 kB of archives.
After this operation, 5,120 B of additional disk space will be used.
Do you want to continue? [Y/n] 
Reading changelogs... Done
(Reading database ... 503032 files and directories currently installed.)
Preparing to unpack .../tesseract-ocr-osd_1%3a4.00~git30-7274cfa-1_all.deb ...
Unpacking tesseract-ocr-osd (1:4.00~git30-7274cfa-1) over (3.04.00-1) ...
Setting up tesseract-ocr-osd (1:4.00~git30-7274cfa-1) ...
--8<---cut here---end--->8---

Instead of just installing the missing package apt removed the legacy
version. The problem can be of course circumvented by an advance user
with TESSDATA_PREFIX, but this is a completely unnecessary complication.

Best regards

Janusz

-- System Information:
Debian Release: buster/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 4.18.0-2-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_US:en (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages tesseract-ocr depends on:
ii  libc62.27-8
ii  libcairo21.16.0-1
ii  libfontconfig1   2.13.1-2
ii  libgcc1  1:8.2.0-9
ii  libglib2.0-0 2.58.1-2
ii  libgomp1 8.2.0-9
ii  libicu63 63.1-4
ii  liblept5 1.76.0-1
ii  libpango-1.0-0   1.42.4-3
ii  libpangocairo-1.0-0  1.42.4-3
ii  libpangoft2-1.0-01.42.4-3
ii  libstdc++6   8.2.0-9
ii  libtesseract44.0.0-1+b1
ii  tesseract-ocr-eng1:4.00~git30-7274cfa-1
ii  tesseract-ocr-osd3.04.00-1

tesseract-ocr recommends no packages.

tesseract-ocr suggests no packages.

-- no debconf information

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien