Bug#906242: cannot export OCR'ed Russian text

2018-08-17 Thread Alberto Garcia
Control: tags -1 moreinfo

On Thu, Aug 16, 2018 at 01:08:54AM +0300, Dmitry Eremin-Solenikov wrote:
> Package: ocrfeeder
> Version: 0.8.1-4
> Severity: important
> 
> After ocrfeeder has successfully OCR'ed Russian text, it is unable to
> export it to any of the formats, dumping following errors to the
> console:

I seem to be able to export cyrillic text do ODT just fine, can you
share one image that you are using to reproduce this problem?

Berto



Bug#906242: cannot export OCR'ed Russian text

2018-08-17 Thread Alberto Garcia
On Fri, Aug 17, 2018 at 05:13:02PM +0300, Dmitry Eremin-Solenikov wrote:

> The issue is not with OCR'ing itself, just exporting I found that
> running ocrfeeder with Russian locale (LANG=ru_RU.UTF-8 ocrfeeder)
> allows me to export text w/o issues. However with my default locale
> (en_GB.utf8) I can see recognized text in GUI, but can not export it
> to the file.

I understand. In my case I tried with some random text and it doesn't
recognize it correctly, but if I replace the text in the box with some
Russian text and then export it I can do it fine.

Berto



Bug#906242: cannot export OCR'ed Russian text

2018-08-17 Thread Dmitry Eremin-Solenikov
Hello,

The issue is not with OCR'ing itself, just exporting I found that running
ocrfeeder with Russian locale (LANG=ru_RU.UTF-8 ocrfeeder) allows
me to export text w/o issues. However with my default locale (en_GB.utf8)
I can see recognized text in GUI, but can not export it to the file.


-- 
With best wishes
Dmitry



Bug#906242: cannot export OCR'ed Russian text

2018-08-15 Thread Dmitry Eremin-Solenikov
Package: ocrfeeder
Version: 0.8.1-4
Severity: important

After ocrfeeder has successfully OCR'ed Russian text, it is unable to
export it to any of the formats, dumping following errors to the
console:

Export to ODT
=
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/ocrfeeder/studio/studioBuilder.py", 
line 284, in exportToOdt
self.exportToFormat('ODT', 'ODT')
  File "/usr/lib/python2.7/dist-packages/ocrfeeder/studio/studioBuilder.py", 
line 281, in exportToFormat
name)
  File "/usr/lib/python2.7/dist-packages/ocrfeeder/studio/widgetModeler.py", 
line 605, in exportPagesWithGenerator
document_generator.addPage(page)
  File 
"/usr/lib/python2.7/dist-packages/ocrfeeder/feeder/documentGeneration.py", line 
293, in addPage
self.addBoxes(page_data.data_boxes)
  File 
"/usr/lib/python2.7/dist-packages/ocrfeeder/feeder/documentGeneration.py", line 
78, in addBoxes
self.addBox(data_box)
  File 
"/usr/lib/python2.7/dist-packages/ocrfeeder/feeder/documentGeneration.py", line 
66, in addBox
self.addText(data_box)
  File 
"/usr/lib/python2.7/dist-packages/ocrfeeder/feeder/documentGeneration.py", line 
251, in addText
text = data_box.getText().decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: 
ordinal not in range(128)


Export to HTML
===
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/ocrfeeder/studio/studioBuilder.py", 
line 298, in exportDialog
self.EXPORT_FORMATS[format][1])
  File "/usr/lib/python2.7/dist-packages/ocrfeeder/studio/studioBuilder.py", 
line 281, in exportToFormat
name)
  File "/usr/lib/python2.7/dist-packages/ocrfeeder/studio/widgetModeler.py", 
line 606, in exportPagesWithGenerator
document_generator.save()
  File 
"/usr/lib/python2.7/dist-packages/ocrfeeder/feeder/documentGeneration.py", line 
207, in save
''' % {'title': self.name, 'body': self.bodies[i], 'previous_page': 
previous_page, 'next_page': next_page}
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 137: 
ordinal not in range(128)


Export to TXT

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/ocrfeeder/studio/studioBuilder.py", 
line 298, in exportDialog
self.EXPORT_FORMATS[format][1])
  File "/usr/lib/python2.7/dist-packages/ocrfeeder/studio/studioBuilder.py", 
line 281, in exportToFormat
name)
  File "/usr/lib/python2.7/dist-packages/ocrfeeder/studio/widgetModeler.py", 
line 605, in exportPagesWithGenerator
document_generator.addPage(page)
  File 
"/usr/lib/python2.7/dist-packages/ocrfeeder/feeder/documentGeneration.py", line 
364, in addPage
self.addText(page.getTextFromBoxes())
  File 
"/usr/lib/python2.7/dist-packages/ocrfeeder/feeder/documentGeneration.py", line 
361, in addText
self.text += unicode(newText, 'utf-8')
TypeError: decoding Unicode is not supported



-- System Information:
Debian Release: buster/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 4.18.0-rc4-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_GB.utf8, LC_CTYPE=en_GB.utf8 (charmap=UTF-8), LANGUAGE=en_GB:en 
(charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages ocrfeeder depends on:
ii  cuneiform 1.1.0+dfsg-7
ii  ghostscript   9.22~dfsg-2.1
ii  gir1.2-goocanvas-2.0  2.0.4-1
ii  gir1.2-gtk-3.03.22.30-2
ii  gir1.2-gtkspell3-3.0  3.0.9-2
ii  iso-codes 3.79-1
ii  python2.7.15-3
ii  python-enchant2.0.0-1
ii  python-gi 3.28.2-1+b1
ii  python-lxml   4.2.3-1
ii  python-pil5.2.0-2
ii  python-reportlab  3.5.2-1
ii  python-sane   2.8.3-1+b2
ii  tesseract-ocr 4.00~git2844-607e8fd8-2

Versions of packages ocrfeeder recommends:
ii  unpaper  6.1-2+b2
pn  yelp 

ocrfeeder suggests no packages.

-- no debconf information