Ninovolador created this task.
Ninovolador added projects: Pywikibot, Wikimedia OCR.
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper.
Restricted Application added a project: Community-Tech.

TASK DESCRIPTION
  Like I said in the title: the proofreadpage module has a function, ` 
url_image(self)` that generates the 
  URL of the image to use in the OCR web service, but it is quite a bad 
function that scrapes the URL and gets a lower-than-optimal resolution, 
resulting in lower quality OCR.
  
  **Steps to replicate the issue**
  
  - Use the pywikibot proofreadpage module to do OCR on a page. The OCR web 
services gets this: 
https://ocr.wmcloud.org/api.php?engine=tesseract&langs[]=es&image=https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu/page141-987px-Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu.jpg&uselang=es
  - Use the in-Wikisource OCR button, and the OCR web service gets this URL: 
https://ocr.wmcloud.org/api.php?engine=tesseract&langs[]=es&image=https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/Origen_de_las_especies_por_medio_de_la_selecci%25C3%25B3n_natural.djvu/page141-1974px-Origen_de_las_especies_por_medio_de_la_selecci%25C3%25B3n_natural.djvu.jpg&line_id=&uselang=es
  
  **What happens?**:
  (in this particular case) pywikibot uses images with 4x less pixels, and so 
the quality of the OCR is a lot worse.
  
  **What should have happened instead?**:
  pywikibot should have a better way of dealing with page image's URL. 
Honestly, using beautifulsoup to scrape the URL is quite a bad idea.
  
  Maybe someone at Wikimedia OCR project can tell us how they manage to get the 
full resolution image from every page

TASK DETAIL
  https://phabricator.wikimedia.org/T352524

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ninovolador
Cc: Aklapper, pywikibot-bugs-list, Ninovolador, mevo, KLawal-WMF, PMenon-WMF, 
KSiebert, NRodriguez, PotsdamLamb, Osps7, Jyoo1011, JohnsonLee01, SHEKH, 
Dijkstra, Khutuck, Zkhalido, HMonroy, Viztor, Wenyi, Inductiveload, dmaza, 
Xover, Tbscho, MayS, Mdupont, JJMC89, B20180, Dvorapa, Bodhisattwa, 
Altostratus, TheresNoTime, Avicennasis, Samwilson, Nakon, mys_721tx, 
MusikAnimal, Xqt, jayvdb, Ricordisamoa, -jem-, Thurs, Masti, Alchimista, Krenair
_______________________________________________
pywikibot-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to