Ninovolador created this task. Ninovolador added projects: Pywikibot, Wikimedia OCR. Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. Restricted Application added a project: Community-Tech.
TASK DESCRIPTION Like I said in the title: the proofreadpage module has a function, ` url_image(self)` that generates the URL of the image to use in the OCR web service, but it is quite a bad function that scrapes the URL and gets a lower-than-optimal resolution, resulting in lower quality OCR. **Steps to replicate the issue** - Use the pywikibot proofreadpage module to do OCR on a page. The OCR web services gets this: https://ocr.wmcloud.org/api.php?engine=tesseract&langs[]=es&image=https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu/page141-987px-Origen_de_las_especies_por_medio_de_la_selecci%C3%B3n_natural.djvu.jpg&uselang=es - Use the in-Wikisource OCR button, and the OCR web service gets this URL: https://ocr.wmcloud.org/api.php?engine=tesseract&langs[]=es&image=https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/Origen_de_las_especies_por_medio_de_la_selecci%25C3%25B3n_natural.djvu/page141-1974px-Origen_de_las_especies_por_medio_de_la_selecci%25C3%25B3n_natural.djvu.jpg&line_id=&uselang=es **What happens?**: (in this particular case) pywikibot uses images with 4x less pixels, and so the quality of the OCR is a lot worse. **What should have happened instead?**: pywikibot should have a better way of dealing with page image's URL. Honestly, using beautifulsoup to scrape the URL is quite a bad idea. Maybe someone at Wikimedia OCR project can tell us how they manage to get the full resolution image from every page TASK DETAIL https://phabricator.wikimedia.org/T352524 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ninovolador Cc: Aklapper, pywikibot-bugs-list, Ninovolador, mevo, KLawal-WMF, PMenon-WMF, KSiebert, NRodriguez, PotsdamLamb, Osps7, Jyoo1011, JohnsonLee01, SHEKH, Dijkstra, Khutuck, Zkhalido, HMonroy, Viztor, Wenyi, Inductiveload, dmaza, Xover, Tbscho, MayS, Mdupont, JJMC89, B20180, Dvorapa, Bodhisattwa, Altostratus, TheresNoTime, Avicennasis, Samwilson, Nakon, mys_721tx, MusikAnimal, Xqt, jayvdb, Ricordisamoa, -jem-, Thurs, Masti, Alchimista, Krenair
_______________________________________________ pywikibot-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
