Just to add to Thomas' message, you can also get a list of works of a wikisource using the oai-pmh interface. Example for the Italian ws: https://it.wikisource.org/wiki/Special:ProofreadIndexOai?verb=ListRecords&metadataPrefix=prp_qdc&set=edizioni_wikisource Documentation: https://www.mediawiki.org/wiki/Extension:Proofread_Page#OAI-PMH We are migrating our metadata to Wikidata, but that is an ongoing process that hasn't been finished yet.
It would be interesting to know what kind of text you expect for training Tesseract, like: - does it need markup stripped text? What about page layout, or bold text, italics, etc? - most of the time we don't keep line breaks nor end of line hyphenation. Should training data include them? - references are transcribed in-line, transcluded as footers when rendered. Should they be removed? TBH, the biggest problem I found in latin script languages is the lack of old ortography dictionaries, because for us every time that the ocr uses modern ortography instead of the printed ortography that counts as an error, which means that we have to "fix" well-recognized words. Or more mundane problems, like how to generate output without end of line hyphenation :) And thanks for the compliments! I also think Tesseract is an amazing project, and I think many in our community would be very glad to help to provide the best, freely reusable, training data in the world :) Cheers, Micru On Wed, Aug 13, 2014 at 10:40 PM, Thomas Tanon <thoma...@hotmail.fr> wrote: > > Thanks a lot for this nice answer, > > A technical answer to the question: > > Are there programatic ways of getting at the data, for example > downloading all page images and corresponding text that is marked as green, > for a specific language / script? > > Yes, you can get the list of Page: pages (the pages that contain the > wikitext for a given scan image) using this API request: > https://en.wikisource.org/w/api.php?action=query&generator=allpages&gapnamespace=104&prop=proofread&format=json > for > en.wikisource where the Page: namespace id is 104 (this id is not the same > in all Wikisources) (doc: https://www.mediawiki.org/wiki/API:Allpages ) > > Then you can just retrieve the content of a "green" page (the ones with " > quality": 4) using this API request > https://en.wikisource.org/w/api.php?action=query&prop=revisions&titles=Page:%22%27Keep%20%27em%20Flying%27%20is%20Our%20Battle%20Cry^%20First%20Class%20Fighting%20Men%20Needed.%22%20-%20NARA%20-%20513526.jpg&rvprop=content > <https://en.wikisource.org/w/api.php?action=query&prop=revisions&titles=Page:%22'Keep%20'em%20Flying'%20is%20Our%20Battle%20Cry%5E%20First%20Class%20Fighting%20Men%20Needed.%22%20-%20NARA%20-%20513526.jpg&rvprop=content> > (doc: https://www.mediawiki.org/wiki/API:Properties#revisions_/_rv ). > > To get the image of a given Page: page, just use this API request > https://en.wikisource.org/w/api.php?action=query&titles=Image:Albert%20Einstein%20Head.jpg&prop=imageinfo&format=json&iiprop=url > that > retrieves the url of a file from his title (the Page: pages has as page > title "Page:NAME_OF_THE_FILE" with sometime after a > "/PAGE_NUMBER_IN_A_MULTIPAGE_FILE" so you have in NAME_OF_THE_FILE the name > of the image to use. > > Thanks again, > > Thomas > > > From: Nick White <nick.wh...@durham.ac.uk> > Date: Tue, Aug 12, 2014 at 6:25 PM > Subject: Re: [tesseract-ocr] Outreach from the Wikisource community > To: tesseract-...@googlegroups.com > Cc: "discussion list for Wikisource, the free library" < > wikisource-l@lists.wikimedia.org>, David Cuenca <dacu...@gmail.com> > > > Dear Wikisourcerers, > > It's good to hear from you. Wikisource is awesome, as far as I am > concerned. > > > One > > of the most serious issues was raised by the Belarusian community which > uses 2 > > different scripts with no commercial OCR support. This means that the > > volunteers have to type each word manually. We wondered if it would be > possible > > to train Tesseract to recognize these old texts using the text that has > been > > already typed. > > Actually, Tesseract should already have support for Russian and > Belarussian "out of the box"; see the 'rus' and 'bel' training data. > > > We would like to know if you would be interested in exploring > collaboration > > possibilities. I imagine that with your guidance we could prepare > training data > > The first thing to do would be to take a look at the results you get > from Tesseract with the rus and bel training sets already available, > and let us know if they aren't appropriate. > > > not only in different languages, but also from different time > > periods, scripts, etc. > > As to training for specific scripts, time periods, etc., in theory > that is super cool, in practise probably one training set should be > able to cover more-or-less everything (except very different > scripts, like fraktur). That has been my experience with training > Ancient Greek (for which I have been interested in recognising > printing from a variety of time periods). > > So give Tesseract a whirl, and if it isn't appropriate, or doesn't > work for specific scripts, let us know and we can try to figure out > a plan. > > > At the moment it is not very clear how to achieve this. > > My plan is to rewrite the training documentation very soon, so > things should hopefully become clearer on that front. > > One thing that wikisource could potentially do for us would be > provide loads of proofread, freely reusable "ground truth" data to > test Tesseract with. Are there programatic ways of getting at the > data, for example downloading all page images and corresponding text > that is marked as green, for a specific language / script? > > Thanks for getting in touch! > > Nick > > > > -- > Etiamsi omnes, ego non > > > > _______________________________________________ > Wikisource-l mailing list > Wikisource-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikisource-l > > -- Etiamsi omnes, ego non
_______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l