Here you go: https://github.com/pvanhoof/tracker-gnome/tree/wip/pvanhoof/ocr-pdf-sup port
Note that this doesn't yet do automatic rotating. And note that I think that instead of using pdftoppm I could also use Poppler's API to convert a page into a temporary PPM file. Note. Carlos: I tried pushing this to a branch on git.gnome.org, but apparently that fails nowadays *. Kind regards, Philip * Attempt to push to git.gnome.org: pvanhoof@lars:~/repos/gnome/tracker-miners$ git push origin wip/pvanhoof/ocr-pdf-support:wip/pvanhoof/ocr-pdf-support Counting objects: 6, done. Delta compression using up to 4 threads. Compressing objects: 100% (6/6), done. Writing objects: 100% (6/6), 2.02 KiB | 0 bytes/s, done. Total 6 (delta 5), reused 0 (delta 0) remote: hooks/pre-receive: line 125: syntax error in conditional expression: unexpected token `;' remote: hooks/pre-receive: line 125: syntax error near `;' remote: hooks/pre-receive: line 125: ` if [[ $basedir = '/var/opt/gitlab/git-data/repositories/GNOME' || $basedir = '/git' || $basedir = '/var/opt/gitlab/git-data/repositories/Infrastructure']]; then' To ssh://git.gnome.org/git/tracker-miners ! [remote rejected] wip/pvanhoof/ocr-pdf-support -> wip/pvanhoof/ocr-pdf-support (pre-receive hook declined) error: failed to push some refs to 'ssh://pvanh...@git.gnome.org/git/tracker-miners' pvanhoof@lars:~/repos/gnome/tracker-miners$ git remote add github g...@github.com:pvanhoof/tracker-gnome.git pvanhoof@lars:~/repos/gnome/tracker-miners$ git push github wip/pvanhoof/ocr-pdf-support:wip/pvanhoof/ocr-pdf-support Counting objects: 83907, done. Delta compression using up to 4 threads. Compressing objects: 100% (17095/17095), done. Writing objects: 100% (83907/83907), 32.94 MiB | 2.50 MiB/s, done. Total 83907 (delta 66373), reused 83894 (delta 66363) remote: Resolving deltas: 100% (66373/66373), done. To github.com:pvanhoof/tracker-gnome.git * [new branch] wip/pvanhoof/ocr-pdf-support -> wip/pvanhoof/ocr-pdf-support pvanhoof@lars:~/repos/gnome/tracker-miners$ git push github master:master Total 0 (delta 0), reused 0 (delta 0) To github.com:pvanhoof/tracker-gnome.git * [new branch] master -> master pvanhoof@lars:~/repos/gnome/tracker-miners$ On Sat, 2018-02-24 at 12:33 +0100, Philip Van Hoof wrote: > Hey us, the people who made Tracker, > > First time in my life I actually used the same software I worked on a > few years ago. I bought myself a Medion laptop and cheap as it is, it > of course spontaneously broke. So I had to find the invoice and other > documents for it, so that I could bring it back to the store for RMA. > > Tried to search my PDFs (I'm one of those 'Yes we scan'-people, who > scans all his documents before putting them in maps or bringing them > to > the accountant). Of course that didn't work. Because the PDFs didn't > have OCR applied to them by my dead-tree scanner apparatus. > > However, I made a little script that does that for me: > > pvanhoof@lars:~/Documents$ cat /usr/local/bin/fixpdfs.sh > for a in *pdf; do pdftk $a cat 1-endwest output ROT-$a; pdfocr -i > ROT- > $a -o OCR-$a; rm ROT-$a; mv OCR-$a $a; done > pvanhoof@lars:~/Documents$ > > Now I was wondering: couldn't we add non-intrusive OCR to Tracker's > PDF > extractor? By that I mean we could let it do an OCR first, extract it > that way, but don't write that to the original PDF (as our extractors > should not modify the files). I guess we could use tracker-writeback > (if that still works) to write the OCR into PDF files in case the > user > wants that. > > Given that not forgetting to run that damn script on my recently > scanned PDFs is probably more time consuming over de span of one > year, > than to just add it to tracker-extractor's PDF extractor; I might > actually just do this myself. If somebody wants to beat me to it or > join the fun. Let me know. > > Thoughts? > > I think we'll > > a) See if the PDF already has text embedded or not > > b) Detect orientation and rotate the PDF to a temporary file. Else > OCR > will not detect anything > > c) Link with an OCR library and enrich-first and/or extract the > detected text > > d) SPARQL-insert the text as nie:plainTextContent or something. > > Kind regards, > > Philip > > > > _______________________________________________ > tracker-list mailing list > tracker-list@gnome.org > https://mail.gnome.org/mailman/listinfo/tracker-list
signature.asc
Description: This is a digitally signed message part
_______________________________________________ tracker-list mailing list tracker-list@gnome.org https://mail.gnome.org/mailman/listinfo/tracker-list