Re: [Tracker] The PDF extractor and OCR

Philip Van Hoof Tue, 27 Feb 2018 06:04:09 -0800

Here you go:

https://github.com/pvanhoof/tracker-gnome/tree/wip/pvanhoof/ocr-pdf-sup
port

Note that this doesn't yet do automatic rotating. And note that I think
that instead of using pdftoppm I could also use Poppler's API to
convert a page into a temporary PPM file.

Note. Carlos: I tried pushing this to a branch on git.gnome.org, but
apparently that fails nowadays *.

Kind regards,

Philip

* Attempt to push to git.gnome.org:

pvanhoof@lars:~/repos/gnome/tracker-miners$ git push origin 
wip/pvanhoof/ocr-pdf-support:wip/pvanhoof/ocr-pdf-support
Counting objects: 6, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 2.02 KiB | 0 bytes/s, done.
Total 6 (delta 5), reused 0 (delta 0)
remote: hooks/pre-receive: line 125: syntax error in conditional expression: 
unexpected token `;'
remote: hooks/pre-receive: line 125: syntax error near `;'
remote: hooks/pre-receive: line 125: `    if [[ $basedir = 
'/var/opt/gitlab/git-data/repositories/GNOME' || $basedir = '/git' || $basedir 
= '/var/opt/gitlab/git-data/repositories/Infrastructure']]; then'
To ssh://git.gnome.org/git/tracker-miners
 ! [remote rejected]     wip/pvanhoof/ocr-pdf-support -> 
wip/pvanhoof/ocr-pdf-support (pre-receive hook declined)
error: failed to push some refs to 
'ssh://pvanh...@git.gnome.org/git/tracker-miners'
pvanhoof@lars:~/repos/gnome/tracker-miners$ git remote add github 
g...@github.com:pvanhoof/tracker-gnome.git
pvanhoof@lars:~/repos/gnome/tracker-miners$ git push github 
wip/pvanhoof/ocr-pdf-support:wip/pvanhoof/ocr-pdf-support
Counting objects: 83907, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (17095/17095), done.
Writing objects: 100% (83907/83907), 32.94 MiB | 2.50 MiB/s, done.
Total 83907 (delta 66373), reused 83894 (delta 66363)
remote: Resolving deltas: 100% (66373/66373), done.
To github.com:pvanhoof/tracker-gnome.git
 * [new branch]          wip/pvanhoof/ocr-pdf-support -> 
wip/pvanhoof/ocr-pdf-support
pvanhoof@lars:~/repos/gnome/tracker-miners$ git push github master:master
Total 0 (delta 0), reused 0 (delta 0)
To github.com:pvanhoof/tracker-gnome.git
 * [new branch]          master -> master
pvanhoof@lars:~/repos/gnome/tracker-miners$

On Sat, 2018-02-24 at 12:33 +0100, Philip Van Hoof wrote:
> Hey us, the people who made Tracker,
> 
> First time in my life I actually used the same software I worked on a
> few years ago. I bought myself a Medion laptop and cheap as it is, it
> of course spontaneously broke. So I had to find the invoice and other
> documents for it, so that I could bring it back to the store for RMA.
> 
> Tried to search my PDFs (I'm one of those 'Yes we scan'-people, who
> scans all his documents before putting them in maps or bringing them
> to
> the accountant). Of course that didn't work. Because the PDFs didn't
> have OCR applied to them by my dead-tree scanner apparatus.
> 
> However, I made a little script that does that for me:
> 
> pvanhoof@lars:~/Documents$ cat /usr/local/bin/fixpdfs.sh 
> for a in *pdf; do pdftk $a cat 1-endwest output ROT-$a; pdfocr -i
> ROT-
> $a -o OCR-$a; rm ROT-$a; mv OCR-$a $a; done
> pvanhoof@lars:~/Documents$
> 
> Now I was wondering: couldn't we add non-intrusive OCR to Tracker's
> PDF
> extractor? By that I mean we could let it do an OCR first, extract it
> that way, but don't write that to the original PDF (as our extractors
> should not modify the files). I guess we could use tracker-writeback
> (if that still works) to write the OCR into PDF files in case the
> user
> wants that.
> 
> Given that not forgetting to run that damn script on my recently
> scanned PDFs is probably more time consuming over de span of one
> year,
> than to just add it to tracker-extractor's PDF extractor; I might
> actually just do this myself. If somebody wants to beat me to it or
> join the fun. Let me know.
> 
> Thoughts?
> 
> I think we'll
> 
> a) See if the PDF already has text embedded or not
> 
> b) Detect orientation and rotate the PDF to a temporary file. Else
> OCR
> will not detect anything
> 
> c) Link with an OCR library and enrich-first and/or extract the
> detected text
> 
> d) SPARQL-insert the text as nie:plainTextContent or something.
> 
> Kind regards,
> 
> Philip
> 
> 
> 
> _______________________________________________
> tracker-list mailing list
> tracker-list@gnome.org
> https://mail.gnome.org/mailman/listinfo/tracker-list

signature.asc
Description: This is a digitally signed message part

_______________________________________________
tracker-list mailing list
tracker-list@gnome.org
https://mail.gnome.org/mailman/listinfo/tracker-list

Re: [Tracker] The PDF extractor and OCR

Reply via email to