Re: [mySociety:public] diffing the DEbill

stef Mon, 26 Apr 2010 14:49:48 -0700

On Mon, Apr 26, 2010 at 10:29:55PM +0100, Richard Boulton wrote:
> On 26 April 2010 22:19, Francis Davey <[email protected]> wrote:
> > Ah, now I see exactly what you want and why you might want to automate
> > it. What you really want to do is automate the attribution of changes
> > to particular national negotiating groups, but that information
> > appears as PDF footnotes, and I am not sufficiently familiar with
> > tools to manipulate PDF's to help you there. Someone else on this list
> > may know though.
> 
> PDFs vary massively in how hard it is to automatically extract
> structure from them.


indeed, the leak was a scanned pdf, so we had to transcribe it first.
so we have a text only version of that.
while the second was a text-pdf. for diffing we had to consolidate those two,
which is quite hard.

> I've been playing with this quite a bit lately (really must write a blog
> post about it soon, and document some sample code I've been working on).
> Meanwhile, if you can point me at a few example PDFs and tell me what you'd
> like to automatically extract from them, I'd be happy to have a good go at
> them.

it's not only about extracting, but comparing to prev versions. so the
extraction should be similar to the earlier versions.

-- 
gpg: https://www.ctrlc.hu/~stef/stef.gpg
gpg fp: F617 AC77 6E86 5830 08B8  BB96 E7A4 C6CF A84A 7140

_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public

Re: [mySociety:public] diffing the DEbill

Reply via email to