On Mon, Apr 26, 2010 at 10:29:55PM +0100, Richard Boulton wrote: > On 26 April 2010 22:19, Francis Davey <[email protected]> wrote: > > Ah, now I see exactly what you want and why you might want to automate > > it. What you really want to do is automate the attribution of changes > > to particular national negotiating groups, but that information > > appears as PDF footnotes, and I am not sufficiently familiar with > > tools to manipulate PDF's to help you there. Someone else on this list > > may know though. > > PDFs vary massively in how hard it is to automatically extract > structure from them.
indeed, the leak was a scanned pdf, so we had to transcribe it first. so we have a text only version of that. while the second was a text-pdf. for diffing we had to consolidate those two, which is quite hard. > I've been playing with this quite a bit lately (really must write a blog > post about it soon, and document some sample code I've been working on). > Meanwhile, if you can point me at a few example PDFs and tell me what you'd > like to automatically extract from them, I'd be happy to have a good go at > them. it's not only about extracting, but comparing to prev versions. so the extraction should be similar to the earlier versions. -- gpg: https://www.ctrlc.hu/~stef/stef.gpg gpg fp: F617 AC77 6E86 5830 08B8 BB96 E7A4 C6CF A84A 7140 _______________________________________________ Mailing list [email protected] Archive, settings, or unsubscribe: https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
