Re: PDF to CSV?
Dave Hodgkinson davehodg at gmail.com writes: I'm about to hit CPAN, but any wisdom from you lovely people would be nice! I've got bank statements in PDF from Barclays. Would it be easy to produce a CSV of the statement parts from them? What's the go-to PDF module? I know this thread is a bit old, but I'm posting here in case this is helpful to anyone... I've just made a tool in node.js to help my girlfriend convert 18 months' worth of Barclays business account statement PDFs into CSV. If you want to give it a go, it's up on GitHub here: https://github.com/penrosestudio/barclays-bank-pdf-to-csv. Cheers, Jonathan
PDF to CSV?
I'm about to hit CPAN, but any wisdom from you lovely people would be nice! I've got bank statements in PDF from Barclays. Would it be easy to produce a CSV of the statement parts from them? What's the go-to PDF module?
Re: PDF to CSV?
I've got some code somewhere for doing this for HSBC's HTML statements I tried for their PDF's (which is the only available format for their credit card) but the formatting (of the PDF) was such a pain that I gave up. I thought Barclays let you export as csv in any case? - might just be last X months I guess On 12 December 2013 10:47, Dave Hodgkinson daveh...@gmail.com wrote: I'm about to hit CPAN, but any wisdom from you lovely people would be nice! I've got bank statements in PDF from Barclays. Would it be easy to produce a CSV of the statement parts from them? What's the go-to PDF module?
Re: PDF to CSV?
pdftotext (from poppler-utils) does a good job at extracting text from PDFs, the rest should be text munching :) Ideally you'd want to target information directly in the PDF structure. I've got the feeling that's not easily done. J. On 12 December 2013 10:47, Dave Hodgkinson daveh...@gmail.com wrote: I'm about to hit CPAN, but any wisdom from you lovely people would be nice! I've got bank statements in PDF from Barclays. Would it be easy to produce a CSV of the statement parts from them? What's the go-to PDF module? -- Jerome Eteve +44(0)7738864546 http://www.eteve.net/
Re: PDF to CSV?
1) xpdf's pdftotext CLI utility 2) regexp On Thu, Dec 12, 2013 at 11:47 AM, Dave Hodgkinson daveh...@gmail.comwrote: I'm about to hit CPAN, but any wisdom from you lovely people would be nice! I've got bank statements in PDF from Barclays. Would it be easy to produce a CSV of the statement parts from them? What's the go-to PDF module?
Re: PDF to CSV?
Indeed, only PDF going back in time. CAM::PDF has getpdftext.pl which is where I'm currently positioning my yak. On 12 Dec 2013, at 11:07, Leo Lapworth l...@cuckoo.org wrote: I've got some code somewhere for doing this for HSBC's HTML statements I tried for their PDF's (which is the only available format for their credit card) but the formatting (of the PDF) was such a pain that I gave up. I thought Barclays let you export as csv in any case? - might just be last X months I guess On 12 December 2013 10:47, Dave Hodgkinson daveh...@gmail.com wrote: I'm about to hit CPAN, but any wisdom from you lovely people would be nice! I've got bank statements in PDF from Barclays. Would it be easy to produce a CSV of the statement parts from them? What's the go-to PDF module?
Re: PDF to CSV?
OK, It puts each column on a new line but that's not the end of the world. On 12 Dec 2013, at 11:21, DAVID HODGKINSON daveh...@me.com wrote: Indeed, only PDF going back in time. CAM::PDF has getpdftext.pl which is where I'm currently positioning my yak. On 12 Dec 2013, at 11:07, Leo Lapworth l...@cuckoo.org wrote: I've got some code somewhere for doing this for HSBC's HTML statements I tried for their PDF's (which is the only available format for their credit card) but the formatting (of the PDF) was such a pain that I gave up. I thought Barclays let you export as csv in any case? - might just be last X months I guess On 12 December 2013 10:47, Dave Hodgkinson daveh...@gmail.com wrote: I'm about to hit CPAN, but any wisdom from you lovely people would be nice! I've got bank statements in PDF from Barclays. Would it be easy to produce a CSV of the statement parts from them? What's the go-to PDF module?
Re: PDF to CSV?
pdf is where data goes to die. I've been peripherally involved in extracting data from tables in scientific papers, it is fairly easy to extract text from a pdf, but not the formatting with is liable to get *horribly* scrambled. If i were actually given the job I'd be inclined to convert the table to an image and use OCR to extract the text and formatting and then use the text directly extracted from the pdf to correct the misreads. Either than or look at getting the Mechanical Turk to do it. You'll probably be able to hack something up to work with the bank statements they will all be in the same format generated by the same program, but that format is liable to break regularly when the statement layout is altered or the program is updated/changed or the stars are wrong -- Michael On Thu, Dec 12, 2013 at 10:47 AM, Dave Hodgkinson daveh...@gmail.comwrote: I'm about to hit CPAN, but any wisdom from you lovely people would be nice! I've got bank statements in PDF from Barclays. Would it be easy to produce a CSV of the statement parts from them? What's the go-to PDF module?
Re: PDF to CSV?
Hi David, http://search.cpan.org/~audreyt/Template-Extract-0.41/lib/Template/Extract.pm could work better for extracting formatted text like this maybe A
Re: PDF to CSV?
pdftotext++ I've had lots of success with that for a variety of use-cases. I wouldn't bother with a more robust library based solution for personal data mangling problems. On 12/12/2013, at 10:17 PM, Stanislaw Pusep wrote: 1) xpdf's pdftotext CLI utility 2) regexp On Thu, Dec 12, 2013 at 11:47 AM, Dave Hodgkinson daveh...@gmail.comwrote: I'm about to hit CPAN, but any wisdom from you lovely people would be nice! I've got bank statements in PDF from Barclays. Would it be easy to produce a CSV of the statement parts from them? What's the go-to PDF module?
Re: PDF to CSV?
Sadly, that failed on a Barclays statement. On 12 Dec 2013, at 11:50, Kieren Diment dim...@gmail.com wrote: pdftotext++ I've had lots of success with that for a variety of use-cases. I wouldn't bother with a more robust library based solution for personal data mangling problems. On 12/12/2013, at 10:17 PM, Stanislaw Pusep wrote: 1) xpdf's pdftotext CLI utility 2) regexp On Thu, Dec 12, 2013 at 11:47 AM, Dave Hodgkinson daveh...@gmail.comwrote: I'm about to hit CPAN, but any wisdom from you lovely people would be nice! I've got bank statements in PDF from Barclays. Would it be easy to produce a CSV of the statement parts from them? What's the go-to PDF module?
Re: PDF to CSV?
Not sure what you're trying to tell me here. It can read PDF? What? On 12 Dec 2013, at 11:49, Aaron Trevena aaron.trev...@gmail.com wrote: Hi David, http://search.cpan.org/~audreyt/Template-Extract-0.41/lib/Template/Extract.pm could work better for extracting formatted text like this maybe A
Re: PDF to CSV?
https://www.pdftoexcel.org/ seems to have done a halfway passable job. On 12 Dec 2013, at 10:47, Dave Hodgkinson daveh...@gmail.com wrote: I'm about to hit CPAN, but any wisdom from you lovely people would be nice! I've got bank statements in PDF from Barclays. Would it be easy to produce a CSV of the statement parts from them? What's the go-to PDF module?
Re: PDF to CSV?
On 12 Dec 2013, at 12:41, DAVID HODGKINSON wrote: Not sure what you're trying to tell me here. It can read PDF? What? PDF files do have plain text in them, it just just wrapped in markup, control characters and binary blobs (for things like embedded images and fonts). It's possible that the data you want can be extracted from them by finding the appropriate bit of text in the file and using the code around it as a match in Template::Extract. -- David Dorward http://dorward.co.uk/
Re: PDF to CSV?
On 12 Dec 2013, at 13:39, David Dorward da...@dorward.me.uk wrote: On 12 Dec 2013, at 12:41, DAVID HODGKINSON wrote: Not sure what you're trying to tell me here. It can read PDF? What? PDF files do have plain text in them, it just just wrapped in markup, control characters and binary blobs (for things like embedded images and fonts). It's possible that the data you want can be extracted from them by finding the appropriate bit of text in the file and using the code around it as a match in Template::Extract. My life is diminishing too rapidly for that.
Re: PDF to CSV?
On Thu, Dec 12, 2013 at 11:38:02AM +, Michael Lush wrote: pdf is where data goes to die. I've been peripherally involved in extracting data from tables in scientific papers, it is fairly easy to extract text from a pdf, but not the formatting with is liable to get *horribly* scrambled. I have tried extracting words from a 4 part vocal item and ended up with We / We / We / We / wish / wish / wish / wish / you / you / you / you / (where / represents a newline). The problem is the order of the text elements seems to be at the whim of the program producing the PDF though I haven't investigate in detail.