Re: PDF to CSV?

2014-09-16 Thread Jonathan Lister
Dave Hodgkinson davehodg at gmail.com writes:

 
 I'm about to hit CPAN, but any wisdom from you lovely people
 would be nice!
 
 I've got bank statements in PDF from Barclays. Would it be easy
 to produce a CSV of the statement parts from them?
 
 What's the go-to PDF module?
 
 


I know this thread is a bit old, but I'm posting here in
case this is helpful to anyone...

I've just made a tool in node.js to help my girlfriend convert
18 months' worth of Barclays business account statement
PDFs into CSV.

If you want to give it a go, it's up on GitHub here:
https://github.com/penrosestudio/barclays-bank-pdf-to-csv.

Cheers,


Jonathan



PDF to CSV?

2013-12-12 Thread Dave Hodgkinson
I'm about to hit CPAN, but any wisdom from you lovely people
would be nice!

I've got bank statements in PDF from Barclays. Would it be easy
to produce a CSV of the statement parts from them?

What's the go-to PDF module?


Re: PDF to CSV?

2013-12-12 Thread Leo Lapworth
I've got some code somewhere for doing this for HSBC's HTML statements

I tried for their PDF's (which is the only available format for their
credit card) but the formatting (of the PDF) was such a pain that I
gave up.

I thought Barclays let you export as csv in any case? - might just be
last X months I guess

On 12 December 2013 10:47, Dave Hodgkinson daveh...@gmail.com wrote:
 I'm about to hit CPAN, but any wisdom from you lovely people
 would be nice!

 I've got bank statements in PDF from Barclays. Would it be easy
 to produce a CSV of the statement parts from them?

 What's the go-to PDF module?


Re: PDF to CSV?

2013-12-12 Thread Jérôme Étévé
pdftotext (from poppler-utils) does a good job at extracting text from
PDFs, the rest should be text munching :)

Ideally you'd want to target information directly in the PDF
structure. I've got the feeling that's not easily done.

J.

On 12 December 2013 10:47, Dave Hodgkinson daveh...@gmail.com wrote:
 I'm about to hit CPAN, but any wisdom from you lovely people
 would be nice!

 I've got bank statements in PDF from Barclays. Would it be easy
 to produce a CSV of the statement parts from them?

 What's the go-to PDF module?



-- 
Jerome Eteve
+44(0)7738864546
http://www.eteve.net/


Re: PDF to CSV?

2013-12-12 Thread Stanislaw Pusep
1) xpdf's pdftotext CLI utility
2) regexp



On Thu, Dec 12, 2013 at 11:47 AM, Dave Hodgkinson daveh...@gmail.comwrote:

 I'm about to hit CPAN, but any wisdom from you lovely people
 would be nice!

 I've got bank statements in PDF from Barclays. Would it be easy
 to produce a CSV of the statement parts from them?

 What's the go-to PDF module?



Re: PDF to CSV?

2013-12-12 Thread DAVID HODGKINSON
Indeed, only PDF going back in time.

CAM::PDF has getpdftext.pl which is where I'm currently positioning
my yak.


On 12 Dec 2013, at 11:07, Leo Lapworth l...@cuckoo.org wrote:

 I've got some code somewhere for doing this for HSBC's HTML statements
 
 I tried for their PDF's (which is the only available format for their
 credit card) but the formatting (of the PDF) was such a pain that I
 gave up.
 
 I thought Barclays let you export as csv in any case? - might just be
 last X months I guess
 
 On 12 December 2013 10:47, Dave Hodgkinson daveh...@gmail.com wrote:
 I'm about to hit CPAN, but any wisdom from you lovely people
 would be nice!
 
 I've got bank statements in PDF from Barclays. Would it be easy
 to produce a CSV of the statement parts from them?
 
 What's the go-to PDF module?



Re: PDF to CSV?

2013-12-12 Thread DAVID HODGKINSON
OK, It puts each column on a new line but that's not the end of the
world.


On 12 Dec 2013, at 11:21, DAVID HODGKINSON daveh...@me.com wrote:

 Indeed, only PDF going back in time.
 
 CAM::PDF has getpdftext.pl which is where I'm currently positioning
 my yak.
 
 
 On 12 Dec 2013, at 11:07, Leo Lapworth l...@cuckoo.org wrote:
 
 I've got some code somewhere for doing this for HSBC's HTML statements
 
 I tried for their PDF's (which is the only available format for their
 credit card) but the formatting (of the PDF) was such a pain that I
 gave up.
 
 I thought Barclays let you export as csv in any case? - might just be
 last X months I guess
 
 On 12 December 2013 10:47, Dave Hodgkinson daveh...@gmail.com wrote:
 I'm about to hit CPAN, but any wisdom from you lovely people
 would be nice!
 
 I've got bank statements in PDF from Barclays. Would it be easy
 to produce a CSV of the statement parts from them?
 
 What's the go-to PDF module?
 



Re: PDF to CSV?

2013-12-12 Thread Michael Lush
pdf is where data goes to die.

I've been peripherally involved in extracting data from tables in
scientific papers, it is fairly easy to extract text from a pdf, but not
the formatting with is liable to get *horribly* scrambled.

If i were actually given the job I'd be inclined to convert the table to an
image and use OCR to extract the text and formatting and then use the text
directly extracted from the pdf to correct the misreads.  Either than or
look at getting the Mechanical Turk to do it.

You'll probably be able to hack something up to work with the bank
statements they will all be in the same format generated by the same
program,  but that format is liable to break regularly when the statement
layout is altered or the program is updated/changed or the stars are wrong

--
Michael

On Thu, Dec 12, 2013 at 10:47 AM, Dave Hodgkinson daveh...@gmail.comwrote:

 I'm about to hit CPAN, but any wisdom from you lovely people
 would be nice!

 I've got bank statements in PDF from Barclays. Would it be easy
 to produce a CSV of the statement parts from them?

 What's the go-to PDF module?



Re: PDF to CSV?

2013-12-12 Thread Aaron Trevena
Hi David,

http://search.cpan.org/~audreyt/Template-Extract-0.41/lib/Template/Extract.pm
could work better for extracting formatted text like this maybe

A


Re: PDF to CSV?

2013-12-12 Thread Kieren Diment
pdftotext++ I've had lots of success with that for a variety of use-cases.  I 
wouldn't bother with a more robust library based solution for personal data 
mangling problems.

On 12/12/2013, at 10:17 PM, Stanislaw Pusep wrote:

 1) xpdf's pdftotext CLI utility
 2) regexp
 
 
 
 On Thu, Dec 12, 2013 at 11:47 AM, Dave Hodgkinson daveh...@gmail.comwrote:
 
 I'm about to hit CPAN, but any wisdom from you lovely people
 would be nice!
 
 I've got bank statements in PDF from Barclays. Would it be easy
 to produce a CSV of the statement parts from them?
 
 What's the go-to PDF module?
 




Re: PDF to CSV?

2013-12-12 Thread DAVID HODGKINSON
Sadly, that failed on a Barclays statement.


On 12 Dec 2013, at 11:50, Kieren Diment dim...@gmail.com wrote:

 pdftotext++ I've had lots of success with that for a variety of use-cases.  I 
 wouldn't bother with a more robust library based solution for personal data 
 mangling problems.
 
 On 12/12/2013, at 10:17 PM, Stanislaw Pusep wrote:
 
 1) xpdf's pdftotext CLI utility
 2) regexp
 
 
 
 On Thu, Dec 12, 2013 at 11:47 AM, Dave Hodgkinson daveh...@gmail.comwrote:
 
 I'm about to hit CPAN, but any wisdom from you lovely people
 would be nice!
 
 I've got bank statements in PDF from Barclays. Would it be easy
 to produce a CSV of the statement parts from them?
 
 What's the go-to PDF module?
 
 
 




Re: PDF to CSV?

2013-12-12 Thread DAVID HODGKINSON
Not sure what you're trying to tell me here. It can read PDF? What?


On 12 Dec 2013, at 11:49, Aaron Trevena aaron.trev...@gmail.com wrote:

 Hi David,
 
 http://search.cpan.org/~audreyt/Template-Extract-0.41/lib/Template/Extract.pm
 could work better for extracting formatted text like this maybe
 
 A




Re: PDF to CSV?

2013-12-12 Thread Dave Hodgkinson
https://www.pdftoexcel.org/ seems to have done a halfway passable job.


On 12 Dec 2013, at 10:47, Dave Hodgkinson daveh...@gmail.com wrote:

 
 I'm about to hit CPAN, but any wisdom from you lovely people
 would be nice!
 
 I've got bank statements in PDF from Barclays. Would it be easy
 to produce a CSV of the statement parts from them?
 
 What's the go-to PDF module?



Re: PDF to CSV?

2013-12-12 Thread David Dorward

On 12 Dec 2013, at 12:41, DAVID HODGKINSON wrote:


Not sure what you're trying to tell me here. It can read PDF? What?


PDF files do have plain text in them, it just just wrapped in markup, 
control characters and binary blobs (for things like embedded images and 
fonts).


It's possible that the data you want can be extracted from them by 
finding the appropriate bit of text in the file and using the code 
around it as a match in Template::Extract.



--
David Dorward
http://dorward.co.uk/


Re: PDF to CSV?

2013-12-12 Thread DAVID HODGKINSON

On 12 Dec 2013, at 13:39, David Dorward da...@dorward.me.uk wrote:

 On 12 Dec 2013, at 12:41, DAVID HODGKINSON wrote:
 
 Not sure what you're trying to tell me here. It can read PDF? What?
 
 PDF files do have plain text in them, it just just wrapped in markup, control 
 characters and binary blobs (for things like embedded images and fonts).
 
 It's possible that the data you want can be extracted from them by finding 
 the appropriate bit of text in the file and using the code around it as a 
 match in Template::Extract.

My life is diminishing too rapidly for that.


Re: PDF to CSV?

2013-12-12 Thread andrew-perl08
On Thu, Dec 12, 2013 at 11:38:02AM +, Michael Lush wrote:
 pdf is where data goes to die.
 
 I've been peripherally involved in extracting data from tables in
 scientific papers, it is fairly easy to extract text from a pdf, but not
 the formatting with is liable to get *horribly* scrambled.

I have tried extracting words from a 4 part vocal item and ended up with

We / We / We / We / wish / wish / wish / wish / you / you / you / you /

(where / represents a newline).  The problem is the order of the text elements 
seems to be at the whim of the program producing the PDF though I haven't 
investigate in detail.