Re: How to logically read text from a PDF table?

Dane Bezuidenhout Tue, 18 Jul 2017 08:36:10 -0700

Hi Manuel,

Thank you for the fast response, I will investigate Tabula.


Regards,

Dane

Dane Bezuidenhout
SprintHive <https://sprinthive.com/>

M: +27 82 562 7850


vCard <http://www.sprinthive.com/files/dane.vcf>

On Tue, Jul 18, 2017 at 5:31 PM, Manuel Aristarán <[email protected]>
wrote:

> Hi Dane,
>
> As you might know, there's no thing such as tables in PDF files. The only
> way to extract them is to try to reconstruct the tabular arrangement from
> the characters' positions, ruling lines, and so on. I'm one of the
> maintainers of Tabula [1], which is a tool based on PDFBox that implements
> a number of algorithms to attempt that. We have a GUI tool [2], and a Java
> library [3]. Both are open source (MIT license)
>
> Best,
>
> [1] http://tabula.technology
> [2] https://github.com/tabulapdf/tabula
> [3] https://github.com/tabulapdf/tabula-java
>
> --
> Manuel Aristarán
> jazzido.com
>
>
>
> On Tue, Jul 18, 2017 at 9:28 AM, Dane Bezuidenhout <
> [email protected]> wrote:
>
> > The examples available are clear on constructing a table, but there is
> > little info on reading a table. I've investigated a few solution to this,
> > but feel that they are "hacky" in that they rely on establishing column
> and
> > row regions to read text from.
> >
> > Surely there is a canonical way to traverse the PDDocument table elements
> > and access table cells with reference to row and columns?
> >
> > Any advice would be appreciated.
> >
> >
> > Dane Bezuidenhout
> > SprintHive <https://sprinthive.com/>
> >
> > M: +27 82 562 7850
> >
> >
> > vCard <http://www.sprinthive.com/files/dane.vcf>
> >
>

Re: How to logically read text from a PDF table?

Reply via email to