Hi Tim,

How does one configure PDFParserConfig in tika-config.xml ? May be as one
of the PDFParser properties ?
PDFParser.setSortByPosition (and other simple setters) are deprecated so
setting a 'sprtByPosition' as one of the PDFParser properties goes via the
deprecated call path (probably not a big deal though :-))
I also looked at the source and I'm still not sure which ContentHandler did
you use to get the HTML tags added.
(I may experiment with a custom one sitting on top of it adding the table
tags may be...)
Sergey

On Thu, Jul 11, 2019 at 9:52 PM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:

> Hi Tim
>
> Thanks, I'm going to try to experiment with different complex enough PDFs
> in order to figure out how to enhance the Quarkus Tika extension, what to
> let customize, etc (I'll link to it in a follow up email).
> Your output looks better :-), and which ContentHandler did you use ?
>
> Sergey
>
> On Thu, Jul 11, 2019 at 7:23 PM Tim Allison <talli...@apache.org> wrote:
>
>> Might not need to break out the neural nets just yet...try turning on
>> sortByPosition via the PDFParserConfig and/or tika_config.xml.
>>
>> This is what you get:
>>
>>
>>
>> <title>PDF Invoice Example</title>
>> </head>
>> <body><div class="page"><p />
>> <p>Invoice
>> </p>
>> <p>From: Invoice Number INV-3337
>> </p>
>> <p>DEMO - Sliced Invoices Order Number 12345
>> Suite 5A-1204 Invoice Date January 25, 2016
>> 123 Somewhere Street Due Date January 31, 2016
>> Your City AZ 12345
>> ad...@slicedinvoices.com Total Due $93.50
>> </p>
>> <p>To:
>> Test Business
>> 123 Somewhere St
>> Melbourne, VIC 3000
>> t...@test.com
>> </p>
>> <p>Hrs/Qty Service Rate/Price Adjust Sub Total
>> </p>
>> <p>1.00 Web DesignThis is a sample description... $85.00 0.00% $85.00
>> </p>
>> <p>Pa
>> idSub Total $85.00
>> </p>
>> <p>Tax $8.50
>> Total $93.50
>> </p>
>> <p>ANZ Bank
>> ACC # 1234 1234
>> BSB # 4321 432
>> </p>
>> <p>Payment is due within 30 days from date of invoice. Late payment is
>> subject to fees of 5% per month.
>> Thanks for choosing DEMO - Sliced Invoices | ad...@slicedinvoices.com
>> Page 1/1</p>
>> <p />
>> <div class="annotation"><a
>> href="http://slicedinvoices.com/demo";>http://slicedinvoices.com/demo
>> </a></div>
>> <div class="annotation"><a
>> href="http://slicedinvoices.com/demo";>http://slicedinvoices.com/demo
>> </a></div>
>> <div class="annotation"><a
>> href="http://slicedinvoices.com/demo";>http://slicedinvoices.com/demo
>> </a></div>
>> <div class="annotation"><a
>> href="mailto:ad...@slicedinvoices.com";>mailto:ad...@slicedinvoices.com
>> </a></div>
>> </div>
>> </body></html>
>>
>> On Thu, Jul 11, 2019 at 1:25 PM Sergey Beryozkin <sberyoz...@gmail.com>
>> wrote:
>> >
>> > Hi
>> >
>> > I've used Tika to parse this invoice PDF:
>> >
>> > https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
>> >
>> > (AutoDetectParser, ToTextContentHandler), see below what is returned.
>> > The numbers like (1), (2) are added by myself, this is the preferred
>> order (approximately).
>> >
>> > Is it possible to hint somehow to Tika how to report the content ?
>> >
>> > Thanks Sergey
>> >
>> > PDF Invoice Example
>> > Invoice
>> >
>> > (5)Payment is due within 30 days from date of invoice. Late payment is
>> subject to fees of 5% per month.
>> >
>> > Thanks for choosing DEMO - Sliced Invoices | ad...@slicedinvoices.com
>> >
>> > Page 1/1
>> >
>> > (2)From:
>> >
>> > DEMO - Sliced Invoices
>> >
>> > Suite 5A-1204
>> >
>> > 123 Somewhere Street
>> >
>> > Your City AZ 12345
>> >
>> > ad...@slicedinvoices.com
>> >
>> > (1)Invoice Number INV-3337
>> >
>> > Order Number 12345
>> >
>> > Invoice Date January 25, 2016
>> >
>> > Due Date January 31, 2016
>> >
>> > Total Due $93.50
>> >
>> > (3)To:
>> >
>> > Test Business
>> >
>> > 123 Somewhere St
>> >
>> > Melbourne, VIC 3000
>> >
>> > t...@test.com
>> >
>> > (4) Hrs/Qty Service Rate/Price Adjust Sub Total
>> >
>> > 1.00
>> > Web Design
>> > This is a sample description...
>> >
>> > $85.00 0.00% $85.00
>> >
>> > Sub Total $85.00
>> >
>> > Tax $8.50
>> >
>> > Total $93.50
>> >
>> > (5) ANZ Bank
>> >
>> > ACC # 1234 1234
>> >
>> > BSB # 4321 432 Pa
>> > id
>>
>

Reply via email to