Hi Tim, How does one configure PDFParserConfig in tika-config.xml ? May be as one of the PDFParser properties ? PDFParser.setSortByPosition (and other simple setters) are deprecated so setting a 'sprtByPosition' as one of the PDFParser properties goes via the deprecated call path (probably not a big deal though :-)) I also looked at the source and I'm still not sure which ContentHandler did you use to get the HTML tags added. (I may experiment with a custom one sitting on top of it adding the table tags may be...) Sergey
On Thu, Jul 11, 2019 at 9:52 PM Sergey Beryozkin <sberyoz...@gmail.com> wrote: > Hi Tim > > Thanks, I'm going to try to experiment with different complex enough PDFs > in order to figure out how to enhance the Quarkus Tika extension, what to > let customize, etc (I'll link to it in a follow up email). > Your output looks better :-), and which ContentHandler did you use ? > > Sergey > > On Thu, Jul 11, 2019 at 7:23 PM Tim Allison <talli...@apache.org> wrote: > >> Might not need to break out the neural nets just yet...try turning on >> sortByPosition via the PDFParserConfig and/or tika_config.xml. >> >> This is what you get: >> >> >> >> <title>PDF Invoice Example</title> >> </head> >> <body><div class="page"><p /> >> <p>Invoice >> </p> >> <p>From: Invoice Number INV-3337 >> </p> >> <p>DEMO - Sliced Invoices Order Number 12345 >> Suite 5A-1204 Invoice Date January 25, 2016 >> 123 Somewhere Street Due Date January 31, 2016 >> Your City AZ 12345 >> ad...@slicedinvoices.com Total Due $93.50 >> </p> >> <p>To: >> Test Business >> 123 Somewhere St >> Melbourne, VIC 3000 >> t...@test.com >> </p> >> <p>Hrs/Qty Service Rate/Price Adjust Sub Total >> </p> >> <p>1.00 Web DesignThis is a sample description... $85.00 0.00% $85.00 >> </p> >> <p>Pa >> idSub Total $85.00 >> </p> >> <p>Tax $8.50 >> Total $93.50 >> </p> >> <p>ANZ Bank >> ACC # 1234 1234 >> BSB # 4321 432 >> </p> >> <p>Payment is due within 30 days from date of invoice. Late payment is >> subject to fees of 5% per month. >> Thanks for choosing DEMO - Sliced Invoices | ad...@slicedinvoices.com >> Page 1/1</p> >> <p /> >> <div class="annotation"><a >> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo >> </a></div> >> <div class="annotation"><a >> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo >> </a></div> >> <div class="annotation"><a >> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo >> </a></div> >> <div class="annotation"><a >> href="mailto:ad...@slicedinvoices.com">mailto:ad...@slicedinvoices.com >> </a></div> >> </div> >> </body></html> >> >> On Thu, Jul 11, 2019 at 1:25 PM Sergey Beryozkin <sberyoz...@gmail.com> >> wrote: >> > >> > Hi >> > >> > I've used Tika to parse this invoice PDF: >> > >> > https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf >> > >> > (AutoDetectParser, ToTextContentHandler), see below what is returned. >> > The numbers like (1), (2) are added by myself, this is the preferred >> order (approximately). >> > >> > Is it possible to hint somehow to Tika how to report the content ? >> > >> > Thanks Sergey >> > >> > PDF Invoice Example >> > Invoice >> > >> > (5)Payment is due within 30 days from date of invoice. Late payment is >> subject to fees of 5% per month. >> > >> > Thanks for choosing DEMO - Sliced Invoices | ad...@slicedinvoices.com >> > >> > Page 1/1 >> > >> > (2)From: >> > >> > DEMO - Sliced Invoices >> > >> > Suite 5A-1204 >> > >> > 123 Somewhere Street >> > >> > Your City AZ 12345 >> > >> > ad...@slicedinvoices.com >> > >> > (1)Invoice Number INV-3337 >> > >> > Order Number 12345 >> > >> > Invoice Date January 25, 2016 >> > >> > Due Date January 31, 2016 >> > >> > Total Due $93.50 >> > >> > (3)To: >> > >> > Test Business >> > >> > 123 Somewhere St >> > >> > Melbourne, VIC 3000 >> > >> > t...@test.com >> > >> > (4) Hrs/Qty Service Rate/Price Adjust Sub Total >> > >> > 1.00 >> > Web Design >> > This is a sample description... >> > >> > $85.00 0.00% $85.00 >> > >> > Sub Total $85.00 >> > >> > Tax $8.50 >> > >> > Total $93.50 >> > >> > (5) ANZ Bank >> > >> > ACC # 1234 1234 >> > >> > BSB # 4321 432 Pa >> > id >> >