Python Tika parsing order issue

2020-05-12 Thread Gourang Gaurav
Hi, I was using following to get the data from pdf:from tika import parserparser.from_file(file_path, xmlContent=True)The location of the text segment is changing compared to original pdf page. ( In my case, table at the beginning of the page is coming at the bottom).  So, I tried using custom c

Re: Parsing order issue

2020-01-07 Thread Lu Sun
it? Thanks so much in advance. Regards, Luke On Fri, 20 Dec 2019 at 18:06, Lu Sun wrote: > Dear PDFBox Dev Team, > > Hope this message finds you well. > > Just wanted to raise this for your attention. Please can you provide any > solutions on the parsing order issue? Attached is

Re: Parsing order issue

2020-01-06 Thread Tilman Hausherr
18:06, Lu Sun wrote: Dear PDFBox Dev Team, Hope this message finds you well. Just wanted to raise this for your attention. Please can you provide any solutions on the parsing order issue? Attached is my config file, an example of pdf file and my parsing results. Thanks so much in advance. W

Re: Parsing order issue

2019-12-20 Thread Tilman Hausherr
provide any solutions on the parsing order issue? Attached is my config file, an example of pdf file and my parsing results. Thanks so much in advance. Wish you and your team a Merry Christmas and Happy New Year. Regards, Luke On Tue, 17 Dec 2019 at 12:34, Tim Allison <mailto:talli...@apache.

Re: Parsing order issue

2019-12-20 Thread Lu Sun
Dear PDFBox Dev Team, Hope this message finds you well. Just wanted to raise this for your attention. Please can you provide any solutions on the parsing order issue? Attached is my config file, an example of pdf file and my parsing results. Thanks so much in advance. Wish you and your team a

Re: Parsing order issue

2019-12-17 Thread Tim Allison
Tilman, That isn’t correct. I’ll find the link that might help... On Tue, Dec 17, 2019 at 1:02 PM Tilman Hausherr wrote: > I already answered... we need the PDF. > > But... about the config: > > > > > > >image/jpeg >application/pdf > class="org.apache

Re: Parsing order issue

2019-12-17 Thread Tilman Hausherr
I already answered... we need the PDF. But... about the config:             image/jpeg   application/pdf   class="org.apache.tika.parser.executable.ExecutableParser"/>                   application/pdf       Is this a correct setting for PDFs in tika? I notice that

Re: Parsing order issue

2019-12-17 Thread Maruan Sahyoun
Hi Tim, unfortunately the image didn't make it to the mailing list. What is the issue here? Is the extracted text not in the right order? Order of PDF parsing and visual order of text are not related. BR Maruan > PDFBox Colleagues, > Any recommendations? > > On Mon, Dec 16, 2019 at 7:05 A

Re: Parsing order issue

2019-12-17 Thread Tim Allison
PDFBox Colleagues, Any recommendations? On Mon, Dec 16, 2019 at 7:05 AM Lu Sun wrote: > Dear Tika Dev Team, > > > > Hope this email finds you well. > > > > I have been actively using Tika for pdf file reading. One issue I found is > the parsing order. As shown in attached image, the parsing or

Re: Parsing order issue

2019-12-16 Thread Tilman Hausherr
Please upload the PDF to a sharehoster. Tilman Am 15.12.2019 um 23:21 schrieb Lu Sun: Dear Tika Dev Team, Hope this email finds you well. I have been actively using Tika for pdf file reading. One issue I found is the parsing order. As shown in attached image, the parsing order of pdf file

Parsing order issue

2019-12-16 Thread Lu Sun
Dear Tika Dev Team, Hope this email finds you well. I have been actively using Tika for pdf file reading. One issue I found is the parsing order. As shown in attached image, the parsing order of pdf file is not based on position of texts. As suggested in this github link