Dear PDFBox Dev Team, After searching through online <https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order>, I am certain that using setSortByPosition(true) would help. However, I am struggling to get the config file right. Can you please provide any advice on it?
Thanks so much in advance. Regards, Luke On Fri, 20 Dec 2019 at 18:06, Lu Sun <vistax...@gmail.com> wrote: > Dear PDFBox Dev Team, > > Hope this message finds you well. > > Just wanted to raise this for your attention. Please can you provide any > solutions on the parsing order issue? Attached is my config file, an > example of pdf file and my parsing results. > > Thanks so much in advance. Wish you and your team a Merry Christmas and > Happy New Year. > > Regards, > Luke > > On Tue, 17 Dec 2019 at 12:34, Tim Allison <talli...@apache.org> wrote: > >> PDFBox Colleagues, >> Any recommendations? >> >> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vistax...@gmail.com> wrote: >> >>> Dear Tika Dev Team, >>> >>> >>> >>> Hope this email finds you well. >>> >>> >>> >>> I have been actively using Tika for pdf file reading. One issue I found >>> is the parsing order. As shown in attached image, the parsing order of pdf >>> file is not based on position of texts. >>> >>> >>> >>> As suggested in this github link >>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a >>> customized config file (see attached), hoping to solve the issue. But this >>> has not worked out. If any chance, can you please review this issue, and >>> provide any insights or solutions? >>> >>> >>> >>> Thanks so much in advance. >>> >>> >>> >>> Regards, >>> >>> Luke >>> >>