> On 15 Jul 2015, at 04:52, Allison, Timothy B. <[email protected]> wrote: > > All, > Raymond Wu recently opened TIKA-1679 and recommended that we switch to > per-page processing so that if there's an exception on one page, we'll still > be able to extract contents from other pages. > > The proposed fix is along these lines: > > int nop = document.getNumberOfPages(); > for(int i=1;i<=nop;i++) { > PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata, > extractAnnotationText, enableAutoSpace, > suppressDuplicateOverlappingText, sortByPosition); > try { > pdf2XHTML.setStartPage(i); > pdf2XHTML.setEndPage(i); > pdf2XHTML.writeText(document, dummyWriter); > } catch(Exception e) { > // TODO ... > } > > Does this seem reasonable? Any gut reaction/estimates on the performance > hit? Perhaps we should make this mode configurable? >
Looks fine to me, as quick look at the source of PDFTextStripper doesn’t indicate any performance issues. — John > Thank you. > > Best, > > Tim --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

