Trailing Space and Final CRLF Added

2022-03-16 Thread flywire
Can text be extracted without adding trailing space? *Text.txt* def hello_world(): print("Hello World!") hello_world() *File ends line above with no CRLF* java -jar pdfbox-app-2.0.25.jar TextToPDF -standardFont Courier test.pdf test.txt java -jar pdfbox-app-2.0.25.jar ExtractText test.pdf te

Searchable Pdf

2022-02-10 Thread flywire
Can PDFBox make a scanned text document pdf text searchable?

pdfbox-app-2 ExtractText [Text file] / ExtractImages

2021-10-22 Thread flywire
Is it possible to specify an output path for the extracted files allowing the app to manage filenames? ExtractText has only one output file which could be generated as part of a batch process but that's not really possible with ExtractImages using Win10.

PDFText2HTML.java Working Example

2021-08-26 Thread flywire
I couldn't find a working PDFText2HTML.java example. Can you show me one, preferably as an app?

Re: PDF2MD - Images

2021-08-26 Thread flywire
Figures, Tables etc often have a unique caption line eg Figure N: Description... After extracting text I used this workaround to post-process the markdown files on Win10 with GNU sed (hence ^^): === display proposed changes for %f in (*.md) do sed -n 's/\(^^Figure \)\([0-9]\+\)\(\: .*\)/\n![]

Re: PDF2MD - Codeblocks

2021-08-23 Thread flywire
With a bit of customisation, PDFBox should be able to parse pdf to md . This probably involves a process like PDFText2HTML.java , possibly

PDF2MD - Paragraphs

2021-08-23 Thread flywire
https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf shows a clear break between paragraphs. I'm on Win10 using: java -jar pdfbox-app-2.0.24.jar ExtractText %1 Each line is extracted but there is no newline for the paragraph. How can I insert one during text extraction? I've read about

PDF2MD - Codeblocks

2021-08-23 Thread flywire
https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains codeblocks identified by a change of font and no other fonts on those lines. I'd like to insert control codes before and after them while I'm extracting text. I'm on Win10 using: java -jar pdfbox-app-2.0.24.jar ExtractText %1

PDF2MD - Images

2021-08-23 Thread flywire
https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains images and I'd like to replace them with code while I'm extracting text. I'm on Win10 using: java -jar pdfbox-app-2.0.24.jar ExtractText %1 Required code is: %newline%[](%filename%-%image-no%.png)%newline% %filename% is wit

ExtractImages Ignoring Textboxes

2021-08-23 Thread flywire
https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains textboxes which are extracted as images containing a solid black box. How can I ignore those text boxes while extracting images and not increment image number contained in the filename. They always occur as the last two images