Removing ALMOST all text from a pdf

Nick Westerly Sat, 01 Dec 2018 15:04:07 -0800

I'm using the method here to remove text from a document:

http://www.docjar.com/html/api/org/apache/pdfbox/examples/util/RemoveAllText.java.html


And then rendering the page to an image.

I'd like to do exactly as I'm doing, except leave certain pieces of text if
they match a regex pattern (i'm looking for sequences of dashes).

For this part of the parsing, I'd like to implement a method that checks
the textual representations of the prevToken, and only removes it if it
doesn't match my string. Are there any helper methods to get the text here
given an element like this (possibly in pdf text stripper or otherwise)? Or
do i have to manually parse the text?

for (Object token : tokens) {
    if (token instanceof Operator) {
        Operator op = (Operator) token;
        if (op.getName().equals("TJ") || op.getName().equals("Tj")) {
            //remove the one argument to this operator
            Object prevToken = newTokens.get(newTokens.size() - 1);
            if(!matchesMyString(prevToken)) {
                newTokens.remove(newTokens.size() - 1);
            }
            continue;
        }
    }
    newTokens.add(token);
}

Thanks

Nick

Removing ALMOST all text from a pdf

Reply via email to