Re: Extracting text from a PDF

Paul Dupuis via use-livecode Mon, 09 Mar 2026 09:25:46 -0700

I have no idea if this will help as you are using the PDF Widget andthsi was for the XPDF External, but the Widget is based on Google PDFiumjust like the older External. The XPDF External had a problem withhyphenations in PDF where the hyphen was actual a 2-byte Unicodecharacter. The following takes the text returned (that may include ahyphen) and "fixes" it to include a normal hyphen:


command Rehyphenate @xText
  -- This handler is a workaround for the following bug:
  -- http://quality.livecode.com/show_bug.cgi?id=18442

-- This bug is fundamentally a issue in the PDFium PDF library wherecertain hyphenated -- strings (such as URLs) with line breaks are returned with aUnicode BOM (xFFFE) instead -- of a hyphen. Rehyphenate replaces hyphens between non-whitespacewhere xFFFE is returned.

  --
  -- The intended usage is:
  -- XPDFViewer_GetSelectionUnicode "Document1", "tUnicode"
  -- put textDecode(tUnicode, "UTF16") into tUnicode
  -- Rehyphenate tUnicode
  -- put tUnicode into ...


  local tStart, tEnd
  put numToChar(255)&numToChar(254) into tBadUnicodeChar
  if tText contains tBadUnicodeChar then
    repeat while matchChunk(xText, "[^\s]*(\x{FFFE})[^\s]*", tStart, tEnd)
      put "-" into char tStart to tEnd of xText
    end repeat
  end if
end Rehyphenate

At the very least, there may be a similar bug in the widget (because thebug was in the underlying PDFium library) that requires some sort ofsimilar work around.


On 3/9/2026 11:40 AM, David Epstein via use-livecode wrote:

Does anyone have experience trying to clean up the text that can be extracted 
from a PDF shown in the PDF widget by getting “the hilitedRangeText” of the 
widget?

In the case I’m working with there is an invisible numToChar(10) at the end of 
each visible line; and to obtain text that will wrap freely in a LiveCode field 
I can “replace numToChar(10) with space” in the text I’ve extracted.  This 
works.

When a word is divided at the end of the line, the visible hyphen is a 
numToChar(63).  But a command to “replace numToChar(63) with empty” does not 
work, and the character remains in place (showing up, in a field whose font is 
Palatino, as a boxed question mark).

My impression is that not all PDF documents work the same way, and that there 
are other problems trying to extract their text.  But why does this 
numToChar(63) character not get replaced?

David Epstein

_______________________________________________
use-livecode mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

_______________________________________________
use-livecode mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Extracting text from a PDF

Reply via email to