[jira] Commented: (PDFBOX-55) Invalid character while extracting text from a chinese pdf

Tom Jackson (JIRA) Thu, 22 Jan 2009 14:12:24 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666318#action_12666318
 ]


Tom Jackson commented on PDFBOX-55:
-----------------------------------

I ran into cases where an illegal hex character (0x00) would slip into the text 
output I was writing to XML, as well.  

The characters less than/equal to 7 and greater than/equal to 127 aren't 
accepted by XML.  That's a limitation of XML, not PDFBox.  Even so, to correct 
the situation, here is my C# code that I use that will accept a string and send 
out a scrubbed string of the input, giving you only characters #8-126, that you 
can use after you get the string PDFBox returns, but before you send your text 
string to XML via XmlWriter.  You would invoke a line similar to string 
myContent = charScrubber(pdfboxContent), and then write "myContent" into your 
XML file, to use this function:

        /// <summary>
        /// This will parse a string and only return characters
        /// as a concatenated string if they are UTF-8 compliant.
        ///  Author:  Tom Jackson
        /// </summary>
        /// <param name="content">A string variable provided for 
parsing.</param>
        /// <returns>The same string variable, minus characters that are 
        /// not UTF-8 compliant.</returns>
        private static string charScrubber(string content)
        {
            StringBuilder sbTemp = new StringBuilder(content.Length);
            foreach (char currentChar in content)
            {
                if (currentChar > 7 && currentChar < 127)
                {
                    sbTemp.Append(currentChar);
                }
            }

            content = sbTemp.ToString();
            return content;
        }   


Enjoy.

-Tom

> Invalid character while extracting text from a chinese pdf
> ----------------------------------------------------------
>
>                 Key: PDFBOX-55
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-55
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1185058
> Originally submitted by seblaunay on 2005-04-18 01:59.
> First, thanks for this wonderful api.
> I have a problem extracting text from a pdf document
> provided with adobe acrobat reader : ENUtxt.pdf.
> The pdf contains text with chinese fonts which cannot
> be extracted.
> But, it contains also this text (extract with xpdf or
> acrobat reader) :
> ---------------------------------------
> Lorem ipsum dolor
> ad minim
> ---------------------------------------
> The problem is i obtain on my Writer with
> PDFTextStripper.WriteText something like this :
> ---------------------------------------
> -PSFNJQTVNEPMPS
> BENJOJNWFSOJBNôH
> ---------------------------------------
> And between this valid characters, there are these
> invalid characters :
> 0x0, 0x1, 0x5, 0x6, 0x18.
> Because, i sax the content of a document into a xml,
> the resulting xml is not valid because it contains
> invalid characters...
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1185058&file_id=130664
> ENUtxt.pdf (application/pdf), 7582 bytes
> The pdf used
> [comment on SourceForge]
> Originally sent by seblaunay.
> Logged In: YES 
> user_id=1261395
> Document to test added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-55) Invalid character while extracting text from a chinese pdf

Reply via email to