[jira] Updated: (PDFBOX-433) parse Unicode glyph names

Timo Boehme (JIRA) Mon, 23 Feb 2009 02:13:28 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Timo Boehme updated PDFBOX-433:
-------------------------------

    Description: 
Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) 
how glyph names should be constructed to easily convert them (to Unicode). What 
is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and 
Unicode names (uniXXXX). I have therefore attached an updated method 
getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
It first strips off suffix and tests later on for names starting with 'uni'.

Timo

    /**
     * This will get the character from the name.
     *
     * @param name The name of the character.
     *
     * @return The printable character for the code.
     */
    public static String getCharacter( COSName name )
    {
          COSName baseName = name;
          String  nameStr  = baseName.getName();

          // test if we have a suffix and if so remove it
          if ( nameStr.indexOf('.') > 0 ) {
                nameStr  = nameStr.substring( 0, nameStr.indexOf('.') );
                baseName = COSName.getPDFName( nameStr ); 
          }
          
        String character = (String)NAME_TO_CHARACTER.get( baseName );
        if( character == null )
        {
                  // test for Unicode name
                  // (uniXXXX - XXXX must be a multiple of four;
                  //  each representing a hexadecimal Unicode code point) 
                  if ( nameStr.startsWith( "uni" ) )
                  {
                          StringBuffer uniStr = new StringBuffer();
                          
                                for ( int chPos = 3; chPos + 4 <= 
nameStr.length(); chPos += 4 ) {

                                        try {
                                                
                                                int characterCode = 
Integer.parseInt( nameStr.substring( chPos, chPos + 4), 16 );
                                                
                                                if ( ( characterCode > 0xD7FF ) 
&& ( characterCode < 0xE000 ) )
                                                        
Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
                                                                                
                                                                                
                                                                                
        "Unicode character name with not allowed code area: " +
                                                                                
                                                                                
                                                                                
        nameStr );
                                                else
                                                        uniStr.append( (char) 
characterCode );
                                                
                                        } catch (NumberFormatException nfe) {
                                                
Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
                                                                                
                "Not a number in Unicode character name: " +
                                                                                
                nameStr );
                                        }
                                }
                                character = uniStr.toString();
                  }
                  else
                          character = nameStr;
        }
        return character;
    }


  was:
Adobe has specified (http://www.adobe.com/devnet/opentype/archives/glyph.html) 
how glyph names should be constructed to easily convert them (to Unicode). What 
is currently missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and 
Unicode names (uniXXXX). I have therefore attached an updated method 
getCharacter( COSName name ) for class org.apache.pdfbox.encoding.Encoding.
It first strips off suffix and tests later on for names starting with 'uni'.

Timo


> parse Unicode glyph names
> -------------------------
>
>                 Key: PDFBOX-433
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-433
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Timo Boehme
>            Priority: Minor
>
> Adobe has specified 
> (http://www.adobe.com/devnet/opentype/archives/glyph.html) how glyph names 
> should be constructed to easily convert them (to Unicode). What is currently 
> missing in PDFBox is the handling of suffixes (NAME.SUFFIX) and Unicode names 
> (uniXXXX). I have therefore attached an updated method getCharacter( COSName 
> name ) for class org.apache.pdfbox.encoding.Encoding.
> It first strips off suffix and tests later on for names starting with 'uni'.
> Timo
>     /**
>      * This will get the character from the name.
>      *
>      * @param name The name of the character.
>      *
>      * @return The printable character for the code.
>      */
>     public static String getCharacter( COSName name )
>     {
>         COSName baseName = name;
>         String  nameStr  = baseName.getName();
>         // test if we have a suffix and if so remove it
>         if ( nameStr.indexOf('.') > 0 ) {
>               nameStr  = nameStr.substring( 0, nameStr.indexOf('.') );
>               baseName = COSName.getPDFName( nameStr ); 
>         }
>         
>         String character = (String)NAME_TO_CHARACTER.get( baseName );
>         if( character == null )
>         {
>                 // test for Unicode name
>                 // (uniXXXX - XXXX must be a multiple of four;
>                 //  each representing a hexadecimal Unicode code point) 
>                 if ( nameStr.startsWith( "uni" ) )
>                 {
>                         StringBuffer uniStr = new StringBuffer();
>                         
>                               for ( int chPos = 3; chPos + 4 <= 
> nameStr.length(); chPos += 4 ) {
>                                       try {
>                                               
>                                               int characterCode = 
> Integer.parseInt( nameStr.substring( chPos, chPos + 4), 16 );
>                                               
>                                               if ( ( characterCode > 0xD7FF ) 
> && ( characterCode < 0xE000 ) )
>                                                       
> Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
>                                                                               
>                                                                               
>                                                                               
>             "Unicode character name with not allowed code area: " +
>                                                                               
>                                                                               
>                                                                               
>             nameStr );
>                                               else
>                                                       uniStr.append( (char) 
> characterCode );
>                                               
>                                       } catch (NumberFormatException nfe) {
>                                               
> Logger.getLogger(Encoding.class.getName()).log( Level.WARNING,
>                                                                               
>                 "Not a number in Unicode character name: " +
>                                                                               
>                 nameStr );
>                                       }
>                               }
>                               character = uniStr.toString();
>                 }
>                 else
>                         character = nameStr;
>         }
>         return character;
>     }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-433) parse Unicode glyph names

Reply via email to