org.pdfbox.filter.ASCIIHexFilter does not skip Whitespace
---------------------------------------------------------

                 Key: PDFBOX-390
                 URL: https://issues.apache.org/jira/browse/PDFBOX-390
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 0.8.0-incubator
            Reporter: Mathias Bosch
             Fix For: 0.8.0-incubator


org.pdfbox.filter.ASCIIHexFilter does not skip Whitespace

According to the Specification (pdf_reference_1-7.pdf) all Whitespace
Characters between the ASCII-Hex values have to be skipped (see 3.3.1
ASCIIHexDecode Filter).

The 0.8.0-incubator source decodes (or attempts to decode) those Whitespace
Characters and as a result the byte values are wrong (all characters that
are not [0-9a-f] result in -1, but processing does continue).
This causes an invalid byte Stream.

The ASCIIHexDecode Filter Section also defines the EOD end Character of the
Byte Steam as '>' which might ease the parsing of inline Images.
(The EI Operator should follow the EOD in case of an inline Image).

Example for ASCII-Hex encoded value, copied from the Spec:
FF CE A3 7C 5B 3F 28 16 0A 02 00 02 0A 16 28 3F 5B 7C A3 CE FF >


I did fix the problem to be able to continue with my work.
I paste the changed code here as a hint that might help to fix the bug.

public class ASCIIHexFilter
  implements Filter
{

 /**
  * Whitespace
  *   0  0x00  Null (NUL)
  *   9  0x09  Tab (HT)
  *  10  0x0A  Line feed (LF)
  *  12  0x0C  Form feed (FF)
  *  13  0x0D  Carriage return (CR)
  *  32  0x20  Space (SP)  
  */
  protected boolean isWhitespace(int c) {
    return c == 0 || c == 9 || c == 10 || c == 12 || c == 13 || c == 32;
  }
  
  protected boolean isEOD(int c) {
    return (c == 62); // '>' - EOD
  }

  /**
   * [EMAIL PROTECTED]
   */
  public void decode(InputStream compressedData, OutputStream result, 
COSDictionary options, int filterIndex) throws IOException {
    int value = 0;
    int firstByte = 0;
    int secondByte = 0;
    while ((firstByte = compressedData.read()) != -1) {
      
      // always after first char
      while(isWhitespace(firstByte))
        firstByte = compressedData.read();

      if(isEOD(firstByte))
        break;
      
      if(REVERSE_HEX[firstByte] == -1)
        System.out.println("Invalid Hex Code; int: " + firstByte + " char: " + 
(char) firstByte);

      value = REVERSE_HEX[firstByte] * 16;
      secondByte = compressedData.read();
      
      if(isEOD(secondByte)) {
        // second value behaves like 0 in case of EOD
        result.write(value);
        break;
      }

      if(secondByte >= 0) {
        if(REVERSE_HEX[secondByte] == -1)
          System.out.println("Invalid Hex Code; int: " + secondByte + " char: " 
+ (char) secondByte);

        value += REVERSE_HEX[secondByte];
      }
      result.write(value);
    }
    
    result.flush();
  }

// .....................................................
// other code remains unchanged





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to