Re: Bug or known limitation?

George Van Treeck Tue, 15 Dec 2009 10:31:44 -0800

Thanks. I'll modify my local sources to ingore the "PS" subtype.

Also, I recommend the following code changes to fix problems that I have run 
into with pdfbox:


Line 970 in pdrfparser.BaseParser to avoid an exception where the PDF contains 
an invalid value, like "t", etc.
<< if (trueString.equals("true"))
>> if ("true".startsWith(trueString))

Insert below line 240 in cos.COSString to guess at the appropriate charcter 
type. I know the code is not "right". But, it seems to avoid an exception in 
some PDFs that I crawl:
>> else if( data[0] >= (byte)0xC0 && data[0] <= (byte)0xFD )
>> {
>>   encoding = "UTF-8";
>>   start=2;
>> }

A lot of people have asked for this feature and some "experts" have replied 
that it's not possible. So, here is some code to implement this impossible. 
Below is some code to unscramble text so that each line of text in one column 
is joined with text that is roughly aligned with the lines an ajacent column. 
I'm know next to nothing about PDFs, so I'm sure there many use-cases this does 
not cover. But, 10% of a feature is better than 0%.... Konwing a lot more about 
PDFs than I, you might want to inspect these changes to util.PDFTextStripper 
and rewrite in a more appropriate manner.

Insert at line 438
    /**
     * This sorting is handles text aligned into columns by using
     * column-based alignment to determine the text ordering.
     * Specifically, vertically adjacent items items are grouped into sets,
     * where each set contains adjacent items with same x (left horizontal)
     * coordinate. If a horizontally left-adjacent text item is part of a set
     * containing other vertically adjacent text items at the same x coordinate,
     * then the items in the first set are separate column and are all added to
     * the list first, followed by the horizontally adjacent set.
     * 
     * @param textList
     */
    @SuppressWarnings("unchecked")
    protected void sortByPosition(List<TextPosition> textList) {
      /**
       * An array of sets, each set containing a sublist of text items
       * all starting at the same column border.
       */
      final HashMap<Float, ArrayList<TextPosition>> set_map =
        new HashMap<Float, ArrayList<TextPosition>>();
      
      final int TEXT_LIST_SIZE = textList.size();
      if (TEXT_LIST_SIZE <= 1)
        return; // nothing to sort
      
      // Group into sets.
      Iterator<TextPosition> textIter = textList.iterator();
      while( textIter.hasNext() )
      {
          TextPosition position = textIter.next();
          float positionX = position.getXDirAdj();
          ArrayList<TextPosition> set = set_map.get( positionX );
          if (set == null)
          {
            set = new ArrayList<TextPosition>();
            set_map.put( positionX, set );
          }
          set.add( position );
      }
      
      // Sort each set
      final int MAP_SIZE = set_map.size();
      if (MAP_SIZE > 0) {
        // First, sort the sets.
        Iterator<Float> mapIter = set_map.keySet().iterator();
        final ArrayList<Float> map_index = new ArrayList<Float>(MAP_SIZE);
        while ( mapIter.hasNext() )
          map_index.add( mapIter.next() );
        // Sort by x coordinate of column margin.
        Collections.sort(map_index);
        // Second, sort within each set.
        for (int i = 0; i < MAP_SIZE; i++)
        {
          ArrayList<TextPosition> set = set_map.get( map_index.get(i) );
          if (set.size() > 1)
          {
            TextPositionComparator comparator = new TextPositionComparator();
            Collections.sort( set, comparator );
          }
        }
        // Third, coalesce horizontally adjacent text items.
        // Fourth, re-order the textList.
        for (int i = 0; i < MAP_SIZE; i++)
        {
          ArrayList<TextPosition> set = set_map.get( map_index.get(i) );
          Iterator<TextPosition> setIter = set.iterator();
          while ( setIter.hasNext() )
            textList.add( setIter.next() );
        }
      }
    }

Lines 462, 463:
<< TextPositionComparator comparator = new TextPositionComparator();
<< Collections.sort( textList, comparator );
>> sortByPosition(textList);


Thanks,
George Van Treeck



----- Original Message ----
From: Andreas Lehmkühler <[email protected]>
To: George Van Treeck <[email protected]>; [email protected]
Sent: Tue, December 15, 2009 4:34:45 AM
Subject: Re: Bug or known limitation?

Hi,

Gesendet: Di, 15. Dez 2009 Von: George Van Treeck<[email protected]>

> I ran into the exception below when using an older 0.8 version. So, I did a
> build using HEAD from subversion. And the exception persists. The following
> is output from a little web crawler I wrote.
> 
> ERROR: Unable to load PDF document:
> http://www.polaroid.com/media/document/a932manualEN20091019.pdf
> java.io.IOException: Unknown xobject subtype 'PS'
> at
> org.apache.pdfbox.pdmodel.graphics.xobject.PDXObject.createXObject(PDXObject
> .java:165)
> at org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:161)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java
> :226)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:20
> 6)
> at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
> 
> at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291
> )
> at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
> at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
> at webcrawler.WebCrawler.getContent(WebCrawler.java:1444)
> 
PDFBox doesn't support that kin of subtype for XObjects. Refering to the pdf 
reference manual (v1.7 chapter 4.7.1 PostScript XObjects ) it's rarely used and 
shouldn't have any effect when viewing the document. It could only be used when 
printing on a ps enabled printer. This feature is likely to be removed from PDF 
in a future version.

PDFBox should ignore those PS XObjects in future.

> -George
> 

BR
Andreas Lehmkühler

Re: Bug or known limitation?

Reply via email to