Thanks. I'll modify my local sources to ingore the "PS" subtype.
Also, I recommend the following code changes to fix problems that I have run
into with pdfbox:
Line 970 in pdrfparser.BaseParser to avoid an exception where the PDF contains
an invalid value, like "t", etc.
<< if (trueString.equals("true"))
>> if ("true".startsWith(trueString))
Insert below line 240 in cos.COSString to guess at the appropriate charcter
type. I know the code is not "right". But, it seems to avoid an exception in
some PDFs that I crawl:
>> else if( data[0] >= (byte)0xC0 && data[0] <= (byte)0xFD )
>> {
>> encoding = "UTF-8";
>> start=2;
>> }
A lot of people have asked for this feature and some "experts" have replied
that it's not possible. So, here is some code to implement this impossible.
Below is some code to unscramble text so that each line of text in one column
is joined with text that is roughly aligned with the lines an ajacent column.
I'm know next to nothing about PDFs, so I'm sure there many use-cases this does
not cover. But, 10% of a feature is better than 0%.... Konwing a lot more about
PDFs than I, you might want to inspect these changes to util.PDFTextStripper
and rewrite in a more appropriate manner.
Insert at line 438
/**
* This sorting is handles text aligned into columns by using
* column-based alignment to determine the text ordering.
* Specifically, vertically adjacent items items are grouped into sets,
* where each set contains adjacent items with same x (left horizontal)
* coordinate. If a horizontally left-adjacent text item is part of a set
* containing other vertically adjacent text items at the same x coordinate,
* then the items in the first set are separate column and are all added to
* the list first, followed by the horizontally adjacent set.
*
* @param textList
*/
@SuppressWarnings("unchecked")
protected void sortByPosition(List<TextPosition> textList) {
/**
* An array of sets, each set containing a sublist of text items
* all starting at the same column border.
*/
final HashMap<Float, ArrayList<TextPosition>> set_map =
new HashMap<Float, ArrayList<TextPosition>>();
final int TEXT_LIST_SIZE = textList.size();
if (TEXT_LIST_SIZE <= 1)
return; // nothing to sort
// Group into sets.
Iterator<TextPosition> textIter = textList.iterator();
while( textIter.hasNext() )
{
TextPosition position = textIter.next();
float positionX = position.getXDirAdj();
ArrayList<TextPosition> set = set_map.get( positionX );
if (set == null)
{
set = new ArrayList<TextPosition>();
set_map.put( positionX, set );
}
set.add( position );
}
// Sort each set
final int MAP_SIZE = set_map.size();
if (MAP_SIZE > 0) {
// First, sort the sets.
Iterator<Float> mapIter = set_map.keySet().iterator();
final ArrayList<Float> map_index = new ArrayList<Float>(MAP_SIZE);
while ( mapIter.hasNext() )
map_index.add( mapIter.next() );
// Sort by x coordinate of column margin.
Collections.sort(map_index);
// Second, sort within each set.
for (int i = 0; i < MAP_SIZE; i++)
{
ArrayList<TextPosition> set = set_map.get( map_index.get(i) );
if (set.size() > 1)
{
TextPositionComparator comparator = new TextPositionComparator();
Collections.sort( set, comparator );
}
}
// Third, coalesce horizontally adjacent text items.
// Fourth, re-order the textList.
for (int i = 0; i < MAP_SIZE; i++)
{
ArrayList<TextPosition> set = set_map.get( map_index.get(i) );
Iterator<TextPosition> setIter = set.iterator();
while ( setIter.hasNext() )
textList.add( setIter.next() );
}
}
}
Lines 462, 463:
<< TextPositionComparator comparator = new TextPositionComparator();
<< Collections.sort( textList, comparator );
>> sortByPosition(textList);
Thanks,
George Van Treeck
----- Original Message ----
From: Andreas Lehmkühler <[email protected]>
To: George Van Treeck <[email protected]>; [email protected]
Sent: Tue, December 15, 2009 4:34:45 AM
Subject: Re: Bug or known limitation?
Hi,
Gesendet: Di, 15. Dez 2009 Von: George Van Treeck<[email protected]>
> I ran into the exception below when using an older 0.8 version. So, I did a
> build using HEAD from subversion. And the exception persists. The following
> is output from a little web crawler I wrote.
>
> ERROR: Unable to load PDF document:
> http://www.polaroid.com/media/document/a932manualEN20091019.pdf
> java.io.IOException: Unknown xobject subtype 'PS'
> at
> org.apache.pdfbox.pdmodel.graphics.xobject.PDXObject.createXObject(PDXObject
> .java:165)
> at org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:161)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java
> :226)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:20
> 6)
> at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
>
> at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291
> )
> at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
> at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
> at webcrawler.WebCrawler.getContent(WebCrawler.java:1444)
>
PDFBox doesn't support that kin of subtype for XObjects. Refering to the pdf
reference manual (v1.7 chapter 4.7.1 PostScript XObjects ) it's rarely used and
shouldn't have any effect when viewing the document. It could only be used when
printing on a ps enabled printer. This feature is likely to be removed from PDF
in a future version.
PDFBox should ignore those PS XObjects in future.
> -George
>
BR
Andreas Lehmkühler