But I remember that you were complaining about the performance and I explained a way to do this but which would require some work.

Alternatively, try the code below which gets the dictionary without using the preflight library by extending the parser and using old-style document loading. What the code does is to load "orphan" objects, which isn't done in the main parser.

public class LinearizedCheck
{
    public static void main(String[] args) throws IOException
    {
RandomAccessBufferedFileInputStream raf = new RandomAccessBufferedFileInputStream(new File("test_medium.pdf"));
        PDFParser parser = new MyPDFParser(raf);
        parser.parse();
        PDDocument document = parser.getPDDocument();
        System.out.println(getLinearizedDictionary(document));
        document.close();
    }

    // from preflight
    static COSDictionary getLinearizedDictionary(PDDocument document)
    {
        // ---- Get Ref to obj
        COSDocument cDoc = document.getDocument();
        List<?> lObj = cDoc.getObjects();
        for (Object object : lObj)
        {
            COSBase curObj = ((COSObject) object).getObject();
            if (curObj instanceof COSDictionary
&& ((COSDictionary) curObj).keySet().contains(COSName.getPDFName("Linearized")))
            {
                return (COSDictionary) curObj;
            }
        }
        return null;
    }

    private static class MyPDFParser extends PDFParser
    {
MyPDFParser(RandomAccessBufferedFileInputStream raf) throws IOException
        {
            super(raf);
        }

        // from preflight
        @Override
protected void initialParse() throws InvalidPasswordException, IOException
        {
            super.initialParse();
            // For each ObjectKey, we check if the object has been loaded
            // useful for linearized PDFs
            Map<COSObjectKey, Long> xrefTable = document.getXrefTable();
for (Map.Entry<COSObjectKey, Long> entry : xrefTable.entrySet())
            {
                COSObject co = document.getObjectFromPool(entry.getKey());
                if (co.getObject() == null)
                {
// object isn't loaded - parse the object to load its content
                    parseObjectDynamically(co, true);
                }
            }
        }
    }
}


According to the PDF specification: "The linearization parameter dictionary shall be entirely contained within the first 1024 bytes of the PDF file. This limits the amount of data a conforming reader must read before deciding whether the file is linearized."

So if I were you, I'd download the first 1024 bytes, search for "/Linearized" with the Knuth-Morris-Pratt Algorithm,
https://stackoverflow.com/questions/1507780/searching-for-a-sequence-of-bytes-in-a-binary-file-with-java
and only if you get a hit, then use the code mentioned.

Tilman



Am 25.07.2017 um 10:34 schrieb karthick g:
Hi team,

Based on the analysis I have found one thing regarding Linearized PDF in
2.0 and above versions of PDFBox.

COSDocument cDoc = pdDoc.getDocument();
List<COSObject> lObj = cDoc.getObjects();
         for (COSObject object : lObj)
         {
             System.out.println(object.getObjectNumber());
    }

Based on the code am retrieving  cosobject numbers of PDFDocument
which prints COSObjects sequentially.......
PDF 1.8.2 and 2.0.6 works same  except the fact that COSObject pointing to
Linearized dictionary is not added.

748 0 obj <</Linearized 1/L 1829691/O 752/E 171783/N 9/T 1814683/H [ 3196
824]>> endobj

The 748, 0 which is present in 1.8.2 is not present in 2.0.6. Is the
finding is correct and can you guide me to fix it.
If it is fixed I can able to retrieve Linearized dictionary without going
for preflight jar,

PDFBox 1.8.2
===========
COSObject{1, 0}
---------------------
------------------------------
---------------------------
COSObject{747, 0}
COSObject{748, 0}
COSObject{749, 0}
---------

PDFBox 2.0.6
===========
COSObject{1, 0}
---------------------
------------------------------
---------------------------
COSObject{747, 0}
COSObject{749, 0}
---------

Regard,
Karthick G

On Thu, Jul 13, 2017 at 12:02 PM, karthick g <[email protected]>
wrote:

Hi Team,

In our project we want to take the Linearised dictionary. Before these 2.0
versions,
We can able to get that dictionary by normal workarounds that without
loading preflight document. Now after 2.0 versions we have to load the
preflight document to get the linearized property. Which resulting in
additional work around and which cost the project performance. Will their
be a workaround in next release, Such that linearized property can be
retrieved without loading Preflight document.

Regards,
Karthick G



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to