[jira] [Commented] (PDFBOX-2896) XMPBox not creating valid title entry in DublinCoreSchema in trunk
[ https://issues.apache.org/jira/browse/PDFBOX-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634745#comment-14634745 ] Hudson commented on PDFBOX-2896: SUCCESS: Integrated in tika-trunk-jdk1.7 #796 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/796/]) TIKA-1678 -- initial commit. Need to wait for fix to PDFBOX-2896 to generate test file. (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1692042) * /tika/trunk/tika-parsers/src/main/java/org/apache/pdfbox * /tika/trunk/tika-parsers/src/main/java/org/apache/pdfbox/pdfparser * /tika/trunk/tika-parsers/src/main/java/org/apache/pdfbox/pdfparser/PDFOctalUnicodeDecoder.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java XMPBox not creating valid title entry in DublinCoreSchema in trunk Key: PDFBOX-2896 URL: https://issues.apache.org/jira/browse/PDFBOX-2896 Project: PDFBox Issue Type: Bug Components: XmpBox Affects Versions: 2.0.0 Reporter: Tim Allison Priority: Minor On TIKA-1678, I was trying to generate a test PDF that had a dc:title in the XMP with XMPBox from PDFBox's trunk. I modified the code from CreatePDFA by adding this: {code} DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema(); dc.setTitle(this is the title); {code} The generated PDF doesn't appear to have a compliant dc:title entry in the XMP. [~tilman] noted the divergence from the standard [here|https://issues.apache.org/jira/browse/TIKA-1678?focusedCommentId=14634045page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14634045]. What PDFBox does: {code} dc:title rdf:Alt dc:lithis is the title/dc:li /rdf:Alt /dc:title {code} It should be: {code} dc:title rdf:Alt rdf:li xml:lang=x-defaultthis is the title/rdf:li /rdf:Alt /dc:title {code} Error message from the PDF-Tools validator: {quote} 'dc:li' is not allowed in arrays. The elements must be rdf:li or rdf:_N, where N is a positive number. There is only one RDF resource allowed in XMP. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-1130) ExtractText -html doesn't always close the p tags it opens
[ https://issues.apache.org/jira/browse/PDFBOX-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343741#comment-14343741 ] Hudson commented on PDFBOX-1130: SUCCESS: Integrated in tika-trunk-jdk1.7 #524 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/524/]) TIKA-758 clean up after remembering PDFBOX-1130 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1663424) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java ExtractText -html doesn't always close the p tags it opens Key: PDFBOX-1130 URL: https://issues.apache.org/jira/browse/PDFBOX-1130 Project: PDFBox Issue Type: Bug Reporter: Michael McCandless Assignee: Andreas Lehmkühler Priority: Minor Fix For: 1.8.0 Attachments: 86.pdf, PDFBOX-1130.patch I have a test document (same one on PDFBOX-1129), which when run through ExtractText -html, extracts the page number for each page, however in each case the page number looks like: pNpText of page N... Ie, the p tag for the page number wasn't closed. Maybe related: if I run ExtractText without html, there is not space after the page number and before the next word, ie I see words like 1Massachusetts, 2Course, 3also, 4the. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2122) FontBox's TTFDataStream doesn't set timezone in readInternationalDate
[ https://issues.apache.org/jira/browse/PDFBOX-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025435#comment-14025435 ] Hudson commented on PDFBOX-2122: SUCCESS: Integrated in tika-trunk-jdk1.7 #36 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/36/]) TIKA-1325: small workaround until we can integrate PDFBOX-2122. Default timezone is now set and then unset for ttf test in FontParsers test. (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1601444) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/font/FontParsersTest.java FontBox's TTFDataStream doesn't set timezone in readInternationalDate - Key: PDFBOX-2122 URL: https://issues.apache.org/jira/browse/PDFBOX-2122 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 1.8.5, 1.8.6, 2.0.0 Reporter: Tim Allison Assignee: Tilman Hausherr Priority: Trivial Fix For: 1.8.6, 2.0.0 Attachments: PDFBOX-2122.patch TTFDataStream doesn't set the timezone for the calendar. GregorianCalendar defaults to the system's timezone. This means that people in different timezones will get slightly different dates. (TIKA-1325). One TTF Spec (https://developer.apple.com/fonts/TTRefMan/RM06/Chap6.html) doesn't specify the timezone, but my guess would be UTC...except that it is Apple, so maybe it's Cupertino. :) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2122) FontBox's TTFDataStream doesn't set timezone in readInternationalDate
[ https://issues.apache.org/jira/browse/PDFBOX-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025463#comment-14025463 ] Hudson commented on PDFBOX-2122: SUCCESS: Integrated in tika-trunk-jdk1.6 #36 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/36/]) TIKA-1325: small workaround until we can integrate PDFBOX-2122. Default timezone is now set and then unset for ttf test in FontParsers test. (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1601444) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/font/FontParsersTest.java FontBox's TTFDataStream doesn't set timezone in readInternationalDate - Key: PDFBOX-2122 URL: https://issues.apache.org/jira/browse/PDFBOX-2122 Project: PDFBox Issue Type: Bug Components: FontBox Affects Versions: 1.8.5, 1.8.6, 2.0.0 Reporter: Tim Allison Assignee: Tilman Hausherr Priority: Trivial Fix For: 1.8.6, 2.0.0 Attachments: PDFBOX-2122.patch TTFDataStream doesn't set the timezone for the calendar. GregorianCalendar defaults to the system's timezone. This means that people in different timezones will get slightly different dates. (TIKA-1325). One TTF Spec (https://developer.apple.com/fonts/TTRefMan/RM06/Chap6.html) doesn't specify the timezone, but my guess would be UTC...except that it is Apple, so maybe it's Cupertino. :) -- This message was sent by Atlassian JIRA (v6.2#6252)