[jira] [Commented] (PDFBOX-2896) XMPBox not creating valid "title" entry in DublinCoreSchema in trunk
[ https://issues.apache.org/jira/browse/PDFBOX-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634745#comment-14634745 ] Hudson commented on PDFBOX-2896: SUCCESS: Integrated in tika-trunk-jdk1.7 #796 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/796/]) TIKA-1678 -- initial commit. Need to wait for fix to PDFBOX-2896 to generate test file. (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1692042) * /tika/trunk/tika-parsers/src/main/java/org/apache/pdfbox * /tika/trunk/tika-parsers/src/main/java/org/apache/pdfbox/pdfparser * /tika/trunk/tika-parsers/src/main/java/org/apache/pdfbox/pdfparser/PDFOctalUnicodeDecoder.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java > XMPBox not creating valid "title" entry in DublinCoreSchema in trunk > > > Key: PDFBOX-2896 > URL: https://issues.apache.org/jira/browse/PDFBOX-2896 > Project: PDFBox > Issue Type: Bug > Components: XmpBox >Affects Versions: 2.0.0 >Reporter: Tim Allison >Priority: Minor > > On TIKA-1678, I was trying to generate a test PDF that had a dc:title in the > XMP with XMPBox from PDFBox's trunk. I modified the code from CreatePDFA by > adding this: > {code} > DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema(); > dc.setTitle("this is the title"); > {code} > The generated PDF doesn't appear to have a compliant dc:title entry in the > XMP. > [~tilman] noted the divergence from the standard > [here|https://issues.apache.org/jira/browse/TIKA-1678?focusedCommentId=14634045&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14634045]. > What PDFBox does: > {code} > > > this is the title > > > {code} > It should be: > {code} > > > this is the title > > > {code} > Error message from the PDF-Tools validator: > {quote} > 'dc:li' is not allowed in arrays. The elements must be rdf:li or rdf:_N, > where N is a positive number. > There is only one RDF resource allowed in XMP. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-1130) ExtractText -html doesn't always close the tags it opens
[ https://issues.apache.org/jira/browse/PDFBOX-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343741#comment-14343741 ] Hudson commented on PDFBOX-1130: SUCCESS: Integrated in tika-trunk-jdk1.7 #524 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/524/]) TIKA-758 clean up after remembering PDFBOX-1130 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1663424) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java > ExtractText -html doesn't always close the tags it opens > > > Key: PDFBOX-1130 > URL: https://issues.apache.org/jira/browse/PDFBOX-1130 > Project: PDFBox > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 1.8.0 > > Attachments: 86.pdf, PDFBOX-1130.patch > > > I have a test document (same one on PDFBOX-1129), which when run through > ExtractText -html, extracts the page number for each page, however in each > case the page number looks like: > NText of page N... > Ie, the tag for the page number wasn't closed. > Maybe related: if I run ExtractText without html, there is not space after > the page number and before the next word, ie I see words like 1Massachusetts, > 2Course, 3also, 4the. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2383) PDFBox tests include copyright files
[ https://issues.apache.org/jira/browse/PDFBOX-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14309876#comment-14309876 ] Hudson commented on PDFBOX-2383: SUCCESS: Integrated in tika-trunk-jdk1.7 #474 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/474/]) TIKA-1542 substitute Apache friendly TTF test file for our current copyrighted file, take 2. See PDFBOX-2383 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1657952) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/font/FontParsersTest.java * /tika/trunk/tika-parsers/src/test/resources/test-documents/testTrueType2.ttf * /tika/trunk/tika-parsers/src/test/resources/test-documents/testTrueType3.ttf > PDFBox tests include copyright files > > > Key: PDFBOX-2383 > URL: https://issues.apache.org/jira/browse/PDFBOX-2383 > Project: PDFBox > Issue Type: Bug >Affects Versions: 1.8.7, 2.0.0 >Reporter: John Hewson >Assignee: Tilman Hausherr >Priority: Blocker > Fix For: 2.0.0 > > Attachments: Aclonica.ttf > > > The test files for PDFBox, FontBox, and Preflight include several files under > copyright which we probably don't have permission to redistribute, and need > to be removed (or preferably replaced): > pdfbox/src/test/resources/org/apache/pdfbox/ > - -ttf/ArialMT.ttf (This is actually Bitstream Vera Sans - the license on > this might be ok though?)- > - -pdfparser/gdb-refcard.pdf (GPL licensed)- > - -pdmodel/page_label.pdf (Edited by Foxit PDF for Evaluation Only)- > - -pdmodel/font/256.pdf (Copyright 2004 Journal of Combinatorics)- > fontbox/src/test/resources/ttf/ > - -testTrueType.ttf (NewBaskerville, Copyright © 2002 Veronika Elsner)- > preflight/src/test/resources/org/apache/padaf/preflight/font/ > - -true_type.ttf (Subset of Microsoft Arial)- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-2122) FontBox's TTFDataStream doesn't set timezone in readInternationalDate
[ https://issues.apache.org/jira/browse/PDFBOX-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025463#comment-14025463 ] Hudson commented on PDFBOX-2122: SUCCESS: Integrated in tika-trunk-jdk1.6 #36 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/36/]) TIKA-1325: small workaround until we can integrate PDFBOX-2122. Default timezone is now set and then unset for ttf test in FontParsers test. (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1601444) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/font/FontParsersTest.java > FontBox's TTFDataStream doesn't set timezone in readInternationalDate > - > > Key: PDFBOX-2122 > URL: https://issues.apache.org/jira/browse/PDFBOX-2122 > Project: PDFBox > Issue Type: Bug > Components: FontBox >Affects Versions: 1.8.5, 1.8.6, 2.0.0 >Reporter: Tim Allison >Assignee: Tilman Hausherr >Priority: Trivial > Fix For: 1.8.6, 2.0.0 > > Attachments: PDFBOX-2122.patch > > > TTFDataStream doesn't set the timezone for the calendar. GregorianCalendar > defaults to the system's timezone. This means that people in different > timezones will get slightly different dates. (TIKA-1325). > One TTF Spec (https://developer.apple.com/fonts/TTRefMan/RM06/Chap6.html) > doesn't specify the timezone, but my guess would be UTC...except that it is > Apple, so maybe it's Cupertino. :) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2122) FontBox's TTFDataStream doesn't set timezone in readInternationalDate
[ https://issues.apache.org/jira/browse/PDFBOX-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025435#comment-14025435 ] Hudson commented on PDFBOX-2122: SUCCESS: Integrated in tika-trunk-jdk1.7 #36 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/36/]) TIKA-1325: small workaround until we can integrate PDFBOX-2122. Default timezone is now set and then unset for ttf test in FontParsers test. (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1601444) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/font/FontParsersTest.java > FontBox's TTFDataStream doesn't set timezone in readInternationalDate > - > > Key: PDFBOX-2122 > URL: https://issues.apache.org/jira/browse/PDFBOX-2122 > Project: PDFBox > Issue Type: Bug > Components: FontBox >Affects Versions: 1.8.5, 1.8.6, 2.0.0 >Reporter: Tim Allison >Assignee: Tilman Hausherr >Priority: Trivial > Fix For: 1.8.6, 2.0.0 > > Attachments: PDFBOX-2122.patch > > > TTFDataStream doesn't set the timezone for the calendar. GregorianCalendar > defaults to the system's timezone. This means that people in different > timezones will get slightly different dates. (TIKA-1325). > One TTF Spec (https://developer.apple.com/fonts/TTRefMan/RM06/Chap6.html) > doesn't specify the timezone, but my guess would be UTC...except that it is > Apple, so maybe it's Cupertino. :) -- This message was sent by Atlassian JIRA (v6.2#6252)