[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

Andrew Jackson (JIRA) Wed, 15 Jul 2015 04:34:49 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627913#comment-14627913
 ]


Andrew Jackson commented on TIKA-1678:
--------------------------------------

I'm seeing this in about 220,000 out of 21,204,351 PDFs crawled from 2013 
onwards, so it's a lot, but a small percentage. I thought it might be down to 
one or two implementations, but I'm seeing a fairly broad range of software IDs:

{noformat}
     "generator": [
        "Bullzip PDF Printer / www.bullzip.com / Freeware Edition (not 
registered)",
        30,
        "GPL Ghostscript 8.54 PDF Writer",
        30,
        "PDFCreator Version 0.9.5",
        30,
        "PDFCreator Version 1.1.0",
        30,
        "Neevia Document Converter 5.2",
        29,
        "pdfcreator Version 0.9.9",
        29,
        "AFPL Ghostscript 8.54 PDF Writer",
        28,
        "PDFCreator Version 1.0.0",
        28,
        "GPL Ghostscript 8.64 ps2pdf.com",
        27,
...
{noformat}

That octal UTF-16 BE BOM is pretty specific, so I think writing a hander to 
catch it is unlikely to cause problems elsewhere. But I'm not really sure how 
to fix this either.

In case it helps, here are some more (randomly chosen) URLs that seem to 
display the same issue (if they've not disappeared from the live web already!):

{noformat}
http://www.girlsb.org.uk/media/060a6d49/After_McDonaldization_Chapter_1.pdf
http://www.uniswales.ac.uk/wp/media/2011-March-The-Impact-of-International-and-EU-Students-in-Wales.pdf
http://www.youthworkwales.org.uk/creo_files/upload/files/gd_in_yw_conceptual_model_2009_1_.pdf
http://www.transitionchepstow.org.uk/wp-content/uploads/2014/09/Living-with-climate-change-poster.pdf
http://www.staustelltowncouncil.com/St-Austell-Town-Council/UserFiles/Files/Committees/Community/Agendas/2010/community%20agenda%206%20Sept%2010.pdf
http://community.stroud.gov.uk/_documents/79_SmartWater-NW-Kit-Leaflet3.pdf
http://www.recycleformerthyr.co.uk/media/9365/dowlais%20juniors%20school%20photos.pdf
http://www.visitmerthyr.co.uk/media/24663/volunteering_poster.psd_welsh.pdf
http://merthyrcynon.foodbank.org.uk/resources/documents/Get%20Involved/Gift-Aid-Form/Gift-Aid-form.pdf
http://www.basquechildren.org/-/docs/clarion
http://www.biicl.org/files/3776_5_-_richard_happ.pdf
http://www.artscouncil-ni.org/images/uploads/publications-documents/ArtsandHealth.pdf
http://www.lawsoc-ni.org/download/fs/doc/LEXCEL_APPLICATION_&_STATUS_ENQUIRY_FORMS_2010%5b1%5d/pdf/
http://www.templechurch.com/wp-content/uploads/2012/08/Olympic-poster.pdf
http://www.llennatur.com/files/u1/Cylchgrawn32.pdf
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/315413/cau101.pdf
https://www.networks.nhs.uk/nhs-networks/common-assessment-framework-for-adults-learning/archived-material-from-caf-network-website-pre-april-2012/barnsley-ig-toolkit-aug-2010/FINAL_BMBC_ASSD_NHS_Numb
http://www.ccfgb.co.uk/images/Visit.pdf
http://www.wihb.scot.nhs.uk/hairt-reports/policies/key-infection-prevention-policies?task=document.viewdoc&id=22
http://www.ed.ac.uk/polopoly_fs/1.94783!/fileManager/martha%20hamilton%20trust%20app%20form12.pdf
http://stophs2.org/wp-content/uploads/2010/11/EHS_booklet.pdf
http://www.brookes.ac.uk/Documents/Regulations/Current/Core/A1/Technolgy--Design---Environment-Prizes/
http://www.nus.org.uk/PageFiles/4011/ACTSA_Events_December.pdf
https://www.ids.ac.uk/files/dmfile/GCSTDemocracyandSecurity34_WP8.pdf
http://www.theigc.org/wp-content/uploads/2015/02/Chaudhry-Woodruff-2013-Working-Paper.pdf
{noformat}

> PDF metadata extraction fails to spot UTF-16 encoded title
> ----------------------------------------------------------
>
>                 Key: TIKA-1678
>                 URL: https://issues.apache.org/jira/browse/TIKA-1678
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata
>    Affects Versions: 1.9
>            Reporter: Andrew Jackson
>            Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority 
> of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
> not always being decoded as such.
> A specific example is here: 
> http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> <</Type/Metadata
> /Subtype/XML/Length 1978>>stream
> <?xpacket begin='ï»¿' id='W5M0MpCehiHzreSzNTczkc9d'?>
> <?adobe-xap-filters esc="CRLF"?>
> <x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 
> 1.6'>
> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' 
> xmlns:iX='http://ns.adobe.com/iX/1.0/'>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
> pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n'/>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:xmp='http://ns.adobe.com/xap/1.0/'><xmp:ModifyDate>2012-07-18T15:38:01+01:00</xmp:ModifyDate>
> <xmp:CreateDate>2012-07-18T15:38:01+01:00</xmp:CreateDate>
> <xmp:CreatorTool>UnknownApplication</xmp:CreatorTool></rdf:Description>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
> xapMM:DocumentID='ac9f232e-d341-11e1-0000-ba905bfc4694'/>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:dc='http://purl.org/dc/elements/1.1/' 
> dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li 
> xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
>  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x</rdf:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>\376\377\000T\000e\000t\000t\000i</rdf:li></rdf:Seq></dc:creator></rdf:Description>
> </rdf:RDF>
> </x:xmpmeta>
> <?xpacket end='w'?>
> endstream
> endobj
> 2 0 obj
> <</Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
> error, but the ones encoded in the actual PDF metadata fields should be 
> extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is 
> the XML dc:title being used to override the PDF title field? Or is one of the 
> title fields being decoded incorrectly?
> (I accept that although this is a real PDF document from the web, it is also 
> a malformed one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

Reply via email to