[jira] [Commented] (TIKA-2175) Enable extraction of inlined jp2/jpx from PDF

Tim Allison (JIRA) Mon, 28 Nov 2016 04:25:25 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701842#comment-15701842
 ]


Tim Allison commented on TIKA-2175:
-----------------------------------

Hmmm....This is working for me (at least in our test suite)
{noformat}
    @Test
    public void testjp2() throws Exception {
        PDFParserConfig config = new PDFParserConfig();
        
config.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);
        ParseContext context = new ParseContext();
        context.set(PDFParserConfig.class, config);
        System.out.println(getXML("pdf-with-jp2-images.pdf", context).xml);
    }

{noformat}
yields:

{noformat}
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="date" content="2015-12-28T14:25:23Z" />
<meta name="pdf:PDFVersion" content="1.7" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" />
<meta name="X-Parsed-By" content="class 
org.apache.tika.parser.ocr.TesseractOCRParser" />
<meta name="xmp:CreatorTool" content="Nitro Pro" />
<meta name="access_permission:modify_annotations" content="true" />
<meta name="access_permission:can_print_degraded" content="true" />
<meta name="access_permission:extract_for_accessibility" content="true" />
<meta name="access_permission:assemble_document" content="true" />
<meta name="xmpTPg:NPages" content="2" />
<meta name="Last-Modified" content="2015-12-28T14:25:23Z" />
<meta name="dcterms:modified" content="2015-12-28T14:25:23Z" />
<meta name="dc:format" content="application/pdf; version=1.7" />
<meta name="access_permission:extract_content" content="true" />
<meta name="Last-Save-Date" content="2015-12-28T14:25:23Z" />
<meta name="access_permission:can_print" content="true" />
<meta name="pdf:docinfo:creator_tool" content="Nitro Pro" />
<meta name="access_permission:fill_in_form" content="true" />
<meta name="pdf:docinfo:modified" content="2015-12-28T14:25:23Z" />
<meta name="meta:save-date" content="2015-12-28T14:25:23Z" />
<meta name="pdf:encrypted" content="false" />
<meta name="modified" content="2015-12-28T14:25:23Z" />
<meta name="access_permission:can_modify" content="true" />
<meta name="Content-Type" content="application/pdf" />
<title></title>
</head>
<body><div class="page"><p />
<div class="ocr">r13.3mm] ﬁe G’hile

CERTIFICADO

El Banco dc Chile. oﬁcinn QUILLOTAA Conﬁrm que cl Sr. Algiandm Rodrigo Pnlmn 
Perez.
Rul: 9,582.807-8. cs lilular dc la Cucmu Curricula MIN asigmda con el N" 
1404810008
vigcnlc dcsdc nl ox dc Fuhrcm dc I991. Bien llevada.

Dames lu pneseme conﬁrmacion. a pcdidn dcl inlencsado sin ulterior 
responxubiﬂdﬂd para el
Banco (I: Chile.

   
 

e
Bani“ it I ma
Fl“)
ENR‘OUE ‘22:“:

aumou

Samiago. ()3 dc Oczubrc dc 21114.

</div>
</div>
<div class="page"><p />
<div class="ocr">W
BANK OF CHILE

CERTIFICATE

The Bank ofChilc. ofﬁce in QUILLOTAV hereby conﬁrms that Mr. Alejandro Rodrigo 
Palma Pen; with
Tax Payer Regisu'ation No. 9.582.807-8 is Ill: holder of: current mun! No. 
Ida—48200438, nclive since
3"“ February I991 showing a sound performance.

We issue this ocniﬁcalicm a! the request oflhc inleteslcd puny and it emails 
any liabilily for Bank of
Chile.

(Signature illegible)
For: Bank of Chile
The seal of Bank ofC'hileV ENRIQUE MARFIL ILABACA has been slumped herein)

Santiago. 3" October 20 Hr

 

 

 

 

 

</div>
</div>
</body></html>{noformat}

> Enable extraction of inlined jp2/jpx from PDF
> ---------------------------------------------
>
>                 Key: TIKA-2175
>                 URL: https://issues.apache.org/jira/browse/TIKA-2175
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>         Attachments: pdf-with-jp2-images.pdf
>
>
> On TIKA-2174, [~mcaruanagalizia] reported that inline jp2 images in PDFs were 
> not being OCR'd.  TIKA-2174 added that file type to our tesseract parser, but 
> we our code in the PDFParser wasn't extracting the inline images as well.  
> Let's fix that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2175) Enable extraction of inlined jp2/jpx from PDF

Reply via email to