[ 
https://issues.apache.org/jira/browse/TIKA-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059582#comment-17059582
 ] 

Carina Antunes commented on TIKA-3069:
--------------------------------------

Thank you so much for the details! 
{quote} If you're not looking for the literal bytes of the embedded files, 
/unpack is not for you. Perhaps we could look into compressing /rmeta?
{quote}
That would be great! Please look into it. Because of that for now unpack still 
seems the best option for us.

> Unpack with header X-Tika-PDFextractInlineImages does not extract content 
> from image
> ------------------------------------------------------------------------------------
>
>                 Key: TIKA-3069
>                 URL: https://issues.apache.org/jira/browse/TIKA-3069
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.23
>         Environment: Docker image *apache/tika:1.23-full*
>            Reporter: Carina Antunes
>            Priority: Major
>         Attachments: file.pdf, output.zip, parser.json
>
>
> Expected content to be extracted from pdf with image using tesseract, ie same 
> behaviour of _/rmeta/text, but instead no content is extracted._
> Response from */unpack/all* _:_
> {code:java}
> $ curl -T file.pdf http://localhost:9998/unpack/all --header 
> "X-Tika-PDFextractInlineImages: true" > output.zip    
> __TEXT__
>  [image: image0.jpg]
> __METADATA__
>  "pdf:unmappedUnicodeCharsPerPage","0" "pdf:PDFVersion","1.4" 
> "X-Parsed-By","org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"
>  "pdf:hasXFA","false" "access_permission:modify_annotations","true" 
> "access_permission:can_print_degraded","true" 
> "access_permission:extract_for_accessibility","true" 
> "access_permission:assemble_document","true" "xmpTPg:NPages","1" 
> "pdf:hasXMP","false" "dc:format","application/pdf; version=1.4" 
> "pdf:charsPerPage","0" "access_permission:extract_content","true" 
> "access_permission:can_print","true" "access_permission:fill_in_form","true" 
> "pdf:encrypted","false" "access_permission:can_modify","true" 
> "Content-Type","application/pdf"
> {code}
>  
> Expected response similar to  */rmeta/text:*
> {code:java}
> $ curl -T file.pdf http://localhost:9998/rmeta/text --header 
> "X-Tika-PDFextractInlineImages: true"
> {
>   "Content-Type": "application/pdf",
>   "X-Parsed-By": [
>     "org.apache.tika.parser.DefaultParser",
>     "org.apache.tika.parser.pdf.PDFParser"
>   ],
>   "X-TIKA:embedded_depth": "0",
>   "X-TIKA:parse_time_millis": "4112",
>   "access_permission:assemble_document": "true",
>   "access_permission:can_modify": "true",
>   "access_permission:can_print": "true",
>   "access_permission:can_print_degraded": "true",
>   "access_permission:extract_content": "true",
>   "access_permission:extract_for_accessibility": "true",
>   "access_permission:fill_in_form": "true",
>   "access_permission:modify_annotations": "true",
>   "dc:format": "application/pdf; version\u003d1.4",
>   "pdf:PDFVersion": "1.4",
>   "pdf:charsPerPage": "0",
>   "pdf:encrypted": "false",
>   "pdf:hasXFA": "false",
>   "pdf:hasXMP": "false",
>   "pdf:unmappedUnicodeCharsPerPage": "0",
>   "xmpTPg:NPages": "1"
> },
> {
>   "Component 1": "Y component: Quantization table 0, Sampling factors 2 
> horiz/2 vert",
>   "Component 2": "Cb component: Quantization table 1, Sampling factors 1 
> horiz/1 vert",
>   "Component 3": "Cr component: Quantization table 1, Sampling factors 1 
> horiz/1 vert",
>   "Compression Type": "Baseline",
>   "Content-Type": "image/jpeg",
>   "Data Precision": "8 bits",
>   "File Modified Date": "Wed Mar 11 19:28:01 +00:00 2020",
>   "File Name": "apache-tika-16610492346701338708.tmp",
>   "File Size": "319936 bytes",
>   "Image Height": "1554 pixels",
>   "Image Width": "1206 pixels",
>   "Number of Components": "3",
>   "Number of Tables": "4 Huffman tables",
>   "X-Parsed-By": [
>     "org.apache.tika.parser.DefaultParser",
>     "org.apache.tika.parser.ocr.TesseractOCRParser",
>     "org.apache.tika.parser.jpeg.JpegParser"
>   ],
>   "X-TIKA:content": 
> "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLorem 
> Ipsum\n\n\"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, 
> consectetur, adipisci velit...\"\n\nLorem ipsum dolor sit amet, 
> consectetur\nadipiscing elit. Etiam at posuere mauris.\nInterdum et malesuada 
> fames ac ante ipsum\nprimis in faucibus. Suspendisse potenti. Donec\nut 
> dapibus lectus. Aenean neque mauris,\nconvallis quis eros nec, molestie 
> rhoncus\nlectus. Aliquam dui mauris, sagittis ut posuere\nquis, tempor id 
> tellus. Nunc id varius dolor.\nFusce in elementum enim. Vestibulum\nimperdiet 
> pretium est et rhoncus. Nam in urna\nmauris. Nulla facilisi. Nullam sed 
> sapien libero.\nSed ligula arcu, auctor non nunc sed, viverra\nvehicula sem. 
> Vestibulum orci felis, tristique at\norci id, interdum sodales lectus. Donec 
> sed\nrhoncus massa. Donec laoreet sodales velit at\nfaucibus.\n\nAenean sit 
> amet velit eros. Nam congue\nplacerat eros, vitae mattis turpis ultricies 
> ac.\nPraesent vestibulum, tortor tempor tristique\nsagittis, mi risus semper 
> neque, vel vehicula\ntortor sapien in lorem. Sed sit amet mattis 
> leo.\nPraesent euismod lacinia sapien, nec cursus\ndolor dignissim pharetra. 
> Mauris eleifend\npellentesque erat fermentum tempus. Nulla\ncommodo dolor 
> urna, quis tincidunt diam\nconvallis vel.\n\nAenean ornare imperdiet nibh, 
> sed gravida ante\nsagittis et. Fusce dignissim lectus vitae\nullamcorper 
> malesuada. Donec ultricies ornare\nquam a placerat. Donec euismod nibh 
> vitae\nfacilisis consectetur. Nunc in interdum neque,\nvarius vehicula massa. 
> Ut fermentum lorem id\nante porta mattis. Praesent quis nulla ut 
> lectus\nsodales ultricies. Sed sodales mollis ex, a\nsemper metus faucibus 
> ac. Nulla tempor, ipsum\nvel egestas venenatis, enim est gravida mauris,\na 
> lacinia justo quam eget felis. Maecenas\ncommodo, arcu sit amet aliquam 
> molestie, urna\neros rutrum enim, et blandit nisi magna sit amet\n\nlorem. 
> Suspendisse accumsan nulla vitae\naugue tempus, sed fermentum metus 
> viverra.\nEtiam dapibus tellus eget venenatis rhoncus.\nVivamus eu dolor 
> faucibus, malesuada tellus sit\namet, vulputate orci.\n\nNunc at diam eu nisi 
> sollicitudin varius. Sed a\ntincidunt arcu. Integer vitae fermentum 
> libero,\nac semper justo. Nunc dapibus in magna\ntempus aliquet. Proin 
> interdum lorem eget\nsuscipit ullamcorper. Nulla vitae tincidunt\naugue. Cras 
> turpis elit, dignissim eget metus\nnec, fermentum scelerisque ante. 
> Suspendisse\naliquam tortor in eros rhoncus, eget elementum\nvelit sagittis. 
> Donec et tellus ac dui interdum\nmattis. Duis condimentum quis velit 
> et\ncommodo. Sed congue quam vitae neque\nvolutpat viverra.\n\nProin finibus 
> nunc vel elit iaculis vestibulum.\nNulla et mattis magna. Nunc a ligula 
> leo.\nAliquam bibendum semper tellus at molestie.\nCurabitur pellentesque 
> ullamcorper dolor, at\nfinibus elit iaculis ac. Aliquam vestibulum sit\namet 
> diam sit amet condimentum. Donec\nrhoncus, nisi eu dapibus elementum, tellus 
> ex\nornare dui, nec molestie nulla nulla eget nulla.\nUt sem massa, tristique 
> ac commodo id, rutrum\nat massa. Donec enim velit, luctus ac nisi 
> ac,\nbibendum tempus elit. Proin posuere ex odio,\nsed faucibus elit volutpat 
> in. Suspendisse\nscelerisque mauris nunc, ut tincidunt velit\nvulputate quis. 
> Integer efficitur diam vel urna\ndignissim, a  sodales magna _ 
> eleifend.\nVestibulum malesuada ornare diam, faucibus\nmaximus tellus aliquam 
> et. Sed sed libero\negestas, varius sapien faucibus, interdum\nquam. Nullam a 
> accumsan dui. Vivamus\nscelerisque justo in metus ornare interdum.\n\n",
>   "X-TIKA:content_handler": "ToTextContentHandler",
>   "X-TIKA:embedded_depth": "1",
>   "X-TIKA:embedded_resource_path": "/image0.jpg",
>   "X-TIKA:parse_time_millis": "4023",
>   "embeddedResourceType": "INLINE",
>   "pdf:hasXMP": "false",
>   "resourceName": "image0.jpg",
>   "tiff:BitsPerSample": "8",
>   "tiff:ImageLength": "1554",
>   "tiff:ImageWidth": "1206"
> }
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to