[
https://issues.apache.org/jira/browse/TIKA-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057696#comment-17057696
]
Carina edited comment on TIKA-3069 at 3/12/20, 8:27 AM:
--------------------------------------------------------
Thanks for the explanation.
Is there any plans to make it recursively like the /rmeta endpoint?
Seems a bit inconsistent how if we send the header
_X-Tika-PDFextractInlineImages: true_ to one endpoint we get a different
outcome than if we use this endpoint. And I'm not even talking about a zip
inside a zip, which seems a bit more complex, but about extracting text from a
simple pdf scanned document.
The main problem is that the /rmeta endpoint is not a solution for intensive
extraction of bigger documents ie. the higher network load/times for extraction.
It was my understanding that behind the scenes these endpoints with be doing
the same except the unpack would reduce the wire load returning a zip or tar.
was (Author: carina.antunes):
Is there any plans to make it recursively like the /rmeta endpoint?
Seems a bit inconsistent how if we send the header
_X-Tika-PDFextractInlineImages: true_ to one endpoint we get a different
outcome than if we use this endpoint. And I'm not even talking about a zip
inside a zip, which seems a bit more complex, but about extracting text from a
simple pdf scanned document.
The main problem is that the /rmeta endpoint is not a solution for intensive
extraction of bigger documents ie. the higher network load/times for extraction.
It was my understanding that behind the scenes these endpoints with be doing
the same except the unpack would reduce the wire load returning a zip or tar.
> Unpack with header X-Tika-PDFextractInlineImages does not extract content
> from image
> ------------------------------------------------------------------------------------
>
> Key: TIKA-3069
> URL: https://issues.apache.org/jira/browse/TIKA-3069
> Project: Tika
> Issue Type: Bug
> Components: server
> Affects Versions: 1.23
> Environment: Docker image *apache/tika:1.23-full*
> Reporter: Carina
> Priority: Major
> Attachments: file.pdf, output.zip, parser.json
>
>
> Expected content to be extracted from pdf with image using tesseract, ie same
> behaviour of _/rmeta/text, but instead no content is extracted._
> Response from */unpack/all* _:_
> {code:java}
> $ curl -T file.pdf http://localhost:9998/unpack/all --header
> "X-Tika-PDFextractInlineImages: true" > output.zip
> __TEXT__
> [image: image0.jpg]
> __METADATA__
> "pdf:unmappedUnicodeCharsPerPage","0" "pdf:PDFVersion","1.4"
> "X-Parsed-By","org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"
> "pdf:hasXFA","false" "access_permission:modify_annotations","true"
> "access_permission:can_print_degraded","true"
> "access_permission:extract_for_accessibility","true"
> "access_permission:assemble_document","true" "xmpTPg:NPages","1"
> "pdf:hasXMP","false" "dc:format","application/pdf; version=1.4"
> "pdf:charsPerPage","0" "access_permission:extract_content","true"
> "access_permission:can_print","true" "access_permission:fill_in_form","true"
> "pdf:encrypted","false" "access_permission:can_modify","true"
> "Content-Type","application/pdf"
> {code}
>
> Expected response similar to */rmeta/text:*
> {code:java}
> $ curl -T file.pdf http://localhost:9998/rmeta/text --header
> "X-Tika-PDFextractInlineImages: true"
> {
> "Content-Type": "application/pdf",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.pdf.PDFParser"
> ],
> "X-TIKA:embedded_depth": "0",
> "X-TIKA:parse_time_millis": "4112",
> "access_permission:assemble_document": "true",
> "access_permission:can_modify": "true",
> "access_permission:can_print": "true",
> "access_permission:can_print_degraded": "true",
> "access_permission:extract_content": "true",
> "access_permission:extract_for_accessibility": "true",
> "access_permission:fill_in_form": "true",
> "access_permission:modify_annotations": "true",
> "dc:format": "application/pdf; version\u003d1.4",
> "pdf:PDFVersion": "1.4",
> "pdf:charsPerPage": "0",
> "pdf:encrypted": "false",
> "pdf:hasXFA": "false",
> "pdf:hasXMP": "false",
> "pdf:unmappedUnicodeCharsPerPage": "0",
> "xmpTPg:NPages": "1"
> },
> {
> "Component 1": "Y component: Quantization table 0, Sampling factors 2
> horiz/2 vert",
> "Component 2": "Cb component: Quantization table 1, Sampling factors 1
> horiz/1 vert",
> "Component 3": "Cr component: Quantization table 1, Sampling factors 1
> horiz/1 vert",
> "Compression Type": "Baseline",
> "Content-Type": "image/jpeg",
> "Data Precision": "8 bits",
> "File Modified Date": "Wed Mar 11 19:28:01 +00:00 2020",
> "File Name": "apache-tika-16610492346701338708.tmp",
> "File Size": "319936 bytes",
> "Image Height": "1554 pixels",
> "Image Width": "1206 pixels",
> "Number of Components": "3",
> "Number of Tables": "4 Huffman tables",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.ocr.TesseractOCRParser",
> "org.apache.tika.parser.jpeg.JpegParser"
> ],
> "X-TIKA:content":
> "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLorem
> Ipsum\n\n\"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet,
> consectetur, adipisci velit...\"\n\nLorem ipsum dolor sit amet,
> consectetur\nadipiscing elit. Etiam at posuere mauris.\nInterdum et malesuada
> fames ac ante ipsum\nprimis in faucibus. Suspendisse potenti. Donec\nut
> dapibus lectus. Aenean neque mauris,\nconvallis quis eros nec, molestie
> rhoncus\nlectus. Aliquam dui mauris, sagittis ut posuere\nquis, tempor id
> tellus. Nunc id varius dolor.\nFusce in elementum enim. Vestibulum\nimperdiet
> pretium est et rhoncus. Nam in urna\nmauris. Nulla facilisi. Nullam sed
> sapien libero.\nSed ligula arcu, auctor non nunc sed, viverra\nvehicula sem.
> Vestibulum orci felis, tristique at\norci id, interdum sodales lectus. Donec
> sed\nrhoncus massa. Donec laoreet sodales velit at\nfaucibus.\n\nAenean sit
> amet velit eros. Nam congue\nplacerat eros, vitae mattis turpis ultricies
> ac.\nPraesent vestibulum, tortor tempor tristique\nsagittis, mi risus semper
> neque, vel vehicula\ntortor sapien in lorem. Sed sit amet mattis
> leo.\nPraesent euismod lacinia sapien, nec cursus\ndolor dignissim pharetra.
> Mauris eleifend\npellentesque erat fermentum tempus. Nulla\ncommodo dolor
> urna, quis tincidunt diam\nconvallis vel.\n\nAenean ornare imperdiet nibh,
> sed gravida ante\nsagittis et. Fusce dignissim lectus vitae\nullamcorper
> malesuada. Donec ultricies ornare\nquam a placerat. Donec euismod nibh
> vitae\nfacilisis consectetur. Nunc in interdum neque,\nvarius vehicula massa.
> Ut fermentum lorem id\nante porta mattis. Praesent quis nulla ut
> lectus\nsodales ultricies. Sed sodales mollis ex, a\nsemper metus faucibus
> ac. Nulla tempor, ipsum\nvel egestas venenatis, enim est gravida mauris,\na
> lacinia justo quam eget felis. Maecenas\ncommodo, arcu sit amet aliquam
> molestie, urna\neros rutrum enim, et blandit nisi magna sit amet\n\nlorem.
> Suspendisse accumsan nulla vitae\naugue tempus, sed fermentum metus
> viverra.\nEtiam dapibus tellus eget venenatis rhoncus.\nVivamus eu dolor
> faucibus, malesuada tellus sit\namet, vulputate orci.\n\nNunc at diam eu nisi
> sollicitudin varius. Sed a\ntincidunt arcu. Integer vitae fermentum
> libero,\nac semper justo. Nunc dapibus in magna\ntempus aliquet. Proin
> interdum lorem eget\nsuscipit ullamcorper. Nulla vitae tincidunt\naugue. Cras
> turpis elit, dignissim eget metus\nnec, fermentum scelerisque ante.
> Suspendisse\naliquam tortor in eros rhoncus, eget elementum\nvelit sagittis.
> Donec et tellus ac dui interdum\nmattis. Duis condimentum quis velit
> et\ncommodo. Sed congue quam vitae neque\nvolutpat viverra.\n\nProin finibus
> nunc vel elit iaculis vestibulum.\nNulla et mattis magna. Nunc a ligula
> leo.\nAliquam bibendum semper tellus at molestie.\nCurabitur pellentesque
> ullamcorper dolor, at\nfinibus elit iaculis ac. Aliquam vestibulum sit\namet
> diam sit amet condimentum. Donec\nrhoncus, nisi eu dapibus elementum, tellus
> ex\nornare dui, nec molestie nulla nulla eget nulla.\nUt sem massa, tristique
> ac commodo id, rutrum\nat massa. Donec enim velit, luctus ac nisi
> ac,\nbibendum tempus elit. Proin posuere ex odio,\nsed faucibus elit volutpat
> in. Suspendisse\nscelerisque mauris nunc, ut tincidunt velit\nvulputate quis.
> Integer efficitur diam vel urna\ndignissim, a sodales magna _
> eleifend.\nVestibulum malesuada ornare diam, faucibus\nmaximus tellus aliquam
> et. Sed sed libero\negestas, varius sapien faucibus, interdum\nquam. Nullam a
> accumsan dui. Vivamus\nscelerisque justo in metus ornare interdum.\n\n",
> "X-TIKA:content_handler": "ToTextContentHandler",
> "X-TIKA:embedded_depth": "1",
> "X-TIKA:embedded_resource_path": "/image0.jpg",
> "X-TIKA:parse_time_millis": "4023",
> "embeddedResourceType": "INLINE",
> "pdf:hasXMP": "false",
> "resourceName": "image0.jpg",
> "tiff:BitsPerSample": "8",
> "tiff:ImageLength": "1554",
> "tiff:ImageWidth": "1206"
> }
> {code}
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)