[ https://issues.apache.org/jira/browse/NUTCH-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195439#comment-15195439 ]
Longuemare commented on NUTCH-2138: ----------------------------------- Hello, OCR for image in PDF still not working with nutch 1.11, lib/tika-core-1.11.jar, plugins/parse-tika/tika-parsers-1.11.jar tesseract -v tesseract 3.03 leptonica-1.70 libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0 bin/nutch parsechecker -dumpText http://domain.tld/file.pdf fetching: http://domain.tld/file.pdf robots.txt whitelist not configured. parsing: http://domain.tld/file.pdf contentType: application/pdf signature: af00322e75c5eb43085df668f2faca2f --------- Url --------------- http://domain.tld/file.pdf --------- ParseData --------- Version: 5 Status: success(1,0) Title: Outlinks: 0 Content Metadata: nutch.fetch.time=1458053976974 Age=0 Content-Language=fr-FR Served-by=www.nord.gouv.fr Content-Length=5052242 Content-Transfer-Encoding=binary Expires=Tue, 15 Mar 2016 15:09:37 GMT Last-Modified=Fri, 12 Jun 2015 14:58:13 GMT Set-Cookie=eZSESSID=6ns8c06tnu40kd3ohfpl6vnrj5; path=/ Connection=close X-Cache=Miss from Varnish Server=nginx X-Powered-By=eZ Publish Cache-Control= Pragma= X-Varnish=1186703160 Date=Tue, 15 Mar 2016 14:59:37 GMT Content-Disposition=inline; filename="file.pdf" nutch.crawl.score=0.0 Via=1.1 varnish Accept-Ranges=bytes Content-Type=application/pdf Parse Metadata: access_permission:extract_for_accessibility=true meta:save-date=2015-06-12T14:47:32Z dcterms:created=2015-06-12T14:47:32Z date=2015-06-12T14:47:32Z access_permission:can_modify=true access_permission:modify_annotations=true Creation-Date=2015-06-12T14:47:32Z created=Fri Jun 12 16:47:32 CEST 2015 access_permission:fill_in_form=true access_permission:can_print=true dc:format=application/pdf; version=1.4 xmp:CreatorTool=RICOH MP 3353 Last-Save-Date=2015-06-12T14:47:32Z access_permission:assemble_document=true meta:creation-date=2015-06-12T14:47:32Z dcterms:modified=2015-06-12T14:47:32Z Last-Modified=2015-06-12T14:47:32Z pdf:PDFVersion=1.4 modified=2015-06-12T14:47:32Z xmpTPg:NPages=45 access_permission:can_print_degraded=true pdf:encrypted=false access_permission:extract_content=true producer=RICOH MP 3353 Content-Type=application/pdf --------- ParseText --------- thanks, Eric https://tika.apache.org/1.11/gettingstarted.html > Tika cannot OCR embedded images from PDF > ---------------------------------------- > > Key: NUTCH-2138 > URL: https://issues.apache.org/jira/browse/NUTCH-2138 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.10 > Environment: Nutch v1.10 > openjdk version "1.8.0_60-internal" > Debian 7.8 > Tika 1.8 or Tika 1.10 > Reporter: jean blue > > Tika 1.10 is able to OCR embedded images if PDFParser.properties is modified > accordingly in tika-app-1.10.jar but parse-tika doesn't if same modifications > are made in runtime/local/plugins/parse-tika/tika-parsers-1.10.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332)