[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169090#comment-14169090 ] Hong-Thai Nguyen commented on TIKA-1445: Interesting question ! For me, parser's selection and parsers priority decision should be done on runtime by configuration, not inside a parser. Image's parser is an interesting case of concurrent parsers (Tesseract vs classical Image Parsers). We have double problem here: 1. When many parsers can work with same mime type, which one is selected ? 2. When we have many parsers, can we apply many parsers and merge results (metadata handler) . * For case 1, if we use a override config of parsers on runtime, we can declare many parsers with matching mimetype and the later one in list will be selected. We may extend CLI/WebService to inject this kind of configuration. * For case 2, we don't have a solution for now. We may extend CompositeParser to accept a mode 'many' parsers and call matching parsers in chain. The merging result is an other problem.we can accept a same metadata name is override by an other parser. The perfect solution is (again) using nested structure on our metadata which enable store each parser's result. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1445.Mattmann.101214.patch.txt Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1176) ChmDirectoryListingSet does not correctly enumerate directory entries
[ https://issues.apache.org/jira/browse/TIKA-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169146#comment-14169146 ] Hong-Thai Nguyen commented on TIKA-1176: Hi [~mdgeek], thank for your offering code testing file. Unfortunately, this check raised other exception on this file: {code} The full exception stack trace is included below: org.apache.tika.exception.TikaException at org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:355) at org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:70) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:326) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:285) at org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94) at org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77) at javax.swing.TransferHandler.importData(TransferHandler.java:755) at javax.swing.TransferHandler$DropHandler.drop(TransferHandler.java:1478) at java.awt.dnd.DropTarget.drop(DropTarget.java:434) at javax.swing.TransferHandler$SwingDropTarget.drop(TransferHandler.java:1203) at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(SunDropTargetContextPeer.java:519) at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(SunDropTargetContextPeer.java:832) at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(SunDropTargetContextPeer.java:756) at sun.awt.dnd.SunDropTargetEvent.dispatch(SunDropTargetEvent.java:30) at java.awt.Component.dispatchEventImpl(Component.java:4517) at java.awt.Container.dispatchEventImpl(Container.java:2097) at java.awt.Component.dispatchEvent(Component.java:4488) at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4575) at java.awt.LightweightDispatcher.processDropTargetEvent(Container.java:4310) at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4161) at java.awt.Container.dispatchEventImpl(Container.java:2083) at java.awt.Window.dispatchEventImpl(Window.java:2489) at java.awt.Component.dispatchEvent(Component.java:4488) at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:674) at java.awt.EventQueue.access$400(EventQueue.java:81) at java.awt.EventQueue$2.run(EventQueue.java:633) at java.awt.EventQueue$2.run(EventQueue.java:631) at java.security.AccessController.doPrivileged(Native Method) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98) at java.awt.EventQueue$3.run(EventQueue.java:647) at java.awt.EventQueue$3.run(EventQueue.java:645) at java.security.AccessController.doPrivileged(Native Method) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87) at java.awt.EventQueue.dispatchEvent(EventQueue.java:644) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:269) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:184) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:174) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:169) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:161) at java.awt.EventDispatchThread.run(EventDispatchThread.java:122) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.tika.parser.chm.core.ChmCommons.copyOfRange(ChmCommons.java:342) at org.apache.tika.parser.chm.core.ChmCommons.getChmBlockSegment(ChmCommons.java:108) at org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:337) ... 43 more {code} It's quite complex our CHM Parser, can you apply a full fix and a test with expected content in output on your file ? Thanks, ChmDirectoryListingSet does not correctly enumerate directory entries - Key: TIKA-1176 URL: https://issues.apache.org/jira/browse/TIKA-1176 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Doug Martin Attachments: HelpStudioSample.chm
[jira] [Resolved] (TIKA-1444) Detection for VirtualPC VHD files
[ https://issues.apache.org/jira/browse/TIKA-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1444. -- Resolution: Fixed Fix Version/s: 1.7 I don't think we can remove the .vhd extension from VHDL, as it seems to be very widely used for that I've added an expanded entry for Virtual PC VHD files in r1631329, without the glob extension Detection for VirtualPC VHD files - Key: TIKA-1444 URL: https://issues.apache.org/jira/browse/TIKA-1444 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.6 Reporter: Luis Filipe Nassif Priority: Minor Fix For: 1.7 Please, remove the glob pattern=*.vhd/ entry from text/x-vhdl mimetype definition and add the following: {code} mime-type type=application/x-vhd glob pattern=*.vhd/ magic priority=50 match value=conectix type=string offset=0/ /magic /mime-type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1444) Detection for VirtualPC VHD files
[ https://issues.apache.org/jira/browse/TIKA-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169171#comment-14169171 ] Hudson commented on TIKA-1444: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #261 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/261/]) TIKA-1444 Virtual PC Virtual Hard Disk mimetype (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1631329) * /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml Detection for VirtualPC VHD files - Key: TIKA-1444 URL: https://issues.apache.org/jira/browse/TIKA-1444 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.6 Reporter: Luis Filipe Nassif Priority: Minor Fix For: 1.7 Please, remove the glob pattern=*.vhd/ entry from text/x-vhdl mimetype definition and add the following: {code} mime-type type=application/x-vhd glob pattern=*.vhd/ magic priority=50 match value=conectix type=string offset=0/ /magic /mime-type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1444) Detection for VirtualPC VHD files
[ https://issues.apache.org/jira/browse/TIKA-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169183#comment-14169183 ] Hudson commented on TIKA-1444: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #241 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/241/]) TIKA-1444 Virtual PC Virtual Hard Disk mimetype (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1631329) * /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml Detection for VirtualPC VHD files - Key: TIKA-1444 URL: https://issues.apache.org/jira/browse/TIKA-1444 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.6 Reporter: Luis Filipe Nassif Priority: Minor Fix For: 1.7 Please, remove the glob pattern=*.vhd/ entry from text/x-vhdl mimetype definition and add the following: {code} mime-type type=application/x-vhd glob pattern=*.vhd/ magic priority=50 match value=conectix type=string offset=0/ /magic /mime-type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
buildbot success in ASF Buildbot on tika-trunk
The Buildbot has detected a restored build on builder tika-trunk while building ASF Buildbot. Full details are available at: http://ci.apache.org/builders/tika-trunk/builds/226 Buildbot URL: http://ci.apache.org/ Buildslave for this Build: lares_ubuntu Build Reason: scheduler Build Source Stamp: [branch tika/trunk] 1631329 Blamelist: nick Build succeeded! sincerely, -The Buildbot
[jira] [Commented] (TIKA-1444) Detection for VirtualPC VHD files
[ https://issues.apache.org/jira/browse/TIKA-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169298#comment-14169298 ] Luis Filipe Nassif commented on TIKA-1444: -- Thank you [~gagravarr]. Currently I am adding that definition in custom-mimetypes.xml, but I can not add the glob pattern because Tika complains about an extension conflict. Is it possible to implement glob pattern overriding within a custom-mimetypes.xml? Should I open a new issue for that? Detection for VirtualPC VHD files - Key: TIKA-1444 URL: https://issues.apache.org/jira/browse/TIKA-1444 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.6 Reporter: Luis Filipe Nassif Priority: Minor Fix For: 1.7 Please, remove the glob pattern=*.vhd/ entry from text/x-vhdl mimetype definition and add the following: {code} mime-type type=application/x-vhd glob pattern=*.vhd/ magic priority=50 match value=conectix type=string offset=0/ /magic /mime-type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1449) Extract Images from PDF at Correct Location
James Baker created TIKA-1449: - Summary: Extract Images from PDF at Correct Location Key: TIKA-1449 URL: https://issues.apache.org/jira/browse/TIKA-1449 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.7 Reporter: James Baker The structured view of a PDF document shows inline images extracted at the bottom of each page. They should be shown at the location they appear in the document, as they do with Word documents (etc.) For more information, also see TIKA-1427. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: Rome 1.5 retry
GitHub user jotomo opened a pull request: https://github.com/apache/tika/pull/19 Rome 1.5 retry Well, after repeatedly shooting myself in the food by trying to run the tests with Java 8 (yields a NoClassDefFound: org.slf4j.LoggerFactory error. Fancy), I seem to have fixed this. The dependency from _netcdf_ to _jdom_, which was is declared optional in _netcdf_'s POM was (inadvertendly) satifisfied by _rome_, until _rome_ upgraded to _jdom2_ with the upgrade to _rome_ 1.5. This fix makes _netcdf_'s dependency on _jdom_ (1), which apparently is not optional, explicit and required. Resulting in all tests being green again. I chose this approach over the more aggressive one where _netcdf_ is updated, which would then require the same version of _jdom_ as _rome_ (and then also declares that dependency as non-optional), however I can't predict the impact that upgrade would have. Furthermore, that upgrade adds dependencies to libs not available on sonatype/maven.org. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jotomo/tika rome-1.5-retry Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/19.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19 commit 32aadd083759b9022d152869f8f2990e5359bb8c Author: Johannes Mockenhaupt g...@jotomo.de Date: 2014-10-06T12:58:54Z Revert Revert TIKA-1435 until we figure out the Rome/JDOM/HDFParser issue merge 1629338:1629337 This reverts commit be824cc499eee3e975003ecc3a7ae1e91d86c195. commit fb9df6d51aeee2fc8ee7e2877cf974c8f266457b Author: Johannes Mockenhaupt g...@jotomo.de Date: 2014-10-06T15:30:18Z Make netcdf's dependency on jdom explicit (netcdf declares it with scope provided). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1443) Add a junk text detector to Tika
[ https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170462#comment-14170462 ] Chris A. Mattmann commented on TIKA-1443: - #love Add a junk text detector to Tika Key: TIKA-1443 URL: https://issues.apache.org/jira/browse/TIKA-1443 Project: Tika Issue Type: Wish Reporter: Tim Allison Priority: Minor It would be helpful to have a detector that flags documents whose extracted text is junk. This could be used as a component of TIKA-1332 or as a standalone detector. See TIKA-1332 for some initial ideas of what statistics we might use for such a detector. Two use cases: * Parser developers could quickly see whether changes in code lead to less junky documents or more junky documents. This would also aid in prioritizing manual review of output comparison (see discussion in TIKA-1419). * Search system integrators could use that information to set document specific relevancy rankings or to avoid indexing a document -- This message was sent by Atlassian JIRA (v6.3.4#6332)