[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-13 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169090#comment-14169090
 ] 

Hong-Thai Nguyen commented on TIKA-1445:


Interesting question !
For me, parser's selection and parsers priority decision should be done on 
runtime by configuration, not inside a parser.
Image's parser is an interesting case of concurrent parsers (Tesseract vs 
classical Image Parsers). We have double problem here:
1. When many parsers can work with same mime type, which one is selected ?
2. When we have many parsers, can we apply many parsers and merge results 
(metadata  handler) .

* For case 1, if we use a override config of parsers on runtime, we can declare 
many parsers with matching mimetype and the later one in list will be selected. 
We may extend CLI/WebService to inject this kind of configuration.
* For case 2, we don't have a solution for now. We may extend CompositeParser 
to accept a mode 'many' parsers and call matching parsers in chain. The merging 
result is an other problem.we can accept a same metadata name is override by an 
other parser. The perfect solution is (again) using nested structure on our 
metadata which enable store each parser's result.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1445.Mattmann.101214.patch.txt


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1176) ChmDirectoryListingSet does not correctly enumerate directory entries

2014-10-13 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169146#comment-14169146
 ] 

Hong-Thai Nguyen commented on TIKA-1176:


Hi [~mdgeek], thank for your offering code  testing file. Unfortunately, this 
check raised other exception on this file:
{code}
The full exception stack trace is included below:

org.apache.tika.exception.TikaException
at 
org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:355)
at org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:70)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:326)
at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:285)
at 
org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
at 
org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
at javax.swing.TransferHandler.importData(TransferHandler.java:755)
at 
javax.swing.TransferHandler$DropHandler.drop(TransferHandler.java:1478)
at java.awt.dnd.DropTarget.drop(DropTarget.java:434)
at 
javax.swing.TransferHandler$SwingDropTarget.drop(TransferHandler.java:1203)
at 
sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(SunDropTargetContextPeer.java:519)
at 
sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(SunDropTargetContextPeer.java:832)
at 
sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(SunDropTargetContextPeer.java:756)
at sun.awt.dnd.SunDropTargetEvent.dispatch(SunDropTargetEvent.java:30)
at java.awt.Component.dispatchEventImpl(Component.java:4517)
at java.awt.Container.dispatchEventImpl(Container.java:2097)
at java.awt.Component.dispatchEvent(Component.java:4488)
at 
java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4575)
at 
java.awt.LightweightDispatcher.processDropTargetEvent(Container.java:4310)
at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4161)
at java.awt.Container.dispatchEventImpl(Container.java:2083)
at java.awt.Window.dispatchEventImpl(Window.java:2489)
at java.awt.Component.dispatchEvent(Component.java:4488)
at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:674)
at java.awt.EventQueue.access$400(EventQueue.java:81)
at java.awt.EventQueue$2.run(EventQueue.java:633)
at java.awt.EventQueue$2.run(EventQueue.java:631)
at java.security.AccessController.doPrivileged(Native Method)
at 
java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
at 
java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
at java.awt.EventQueue$3.run(EventQueue.java:647)
at java.awt.EventQueue$3.run(EventQueue.java:645)
at java.security.AccessController.doPrivileged(Native Method)
at 
java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
at java.awt.EventQueue.dispatchEvent(EventQueue.java:644)
at 
java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:269)
at 
java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:184)
at 
java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:174)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:169)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:161)
at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at 
org.apache.tika.parser.chm.core.ChmCommons.copyOfRange(ChmCommons.java:342)
at 
org.apache.tika.parser.chm.core.ChmCommons.getChmBlockSegment(ChmCommons.java:108)
at 
org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:337)
... 43 more
{code} 

It's quite complex our CHM Parser, can you apply a full fix and a test with 
expected content in output on your file ?

Thanks,

 ChmDirectoryListingSet does not correctly enumerate directory entries
 -

 Key: TIKA-1176
 URL: https://issues.apache.org/jira/browse/TIKA-1176
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Doug Martin
 Attachments: HelpStudioSample.chm


 

[jira] [Resolved] (TIKA-1444) Detection for VirtualPC VHD files

2014-10-13 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1444.
--
   Resolution: Fixed
Fix Version/s: 1.7

I don't think we can remove the .vhd extension from VHDL, as it seems to be 
very widely used for that

I've added an expanded entry for Virtual PC VHD files in r1631329, without the 
glob extension

 Detection for VirtualPC VHD files
 -

 Key: TIKA-1444
 URL: https://issues.apache.org/jira/browse/TIKA-1444
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
Priority: Minor
 Fix For: 1.7


 Please, remove the glob pattern=*.vhd/ entry from text/x-vhdl mimetype 
 definition and add the following:
 {code}
 mime-type type=application/x-vhd
   glob pattern=*.vhd/
   magic priority=50
   match value=conectix type=string offset=0/
   /magic
 /mime-type
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1444) Detection for VirtualPC VHD files

2014-10-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169171#comment-14169171
 ] 

Hudson commented on TIKA-1444:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #261 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/261/])
TIKA-1444 Virtual PC Virtual Hard Disk mimetype (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1631329)
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


 Detection for VirtualPC VHD files
 -

 Key: TIKA-1444
 URL: https://issues.apache.org/jira/browse/TIKA-1444
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
Priority: Minor
 Fix For: 1.7


 Please, remove the glob pattern=*.vhd/ entry from text/x-vhdl mimetype 
 definition and add the following:
 {code}
 mime-type type=application/x-vhd
   glob pattern=*.vhd/
   magic priority=50
   match value=conectix type=string offset=0/
   /magic
 /mime-type
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1444) Detection for VirtualPC VHD files

2014-10-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169183#comment-14169183
 ] 

Hudson commented on TIKA-1444:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #241 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/241/])
TIKA-1444 Virtual PC Virtual Hard Disk mimetype (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1631329)
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


 Detection for VirtualPC VHD files
 -

 Key: TIKA-1444
 URL: https://issues.apache.org/jira/browse/TIKA-1444
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
Priority: Minor
 Fix For: 1.7


 Please, remove the glob pattern=*.vhd/ entry from text/x-vhdl mimetype 
 definition and add the following:
 {code}
 mime-type type=application/x-vhd
   glob pattern=*.vhd/
   magic priority=50
   match value=conectix type=string offset=0/
   /magic
 /mime-type
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


buildbot success in ASF Buildbot on tika-trunk

2014-10-13 Thread buildbot
The Buildbot has detected a restored build on builder tika-trunk while building 
ASF Buildbot.
Full details are available at:
 http://ci.apache.org/builders/tika-trunk/builds/226

Buildbot URL: http://ci.apache.org/

Buildslave for this Build: lares_ubuntu

Build Reason: scheduler
Build Source Stamp: [branch tika/trunk] 1631329
Blamelist: nick

Build succeeded!

sincerely,
 -The Buildbot





[jira] [Commented] (TIKA-1444) Detection for VirtualPC VHD files

2014-10-13 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169298#comment-14169298
 ] 

Luis Filipe Nassif commented on TIKA-1444:
--

Thank you [~gagravarr]. Currently I am adding that definition in 
custom-mimetypes.xml, but I can not add the glob pattern because Tika complains 
about an extension conflict. Is it possible to implement glob pattern 
overriding within a custom-mimetypes.xml? Should I open a new issue for that?

 Detection for VirtualPC VHD files
 -

 Key: TIKA-1444
 URL: https://issues.apache.org/jira/browse/TIKA-1444
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
Priority: Minor
 Fix For: 1.7


 Please, remove the glob pattern=*.vhd/ entry from text/x-vhdl mimetype 
 definition and add the following:
 {code}
 mime-type type=application/x-vhd
   glob pattern=*.vhd/
   magic priority=50
   match value=conectix type=string offset=0/
   /magic
 /mime-type
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1449) Extract Images from PDF at Correct Location

2014-10-13 Thread James Baker (JIRA)
James Baker created TIKA-1449:
-

 Summary: Extract Images from PDF at Correct Location
 Key: TIKA-1449
 URL: https://issues.apache.org/jira/browse/TIKA-1449
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.7
Reporter: James Baker


The structured view of a PDF document shows inline images extracted at the 
bottom of each page. They should be shown at the location they appear in the 
document, as they do with Word documents (etc.)

For more information, also see TIKA-1427.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: Rome 1.5 retry

2014-10-13 Thread jotomo
GitHub user jotomo opened a pull request:

https://github.com/apache/tika/pull/19

Rome 1.5 retry

Well, after repeatedly shooting myself in the food by trying to run the 
tests with Java 8 (yields a NoClassDefFound: org.slf4j.LoggerFactory error. 
Fancy), I seem to have fixed this. 
The dependency from _netcdf_ to _jdom_, which was is declared optional in 
_netcdf_'s POM was (inadvertendly) satifisfied by _rome_, until _rome_ upgraded 
to _jdom2_ with the upgrade to _rome_ 1.5. This fix makes _netcdf_'s dependency 
on _jdom_ (1), which apparently is not optional, explicit and required. 
Resulting in all tests being green again. I chose this approach over the more 
aggressive one where _netcdf_ is updated, which would then require the same 
version of _jdom_ as _rome_ (and then also declares that dependency as 
non-optional), however I can't predict the impact that upgrade would have. 
Furthermore, that upgrade adds dependencies to libs not available on 
sonatype/maven.org.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jotomo/tika rome-1.5-retry

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/19.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19


commit 32aadd083759b9022d152869f8f2990e5359bb8c
Author: Johannes Mockenhaupt g...@jotomo.de
Date:   2014-10-06T12:58:54Z

Revert Revert TIKA-1435 until we figure out the Rome/JDOM/HDFParser issue 
merge 1629338:1629337

This reverts commit be824cc499eee3e975003ecc3a7ae1e91d86c195.

commit fb9df6d51aeee2fc8ee7e2877cf974c8f266457b
Author: Johannes Mockenhaupt g...@jotomo.de
Date:   2014-10-06T15:30:18Z

Make netcdf's dependency on jdom explicit (netcdf declares it with scope 
provided).




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1443) Add a junk text detector to Tika

2014-10-13 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170462#comment-14170462
 ] 

Chris A. Mattmann commented on TIKA-1443:
-

#love

 Add a junk text detector to Tika
 

 Key: TIKA-1443
 URL: https://issues.apache.org/jira/browse/TIKA-1443
 Project: Tika
  Issue Type: Wish
Reporter: Tim Allison
Priority: Minor

 It would be helpful to have a detector that flags documents whose extracted 
 text is junk.  This could be used as a component of TIKA-1332 or as a 
 standalone detector.  See TIKA-1332 for some initial ideas of what statistics 
 we might use for such a detector.
 Two use cases:
 * Parser developers could quickly see whether changes in code lead to less 
 junky documents or more junky documents.  This would also aid in 
 prioritizing manual review of output comparison (see discussion in TIKA-1419).
 * Search system integrators could use that information to set document 
 specific relevancy rankings or to avoid indexing a document



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)