[jira] [Commented] (TIKA-1316) Old Site Code in Trunk

2014-06-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019606#comment-14019606
 ] 

Hudson commented on TIKA-1316:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #23 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/23/])
fix for TIKA-1316 identified by Tyler Palsulich. (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1600811)
* /tika/trunk/src/site


> Old Site Code in Trunk
> --
>
> Key: TIKA-1316
> URL: https://issues.apache.org/jira/browse/TIKA-1316
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: easyfix
> Fix For: 1.6
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The \{tika trunk\}/src/site directory seems to old and unused. It does not 
> correspond to the site currently on apache.tika.org 
> (http://svn.apache.org/repos/asf/tika/site/).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1316) Old Site Code in Trunk

2014-06-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019601#comment-14019601
 ] 

Hudson commented on TIKA-1316:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #22 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/22/])
fix for TIKA-1316 identified by Tyler Palsulich. (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1600811)
* /tika/trunk/src/site


> Old Site Code in Trunk
> --
>
> Key: TIKA-1316
> URL: https://issues.apache.org/jira/browse/TIKA-1316
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: easyfix
> Fix For: 1.6
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The \{tika trunk\}/src/site directory seems to old and unused. It does not 
> correspond to the site currently on apache.tika.org 
> (http://svn.apache.org/repos/asf/tika/site/).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[DISCUSS] 1.6 Release?

2014-06-05 Thread Mattmann, Chris A (3980)
Hey Guys,

So there's been lots of great activity lately between Nick, Tim, Annie,
Tyler, Lewis, Paul R., and me and others. We've got ~44 issues fixed in
JIRA. I moved
all unfixed to 1.7 and would like to roll a 1.6 RC no later than Monday.
Please let me know if this sounds kosher to folks. LOVE the activity
lately around Tika!

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






Re: Review Request 22246: New parser for Matlab .mat files

2014-06-05 Thread Mattmann, Chris A (3980)
Hi Annie,

Can you please create a JIRA issue for this, and also please create
a diff against the Tika trunk by doing the following:

0. create JIRA issue for Matlab parser
1. svn co http://svn.apache.org/repos/asf/tika/trunk tika
2. cd tika
3. drop your Matlab parser files in e.g.,
tika-parsers/src/main/java/org/apache/tika/parser/matlab
4. update file packages, etc.
5. svn status (files look ok?)
6. svn diff > TIKA-xxx.aburgess.yyMMdd.patch.txt (where xxx is the JIRA
issue id from 0.)

Then if you attach the diff to ReviewBoard I can annotate the lines etc
with comments. THanks! Also once you create the JIRA issue I will help
get it into the sources.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Ann Burgess 
Reply-To: "dev@tika.apache.org" , "Bryant, Ann C
(398J-Affiliate)" 
Date: Thursday, June 5, 2014 11:37 AM
To: Chris Mattmann 
Cc: Matthias Krueger , tika , "Bryant,
Ann C (398J-Affiliate)" , Nick Burch

Subject: Re: Review Request 22246: New parser for Matlab .mat files

>
>
>> On June 4, 2014, 11:25 p.m., Matthias Krueger wrote:
>> > The Matlab MIME types used seem to be application/x-matlab-data or
>>application/matlab-mat.
>> > 
>> > Would it make sense to add them to the mime XML for detection?
>> > 
>> > 
>> >   MATLAB data file
>> >   
>> >   
>> > 
>> >   
>> >   
>> > 
>> > 
>> >
>> 
>> Chris Mattmann wrote:
>> +1 this makes a ton of sense to add IMO.
>> 
>> Nick Burch wrote:
>> There's some odd whitespace going on - we normally use 4 spaces and
>>no tabs.
>> 
>> When outputting the variables, it would probably make sense to put
>>each one into either a paragraph or a list, so that we get helpful
>>output in html mode as well as text mode
>> 
>> With that in place, it would then be possible to have a unit test
>>that checked the html output, as well as the current text one
>> 
>> Also on testing, I think at least some of the tests have an
>>implementation of assertContains, which generally gives a more helpful
>>failure message than assertTrue(s.contains(...)) does, might be worth
>>looking into that?
>
>Great input - thank you! I will integrate both and upload the diff.
>
>
>- Ann
>
>
>---
>This is an automatically generated e-mail. To reply, visit:
>https://reviews.apache.org/r/22246/#review44773
>---
>
>
>On June 4, 2014, 10:23 p.m., Ann Burgess wrote:
>> 
>> ---
>> This is an automatically generated e-mail. To reply, visit:
>> https://reviews.apache.org/r/22246/
>> ---
>> 
>> (Updated June 4, 2014, 10:23 p.m.)
>> 
>> 
>> Review request for tika and Chris Mattmann.
>> 
>> 
>> Repository: tika
>> 
>> 
>> Description
>> ---
>> 
>> This is a new parser for Matlab .mat files.  The parser utilizes the
>>JmatIO, Matlab's MAT-file I/O API in JAVA. JmatIO is available through
>>Maven Central.  The text output from this parser provides variable names
>>and dimensions that are both inside and outside of data structures, but
>>does NOT provide the actual data values within each .mat file.
>> 
>> 
>> Diffs
>> -
>> 
>> 
>> Diff: https://reviews.apache.org/r/22246/diff/
>> 
>> 
>> Testing
>> ---
>> 
>> Successfully run a basic unit test that checks both --text and
>>--metadata parser output.
>> 
>> 
>> File Attachments
>> 
>> 
>> Parser File
>>   
>>https://reviews.apache.org/media/uploaded/files/2014/06/04/cb39636d-ec53-
>>4fbc-b348-6a4db8907f6b__MatParser.java
>> Unit Test
>>   
>>https://reviews.apache.org/media/uploaded/files/2014/06/04/bbff8c6b-caa1-
>>4830-b441-532c28c3c78e__MatParserTest.java
>> 
>> 
>> Thanks,
>> 
>> Ann Burgess
>> 
>>
>



[jira] [Resolved] (TIKA-1316) Old Site Code in Trunk

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1316.
-

Resolution: Fixed

{noformat}
[mattmann-0420740:~/tmp/tika] mattmann% svn rm 
https://svn.apache.org/repos/asf/tika/trunk/src/site -m "fix for TIKA-1316 
identified by Tyler Palsulich."

Committed revision 1600811.
{noformat}

Nice catch, [~tpalsulich]

> Old Site Code in Trunk
> --
>
> Key: TIKA-1316
> URL: https://issues.apache.org/jira/browse/TIKA-1316
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: easyfix
> Fix For: 1.6
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The \{tika trunk\}/src/site directory seems to old and unused. It does not 
> correspond to the site currently on apache.tika.org 
> (http://svn.apache.org/repos/asf/tika/site/).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1316) Old Site Code in Trunk

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1316:


Affects Version/s: (was: 1.6)
Fix Version/s: 1.6

> Old Site Code in Trunk
> --
>
> Key: TIKA-1316
> URL: https://issues.apache.org/jira/browse/TIKA-1316
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: easyfix
> Fix For: 1.6
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The \{tika trunk\}/src/ directory seems to old and unused. It does not 
> correspond to the site currently on apache.tika.org 
> (http://svn.apache.org/repos/asf/tika/site/).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1304) Implement Metadata Property with PropertyType ALT

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1304:


Component/s: metadata

> Implement Metadata Property with PropertyType ALT
> -
>
> Key: TIKA-1304
> URL: https://issues.apache.org/jira/browse/TIKA-1304
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Tim Allison
>Priority: Trivial
>
> PropertyType Alt has been available for a while, but it doesn't appear to 
> have been implemented.  I'd like to implement it to fix TIKA-1295.
> If I've missed the implementation or if there is a preferred workaround, 
> please let me know, and I'll close this issue and use that to fix TIKA-1295.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1316) Old Site Code in Trunk

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1316:


Component/s: general

> Old Site Code in Trunk
> --
>
> Key: TIKA-1316
> URL: https://issues.apache.org/jira/browse/TIKA-1316
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Tyler Palsulich
>Priority: Trivial
>  Labels: easyfix
> Fix For: 1.6
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The \{tika trunk\}/src/ directory seems to old and unused. It does not 
> correspond to the site currently on apache.tika.org 
> (http://svn.apache.org/repos/asf/tika/site/).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1316) Old Site Code in Trunk

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1316:


Description: The \{tika trunk\}/src/site directory seems to old and unused. 
It does not correspond to the site currently on apache.tika.org 
(http://svn.apache.org/repos/asf/tika/site/).  (was: The \{tika trunk\}/src/ 
directory seems to old and unused. It does not correspond to the site currently 
on apache.tika.org (http://svn.apache.org/repos/asf/tika/site/).)

> Old Site Code in Trunk
> --
>
> Key: TIKA-1316
> URL: https://issues.apache.org/jira/browse/TIKA-1316
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: easyfix
> Fix For: 1.6
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The \{tika trunk\}/src/site directory seems to old and unused. It does not 
> correspond to the site currently on apache.tika.org 
> (http://svn.apache.org/repos/asf/tika/site/).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (TIKA-1316) Old Site Code in Trunk

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-1316:
---

Assignee: Chris A. Mattmann

> Old Site Code in Trunk
> --
>
> Key: TIKA-1316
> URL: https://issues.apache.org/jira/browse/TIKA-1316
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: easyfix
> Fix For: 1.6
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The \{tika trunk\}/src/ directory seems to old and unused. It does not 
> correspond to the site currently on apache.tika.org 
> (http://svn.apache.org/repos/asf/tika/site/).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1238) Update OutlookExtractor to handle codepage identification more rigorously

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1238:


Component/s: parser

> Update OutlookExtractor to handle codepage identification more rigorously
> -
>
> Key: TIKA-1238
> URL: https://issues.apache.org/jira/browse/TIKA-1238
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
>
> Since OutlookExtractor's codepage detection chunk was written, POI's HSMF has 
> added more robutst capabilities for identifying codepages in Outlook .msg 
> files.  As a first step to integrating those improvements, I'll copy and 
> paste some of POI's code into OutlookExtractor.  As a second step, I'll 
> expose more of HSMF's capabilities within POI and then factor out the 
> duplicate code in Tika.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1300:


Component/s: parser

> Switch default PDFBox parser to NonSequentialParser
> ---
>
> Key: TIKA-1300
> URL: https://issues.apache.org/jira/browse/TIKA-1300
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1295) Make some Dublin Core items multi-valued

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1295:


Component/s: metadata

> Make some Dublin Core items multi-valued
> 
>
> Key: TIKA-1295
> URL: https://issues.apache.org/jira/browse/TIKA-1295
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
>
> According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, 
> dc:title, dc:description and dc:rights should allow multiple values because 
> of language alternatives.  Unless anyone objects in the next few days, I'll 
> switch those to Property.toInternalTextBag() from Property.toInternalText().  
> I'll also modify PDFParser to extract dc:rights.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1323) Improve exception reporting in JAX-RS server

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1323:


Component/s: server

> Improve exception reporting in JAX-RS server
> 
>
> Key: TIKA-1323
> URL: https://issues.apache.org/jira/browse/TIKA-1323
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Tim Allison
>Priority: Minor
>
> I'd like to use tika-server for TIKA-1302.  As part of that, I'd like to 
> record exception stacktraces per document.  I see two options: transmit the 
> info back to the client (assuming a doc didn't bring the server down :) ) 
> along with the current error code or log the document id and stacktrace via 
> the server.  Given my current design thoughts, I'd prefer the first option.
> Any objections or recommendations?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1298) testEmbeddedPDFEmbeddingAnotherDocument fails with PDFBox 1.8.5 and java 1.6

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1298:


Component/s: parser

> testEmbeddedPDFEmbeddingAnotherDocument fails with PDFBox 1.8.5 and java 1.6
> 
>
> Key: TIKA-1298
> URL: https://issues.apache.org/jira/browse/TIKA-1298
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> Not sure why this is happening.  Test works with PDFBox 1.8.5 and Java 1.7; 
> and it works with PDFBox 1.8.4 and either Java 1.6 or Java 1.7.  I'll look 
> into this now.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1321) Add experimental Stax/Streaming XWPF/docx extractor

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1321:


Component/s: parser

> Add experimental Stax/Streaming XWPF/docx extractor
> ---
>
> Key: TIKA-1321
> URL: https://issues.apache.org/jira/browse/TIKA-1321
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> I'd like to contribute an experimental streaming extractor for docx.  I 
> should have something ready for committing in a few weeks.  I'll attach 
> drafts as they're ready.
> At least for a couple of releases, I'd like to keep it in 
> o.a.t.parser.microsoft.ooxml.experimental if that makes sense.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1254) No warning when Tika does not find a parser.

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1254:


Component/s: parser

> No warning when Tika does not find a parser.
> 
>
> Key: TIKA-1254
> URL: https://issues.apache.org/jira/browse/TIKA-1254
> Project: Tika
>  Issue Type: Wish
>  Components: parser
>Reporter: Ankit Gupta
>Priority: Minor
>
> When using Tika using Gradle or Maven, if the dependency is specified only on 
> tika-core and not on tika-parsers, then there is no warning to let you know 
> that there is a library missing and the function returns an empty string.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1235) empty docx creates exception

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1235:


Component/s: parser

> empty docx creates exception
> 
>
> Key: TIKA-1235
> URL: https://issues.apache.org/jira/browse/TIKA-1235
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Lutz Theurer
>Priority: Minor
>
> using an empty docx File as input results in exception. Trace:
> Apache Tika was unable to parse the document
> at F:\Microsoft Word-Dokument (neu) (2).docx.
> The full exception stack trace is included below:
> org.apache.tika.exception.TikaException: Error creating OOXML extractor
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:128)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
>   at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
>   at 
> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
>   at 
> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
>   at javax.swing.TransferHandler.importData(Unknown Source)
>   at javax.swing.TransferHandler$DropHandler.drop(Unknown Source)
>   at java.awt.dnd.DropTarget.drop(Unknown Source)
>   at javax.swing.TransferHandler$SwingDropTarget.drop(Unknown Source)
>   at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(Unknown 
> Source)
>   at 
> sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(Unknown
>  Source)
>   at 
> sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(Unknown 
> Source)
>   at sun.awt.dnd.SunDropTargetEvent.dispatch(Unknown Source)
>   at java.awt.Component.dispatchEventImpl(Unknown Source)
>   at java.awt.Container.dispatchEventImpl(Unknown Source)
>   at java.awt.Component.dispatchEvent(Unknown Source)
>   at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
>   at java.awt.LightweightDispatcher.processDropTargetEvent(Unknown Source)
>   at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
>   at java.awt.Container.dispatchEventImpl(Unknown Source)
>   at java.awt.Window.dispatchEventImpl(Unknown Source)
>   at java.awt.Component.dispatchEvent(Unknown Source)
>   at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
>   at java.awt.EventQueue.access$200(Unknown Source)
>   at java.awt.EventQueue$3.run(Unknown Source)
>   at java.awt.EventQueue$3.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
> Source)
>   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
> Source)
>   at java.awt.EventQueue$4.run(Unknown Source)
>   at java.awt.EventQueue$4.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
> Source)
>   at java.awt.EventQueue.dispatchEvent(Unknown Source)
>   at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
>   at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
>   at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
>   at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
>   at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
>   at java.awt.EventDispatchThread.run(Unknown Source)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: 
> Package should contain a content type part [M1.13]
>   at 
> org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:178)
>   at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:662)
>   at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:269)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:74)
>   ... 43 more



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1182) Out of memory exception when parsing TTF file

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1182:


Component/s: parser

> Out of memory exception when parsing TTF file
> -
>
> Key: TIKA-1182
> URL: https://issues.apache.org/jira/browse/TIKA-1182
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: Ubuntu
> java version "1.7.0_40"
> Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
> Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)
>Reporter: Erik Hetzner
> Attachments: 16A4FF_8.ttf, TIKA-1182-fix1.patch, TIKA_1182.java
>
>
>When parsing attached file using tika-app-1.4.jar, CPU usage is high and 
> it never seems to finish.
> When parsing using attached java code, I get an out of memory exception.
> Let me know what other information I can provide.
> Thank you!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1234) empty docx creates exception

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1234.
-

Resolution: Duplicate

> empty docx creates exception
> 
>
> Key: TIKA-1234
> URL: https://issues.apache.org/jira/browse/TIKA-1234
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Lutz Theurer
>Priority: Minor
>
> using an empty docx File as input results in exception. Trace:
> Apache Tika was unable to parse the document
> at F:\Microsoft Word-Dokument (neu) (2).docx.
> The full exception stack trace is included below:
> org.apache.tika.exception.TikaException: Error creating OOXML extractor
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:128)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
>   at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
>   at 
> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
>   at 
> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
>   at javax.swing.TransferHandler.importData(Unknown Source)
>   at javax.swing.TransferHandler$DropHandler.drop(Unknown Source)
>   at java.awt.dnd.DropTarget.drop(Unknown Source)
>   at javax.swing.TransferHandler$SwingDropTarget.drop(Unknown Source)
>   at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(Unknown 
> Source)
>   at 
> sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(Unknown
>  Source)
>   at 
> sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(Unknown 
> Source)
>   at sun.awt.dnd.SunDropTargetEvent.dispatch(Unknown Source)
>   at java.awt.Component.dispatchEventImpl(Unknown Source)
>   at java.awt.Container.dispatchEventImpl(Unknown Source)
>   at java.awt.Component.dispatchEvent(Unknown Source)
>   at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
>   at java.awt.LightweightDispatcher.processDropTargetEvent(Unknown Source)
>   at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
>   at java.awt.Container.dispatchEventImpl(Unknown Source)
>   at java.awt.Window.dispatchEventImpl(Unknown Source)
>   at java.awt.Component.dispatchEvent(Unknown Source)
>   at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
>   at java.awt.EventQueue.access$200(Unknown Source)
>   at java.awt.EventQueue$3.run(Unknown Source)
>   at java.awt.EventQueue$3.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
> Source)
>   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
> Source)
>   at java.awt.EventQueue$4.run(Unknown Source)
>   at java.awt.EventQueue$4.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
> Source)
>   at java.awt.EventQueue.dispatchEvent(Unknown Source)
>   at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
>   at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
>   at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
>   at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
>   at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
>   at java.awt.EventDispatchThread.run(Unknown Source)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: 
> Package should contain a content type part [M1.13]
>   at 
> org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:178)
>   at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:662)
>   at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:269)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:74)
>   ... 43 more



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1174) Invalid characters in filtered PDF output

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1174:


Component/s: parser

> Invalid characters in filtered PDF output
> -
>
> Key: TIKA-1174
> URL: https://issues.apache.org/jira/browse/TIKA-1174
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: Mac OS X 10.8.5, Java 1.7u40 (but also seen on CentOS5)
>Reporter: Matt Sheppard
>Priority: Minor
> Attachments: map_sp_1c_a4.pdf
>
>
> The PDF document at 
> http://www.logan.qld.gov.au/__data/assets/pdf_file/0010/9496/map_sp_1a_a4.pdf 
> produces invalid characters in the output when filtered by Tika 1.4.
> {noformat}
> >
> /opt/funnelback/mbin/java/bin/java -jar tika-app-1.4.jar map_sp_1c_a4.pdf | 
> hea…
> …d -n 40
> ERROR - Error: Could not parse predefined CMAP file for 'nullžf 
> °-ˇžl,¡ì$1-UCS2'
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> [snip]
> Cycle network
> 
> 
> 
> HILEY
> 
> {noformat}
> Is there any proper way to avoid this, or is the best approach to strip such 
> characters from Tika's output?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1269) Self-hosted documentation for the JAX-RS Server

2014-06-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019434#comment-14019434
 ] 

Hudson commented on TIKA-1269:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #22 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/22/])
TIKA-1269 Some endpoints may lack a produces annotation (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1600793)
* /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaWelcome.java


> Self-hosted documentation for the JAX-RS Server
> ---
>
> Key: TIKA-1269
> URL: https://issues.apache.org/jira/browse/TIKA-1269
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.5
>Reporter: Nick Burch
> Fix For: 1.7
>
> Attachments: enable-enunciate.patch
>
>
> Currently, if you fire up the JAX-RS Tika Server, and go to the root of the 
> server in a web browser, you get an empty page back. You have to know to head 
> over to https://wiki.apache.org/tika/TikaJAXRS find out what the available 
> URLs are
> We should self-host some simple documentation on the server at the root of 
> it, so that people can discover what it offers. Ideally, this should be 
> largely auto-generated based on the endpoints, so that we don't risk missing 
> things when we add new features
> This will also allow us to potentially offer a sample running version of the 
> server for people to discover Tika with



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1324) Use a common path for the Tika Server unpacker resources

2014-06-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019433#comment-14019433
 ] 

Hudson commented on TIKA-1324:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #22 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/22/])
TIKA-1324 As discussed on the mailing lists, use a common url prefix for the 
unpacker resources (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1600791)
* /tika/trunk/CHANGES.txt
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/UnpackerResource.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/UnpackerResourceTest.java


> Use a common path for the Tika Server unpacker resources
> 
>
> Key: TIKA-1324
> URL: https://issues.apache.org/jira/browse/TIKA-1324
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.5
>Reporter: Nick Burch
> Fix For: 1.6
>
>
> Currently, the two different methods of the Tika Server unpacker endpoint 
> don't share a common url prefix, which causes them to clash with the new 
> welcome endpoint
> As discussed on the mailing list, we should change these two have a common 
> prefix, so that the urls are then:
>  * /unpack/{id}
>  * /unpack/all/{id}
> After making the change, the changelog and release notes need to be updated 
> for it, as it is a breaking change for the (handful of) users of the endpoint
> This will help with TIKA-1269



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1269) Self-hosted documentation for the JAX-RS Server

2014-06-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019415#comment-14019415
 ] 

Hudson commented on TIKA-1269:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #21 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/21/])
TIKA-1269 Some endpoints may lack a produces annotation (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1600793)
* /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaWelcome.java


> Self-hosted documentation for the JAX-RS Server
> ---
>
> Key: TIKA-1269
> URL: https://issues.apache.org/jira/browse/TIKA-1269
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.5
>Reporter: Nick Burch
> Fix For: 1.7
>
> Attachments: enable-enunciate.patch
>
>
> Currently, if you fire up the JAX-RS Tika Server, and go to the root of the 
> server in a web browser, you get an empty page back. You have to know to head 
> over to https://wiki.apache.org/tika/TikaJAXRS find out what the available 
> URLs are
> We should self-host some simple documentation on the server at the root of 
> it, so that people can discover what it offers. Ideally, this should be 
> largely auto-generated based on the endpoints, so that we don't risk missing 
> things when we add new features
> This will also allow us to potentially offer a sample running version of the 
> server for people to discover Tika with



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1324) Use a common path for the Tika Server unpacker resources

2014-06-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019414#comment-14019414
 ] 

Hudson commented on TIKA-1324:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #21 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/21/])
TIKA-1324 As discussed on the mailing lists, use a common url prefix for the 
unpacker resources (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1600791)
* /tika/trunk/CHANGES.txt
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/UnpackerResource.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/UnpackerResourceTest.java


> Use a common path for the Tika Server unpacker resources
> 
>
> Key: TIKA-1324
> URL: https://issues.apache.org/jira/browse/TIKA-1324
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.5
>Reporter: Nick Burch
> Fix For: 1.6
>
>
> Currently, the two different methods of the Tika Server unpacker endpoint 
> don't share a common url prefix, which causes them to clash with the new 
> welcome endpoint
> As discussed on the mailing list, we should change these two have a common 
> prefix, so that the urls are then:
>  * /unpack/{id}
>  * /unpack/all/{id}
> After making the change, the changelog and release notes need to be updated 
> for it, as it is a breaking change for the (handful of) users of the endpoint
> This will help with TIKA-1269



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1319) Translation

2014-06-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019416#comment-14019416
 ] 

Hudson commented on TIKA-1319:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #21 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/21/])
contribution for TIKA-1319: Translation module contributed by Tyler Palsulich. 
(mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1600787)
* /tika/trunk/CHANGES.txt
* /tika/trunk/pom.xml
* /tika/trunk/tika-core/src/main/java/org/apache/tika/Tika.java
* /tika/trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java
* /tika/trunk/tika-core/src/main/java/org/apache/tika/language/translate
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java
* /tika/trunk/tika-translate
* /tika/trunk/tika-translate/pom.xml
* /tika/trunk/tika-translate/src
* /tika/trunk/tika-translate/src/main
* /tika/trunk/tika-translate/src/main/java
* /tika/trunk/tika-translate/src/main/java/org
* /tika/trunk/tika-translate/src/main/java/org/apache
* /tika/trunk/tika-translate/src/main/java/org/apache/tika
* /tika/trunk/tika-translate/src/main/java/org/apache/tika/language
* /tika/trunk/tika-translate/src/main/java/org/apache/tika/language/translate
* 
/tika/trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
* /tika/trunk/tika-translate/src/main/resources
* /tika/trunk/tika-translate/src/main/resources/META-INF
* /tika/trunk/tika-translate/src/main/resources/META-INF/services
* 
/tika/trunk/tika-translate/src/main/resources/META-INF/services/org.apache.tika.language.translate.Translator
* /tika/trunk/tika-translate/src/main/resources/org
* /tika/trunk/tika-translate/src/main/resources/org/apache
* /tika/trunk/tika-translate/src/main/resources/org/apache/tika
* /tika/trunk/tika-translate/src/main/resources/org/apache/tika/language
* 
/tika/trunk/tika-translate/src/main/resources/org/apache/tika/language/translator.microsoft.properties
* /tika/trunk/tika-translate/src/test
* /tika/trunk/tika-translate/src/test/java
* /tika/trunk/tika-translate/src/test/java/org
* /tika/trunk/tika-translate/src/test/java/org/apache
* /tika/trunk/tika-translate/src/test/java/org/apache/tika
* /tika/trunk/tika-translate/src/test/java/org/apache/tika/language
* /tika/trunk/tika-translate/src/test/java/org/apache/tika/language/translate
* 
/tika/trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java


> Translation
> ---
>
> Key: TIKA-1319
> URL: https://issues.apache.org/jira/browse/TIKA-1319
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.6
>
>
> I just opened up a review on reviews.apache.org -- 
> https://reviews.apache.org/r/22219/. I copied the description below. 
> This patch adds basic language translation functionality to Tika. Translation 
> is provided by a Microsoft API, but accessed through Apache 2 licensed 
> com.memetix.microsoft-translator-java-api 
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
> to use the translation feature, they have to add a client id and client 
> secret to the 
> tika-core/src/main/resources/org/apache/tika/language/translator.properties 
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
> com.memetix as a dependency in tika-core. I put the Translator class in 
> org.apache.tika.language. There is no integration with the server or CLI, 
> yet. Further, only Strings are translated right now -- if you pass in a full 
> document with xml tags, the structure will be mangled. But, I think that 
> would be a cool feature -- translate the body, title, subtitle, etc, but not 
> the structural elements. 
> There is still more work to do, but I wanted some more eyes on this to make 
> sure I'm heading in the right direction and this is a desired feature. Let me 
> know what you think!
> There are two simple unit tests for now which translate "hello" to French 
> ("salut"). One for inputting the source and target languages, one for 
> inputing just the target language (and detecting the source language 
> automatically).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1269) Self-hosted documentation for the JAX-RS Server

2014-06-05 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019407#comment-14019407
 ] 

Nick Burch commented on TIKA-1269:
--

With the change in TIKA-1324, the very basic self-hosted welcome page is now 
displaying. 

At some point fairly soon we should switch this for something more fully 
featured, I'd vote for something that makes it very easy to not only discover 
what the APIs are, but also to try them out. How easy it'll be to get some of 
the very nice documentation + trying it out frameworks that are out there, 
integrated with the Tika Server and it's list of endpoints, is another matter...

> Self-hosted documentation for the JAX-RS Server
> ---
>
> Key: TIKA-1269
> URL: https://issues.apache.org/jira/browse/TIKA-1269
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.5
>Reporter: Nick Burch
> Fix For: 1.7
>
> Attachments: enable-enunciate.patch
>
>
> Currently, if you fire up the JAX-RS Tika Server, and go to the root of the 
> server in a web browser, you get an empty page back. You have to know to head 
> over to https://wiki.apache.org/tika/TikaJAXRS find out what the available 
> URLs are
> We should self-host some simple documentation on the server at the root of 
> it, so that people can discover what it offers. Ideally, this should be 
> largely auto-generated based on the endpoints, so that we don't risk missing 
> things when we add new features
> This will also allow us to potentially offer a sample running version of the 
> server for people to discover Tika with



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1324) Use a common path for the Tika Server unpacker resources

2014-06-05 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019398#comment-14019398
 ] 

Nick Burch commented on TIKA-1324:
--

Change made in r1600791.

Leaving open for now, as whoever writes the 1.6 release notes needs to mention 
this in them

> Use a common path for the Tika Server unpacker resources
> 
>
> Key: TIKA-1324
> URL: https://issues.apache.org/jira/browse/TIKA-1324
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.5
>Reporter: Nick Burch
> Fix For: 1.6
>
>
> Currently, the two different methods of the Tika Server unpacker endpoint 
> don't share a common url prefix, which causes them to clash with the new 
> welcome endpoint
> As discussed on the mailing list, we should change these two have a common 
> prefix, so that the urls are then:
>  * /unpack/{id}
>  * /unpack/all/{id}
> After making the change, the changelog and release notes need to be updated 
> for it, as it is a breaking change for the (handful of) users of the endpoint
> This will help with TIKA-1269



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Review Request 22246: New parser for Matlab .mat files

2014-06-05 Thread Ann Burgess


> On June 4, 2014, 11:25 p.m., Matthias Krueger wrote:
> > The Matlab MIME types used seem to be application/x-matlab-data or 
> > application/matlab-mat.
> > 
> > Would it make sense to add them to the mime XML for detection?
> > 
> > 
> >   MATLAB data file
> >   
> >   
> > 
> >   
> >   
> > 
> > 
> >
> 
> Chris Mattmann wrote:
> +1 this makes a ton of sense to add IMO.
> 
> Nick Burch wrote:
> There's some odd whitespace going on - we normally use 4 spaces and no 
> tabs.
> 
> When outputting the variables, it would probably make sense to put each 
> one into either a paragraph or a list, so that we get helpful output in html 
> mode as well as text mode
> 
> With that in place, it would then be possible to have a unit test that 
> checked the html output, as well as the current text one
> 
> Also on testing, I think at least some of the tests have an 
> implementation of assertContains, which generally gives a more helpful 
> failure message than assertTrue(s.contains(...)) does, might be worth looking 
> into that?
> 
> Ann Burgess wrote:
> Great input - thank you! I will integrate both and upload the diff.
> 
> Matthias Krueger wrote:
> This is on a good way, some quick additional comments:
> * I tested with the files in 
> https://github.com/scipy/scipy/tree/master/scipy/io/matlab/tests/data. JMatIO 
> only support MATLAB 5 files. This could be added as a quick comment or 
> javadoc.
> * I think Tika is based on JDK 1.6. I don't see a reason for the test to 
> take care and always just return-succeeding on JDK 1.5.

+1 Matthias. 


- Ann


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22246/#review44773
---


On June 4, 2014, 10:23 p.m., Ann Burgess wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22246/
> ---
> 
> (Updated June 4, 2014, 10:23 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> ---
> 
> This is a new parser for Matlab .mat files.  The parser utilizes the JmatIO, 
> Matlab's MAT-file I/O API in JAVA. JmatIO is available through Maven Central. 
>  The text output from this parser provides variable names and dimensions that 
> are both inside and outside of data structures, but does NOT provide the 
> actual data values within each .mat file. 
> 
> 
> Diffs
> -
> 
> 
> Diff: https://reviews.apache.org/r/22246/diff/
> 
> 
> Testing
> ---
> 
> Successfully run a basic unit test that checks both --text and --metadata 
> parser output.  
> 
> 
> File Attachments
> 
> 
> Parser File
>   
> https://reviews.apache.org/media/uploaded/files/2014/06/04/cb39636d-ec53-4fbc-b348-6a4db8907f6b__MatParser.java
> Unit Test
>   
> https://reviews.apache.org/media/uploaded/files/2014/06/04/bbff8c6b-caa1-4830-b441-532c28c3c78e__MatParserTest.java
> 
> 
> Thanks,
> 
> Ann Burgess
> 
>



Re: svn commit: r1600791 - in /tika/trunk: CHANGES.txt tika-server/src/main/java/org/apache/tika/server/UnpackerResource.java tika-server/src/test/java/org/apache/tika/server/UnpackerResourceTest.java

2014-06-05 Thread Mattmann, Chris A (3980)
Thanks nick 

Sent from my iPhone

> On Jun 5, 2014, at 4:19 PM, "n...@apache.org"  wrote:
> 
> Author: nick
> Date: Thu Jun  5 23:19:21 2014
> New Revision: 1600791
> 
> URL: http://svn.apache.org/r1600791
> Log:
> TIKA-1324 As discussed on the mailing lists, use a common url prefix for the 
> unpacker resources
> 
> Modified:
>tika/trunk/CHANGES.txt
>
> tika/trunk/tika-server/src/main/java/org/apache/tika/server/UnpackerResource.java
>
> tika/trunk/tika-server/src/test/java/org/apache/tika/server/UnpackerResourceTest.java
> 
> Modified: tika/trunk/CHANGES.txt
> URL: 
> http://svn.apache.org/viewvc/tika/trunk/CHANGES.txt?rev=1600791&r1=1600790&r2=1600791&view=diff
> ==
> --- tika/trunk/CHANGES.txt (original)
> +++ tika/trunk/CHANGES.txt Thu Jun  5 23:19:21 2014
> @@ -1,5 +1,10 @@
> Release 1.6 - ??/??/2014
> 
> +  * The Tika Server URLs for the unpacker resources have been changed,
> +to bring them under a common prefix (TIKA-1324). The mapping is
> +/unpacker/{id} -> /unpack/{id}
> +/all/{id}  -> /unpack/all/{id}
> +
>   * Added module and core Tika interface for translating text between
> languages and added a default implementation that call's Microsoft's
> translate service (TIKA-1319)
> @@ -18,8 +23,8 @@ Release 1.6 - ??/??/2014
> based (TIKA-1204, TIKA-1221)
> 
>   * Added a user facing welcome page to the Tika Server, which
> -says what it is, and a very brief summary of what is available.
> -(Not working yet though...!) (TIKA-1269)
> +says what it is, and a very brief summary of what is available. 
> +(TIKA-1269)
> 
>   * Added Tika Server endpoints to list the available mime types,
> Parsers and Detectors, similar to the --list- methods on
> 
> Modified: 
> tika/trunk/tika-server/src/main/java/org/apache/tika/server/UnpackerResource.java
> URL: 
> http://svn.apache.org/viewvc/tika/trunk/tika-server/src/main/java/org/apache/tika/server/UnpackerResource.java?rev=1600791&r1=1600790&r2=1600791&view=diff
> ==
> --- 
> tika/trunk/tika-server/src/main/java/org/apache/tika/server/UnpackerResource.java
>  (original)
> +++ 
> tika/trunk/tika-server/src/main/java/org/apache/tika/server/UnpackerResource.java
>  Thu Jun  5 23:19:21 2014
> @@ -60,6 +60,7 @@ import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
> 
> +@Path("/unpack")
> public class UnpackerResource {
>   private static final Log logger = LogFactory.getLog(UnpackerResource.class);
>   public static final String TEXT_FILENAME = "__TEXT__";
> @@ -70,7 +71,7 @@ public class UnpackerResource {
>   this.tikaConfig = tikaConfig;
>   }
> 
> -  @Path("/unpacker{id:(/.*)?}")
> +  @Path("/{id:(/.*)?}")
>   @PUT
>   @Produces({"application/zip", "application/x-tar"})
>   public Map unpack(
> 
> Modified: 
> tika/trunk/tika-server/src/test/java/org/apache/tika/server/UnpackerResourceTest.java
> URL: 
> http://svn.apache.org/viewvc/tika/trunk/tika-server/src/test/java/org/apache/tika/server/UnpackerResourceTest.java?rev=1600791&r1=1600790&r2=1600791&view=diff
> ==
> --- 
> tika/trunk/tika-server/src/test/java/org/apache/tika/server/UnpackerResourceTest.java
>  (original)
> +++ 
> tika/trunk/tika-server/src/test/java/org/apache/tika/server/UnpackerResourceTest.java
>  Thu Jun  5 23:19:21 2014
> @@ -36,8 +36,9 @@ import org.apache.cxf.jaxrs.lifecycle.Si
> import org.junit.Test;
> 
> public class UnpackerResourceTest extends CXFTestBase {
> -private static final String UNPACKER_PATH = "/unpacker";
> -private static final String ALL_PATH = "/all";
> +private static final String BASE_PATH = "/unpack";
> +private static final String UNPACKER_PATH = BASE_PATH + "";
> +private static final String ALL_PATH = BASE_PATH + "/all";
> 
>private static final String TEST_DOC_WAV = "Doc1_ole.doc";
>private static final String WAV1_MD5 = "bdd0a78a54968e362445364f95d8dc96";
> 
> 


[jira] [Created] (TIKA-1324) Use a common path for the Tika Server unpacker resources

2014-06-05 Thread Nick Burch (JIRA)
Nick Burch created TIKA-1324:


 Summary: Use a common path for the Tika Server unpacker resources
 Key: TIKA-1324
 URL: https://issues.apache.org/jira/browse/TIKA-1324
 Project: Tika
  Issue Type: Improvement
  Components: server
Affects Versions: 1.5
Reporter: Nick Burch
 Fix For: 1.6


Currently, the two different methods of the Tika Server unpacker endpoint don't 
share a common url prefix, which causes them to clash with the new welcome 
endpoint

As discussed on the mailing list, we should change these two have a common 
prefix, so that the urls are then:
 * /unpack/{id}
 * /unpack/all/{id}

After making the change, the changelog and release notes need to be updated for 
it, as it is a breaking change for the (handful of) users of the endpoint

This will help with TIKA-1269



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1319) Translation

2014-06-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019388#comment-14019388
 ] 

Hudson commented on TIKA-1319:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #21 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/21/])
contribution for TIKA-1319: Translation module contributed by Tyler Palsulich. 
(mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1600787)
* /tika/trunk/CHANGES.txt
* /tika/trunk/pom.xml
* /tika/trunk/tika-core/src/main/java/org/apache/tika/Tika.java
* /tika/trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java
* /tika/trunk/tika-core/src/main/java/org/apache/tika/language/translate
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java
* /tika/trunk/tika-translate
* /tika/trunk/tika-translate/pom.xml
* /tika/trunk/tika-translate/src
* /tika/trunk/tika-translate/src/main
* /tika/trunk/tika-translate/src/main/java
* /tika/trunk/tika-translate/src/main/java/org
* /tika/trunk/tika-translate/src/main/java/org/apache
* /tika/trunk/tika-translate/src/main/java/org/apache/tika
* /tika/trunk/tika-translate/src/main/java/org/apache/tika/language
* /tika/trunk/tika-translate/src/main/java/org/apache/tika/language/translate
* 
/tika/trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
* /tika/trunk/tika-translate/src/main/resources
* /tika/trunk/tika-translate/src/main/resources/META-INF
* /tika/trunk/tika-translate/src/main/resources/META-INF/services
* 
/tika/trunk/tika-translate/src/main/resources/META-INF/services/org.apache.tika.language.translate.Translator
* /tika/trunk/tika-translate/src/main/resources/org
* /tika/trunk/tika-translate/src/main/resources/org/apache
* /tika/trunk/tika-translate/src/main/resources/org/apache/tika
* /tika/trunk/tika-translate/src/main/resources/org/apache/tika/language
* 
/tika/trunk/tika-translate/src/main/resources/org/apache/tika/language/translator.microsoft.properties
* /tika/trunk/tika-translate/src/test
* /tika/trunk/tika-translate/src/test/java
* /tika/trunk/tika-translate/src/test/java/org
* /tika/trunk/tika-translate/src/test/java/org/apache
* /tika/trunk/tika-translate/src/test/java/org/apache/tika
* /tika/trunk/tika-translate/src/test/java/org/apache/tika/language
* /tika/trunk/tika-translate/src/test/java/org/apache/tika/language/translate
* 
/tika/trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java


> Translation
> ---
>
> Key: TIKA-1319
> URL: https://issues.apache.org/jira/browse/TIKA-1319
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.6
>
>
> I just opened up a review on reviews.apache.org -- 
> https://reviews.apache.org/r/22219/. I copied the description below. 
> This patch adds basic language translation functionality to Tika. Translation 
> is provided by a Microsoft API, but accessed through Apache 2 licensed 
> com.memetix.microsoft-translator-java-api 
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
> to use the translation feature, they have to add a client id and client 
> secret to the 
> tika-core/src/main/resources/org/apache/tika/language/translator.properties 
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
> com.memetix as a dependency in tika-core. I put the Translator class in 
> org.apache.tika.language. There is no integration with the server or CLI, 
> yet. Further, only Strings are translated right now -- if you pass in a full 
> document with xml tags, the structure will be mangled. But, I think that 
> would be a cool feature -- translate the body, title, subtitle, etc, but not 
> the structural elements. 
> There is still more work to do, but I wanted some more eyes on this to make 
> sure I'm heading in the right direction and this is a desired feature. Let me 
> know what you think!
> There are two simple unit tests for now which translate "hello" to French 
> ("salut"). One for inputting the source and target languages, one for 
> inputing just the target language (and detecting the source language 
> automatically).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1302:


Component/s: server
 general
 cli

> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, general, server
>Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1275) Upgrade Commons compress to 1.8.1

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1275:


Component/s: general

> Upgrade Commons compress to 1.8.1
> -
>
> Key: TIKA-1275
> URL: https://issues.apache.org/jira/browse/TIKA-1275
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Reporter: Fabian Lange
>
> Hi,
> I am using Tika to detect content also from archives. But because the raw 
> input stream is a CipherInputStream I ran into 
> https://issues.apache.org/jira/browse/COMPRESS-277
> which compress kindly solved for me.
> To be able to use Tika without patching my stack, I would like to see an 
> upgrade of commons compress to 1.8.1 as soon as it is out.
> This may, or may not be in 1.6 timeframe.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1137) Wasted work in WontBeSerializedError.writeObject()

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1137:


Component/s: parser

> Wasted work in WontBeSerializedError.writeObject()
> --
>
> Key: TIKA-1137
> URL: https://issues.apache.org/jira/browse/TIKA-1137
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
> Environment: any
>Reporter: Adrian Nistor
>  Labels: patch, perfomance
> Attachments: patch.diff
>
>
> The problem appears in version 1.3 and in revision 1494353.  I
> attached a one-line patch that fixes it.
> In method "WontBeSerializedError.writeObject", the loop over
> "e.getStackTrace()" should break immediately after "found" is set to
> "true".  All the iterations after "found" is set to "true" do not
> perform any useful work, at best they just set "found" again to
> "true".
> Method "embedInTempFile" in class "ExternalEmbedderTest" has a similar
> loop (over "embeddedMetadata.getValues(metadataName)"), and this loop
> breaks immediately after "foundExpectedValue" is set to "true", just
> like in the proposed patch.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1167) Embedded object not extracted

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1167:


Component/s: parser

> Embedded object not extracted
> -
>
> Key: TIKA-1167
> URL: https://issues.apache.org/jira/browse/TIKA-1167
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Daniel Bonniot de Ruisselet
>Priority: Critical
> Fix For: 1.7
>
> Attachments: Doc w Structure that wont extract.docx
>
>
> For the attached docx, tika seems to detect the embedded object, as shown by 
> this tag:
> {{}}
> However, extraction itself (using -z on the command line, or using the API) 
> does not seem to work for this object:
> {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}}
> {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to 
> /tmp/tika/rId9_image1.wmf}}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1287) Update NetCDF .jar file on Maven Central

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1287:


Component/s: parser

> Update NetCDF .jar file on Maven Central
> 
>
> Key: TIKA-1287
> URL: https://issues.apache.org/jira/browse/TIKA-1287
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Ann Burgess
>  Labels: jar, maven, netcdf, tika, unit-test, update
>
> I am working to update the NetCDFParser file.  When using the most-recent 
> .jar file available from http://www.unidata.ucar.edu/ at the command line I 
> receive a note about a depreciated API: 
> javac -classpath 
> ../../../../tika-core/target/tika-core-1.6-SNAPSHOT.jar:../../../../toolsUI-4.3.jar
>  org/apache/tika/parser/netcdf/NetCDFParser.java
> Note: org/apache/tika/parser/netcdf/NetCDFParser.java uses or overrides a 
> deprecated API.
> Note: Recompile with -Xlint:deprecation for details.
> After updating the NetCDFParser file with non-deprecated methods (e.x. 
> changing "dimension.getName()" to "dimension.getFullName()") however, I get 
> failed unit tests in maven, which I assume is because the Maven Central Repo 
> has the lapsed version of the .jar file needed for NetCDF files (
> http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22edu.ucar%22%20AND%20a%3A%22netcdf%22)
>  .
> Can anyone provide insight into how I get the updated .jar file into the 
> Maven Central Repository? Is there an alternative method to update Tika so I 
> can run my unit tests in Maven?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1155) Number Format is converted with an error

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1155:


Component/s: parser

> Number Format is converted with an error
> 
>
> Key: TIKA-1155
> URL: https://issues.apache.org/jira/browse/TIKA-1155
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Evgeniy Buyanov
> Attachments: test-Excel.csv, test.xlsx, test.xml
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> {code:Title=Source data}
> 
>
> 
> 10
> -10
> {code}
> java -jar tika-app-1.4.jar test.xlsx > test.xml
> {code:Title=Result}
>   * 10 _F
>   -10 _F
> {code}
> related ASF Bugzilla – Bug 
> [52592|https://issues.apache.org/bugzilla/show_bug.cgi?id=52592]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1207) Parent task for integration of Any23 into Tika

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1207:


Component/s: general

> Parent task for integration of Any23 into Tika
> --
>
> Key: TIKA-1207
> URL: https://issues.apache.org/jira/browse/TIKA-1207
> Project: Tika
>  Issue Type: Task
>  Components: general
> Environment: Any23 trunk
> Tika trunk
>Reporter: Lewis John McGibbney
> Attachments: the_initiative.txt
>
>
> This issue should act as parent for all issues relating to integration of 
> Any23 in to Tika. A document should be maintained herewith which details the 
> execution plan for the migration of code. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1163) NPE thrown by TikaConfig.getDefaultConfig()

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1163:


Component/s: config

> NPE thrown by TikaConfig.getDefaultConfig() 
> 
>
> Key: TIKA-1163
> URL: https://issues.apache.org/jira/browse/TIKA-1163
> Project: Tika
>  Issue Type: Bug
>  Components: config
>Affects Versions: 1.4
> Environment: OS-X, JDK 1.7
>Reporter: Derrick Johnson
> Attachments: TCT.java
>
>
> The below exception gets thrown every time I execute 
> TikaConfig.getDefaultConfig. Similary, invoking `Tika t = new Tika()` throws 
> the same exception, since code inside this constructor invokes 
> TikaConfig.getDefaultConfig().
> This problem is non-existent when I use Tika-core and Tika-parsers 1.0. But 
> when bump the version numbers to 1.2 (in order to get around a bug in 
> PDFBOX), the problem shows up. I'm using maven. I've carefully ensured that 
> I'm not pulling in the wrong version of Tika, using things like `mvn 
> dependency:tree` and making there were no sneaky problematic transitive 
> dependencies. 
> java.lang.NullPointerException
>   at 
> org.apache.tika.mime.MimeTypesReader$ClauseRecord.stop(MimeTypesReader.java:245)
>   at 
> org.apache.tika.mime.MimeTypesReader.endElement(MimeTypesReader.java:203)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:606)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractXMLDocumentParser.emptyElement(AbstractXMLDocumentParser.java:183)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanStartElement(XMLDocumentFragmentScannerImpl.java:1303)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2717)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:302)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:115)
>   at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:64)
>   at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:93)
>   at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:149)
>   at 
> org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:479)
>   at 
> org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:60)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:169)
>   at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:268)
> at 
> my.method.which.invokes `new Tika()`



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1315) Basic list support in WordExtractor

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1315:


Fix Version/s: (was: 1.6)
   1.7

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Priority: Minor
> Fix For: 1.7
>
> Attachments: ListUtils.java, WordExtractor.java.patch, 
> WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1318) Use of Deprecated Word6Extractor.getParagraphText() Method

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1318:


Fix Version/s: (was: 1.6)
   1.7

> Use of Deprecated Word6Extractor.getParagraphText() Method
> --
>
> Key: TIKA-1318
> URL: https://issues.apache.org/jira/browse/TIKA-1318
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Tyler Palsulich
>Priority: Minor
>  Labels: deprecation
> Fix For: 1.7
>
>
> org.apache.tika.parser.microsoft.WordExtractor.parseWord6() uses the 
> deprecated Word6Extractor.getParagraphText() method. getParagraphText() is 
> supposed to return a String[] with an element for each paragraph in the text. 
> The replacement is getText(), which lets paragraph, cell, etc separation be 
> implementation specific. I'm not sure, at this point, how the POI 
> WordExtractor separates them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1253) SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible with [1.6, 1.7]

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1253:


Component/s: general

> SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible 
> with [1.6, 1.7]
> --
>
> Key: TIKA-1253
> URL: https://issues.apache.org/jira/browse/TIKA-1253
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Reporter: sudheshna iyer
>Priority: Blocker
>
> I am receiving the following error with Tika 4.0
> SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible 
> with [1.6, 1.7]
> pom.xml file entry:
>   
>   org.apache.tika
>   tika-app
>   1.4
> 
>   
> I have to incorporate tika project with other projects which use 1.7 of 
> SLF4J. Since Tika is not compatible with 1.7, I am not able to run my Tika 
> service.
> Why is Tika using lower versions of SLF4J? What is the workaround? 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---

Fix Version/s: (was: 1.6)
   1.7

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10
>Reporter: Michael McCandless
> Fix For: 1.7
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time e

[jira] [Updated] (TIKA-1306) ClassCastException WARN [main] (COSDocument.java:303) - java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSName in o

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1306:


Fix Version/s: (was: 1.6)
   1.7

> ClassCastException  WARN [main] (COSDocument.java:303) - 
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast 
> to org.apache.pdfbox.cos.COSName in o.a.t.parser.pdf.PDFParserTest
> 
>
> Key: TIKA-1306
> URL: https://issues.apache.org/jira/browse/TIKA-1306
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.7
>
>
> The below is a stack trace highlighted by setting up the nightly builds.
> Annie Burgess and I were also able to confirm this Exception in a recent 
> fresh checkout and mvn clean install of Tika trunk 1.6-SNAPSHOT.
> We should address this as it _may_ be a problem with main code which we 
> should address.
> Running org.apache.tika.parser.pdf.PDFParserTest
> ERROR [main] (NonSequentialPDFParser.java:1887) - Can't find the object xref 
> at offset 116
> ERROR [main] (NonSequentialPDFParser.java:1887) - Can't find the object xref 
> at offset 26441
> ERROR [main] (NonSequentialPDFParser.java:1887) - Can't find the object xref 
> at offset 2314576
>  WARN [main] (COSDocument.java:303) - java.lang.ClassCastException: 
> org.apache.pdfbox.cos.COSString cannot be cast to 
> org.apache.pdfbox.cos.COSName
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast 
> to org.apache.pdfbox.cos.COSName
>   at 
> org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:295)
>   at 
> org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:657)
>   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1239)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:118)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.TikaTest.getText(TikaTest.java:125)
>   at org.apache.tika.TikaTest.getText(TikaTest.java:133)
>   at 
> org.apache.tika.parser.pdf.PDFParserTest.testSequentialParser(PDFParserTest.java:552)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithAr

[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1276:


Fix Version/s: (was: 1.6)
   1.7

> Missing embedded dependencies in tika-bundle
> 
>
> Key: TIKA-1276
> URL: https://issues.apache.org/jira/browse/TIKA-1276
> Project: Tika
>  Issue Type: Bug
>  Components: packaging
>Affects Versions: 1.5
> Environment: OSGI, Apache Felix via Apache Sling Launcher
>Reporter: Rupert Westenthaler
> Fix For: 1.7
>
> Attachments: TIKA-1276_20140423_rwesten.diff, 
> TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_3_rwesten.diff, 
> TIKA-1276_20140428_rwesten.diff
>
>
> While updating from tika 1.2 to 1.5 I that the 
> `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
> 1. `com.uwyn:jhighlight:1.0` is not embedded
> Because of that installing the bundle results in the following exception
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
> [103.0] osgi.wiring.package; 
> (osgi.wiring.package=com.uwyn.jhighlight.renderer))
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
> [103.0] osgi.wiring.package; 
> (osgi.wiring.package=com.uwyn.jhighlight.renderer)
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> 2. `org.ow2.asm:asm:4.1` is not embedded because 
> `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
> therefore the `Embed-Dependency` directive `asm` does not match any 
> dependency. 
> Because of that one do get the following exception (after fixing (1))
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; 
> (&(osgi.wiring.package=org.objectweb.asm)(version>=4.1.0)(!(version>=5.0.0
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; 
> (&(osgi.wiring.package=org.objectweb.asm)(version>=4.1.0)(!(version>=5.0.0)))
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> There are two possibilities to fix this (a) change the `Embed-Dependency` to 
> `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
> tika-bundle pom file.
> 3. `edu.ucar:netcdf:4.2-min` is not embedded
> Because of that one does get the following exception (after fixing (1) and 
> (2))
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
> After fixing the above issues the tika-bundle was started successfully. 
> However when extracting EXIG metadata from a jpeg image I got the following 
> exception.
> {code}
> java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
>   at 
> com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
>   at 
> com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
>   at 
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
>  

[jira] [Updated] (TIKA-1218) Unable to parse a mp3 file on 1.5 getting a exception

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1218:


Component/s: parser

> Unable to parse a mp3 file on 1.5 getting a exception
> -
>
> Key: TIKA-1218
> URL: https://issues.apache.org/jira/browse/TIKA-1218
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: Win 7, Java 1.7
>Reporter: Sumeet Gorab
>Priority: Blocker
> Attachments: Save-the-World-Knife-Party-Remix.mp3
>
>
> Unable to parse a mp3 file on 1.5 getting following exception:
> Exception in thread "main" java.lang.NegativeArraySizeException
>   at 
> org.apache.tika.parser.mp3.ID3v2Frame$RawTag.(ID3v2Frame.java:417)
>   at 
> org.apache.tika.parser.mp3.ID3v2Frame$RawTag.(ID3v2Frame.java:382)
>   at 
> org.apache.tika.parser.mp3.ID3v2Frame$RawTagIterator.next(ID3v2Frame.java:371)
>   at 
> org.apache.tika.parser.mp3.ID3v24Handler.(ID3v24Handler.java:49)
>   at 
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:174)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1108) Represent individual slides in pptx

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1108:


Fix Version/s: (was: 1.6)
   1.7

> Represent individual slides in pptx
> ---
>
> Key: TIKA-1108
> URL: https://issues.apache.org/jira/browse/TIKA-1108
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
> Fix For: 1.7
>
>
> When parsing ppt, tika produces for each slide:
> 
> However for pptx these seem to be missing, all the text is directly under 
> .



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-995) XHTMLContentHandler doesn't pass attributes of body element

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-995:
---

Fix Version/s: (was: 1.6)
   1.7

> XHTMLContentHandler doesn't pass attributes of body element
> ---
>
> Key: TIKA-995
> URL: https://issues.apache.org/jira/browse/TIKA-995
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
>Reporter: Markus Jelsma
> Fix For: 1.7
>
> Attachments: TIKA-995-1.3-1.patch, TIKA-995-unit.patch
>
>
> XHTMLContentHandler.startElement() uses lazyHead() for the body element 
> because it's defined in the AUTO Set. As a consequence, attributes of the 
> body element are not passed to downstream content handlers. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1072:


Fix Version/s: (was: 1.6)
   1.7

> AIOOBE when handling embedded document in .doc file
> ---
>
> Key: TIKA-1072
> URL: https://issues.apache.org/jira/browse/TIKA-1072
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.7
>
> Attachments: 20-Force-on-a-current-S00.doc, Ole10NativeEntry.bin
>
>
> I have a Word (.doc) document that hits an exception when I run:
> {noformat}
> java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar 
> /x/tmp/20-Force-on-a-current-S00.doc 
> {noformat}
> Here's the exception:
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
>   at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
>   at 
> org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:139)
>   at 
> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> {noformat}
> It happens when we try to parse an OLE10 embedded object ... the code
> that does this parsing captures and ignores Ole10NativeException and
> skips the entry ... so I'm wondering if we should also catch AIOOBE
> and skip the entry?  Ie, maybe this entry really is not OLE10, and the
> Ole10Native code is failing to throw Ole10NativeException for it?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-980) MicrodataContentHandler for Apache Tika

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-980:
---

Fix Version/s: (was: 1.6)
   1.7

> MicrodataContentHandler for Apache Tika
> ---
>
> Key: TIKA-980
> URL: https://issues.apache.org/jira/browse/TIKA-980
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Ken Krugler
> Fix For: 1.7
>
> Attachments: TIKA-980-1.3-1.patch, TIKA-980-1.3-2.patch, 
> TIKA-980-1.3-3.patch, TIKA-980-1.3-4.patch, TIKA-980-1.3-5.patch
>
>
> ContentHandler for Apache Tika capable of building a data structure 
> containing Microdata item scopes and item properties. The Item* classes are 
> borrowed from the Apache Any23 project and are slightly modified to 
> accomodate this SAX-based extractor vs the original DOM-based extractor.
> The provided unit test outputs two item scopes about the Europe and NA 
> ApacheCon events and each has a nested property.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1273) old tika-server jar artifact contains no manifest so not able to invoke from shell

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1273:


Fix Version/s: (was: 1.6)
   1.7

> old tika-server jar artifact contains no manifest so not able to invoke from 
> shell
> --
>
> Key: TIKA-1273
> URL: https://issues.apache.org/jira/browse/TIKA-1273
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.7
>
>
> I've never ever used the old tika-server artifact which is generated when one 
> installs the server module. It needs to contain a manifest otherwise it 
> cannot be invoked from the shell.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-776) ExifTool Embedder

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-776:
---

Fix Version/s: (was: 1.6)
   1.7

> ExifTool Embedder
> -
>
> Key: TIKA-776
> URL: https://issues.apache.org/jira/browse/TIKA-776
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata
>Affects Versions: 1.0
> Environment: ExifTool is required 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>  Labels: embed, exiftool, patch
> Fix For: 1.7
>
> Attachments: tika-parsers-exiftool-embed-patch.txt
>
>
> This patch adds an ExifTool ExternalEmbedder which builds upon the work in 
> issue TIKA-774 and TIKA-775.
> In the tika-parsers an ExiftoolExternalEmbedder is added which extends 
> ExternalEmbedder to programmatically create an Embedder which calls the 
> ExifTool command line to embed tika metadata into a file stream and an 
> ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and 
> XMP fields then parses the resulting file stream to verify the operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1301) Establish TikaServer on Apache hosted VM

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1301:


Fix Version/s: (was: 1.6)
   1.7

> Establish TikaServer on Apache hosted VM
> 
>
> Key: TIKA-1301
> URL: https://issues.apache.org/jira/browse/TIKA-1301
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Reporter: Lewis John McGibbney
> Fix For: 1.7
>
>
> Over in Any23, Infra recently provisioned us with a nice shiny new VM to run 
> our service on
> http://any23.org
> I would like to do the same for Tika. I have some scripts on the Any23 VM 
> which will pull stable nightly tika-server snapshots and deploy them to the 
> VM. This is really nice for both dev's and users alike.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1208) Migrate Any23 mime contributions to Tika

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1208:


Fix Version/s: (was: 1.6)
   1.7

> Migrate Any23 mime contributions to Tika
> 
>
> Key: TIKA-1208
> URL: https://issues.apache.org/jira/browse/TIKA-1208
> Project: Tika
>  Issue Type: Sub-task
>  Components: mime
>Reporter: Lewis John McGibbney
> Fix For: 1.7
>
> Attachments: TIKA-1208.patch
>
>
> We begin with one of the most obvious areas in which there
> is overlap.
> In short, the appeal of this package is the addition of detection 
> for the following types:
>  - text/n3
>  - text/rdf+n3
>  - application/n3
>  - text/x-nquads
>  - text/rdf+nq
>  - text/nq
>  - application/nq
>  - text/turtle
>  - application/x-turtle
>  - application/turtle
>  - application/trix
>  
> Therefore although both Tika and Any23 execute the task of Mimetype-related
> tasks, there is a contribution to be made. This involves the trasferral of
> code pertaining to pattern recogition, Mimetype XML defitinions within 
> tika-mimetypes.xml and a Purifier implementation that removes all 
> the eventual blank characters at the header of a file that might 
> prevents its MIME Type detection.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-819:
---

Fix Version/s: (was: 1.6)
   1.7

> Make Option to Exclude Embedded Files' Text for Text Content
> 
>
> Key: TIKA-819
> URL: https://issues.apache.org/jira/browse/TIKA-819
> Project: Tika
>  Issue Type: New Feature
>  Components: general
>Affects Versions: 1.0
> Environment: Windows-7 + JDK 1.6 u26
>Reporter: Albert L.
> Fix For: 1.7
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the 
> option to disable text from the PPTX from showing up when asking for the text 
> content from DOCX.  In other words, it would be nice to have the option to 
> get text content *only* from the DOCX instead of the DOCX+PPTX.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1059:


Fix Version/s: (was: 1.6)
   1.7

> Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
> --
>
> Key: TIKA-1059
> URL: https://issues.apache.org/jira/browse/TIKA-1059
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ray Gauss II
> Fix For: 1.7
>
>
> The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch 
> {{InterruptedException}} and ignore it.
> The methods should either call {{interrupt()}} on the current thread or 
> re-throw the exception, possibly wrapped in a {{TikaException}}.
> See TIKA-775 for a previous discussion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1269) Self-hosted documentation for the JAX-RS Server

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1269:


Fix Version/s: (was: 1.6)
   1.7

> Self-hosted documentation for the JAX-RS Server
> ---
>
> Key: TIKA-1269
> URL: https://issues.apache.org/jira/browse/TIKA-1269
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.5
>Reporter: Nick Burch
> Fix For: 1.7
>
> Attachments: enable-enunciate.patch
>
>
> Currently, if you fire up the JAX-RS Tika Server, and go to the root of the 
> server in a web browser, you get an empty page back. You have to know to head 
> over to https://wiki.apache.org/tika/TikaJAXRS find out what the available 
> URLs are
> We should self-host some simple documentation on the server at the root of 
> it, so that people can discover what it offers. Ideally, this should be 
> largely auto-generated based on the endpoints, so that we don't risk missing 
> things when we add new features
> This will also allow us to potentially offer a sample running version of the 
> server for people to discover Tika with



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1308:


Fix Version/s: (was: 1.6)
   1.7

> Support in memory parse mode(don't create temp file): to support run Tika in 
> GAE
> 
>
> Key: TIKA-1308
> URL: https://issues.apache.org/jira/browse/TIKA-1308
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: yuanyun.cn
>  Labels: gae
> Fix For: 1.7
>
>
> I am trying to use Tika in GAE and write a simple servlet to extract meta 
> data info from jpeg:
> String urlStr = req.getParameter("imageUrl");
> byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
> ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
> Metadata metadata = new Metadata();
> BodyContentHandler ch = new BodyContentHandler();
> AutoDetectParser parser = new AutoDetectParser();
> parser.parse(bais, ch, metadata, new ParseContext());
> bais.close();
> This fails with exception:
> Caused by: java.lang.SecurityException: Unable to create temporary file
>   at java.io.File.createTempFile(File.java:1986)
>   at 
> org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
>   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
>   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
>   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
> Checked the code, in 
> org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, 
> Metadata, ParseContext), it creates a temp file from the input stream.
> I can understand why tika create temp file from the stream: so tika can parse 
> it multiple times.
> But as GAE and other cloud servers are getting more popular, is it possible 
> to avoid create temp file: instead we can copy the origin stream to a 
> byteArray stream, so tika can also parse it multiple times.
> -- This will have a limit on the file size, as tika keeps the whole file in 
> memory, but this can make tika work in GAE and maybe other cloud server.
> We can add a parameter in parser.parse to indicate whether do in memory parse 
> only.
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-985) Support for HTML5 elements

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-985:
---

Fix Version/s: (was: 1.6)
   1.7

> Support for HTML5 elements
> --
>
> Key: TIKA-985
> URL: https://issues.apache.org/jira/browse/TIKA-985
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.2
>Reporter: Markus Jelsma
> Fix For: 1.7
>
> Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, 
> TIKA-985-1.3-3.patch, TIKA-985-1.5.patch
>
>
> TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, 
> section). This prevents some custom ContentHandlers from reading expected 
> elements and/or attributes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1106) CLAVIN Integration

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1106:


Fix Version/s: (was: 1.6)
   1.7

> CLAVIN Integration
> --
>
> Key: TIKA-1106
> URL: https://issues.apache.org/jira/browse/TIKA-1106
> Project: Tika
>  Issue Type: Wish
>  Components: general
>Affects Versions: 1.3
> Environment: All
>Reporter: Adam Estrada
>Priority: Minor
>  Labels: entity, geospatial
> Fix For: 1.7
>
>
> I've been evaluating CLAVIN as a way to extract location information from 
> unstructured text. It seems like meshing it with Tika in some way would make 
> a lot of sense. From CLAVIN website...
> {quote}
> CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source 
> software package for document geotagging and geoparsing that employs 
> context-based geographic entity resolution. It combines a variety of open 
> source tools with natural language processing techniques to extract location 
> names from unstructured text documents and resolve them against gazetteer 
> records. Importantly, CLAVIN does not simply "look up" location names; 
> rather, it uses intelligent heuristics in an attempt to identify precisely 
> which "Springfield" (for example) was intended by the author, based on the 
> context of the document. CLAVIN also employs fuzzy search to handle 
> incorrectly-spelled location names, and it recognizes alternative names 
> (e.g., "Ivory Coast" and "Côte d'Ivoire") as referring to the same geographic 
> entity. By enriching text documents with structured geo data, CLAVIN enables 
> hierarchical geospatial search and advanced geospatial analytics on 
> unstructured data.
> {quote}
> There was only one other instance of the word "clavin" mentioned in the ASF 
> jira site so I thought it was definitely worth posting here.
> https://github.com/Berico-Technologies/CLAVIN



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-987) Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-987:
---

Fix Version/s: (was: 1.6)
   1.7

> Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted
> 
>
> Key: TIKA-987
> URL: https://issues.apache.org/jira/browse/TIKA-987
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.7
>
> Attachments: picture.doc, picture_3.doc
>
>
> I have two Word docs, both containing the same drawing, but one has
> text added.
> In one case (picture.doc) the extraction is correct: it contains only
> an embedded image.wmf; when I view the image it's correct.
> In the second case (picture_3.doc) the picture is extracted as image
> (no extension), and is 0 bytes, and there is an invalid character
> (mapped to unicode replacement char) inserted before the image:
> {noformat}
> 
> 
> �
> 
> 
> vehicle
> 
> {noformat}
> (Though, the text "vehicle" is extracted correctly).
> I dug a bit, and with the 2nd doc there is an embedded {SHAPE *
> MERGEFORMAT} field, which we invoke
> WordExtractor.handleSpecialCharacterRuns on, and somehow it extracts
> the 0-byte no-extension image as well as the invalid character.  With
> the first doc there is no field (at least not one that's handle with
> handleSpecialCharacterRuns...).  Otherwise I'm not sure how to
> fix... it could be something is going wrong in how POI parses the
> Pictures from PictureSource.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1167) Embedded object not extracted

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1167:


Fix Version/s: (was: 1.6)
   1.7

> Embedded object not extracted
> -
>
> Key: TIKA-1167
> URL: https://issues.apache.org/jira/browse/TIKA-1167
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Daniel Bonniot de Ruisselet
>Priority: Critical
> Fix For: 1.7
>
> Attachments: Doc w Structure that wont extract.docx
>
>
> For the attached docx, tika seems to detect the embedded object, as shown by 
> this tag:
> {{}}
> However, extraction itself (using -z on the command line, or using the API) 
> does not seem to work for this object:
> {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}}
> {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to 
> /tmp/tika/rId9_image1.wmf}}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-93) OCR support

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-93:
--

Fix Version/s: (was: 1.6)
   1.7

> OCR support
> ---
>
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.7
>
> Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
> TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1242) Update CXF version to 3.0.0-milestone2

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1242:


Fix Version/s: (was: 1.6)
   1.7

> Update CXF version to 3.0.0-milestone2
> --
>
> Key: TIKA-1242
> URL: https://issues.apache.org/jira/browse/TIKA-1242
> Project: Tika
>  Issue Type: Task
>  Components: server
>Reporter: Sergey Beryozkin
>Priority: Minor
> Fix For: 1.7
>
>
> CXF 3.0.0 Milestone 2 JAX-RS front-end offers a complete JAX-RS 2.0 support, 
> has fewer dependencies and is smaller compared to CXF 2.7.x one. It is also 
> backward-compatible with the applications written against JAX-RS 1.1.
> So it should be a safe upgrade 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-539:
---

Fix Version/s: (was: 1.6)
   1.7

> Encoding detection is too biased by encoding in meta tag
> 
>
> Key: TIKA-539
> URL: https://issues.apache.org/jira/browse/TIKA-539
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 0.8, 0.9, 0.10
>Reporter: Reinhard Schwab
>Assignee: Ken Krugler
> Fix For: 1.7
>
> Attachments: TIKA-539.patch, TIKA-539_2.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from 
> http response header).
> test code to reproduce:
> static String content = "\n"
>   + " content=\"application/xhtml+xml; charset=iso-8859-1\" />"
>   + "Über den Wolken\n";
>   /**
>* @param args
>* @throws IOException
>* @throws TikaException
>* @throws SAXException
>*/
>   public static void main(String[] args) throws IOException, SAXException,
>   TikaException {
>   Metadata metadata = new Metadata();
>   metadata.set(Metadata.CONTENT_TYPE, "text/html");
>   metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   InputStream in = new 
> ByteArrayInputStream(content.getBytes("UTF-8"));
>   AutoDetectParser parser = new AutoDetectParser();
>   BodyContentHandler h = new BodyContentHandler(1);
>   parser.parse(in, h, metadata, new ParseContext());
>   System.out.print(h.toString());
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   }



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1079) Word document hits AIOOBE in SummaryExtractor.parseSummaries

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1079:


Fix Version/s: (was: 1.6)
   1.7

> Word document hits AIOOBE in SummaryExtractor.parseSummaries
> 
>
> Key: TIKA-1079
> URL: https://issues.apache.org/jira/browse/TIKA-1079
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.7
>
> Attachments: guide_to_daips_(id_3152_ver_1.0.0).doc
>
>
> I'm not yet sure if this is a corrupted document (though, MS Word opens it 
> just fine) or a bug in POI ... but I hit this exc when running it through 
> TikaCLI:
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: -1
>   at org.apache.poi.hpsf.CodePageString.(CodePageString.java:161)
>   at 
> org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:158)
>   at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:163)
>   at org.apache.poi.hpsf.Property.(Property.java:164)
>   at org.apache.poi.hpsf.Section.(Section.java:277)
>   at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451)
>   at org.apache.poi.hpsf.PropertySet.(PropertySet.java:246)
>   at 
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:78)
>   at 
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:69)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-891:
---

Fix Version/s: (was: 1.6)
   1.7

> Use POST in addition to PUT on method calls in tika-server
> --
>
> Key: TIKA-891
> URL: https://issues.apache.org/jira/browse/TIKA-891
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.7
>
>
> Per Jukka's email:
> http://s.apache.org/uR
> It would be a better use of REST/HTTP "verbs" to use POST to put content to a 
> resource where we don't intend to store that content (which is the 
> implication of PUT). Max suggested adding:
> {code}
> @POST
> {code}
> annotations to the methods we are currently exposing using PUT to take care 
> of this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1220) Parser implementration for IFC files

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1220:


Fix Version/s: (was: 1.6)
   1.7

> Parser implementration for IFC files
> 
>
> Key: TIKA-1220
> URL: https://issues.apache.org/jira/browse/TIKA-1220
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.7
>
> Attachments: 2012-03-23-Duplex-Programming.ifc
>
>
> The Industry Foundation Classes (IFC) [0] data model is intended to describe 
> building and construction industry data. For the sake of argument, it can be 
> considered as a more intelligent successor to the .dwg data models used 
> within CAD models.
> I've tracked down a potential 3rd party library [1] which we maybe able to 
> wrap and use within Tika however the provided software packages are licensed 
> under: http://creativecommons.org/licenses/by-nc-sa/3.0/de/ so I am currently 
> over on legal-discuss@ in an attempt to see if it is possible to wrap some 
> code and contribute it to tika-parsers.
> When I get feedback from legal-discuss, and if this is a go-ahead, I'll need 
> to help the developers package the code as a Maven artifact(s), then I will 
> progress with writing the implementation.  
> [0] http://en.wikipedia.org/wiki/Industry_Foundation_Classes
> [1] http://www.ifctoolsproject.com/



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1238) Update OutlookExtractor to handle codepage identification more rigorously

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1238:


Fix Version/s: (was: 1.6)
   1.7

> Update OutlookExtractor to handle codepage identification more rigorously
> -
>
> Key: TIKA-1238
> URL: https://issues.apache.org/jira/browse/TIKA-1238
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
>
> Since OutlookExtractor's codepage detection chunk was written, POI's HSMF has 
> added more robutst capabilities for identifying codepages in Outlook .msg 
> files.  As a first step to integrating those improvements, I'll copy and 
> paste some of POI's code into OutlookExtractor.  As a second step, I'll 
> expose more of HSMF's capabilities within POI and then factor out the 
> duplicate code in Tika.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-774) ExifTool Parser

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-774:
---

Fix Version/s: (was: 1.6)
   1.7

> ExifTool Parser
> ---
>
> Key: TIKA-774
> URL: https://issues.apache.org/jira/browse/TIKA-774
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.0
> Environment: Requires be installed 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>  Labels: features, newbie, patch,
> Fix For: 1.7
>
> Attachments: testJPEG_IPTC_EXT.jpg, 
> tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt
>
>
> Adds an external parser that calls ExifTool to extract extended metadata 
> fields from images and other content types.
> In the core project:
> An ExifTool interface is added which contains Property objects that define 
> the metadata fields available.
> An additional Property constructor for internalTextBag type.
> In the parsers project:
> An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
> on the command line and mapping the response to tika metadata fields.  This 
> extractor could be called instead of or in addition to the existing 
> ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
> JpegParser but those have not been changed at this time.
> An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
> An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
> metadata fields to existing tika and Drew Noakes metadata fields if enabled.
> An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
> implementations in XML files.
> An ExifToolParserTest is added which tests several expected XMP and IPTC 
> metadata values in testJPEG_IPTC_EXT.jpg.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-605) Tika GDAL parser

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-605:
---

Fix Version/s: (was: 1.6)
   1.7

> Tika GDAL parser
> 
>
> Key: TIKA-605
> URL: https://issues.apache.org/jira/browse/TIKA-605
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
> Environment: indep. of env.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>  Labels: gdal, gsoc2013, integration, mentor, tika
> Fix For: 1.7
>
> Attachments: 0001-TIKA-605-Tika-GDAL-parser.patch, 
> TIKA-605.Mattmann.092511.patch.txt
>
>
> Leverage the GDAL toolkit and its Java SWIG bindings to create a Tika parser 
> around GDAL. See here: 
> http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/java/apps/gdalinfo.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1300:


Fix Version/s: (was: 1.6)
   1.7

> Switch default PDFBox parser to NonSequentialParser
> ---
>
> Key: TIKA-1300
> URL: https://issues.apache.org/jira/browse/TIKA-1300
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-988) We don't extract a placeholder for a Word document embedded in an Excel document

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-988:
---

Fix Version/s: (was: 1.6)
   1.7

> We don't extract a placeholder for a Word document embedded in an Excel 
> document
> 
>
> Key: TIKA-988
> URL: https://issues.apache.org/jira/browse/TIKA-988
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.7
>
> Attachments: bug31373.xls
>
>
> In TIKA-956 we fixed the Word parser so that at the point where an embedded 
> document appears, we output a  tag.
> It would be nice to do this for documents embedded in Excel too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1295) Make some Dublin Core items multi-valued

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1295:


Fix Version/s: (was: 1.6)
   1.7

> Make some Dublin Core items multi-valued
> 
>
> Key: TIKA-1295
> URL: https://issues.apache.org/jira/browse/TIKA-1295
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
>
> According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, 
> dc:title, dc:description and dc:rights should allow multiple values because 
> of language alternatives.  Unless anyone objects in the next few days, I'll 
> switch those to Property.toInternalTextBag() from Property.toInternalText().  
> I'll also modify PDFParser to extract dc:rights.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1307) Jenkins Java7 job requires a profile in order to build 'tika-java7' module.

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1307:


Fix Version/s: (was: 1.6)
   1.7

> Jenkins Java7 job requires a profile in order to build 'tika-java7' module.
> ---
>
> Key: TIKA-1307
> URL: https://issues.apache.org/jira/browse/TIKA-1307
> Project: Tika
>  Issue Type: Bug
>  Components: packaging
>Affects Versions: 1.5
>Reporter: Lewis John McGibbney
> Fix For: 1.7
>
>
> N.B. Can someone please create a *build* tag in Admin area? The assign it to 
> this issue?
> This issue was flagged up by Hong-Thai during the DISCUSS nightly builds 
> thread recently
> http://www.mail-archive.com/dev%40tika.apache.org/msg07963.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1319) Translation

2014-06-05 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019380#comment-14019380
 ] 

Chris A. Mattmann edited comment on TIKA-1319 at 6/5/14 10:54 PM:
--

- I applied this patch in r1600787 with some slight modifications from 
https://reviews.apache.org/r/22219/ to simply add in an ALv2 header to the 
DefaultTranslator. Great work [~tpalsulich] and thanks to everyone's input (Tim 
Allison, Paul Ramirez, Lewis John McGibbney, Nick) et al!


was (Author: chrismattmann):
- I applied this patch in r1600787 with some slight modifications from 
https://reviews.apache.org/r/22219/ to simply add in an ALv2 header to the 
DefaultTranslator. Great work [~tpalsulich] and thanks to everyone's input 
(Nick, Paul Ramirez, Lewis John McGibbney] et al!

> Translation
> ---
>
> Key: TIKA-1319
> URL: https://issues.apache.org/jira/browse/TIKA-1319
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.6
>
>
> I just opened up a review on reviews.apache.org -- 
> https://reviews.apache.org/r/22219/. I copied the description below. 
> This patch adds basic language translation functionality to Tika. Translation 
> is provided by a Microsoft API, but accessed through Apache 2 licensed 
> com.memetix.microsoft-translator-java-api 
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
> to use the translation feature, they have to add a client id and client 
> secret to the 
> tika-core/src/main/resources/org/apache/tika/language/translator.properties 
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
> com.memetix as a dependency in tika-core. I put the Translator class in 
> org.apache.tika.language. There is no integration with the server or CLI, 
> yet. Further, only Strings are translated right now -- if you pass in a full 
> document with xml tags, the structure will be mangled. But, I think that 
> would be a cool feature -- translate the body, title, subtitle, etc, but not 
> the structural elements. 
> There is still more work to do, but I wanted some more eyes on this to make 
> sure I'm heading in the right direction and this is a desired feature. Let me 
> know what you think!
> There are two simple unit tests for now which translate "hello" to French 
> ("salut"). One for inputting the source and target languages, one for 
> inputing just the target language (and detecting the source language 
> automatically).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1319) Translation

2014-06-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1319.
-

   Resolution: Fixed
Fix Version/s: 1.6

- I applied this patch in r1600787 with some slight modifications from 
https://reviews.apache.org/r/22219/ to simply add in an ALv2 header to the 
DefaultTranslator. Great work [~tpalsulich] and thanks to everyone's input 
(Nick, Paul Ramirez, Lewis John McGibbney] et al!

> Translation
> ---
>
> Key: TIKA-1319
> URL: https://issues.apache.org/jira/browse/TIKA-1319
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.6
>
>
> I just opened up a review on reviews.apache.org -- 
> https://reviews.apache.org/r/22219/. I copied the description below. 
> This patch adds basic language translation functionality to Tika. Translation 
> is provided by a Microsoft API, but accessed through Apache 2 licensed 
> com.memetix.microsoft-translator-java-api 
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
> to use the translation feature, they have to add a client id and client 
> secret to the 
> tika-core/src/main/resources/org/apache/tika/language/translator.properties 
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
> com.memetix as a dependency in tika-core. I put the Translator class in 
> org.apache.tika.language. There is no integration with the server or CLI, 
> yet. Further, only Strings are translated right now -- if you pass in a full 
> document with xml tags, the structure will be mangled. But, I think that 
> would be a cool feature -- translate the body, title, subtitle, etc, but not 
> the structural elements. 
> There is still more work to do, but I wanted some more eyes on this to make 
> sure I'm heading in the right direction and this is a desired feature. Let me 
> know what you think!
> There are two simple unit tests for now which translate "hello" to French 
> ("salut"). One for inputting the source and target languages, one for 
> inputing just the target language (and detecting the source language 
> automatically).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Review Request 22246: New parser for Matlab .mat files

2014-06-05 Thread Matthias Krueger


> On June 4, 2014, 11:25 p.m., Matthias Krueger wrote:
> > The Matlab MIME types used seem to be application/x-matlab-data or 
> > application/matlab-mat.
> > 
> > Would it make sense to add them to the mime XML for detection?
> > 
> > 
> >   MATLAB data file
> >   
> >   
> > 
> >   
> >   
> > 
> > 
> >
> 
> Chris Mattmann wrote:
> +1 this makes a ton of sense to add IMO.
> 
> Nick Burch wrote:
> There's some odd whitespace going on - we normally use 4 spaces and no 
> tabs.
> 
> When outputting the variables, it would probably make sense to put each 
> one into either a paragraph or a list, so that we get helpful output in html 
> mode as well as text mode
> 
> With that in place, it would then be possible to have a unit test that 
> checked the html output, as well as the current text one
> 
> Also on testing, I think at least some of the tests have an 
> implementation of assertContains, which generally gives a more helpful 
> failure message than assertTrue(s.contains(...)) does, might be worth looking 
> into that?
> 
> Ann Burgess wrote:
> Great input - thank you! I will integrate both and upload the diff.

This is on a good way, some quick additional comments:
* I tested with the files in 
https://github.com/scipy/scipy/tree/master/scipy/io/matlab/tests/data. JMatIO 
only support MATLAB 5 files. This could be added as a quick comment or javadoc.
* I think Tika is based on JDK 1.6. I don't see a reason for the test to take 
care and always just return-succeeding on JDK 1.5.


- Matthias


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22246/#review44773
---


On June 4, 2014, 10:23 p.m., Ann Burgess wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22246/
> ---
> 
> (Updated June 4, 2014, 10:23 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> ---
> 
> This is a new parser for Matlab .mat files.  The parser utilizes the JmatIO, 
> Matlab's MAT-file I/O API in JAVA. JmatIO is available through Maven Central. 
>  The text output from this parser provides variable names and dimensions that 
> are both inside and outside of data structures, but does NOT provide the 
> actual data values within each .mat file. 
> 
> 
> Diffs
> -
> 
> 
> Diff: https://reviews.apache.org/r/22246/diff/
> 
> 
> Testing
> ---
> 
> Successfully run a basic unit test that checks both --text and --metadata 
> parser output.  
> 
> 
> File Attachments
> 
> 
> Parser File
>   
> https://reviews.apache.org/media/uploaded/files/2014/06/04/cb39636d-ec53-4fbc-b348-6a4db8907f6b__MatParser.java
> Unit Test
>   
> https://reviews.apache.org/media/uploaded/files/2014/06/04/bbff8c6b-caa1-4830-b441-532c28c3c78e__MatParserTest.java
> 
> 
> Thanks,
> 
> Ann Burgess
> 
>



[jira] [Commented] (TIKA-1319) Translation

2014-06-05 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019374#comment-14019374
 ] 

Chris A. Mattmann commented on TIKA-1319:
-

OK integrated in the patch from review board into trunk, built and test:

{noformat}
[INFO] 
[INFO] --- maven-site-plugin:3.0:attach-descriptor (attach-descriptor) @ tika 
---
[INFO] 
[INFO] --- maven-install-plugin:2.3.1:install (default-install) @ tika ---
[INFO] Installing /Users/mattmann/tmp/tika/pom.xml to 
/Users/mattmann/.m2/repository/org/apache/tika/tika/1.6-SNAPSHOT/tika-1.6-SNAPSHOT.pom
[INFO] Installing /Users/mattmann/tmp/tika/target/tika-1.6-SNAPSHOT-site.xml to 
/Users/mattmann/.m2/repository/org/apache/tika/tika/1.6-SNAPSHOT/tika-1.6-SNAPSHOT-site.xml
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent  SUCCESS [4.679s]
[INFO] Apache Tika core .. SUCCESS [42.186s]
[INFO] Apache Tika parsers ... SUCCESS [2:55.816s]
[INFO] Apache Tika XMP ... SUCCESS [10.876s]
[INFO] Apache Tika serialization . SUCCESS [8.565s]
[INFO] Apache Tika application ... SUCCESS [45.837s]
[INFO] Apache Tika OSGi bundle ... SUCCESS [50.357s]
[INFO] Apache Tika server  SUCCESS [51.133s]
[INFO] Apache Tika translate . SUCCESS [6.250s]
[INFO] Apache Tika Java-7 Components . SUCCESS [7.767s]
[INFO] Apache Tika ... SUCCESS [0.055s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 6:45.795s
[INFO] Finished at: Thu Jun 05 13:56:12 PDT 2014
[INFO] Final Memory: 76M/247M
[INFO] 
{noformat}

All tests passed, going forward to commit now.


> Translation
> ---
>
> Key: TIKA-1319
> URL: https://issues.apache.org/jira/browse/TIKA-1319
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Minor
>
> I just opened up a review on reviews.apache.org -- 
> https://reviews.apache.org/r/22219/. I copied the description below. 
> This patch adds basic language translation functionality to Tika. Translation 
> is provided by a Microsoft API, but accessed through Apache 2 licensed 
> com.memetix.microsoft-translator-java-api 
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
> to use the translation feature, they have to add a client id and client 
> secret to the 
> tika-core/src/main/resources/org/apache/tika/language/translator.properties 
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
> com.memetix as a dependency in tika-core. I put the Translator class in 
> org.apache.tika.language. There is no integration with the server or CLI, 
> yet. Further, only Strings are translated right now -- if you pass in a full 
> document with xml tags, the structure will be mangled. But, I think that 
> would be a cool feature -- translate the body, title, subtitle, etc, but not 
> the structural elements. 
> There is still more work to do, but I wanted some more eyes on this to make 
> sure I'm heading in the right direction and this is a desired feature. Let me 
> know what you think!
> There are two simple unit tests for now which translate "hello" to French 
> ("salut"). One for inputting the source and target languages, one for 
> inputing just the target language (and detecting the source language 
> automatically).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1182) Out of memory exception when parsing TTF file

2014-06-05 Thread Matthias Krueger (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019351#comment-14019351
 ] 

Matthias Krueger commented on TIKA-1182:


I retested with the current Tika trunk and can confirm that this file is now 
handled in FontBox properly (throwing a java.io.EOFException). We should revert 
https://github.com/apache/tika/commit/bbd065b7070651d939a84e043b4f6f22f80269d9 
to remove the temporary workaround and can resolve this ticket.

This is trivial and would be good to have as the workaround has its own issues:

{code}
Font.createFont(Font.TRUETYPE_FONT, tis.getFile());
{code}
Not 100% sure but I think the JDK's FontManager will permanently keep a file 
handle open for any font created that way.

{code}
tis.mark(0);
Font.createFont(Font.TRUETYPE_FONT, stream);
tis.reset();
{code}
This never really worked as a mark(0) with subsequent reads from the stream 
will cause reset to throw an IOException.


> Out of memory exception when parsing TTF file
> -
>
> Key: TIKA-1182
> URL: https://issues.apache.org/jira/browse/TIKA-1182
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.4
> Environment: Ubuntu
> java version "1.7.0_40"
> Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
> Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)
>Reporter: Erik Hetzner
> Attachments: 16A4FF_8.ttf, TIKA-1182-fix1.patch, TIKA_1182.java
>
>
>When parsing attached file using tika-app-1.4.jar, CPU usage is high and 
> it never seems to finish.
> When parsing using attached java code, I get an out of memory exception.
> Let me know what other information I can provide.
> Thank you!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Review Request 22219: Add Translation to Tika

2014-06-05 Thread Chris Mattmann


> On June 5, 2014, 8:35 p.m., Matthias Krueger wrote:
> > I'm new to Tika and late to this feature so I don't have any background on 
> > the targeted use case. Are Tika users looking to machine-translate 
> > metadata, parsed content or bot? Do they really benefit by having this in 
> > Tika and not use their existing libraries (they need to squeeze them into a 
> > Translator interface impl anyway)? Also, org.apache.tika.Tika handles 
> > detection and parsing of files/binary streams, it seems a bit awkward to 
> > now also have a method to translate Strings from one language to another.
> > 
> > The implementation looks ok. Would it make sense to have more fine grained 
> > exception handling for translations? The generic Exception throwing in the 
> > Translator interface and the catch(Exception) and rethrow 
> > IllegalStateException in Tika#translate seems a bit weird.

Thanks for the comments. One of the goals in Tika has long been to support 
Language "detection". It's useful of course for text extraction and metadata 
extraction, but also has a number of other impactful use cases. As does 
language "translation", which is always something I personally have wanted to 
support since it has equal possible benefits (imagine adding metadata in 
parsers in multiple languages by turning on a "switch" in the ParseContext 
object, etc.) This adds basic support for doing those things, to the core of 
Tika, and an example language translator. Doesn't have to be the only one and 
ideally I'd like to have a default language translator that doesn't necessary 
rely on an external service, but this is a great start. Now we can start to 
build out functionality on top of this work and I think it will really benefit 
the project.

That said, if you have some concrete ideas for benefitting the code and its 
exception handling, please open up an issue and throw up a patch on 
ReviewBoard, etc., and I'd be happy to comment and review it. This patch is 
about to be committed and has met the bar IMO to enter into the sources.

Cheers.


- Chris


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/#review44846
---


On June 5, 2014, 4:19 p.m., Tyler Palsulich wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22219/
> ---
> 
> (Updated June 5, 2014, 4:19 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> ---
> 
> This patch adds basic language translation functionality to Tika. Translation 
> is provided by a Microsoft API, but accessed through Apache 2 licensed 
> com.memetix.microsoft-translator-java-api 
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
> to use the translation feature, they have to add a client id and client 
> secret to the 
> tika-core/src/main/resources/org/apache/tika/language/translator.properties 
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
> com.memetix as a dependency in tika-core. I put the Translator class in 
> org.apache.tika.language. There is no integration with the server or CLI, 
> yet. Further, only Strings are translated right now -- if you pass in a full 
> document with xml tags, the structure will be mangled. But, I think that 
> would be a cool feature -- translate the body, title, subtitle, etc, but not 
> the structural elements. 
> 
> There is still more work to do, but I wanted some more eyes on this to make 
> sure I'm heading in the right direction and this is a desired feature. Let me 
> know what you think!
> 
> 
> Diffs
> -
> 
>   trunk/pom.xml 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java 
> 1600565 
>   
> trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java
>  PRE-CREATION 
>   
> trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java
>  PRE-CREATION 
>   trunk/tika-translate/pom.xml PRE-CREATION 
>   
> trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
>  PRE-CREATION 
>   
> trunk/tika-translate/src/main/resources/META-INF/services/org.apache.tika.language.translate.Translator
>  PRE-CREATION 
>   
> trunk/tika-translate/src/main/resources/org/apache/tika/language/translator.microsoft.properties
>  PRE-CREATION 
>   
> trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22219/diff/
> 
> 
> Testing
> ---
> 
> There are two simple unit tests for now which translate "hello" to French 
> ("salut"). One for 

Re: Review Request 22219: Add Translation to Tika

2014-06-05 Thread Matthias Krueger

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/#review44846
---


I'm new to Tika and late to this feature so I don't have any background on the 
targeted use case. Are Tika users looking to machine-translate metadata, parsed 
content or bot? Do they really benefit by having this in Tika and not use their 
existing libraries (they need to squeeze them into a Translator interface impl 
anyway)? Also, org.apache.tika.Tika handles detection and parsing of 
files/binary streams, it seems a bit awkward to now also have a method to 
translate Strings from one language to another.

The implementation looks ok. Would it make sense to have more fine grained 
exception handling for translations? The generic Exception throwing in the 
Translator interface and the catch(Exception) and rethrow IllegalStateException 
in Tika#translate seems a bit weird.

- Matthias Krueger


On June 5, 2014, 4:19 p.m., Tyler Palsulich wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22219/
> ---
> 
> (Updated June 5, 2014, 4:19 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> ---
> 
> This patch adds basic language translation functionality to Tika. Translation 
> is provided by a Microsoft API, but accessed through Apache 2 licensed 
> com.memetix.microsoft-translator-java-api 
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
> to use the translation feature, they have to add a client id and client 
> secret to the 
> tika-core/src/main/resources/org/apache/tika/language/translator.properties 
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
> com.memetix as a dependency in tika-core. I put the Translator class in 
> org.apache.tika.language. There is no integration with the server or CLI, 
> yet. Further, only Strings are translated right now -- if you pass in a full 
> document with xml tags, the structure will be mangled. But, I think that 
> would be a cool feature -- translate the body, title, subtitle, etc, but not 
> the structural elements. 
> 
> There is still more work to do, but I wanted some more eyes on this to make 
> sure I'm heading in the right direction and this is a desired feature. Let me 
> know what you think!
> 
> 
> Diffs
> -
> 
>   trunk/pom.xml 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java 
> 1600565 
>   
> trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java
>  PRE-CREATION 
>   
> trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java
>  PRE-CREATION 
>   trunk/tika-translate/pom.xml PRE-CREATION 
>   
> trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
>  PRE-CREATION 
>   
> trunk/tika-translate/src/main/resources/META-INF/services/org.apache.tika.language.translate.Translator
>  PRE-CREATION 
>   
> trunk/tika-translate/src/main/resources/org/apache/tika/language/translator.microsoft.properties
>  PRE-CREATION 
>   
> trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22219/diff/
> 
> 
> Testing
> ---
> 
> There are two simple unit tests for now which translate "hello" to French 
> ("salut"). One for inputting the source and target languages, one for 
> inputing just the target language (and detecting the source language 
> automatically).
> 
> 
> Thanks,
> 
> Tyler Palsulich
> 
>



[jira] [Commented] (TIKA-1323) Improve exception reporting in JAX-RS server

2014-06-05 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019202#comment-14019202
 ] 

Sergey Beryozkin commented on TIKA-1323:


Hi Tim, sure, I was only commenting because in a pure web service server world 
it is kind of a good practice to limit the exposure of  the exception 
information to the client for the security reasons. But in this case it is not 
a 'pure' server fault which can expose some sensitive info, it is more about 
presenting the 'fault' info about given file Tika can not process to the 
client, this is how I see it a least :-). So yes, if A works for you then it is 
+1...

Cheers, Sergey  

> Improve exception reporting in JAX-RS server
> 
>
> Key: TIKA-1323
> URL: https://issues.apache.org/jira/browse/TIKA-1323
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>
> I'd like to use tika-server for TIKA-1302.  As part of that, I'd like to 
> record exception stacktraces per document.  I see two options: transmit the 
> info back to the client (assuming a doc didn't bring the server down :) ) 
> along with the current error code or log the document id and stacktrace via 
> the server.  Given my current design thoughts, I'd prefer the first option.
> Any objections or recommendations?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1319) Translation

2014-06-05 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019154#comment-14019154
 ] 

Chris A. Mattmann commented on TIKA-1319:
-

Tyler per my comments here: https://reviews.apache.org/r/22219/ this is 
excellent and ready to ship. I found a file that you forgot to put the ALv2 on 
but no biggie there. I will add it. I will commit this in the next few hours. 
Thanks!

> Translation
> ---
>
> Key: TIKA-1319
> URL: https://issues.apache.org/jira/browse/TIKA-1319
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Minor
>
> I just opened up a review on reviews.apache.org -- 
> https://reviews.apache.org/r/22219/. I copied the description below. 
> This patch adds basic language translation functionality to Tika. Translation 
> is provided by a Microsoft API, but accessed through Apache 2 licensed 
> com.memetix.microsoft-translator-java-api 
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
> to use the translation feature, they have to add a client id and client 
> secret to the 
> tika-core/src/main/resources/org/apache/tika/language/translator.properties 
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
> com.memetix as a dependency in tika-core. I put the Translator class in 
> org.apache.tika.language. There is no integration with the server or CLI, 
> yet. Further, only Strings are translated right now -- if you pass in a full 
> document with xml tags, the structure will be mangled. But, I think that 
> would be a cool feature -- translate the body, title, subtitle, etc, but not 
> the structural elements. 
> There is still more work to do, but I wanted some more eyes on this to make 
> sure I'm heading in the right direction and this is a desired feature. Let me 
> know what you think!
> There are two simple unit tests for now which translate "hello" to French 
> ("salut"). One for inputting the source and target languages, one for 
> inputing just the target language (and detecting the source language 
> automatically).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Review Request 22219: Add Translation to Tika

2014-06-05 Thread Chris Mattmann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/#review44842
---

Ship it!


Ship It!

- Chris Mattmann


On June 5, 2014, 4:19 p.m., Tyler Palsulich wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22219/
> ---
> 
> (Updated June 5, 2014, 4:19 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> ---
> 
> This patch adds basic language translation functionality to Tika. Translation 
> is provided by a Microsoft API, but accessed through Apache 2 licensed 
> com.memetix.microsoft-translator-java-api 
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
> to use the translation feature, they have to add a client id and client 
> secret to the 
> tika-core/src/main/resources/org/apache/tika/language/translator.properties 
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
> com.memetix as a dependency in tika-core. I put the Translator class in 
> org.apache.tika.language. There is no integration with the server or CLI, 
> yet. Further, only Strings are translated right now -- if you pass in a full 
> document with xml tags, the structure will be mangled. But, I think that 
> would be a cool feature -- translate the body, title, subtitle, etc, but not 
> the structural elements. 
> 
> There is still more work to do, but I wanted some more eyes on this to make 
> sure I'm heading in the right direction and this is a desired feature. Let me 
> know what you think!
> 
> 
> Diffs
> -
> 
>   trunk/pom.xml 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java 
> 1600565 
>   
> trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java
>  PRE-CREATION 
>   
> trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java
>  PRE-CREATION 
>   trunk/tika-translate/pom.xml PRE-CREATION 
>   
> trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
>  PRE-CREATION 
>   
> trunk/tika-translate/src/main/resources/META-INF/services/org.apache.tika.language.translate.Translator
>  PRE-CREATION 
>   
> trunk/tika-translate/src/main/resources/org/apache/tika/language/translator.microsoft.properties
>  PRE-CREATION 
>   
> trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22219/diff/
> 
> 
> Testing
> ---
> 
> There are two simple unit tests for now which translate "hello" to French 
> ("salut"). One for inputting the source and target languages, one for 
> inputing just the target language (and detecting the source language 
> automatically).
> 
> 
> Thanks,
> 
> Tyler Palsulich
> 
>



Re: Review Request 22219: Add Translation to Tika

2014-06-05 Thread Chris Mattmann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/#review44841
---



trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java


Need Apache license here. I will add it.


- Chris Mattmann


On June 5, 2014, 4:19 p.m., Tyler Palsulich wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22219/
> ---
> 
> (Updated June 5, 2014, 4:19 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> ---
> 
> This patch adds basic language translation functionality to Tika. Translation 
> is provided by a Microsoft API, but accessed through Apache 2 licensed 
> com.memetix.microsoft-translator-java-api 
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
> to use the translation feature, they have to add a client id and client 
> secret to the 
> tika-core/src/main/resources/org/apache/tika/language/translator.properties 
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
> com.memetix as a dependency in tika-core. I put the Translator class in 
> org.apache.tika.language. There is no integration with the server or CLI, 
> yet. Further, only Strings are translated right now -- if you pass in a full 
> document with xml tags, the structure will be mangled. But, I think that 
> would be a cool feature -- translate the body, title, subtitle, etc, but not 
> the structural elements. 
> 
> There is still more work to do, but I wanted some more eyes on this to make 
> sure I'm heading in the right direction and this is a desired feature. Let me 
> know what you think!
> 
> 
> Diffs
> -
> 
>   trunk/pom.xml 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java 
> 1600565 
>   
> trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java
>  PRE-CREATION 
>   
> trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java
>  PRE-CREATION 
>   trunk/tika-translate/pom.xml PRE-CREATION 
>   
> trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
>  PRE-CREATION 
>   
> trunk/tika-translate/src/main/resources/META-INF/services/org.apache.tika.language.translate.Translator
>  PRE-CREATION 
>   
> trunk/tika-translate/src/main/resources/org/apache/tika/language/translator.microsoft.properties
>  PRE-CREATION 
>   
> trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22219/diff/
> 
> 
> Testing
> ---
> 
> There are two simple unit tests for now which translate "hello" to French 
> ("salut"). One for inputting the source and target languages, one for 
> inputing just the target language (and detecting the source language 
> automatically).
> 
> 
> Thanks,
> 
> Tyler Palsulich
> 
>



Re: Review Request 22246: New parser for Matlab .mat files

2014-06-05 Thread Ann Burgess


> On June 4, 2014, 11:25 p.m., Matthias Krueger wrote:
> > The Matlab MIME types used seem to be application/x-matlab-data or 
> > application/matlab-mat.
> > 
> > Would it make sense to add them to the mime XML for detection?
> > 
> > 
> >   MATLAB data file
> >   
> >   
> > 
> >   
> >   
> > 
> > 
> >
> 
> Chris Mattmann wrote:
> +1 this makes a ton of sense to add IMO.
> 
> Nick Burch wrote:
> There's some odd whitespace going on - we normally use 4 spaces and no 
> tabs.
> 
> When outputting the variables, it would probably make sense to put each 
> one into either a paragraph or a list, so that we get helpful output in html 
> mode as well as text mode
> 
> With that in place, it would then be possible to have a unit test that 
> checked the html output, as well as the current text one
> 
> Also on testing, I think at least some of the tests have an 
> implementation of assertContains, which generally gives a more helpful 
> failure message than assertTrue(s.contains(...)) does, might be worth looking 
> into that?

Great input - thank you! I will integrate both and upload the diff.  


- Ann


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22246/#review44773
---


On June 4, 2014, 10:23 p.m., Ann Burgess wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22246/
> ---
> 
> (Updated June 4, 2014, 10:23 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> ---
> 
> This is a new parser for Matlab .mat files.  The parser utilizes the JmatIO, 
> Matlab's MAT-file I/O API in JAVA. JmatIO is available through Maven Central. 
>  The text output from this parser provides variable names and dimensions that 
> are both inside and outside of data structures, but does NOT provide the 
> actual data values within each .mat file. 
> 
> 
> Diffs
> -
> 
> 
> Diff: https://reviews.apache.org/r/22246/diff/
> 
> 
> Testing
> ---
> 
> Successfully run a basic unit test that checks both --text and --metadata 
> parser output.  
> 
> 
> File Attachments
> 
> 
> Parser File
>   
> https://reviews.apache.org/media/uploaded/files/2014/06/04/cb39636d-ec53-4fbc-b348-6a4db8907f6b__MatParser.java
> Unit Test
>   
> https://reviews.apache.org/media/uploaded/files/2014/06/04/bbff8c6b-caa1-4830-b441-532c28c3c78e__MatParserTest.java
> 
> 
> Thanks,
> 
> Ann Burgess
> 
>



Re: Review Request 22219: Add Translation to Tika

2014-06-05 Thread Paul Ramirez

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/#review44825
---

Ship it!


Ship It!

- Paul Ramirez


On June 5, 2014, 4:19 p.m., Tyler Palsulich wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22219/
> ---
> 
> (Updated June 5, 2014, 4:19 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> ---
> 
> This patch adds basic language translation functionality to Tika. Translation 
> is provided by a Microsoft API, but accessed through Apache 2 licensed 
> com.memetix.microsoft-translator-java-api 
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
> to use the translation feature, they have to add a client id and client 
> secret to the 
> tika-core/src/main/resources/org/apache/tika/language/translator.properties 
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
> com.memetix as a dependency in tika-core. I put the Translator class in 
> org.apache.tika.language. There is no integration with the server or CLI, 
> yet. Further, only Strings are translated right now -- if you pass in a full 
> document with xml tags, the structure will be mangled. But, I think that 
> would be a cool feature -- translate the body, title, subtitle, etc, but not 
> the structural elements. 
> 
> There is still more work to do, but I wanted some more eyes on this to make 
> sure I'm heading in the right direction and this is a desired feature. Let me 
> know what you think!
> 
> 
> Diffs
> -
> 
>   trunk/pom.xml 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java 
> 1600565 
>   
> trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java
>  PRE-CREATION 
>   
> trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java
>  PRE-CREATION 
>   trunk/tika-translate/pom.xml PRE-CREATION 
>   
> trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
>  PRE-CREATION 
>   
> trunk/tika-translate/src/main/resources/META-INF/services/org.apache.tika.language.translate.Translator
>  PRE-CREATION 
>   
> trunk/tika-translate/src/main/resources/org/apache/tika/language/translator.microsoft.properties
>  PRE-CREATION 
>   
> trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22219/diff/
> 
> 
> Testing
> ---
> 
> There are two simple unit tests for now which translate "hello" to French 
> ("salut"). One for inputting the source and target languages, one for 
> inputing just the target language (and detecting the source language 
> automatically).
> 
> 
> Thanks,
> 
> Tyler Palsulich
> 
>



[jira] [Commented] (TIKA-1319) Translation

2014-06-05 Thread Paul Ramirez (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018949#comment-14018949
 ] 

Paul Ramirez commented on TIKA-1319:


Interfaces looks good refactored. +1

> Translation
> ---
>
> Key: TIKA-1319
> URL: https://issues.apache.org/jira/browse/TIKA-1319
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Minor
>
> I just opened up a review on reviews.apache.org -- 
> https://reviews.apache.org/r/22219/. I copied the description below. 
> This patch adds basic language translation functionality to Tika. Translation 
> is provided by a Microsoft API, but accessed through Apache 2 licensed 
> com.memetix.microsoft-translator-java-api 
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
> to use the translation feature, they have to add a client id and client 
> secret to the 
> tika-core/src/main/resources/org/apache/tika/language/translator.properties 
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
> com.memetix as a dependency in tika-core. I put the Translator class in 
> org.apache.tika.language. There is no integration with the server or CLI, 
> yet. Further, only Strings are translated right now -- if you pass in a full 
> document with xml tags, the structure will be mangled. But, I think that 
> would be a cool feature -- translate the body, title, subtitle, etc, but not 
> the structural elements. 
> There is still more work to do, but I wanted some more eyes on this to make 
> sure I'm heading in the right direction and this is a desired feature. Let me 
> know what you think!
> There are two simple unit tests for now which translate "hello" to French 
> ("salut"). One for inputting the source and target languages, one for 
> inputing just the target language (and detecting the source language 
> automatically).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1323) Improve exception reporting in JAX-RS server

2014-06-05 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1323:
--

Summary: Improve exception reporting in JAX-RS server  (was: Improve 
logging in JAX-RS server)

> Improve exception reporting in JAX-RS server
> 
>
> Key: TIKA-1323
> URL: https://issues.apache.org/jira/browse/TIKA-1323
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>
> I'd like to use tika-server for TIKA-1302.  As part of that, I'd like to 
> record exception stacktraces per document.  I see two options: transmit the 
> info back to the client (assuming a doc didn't bring the server down :) ) 
> along with the current error code or log the document id and stacktrace via 
> the server.  Given my current design thoughts, I'd prefer the first option.
> Any objections or recommendations?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1323) Improve logging in JAX-RS server

2014-06-05 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018931#comment-14018931
 ] 

Tim Allison commented on TIKA-1323:
---

Hi Sergey,

  For TIKA-1302, I'd like to use tika-server, and I'd like to be able to record 
exceptions at a per file level so that we can say, e.g. With Tika 1.5 we had 
515 exceptions on docx files, but with Tika-1.6-SNAPSHOT we had 1025 or 
something similar.  I'd also like to be able to say: we had an exception on 
file 12345.docx with Tika 1.5 but we're not getting an exception with 
Tika-1.6-SNAPSHOT.  We can do that now with tika-server on the client side.  If 
my client receives a 422 or 500, I know that something went wrong, and I can 
log it.

However, what I'd also like to be able to do is identify frequency of 
stacktrace elements so that we can sort the most frequent exceptions per 
document type.  To do this, we need to be able to record the stacktrace, and 
I'd also like to be able to link the stacktrace back to the document that 
caused the problem. 

If I run Tika directly via java code (what I've been doing), I can easily catch 
the exceptions and log the information at a per file basis.  So, my preference 
(plan A) would be have tika-server return the stacktrace as the body content 
for exceptions.  We can parameterize this functionality on the commandline, of 
course.  The other option (plan B) would be to pass the file name to 
tika-server, and have tika-server log the file name in conjunction with the 
stacktrace, but that is not as appealing to me.  The third option, of course, 
is to set up a different service for evaluation, but I'd much prefer to use our 
base code as much as possible.

So, is plan A reasonable?



> Improve logging in JAX-RS server
> 
>
> Key: TIKA-1323
> URL: https://issues.apache.org/jira/browse/TIKA-1323
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>
> I'd like to use tika-server for TIKA-1302.  As part of that, I'd like to 
> record exception stacktraces per document.  I see two options: transmit the 
> info back to the client (assuming a doc didn't bring the server down :) ) 
> along with the current error code or log the document id and stacktrace via 
> the server.  Given my current design thoughts, I'd prefer the first option.
> Any objections or recommendations?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1319) Translation

2014-06-05 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018919#comment-14018919
 ] 

Tyler Palsulich commented on TIKA-1319:
---

Thank you all for the help! And, thanks to [~lewismc] for the idea. I just 
updated the patch. Let me know if there is anything else I should update.

> Translation
> ---
>
> Key: TIKA-1319
> URL: https://issues.apache.org/jira/browse/TIKA-1319
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Minor
>
> I just opened up a review on reviews.apache.org -- 
> https://reviews.apache.org/r/22219/. I copied the description below. 
> This patch adds basic language translation functionality to Tika. Translation 
> is provided by a Microsoft API, but accessed through Apache 2 licensed 
> com.memetix.microsoft-translator-java-api 
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
> to use the translation feature, they have to add a client id and client 
> secret to the 
> tika-core/src/main/resources/org/apache/tika/language/translator.properties 
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
> com.memetix as a dependency in tika-core. I put the Translator class in 
> org.apache.tika.language. There is no integration with the server or CLI, 
> yet. Further, only Strings are translated right now -- if you pass in a full 
> document with xml tags, the structure will be mangled. But, I think that 
> would be a cool feature -- translate the body, title, subtitle, etc, but not 
> the structural elements. 
> There is still more work to do, but I wanted some more eyes on this to make 
> sure I'm heading in the right direction and this is a desired feature. Let me 
> know what you think!
> There are two simple unit tests for now which translate "hello" to French 
> ("salut"). One for inputting the source and target languages, one for 
> inputing just the target language (and detecting the source language 
> automatically).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Review Request 22219: Add Translation to Tika

2014-06-05 Thread Tyler Palsulich

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/
---

(Updated June 5, 2014, 4:19 p.m.)


Review request for tika and Chris Mattmann.


Changes
---

I updated the patch (off of r1600565). I created the tika-translate module. The 
Translator interface is still in tika-core. MicrosoftTranslator (in 
tika-translate) is the only implementation of the Translator interface. 
tika-core/Tika uses SWI to load the Translators from the 
META-INF/services/org.apache.tika.language.translate.Translator file in 
tika-translate, so tika-core does not depend on tika-translate. A notable 
result of this is, I added a Translator field to TikaConfig -- so users can 
specify a translator in a DOM, get a DefaultTranslator, etc. I updated some of 
the JavaDoc, too. 


Repository: tika


Description
---

This patch adds basic language translation functionality to Tika. Translation 
is provided by a Microsoft API, but accessed through Apache 2 licensed 
com.memetix.microsoft-translator-java-api 
(https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to 
use the translation feature, they have to add a client id and client secret to 
the tika-core/src/main/resources/org/apache/tika/language/translator.properties 
file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
com.memetix as a dependency in tika-core. I put the Translator class in 
org.apache.tika.language. There is no integration with the server or CLI, yet. 
Further, only Strings are translated right now -- if you pass in a full 
document with xml tags, the structure will be mangled. But, I think that would 
be a cool feature -- translate the body, title, subtitle, etc, but not the 
structural elements. 

There is still more work to do, but I wanted some more eyes on this to make 
sure I'm heading in the right direction and this is a desired feature. Let me 
know what you think!


Diffs (updated)
-

  trunk/pom.xml 1600565 
  trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600565 
  trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java 1600565 
  
trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java
 PRE-CREATION 
  
trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java
 PRE-CREATION 
  trunk/tika-translate/pom.xml PRE-CREATION 
  
trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
 PRE-CREATION 
  
trunk/tika-translate/src/main/resources/META-INF/services/org.apache.tika.language.translate.Translator
 PRE-CREATION 
  
trunk/tika-translate/src/main/resources/org/apache/tika/language/translator.microsoft.properties
 PRE-CREATION 
  
trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java
 PRE-CREATION 

Diff: https://reviews.apache.org/r/22219/diff/


Testing
---

There are two simple unit tests for now which translate "hello" to French 
("salut"). One for inputting the source and target languages, one for inputing 
just the target language (and detecting the source language automatically).


Thanks,

Tyler Palsulich



Re: Review Request 22246: New parser for Matlab .mat files

2014-06-05 Thread Nick Burch


> On June 4, 2014, 11:25 p.m., Matthias Krueger wrote:
> > The Matlab MIME types used seem to be application/x-matlab-data or 
> > application/matlab-mat.
> > 
> > Would it make sense to add them to the mime XML for detection?
> > 
> > 
> >   MATLAB data file
> >   
> >   
> > 
> >   
> >   
> > 
> > 
> >
> 
> Chris Mattmann wrote:
> +1 this makes a ton of sense to add IMO.

There's some odd whitespace going on - we normally use 4 spaces and no tabs.

When outputting the variables, it would probably make sense to put each one 
into either a paragraph or a list, so that we get helpful output in html mode 
as well as text mode

With that in place, it would then be possible to have a unit test that checked 
the html output, as well as the current text one

Also on testing, I think at least some of the tests have an implementation of 
assertContains, which generally gives a more helpful failure message than 
assertTrue(s.contains(...)) does, might be worth looking into that?


- Nick


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22246/#review44773
---


On June 4, 2014, 10:23 p.m., Ann Burgess wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22246/
> ---
> 
> (Updated June 4, 2014, 10:23 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> ---
> 
> This is a new parser for Matlab .mat files.  The parser utilizes the JmatIO, 
> Matlab's MAT-file I/O API in JAVA. JmatIO is available through Maven Central. 
>  The text output from this parser provides variable names and dimensions that 
> are both inside and outside of data structures, but does NOT provide the 
> actual data values within each .mat file. 
> 
> 
> Diffs
> -
> 
> 
> Diff: https://reviews.apache.org/r/22246/diff/
> 
> 
> Testing
> ---
> 
> Successfully run a basic unit test that checks both --text and --metadata 
> parser output.  
> 
> 
> File Attachments
> 
> 
> Parser File
>   
> https://reviews.apache.org/media/uploaded/files/2014/06/04/cb39636d-ec53-4fbc-b348-6a4db8907f6b__MatParser.java
> Unit Test
>   
> https://reviews.apache.org/media/uploaded/files/2014/06/04/bbff8c6b-caa1-4830-b441-532c28c3c78e__MatParserTest.java
> 
> 
> Thanks,
> 
> Ann Burgess
> 
>



[jira] [Commented] (TIKA-1323) Improve logging in JAX-RS server

2014-06-05 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018612#comment-14018612
 ] 

Sergey Beryozkin commented on TIKA-1323:


Hi, my be do log at the server plus optionally report all the details to the 
client ? I'm not sure really :-)

> Improve logging in JAX-RS server
> 
>
> Key: TIKA-1323
> URL: https://issues.apache.org/jira/browse/TIKA-1323
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>
> I'd like to use tika-server for TIKA-1302.  As part of that, I'd like to 
> record exception stacktraces per document.  I see two options: transmit the 
> info back to the client (assuming a doc didn't bring the server down :) ) 
> along with the current error code or log the document id and stacktrace via 
> the server.  Given my current design thoughts, I'd prefer the first option.
> Any objections or recommendations?



--
This message was sent by Atlassian JIRA
(v6.2#6252)