date:20160314

[jira] [Created] (TIKA-1901) tika detect consumes stream when streams contains msoffice file

2016-03-14 Thread Gerard van der Hoorn (JIRA)

Gerard van der Hoorn created TIKA-1901:
--

 Summary: tika detect consumes stream when streams contains 
msoffice file
 Key: TIKA-1901
 URL: https://issues.apache.org/jira/browse/TIKA-1901
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.12
Reporter: Gerard van der Hoorn


When tika.detect is used to on ms-office file (word or excel 2003) the stream 
is consumed which is not as expected. According to the documentation when  the 
stream supports marking the position in the file will be returned to the 
original position.

Added is a testcase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1901) tika detect consumes stream when streams contains msoffice file

2016-03-14 Thread Gerard van der Hoorn (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerard van der Hoorn updated TIKA-1901:
---
Attachment: test.xls
test.pdf
test.doc
TikaStreamConsumingIssue.java

Added a junit test with test files.

> tika detect consumes stream when streams contains msoffice file
> ---
>
> Key: TIKA-1901
> URL: https://issues.apache.org/jira/browse/TIKA-1901
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.12
>Reporter: Gerard van der Hoorn
> Attachments: TikaStreamConsumingIssue.java, test.doc, test.pdf, 
> test.xls
>
>
> When tika.detect is used to on ms-office file (word or excel 2003) the stream 
> is consumed which is not as expected. According to the documentation when  
> the stream supports marking the position in the file will be returned to the 
> original position.
> Added is a testcase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1894) Add XMPMM metadata extraction to JempboxExtractor

2016-03-14 Thread Ray Gauss II (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193622#comment-15193622
 ] 

Ray Gauss II commented on TIKA-1894:


The {{tika-xmp}} project deals with converting a populated Tika {{Metadata}} 
object into XMP.

Perhaps that project should be renamed to something more specific at some 
point, but regardless, I don't think it's the right spot for this sort of 
shared parser code.

I'd vote for the simpler shared util jar, but I think it can still live next to 
the modules, something like {{/tika-parsers-modules/tika-parser-xmp-commons}}?

> Add XMPMM metadata extraction to JempboxExtractor
> -
>
> Key: TIKA-1894
> URL: https://issues.apache.org/jira/browse/TIKA-1894
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
>
> The XMP Media Management (XMPMM) section of xmp carries some useful 
> information.  We currently have keys for many of the important attributes in 
> tika-core's o.a.t.metadata.XMPMM, and JempBox extracts the XMPMM schema, but 
> the wiring between the two has not yet been installed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] options for XMP parsing?

2016-03-14 Thread Ray Gauss

Hi Tim,

Consolidated handing of XMP would be great, I'm glad you're taking a look at it 
and I'll try to help out where I can.

> You've been happy with it at Alfresco? 

It's been a while since I looked at it but I don't recall any difficulties.

> I'd be interested to hear more about what happens with InDesign files.

It stores things in 'pages' [1].

Regards,

Ray


[1] http://stackoverflow.com/a/22661992


> On Mar 10, 2016, at 9:38 AM, Allison, Timothy B.  wrote:
> 
> Hi Ray,
>  Got it.  Thank you.
> 
> That'd be great.  In follow up discussion with PDFBox devs, they mentioned 
> that it is not a design feature/restriction on XMPBox that it doesn't handle 
> non PDF/A files...only a matter of patching and building out their current 
> code base.   The downside is there's quite a bit to do, the upside is that it 
> is a living code base.
> 
> I'll experiment with Adobe's xmp-core.  If you have any pointers/examples, 
> let me know...I'll be starting with: 
> https://indisnip.wordpress.com/2010/08/17/extract-metadata-with-adobe-xmp-part-2/.
>  You've been happy with it at Alfresco? 
> 
> No matter which package we use, it would be nice to build out uniform 
> extraction of XMP for all image and PDF files for the common elements -- with 
> special handling by file type if necessary.  As you mentioned, it would also 
> be great to add or modify our XMPScanner to extract all XMP packets from a 
> file...I've started dabbling with this here: 
> https://github.com/tballison/tika/tree/xmp_scanner .  I'd be interested to 
> hear more about what happens with InDesign files. In our own test set, we 
> have a PDF file with two packets containing conflicting authorship info IIRC! 
> :)  It would be nice to expose both the canonical XMP info (with proper 
> processing of "later-xmp-overrides-earlier") as well as all of the info that 
> can be scraped from the XMP (packet1: authorXYZ packet2: authorQRS)...two 
> different use cases.
> 
> Thank you, again.
> 
> Cheers,
> 
>   Tim 
> 
> 
> 
> 
> -Original Message-
> From: Ray Gauss [mailto:ray.ga...@alfresco.com] 
> Sent: Tuesday, March 08, 2016 2:34 PM
> To: dev@tika.apache.org
> Subject: Re: [DISCUSS] options for XMP parsing?
> 
> To clarify... the 'we' in my third sentence was referring to Alfresco, not 
> Tika.
> 
> I'm not sure how much of that code would be useful but I may be able to 
> contribute some of it.
> 
> Regards,
> 
> Ray
> 
> 
>> On Mar 8, 2016, at 2:07 PM, Allison, Timothy B.  wrote:
>> 
>> Thank you.  Will take a look.
>> 
>> -Original Message-
>> From: Ray Gauss [mailto:ray.ga...@alfresco.com]
>> Sent: Tuesday, March 08, 2016 1:55 PM
>> To: dev@tika.apache.org
>> Subject: Re: [DISCUSS] options for XMP parsing?
>> 
>> Hi Tim,
>> 
>> We're already using Adobe's xmpcore in tika-xmp which works fine for parsing 
>> XMP (though has not seen updates in a while), but getting the XMP packets 
>> out of the files is tricker.  
>> 
>> We have XMPPacketScanner which works for many cases, but not all.  InDesign 
>> files for example do some strange things.
>> 
>> In the past we've used different packet scanners depending on the file type 
>> (including Exiftool command-line) to get the XMP out then used xmpcore to 
>> parse into simple flattened properties.
>> 
>> Regards,
>> 
>> Ray
>> 
>> 
>>> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B.  wrote:
>>> 
>>> All,
>>> 
>>> PDFBox 2.0 is soon to be released.  In the course of its development, the 
>>> project has migrated from Jempbox (which we're now using) to XmpBox; and 
>>> Jempbox is now on its last legs.  
>>> 
>>> XmpBox was "written for PDF/A checking," not for robust processing of 
>>> common variants of XMPs in the wild; I found that it fails on roughly 40% 
>>> of XMPs I pulled out of PDFs from govdocs1/commoncrawl.
>>> 
>>> In short, we can't migrate to XmpBox, and Jempbox is at the end of its life.
>>> 
>>> Has anyone had any luck with an Apache-friendly XMP parser?  Are there 
>>> better options than copying and pasting jempbox into Tika and maintaining 
>>> it ourselves (yuk!)?
>>> 
>>>Best,
>>> 
>>>   Tim
>>> 
>>> -Original Message-
>>> From: Tilman Hausherr [mailto:thaush...@t-online.de]
>>> Sent: Tuesday, March 08, 2016 12:13 PM
>>> To: d...@pdfbox.apache.org
>>> Subject: Re: roadmap for XMPBox?
>>> 
>>> I think the problem is that XmpBox was written for PDF/A checking, so it 
>>> fails with XMPs that are not PDF/A. For example, file 000142.pdf has the 
>>> schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
>>> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_
>>> p
>>> roperties_in_pdfa-1_2008-03-20.pdf
>>> 
>>> And no, there are no plans for anything on XMP at this time...
>>> 
>>> Tilman
>>> 
>>> 
>>> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
 All,
 
 
 
 When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch 
 from our current re

[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-14 Thread Ray Gauss II (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193845#comment-15193845
 ] 

Ray Gauss II commented on TIKA-1607:


Have we already considered treating the XMP packets more like embedded 
resources and making it easier for the advanced users described above to get at 
those resources, perhaps providing an {{EmbeddedResourceHandler}} 
implementation they could use without resorting to extracting them to files?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1902) Error while parsing a file using ContentHandler object (initialized using the BodyContentHandler object) for some files

2016-03-14 Thread Harsh Fatepuria (JIRA)

Harsh Fatepuria created TIKA-1902:
-

 Summary: Error while parsing a file using ContentHandler object 
(initialized using the BodyContentHandler object) for some files
 Key: TIKA-1902
 URL: https://issues.apache.org/jira/browse/TIKA-1902
 Project: Tika
  Issue Type: Bug
  Components: handler, parser
Affects Versions: 1.12
 Environment: Java
Reporter: Harsh Fatepuria


Java Code:

public static String parseBodyToHTML(String filePath) throws IOException, 
SAXException, TikaException 
{
ContentHandler handler = new BodyContentHandler(new 
ToXMLContentHandler());
 
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try (FileInputStream stream =new FileInputStream(new 
File(filePath))) {
parser.parse(stream, handler, metadata);
return handler.toString();
}
}


While using this function for some files, I get the following error:

Exception in thread "main" org.xml.sax.SAXException: Namespace 
http://www.w3.org/1999/xhtml not declared
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
at 
org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:291)
at org.apache.tika.parser.pdf.PDF2XHTML.startPage(PDF2XHTML.java:225)
at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:437)
at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383)
at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
at TTR.TTRAnalysis.parseBodyToHTML(TTRAnalysis.java:39)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1902) Error while parsing some files using ContentHandler object (initialized using the BodyContentHandler object)

2016-03-14 Thread Harsh Fatepuria (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh Fatepuria updated TIKA-1902:
--
Summary: Error while parsing some files using ContentHandler object 
(initialized using the BodyContentHandler object)   (was: Error while parsing a 
file using ContentHandler object (initialized using the BodyContentHandler 
object) for some files)

> Error while parsing some files using ContentHandler object (initialized using 
> the BodyContentHandler object) 
> -
>
> Key: TIKA-1902
> URL: https://issues.apache.org/jira/browse/TIKA-1902
> Project: Tika
>  Issue Type: Bug
>  Components: handler, parser
>Affects Versions: 1.12
> Environment: Java
>Reporter: Harsh Fatepuria
>  Labels: handler, java, parser, tika
>
> Java Code:
> public static String parseBodyToHTML(String filePath) throws IOException, 
> SAXException, TikaException 
> {
>   ContentHandler handler = new BodyContentHandler(new 
> ToXMLContentHandler());
>
>   AutoDetectParser parser = new AutoDetectParser();
>   Metadata metadata = new Metadata();
>   try (FileInputStream stream =new FileInputStream(new 
> File(filePath))) {
>   parser.parse(stream, handler, metadata);
>   return handler.toString();
>   }
> }
> While using this function for some files, I get the following error:
> Exception in thread "main" org.xml.sax.SAXException: Namespace 
> http://www.w3.org/1999/xhtml not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:291)
>   at org.apache.tika.parser.pdf.PDF2XHTML.startPage(PDF2XHTML.java:225)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:437)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
>   at TTR.TTRAnalysis.parseBodyToHTML(TTRAnalysis.java:39)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1902) Error while parsing some files using ContentHandler object (initialized using the BodyContentHandler object)

2016-03-14 Thread Harsh Fatepuria (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194148#comment-15194148
 ] 

Harsh Fatepuria commented on TIKA-1902:
---

I am using the BodyContentHandler object to just get the  part of the 
extracted XHTML.

If I modify the code to :  ContentHandler handler = new ToXMLContentHandler();
I get the full XHTML code with the metadata.

> Error while parsing some files using ContentHandler object (initialized using 
> the BodyContentHandler object) 
> -
>
> Key: TIKA-1902
> URL: https://issues.apache.org/jira/browse/TIKA-1902
> Project: Tika
>  Issue Type: Bug
>  Components: handler, parser
>Affects Versions: 1.12
> Environment: Java
>Reporter: Harsh Fatepuria
>  Labels: handler, java, parser, tika
>
> Java Code:
> public static String parseBodyToHTML(String filePath) throws IOException, 
> SAXException, TikaException 
> {
>   ContentHandler handler = new BodyContentHandler(new 
> ToXMLContentHandler());
>
>   AutoDetectParser parser = new AutoDetectParser();
>   Metadata metadata = new Metadata();
>   try (FileInputStream stream =new FileInputStream(new 
> File(filePath))) {
>   parser.parse(stream, handler, metadata);
>   return handler.toString();
>   }
> }
> While using this function for some files, I get the following error:
> Exception in thread "main" org.xml.sax.SAXException: Namespace 
> http://www.w3.org/1999/xhtml not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:291)
>   at org.apache.tika.parser.pdf.PDF2XHTML.startPage(PDF2XHTML.java:225)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:437)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
>   at TTR.TTRAnalysis.parseBodyToHTML(TTRAnalysis.java:39)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1898) backslashes in mime-type : application/vnd.mif are wrong

2016-03-14 Thread Nick Burch (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194363#comment-15194363
 ] 

Nick Burch commented on TIKA-1898:
--

I've just tried with your test file, and Tika is able to detect the file 
correctly with the data only (no filename). That makes me think that the 
mimetype is correct:

{code}
$ java -jar tika-app-1.13-SNAPSHOT.jar --detect < test.mif 
application/vnd.mif
{code}

Are you able to produce a junit unit test that shows your detection issue, and 
ideally shows your proposed patch fixes it? (Bonus marks if it's as a Github 
Pull Request or a Patch attached to the JIRA!)

> backslashes in mime-type : application/vnd.mif are wrong 
> -
>
> Key: TIKA-1898
> URL: https://issues.apache.org/jira/browse/TIKA-1898
> Project: Tika
>  Issue Type: Bug
>  Components: config, core
> Environment: Win64, Eclipse
>Reporter: Steffen Netz
>Priority: Minor
>  Labels: easyfix, patch
> Attachments: test.doc, test.fm, test.mif, tika-bug.log
>
>
> In
> tika-core\src\main\resources\org\apache\tika\mime\tika-mimetypes.xml  
> there are the lines:
> 
>   
>   
>   
>   
>   
>   
>   wrong.
> the backslashes must be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1901) tika detect consumes stream when streams contains msoffice file

[jira] [Updated] (TIKA-1901) tika detect consumes stream when streams contains msoffice file

[jira] [Commented] (TIKA-1894) Add XMPMM metadata extraction to JempboxExtractor

Re: [DISCUSS] options for XMP parsing?

[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

[jira] [Created] (TIKA-1902) Error while parsing a file using ContentHandler object (initialized using the BodyContentHandler object) for some files

[jira] [Updated] (TIKA-1902) Error while parsing some files using ContentHandler object (initialized using the BodyContentHandler object)

[jira] [Commented] (TIKA-1902) Error while parsing some files using ContentHandler object (initialized using the BodyContentHandler object)

[jira] [Commented] (TIKA-1898) backslashes in mime-type : application/vnd.mif are wrong

9 matches

Site Navigation

Mail list logo

Footer information