[jira] [Updated] (TIKA-1390) Create tika-example module

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1390:

Fix Version/s: (was: 1.11)
   1.12

> Create tika-example module
> --
>
> Key: TIKA-1390
> URL: https://issues.apache.org/jira/browse/TIKA-1390
> Project: Tika
>  Issue Type: Bug
>  Components: example
>Reporter: Tyler Palsulich
> Fix For: 1.12
>
>
> This issue will track the initial creation of the tika-example module. 
> Subtasks will be used for the first few examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---
Fix Version/s: (was: 1.11)
   1.12

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10
>Reporter: Michael McCandless
>  Labels: newbie
> Fix For: 1.12
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
> 

[jira] [Updated] (TIKA-985) Support for HTML5 elements

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-985:
---
Fix Version/s: (was: 1.11)
   1.12

> Support for HTML5 elements
> --
>
> Key: TIKA-985
> URL: https://issues.apache.org/jira/browse/TIKA-985
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.2
>Reporter: Markus Jelsma
> Fix For: 1.12
>
> Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, 
> TIKA-985-1.3-3.patch, TIKA-985-1.5.patch
>
>
> TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, 
> section). This prevents some custom ContentHandlers from reading expected 
> elements and/or attributes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1108) Represent individual slides in pptx

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1108:

Fix Version/s: (was: 1.11)
   1.12

> Represent individual slides in pptx
> ---
>
> Key: TIKA-1108
> URL: https://issues.apache.org/jira/browse/TIKA-1108
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
> Fix For: 1.12
>
>
> When parsing ppt, tika produces for each slide:
> 
> However for pptx these seem to be missing, all the text is directly under 
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1329) Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1329:

Fix Version/s: (was: 1.11)
   1.12

> Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser
> ---
>
> Key: TIKA-1329
> URL: https://issues.apache.org/jira/browse/TIKA-1329
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.12
>
> Attachments: TIKA-1329v2.patch, test_recursive_embedded.docx
>
>
> Jukka and Nick have a great demo of parsing metadata recursively on the 
> [wiki|http://wiki.apache.org/tika/RecursiveMetadata].  For TIKA-1302, I'd 
> like to use something similar, and I think that others may find it useful for 
> tika-app and tika-server.
> I took the code from the wiki and made some modifications.  I'm not sure if 
> we should put this in parsers or in a new module for "examples."  Given that 
> I think this would be useful for tika-app and tika-server, I'd prefer 
> parsers, but I'm open to any input...including "let's not."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-539:
---
Fix Version/s: (was: 1.11)
   1.12

> Encoding detection is too biased by encoding in meta tag
> 
>
> Key: TIKA-539
> URL: https://issues.apache.org/jira/browse/TIKA-539
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, parser
>Affects Versions: 0.8, 0.9, 0.10
>Reporter: Reinhard Schwab
>Assignee: Ken Krugler
>Priority: Minor
> Fix For: 1.12
>
> Attachments: TIKA-539.patch, TIKA-539_2.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from 
> http response header).
> test code to reproduce:
> static String content = "\n"
>   + " content=\"application/xhtml+xml; charset=iso-8859-1\" />"
>   + "Über den Wolken\n";
>   /**
>* @param args
>* @throws IOException
>* @throws TikaException
>* @throws SAXException
>*/
>   public static void main(String[] args) throws IOException, SAXException,
>   TikaException {
>   Metadata metadata = new Metadata();
>   metadata.set(Metadata.CONTENT_TYPE, "text/html");
>   metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   InputStream in = new 
> ByteArrayInputStream(content.getBytes("UTF-8"));
>   AutoDetectParser parser = new AutoDetectParser();
>   BodyContentHandler h = new BodyContentHandler(1);
>   parser.parse(in, h, metadata, new ParseContext());
>   System.out.print(h.toString());
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1577) NetCDF Data Extraction

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1577:

Fix Version/s: (was: 1.11)
   1.12

> NetCDF Data Extraction
> --
>
> Key: TIKA-1577
> URL: https://issues.apache.org/jira/browse/TIKA-1577
> Project: Tika
>  Issue Type: Improvement
>  Components: handler, parser
>Affects Versions: 1.7
>Reporter: Ann Burgess
>Assignee: Ann Burgess
>  Labels: features, handler
> Fix For: 1.12
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> A netCDF classic or 64-bit offset dataset is stored as a single file 
> comprising two parts:
>  - a header, containing all the information about dimensions, attributes, and 
> variables except for the variable data;
>  - a data part, comprising fixed-size data, containing the data for variables 
> that don't have an unlimited dimension; and variable-size data, containing 
> the data for variables that have an unlimited dimension.
> The NetCDFparser currently extracts the "header part".  
>  -- text extracts file Dimensions and Variables
>  -- metadata extracts Global Attributes
> We want the option to extract the "data part" of NetCDF files.  
> Lets use the NetCDF test file for our dev testing:  
> tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1696:

Fix Version/s: (was: 1.11)
   1.12

> Language Identification with Text Processing Toolkit from MITLL
> ---
>
> Key: TIKA-1696
> URL: https://issues.apache.org/jira/browse/TIKA-1696
> Project: Tika
>  Issue Type: New Feature
>  Components: languageidentifier
>Reporter: Paul Ramirez
> Fix For: 1.12
>
>
> The aim here is to extend the methods for language identification within 
> text. MIT Lincoln Labs has an open source library [1] written in Julia. 
> Having spoken  with the MITLL guys there is a possibility that there is a 
> scala version of this library which would make it easier to package in with 
> Tika. 
> At this point I'm not quite sure how many languages this library supports by 
> default but it can be extended when provided some training data.
> [1] https://github.com/mit-nlp/Text.jl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1379) error in Tika().detect for xml files with xades signature

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1379:

Fix Version/s: (was: 1.11)
   1.12

> error in Tika().detect for xml files with xades signature
> -
>
> Key: TIKA-1379
> URL: https://issues.apache.org/jira/browse/TIKA-1379
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.4
>Reporter: Alessandro De Angelis
>  Labels: new-parser
> Fix For: 1.12
>
>
> we tried to get the mime type of an xml file with xades signature embedded. 
> the result is "text/html" and not the expected "text/xml" or 
> "application/xml".
> here is an example of the xml file:
> {code}
> 
> 
>   00094853 0003 2
>   2013-09-23
>   2013-09-23
>   D69017
>   FILOSOFIA DELLA SCIENZA
>   D69
>   TEATRO E ARTI VISIVE
>   
>   1233456
>   PAOLINO
>   PAPERINO
>   23.0
>   23
>   
>   
>   
>   2012
>   6.0
>   
>   9
>   جامعة البندقية - TEST
>   Verbale_3
>   QUI QUO QUA
>   D69017
>   FILOSOFIA DELLA SCIENZA
>   D69
>   TEATRO E ARTI VISIVE
>   QUI QUO QUA
> 26-09-2013 09:55:53 CEST(+0200)
> 
>   3
>   11.09.03
> 
> http://www.w3.org/2000/09/xmldsig#; 
> Id="sig08744308748201048377">
> 
>  Algorithm="http://www.w3.org/2006/12/xml-c14n11;>
>  Algorithm="http://www.w3.org/2001/04/xmldsig-more#rsa-sha256;>
> 
> 
> http://www.w3.org/2002/06/xmldsig-filter2;>
>  xmlns:dsig-xpath="http://www.w3.org/2002/06/xmldsig-filter2; 
> Filter="subtract">/descendant::ds:Signature
> 
> http://www.w3.org/TR/1999/REC-xslt-19991116;>
> http://www.kion.it/webesse3/multilingua; 
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform; 
> exclude-result-prefixes="kion" version="1.0">
>   
>   
>   
>select="/VERBALI/VERBALE">
>select="/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO">
>select="/VERBALI/VERBALE/RAGGRUPPAMENTO">
>select="/VERBALI/VERBALE/COMMISSIONE">
>   
>   
>   
>   
>http-equiv="Content-Type">
>
>test="$sostituzione_root">
>   Dichiarazione 
> conformità Verbale Esame
>   
>   
>   Verbalizzazione 
> esame
>   
>   
>   
>td  {font-family: Arial; font-size:10pt;} 
>div {font-family: Arial; font-size:10pt;}
>pre {font-family: Arial; font-size:10pt;} 
>   
>   
>   
>   
>
>test="$sostituzione_root">
>colspan="2"> select="$verbale_root/ATENEO_DES">
>colspan="2">DICHIARAZIONE DI 
> CONFORMITÀ
>colspan="2">Il sottoscritto  select="$verbale_root/TITOLARE_PROCEDIMENTO">, docente di 
> 
>  
>   
>   
>     
>   
>test="$sostituzione_root/MOTIVAZIONE">
>   
> PREMESSO CHE
>   
>  
>   
>  select="$sostituzione_root/MOTIVAZIONE">
>   
>  
>   
> 
>   
>   
>   
>   
> DICHIARA
>    
> 
>  

[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-891:
---
Fix Version/s: (was: 1.11)
   1.12

> Use POST in addition to PUT on method calls in tika-server
> --
>
> Key: TIKA-891
> URL: https://issues.apache.org/jira/browse/TIKA-891
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: newbie
> Fix For: 1.12
>
>
> Per Jukka's email:
> http://s.apache.org/uR
> It would be a better use of REST/HTTP "verbs" to use POST to put content to a 
> resource where we don't intend to store that content (which is the 
> implication of PUT). Max suggested adding:
> {code}
> @POST
> {code}
> annotations to the methods we are currently exposing using PUT to take care 
> of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1705) Update ASM dependency to 5.0.4

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1705:

Fix Version/s: (was: 1.11)
   1.12

> Update ASM dependency to 5.0.4
> --
>
> Key: TIKA-1705
> URL: https://issues.apache.org/jira/browse/TIKA-1705
> Project: Tika
>  Issue Type: Task
>Affects Versions: 1.7
>Reporter: Uwe Schindler
>Assignee: Dave Meikle
> Fix For: 1.12
>
> Attachments: TIKA-1705-2.patch, TIKA-1705.patch
>
>
> Currently the Class file parser uses ASM 4.1. This older version cannot read 
> Java 8 / Java 9 class files (fails with Exception).
> The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The 
> code change is only to update the visitor version, so it gets new Java 8 
> features like lambdas reported, but this is not really required, but should 
> be done for full support.
> FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 
> 5, too.
> You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no 
> problem with Lucene using a newer version). Since ASM 4.x the updates are 
> more easy (no visitor interfaces anymore, instead abstract classes), so it 
> does not break if you just replace the JAR file. So just see this as a 
> recommendatation, not urgent! Solr/Lucene will also work without this patch 
> (it just replaces the shipped ASM by newer version in our packaging).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-987) Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-987:
---
Fix Version/s: (was: 1.11)
   1.12

> Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted
> 
>
> Key: TIKA-987
> URL: https://issues.apache.org/jira/browse/TIKA-987
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.12
>
> Attachments: picture.doc, picture_3.doc
>
>
> I have two Word docs, both containing the same drawing, but one has
> text added.
> In one case (picture.doc) the extraction is correct: it contains only
> an embedded image.wmf; when I view the image it's correct.
> In the second case (picture_3.doc) the picture is extracted as image
> (no extension), and is 0 bytes, and there is an invalid character
> (mapped to unicode replacement char) inserted before the image:
> {noformat}
> 
> 
> �
> 
> 
> vehicle
> 
> {noformat}
> (Though, the text "vehicle" is extracted correctly).
> I dug a bit, and with the 2nd doc there is an embedded {SHAPE *
> MERGEFORMAT} field, which we invoke
> WordExtractor.handleSpecialCharacterRuns on, and somehow it extracts
> the 0-byte no-extension image as well as the invalid character.  With
> the first doc there is no field (at least not one that's handle with
> handleSpecialCharacterRuns...).  Otherwise I'm not sure how to
> fix... it could be something is going wrong in how POI parses the
> Pictures from PictureSource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1395) Create embedded image extraction example

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1395:

Fix Version/s: (was: 1.11)
   1.12

> Create embedded image extraction example
> 
>
> Key: TIKA-1395
> URL: https://issues.apache.org/jira/browse/TIKA-1395
> Project: Tika
>  Issue Type: Sub-task
>  Components: example
>Reporter: Tyler Palsulich
>Priority: Minor
> Fix For: 1.12
>
>
> Create an example of how to turn do embedded image extraction and parsing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1295) Make some Dublin Core items multi-valued

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1295:

Fix Version/s: (was: 1.11)
   1.12

> Make some Dublin Core items multi-valued
> 
>
> Key: TIKA-1295
> URL: https://issues.apache.org/jira/browse/TIKA-1295
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.12
>
>
> According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, 
> dc:title, dc:description and dc:rights should allow multiple values because 
> of language alternatives.  Unless anyone objects in the next few days, I'll 
> switch those to Property.toInternalTextBag() from Property.toInternalText().  
> I'll also modify PDFParser to extract dc:rights.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1059:

Fix Version/s: (was: 1.11)
   1.12

> Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
> --
>
> Key: TIKA-1059
> URL: https://issues.apache.org/jira/browse/TIKA-1059
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ray Gauss II
> Fix For: 1.12
>
>
> The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch 
> {{InterruptedException}} and ignore it.
> The methods should either call {{interrupt()}} on the current thread or 
> re-throw the exception, possibly wrapped in a {{TikaException}}.
> See TIKA-775 for a previous discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1688) Tika Version in Metadata

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1688:

Fix Version/s: (was: 1.11)
   1.12

> Tika Version in Metadata
> 
>
> Key: TIKA-1688
> URL: https://issues.apache.org/jira/browse/TIKA-1688
> Project: Tika
>  Issue Type: Improvement
>Reporter: Paul Ramirez
>Priority: Minor
> Fix For: 1.12
>
>
> Could this be added as X-Tika:version that way downstream there would be 
> traceability to extraction based on version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1540) New Tika plugin for image based feature extraction using computer vision techniques

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1540:

Fix Version/s: (was: 1.11)
   1.12

> New Tika plugin for image based feature extraction using computer vision 
> techniques
> ---
>
> Key: TIKA-1540
> URL: https://issues.apache.org/jira/browse/TIKA-1540
> Project: Tika
>  Issue Type: New Feature
> Environment: cross platform
>Reporter: Aashish Chaudhary
>Assignee: Lewis John McGibbney
>  Labels: gsoc2015
> Fix For: 1.12
>
> Attachments: TIKA-vision.achaudhary.150209.patch.txt
>
>
> This will be a web-service client based parser to perform image feature 
> extraction using Computer Vision techniques. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1456) Visual Sentiment API parser

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1456:

Fix Version/s: (was: 1.11)
   1.12

> Visual Sentiment API parser
> ---
>
> Key: TIKA-1456
> URL: https://issues.apache.org/jira/browse/TIKA-1456
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>  Labels: gsoc2015
> Fix For: 1.12
>
>
> Integrate the Visual Sentibank API as a parser for images. We can use 
> Aperture from CMU, it's released under the MIT license:
> https://github.com/d8w/aperture



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1106) CLAVIN Integration

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1106:

Fix Version/s: (was: 1.11)
   1.12

> CLAVIN Integration
> --
>
> Key: TIKA-1106
> URL: https://issues.apache.org/jira/browse/TIKA-1106
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.3
> Environment: All
>Reporter: Adam Estrada
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: entity, geospatial, new-parser
> Fix For: 1.12
>
>
> I've been evaluating CLAVIN as a way to extract location information from 
> unstructured text. It seems like meshing it with Tika in some way would make 
> a lot of sense. From CLAVIN website...
> {quote}
> CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source 
> software package for document geotagging and geoparsing that employs 
> context-based geographic entity resolution. It combines a variety of open 
> source tools with natural language processing techniques to extract location 
> names from unstructured text documents and resolve them against gazetteer 
> records. Importantly, CLAVIN does not simply "look up" location names; 
> rather, it uses intelligent heuristics in an attempt to identify precisely 
> which "Springfield" (for example) was intended by the author, based on the 
> context of the document. CLAVIN also employs fuzzy search to handle 
> incorrectly-spelled location names, and it recognizes alternative names 
> (e.g., "Ivory Coast" and "Côte d'Ivoire") as referring to the same geographic 
> entity. By enriching text documents with structured geo data, CLAVIN enables 
> hierarchical geospatial search and advanced geospatial analytics on 
> unstructured data.
> {quote}
> There was only one other instance of the word "clavin" mentioned in the ASF 
> jira site so I thought it was definitely worth posting here.
> https://github.com/Berico-Technologies/CLAVIN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1328) Translate Metadata and Content

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1328:

Fix Version/s: (was: 1.11)
   1.12

> Translate Metadata and Content
> --
>
> Key: TIKA-1328
> URL: https://issues.apache.org/jira/browse/TIKA-1328
> Project: Tika
>  Issue Type: New Feature
>  Components: translation
>Reporter: Tyler Palsulich
> Fix For: 1.12
>
>
> Right now, Translation is only done on Strings. Ideally, users would be able 
> to "turn on" translation while parsing. I can think of a couple options:
> - Make a TranslateAutoDetectParser. Automatically detect the file type, parse 
> it, then translate the content.
> - Make a Context switch. When true, translate the content regardless of the 
> parser used. I'm not sure the best way to go about this method, but I prefer 
> it over another Parser.
> Regardless, we need a black or white list for translation. I think black list 
> would be the way to go -- which fields should not be translated (dates, 
> versions, ...) Any ideas? Also, somewhat unrelated, does anyone know of any 
> other open source translation libraries? If we were really lucky, it wouldn't 
> depend on an online service.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1516) Downgrade Rome dependency to 0.9 to avoid nasty NPE

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1516:

Fix Version/s: (was: 1.11)
   1.12

> Downgrade Rome dependency to 0.9 to avoid nasty NPE
> ---
>
> Key: TIKA-1516
> URL: https://issues.apache.org/jira/browse/TIKA-1516
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: TIKA-1516.patch
>
>
> As documented [in this 
> thread|http://www.mail-archive.com/dev%40nutch.apache.org/msg15755.html] 
> Nutch's 
> [parse-tika|https://github.com/apache/nutch/blob/trunk/src/plugin/parse-tika/plugin.xml#L56]
>  uses Rome 1.0, this is inherited directly from the Tika pom.xml for the 
> [same 
> depenency|https://github.com/apache/tika/blob/trunk/tika-parsers/pom.xml#L184].
> A downgrade is required.
> {code}
> java.lang.Exception: java.lang.ExceptionInInitializerError
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.ExceptionInInitializerError
> at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:136)
> at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:70)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:105)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:101)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.NullPointerException
> at java.util.Properties$LineReader.readLine(Properties.java:418)
> at java.util.Properties.load0(Properties.java:337)
> at java.util.Properties.load(Properties.java:325)
> at 
> com.sun.syndication.io.impl.PropertiesLoader.(PropertiesLoader.java:74)
> at 
> com.sun.syndication.io.impl.PropertiesLoader.getPropertiesLoader(PropertiesLoader.java:46)
> at 
> com.sun.syndication.io.impl.PluginManager.(PluginManager.java:54)
> at 
> com.sun.syndication.io.impl.PluginManager.(PluginManager.java:46)
> at 
> com.sun.syndication.feed.synd.impl.Converters.(Converters.java:40)
> at 
> com.sun.syndication.feed.synd.SyndFeedImpl.(SyndFeedImpl.java:59)
> ... 16 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1672) Integrate tika-java7 component

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1672:

Fix Version/s: (was: 1.11)
   1.12

> Integrate tika-java7 component
> --
>
> Key: TIKA-1672
> URL: https://issues.apache.org/jira/browse/TIKA-1672
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tyler Palsulich
> Fix For: 1.12
>
>
> Code requiring Java 7 doesn't need to be in a separate module now that 
> TIKA-1536 (upgrade to Java 7) is done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1513) Add mime detection and parsing for dbf files

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1513:

Fix Version/s: (was: 1.11)
   1.12

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.12
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1674) Add example to show how to extract embedded files

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1674:

Fix Version/s: (was: 1.11)
   1.12

> Add example to show how to extract embedded files
> -
>
> Key: TIKA-1674
> URL: https://issues.apache.org/jira/browse/TIKA-1674
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.12
>
>
> On tika-user, we received a question on how to extract embedded files.  Let's 
> add an example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1746) modify TikaFileTypeDetector to use new detect method accepting java.nio.file.Path

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1746:

Fix Version/s: (was: 1.11)
   1.12

> modify TikaFileTypeDetector to use new detect method accepting 
> java.nio.file.Path
> -
>
> Key: TIKA-1746
> URL: https://issues.apache.org/jira/browse/TIKA-1746
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.12
>
> Attachments: TIKA-1746.patch
>
>
> Utilize the new org.apache.tika.Tika.detect(Path) method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1772) Mimetype of VTT files

2015-10-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962582#comment-14962582
 ] 

Hudson commented on TIKA-1772:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #872 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/872/])
Fix for TIKA-1772: Mimetype of VTT files contributed by Alexander Widera 
 this closes #59. (mattmann: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1709302])
* trunk/CHANGES.txt


> Mimetype of VTT files
> -
>
> Key: TIKA-1772
> URL: https://issues.apache.org/jira/browse/TIKA-1772
> Project: Tika
>  Issue Type: Improvement
>Reporter: Alexander Widera
>Priority: Minor
> Fix For: 1.11
>
> Attachments: upc-video-subtitles-en.vtt
>
>
> Files with extension "vtt" are "WebVTT: The Web Video Text Tracks Format" 
> files.
> The mimetype resolved by tika is currently text/plain.
> The correct mimetype should be text/vtt.
> see: https://w3c.github.io/webvtt/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1672) Integrate tika-java7 component

2015-10-18 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962596#comment-14962596
 ] 

Yaniv Kunda commented on TIKA-1672:
---

Here are some names I suggested:
- tika-java7-spi
- tika-java7-filetypedetector
- tika-java7-detector-spi


> Integrate tika-java7 component
> --
>
> Key: TIKA-1672
> URL: https://issues.apache.org/jira/browse/TIKA-1672
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tyler Palsulich
> Fix For: 1.12
>
>
> Code requiring Java 7 doesn't need to be in a separate module now that 
> TIKA-1536 (upgrade to Java 7) is done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[DISCUSS] 1.11 RC #1 today

2015-10-18 Thread Mattmann, Chris A (3980)
Hey Folks,

For real this time I’m going to cut a 1.11 RC #1 before end of
day today. 24 issues fixed.

Will send out the VOTE shortly. Also of note going for a Nutch 1.11
RC #1 today too.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





[jira] [Updated] (TIKA-1724) Create parser for .obo file format.

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1724:

Fix Version/s: (was: 1.11)
   1.12

> Create parser for .obo file format.
> ---
>
> Key: TIKA-1724
> URL: https://issues.apache.org/jira/browse/TIKA-1724
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: TIKA-1724.patch
>
>
> This parser implementation caters for files of the [OBO Flat File Format 
> Guide, version 1.4|http://purl.obolibrary.org/obo/oboformat/spec.html] 
> MimeType.
> The OBO format is the text file format used by OBO-Edit, the open source, 
> platform-independent application for viewing and editing ontologies. This 
> file format is used heavily within the clinical and biomedical fields as a 
> particular flat file serialization for ontologies. .obo files are 'typically' 
> accompanied by corresponding .owl serializations as this is also another file 
> format used pervasively within the clinical and biomedical fields.
> I would sincerely appreciate code review. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-776) ExifTool Embedder

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-776:
---
Fix Version/s: (was: 1.11)
   1.12

> ExifTool Embedder
> -
>
> Key: TIKA-776
> URL: https://issues.apache.org/jira/browse/TIKA-776
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata
>Affects Versions: 1.0
> Environment: ExifTool is required 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>  Labels: embed, exiftool, patch
> Fix For: 1.12
>
> Attachments: tika-parsers-exiftool-embed-patch.txt
>
>
> This patch adds an ExifTool ExternalEmbedder which builds upon the work in 
> issue TIKA-774 and TIKA-775.
> In the tika-parsers an ExiftoolExternalEmbedder is added which extends 
> ExternalEmbedder to programmatically create an Embedder which calls the 
> ExifTool command line to embed tika metadata into a file stream and an 
> ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and 
> XMP fields then parses the resulting file stream to verify the operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1308:

Fix Version/s: (was: 1.11)
   1.12

> Support in memory parse mode(don't create temp file): to support run Tika in 
> GAE
> 
>
> Key: TIKA-1308
> URL: https://issues.apache.org/jira/browse/TIKA-1308
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: jefferyyuan
>  Labels: gae
> Fix For: 1.12
>
>
> I am trying to use Tika in GAE and write a simple servlet to extract meta 
> data info from jpeg:
> {code}
> String urlStr = req.getParameter("imageUrl");
> byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
> ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
> Metadata metadata = new Metadata();
> BodyContentHandler ch = new BodyContentHandler();
> AutoDetectParser parser = new AutoDetectParser();
> parser.parse(bais, ch, metadata, new ParseContext());
> bais.close();
> {code}
> This fails with exception:
> {code}
> Caused by: java.lang.SecurityException: Unable to create temporary file
>   at java.io.File.createTempFile(File.java:1986)
>   at 
> org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
>   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
>   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
>   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
> {code}
> Checked the code, in 
> org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, 
> Metadata, ParseContext), it creates a temp file from the input stream.
> I can understand why tika create temp file from the stream: so tika can parse 
> it multiple times.
> But as GAE and other cloud servers are getting more popular, is it possible 
> to avoid create temp file: instead we can copy the origin stream to a 
> byteArray stream, so tika can also parse it multiple times.
> -- This will have a limit on the file size, as tika keeps the whole file in 
> memory, but this can make tika work in GAE and maybe other cloud server.
> We can add a parameter in parser.parse to indicate whether do in memory parse 
> only.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-988) We don't extract a placeholder for a Word document embedded in an Excel document

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-988:
---
Fix Version/s: (was: 1.11)
   1.12

> We don't extract a placeholder for a Word document embedded in an Excel 
> document
> 
>
> Key: TIKA-988
> URL: https://issues.apache.org/jira/browse/TIKA-988
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.12
>
> Attachments: bug31373.xls
>
>
> In TIKA-956 we fixed the Word parser so that at the point where an embedded 
> document appears, we output a  tag.
> It would be nice to do this for documents embedded in Excel too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1465) Implement extraction of non-global variables from netCDF3 and netCDF4

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1465:

Fix Version/s: (was: 1.11)
   1.12

> Implement extraction of non-global variables from netCDF3 and netCDF4
> -
>
> Key: TIKA-1465
> URL: https://issues.apache.org/jira/browse/TIKA-1465
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
>
> Speaking to Eric Nienhouse at the ongoing NSF funded Polar 
> Cyberinfrastructure hackathon in NYC, we became aware that variables 
> parameters contained within netCDF3 and netCDF4 are just as valuable (if not 
> more valuable) as global attribute values. 
> AFAIK, right now we only extract global attributes however we could extend 
> the support to cater for the above observations.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-980) MicrodataContentHandler for Apache Tika

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-980:
---
Fix Version/s: (was: 1.11)
   1.12

> MicrodataContentHandler for Apache Tika
> ---
>
> Key: TIKA-980
> URL: https://issues.apache.org/jira/browse/TIKA-980
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Ken Krugler
> Fix For: 1.12
>
> Attachments: TIKA-980-1.3-1.patch, TIKA-980-1.3-2.patch, 
> TIKA-980-1.3-3.patch, TIKA-980-1.3-4.patch, TIKA-980-1.3-5.patch
>
>
> ContentHandler for Apache Tika capable of building a data structure 
> containing Microdata item scopes and item properties. The Item* classes are 
> borrowed from the Apache Any23 project and are slightly modified to 
> accomodate this SAX-based extractor vs the original DOM-based extractor.
> The provided unit test outputs two item scopes about the Europe and NA 
> ApacheCon events and each has a nested property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1505) chmparser breaks down when extracting from file of CHM format v3

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1505:

Fix Version/s: (was: 1.11)
   1.12

> chmparser breaks down when extracting from file of CHM format v3
> 
>
> Key: TIKA-1505
> URL: https://issues.apache.org/jira/browse/TIKA-1505
> Project: Tika
>  Issue Type: Bug
>Reporter: Bin Hawking
> Fix For: 1.12
>
>
> chmparser throws exception or returns faulty text when:
> 1. extracting from file of CHM format version 3
> 2. chm file with lzx reset interval > 2
> 3. chm file with >5000 objects
> I am making the fix now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1751) Use java.nio.file.Path in TikaConfig

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1751:

Fix Version/s: (was: 1.11)
   1.12

> Use java.nio.file.Path in TikaConfig
> 
>
> Key: TIKA-1751
> URL: https://issues.apache.org/jira/browse/TIKA-1751
> Project: Tika
>  Issue Type: Sub-task
>  Components: config
>Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.12
>
> Attachments: TIKA-1751.patch
>
>
> Provide constructors accepting java.nio.file.Path



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1609) Leverage Google's LibPhonenumber for enhanced phone number extraction and metadata modeling

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1609:

Fix Version/s: (was: 1.11)
   1.12

> Leverage Google's LibPhonenumber for enhanced phone number extraction and 
> metadata modeling
> ---
>
> Key: TIKA-1609
> URL: https://issues.apache.org/jira/browse/TIKA-1609
> Project: Tika
>  Issue Type: New Feature
>  Components: core
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
>
> Google's Libphonenumber can provide us with comprehensive support for 
> modeling Phone number metadata properly in Tika.
> During the development of this patch I realized two things, namely
>  * This is not a parser as such as Phone numbers are not mapped to any 
> particular Mimetype
>  * In addition, there can be many phone numbers per document, so this is most 
> likely a Content Handler of sorts
>  * Tika's Metadata support is currently too restrictive to allow us to 
> persist many complex objects e.g. String, Object. We need to expand Meatdata 
> support over and above String, String[].
> https://github.com/googlei18n/libphonenumber/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-819:
---
Fix Version/s: (was: 1.11)
   1.12

> Make Option to Exclude Embedded Files' Text for Text Content
> 
>
> Key: TIKA-819
> URL: https://issues.apache.org/jira/browse/TIKA-819
> Project: Tika
>  Issue Type: New Feature
>  Components: general
>Affects Versions: 1.0
> Environment: Windows-7 + JDK 1.6 u26
>Reporter: Albert L.
> Fix For: 1.12
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the 
> option to disable text from the PPTX from showing up when asking for the text 
> content from DOCX.  In other words, it would be nice to have the option to 
> get text content *only* from the DOCX instead of the DOCX+PPTX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1726:

Fix Version/s: (was: 1.11)
   1.12

> Augment public methods that use a java.io.File with methods that use a 
> java.nio.file.Path
> -
>
> Key: TIKA-1726
> URL: https://issues.apache.org/jira/browse/TIKA-1726
> Project: Tika
>  Issue Type: Improvement
>  Components: batch, core, gui, parser, translation
>Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.12
>
>
> In light of Java 7 already EOL, it's high time we add support for the new 
> java.nio.file.Path class introduced with it, which, together with support 
> methods in java.nio.file.Files and others, provide a better file I/O 
> framework than java.io.File.
> In just two cases, we have public methods in tika that only return a File 
> object, and cannot be overloaded, so a different name for the new method must 
> be created:
> - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
> _Suggestions:_
> -- addTemporaryFile
> -- addTempFile
> -- createTempFile
> -- createTemporaryPath
> - {{org.apache.tika.io.TikaInputStream#getFile()}}
> _Suggestions:_
> -- asFile
> -- toPath
> -- getPath
> In other cases, the methods accept a File as an argument, and should remain 
> as tika users might be using them - so an overloaded method that accepts a 
> Path instead should be added, referencing the new method from the old one 
> (using the @see tag) until java.io.File itself is deprecated or otherwise 
> becomes obsolete.
> Here is the full list of other methods:
> _tika-app:_
> - {{org.apache.tika.gui.TikaGUI#openFile(File)}}
> _tika-batch:_
> - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
> HANDLE_EXISTING, String)}}
> - {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
> - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
> - 
> {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
> - {{org.apache.tika.batch.fs.FSFileResource}} constructor
> - {{org.apache.tika.batch.fs.FSListCrawler}} constructor
> - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
> - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
> File)}}
> - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
> - {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor
> _tika-core:_
> - {{org.apache.tika.Tika#detect(File)}}
> - {{org.apache.tika.Tika#parse(File)}}
> - {{org.apache.tika.Tika#parseToString(File)}}
> - {{org.apache.tika.config.TikaConfig}} constructors
> - {{org.apache.tika.detect.NNExampleModelDetector}} constructor
> - {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
> - {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
> - {{org.apache.tika.io.TikaInputStream#get(File)}}
> - {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}
> - {{org.apache.tika.parser.ParsingReader}} constructor
> _tika-parsers:_
> - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
> - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
> - {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor
> _tika-translate:_
> - 
> {{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String,
>  String[], File)}}
> Due to lack of evidence, all public methods in public non-test classes (and 
> not in tika-example) are deemed part of a public API - although there's no 
> formal definition of such.
> If anyone knows of a public method which isn't accessed publicly and can be 
> defined as package-private, or for another reason, please comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1640) Make ExternalParser support aliases for key names in extracted metadata

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1640:

Fix Version/s: (was: 1.11)
   1.12

> Make ExternalParser support aliases for key names in extracted metadata
> ---
>
> Key: TIKA-1640
> URL: https://issues.apache.org/jira/browse/TIKA-1640
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
>
> Over in TIKA-1639, we were discussing the work outside of Tika that [~rgauss] 
> did (per [~gagravarr]) on the EXIFTool parsing. I added support in TIKA-1639 
> for this, but one thing Ray's code-based work did that my config oriented 
> work didn't is allow for renaming extracted metadata key names to better 
> support having consistent metadata across parsers.
> Here's one way to do it:
> ExternalParser could have a config section like so:
> {code:xml}
> 
>   
>   
> 
> {code}
> Then this could be used to rename metadata keys.
> I'll implement that in this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1607:

Fix Version/s: (was: 1.11)
   1.12

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.12
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1435) Update rome dependency to 1.5

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1435:

Fix Version/s: (was: 1.11)
   1.12

> Update rome dependency to 1.5
> -
>
> Key: TIKA-1435
> URL: https://issues.apache.org/jira/browse/TIKA-1435
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Johannes Mockenhaupt
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: netcdf-deps-changes.diff
>
>
> Rome 1.5 has been released to Sonatype 
> (https://github.com/rometools/rome/issues/183). Though the website 
> (http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
> is mostly maintenance, adopting slf4j and generics as well as moving the 
> namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1598) Parser Implementation for Streaming Video

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1598:

Fix Version/s: (was: 1.11)
   1.12

> Parser Implementation for Streaming Video
> -
>
> Key: TIKA-1598
> URL: https://issues.apache.org/jira/browse/TIKA-1598
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.12
>
>
> A number of us have been discussing a Tika implementation which could, for 
> example, bind to a live multimedia stream and parse content from the stream 
> until it finished.
> An excellent example would be watching Bonnie Scotland beating R. of Ireland 
> in the upcoming European Championship Qualifying - Group D on Sat 13 Jun @ 
> 17:00 GMT :)
> I located a JMF Wrapper for ffmpeg which 'may' enable us to do this
> http://sourceforge.net/projects/jffmpeg/
> I am not sure... plus it is not licensed liberally enough for us to include 
> so if there are other implementations then please post them here.
> I 'may' be able to have a crack at implementing this next week.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1366) Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1366:

Fix Version/s: (was: 1.11)
   1.12

> Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse 
> 
>
> Key: TIKA-1366
> URL: https://issues.apache.org/jira/browse/TIKA-1366
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Sergey Beryozkin
>Priority: Minor
> Fix For: 1.12
>
>
> Some of Tika Server services will benefit from optionally supporting JAX-RS 
> 2.0 AsyncResponse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1745) Add methods accepting java.nio.file.Path to org.apache.tika.Tika and org.apache.tika.parser.ParsingReader

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1745:

Fix Version/s: (was: 1.11)
   1.12

> Add methods accepting java.nio.file.Path to org.apache.tika.Tika and 
> org.apache.tika.parser.ParsingReader
> -
>
> Key: TIKA-1745
> URL: https://issues.apache.org/jira/browse/TIKA-1745
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.12
>
> Attachments: TIKA-1745.patch
>
>
> Add methods accepting java.nio.file.Path to complement those accepting 
> java.io.File, using the new methods in TikaInputStream or java.nio.file.Files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1508) Add uniformity to parser parameter configuration

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1508:

Fix Version/s: (was: 1.11)
   1.12

> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.12
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
> 
>   
> 2
> something or other
>   
>   audio/basic
>   audio/x-aiff
>   audio/x-wav
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1318) Use of Deprecated Word6Extractor.getParagraphText() Method

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1318:

Fix Version/s: (was: 1.11)
   1.12

> Use of Deprecated Word6Extractor.getParagraphText() Method
> --
>
> Key: TIKA-1318
> URL: https://issues.apache.org/jira/browse/TIKA-1318
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Tyler Palsulich
>Priority: Minor
>  Labels: deprecation
> Fix For: 1.12
>
>
> org.apache.tika.parser.microsoft.WordExtractor.parseWord6() uses the 
> deprecated Word6Extractor.getParagraphText() method. getParagraphText() is 
> supposed to return a String[] with an element for each paragraph in the text. 
> The replacement is getText(), which lets paragraph, cell, etc separation be 
> implementation specific. I'm not sure, at this point, how the POI 
> WordExtractor separates them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1301) Establish TikaServer on Apache hosted VM

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1301:

Fix Version/s: (was: 1.11)
   1.12

> Establish TikaServer on Apache hosted VM
> 
>
> Key: TIKA-1301
> URL: https://issues.apache.org/jira/browse/TIKA-1301
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
>
> Over in Any23, Infra recently provisioned us with a nice shiny new VM to run 
> our service on
> http://any23.org
> I would like to do the same for Tika. I have some scripts on the Any23 VM 
> which will pull stable nightly tika-server snapshots and deploy them to the 
> VM. This is really nice for both dev's and users alike.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1367:

Fix Version/s: (was: 1.11)
   1.12

> Tika documentation should list tika-parsers parser dependencies
> ---
>
> Key: TIKA-1367
> URL: https://issues.apache.org/jira/browse/TIKA-1367
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sergey Beryozkin
> Fix For: 1.12
>
>
> tika-parsers module has many strong transitive parser dependencies. Maven 
> users of tika-parsers have to exclude all the transitivie dependencies 
> manually. Documenting the list of the existing transitive dependencies and 
> keeping the list up to date will help developers exclude the libraries not 
> needed for a given project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1276:

Fix Version/s: (was: 1.11)
   1.12

> Missing embedded dependencies in tika-bundle
> 
>
> Key: TIKA-1276
> URL: https://issues.apache.org/jira/browse/TIKA-1276
> Project: Tika
>  Issue Type: Bug
>  Components: packaging
>Affects Versions: 1.5
> Environment: OSGI, Apache Felix via Apache Sling Launcher
>Reporter: Rupert Westenthaler
> Fix For: 1.12
>
> Attachments: TIKA-1276_20140423_rwesten.diff, 
> TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_3_rwesten.diff, 
> TIKA-1276_20140428_rwesten.diff
>
>
> While updating from tika 1.2 to 1.5 I that the 
> `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
> 1. `com.uwyn:jhighlight:1.0` is not embedded
> Because of that installing the bundle results in the following exception
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
> [103.0] osgi.wiring.package; 
> (osgi.wiring.package=com.uwyn.jhighlight.renderer))
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
> [103.0] osgi.wiring.package; 
> (osgi.wiring.package=com.uwyn.jhighlight.renderer)
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> 2. `org.ow2.asm:asm:4.1` is not embedded because 
> `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
> therefore the `Embed-Dependency` directive `asm` does not match any 
> dependency. 
> Because of that one do get the following exception (after fixing (1))
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; 
> (&(osgi.wiring.package=org.objectweb.asm)(version>=4.1.0)(!(version>=5.0.0
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; 
> (&(osgi.wiring.package=org.objectweb.asm)(version>=4.1.0)(!(version>=5.0.0)))
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> There are two possibilities to fix this (a) change the `Embed-Dependency` to 
> `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
> tika-bundle pom file.
> 3. `edu.ucar:netcdf:4.2-min` is not embedded
> Because of that one does get the following exception (after fixing (1) and 
> (2))
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
> After fixing the above issues the tika-bundle was started successfully. 
> However when extracting EXIG metadata from a jpeg image I got the following 
> exception.
> {code}
> java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
>   at 
> com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
>   at 
> com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
>   at 
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
>

[jira] [Updated] (TIKA-774) ExifTool Parser

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-774:
---
Fix Version/s: (was: 1.11)
   1.12

> ExifTool Parser
> ---
>
> Key: TIKA-774
> URL: https://issues.apache.org/jira/browse/TIKA-774
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.0
> Environment: Requires be installed 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>  Labels: features, new-parser, newbie, patch
> Fix For: 1.12
>
> Attachments: testJPEG_IPTC_EXT.jpg, 
> tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt
>
>
> Adds an external parser that calls ExifTool to extract extended metadata 
> fields from images and other content types.
> In the core project:
> An ExifTool interface is added which contains Property objects that define 
> the metadata fields available.
> An additional Property constructor for internalTextBag type.
> In the parsers project:
> An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
> on the command line and mapping the response to tika metadata fields.  This 
> extractor could be called instead of or in addition to the existing 
> ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
> JpegParser but those have not been changed at this time.
> An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
> An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
> metadata fields to existing tika and Drew Noakes metadata fields if enabled.
> An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
> implementations in XML files.
> An ExifToolParserTest is added which tests several expected XMP and IPTC 
> metadata values in testJPEG_IPTC_EXT.jpg.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-894:
---
Fix Version/s: (was: 1.11)
   1.12

> Add webapp mode for Tika Server, simplifies deployment
> --
>
> Key: TIKA-894
> URL: https://issues.apache.org/jira/browse/TIKA-894
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Affects Versions: 1.1, 1.2
>Reporter: Chris Wilson
>  Labels: maven, newbie, patch
> Fix For: 1.12
>
> Attachments: tika-server-webapp.patch
>
>
> For use in production services, Tika Server should really be deployed as a 
> WAR file, under a reliable servlet container that knows how to run as a 
> system service, for example Tomcat or JBoss.
> This is especially important on Windows, where I wasted an entire day trying 
> to make TikaServerCli run as some kind of a service. 
> Maven makes building a webapp pretty trivial. With the attached patch 
> applied, "mvn war:war" should work. It seems to run fine in Tomcat, which 
> makes Windows deployment much simpler. Just install Tomcat and drop the WAR 
> file into tomcat's webapps directory and you're away.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1220) Parser implementration for IFC files

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1220:

Fix Version/s: (was: 1.11)
   1.12

> Parser implementration for IFC files
> 
>
> Key: TIKA-1220
> URL: https://issues.apache.org/jira/browse/TIKA-1220
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: new-parser
> Fix For: 1.12
>
> Attachments: 2012-03-23-Duplex-Programming.ifc
>
>
> The Industry Foundation Classes (IFC) [0] data model is intended to describe 
> building and construction industry data. For the sake of argument, it can be 
> considered as a more intelligent successor to the .dwg data models used 
> within CAD models.
> I've tracked down a potential 3rd party library [1] which we maybe able to 
> wrap and use within Tika however the provided software packages are licensed 
> under: http://creativecommons.org/licenses/by-nc-sa/3.0/de/ so I am currently 
> over on legal-discuss@ in an attempt to see if it is possible to wrap some 
> code and contribute it to tika-parsers.
> When I get feedback from legal-discuss, and if this is a go-ahead, I'll need 
> to help the developers package the code as a Maven artifact(s), then I will 
> progress with writing the implementation.  
> [0] http://en.wikipedia.org/wiki/Industry_Foundation_Classes
> [1] http://www.ifctoolsproject.com/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1343) Create a Tika Translator implementation that uses JoshuaDecoder

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1343:

Fix Version/s: (was: 1.11)
   1.12

> Create a Tika Translator implementation that uses JoshuaDecoder
> ---
>
> Key: TIKA-1343
> URL: https://issues.apache.org/jira/browse/TIKA-1343
> Project: Tika
>  Issue Type: New Feature
>  Components: translation
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
>
> The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine 
> translation system hosted at Github:
> http://joshua-decoder.org/
> Joshua takes in corpuses and trains models that can then be used to do 
> language translation. Currently there is support for e.g., Spanisn->English, 
> Indian dialects->English, Chinese->English, and a few others. 
> https://github.com/joshua-decoder/joshua/
> It would be nice to build a Tika Translator on top of Joshua. There are of 
> course several issues with this:
> * the models are huge - so we'll need a separate package or Maven module, 
> maybe tika-translate-joshua or something to release the models and we'll need 
> to build the models. I just went through the process of building the 
> Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, 
> but it took over a day
> * there is a configuration for Joshua, and so we need some way of passing 
> that config into the Translator. Not sure of the best way to do this.
> * Joshua isn't in the Central repository. I've started a discussion on the 
> Joshua lists about this: 
> https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0
> Anyhoo, I've got a working patch right now with hard code stuff, and a manual 
> install into my Maven repo for brave souls out there that want to try it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1657) Allow easier XML serialization of TikaConfig

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1657:

Fix Version/s: (was: 1.11)
   1.12

> Allow easier XML serialization of TikaConfig
> 
>
> Key: TIKA-1657
> URL: https://issues.apache.org/jira/browse/TIKA-1657
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.12
>
> Attachments: TIKA-1558-blacklist-effective.xml, TIKA-1657v1.patch
>
>
> In TIKA-1418, we added an example for how to dump the config file so that 
> users could easily modify it.  I think we should go further and make this an 
> option at the tika-core level with hooks for tika-app and tika-server.  I 
> propose adding a main() to TikaConfig that will print the xml config file 
> that Tika is currently using to stdout.
> I'd like to put this into core so that e.g. Solr's DIH users can get by 
> without having to download tika-app separately.  
> There's every chance that I've not accounted for issues with dynamic loading 
> etc.  Also, I'd be ok with only having this available in tika-app and 
> tika-server if there are good reasons.
> Feedback?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1616) Tika Parser for GIBS Metadata

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1616:

Fix Version/s: (was: 1.11)
   1.12

> Tika Parser for GIBS Metadata
> -
>
> Key: TIKA-1616
> URL: https://issues.apache.org/jira/browse/TIKA-1616
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
>
> [GIBS|https://earthdata.nasa.gov/about-eosdis/science-system-description/eosdis-components/global-imagery-browse-services-gibs]
>  metadata currently consists of simple stuff in the WMTS GetCapabilities 
> request (e.g. 
> http://map1.vis.earthdata.nasa.gov/wmts-arctic/1.0.0/WMTSCapabilities.xml) 
> which includes available layers, extents, time ranges, map projections, color 
> maps, etc. We will eventually have more detailed visualization metadata 
> available in ECHO/CMR which will include linkages to data products, 
> provenance, etc. 
> Some investigation and a Tika parser would be excellent to extract and 
> assimilate GIBS Metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Assigned] (TIKA-1746) modify TikaFileTypeDetector to use new detect method accepting java.nio.file.Path

2015-10-18 Thread Ritesh Chandna
Unsubscribe

Regards,
Ritesh Chandna
Master's Student
University of Southern California
Phone: +1 2132949596

On Sun, Oct 18, 2015 at 10:28 PM, Chris A. Mattmann (JIRA) 
wrote:

>
>  [
> https://issues.apache.org/jira/browse/TIKA-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Chris A. Mattmann reassigned TIKA-1746:
> ---
>
> Assignee: Chris A. Mattmann
>
> > modify TikaFileTypeDetector to use new detect method accepting
> java.nio.file.Path
> >
> -
> >
> > Key: TIKA-1746
> > URL: https://issues.apache.org/jira/browse/TIKA-1746
> > Project: Tika
> >  Issue Type: Sub-task
> >  Components: detector
> >Reporter: Yaniv Kunda
> >Assignee: Chris A. Mattmann
> >Priority: Minor
> >  Labels: java7
> > Fix For: 1.12
> >
> > Attachments: TIKA-1746.patch
> >
> >
> > Utilize the new org.apache.tika.Tika.detect(Path) method
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


Re: [jira] [Updated] (TIKA-1745) Add methods accepting java.nio.file.Path to org.apache.tika.Tika and org.apache.tika.parser.ParsingReader

2015-10-18 Thread Mattmann, Chris A (3980)
Took care of everything but TIKA-1706. That can get taken care of
later.

1.11 RC #1 coming up.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Yaniv Kunda 
Reply-To: "dev@tika.apache.org" 
Date: Sunday, October 18, 2015 at 12:56 PM
To: "dev@tika.apache.org" 
Subject: RE: [jira] [Updated] (TIKA-1745) Add methods accepting
java.nio.file.Path to org.apache.tika.Tika and
org.apache.tika.parser.ParsingReader

>This (and https://issues.apache.org/jira/browse/TIKA-1746 and
>https://issues.apache.org/jira/browse/TIKA-1751) are part of
>https://issues.apache.org/jira/browse/TIKA-1726 and already have
>relatively
>simple patches ready to be committed.
>
>I think they'd be better off committed together with their
>already-committed
>siblings, for putting all API additions in 1.11.
>
>(I'd also like to see https://issues.apache.org/jira/browse/TIKA-1706 in
>1.11, which I have prepared patches for according to [~grossws]'s
>suggestion, but that's another story...)
>
>-Original Message-
>From: Chris A. Mattmann (JIRA) [mailto:j...@apache.org]
>Sent: Sunday, October 18, 2015 22:44
>To: dev@tika.apache.org
>Subject: [jira] [Updated] (TIKA-1745) Add methods accepting
>java.nio.file.Path to org.apache.tika.Tika and
>org.apache.tika.parser.ParsingReader
>
>
> [
>https://issues.apache.org/jira/browse/TIKA-1745?page=com.atlassian.jira.pl
>ugin.system.issuetabpanels:all-tabpanel
>]
>
>Chris A. Mattmann updated TIKA-1745:
>
>Fix Version/s: (was: 1.11)
>   1.12
>
>> Add methods accepting java.nio.file.Path to org.apache.tika.Tika and
>> org.apache.tika.parser.ParsingReader
>> 
>>-
>>
>>
>> Key: TIKA-1745
>> URL: https://issues.apache.org/jira/browse/TIKA-1745
>> Project: Tika
>>  Issue Type: Sub-task
>>  Components: core
>>Reporter: Yaniv Kunda
>>Priority: Minor
>>  Labels: java7
>> Fix For: 1.12
>>
>> Attachments: TIKA-1745.patch
>>
>>
>> Add methods accepting java.nio.file.Path to complement those accepting
>> java.io.File, using the new methods in TikaInputStream or
>> java.nio.file.Files
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.3.4#6332)
>
>-- 
>
>
>This email communication (including any attachments) contains information
>from Answers Corporation or its affiliates that is confidential and may
>be 
>privileged. The information contained herein is intended only for the use
>of the addressee(s) named above. If you are not the intended recipient
>(or 
>the agent responsible to deliver it to the intended recipient), you are
>hereby notified that any dissemination, distribution, use, or copying of
>this communication is strictly prohibited. If you have received this
>email 
>in error, please immediately reply to sender, delete the message and
>destroy all copies of it. If you have questions, please email
>le...@answers.com.
>
>If you wish to unsubscribe to commercial emails from Answers and its
>affiliates, please go to the Answers Subscription Center
>http://campaigns.answers.com/subscriptions to opt out.  Thank you.



[jira] [Closed] (TIKA-1775) Failed to load Main-Class manifest attribute from tika-app-1.10.jar

2015-10-18 Thread QiaoMan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

QiaoMan closed TIKA-1775.
-
Resolution: Fixed

I make a mistake when moving tika.jar file.

> Failed to load Main-Class manifest attribute from tika-app-1.10.jar
> ---
>
> Key: TIKA-1775
> URL: https://issues.apache.org/jira/browse/TIKA-1775
> Project: Tika
>  Issue Type: Bug
> Environment: GNU/Linux
>Reporter: QiaoMan
>
> Hi,
> I am new to java. I download the tika .jar file form 
> http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.10.jar. It appears 
> problem "Failed to load Main-Class manifest attribute". I find a similiar 
> issue https://issues.apache.org/jira/browse/TIKA-1273. It said "old 
> tika-server jar artifact contains no manifest". But I found it still having 
> no manifest file in version 1.10. Is there any problems about my downloading? 
>  Here is my commad "gij -jar tika-app-1.10.jar".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1726.
-
   Resolution: Fixed
Fix Version/s: (was: 1.12)
   1.11

all the sub-tasks of this are done.

> Augment public methods that use a java.io.File with methods that use a 
> java.nio.file.Path
> -
>
> Key: TIKA-1726
> URL: https://issues.apache.org/jira/browse/TIKA-1726
> Project: Tika
>  Issue Type: Improvement
>  Components: batch, core, gui, parser, translation
>Reporter: Yaniv Kunda
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
>
> In light of Java 7 already EOL, it's high time we add support for the new 
> java.nio.file.Path class introduced with it, which, together with support 
> methods in java.nio.file.Files and others, provide a better file I/O 
> framework than java.io.File.
> In just two cases, we have public methods in tika that only return a File 
> object, and cannot be overloaded, so a different name for the new method must 
> be created:
> - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
> _Suggestions:_
> -- addTemporaryFile
> -- addTempFile
> -- createTempFile
> -- createTemporaryPath
> - {{org.apache.tika.io.TikaInputStream#getFile()}}
> _Suggestions:_
> -- asFile
> -- toPath
> -- getPath
> In other cases, the methods accept a File as an argument, and should remain 
> as tika users might be using them - so an overloaded method that accepts a 
> Path instead should be added, referencing the new method from the old one 
> (using the @see tag) until java.io.File itself is deprecated or otherwise 
> becomes obsolete.
> Here is the full list of other methods:
> _tika-app:_
> - {{org.apache.tika.gui.TikaGUI#openFile(File)}}
> _tika-batch:_
> - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
> HANDLE_EXISTING, String)}}
> - {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
> - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
> - 
> {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
> - {{org.apache.tika.batch.fs.FSFileResource}} constructor
> - {{org.apache.tika.batch.fs.FSListCrawler}} constructor
> - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
> - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
> File)}}
> - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
> - {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor
> _tika-core:_
> - {{org.apache.tika.Tika#detect(File)}}
> - {{org.apache.tika.Tika#parse(File)}}
> - {{org.apache.tika.Tika#parseToString(File)}}
> - {{org.apache.tika.config.TikaConfig}} constructors
> - {{org.apache.tika.detect.NNExampleModelDetector}} constructor
> - {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
> - {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
> - {{org.apache.tika.io.TikaInputStream#get(File)}}
> - {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}
> - {{org.apache.tika.parser.ParsingReader}} constructor
> _tika-parsers:_
> - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
> - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
> - {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor
> _tika-translate:_
> - 
> {{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String,
>  String[], File)}}
> Due to lack of evidence, all public methods in public non-test classes (and 
> not in tika-example) are deemed part of a public API - although there's no 
> formal definition of such.
> If anyone knows of a public method which isn't accessed publicly and can be 
> defined as package-private, or for another reason, please comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1746) modify TikaFileTypeDetector to use new detect method accepting java.nio.file.Path

2015-10-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962880#comment-14962880
 ] 

Hudson commented on TIKA-1746:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #873 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/873/])
Fix for TIKA-1746: modify TikaFileTypeDetector to use new detect method 
accepting java.nio.file.Path contributed by Yaniv Kunda. (mattmann: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1709350])
* trunk/CHANGES.txt
* 
trunk/tika-java7/src/main/java/org/apache/tika/filetypedetector/TikaFileTypeDetector.java


> modify TikaFileTypeDetector to use new detect method accepting 
> java.nio.file.Path
> -
>
> Key: TIKA-1746
> URL: https://issues.apache.org/jira/browse/TIKA-1746
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Reporter: Yaniv Kunda
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1746.patch
>
>
> Utilize the new org.apache.tika.Tika.detect(Path) method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1745) Add methods accepting java.nio.file.Path to org.apache.tika.Tika and org.apache.tika.parser.ParsingReader

2015-10-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962881#comment-14962881
 ] 

Hudson commented on TIKA-1745:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #873 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/873/])
Fix for TIKA-1745 Add methods accepting java.nio.file.Path to 
org.apache.tika.Tika and org.apache.tika.parser.ParsingReader contributed by  
Yaniv Kunda. (mattmann: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1709349])
* trunk/CHANGES.txt
* trunk/tika-core/src/main/java/org/apache/tika/Tika.java
* trunk/tika-core/src/main/java/org/apache/tika/parser/ParsingReader.java


> Add methods accepting java.nio.file.Path to org.apache.tika.Tika and 
> org.apache.tika.parser.ParsingReader
> -
>
> Key: TIKA-1745
> URL: https://issues.apache.org/jira/browse/TIKA-1745
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1745.patch
>
>
> Add methods accepting java.nio.file.Path to complement those accepting 
> java.io.File, using the new methods in TikaInputStream or java.nio.file.Files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1751) Use java.nio.file.Path in TikaConfig

2015-10-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962882#comment-14962882
 ] 

Hudson commented on TIKA-1751:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #873 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/873/])
Fix for TIKA-1751: Use java.nio.file.Path in TikaConfig contributed by Yaniv 
Kunda. (mattmann: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev=1709351])
* trunk/CHANGES.txt
* trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java
* trunk/tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java


> Use java.nio.file.Path in TikaConfig
> 
>
> Key: TIKA-1751
> URL: https://issues.apache.org/jira/browse/TIKA-1751
> Project: Tika
>  Issue Type: Sub-task
>  Components: config
>Reporter: Yaniv Kunda
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1751.patch
>
>
> Provide constructors accepting java.nio.file.Path



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-1726:
---

Assignee: Chris A. Mattmann

> Augment public methods that use a java.io.File with methods that use a 
> java.nio.file.Path
> -
>
> Key: TIKA-1726
> URL: https://issues.apache.org/jira/browse/TIKA-1726
> Project: Tika
>  Issue Type: Improvement
>  Components: batch, core, gui, parser, translation
>Reporter: Yaniv Kunda
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: java7
> Fix For: 1.12
>
>
> In light of Java 7 already EOL, it's high time we add support for the new 
> java.nio.file.Path class introduced with it, which, together with support 
> methods in java.nio.file.Files and others, provide a better file I/O 
> framework than java.io.File.
> In just two cases, we have public methods in tika that only return a File 
> object, and cannot be overloaded, so a different name for the new method must 
> be created:
> - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
> _Suggestions:_
> -- addTemporaryFile
> -- addTempFile
> -- createTempFile
> -- createTemporaryPath
> - {{org.apache.tika.io.TikaInputStream#getFile()}}
> _Suggestions:_
> -- asFile
> -- toPath
> -- getPath
> In other cases, the methods accept a File as an argument, and should remain 
> as tika users might be using them - so an overloaded method that accepts a 
> Path instead should be added, referencing the new method from the old one 
> (using the @see tag) until java.io.File itself is deprecated or otherwise 
> becomes obsolete.
> Here is the full list of other methods:
> _tika-app:_
> - {{org.apache.tika.gui.TikaGUI#openFile(File)}}
> _tika-batch:_
> - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
> HANDLE_EXISTING, String)}}
> - {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
> - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
> - 
> {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
> - {{org.apache.tika.batch.fs.FSFileResource}} constructor
> - {{org.apache.tika.batch.fs.FSListCrawler}} constructor
> - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
> - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
> File)}}
> - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
> - {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor
> _tika-core:_
> - {{org.apache.tika.Tika#detect(File)}}
> - {{org.apache.tika.Tika#parse(File)}}
> - {{org.apache.tika.Tika#parseToString(File)}}
> - {{org.apache.tika.config.TikaConfig}} constructors
> - {{org.apache.tika.detect.NNExampleModelDetector}} constructor
> - {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
> - {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
> - {{org.apache.tika.io.TikaInputStream#get(File)}}
> - {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}
> - {{org.apache.tika.parser.ParsingReader}} constructor
> _tika-parsers:_
> - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
> - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
> - {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor
> _tika-translate:_
> - 
> {{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String,
>  String[], File)}}
> Due to lack of evidence, all public methods in public non-test classes (and 
> not in tika-example) are deemed part of a public API - although there's no 
> formal definition of such.
> If anyone knows of a public method which isn't accessed publicly and can be 
> defined as package-private, or for another reason, please comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1751) Use java.nio.file.Path in TikaConfig

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1751.
-
   Resolution: Fixed
Fix Version/s: (was: 1.12)
   1.11

Thanks!

{noformat}
[chipotle:~/tmp/tika1.11] mattmann% svn commit -m "Fix for TIKA-1751: Use 
java.nio.file.Path in TikaConfig contributed by Yaniv Kunda."
SendingCHANGES.txt
Sendingtika-core/src/main/java/org/apache/tika/config/TikaConfig.java
Sending
tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java
Transmitting file data ...
Committed revision 1709351.
[chipotle:~/tmp/tika1.11] mattmann% 
{noformat}


> Use java.nio.file.Path in TikaConfig
> 
>
> Key: TIKA-1751
> URL: https://issues.apache.org/jira/browse/TIKA-1751
> Project: Tika
>  Issue Type: Sub-task
>  Components: config
>Reporter: Yaniv Kunda
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1751.patch
>
>
> Provide constructors accepting java.nio.file.Path



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TIKA-1751) Use java.nio.file.Path in TikaConfig

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-1751:
---

Assignee: Chris A. Mattmann

> Use java.nio.file.Path in TikaConfig
> 
>
> Key: TIKA-1751
> URL: https://issues.apache.org/jira/browse/TIKA-1751
> Project: Tika
>  Issue Type: Sub-task
>  Components: config
>Reporter: Yaniv Kunda
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: java7
> Fix For: 1.12
>
> Attachments: TIKA-1751.patch
>
>
> Provide constructors accepting java.nio.file.Path



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1746) modify TikaFileTypeDetector to use new detect method accepting java.nio.file.Path

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1746.
-
   Resolution: Fixed
Fix Version/s: (was: 1.12)
   1.11

- Fixed thanks!

{noformat}
[chipotle:~/tmp/tika1.11] mattmann% svn commit -m "Fix for TIKA-1746: modify 
TikaFileTypeDetector to use new detect method accepting java.nio.file.Path 
contributed by Yaniv Kunda."
SendingCHANGES.txt
Sending
tika-java7/src/main/java/org/apache/tika/filetypedetector/TikaFileTypeDetector.java
Transmitting file data ..
Committed revision 1709350.
[chipotle:~/tmp/tika1.11] mattmann% 

{noformat}


> modify TikaFileTypeDetector to use new detect method accepting 
> java.nio.file.Path
> -
>
> Key: TIKA-1746
> URL: https://issues.apache.org/jira/browse/TIKA-1746
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Reporter: Yaniv Kunda
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1746.patch
>
>
> Utilize the new org.apache.tika.Tika.detect(Path) method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1775) Failed to load Main-Class manifest attribute from tika-app-1.10.jar

2015-10-18 Thread QiaoMan (JIRA)
QiaoMan created TIKA-1775:
-

 Summary: Failed to load Main-Class manifest attribute from 
tika-app-1.10.jar
 Key: TIKA-1775
 URL: https://issues.apache.org/jira/browse/TIKA-1775
 Project: Tika
  Issue Type: Bug
 Environment: GNU/Linux
Reporter: QiaoMan


Hi,
I am new to java. I download the tika .jar file form 
http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.10.jar. It appears problem 
"Failed to load Main-Class manifest attribute". I find a similiar issue 
https://issues.apache.org/jira/browse/TIKA-1273. It said "old tika-server jar 
artifact contains no manifest". But I found it still having no manifest file in 
version 1.10. Is there any problems about my downloading?  Here is my commad 
"gij -jar tika-app-1.10.jar".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1745) Add methods accepting java.nio.file.Path to org.apache.tika.Tika and org.apache.tika.parser.ParsingReader

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1745.
-
   Resolution: Fixed
Fix Version/s: (was: 1.12)
   1.11

Thanks!

{noformat}
[chipotle:~/tmp/tika1.11] mattmann% svn commit -m "Fix for TIKA-1745 Add 
methods accepting java.nio.file.Path to org.apache.tika.Tika and 
org.apache.tika.parser.ParsingReader contributed by  Yaniv Kunda."
SendingCHANGES.txt
Sendingtika-core/src/main/java/org/apache/tika/Tika.java
Sendingtika-core/src/main/java/org/apache/tika/parser/ParsingReader.java
Transmitting file data ...
Committed revision 1709349.
[chipotle:~/tmp/tika1.11] mattmann% 
{noformat}


> Add methods accepting java.nio.file.Path to org.apache.tika.Tika and 
> org.apache.tika.parser.ParsingReader
> -
>
> Key: TIKA-1745
> URL: https://issues.apache.org/jira/browse/TIKA-1745
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1745.patch
>
>
> Add methods accepting java.nio.file.Path to complement those accepting 
> java.io.File, using the new methods in TikaInputStream or java.nio.file.Files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TIKA-1746) modify TikaFileTypeDetector to use new detect method accepting java.nio.file.Path

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-1746:
---

Assignee: Chris A. Mattmann

> modify TikaFileTypeDetector to use new detect method accepting 
> java.nio.file.Path
> -
>
> Key: TIKA-1746
> URL: https://issues.apache.org/jira/browse/TIKA-1746
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Reporter: Yaniv Kunda
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: java7
> Fix For: 1.12
>
> Attachments: TIKA-1746.patch
>
>
> Utilize the new org.apache.tika.Tika.detect(Path) method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TIKA-1745) Add methods accepting java.nio.file.Path to org.apache.tika.Tika and org.apache.tika.parser.ParsingReader

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-1745:
---

Assignee: Chris A. Mattmann

> Add methods accepting java.nio.file.Path to org.apache.tika.Tika and 
> org.apache.tika.parser.ParsingReader
> -
>
> Key: TIKA-1745
> URL: https://issues.apache.org/jira/browse/TIKA-1745
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: java7
> Fix For: 1.12
>
> Attachments: TIKA-1745.patch
>
>
> Add methods accepting java.nio.file.Path to complement those accepting 
> java.io.File, using the new methods in TikaInputStream or java.nio.file.Files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [jira] [Updated] (TIKA-1745) Add methods accepting java.nio.file.Path to org.apache.tika.Tika and org.apache.tika.parser.ParsingReader

2015-10-18 Thread Yaniv Kunda
This (and https://issues.apache.org/jira/browse/TIKA-1746 and
https://issues.apache.org/jira/browse/TIKA-1751) are part of
https://issues.apache.org/jira/browse/TIKA-1726 and already have relatively
simple patches ready to be committed.

I think they'd be better off committed together with their already-committed
siblings, for putting all API additions in 1.11.

(I'd also like to see https://issues.apache.org/jira/browse/TIKA-1706 in
1.11, which I have prepared patches for according to [~grossws]'s
suggestion, but that's another story...)

-Original Message-
From: Chris A. Mattmann (JIRA) [mailto:j...@apache.org]
Sent: Sunday, October 18, 2015 22:44
To: dev@tika.apache.org
Subject: [jira] [Updated] (TIKA-1745) Add methods accepting
java.nio.file.Path to org.apache.tika.Tika and
org.apache.tika.parser.ParsingReader


 [
https://issues.apache.org/jira/browse/TIKA-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann updated TIKA-1745:

Fix Version/s: (was: 1.11)
   1.12

> Add methods accepting java.nio.file.Path to org.apache.tika.Tika and
> org.apache.tika.parser.ParsingReader
> -
>
> Key: TIKA-1745
> URL: https://issues.apache.org/jira/browse/TIKA-1745
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.12
>
> Attachments: TIKA-1745.patch
>
>
> Add methods accepting java.nio.file.Path to complement those accepting
> java.io.File, using the new methods in TikaInputStream or
> java.nio.file.Files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained herein is intended only for the use 
of the addressee(s) named above. If you are not the intended recipient (or 
the agent responsible to deliver it to the intended recipient), you are 
hereby notified that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited. If you have received this email 
in error, please immediately reply to sender, delete the message and 
destroy all copies of it. If you have questions, please email 
le...@answers.com. 

If you wish to unsubscribe to commercial emails from Answers and its 
affiliates, please go to the Answers Subscription Center 
http://campaigns.answers.com/subscriptions to opt out.  Thank you.


[jira] [Assigned] (TIKA-1771) lower magic priority xhtml magic priority to ensure emails detected as message/rfc822

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-1771:
---

Assignee: Chris A. Mattmann

> lower magic priority xhtml magic priority to ensure emails detected as 
> message/rfc822
> -
>
> Key: TIKA-1771
> URL: https://issues.apache.org/jira/browse/TIKA-1771
> Project: Tika
>  Issue Type: Improvement
>  Components: detector
>Reporter: Jeremy B. Merrill
>Assignee: Chris A. Mattmann
>Priority: Critical
>
> Emails I have (happy to share if you want) contain XHTML, as one part of a 
> multipart email. Prior to this pull request, the priority on the 
> application/xhtml+xml magic detector was 50, equal to the priority on the 
> message/rfc822 detector. Because of the relative position of the two 
> detectors in tika-mimetypes.xml, the emails were incorrectly detected as 
> XHTML documents.
> With this PR, by downgrading the priority of application/xhtml+xml to 40, the 
> more-sensitive email magic detectors take precedence, causing the emails to 
> be properly detected as message/rfc822.
> I have not run this thru the govdocs tester or anything other than my own 
> documents, so, full disclosure, this could cause false negative 
> xhtml-detections elsewhere.
> I should note this occurs on trunk, from Github, up-to-date as of Tuesday-ish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: lower priority on magic for application/xhtml+x...

2015-10-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/58


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Resolved] (TIKA-1771) lower magic priority xhtml magic priority to ensure emails detected as message/rfc822

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1771.
-
   Resolution: Fixed
Fix Version/s: 1.11

Thanks [~jeremybmerrill]! 
{noformat}
[chipotle:~/tmp/tika1.11] mattmann% svn commit -m "Fix for TIKA-1771 lower 
magic priority xhtml magic priority to ensure emails detected as message/rfc822 
contributed by Jeremy B. Merrill  this closes #58."
SendingCHANGES.txt
Sending
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Transmitting file data ..
Committed revision 1709301.
[chipotle:~/tmp/tika1.11] mattmann% 
{noformat}


> lower magic priority xhtml magic priority to ensure emails detected as 
> message/rfc822
> -
>
> Key: TIKA-1771
> URL: https://issues.apache.org/jira/browse/TIKA-1771
> Project: Tika
>  Issue Type: Improvement
>  Components: detector
>Reporter: Jeremy B. Merrill
>Assignee: Chris A. Mattmann
>Priority: Critical
> Fix For: 1.11
>
>
> Emails I have (happy to share if you want) contain XHTML, as one part of a 
> multipart email. Prior to this pull request, the priority on the 
> application/xhtml+xml magic detector was 50, equal to the priority on the 
> message/rfc822 detector. Because of the relative position of the two 
> detectors in tika-mimetypes.xml, the emails were incorrectly detected as 
> XHTML documents.
> With this PR, by downgrading the priority of application/xhtml+xml to 40, the 
> more-sensitive email magic detectors take precedence, causing the emails to 
> be properly detected as message/rfc822.
> I have not run this thru the govdocs tester or anything other than my own 
> documents, so, full disclosure, this could cause false negative 
> xhtml-detections elsewhere.
> I should note this occurs on trunk, from Github, up-to-date as of Tuesday-ish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1772) Mimetype of VTT files

2015-10-18 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962568#comment-14962568
 ] 

Chris A. Mattmann commented on TIKA-1772:
-

Nick please ref the Github PR # and it will close automatically.

> Mimetype of VTT files
> -
>
> Key: TIKA-1772
> URL: https://issues.apache.org/jira/browse/TIKA-1772
> Project: Tika
>  Issue Type: Improvement
>Reporter: Alexander Widera
>Priority: Minor
> Fix For: 1.11
>
> Attachments: upc-video-subtitles-en.vtt
>
>
> Files with extension "vtt" are "WebVTT: The Web Video Text Tracks Format" 
> files.
> The mimetype resolved by tika is currently text/plain.
> The correct mimetype should be text/vtt.
> see: https://w3c.github.io/webvtt/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: fix for TIKA-1772 contributed by wiedsche

2015-10-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/59


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1772) Mimetype of VTT files

2015-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962569#comment-14962569
 ] 

ASF GitHub Bot commented on TIKA-1772:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/59


> Mimetype of VTT files
> -
>
> Key: TIKA-1772
> URL: https://issues.apache.org/jira/browse/TIKA-1772
> Project: Tika
>  Issue Type: Improvement
>Reporter: Alexander Widera
>Priority: Minor
> Fix For: 1.11
>
> Attachments: upc-video-subtitles-en.vtt
>
>
> Files with extension "vtt" are "WebVTT: The Web Video Text Tracks Format" 
> files.
> The mimetype resolved by tika is currently text/plain.
> The correct mimetype should be text/vtt.
> see: https://w3c.github.io/webvtt/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1774) org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared

2015-10-18 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962489#comment-14962489
 ] 

Nick Burch commented on TIKA-1774:
--

This looks like a duplicate of TIKA-1215. See [this comment for why that 
combination isn't 
supported|https://issues.apache.org/jira/browse/TIKA-1215?focusedCommentId=13869693=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13869693]

> org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared
> -
>
> Key: TIKA-1774
> URL: https://issues.apache.org/jira/browse/TIKA-1774
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10
> Environment: Windows 10
> Java 1.8_45
> Apache Tika 1.10
> MS Word 2013
>Reporter: Steve K
>Priority: Minor
>
> Create a test document with MS Word 2013. Just a few paragraphs (lines of 
> text), table, etc.
> Code example:
> ContentHandler handler = new BodyContentHandler(new 
> ToXMLContentHandler());
> File inputFile = new File("c:\\temp\\test.docx");
> InputStream stream = TikaInputStream.get(inputFile);
> AutoDetectParser parser = new AutoDetectParser();
> Metadata metadata = new Metadata();
> parser.parse(stream, handler, metadata);
> System.out.println(handler.toString());
> This will lead to the following Exception:
> Exception in thread "main" org.xml.sax.SAXException: Namespace 
> http://www.w3.org/1999/xhtml not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:163)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
>   at com.test.TikaTest.main(TikaTest.java:28)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
> This exception occurs when using "ToXMLContentHandler" in combination with 
> the BodyContentHandler. Using "ToXMLContentHandler" alone works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1774) org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared

2015-10-18 Thread Steve K (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962496#comment-14962496
 ] 

Steve K commented on TIKA-1774:
---

Thanks Nick. Overlooked this. That makes sense, but then the example on:
https://tika.apache.org/1.10/examples.html#Parsing_to_XHTML
should be removed as it suggests the chaining of BodyContentHandler and 
ToXMLContentHandler.
This suggests that well-formed XML parts are supported whereas it seems only a 
full well-formed complete XML document is supported.

> org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared
> -
>
> Key: TIKA-1774
> URL: https://issues.apache.org/jira/browse/TIKA-1774
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10
> Environment: Windows 10
> Java 1.8_45
> Apache Tika 1.10
> MS Word 2013
>Reporter: Steve K
>Priority: Minor
>
> Create a test document with MS Word 2013. Just a few paragraphs (lines of 
> text), table, etc.
> Code example:
> ContentHandler handler = new BodyContentHandler(new 
> ToXMLContentHandler());
> File inputFile = new File("c:\\temp\\test.docx");
> InputStream stream = TikaInputStream.get(inputFile);
> AutoDetectParser parser = new AutoDetectParser();
> Metadata metadata = new Metadata();
> parser.parse(stream, handler, metadata);
> System.out.println(handler.toString());
> This will lead to the following Exception:
> Exception in thread "main" org.xml.sax.SAXException: Namespace 
> http://www.w3.org/1999/xhtml not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:163)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
>   at com.test.TikaTest.main(TikaTest.java:28)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
> This exception occurs when using "ToXMLContentHandler" in combination with 
> the BodyContentHandler. Using "ToXMLContentHandler" alone works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)