date:20150301


 [ 
https://issues.apache.org/jira/browse/TIKA-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-291:
-
Labels: new-parser  (was: )

 Adobe InDesign support
 --

 Key: TIKA-291
 URL: https://issues.apache.org/jira/browse/TIKA-291
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Jukka Zitting
Priority: Minor
  Labels: new-parser
 Attachments: simple_test-1.indd


 It would be great if Tika could extract content from Adobe InDesign documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-94) Speech recognition


 [ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-94:

Labels: new-parser  (was: )

 Speech recognition
 --

 Key: TIKA-94
 URL: https://issues.apache.org/jira/browse/TIKA-94
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Priority: Minor
  Labels: new-parser

 Like OCR for image files (TIKA-93), we could try using speech recognition to 
 extract text content (where available) from audio (and video!) files.
 The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
 comes with a friendly license.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-289) Add magic byte patterns from file(1)


 [ 
https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-289:
-
Labels: new-parser  (was: )

 Add magic byte patterns from file(1)
 

 Key: TIKA-289
 URL: https://issues.apache.org/jira/browse/TIKA-289
 Project: Tika
  Issue Type: Improvement
  Components: mime
Reporter: Jukka Zitting
Priority: Minor
  Labels: new-parser
 Attachments: file-has-magic-tika-missing.txt, file-mimes-missing.txt


 As discussed in TIKA-285, the file(1) command comes with a pretty 
 comprehensive set of magic byte patterns. It would be nice to get those 
 patterns included also in Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-617) Series of exceptions from PDFBox


 [ 
https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-617.

Resolution: Won't Fix

The underlying exception is 
{code}
Caused by: java.util.zip.DataFormatException: invalid distance too far back
at java.util.zip.Inflater.inflateBytes(Native Method)
{code}

So, I'm closing this as Won't Fix. If anyone objects, please reopen.

 Series of exceptions from PDFBox
 

 Key: TIKA-617
 URL: https://issues.apache.org/jira/browse/TIKA-617
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Erik Hetzner

 Hi,
 I am getting the following exception from PDFBox. Thank you!
 (If I should file these upstream at PDFBox first, please let me know.)
 {noformat}
 $ java -jar tika-app-1.0-SNAPSHOT.jar 
 http://www.arb.ca.gov/research/apr/past/01-340.pdf  /dev/null
 ERROR - Stop reading corrupt stream
 INFO - unsupported/disabled operation: f24.481
 INFO - unsupported/disabled operation: ree)n.
 WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot 
 be cast to org.apache.pdfbox.cos.COSArray
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
 INFO - unsupported/disabled operation: i-
 INFO - unsupported/disabled operation: R4%
 INFO - unsupported/disabled operation: )
 INFO - unsupported/disabled operation: Re.8
 INFO - unsupported/disabled operation: e.
 INFO - unsupported/disabled operation: FE)-
 WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot 
 be cast to org.apache.pdfbox.cos.COSArray
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
 INFO - unsupported/disabled operation: R3%
 INFO - unsupported/disabled operation: T
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5809fdee
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at

[jira] [Updated] (TIKA-627) Support X12 files


 [ 
https://issues.apache.org/jira/browse/TIKA-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-627:
-
Labels: new-parser  (was: )

 Support X12 files
 -

 Key: TIKA-627
 URL: https://issues.apache.org/jira/browse/TIKA-627
 Project: Tika
  Issue Type: New Feature
  Components: mime, parser
Reporter: Jukka Zitting
Priority: Minor
  Labels: new-parser

 X12 [1] is a standardized data interchange format. It would be nice if Tika 
 could understand such files.
 [1] http://en.wikipedia.org/wiki/ASC_X12



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-669) Backup plan for parsing


 [ 
https://issues.apache.org/jira/browse/TIKA-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-669.

Resolution: Duplicate

 Backup plan for parsing
 ---

 Key: TIKA-669
 URL: https://issues.apache.org/jira/browse/TIKA-669
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting

 Currently once a document type has been detected we direct the document to 
 the one parser that best matches the detected type. In practice there are 
 cases where that parser finds that it in fact cannot parse this document, for 
 example when something that looked like XML turns out to have syntax errors. 
 For such cases it would be nice if the CompositeParser could then retry 
 parsing the document with a more generic backup parser, like the plain text 
 parser for malformed XML.
 Implementing this would require some level of buffering and redirection of 
 both parser input and output. Input buffering is easy, but for output 
 buffering we'd probably need to implement new ContentHandler and Metadata 
 layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-663) JSP files data extraction failed


 [ 
https://issues.apache.org/jira/browse/TIKA-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-663.
--
Resolution: Fixed

 JSP files data extraction failed
 

 Key: TIKA-663
 URL: https://issues.apache.org/jira/browse/TIKA-663
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.9
 Environment: Windows, JAva 6
Reporter: samraj
 Attachments: File_1.jsp, File_2.jsp, File_3.jsp


 We have worked with tika extraction. In 0.8 jsp file contents extracted 
 well.. But in 0.9 the same files are not extracted well. Pls give the solution



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-651) Unescaped attribute value generated


[ 
https://issues.apache.org/jira/browse/TIKA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342499#comment-14342499
 ] 

Tyler Palsulich commented on TIKA-651:
--

Is there any update on XML processing libraries that we use? Do we still want 
to change up our dependencies? If not, I'll close this this week.

 Unescaped attribute value generated
 ---

 Key: TIKA-651
 URL: https://issues.apache.org/jira/browse/TIKA-651
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.9
Reporter: Raimund Merkert
Assignee: Jukka Zitting
 Attachments: XHTMLSerializer.java


 I've converted a word document that contains hyperlinks with a complex query 
 component. The  character is not escaped and mozilla complains about that 
 when I write out the XHTML via a content handler that I wrote.
 It's not clear to me whether or not my contenthandler should assume 
 attributes are properly escaped or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-291) Adobe InDesign support


 [ 
https://issues.apache.org/jira/browse/TIKA-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-291.

Resolution: Duplicate

 Adobe InDesign support
 --

 Key: TIKA-291
 URL: https://issues.apache.org/jira/browse/TIKA-291
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Jukka Zitting
Priority: Minor
  Labels: new-parser
 Attachments: simple_test-1.indd


 It would be great if Tika could extract content from Adobe InDesign documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Curating Issues

2015-03-01 Thread Mattmann, Chris A (3980)

You da man

Sent from my iPhone

 On Mar 1, 2015, at 2:36 PM, Tyler Palsulich tpalsul...@gmail.com wrote:
 
 Alright. I'm up to TIKA-694 and still goin'. :)
 
 I've started labeling some issues as new-parser and newbie. I think
 these should be helpful for organization. Please let me know if there is
 another label we've already been using for those. I put new-parser on any
 requests to support a new filetype, even if it doesn't require a full on
 Parser (e.g. just magic).
 
 newbie should be used for new contributors.
 
 I'll take no offense if someone reopens/closes anything after I've touched
 it.
 
 Tyler
 
 On Sat, Feb 28, 2015 at 11:59 PM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 
 Hey Tyler if you want to take a whack, here are some criteria
 I tend to use:
 
 1. Bug report from 1+ years old.
  - Close it - either not reproducible, fixed in a later version
 and not come back to, or not as bad of a bug anymore since it’s
 not a blocker.
 
 2. Feature request from 1+ years old that no one has acted upon.
 - Good candidate for closing - if it was important someone would
 have acted up on it.
 
 3. Issue from 1+ years old with lots of discussion on it
  - Poke the issue - see if a consensus can be reached, if not
 move forward and close.
 
 4. Issue that is your own that you aren’t interested in anymore
 that is 1+ years old
  - Close it you didn’t work on it then, may not get back to it
 and no one else has
 
 5. Issue that is 2+ years old
  - Close, regardless, unless it has patch
 
 6. Issue that is 1+ years old, with patch, uncommitted
  - Try to apply patch or minimal effort to bring current with
 trunk and apply
  - if too much work ask for help
  - if 1+ weeks and no one replies, close it and move forward
 
 There are more but that’s a start. I’ll check out this article
 thanks for sending it.
 
 Cheers,
 Chris
 
 
 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 
 
 
 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Saturday, February 28, 2015 at 8:53 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Curating Issues
 
 Hi Folks,
 
 I just read an article [0] about managing a large project's issues list.
 Tika currently has 331 open issues. Do we know if all of these have been
 triaged? At what point do we want to label an issue as stale and close
 it
 off? What is our preferred split between when to make an issue and when to
 send a message to the mailing list?
 
 Have a good weekend,
 Tyler
 
 [0] http://words.steveklabnik.com/how-to-be-an-open-source-gardener?r=1

[jira] [Commented] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events


[ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342522#comment-14342522
 ] 

Tyler Palsulich commented on TIKA-715:
--

This seems like it's worth looking into. It would be awesome if someone could 
generate a list of Parsers which generate invalid XHTML and need attention.

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
 Fix For: 1.8

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)

[jira] [Commented] (TIKA-539) Encoding detection is too biased by encoding in meta tag


[ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342524#comment-14342524
 ] 

Ken Krugler commented on TIKA-539:
--

Hi Tyler - I see you closed this as fixed, but I don't remember the change that 
resolved it...do you have details?

 Encoding detection is too biased by encoding in meta tag
 

 Key: TIKA-539
 URL: https://issues.apache.org/jira/browse/TIKA-539
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 0.8, 0.9, 0.10
Reporter: Reinhard Schwab
Assignee: Ken Krugler
 Fix For: 1.8

 Attachments: TIKA-539.patch, TIKA-539_2.patch


 if the encoding in the meta tag is wrong, this encoding is detected,
 even if there is the right encoding set in metadata before(which can be  from 
 http response header).
 test code to reproduce:
 static String content = htmlhead\n
   + meta http-equiv=\content-type\ 
 content=\application/xhtml+xml; charset=iso-8859-1\ /
   + /headbodyÜber den Wolken\n/body/html;
   /**
* @param args
* @throws IOException
* @throws TikaException
* @throws SAXException
*/
   public static void main(String[] args) throws IOException, SAXException,
   TikaException {
   Metadata metadata = new Metadata();
   metadata.set(Metadata.CONTENT_TYPE, text/html);
   metadata.set(Metadata.CONTENT_ENCODING, UTF-8);
   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
   InputStream in = new 
 ByteArrayInputStream(content.getBytes(UTF-8));
   AutoDetectParser parser = new AutoDetectParser();
   BodyContentHandler h = new BodyContentHandler(1);
   parser.parse(in, h, metadata, new ParseContext());
   System.out.print(h.toString());
   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-727) Improve the outputed XHTML by HSLFExtractor


[ 
https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342542#comment-14342542
 ] 

Tyler Palsulich commented on TIKA-727:
--

[~gagravarr], if you applied the above patch, is this issue good to close as 
Fixed?

 Improve the outputed XHTML by HSLFExtractor
 ---

 Key: TIKA-727
 URL: https://issues.apache.org/jira/browse/TIKA-727
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 0.10
Reporter: Pablo Queixalos
Priority: Minor
 Attachments: HSLFExtractor.java, HSLFExtractor.patch


 The XHTML output of HSLFExtractor parser is not pure XHTML, it only inserts 
 the full text into a P[aragraph] tag (including non-html carriage returns).  
 This behavior comes from the poor capabilities that the POI 
 PowerPointExtractor offers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-740) SAX parser used for HTML


 [ 
https://issues.apache.org/jira/browse/TIKA-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-740.

Resolution: Won't Fix

 SAX parser used for HTML
 

 Key: TIKA-740
 URL: https://issues.apache.org/jira/browse/TIKA-740
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0
Reporter: Erik Hetzner
 Attachments: a221657.html


 {noformat}
 egh@gales[510] 1 :~/d/software/tika-trunk
 $ java  -jar tika-app/target/tika-app-1.0-SNAPSHOT.jar -v 
 http://www.almasry-alyoum.com/article2.aspx?ArticleID=221657  /dev/null
 Exception in thread main org.apache.tika.exception.TikaException: XML parse 
 error
   at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:71)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:367)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
 Caused by: org.xml.sax.SAXParseException: The element type td must be 
 terminated by the matching end-tag /td.
   at 
 com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:195)
   at 
 com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:174)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:388)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1414)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1749)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:65)
   ... 6 more
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-758) Address TODOs when we upgrade to next PDFBox release


[ 
https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342589#comment-14342589
 ] 

Tyler Palsulich commented on TIKA-758:
--

[~talli...@apache.org], now that we're at PDFBox 1.8.8, can we remove the 
workaround? I removed it locally and all tests pass.

 Address TODOs when we upgrade to next PDFBox release
 

 Key: TIKA-758
 URL: https://issues.apache.org/jira/browse/TIKA-758
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
 Attachments: TIKA-758.Palsulich.061714.patch


 Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in 
 the code when we next upgrade PDFBox.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content


[ 
https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342616#comment-14342616
 ] 

Tyler Palsulich commented on TIKA-819:
--

Is there still interest in this cursory option? It shouldn't be difficult to 
add, if so!

 Make Option to Exclude Embedded Files' Text for Text Content
 

 Key: TIKA-819
 URL: https://issues.apache.org/jira/browse/TIKA-819
 Project: Tika
  Issue Type: New Feature
  Components: general
Affects Versions: 1.0
 Environment: Windows-7 + JDK 1.6 u26
Reporter: Albert L.
 Fix For: 1.8


 It would be nice to be able to disable text content from embedded files.
 For example, if I have a DOCX with an embedded PPTX, then I would like the 
 option to disable text from the PPTX from showing up when asking for the text 
 content from DOCX.  In other words, it would be nice to have the option to 
 get text content *only* from the DOCX instead of the DOCX+PPTX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-807) PHP version of Tika

2015-03-01 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-807.

Resolution: Fixed

I think this is old enough to close and especially with an actively developed, 
downstream library. 

 PHP version of Tika
 ---

 Key: TIKA-807
 URL: https://issues.apache.org/jira/browse/TIKA-807
 Project: Tika
  Issue Type: New Feature
  Components: packaging
Reporter: Ingo Renner
Priority: Minor
  Labels: PHP

 Inspired by TIKA-773 the outcome of this issue should be a PHP 
 library/wrapper to easily work with Tika in PHP applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-821) Support detecting old MIcrosoft Works Word Processor formats


 [ 
https://issues.apache.org/jira/browse/TIKA-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-821:
-
Labels: new-parser  (was: )

 Support detecting old MIcrosoft Works Word Processor formats
 

 Key: TIKA-821
 URL: https://issues.apache.org/jira/browse/TIKA-821
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.1
Reporter: Antoni Mylka
Assignee: Antoni Mylka
  Labels: new-parser

 An issue similar to TIKA-812. This time it's about old Works Word Processor 
 formats. They use an OLE2 structure, but the top-level entry is called 
 MatOST, they are not supported by the OfficeParser. I would like to:
  # Add a magic to tika-mimetypes.xml to mark the file as ms-works if MatOST 
 is found. (After TIKA-806 we officially like those).
  # Add an 'if' to POIFSContainerDetector to look for MatOST.
 I'm not creating a separate media type for this (like I did in TIKA-812) 
 because no parser supports it anyway. In TIKA-812 it was necessary, because 
 ExcelParser can't work with all vnd.ms-works files but can work with 7.0 
 spreadsheets. In this case there is no gain in a separate mime type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-676) Boilerpipe fails


 [ 
https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-676.
--
Resolution: Fixed

No exception is thrown with the file with Tika 1.8-SNAPSHOT. So, closing this 
as fixed. Open a new issue for upgrading the dependency if relevant.

 Boilerpipe fails
 

 Key: TIKA-676
 URL: https://issues.apache.org/jira/browse/TIKA-676
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Gabriele Kahlout
Priority: Minor

 This is apparently a [boilerpipe issue 
 |http://code.google.com/p/boilerpipe/issues/detail?id=24 ], they fixed in the 
 [Web API edition | http://boilerpipe-web.appspot.com/]. 
 {code}
 $ curl --fail -L http://thisrecording.com/the-past | java -jar 
 tika-app-0.9.jar -T
   % Total% Received % Xferd  Average Speed   TimeTime Time  
 Current
  Dload  Upload   Total   SpentLeft  Speed
 100 656880 656880 0  17650  0 --:--:--  0:00:03 --:--:-- 
 18698Exception in thread main org.xml.sax.SAXException: SAX input contains 
 nested A elements -- You have probably hit a bug in your HTML parser (e.g., 
 NekoHTML bug #2909310). Please clean the HTML externally and feed it to 
 boilerpipe again
 100  128k0  128k0 0  32019  0 --:--:--  0:00:04 --:--:-- 33735
   at 
 de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108)
   at 
 de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169)
   at 
 org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279)
   at 
 org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197)
   at 
 org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61)
   at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
   at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
   at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
   at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565)
   at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
   at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-676) Boilerpipe fails


[ 
https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342513#comment-14342513
 ] 

Tyler Palsulich commented on TIKA-676:
--

There is a 
[fork|http://search.maven.org/#artifactdetails%7Ccom.robbypond%7Cboilerpipe%7C1.2.3%7Cjar]
 of Boilerpipe available on Maven Central. Should we switch to that? I'd prefer 
to stay with the main project. But, it doesn't appear available in Central.

 Boilerpipe fails
 

 Key: TIKA-676
 URL: https://issues.apache.org/jira/browse/TIKA-676
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Gabriele Kahlout
Priority: Minor

 This is apparently a [boilerpipe issue 
 |http://code.google.com/p/boilerpipe/issues/detail?id=24 ], they fixed in the 
 [Web API edition | http://boilerpipe-web.appspot.com/]. 
 {code}
 $ curl --fail -L http://thisrecording.com/the-past | java -jar 
 tika-app-0.9.jar -T
   % Total% Received % Xferd  Average Speed   TimeTime Time  
 Current
  Dload  Upload   Total   SpentLeft  Speed
 100 656880 656880 0  17650  0 --:--:--  0:00:03 --:--:-- 
 18698Exception in thread main org.xml.sax.SAXException: SAX input contains 
 nested A elements -- You have probably hit a bug in your HTML parser (e.g., 
 NekoHTML bug #2909310). Please clean the HTML externally and feed it to 
 boilerpipe again
 100  128k0  128k0 0  32019  0 --:--:--  0:00:04 --:--:-- 33735
   at 
 de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108)
   at 
 de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169)
   at 
 org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279)
   at 
 org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197)
   at 
 org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61)
   at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
   at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
   at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
   at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565)
   at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
   at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Curating Issues

2015-03-01 Thread Tyler Palsulich

Alright. I'm up to TIKA-694 and still goin'. :)

I've started labeling some issues as new-parser and newbie. I think
these should be helpful for organization. Please let me know if there is
another label we've already been using for those. I put new-parser on any
requests to support a new filetype, even if it doesn't require a full on
Parser (e.g. just magic).

newbie should be used for new contributors.

I'll take no offense if someone reopens/closes anything after I've touched
it.

Tyler

On Sat, Feb 28, 2015 at 11:59 PM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Tyler if you want to take a whack, here are some criteria
 I tend to use:

 1. Bug report from 1+ years old.
   - Close it - either not reproducible, fixed in a later version
 and not come back to, or not as bad of a bug anymore since it’s
 not a blocker.

 2. Feature request from 1+ years old that no one has acted upon.
  - Good candidate for closing - if it was important someone would
 have acted up on it.

 3. Issue from 1+ years old with lots of discussion on it
   - Poke the issue - see if a consensus can be reached, if not
 move forward and close.

 4. Issue that is your own that you aren’t interested in anymore
 that is 1+ years old
   - Close it you didn’t work on it then, may not get back to it
 and no one else has

 5. Issue that is 2+ years old
   - Close, regardless, unless it has patch

 6. Issue that is 1+ years old, with patch, uncommitted
   - Try to apply patch or minimal effort to bring current with
 trunk and apply
   - if too much work ask for help
   - if 1+ weeks and no one replies, close it and move forward

 There are more but that’s a start. I’ll check out this article
 thanks for sending it.

 Cheers,
 Chris


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Saturday, February 28, 2015 at 8:53 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Curating Issues

 Hi Folks,
 
 I just read an article [0] about managing a large project's issues list.
 Tika currently has 331 open issues. Do we know if all of these have been
 triaged? At what point do we want to label an issue as stale and close
 it
 off? What is our preferred split between when to make an issue and when to
 send a message to the mailing list?
 
 Have a good weekend,
 Tyler
 
 [0] http://words.steveklabnik.com/how-to-be-an-open-source-gardener?r=1

[jira] [Closed] (TIKA-694) On extraction, get properties AND / OR content extraction


 [ 
https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-694.

Resolution: Won't Fix

 On extraction, get properties AND / OR content extraction
 -

 Key: TIKA-694
 URL: https://issues.apache.org/jira/browse/TIKA-694
 Project: Tika
  Issue Type: Wish
  Components: parser
Affects Versions: 1.0
 Environment: All OS
Reporter: Etienne Jouvin
Priority: Minor
 Attachments: Tika-1.0.zip


 I use TIKA to extract properties, and only, on Office files.
 The parser goes throw the document content and this is not necessary and slow 
 down the process.
 It would be nice to have choice to extract only properties or not.
 What I did was the following:
 Extension of AutoDetectParser to override the parse method.
 Then in the ParseContext instance, I put a flag with boolean true to say only 
 extract the properties.
 And for example, on office file, I extended OfficeParser class. During parse 
 method, I check the flag, and if equals to true, I removed all the extraction 
 from the content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-354) ProfilingHandler should take a length-limiting parameter


[ 
https://issues.apache.org/jira/browse/TIKA-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342529#comment-14342529
 ] 

Ken Krugler commented on TIKA-354:
--

Better speed is still important, as a 2x improvement from TIKA-1549 is good but 
means that now it's only 45% of the web crawl time that's spent determining the 
language, versus 90%. However the right way to do this is (with a new detector 
library) internally sampling until the target confidence is reached, versus the 
caller having to decide how much text to analyze.

So net-net, yes I think this can be closed.

 ProfilingHandler should take a length-limiting parameter
 

 Key: TIKA-354
 URL: https://issues.apache.org/jira/browse/TIKA-354
 Project: Tika
  Issue Type: Improvement
  Components: languageidentifier
Affects Versions: 0.5
Reporter: Vivek Magotra
Assignee: Ken Krugler
 Attachments: TIKA-354-2.patch, TIKA-354.patch


 ProfilingHandler currently parses the entire document (thereby analyzing 
 n-grams for the entire doc).
 ProfilingHandler should take a length-limiting parameter that allows a user 
 to specify the amount of data that should get analyzed.
 In fact, by default that limit should be set to something like 8K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-369) Improve accuracy of language detection

[
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342531#comment-14342531
]

Ken Krugler commented on TIKA-369:
--

Hi Tyler - detection speed is an issue, but Tika also suffered from accuracy.
In Mike McCandless's tests, Tika was both 10x slower than language-detection,
and had about a 3.5x higher error rate IIRC (2.8% error rate vs. 0.8%).

I think this issue should be left open, as it has interested details on
possible replacements for the current code that I don't think we want to lose.

Improve accuracy of language detection
--

Key: TIKA-369
URL: https://issues.apache.org/jira/browse/TIKA-369
Project: Tika
Issue Type: Improvement
Components: languageidentifier
Affects Versions: 0.6
Reporter: Ken Krugler
Assignee: Ken Krugler
Attachments: Surprise and Coincidence.pdf, lingdet-mccs.pdf,
textcat.pdf

Currently the LanguageProfile code uses 3-grams to find the best language
profile using Pearson's chi-square test. This has three issues:
1. The results aren't very good for short runs of text. Ted Dunning's paper
(attached) indicates that a log-likelihood ratio (LLR) test works much
better, which would then make language detection faster due to less text
needing to be processed.
2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact
value as a threshold for certainty. This is very sensitive to the amount of
text being processed, and thus gives false negative results for short runs of
text.
3. Certainty should also be based on how much better the result is for
language X, compared to the next best language. If two languages both had
identical sum-of-squares values, and this value was below the threshold, then
the result is still not very certain.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-756) XMP output from Tika CLI


[ 
https://issues.apache.org/jira/browse/TIKA-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342590#comment-14342590
 ] 

Tyler Palsulich commented on TIKA-756:
--

The only blocker on this is tika-xmp having a dependency on tika-parsers, right?

 XMP output from Tika CLI
 

 Key: TIKA-756
 URL: https://issues.apache.org/jira/browse/TIKA-756
 Project: Tika
  Issue Type: New Feature
  Components: cli, metadata
Reporter: Jukka Zitting
Assignee: Jörg Ehrlich
  Labels: metadata, xmp
 Attachments: tika-xmp.patch, tika-xmp_styleAndHeader.patch


 It would be great if the Tika CLI could output metadata also in the XMP 
 format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-788) DWG parser infinite loop on possibly corrupt file


[ 
https://issues.apache.org/jira/browse/TIKA-788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342609#comment-14342609
 ] 

Tyler Palsulich commented on TIKA-788:
--

[~seegler], it looks like your stack trace is related to parsing an mp3 file.

Does anyone have a dwg file that triggers this error? Ideally, they would also 
have the set of Metadata values extracted by AutoCAD.

 DWG parser infinite loop on possibly corrupt file
 -

 Key: TIKA-788
 URL: https://issues.apache.org/jira/browse/TIKA-788
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0
Reporter: Stas Shaposhnikov

 When parsing some dwg items, it is possible that the parser may cause itself 
 to go into an infinite loop.
 Attached is the file causing the problem.
 Here is a possible patch that will at least proceed until an error is thrown.
 {noformat}
 === modified file 
 'tika-parsers/src/main/java/org/apache/tika/parser/dwg/DWGParser.java'
 --- tika-parsers/src/main/java/org/apache/tika/parser/dwg/DWGParser.java  
   2011-11-24 11:30:33 +
 +++ tika-parsers/src/main/java/org/apache/tika/parser/dwg/DWGParser.java  
   2011-11-25 05:27:41 +
 @@ -274,8 +274,10 @@
  return false;
  }
  while (toSkip  0) {
 -byte[] skip = new byte[Math.min((int) toSkip, 0x4000)];
 -IOUtils.readFully(stream, skip);
 +byte[] skip = new byte[(int) Math.min(toSkip, 0x4000)];
 +if (IOUtils.readFully(stream, skip) == -1) {
 +   return false; //invalid skip
 +}
  toSkip -= skip.length;
  }
  return true;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-634) Command Line Parser for Metadata Extraction


[ 
https://issues.apache.org/jira/browse/TIKA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342496#comment-14342496
 ] 

Tyler Palsulich commented on TIKA-634:
--

[~gagravarr], is this issue still relevant?

 Command Line Parser for Metadata Extraction
 ---

 Key: TIKA-634
 URL: https://issues.apache.org/jira/browse/TIKA-634
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 0.9
Reporter: Nick Burch
Assignee: Nick Burch
Priority: Minor

 As discussed on the mailing list:
 http://mail-archives.apache.org/mod_mbox/tika-dev/201104.mbox/%3calpine.deb.2.00.1104052028380.29...@urchin.earth.li%3E
 This issue is to track improvements in the ExternalParser support to handle 
 metadata extraction, and probably easier configuration of an external parser 
 too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-682) Creative Suite formats support


 [ 
https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-682:
-
Labels: new-parser  (was: )

 Creative Suite formats support
 --

 Key: TIKA-682
 URL: https://issues.apache.org/jira/browse/TIKA-682
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.8
Reporter: Vivian Li
  Labels: new-parser
 Attachments: Untitled-1.indd, myfile.psd, myfile.xmp


 Is it possible to support Creative Suite formats, such as PSD, InDesign, 
 etc.? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-682) Creative Suite formats support


 [ 
https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-682:
-
Affects Version/s: (was: 0.9)
   1.8

 Creative Suite formats support
 --

 Key: TIKA-682
 URL: https://issues.apache.org/jira/browse/TIKA-682
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.8
Reporter: Vivian Li
  Labels: new-parser
 Attachments: Untitled-1.indd, myfile.psd, myfile.xmp


 Is it possible to support Creative Suite formats, such as PSD, InDesign, 
 etc.? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-682) Creative Suite formats support


 [ 
https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-682:
-
Component/s: (was: metadata)
 parser

 Creative Suite formats support
 --

 Key: TIKA-682
 URL: https://issues.apache.org/jira/browse/TIKA-682
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.8
Reporter: Vivian Li
  Labels: new-parser
 Attachments: Untitled-1.indd, myfile.psd, myfile.xmp


 Is it possible to support Creative Suite formats, such as PSD, InDesign, 
 etc.? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-712) Master slide text isn't extracted


[ 
https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342518#comment-14342518
 ] 

Tyler Palsulich commented on TIKA-712:
--

Is there any update on this? Otherwise, I'll close it as Won't Fix later this 
week.

 Master slide text isn't extracted
 -

 Key: TIKA-712
 URL: https://issues.apache.org/jira/browse/TIKA-712
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Attachments: TIKA-712-master-slide.xml, TIKA-712.patch, 
 TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, 
 testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx


 It looks like we are not getting text from the master slide for PPT
 and PPTX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-713) Tika can not parse all of the persian pdf files


 [ 
https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-713.
--
Resolution: Fixed

 Tika can not parse all of the persian pdf files
 ---

 Key: TIKA-713
 URL: https://issues.apache.org/jira/browse/TIKA-713
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.9
Reporter: Ahmad Ajiloo
 Attachments: Complex.pdf, Simple2.pdf, Simple3.pdf, ebrat.pdf


 Hello
 I used Tika (of course in Nutch) to parse some persian pdf files. some of the 
 files clearly transformed to a plain text. but about some of them, output was 
 corrupted. I used ICU4J v4 library and the text changed to right-to-left 
 mode. but the mentioned problem didn't resolve. insofar as Tika can not 
 understand any charachter of input persian pdf file!
 {quote}
 I copy this text from my pdf file via Document Viewer in Linux: this is a 
 clearly persian text !
 --
 ‫هر روز پس از نماز صبح، سوره مباركه الرحمن را تا فباي آلاء ربكما تكذبان 
 بخواند.‬
 ‫) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط عثمانطه تقريبا يك نصف 
 صفحه است. (‬
 ‫همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت 
 حافظه مفيد است:‬
 ‫1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي‬
 ‫4- خوردن عسل‬ ‫5- خوردن عدس 6- خوردن گوشت نزديک گردن
 --
 Tike returns this output !
 --
  92   @A   8 * B
C9D  !D   ) (?)   =/

  
  () ,8 ;  
  8 #
+  9!: 
  L
   #)4   M() * 0
  * -3IA J  
   - 2   (+   G
  H  -1
  (+ J 5#+C 0T J (+  O - 6R . (+  O - 5 PH. (+  O -4
 --
 {quote}
 thanks a lot



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-539) Encoding detection is too biased by encoding in meta tag


[ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342530#comment-14342530
 ] 

Tyler Palsulich commented on TIKA-539:
--

Hi [~kkrugler]. I didn't have a specific fix in mind when I closed it. But, I 
saw the two related issues have been resolved and no recent commentary.

Apologies if the closure was premature.

 Encoding detection is too biased by encoding in meta tag
 

 Key: TIKA-539
 URL: https://issues.apache.org/jira/browse/TIKA-539
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 0.8, 0.9, 0.10
Reporter: Reinhard Schwab
Assignee: Ken Krugler
 Fix For: 1.8

 Attachments: TIKA-539.patch, TIKA-539_2.patch


 if the encoding in the meta tag is wrong, this encoding is detected,
 even if there is the right encoding set in metadata before(which can be  from 
 http response header).
 test code to reproduce:
 static String content = htmlhead\n
   + meta http-equiv=\content-type\ 
 content=\application/xhtml+xml; charset=iso-8859-1\ /
   + /headbodyÜber den Wolken\n/body/html;
   /**
* @param args
* @throws IOException
* @throws TikaException
* @throws SAXException
*/
   public static void main(String[] args) throws IOException, SAXException,
   TikaException {
   Metadata metadata = new Metadata();
   metadata.set(Metadata.CONTENT_TYPE, text/html);
   metadata.set(Metadata.CONTENT_ENCODING, UTF-8);
   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
   InputStream in = new 
 ByteArrayInputStream(content.getBytes(UTF-8));
   AutoDetectParser parser = new AutoDetectParser();
   BodyContentHandler h = new BodyContentHandler(1);
   parser.parse(in, h, metadata, new ParseContext());
   System.out.print(h.toString());
   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-768) Parser for EDF files


 [ 
https://issues.apache.org/jira/browse/TIKA-768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-768:
-
Labels: edf new-parser  (was: edf)

 Parser for EDF files
 

 Key: TIKA-768
 URL: https://issues.apache.org/jira/browse/TIKA-768
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Priority: Minor
  Labels: edf, new-parser

 In my spare time I'm occasionally working on biological signal processing, 
 and now I have a case where being able to extract normalized metadata from 
 EDF files (European Data Format, http://www.edfplus.info/) would be useful. 
 Thus it would be nice to add a simple Tika parser that understands this 
 format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-766) Trim down the NetCDF dependency


[ 
https://issues.apache.org/jira/browse/TIKA-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342602#comment-14342602
 ] 

Tyler Palsulich commented on TIKA-766:
--

Do we need to look into this more? Now with GRIB support and the new ucar 
dependencies, we are using more of the functionality.

But, does anyone know if there are still licensing issues?

The size of tika-app is getting unwieldy, so issues like this are worth 
investigating.

 Trim down the NetCDF dependency
 ---

 Key: TIKA-766
 URL: https://issues.apache.org/jira/browse/TIKA-766
 Project: Tika
  Issue Type: Improvement
  Components: packaging, parser
Reporter: Jukka Zitting
Priority: Minor

 As noted in TIKA-763, the NetCDF dependency contains a few LGPL classes that 
 we should get rid of, ideally without the workaround added for TIKA-763.
 Additionally, with 4.2MB the NetCDF jar is pretty large and includes lots of 
 stuff that isn't really related to parsing NetCDF and HDF files.
 It would be nice if the NetCDF project could produce a separately packaged 
 read-only parser library that only contains the stuff needed by Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-770) New ODF metadata keys


[ 
https://issues.apache.org/jira/browse/TIKA-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342603#comment-14342603
 ] 

Tyler Palsulich commented on TIKA-770:
--

[~gagravarr], 3 years later, is it time?

 New ODF metadata keys
 -

 Key: TIKA-770
 URL: https://issues.apache.org/jira/browse/TIKA-770
 Project: Tika
  Issue Type: Improvement
  Components: metadata, parser
Reporter: Jukka Zitting
Priority: Minor
  Labels: odf

 Followup from TIKA-764.
 {quote}
 The 2nd step is to add a few extra common keys for the stats that ODF has 
 that aren't covered, then remove the non standard keys
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-821) Support detecting old MIcrosoft Works Word Processor formats


 [ 
https://issues.apache.org/jira/browse/TIKA-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-821.
--
Resolution: Fixed

Marking fixed based on committed comment.

 Support detecting old MIcrosoft Works Word Processor formats
 

 Key: TIKA-821
 URL: https://issues.apache.org/jira/browse/TIKA-821
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.1
Reporter: Antoni Mylka
Assignee: Antoni Mylka
  Labels: new-parser

 An issue similar to TIKA-812. This time it's about old Works Word Processor 
 formats. They use an OLE2 structure, but the top-level entry is called 
 MatOST, they are not supported by the OfficeParser. I would like to:
  # Add a magic to tika-mimetypes.xml to mark the file as ms-works if MatOST 
 is found. (After TIKA-806 we officially like those).
  # Add an 'if' to POIFSContainerDetector to look for MatOST.
 I'm not creating a separate media type for this (like I did in TIKA-812) 
 because no parser supports it anyway. In TIKA-812 it was necessary, because 
 ExcelParser can't work with all vnd.ms-works files but can work with 7.0 
 spreadsheets. In this case there is no gain in a separate mime type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-630) Dealing with PDF documents from scanning programs


 [ 
https://issues.apache.org/jira/browse/TIKA-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-630.
--
Resolution: Fixed

 Dealing with PDF documents from scanning programs
 -

 Key: TIKA-630
 URL: https://issues.apache.org/jira/browse/TIKA-630
 Project: Tika
  Issue Type: Improvement
  Components: general
Affects Versions: 0.10
Reporter: Joseph Vychtrle
Priority: Minor
  Labels: ocr, pdf,

 Hey,
 sorry I didn't post this to mailing list, I kinda didn't get the confirmation.
 The issue is that often people don't even realize there is a difference in 
 pdf documents (extracted from openoffice/ms office or pdf from a scanner 
 software). And if Tika processes such a document, it detects pdf content 
 type, but there are only images in there. I don't know how to deal with that. 
 There should be a function that decides on the type of PDF document so that I 
 can take it and use some OCR software for the PDF from scanner software.
 If there is a way to do that, could please anybody explain how to do that ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-675) PackageExtractor should track names of recursively nested resources


[ 
https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342507#comment-14342507
 ] 

Tyler Palsulich commented on TIKA-675:
--

Is this still worth implementing? [~gagravarr], if you decide on metadata keys, 
I can take a crack at implementing this. But, not sure it'd be quick.

 PackageExtractor should track names of recursively nested resources
 ---

 Key: TIKA-675
 URL: https://issues.apache.org/jira/browse/TIKA-675
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 0.10
Reporter: Andrzej Bialecki 

 When parsing archive formats the hierarchy of names is not tracked, only the 
 current embedded component's name is preserved under 
 Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be 
 nice to build pseudo-urls for nested resources. In case of Tika API that uses 
 streams this could look like 
 {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or 
 otherwise track the parent-child relationship - e.g. some applications need 
 this information to indicate what composite documents to delete from the 
 index after a container archive has been deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-369) Improve accuracy of language detection

[
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342533#comment-14342533
]

Tyler Palsulich commented on TIKA-369:
--

Thanks, Ken! In that case, I definitely agree.

Improve accuracy of language detection
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-774) ExifTool Parser

[
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342605#comment-14342605
]

Tyler Palsulich commented on TIKA-774:
--

Do we still want to integrate this? Is this a semi duplicate of TIKA-762? I
agree that we should create another conflicting Parser for image types.

ExifTool Parser
---

Key: TIKA-774
URL: https://issues.apache.org/jira/browse/TIKA-774
Project: Tika
Issue Type: New Feature
Components: parser
Affects Versions: 1.0
Environment: Requires be installed
(http://www.sno.phy.queensu.ca/~phil/exiftool/)
Reporter: Ray Gauss II
Labels: features, newbie, patch,
Fix For: 1.8

Attachments: testJPEG_IPTC_EXT.jpg,
tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt

Adds an external parser that calls ExifTool to extract extended metadata
fields from images and other content types.
In the core project:
An ExifTool interface is added which contains Property objects that define
the metadata fields available.
An additional Property constructor for internalTextBag type.
In the parsers project:
An ExiftoolMetadataExtractor is added which does the work of calling ExifTool
on the command line and mapping the response to tika metadata fields. This
extractor could be called instead of or in addition to the existing
ImageMetadataExtractor and JempboxExtractor under TiffParser and/or
JpegParser but those have not been changed at this time.
An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool
metadata fields to existing tika and Drew Noakes metadata fields if enabled.
An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag
implementations in XML files.
An ExifToolParserTest is added which tests several expected XMP and IPTC
metadata values in testJPEG_IPTC_EXT.jpg.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-807) PHP version of Tika


[ 
https://issues.apache.org/jira/browse/TIKA-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342610#comment-14342610
 ] 

Tyler Palsulich commented on TIKA-807:
--

[Here|https://github.com/pierroweb/PhpTikaWrapper] is one project which aims to 
do this. Should we leave this open, in case we want to integrate something 
within the Tika project?

 PHP version of Tika
 ---

 Key: TIKA-807
 URL: https://issues.apache.org/jira/browse/TIKA-807
 Project: Tika
  Issue Type: New Feature
  Components: packaging
Reporter: Ingo Renner
Priority: Minor
  Labels: PHP

 Inspired by TIKA-773 the outcome of this issue should be a PHP 
 library/wrapper to easily work with Tika in PHP applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-648) Parsing HTML anchors with embedded div faulty


 [ 
https://issues.apache.org/jira/browse/TIKA-648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-648.

Resolution: Won't Fix

 Parsing HTML anchors with embedded div faulty
 -

 Key: TIKA-648
 URL: https://issues.apache.org/jira/browse/TIKA-648
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.9
Reporter: Markus Jelsma

 Using Nutch with Tika 0.9 i cannot extract all two outlinks from a given page 
 [1]. This is because Tika doensn't return the document with the anchor text 
 embedded and Nutch skips empty anchors when collecting outlinks.
 The raw HTML is:
 a href=#divbla 1/div/a
 a href=#bla 2/a
 But the parsed HTML with tika-app-1.0-SNAPSHOT.jar -h test.html is:
 a shape=rect href=#/bla 1
 a shape=rect href=#bla 2/a
 [1]: http://people.apache.org/~markus/test.html
 Also described on the Tika user list:
 http://search.lucidimagination.com/search/document/e74d7e72fd61543a/parsing_html_anchors_with_embedded_div_faulty



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-591) Separate launcer process for forking JVMs


[ 
https://issues.apache.org/jira/browse/TIKA-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342521#comment-14342521
 ] 

Tyler Palsulich commented on TIKA-591:
--

I bring up tika-batch (from [~talli...@apache.org]) because it's meant to 
provide a way to reliably run Tika on a large collection of documents -- 
killing the processing when Tika seems to be hanging indefinitely. But, I'm not 
sure if it's in an entirely different JVM, or just a different thread -- or if 
that even matters in regards to this issue.

 Separate launcer process for forking JVMs
 -

 Key: TIKA-591
 URL: https://issues.apache.org/jira/browse/TIKA-591
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Jukka Zitting
Assignee: Jukka Zitting
Priority: Minor

 As a followup to TIKA-416, it would be good to implement at least optional 
 support for a separate launcher process for the ForkParser feature. The need 
 for such an extra process came up in JCR-2864 where a reference to 
 http://developers.sun.com/solaris/articles/subprocess/subprocess.html  was 
 made.
 To summarize, the problem is that the ProcessBuilder.start() call can result 
 in a temporary duplication of the memory space of the parent JVM. Even with 
 copy-on-write semantics this can be a fairly expensive operation and prone to 
 out-of-memory issues especially in large-scale deployments where the parent 
 JVM already uses the majority of the available RAM on a computer.
 A similar problem is also being discussed at HADOOP-5059.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events


 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-715:
-
Labels: newbie  (was: )

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
  Labels: newbie
 Fix For: 1.8

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 java.lang.AssertionError: p inside p
   at

[jira] [Commented] (TIKA-465) LanguageIdentifier API enhancements


[ 
https://issues.apache.org/jira/browse/TIKA-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342528#comment-14342528
 ] 

Tyler Palsulich commented on TIKA-465:
--

[~kkrugler], I commented in case someone else had more context. So, if you're 
happy to close, I am too.

 LanguageIdentifier API enhancements
 ---

 Key: TIKA-465
 URL: https://issues.apache.org/jira/browse/TIKA-465
 Project: Tika
  Issue Type: Improvement
  Components: languageidentifier
Reporter: Chris A. Mattmann
Assignee: Ken Krugler
Priority: Minor

 As originally reported by Jerome Charron in NUTCH-86, Jerome identified a set 
 of improvements for the LanguageIdentifier that we should consider in Tika:
 {quote}
 More informations can be found on the following thread on Nutch-Dev mailing 
 list:
 http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html
 Summary:
 1. LanguageIdentifier API changes. The similarity methods should return an 
 ordered array of language-code/score pairs instead of a simple String 
 containing the language-code.
 2. Ensure consistency between LanguageIdentifier scoring and 
 NGramProfile.getSimilarity().
 {quote}
 I just wanted to capture the issue here in Tika, since I'm about to close it 
 out in Nutch since LanguageIdentification is something that can happen in 
 Tika-ville...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-774) ExifTool Parser


 [ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-774:
-
Labels: features new-parser newbie patch  (was: features newbie patch,)

 ExifTool Parser
 ---

 Key: TIKA-774
 URL: https://issues.apache.org/jira/browse/TIKA-774
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.0
 Environment: Requires be installed 
 (http://www.sno.phy.queensu.ca/~phil/exiftool/)
Reporter: Ray Gauss II
  Labels: features, new-parser, newbie, patch
 Fix For: 1.8

 Attachments: testJPEG_IPTC_EXT.jpg, 
 tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt


 Adds an external parser that calls ExifTool to extract extended metadata 
 fields from images and other content types.
 In the core project:
 An ExifTool interface is added which contains Property objects that define 
 the metadata fields available.
 An additional Property constructor for internalTextBag type.
 In the parsers project:
 An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
 on the command line and mapping the response to tika metadata fields.  This 
 extractor could be called instead of or in addition to the existing 
 ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
 JpegParser but those have not been changed at this time.
 An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
 An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
 metadata fields to existing tika and Drew Noakes metadata fields if enabled.
 An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
 implementations in XML files.
 An ExifToolParserTest is added which tests several expected XMP and IPTC 
 metadata values in testJPEG_IPTC_EXT.jpg.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-774) ExifTool Parser

[
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342605#comment-14342605
]

Tyler Palsulich edited comment on TIKA-774 at 3/2/15 1:45 AM:
--

Do we still want to integrate this? Is this a semi duplicate of TIKA-762? I
agree that we should create another conflicting Parser for image types.

Our decision on this is related to TIKA-776.

was (Author: tpalsulich):
Do we still want to integrate this? Is this a semi duplicate of TIKA-762? I
agree that we should create another conflicting Parser for image types.

ExifTool Parser
---

Attachments: testJPEG_IPTC_EXT.jpg,
tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-289) Add magic byte patterns from file(1)


 [ 
https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch updated TIKA-289:

Attachment: file-mimes-missing.txt

Attached is the list of mime types extracted from the file(1) magic directory 
(found from grepping for {{!:mime}}) which aren't found in the Tika Mimetypes 
file

Some of these will be aliases for ones we already have, some will be aliases 
where we also need to bring over magic, and some will be new ones

This list doesn't include any for which we have a mime type, but don't 
currently have any magic

 Add magic byte patterns from file(1)
 

 Key: TIKA-289
 URL: https://issues.apache.org/jira/browse/TIKA-289
 Project: Tika
  Issue Type: Improvement
  Components: mime
Reporter: Jukka Zitting
Priority: Minor
 Attachments: file-mimes-missing.txt


 As discussed in TIKA-285, the file(1) command comes with a pretty 
 comprehensive set of magic byte patterns. It would be nice to get those 
 patterns included also in Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-289) Add magic byte patterns from file(1)


 [ 
https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch updated TIKA-289:

Attachment: file-has-magic-tika-missing.txt

{{file-has-magic-tika-missing.txt}} is the list of mime types where file(1) has 
magic but Tika does not, where both know about the same mime type. Note that 
there may be some false positives on this list, eg where Tika has the magic on 
a parent type

 Add magic byte patterns from file(1)
 

 Key: TIKA-289
 URL: https://issues.apache.org/jira/browse/TIKA-289
 Project: Tika
  Issue Type: Improvement
  Components: mime
Reporter: Jukka Zitting
Priority: Minor
 Attachments: file-has-magic-tika-missing.txt, file-mimes-missing.txt


 As discussed in TIKA-285, the file(1) command comes with a pretty 
 comprehensive set of magic byte patterns. It would be nice to get those 
 patterns included also in Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-591) Separate launcer process for forking JVMs

2015-03-01 Thread Luis Filipe Nassif (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342232#comment-14342232
]

Luis Filipe Nassif commented on TIKA-591:
-

I think this is very important. We are having problems on Linux that I think
are related to this while running the TesseractOCRParser. Sometimes the trace
is similar to those posted in HADOOP-5059, sometimes it is outside of
TesseractOCRParser, but I think it is related to a memory corruption caused by
an early fork/exec. Reducing the max heap of the JVM helps a bit, but does not
solve the issue. I don't know the tika-batch code, is it possible to use
CompositeParser directly with tika-batch?

Separate launcer process for forking JVMs
-

Key: TIKA-591
URL: https://issues.apache.org/jira/browse/TIKA-591
Project: Tika
Issue Type: Improvement
Components: parser
Reporter: Jukka Zitting
Assignee: Jukka Zitting
Priority: Minor

As a followup to TIKA-416, it would be good to implement at least optional
support for a separate launcher process for the ForkParser feature. The need
for such an extra process came up in JCR-2864 where a reference to
http://developers.sun.com/solaris/articles/subprocess/subprocess.html was
made.
To summarize, the problem is that the ProcessBuilder.start() call can result
in a temporary duplication of the memory space of the parent JVM. Even with
copy-on-write semantics this can be a fairly expensive operation and prone to
out-of-memory issues especially in large-scale deployments where the parent
JVM already uses the majority of the available RAM on a computer.
A similar problem is also being discussed at HADOOP-5059.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: FeedBack required for Geographic Parser

2015-03-01 Thread Nick Burch


On Sat, 28 Feb 2015, Gautham Shankar wrote:

My progress has been updated on the below link.

https://wiki.apache.org/tika/TikaGeographicInformationParser

I would like you guys to comment on the Key Names that i have come up for
customized Meta data, this could certainly be shortened.


Ideally, we try not to invent our own metadata keys, but instead re-use 
definitions/standards from elsewhere. We also try to map format-specific 
keys onto general ones, to keep things consistent between different file 
types


From a quick glance, it looks like a few of the metadata entris you have 
are ones which could be mapped onto an existing key, and a few could be 
mapped onto new metadata properties from external standards


It might also be worth looking at some of the other scientific formats, 
and see if any commonality can be found with those / they can be changed 
to be common. Where there's a concept that's the same, the different 
formats should try to use the same metadata key.


(As an example, as a user, you don't need to know if a file format uses 
Created On, Created At, First Created At, Created, or anything like that, 
you just fetch dc:created and it's the same thing across all formats, and 
you can go look up the Dublin Core specification if you want to check what 
it means semantically)


Nick

[jira] [Updated] (TIKA-443) Geographic Information Parser


 [ 
https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-443:
-
Labels: new-parser  (was: )

 Geographic Information Parser
 -

 Key: TIKA-443
 URL: https://issues.apache.org/jira/browse/TIKA-443
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Arturo Beltran
Assignee: Chris A. Mattmann
  Labels: new-parser
 Attachments: getFDOMetadata.xml


 I'm working in the automatic description of geospatial resources, and I think 
 that might be interesting to incorporate new parser/s to Tika in order to 
 manage and describe some geo-formats. These geo-formats include files, 
 services and databases.
 If anyone is interested in this issue or want to collaborate do not hesitate 
 to contact me. Any help is welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events


[ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342526#comment-14342526
 ] 

Tyler Palsulich commented on TIKA-715:
--

List of parser tests that fail after applying the patch:
{code}
  
AutoDetectParserTest.testKeynote:164-assertAutoDetect:148-assertAutoDetect:132-assertAutoDetect:99
 null
  
AutoDetectParserTest.testPages:169-assertAutoDetect:148-assertAutoDetect:132-assertAutoDetect:99
 mismatched elements open=div close=body
  AutoDetectParserTest.testZipBombPrevention:271 mismatched elements open=p 
close=div
  iBooksParserTest.testiBooksParser:40 mismatched elements open=title close=head
  IWorkParserTest.testKeynoteBulletPoints:115 null
  IWorkParserTest.testKeynoteMasterSlideTable:140 mismatched elements open=tr 
close=table
  IWorkParserTest.testKeynoteTables:127 null
  IWorkParserTest.testKeynoteTextBoxes:103 null
  IWorkParserTest.testPagesLayoutMode:204 mismatched elements open=div 
close=body
  IWorkParserTest.testParseKeynote:57 null
  IWorkParserTest.testParsePages:154 mismatched elements open=div close=body
  IWorkParserTest.testParsePagesHeadersAlphaLower:406 mismatched elements 
open=p close=div
  IWorkParserTest.testParsePagesHeadersAlphaUpper:385 mismatched elements 
open=p close=div
  IWorkParserTest.testParsePagesHeadersFootersFootnotes:316 mismatched elements 
open=p close=div
  IWorkParserTest.testParsePagesHeadersFootersRomanLower:364 mismatched 
elements open=p close=div
  IWorkParserTest.testParsePagesHeadersFootersRomanUpper:343 mismatched 
elements open=p close=div
  RFC822ParserTest.testEncryptedZipAttachment:277 null
  RFC822ParserTest.testMultipart:93 null
  RFC822ParserTest.testNormalZipAttachment:332 null
  RFC822ParserTest.testUnusualFromAddress:197 null
  MboxParserTest.testComplex:150 null
  ExcelParserTest.testExcel95:380 end tag=body with no startElement
  
WordParserTest.testControlCharacter:383-TikaTest.getXML:114-TikaTest.getXML:123
 mismatched elements open=a close=b
  
OOXMLParserTest.testTextInsideTextBox:971-TikaTest.getXML:114-TikaTest.getXML:123
 null
  ODFParserTest.testFromFile:342 null
  ODFParserTest.testOO3:58 null
  ODFParserTest.testOO3Metadata:218 null
{code}

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
 Fix For: 1.8

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at

[jira] [Commented] (TIKA-465) LanguageIdentifier API enhancements


[ 
https://issues.apache.org/jira/browse/TIKA-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342525#comment-14342525
 ] 

Ken Krugler commented on TIKA-465:
--

I'm actually working on a new language detector, so I think this can be closed.

 LanguageIdentifier API enhancements
 ---

 Key: TIKA-465
 URL: https://issues.apache.org/jira/browse/TIKA-465
 Project: Tika
  Issue Type: Improvement
  Components: languageidentifier
Reporter: Chris A. Mattmann
Assignee: Ken Krugler
Priority: Minor

 As originally reported by Jerome Charron in NUTCH-86, Jerome identified a set 
 of improvements for the LanguageIdentifier that we should consider in Tika:
 {quote}
 More informations can be found on the following thread on Nutch-Dev mailing 
 list:
 http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html
 Summary:
 1. LanguageIdentifier API changes. The similarity methods should return an 
 ordered array of language-code/score pairs instead of a simple String 
 containing the language-code.
 2. Ensure consistency between LanguageIdentifier scoring and 
 NGramProfile.getSimilarity().
 {quote}
 I just wanted to capture the issue here in Tika, since I'm about to close it 
 out in Nutch since LanguageIdentification is something that can happen in 
 Tika-ville...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-465) LanguageIdentifier API enhancements


 [ 
https://issues.apache.org/jira/browse/TIKA-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Krugler closed TIKA-465.

Resolution: Won't Fix

The change to the API to return more information about the detected languages 
is still interesting, but I think it makes more sense to look at using a 
different detector (e.g. language-detector/detection) versus improving the 
internal version that was ported from Nutch back in the day.

 LanguageIdentifier API enhancements
 ---

 Key: TIKA-465
 URL: https://issues.apache.org/jira/browse/TIKA-465
 Project: Tika
  Issue Type: Improvement
  Components: languageidentifier
Reporter: Chris A. Mattmann
Assignee: Ken Krugler
Priority: Minor

 As originally reported by Jerome Charron in NUTCH-86, Jerome identified a set 
 of improvements for the LanguageIdentifier that we should consider in Tika:
 {quote}
 More informations can be found on the following thread on Nutch-Dev mailing 
 list:
 http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html
 Summary:
 1. LanguageIdentifier API changes. The similarity methods should return an 
 ordered array of language-code/score pairs instead of a simple String 
 containing the language-code.
 2. Ensure consistency between LanguageIdentifier scoring and 
 NGramProfile.getSimilarity().
 {quote}
 I just wanted to capture the issue here in Tika, since I'm about to close it 
 out in Nutch since LanguageIdentification is something that can happen in 
 Tika-ville...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag


 [ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Krugler updated TIKA-539:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

 Encoding detection is too biased by encoding in meta tag
 

 Key: TIKA-539
 URL: https://issues.apache.org/jira/browse/TIKA-539
 Project: Tika
  Issue Type: Improvement
  Components: metadata, parser
Affects Versions: 0.8, 0.9, 0.10
Reporter: Reinhard Schwab
Assignee: Ken Krugler
Priority: Minor
 Fix For: 1.8

 Attachments: TIKA-539.patch, TIKA-539_2.patch


 if the encoding in the meta tag is wrong, this encoding is detected,
 even if there is the right encoding set in metadata before(which can be  from 
 http response header).
 test code to reproduce:
 static String content = htmlhead\n
   + meta http-equiv=\content-type\ 
 content=\application/xhtml+xml; charset=iso-8859-1\ /
   + /headbodyÜber den Wolken\n/body/html;
   /**
* @param args
* @throws IOException
* @throws TikaException
* @throws SAXException
*/
   public static void main(String[] args) throws IOException, SAXException,
   TikaException {
   Metadata metadata = new Metadata();
   metadata.set(Metadata.CONTENT_TYPE, text/html);
   metadata.set(Metadata.CONTENT_ENCODING, UTF-8);
   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
   InputStream in = new 
 ByteArrayInputStream(content.getBytes(UTF-8));
   AutoDetectParser parser = new AutoDetectParser();
   BodyContentHandler h = new BodyContentHandler(1);
   parser.parse(in, h, metadata, new ParseContext());
   System.out.print(h.toString());
   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-723) Rotated text isn't extracted correctly from PDFs


[ 
https://issues.apache.org/jira/browse/TIKA-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342532#comment-14342532
 ] 

Tyler Palsulich commented on TIKA-723:
--

The default of behavior of Tika still prints out a letter or two per p tag. 
But, did we decide that this isn't a problem, since users can turn on sort by 
position?

 Rotated text isn't extracted correctly from PDFs
 

 Key: TIKA-723
 URL: https://issues.apache.org/jira/browse/TIKA-723
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Priority: Minor
 Attachments: rotated.pdf


 I have an example PDF with 90 degree rotation; Tika produces the
 characters one line at a time.  Ie, the doc has Some rotated text,
 here! but Tika produces this:
 {noformat}
 bodydiv class=pagepSo
 m
 e
  
 r
 o
 t
 a
 t
 e
 d
  
 t
 e
 x
 t
 ,
  
 h
 e
 r
 e
 !/p
 {noformat}
 I'm able to copy/paste the text out correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (TIKA-539) Encoding detection is too biased by encoding in meta tag


 [ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Krugler reopened TIKA-539:
--

 Encoding detection is too biased by encoding in meta tag
 

 Key: TIKA-539
 URL: https://issues.apache.org/jira/browse/TIKA-539
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 0.8, 0.9, 0.10
Reporter: Reinhard Schwab
Assignee: Ken Krugler
 Fix For: 1.8

 Attachments: TIKA-539.patch, TIKA-539_2.patch


 if the encoding in the meta tag is wrong, this encoding is detected,
 even if there is the right encoding set in metadata before(which can be  from 
 http response header).
 test code to reproduce:
 static String content = htmlhead\n
   + meta http-equiv=\content-type\ 
 content=\application/xhtml+xml; charset=iso-8859-1\ /
   + /headbodyÜber den Wolken\n/body/html;
   /**
* @param args
* @throws IOException
* @throws TikaException
* @throws SAXException
*/
   public static void main(String[] args) throws IOException, SAXException,
   TikaException {
   Metadata metadata = new Metadata();
   metadata.set(Metadata.CONTENT_TYPE, text/html);
   metadata.set(Metadata.CONTENT_ENCODING, UTF-8);
   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
   InputStream in = new 
 ByteArrayInputStream(content.getBytes(UTF-8));
   AutoDetectParser parser = new AutoDetectParser();
   BodyContentHandler h = new BodyContentHandler(1);
   parser.parse(in, h, metadata, new ParseContext());
   System.out.print(h.toString());
   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-634) Command Line Parser for Metadata Extraction

2015-03-01 Thread Ray Gauss II (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342547#comment-14342547
 ] 

Ray Gauss II commented on TIKA-634:
---

Also see the [tika-ffmpeg project|https://github.com/AlfrescoLabs/tika-ffmpeg].

There we recently had to patch {{ExternalParser}} for some stream parsing 
concurrency problems which should be raised in a separate issue here shortly.

 Command Line Parser for Metadata Extraction
 ---

 Key: TIKA-634
 URL: https://issues.apache.org/jira/browse/TIKA-634
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 0.9
Reporter: Nick Burch
Assignee: Nick Burch
Priority: Minor

 As discussed on the mailing list:
 http://mail-archives.apache.org/mod_mbox/tika-dev/201104.mbox/%3calpine.deb.2.00.1104052028380.29...@urchin.earth.li%3E
 This issue is to track improvements in the ExternalParser support to handle 
 metadata extraction, and probably easier configuration of an external parser 
 too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-765) add icu dependency

[
https://issues.apache.org/jira/browse/TIKA-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tyler Palsulich closed TIKA-765.

Resolution: Won't Fix

Closing as Won't Fix since the Persian character issues seem to be solved.

add icu dependency
--

Key: TIKA-765
URL: https://issues.apache.org/jira/browse/TIKA-765
Project: Tika
Issue Type: Improvement
Components: general
Affects Versions: 0.10
Reporter: Robert Muir

Spinoff of TIKA-713.
In PDFBox, reflection is used to detect if ICU is available in the classpath:
if it is, then it can use ICU BiDi support
to properly extract right-to-left text. otherwise, the text is returned
backwards. This is because the JDK does not
provide the functionality needed to do this inverse BiDI reordering /
arabic-unshaping.
it would be nice to properly depend on this, so that these languages work out
of box... we do this in Apache Solr's
tika integration (contrib/extraction) for example.
Unlike the charset detection code from ICU that tika includes, including
BiDi support would be trickier, because it uses
datafiles built from unicode (These change over time and would be a hassle to
maintain).
Additionally as a note: Tika has some forked charset code from ICU... long
term it would be great to get those changes
into ICU as well.
Finally as an optimization its possible to reduce the icu4j jar size if
needed with http://apps.icu-project.org/datacustom/,
but maybe as a start we could just depend upon the 'whole' icu?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-852) Quicktime / MP4 Metadata Parser


 [ 
https://issues.apache.org/jira/browse/TIKA-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-852:
-
Labels: new-parser  (was: )

 Quicktime / MP4 Metadata Parser
 ---

 Key: TIKA-852
 URL: https://issues.apache.org/jira/browse/TIKA-852
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0
Reporter: Nick Burch
Assignee: Nick Burch
  Labels: new-parser
 Attachments: TIKA-852.patch


 From the investigations done for TIKA-851, it looks like a parser for the 
 Quicktime format, and MP4 (which is an extension to it) shouldn't be too hard 
 to do. This should be able to output some of the media metadata, such 
 duration, dimensions, and MP4 audio tags
 Information resources on the format are linked from TIKA-851



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.


 [ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-879:
-
Labels: new-parser  (was: )

 Detection problem: message/rfc822 file is detected as text/plain.
 -

 Key: TIKA-879
 URL: https://issues.apache.org/jira/browse/TIKA-879
 Project: Tika
  Issue Type: Bug
  Components: metadata, mime
Affects Versions: 1.0, 1.1, 1.2
 Environment: linux 3.2.9
 oracle jdk7, openjdk7, sun jdk6
Reporter: Konstantin Gribov
  Labels: new-parser
 Attachments: TIKA-879-thunderbird.eml


 When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
 can test it on {{testRFC822}} and {{testRFC822_base64}} in 
 {{tika-parsers/src/test/resources/test-documents/}}).
 Main reason for such behavior is that only magic detector is really works for 
 such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
 file name in {{RESOURCE_NAME_KEY}}.
 As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, 
 text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
 works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.


[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342670#comment-14342670
 ] 

Tyler Palsulich commented on TIKA-879:
--

[~lfcnassif], that seems like a reasonable solution. [~gagravarr], any 
objections to widening the range of the offset for magic detection?

 Detection problem: message/rfc822 file is detected as text/plain.
 -

 Key: TIKA-879
 URL: https://issues.apache.org/jira/browse/TIKA-879
 Project: Tika
  Issue Type: Bug
  Components: metadata, mime
Affects Versions: 1.0, 1.1, 1.2
 Environment: linux 3.2.9
 oracle jdk7, openjdk7, sun jdk6
Reporter: Konstantin Gribov
  Labels: new-parser
 Attachments: TIKA-879-thunderbird.eml


 When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
 can test it on {{testRFC822}} and {{testRFC822_base64}} in 
 {{tika-parsers/src/test/resources/test-documents/}}).
 Main reason for such behavior is that only magic detector is really works for 
 such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
 file name in {{RESOURCE_NAME_KEY}}.
 As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, 
 text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
 works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-893) Tika-server bundle includes wrong META-INF/services/org.apache.tika.parser.Parser, doesn't work


[ 
https://issues.apache.org/jira/browse/TIKA-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342705#comment-14342705
 ] 

Tyler Palsulich commented on TIKA-893:
--

Is this still an issue? From what I understand, all service files are parsed so 
that all services are loaded?

 Tika-server bundle includes wrong 
 META-INF/services/org.apache.tika.parser.Parser, doesn't work
 ---

 Key: TIKA-893
 URL: https://issues.apache.org/jira/browse/TIKA-893
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.1, 1.2
 Environment: Apache Maven 2.2.1 (rdebian-6)
 Java version: 1.6.0_26
 Java home: /usr/lib/jvm/java-6-sun-1.6.0.26/jre
 Default locale: en_GB, platform encoding: UTF-8
 OS name: linux version: 3.0.0-17-generic-pae arch: i386 Family: unix
Reporter: Chris Wilson
  Labels: maven, patch

 Both vorbis-java-tika-0.1.jar and tika-parsers-1.1-SNAPSHOT.jar include 
 different copies of META-INF/services/org.apache.tika.parser.Parser, which 
 the auto-detecting parser needs to configure itself.
 AFAIK, only one of these can be included in a standalone OSGi JAR, as they 
 both have the same filename.
 On my system at least, the vorbis one gets included in the JAR, and not the 
 tika-parsers one.
 This means that the Tika server is capable of auto-detecting Vorbis files, 
 but not Microsoft Office files, which is completely broken from my POV.
 Unless the (undocumented) Bnd contains some way to merge these files, I 
 suggest either:
 * excluding the one from vorbis-java-tika (easy but removes Vorbis 
 auto-detection);
 * bundling vorbis-java-tika as an embedded JAR instead of inlined (might 
 work?);
 * including a manually combined copy of both manifests in 
 tika-server/src/main/resources (ugly, requires maintenance);
 * finding or writing a maven plugin to merge these files (outside my 
 maven-fu).
 My simple workaround, which probably removes Vorbis support completely, is 
 this patch:
 {code:xml|title=tika-server/pom.xml.patch}
 @@ -163,7 +168,7 @@
instructions
  Export-Packageorg.apache.tika.*/Export-Package
  Embed-Dependency
 -
 !jersey-server;scope=compile;inline=META-INF/services/**|au/**|javax/**|org/**|com/**|Resources/**|font_metrics.properties|repackage/**|schema*/**,
 +
 !jersey-server;artifactId=!vorbis-java-tika;scope=compile;inline=META-INF/services/**|au/**|javax/**|org/**|com/**|Resources/**|font_metrics.properties|repackage/**|schema*/**,
  jersey-server;scope=compile;inline=com/** 
 |META-INF/services/com.sun*|META-INF/services/javax.ws.rs.ext.RuntimeDelegate
  /Embed-Dependency
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server


 [ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-891:
-
Labels: newbie  (was: )

 Use POST in addition to PUT on method calls in tika-server
 --

 Key: TIKA-891
 URL: https://issues.apache.org/jira/browse/TIKA-891
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: newbie
 Fix For: 1.8


 Per Jukka's email:
 http://s.apache.org/uR
 It would be a better use of REST/HTTP verbs to use POST to put content to a 
 resource where we don't intend to store that content (which is the 
 implication of PUT). Max suggested adding:
 {code}
 @POST
 {code}
 annotations to the methods we are currently exposing using PUT to take care 
 of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-903) NPE thrown with password protected Pages file


 [ 
https://issues.apache.org/jira/browse/TIKA-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-903.

Resolution: Fixed

No exception is thrown with Tika 1.8-SNAPSHOT. So, closing as fixed.

 NPE thrown with password protected Pages file
 -

 Key: TIKA-903
 URL: https://issues.apache.org/jira/browse/TIKA-903
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0
 Environment: Windows 7
Reporter: Gabriel Valencia
  Labels: iWork, nullpointerexception
 Attachments: testPagesVariousPwdProtected.pages


 When trying to view a password-protected Pages file in Tika GUI, you get an 
 NPE:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
 org.apache.tika.parser.iwork.IWorkPackageParser@30583058
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
   at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
   at 
 org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
   at 
 org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
   at javax.swing.TransferHandler.importData(TransferHandler.java:756)
   at 
 javax.swing.TransferHandler$DropHandler.drop(TransferHandler.java:1479)
   at java.awt.dnd.DropTarget.drop(DropTarget.java:445)
   at 
 javax.swing.TransferHandler$SwingDropTarget.drop(TransferHandler.java:1204)
   at 
 sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(SunDropTargetContextPeer.java:531)
   at 
 sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(SunDropTargetContextPeer.java:844)
   at 
 sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(SunDropTargetContextPeer.java:768)
   at sun.awt.dnd.SunDropTargetEvent.dispatch(SunDropTargetEvent.java:42)
   at java.awt.Component.dispatchEventImpl(Component.java:4498)
   at java.awt.Container.dispatchEventImpl(Container.java:2110)
   at java.awt.Component.dispatchEvent(Component.java:4471)
   at 
 java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4588)
   at 
 java.awt.LightweightDispatcher.processDropTargetEvent(Container.java:4323)
   at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4174)
   at java.awt.Container.dispatchEventImpl(Container.java:2096)
   at java.awt.Window.dispatchEventImpl(Window.java:2490)
   at java.awt.Component.dispatchEvent(Component.java:4471)
   at java.awt.EventQueue.dispatchEvent(EventQueue.java:610)
   at 
 java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:280)
   at 
 java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:195)
   at 
 java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:185)
   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:180)
   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:172)
   at java.awt.EventDispatchThread.run(EventDispatchThread.java:133)
 Caused by: java.lang.NullPointerException
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser$IWORKDocumentType.detectType(IWorkPackageParser.java:125)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser$IWORKDocumentType.access$000(IWorkPackageParser.java:71)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:166)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 30 more
 I tried viewing the contents in 7-zip, but it tells me it can't understand 
 the compression format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-03-01 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342718#comment-14342718
 ] 

Chris A. Mattmann commented on TIKA-891:


I think it would be nice to convert the other PUT ones to POST where it makes 
sense. Do you have a list in mind?

 Use POST in addition to PUT on method calls in tika-server
 --

 Key: TIKA-891
 URL: https://issues.apache.org/jira/browse/TIKA-891
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: newbie
 Fix For: 1.8


 Per Jukka's email:
 http://s.apache.org/uR
 It would be a better use of REST/HTTP verbs to use POST to put content to a 
 resource where we don't intend to store that content (which is the 
 implication of PUT). Max suggested adding:
 {code}
 @POST
 {code}
 annotations to the methods we are currently exposing using PUT to take care 
 of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-836) parsing really slow on some documents


 [ 
https://issues.apache.org/jira/browse/TIKA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-836.

Resolution: Cannot Reproduce

We can't reproduce this without the problem files. If you still have them, 
please upload them and reopen!

 parsing really slow on some documents
 -

 Key: TIKA-836
 URL: https://issues.apache.org/jira/browse/TIKA-836
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0
 Environment: CentOS 4.x/5.x/6.x
Reporter: Rob Tulloh

 We are seeing that tika sometimes takes a very long time to parse some 
 content (likely PDF). For example, with the following EML file that contains 
 4 documents (2 PDF, 1 MS Excel, 1 text):
 {noformat}
 fgrep --binary-file=text Content-Type: XXX.eml
 Content-Type: multipart/mixed;
 Content-Type: multipart/alternative;
 Content-Type: text/plain;
 Content-Type: text/html;
 Content-Type: application/octet-stream;
 Content-Type: application/octet-stream;
 Content-Type: application/vnd.ms-excel;
 du -sh XXX.eml
 6.0MXXX.eml
 {noformat}
 Note that it takes tika nearly 30 minutes to process this content even though 
 the source is only 6M in size:
 {noformat}
 time java -Xmx2G -jar ../../tika-app-1.0.jar -m XXX.eml meta.out
 WARN - Did not found XRef object at specified startxref position 230521
 WARN - Did not found XRef object at specified startxref position 3742379
 real29m16.913s
 user18m17.050s
 sys 0m19.465s
 {noformat}
 Is there any way to configure tika (in particular via solr) to process files 
 more quickly?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-862) JPSS HDF5 files not being detected appropriately


 [ 
https://issues.apache.org/jira/browse/TIKA-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-862.
--
Resolution: Fixed

Marking as fixed. The output from the above file is
{code}
?xml version=1.0 encoding=UTF-8?html 
xmlns=http://www.w3.org/1999/xhtml;
head
meta name=Mission_Name content=NPP/
meta name=Content-Length content=20888/
meta name=Distributor content=noaa/
meta name=N_HDF_Creation_Date content=2022/
meta name=N_HDF_Creation_Time content=203300.301515Z/
meta name=N_Collection_Short_Name content=SPACECRAFT-DIARY-RDR/
meta name=Instrument_Short_Name content=SPACECRAFT/
meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
meta name=X-Parsed-By content=org.apache.tika.parser.hdf.HDFParser/
meta name=Platform_Short_Name content=NPP/
meta name=N_Dataset_Source content=noaa/
meta name=N_Dataset_Type_Tag content=RDR/
meta name=N_Processing_Domain content=ops/
meta name=Content-Type content=application/x-hdf/
meta name=resourceName content=test.h5/
title/
/head
body//html
{code}

 JPSS HDF5 files not being detected appropriately
 

 Key: TIKA-862
 URL: https://issues.apache.org/jira/browse/TIKA-862
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0
Reporter: Richard Yu
Assignee: Chris A. Mattmann
 Attachments: 
 ASF.LICENSE.NOT.GRANTED--RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5,
  
 ASF.LICENSE.NOT.GRANTED--RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5,
  
 RNSCA_npp_d2021_t1935200_e1935400_b00346_c2022203300301515_noaa_ops.h5


 As commented in TIKA-614, JPSS HDF 5 files are not being properly detected by 
 Tika. See this:
 from [~minfing]:
 {quote}
 We were trying to extract metadata from our h5 file (i.e. with JPSS 
 extension). We ran the following command line:
 {noformat}
 [ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m \
  /usr/local/staging/products/h5/SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
 Content-Encoding: windows-1252
 Content-Length: 22187952
 Content-Type: text/plain
 resourceName: 
 SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
 [ryu@localhost hdf5extractor]$
 {noformat}
 We noticed that the content type in text/plain and only 4 lines of output 
 (i.e. we expected al lots of metadata).
 Let me know if more information is needed. Thanks!
 Richard
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-849) Identify and parse the Apple iBooks format


 [ 
https://issues.apache.org/jira/browse/TIKA-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-849:
-
Labels: new-parser  (was: )

 Identify and parse the Apple iBooks format
 --

 Key: TIKA-849
 URL: https://issues.apache.org/jira/browse/TIKA-849
 Project: Tika
  Issue Type: New Feature
  Components: mime, parser
Affects Versions: 1.1
Reporter: Andrew Jackson
  Labels: new-parser
 Attachments: ibooks-support.patch


 With the release of iBooks Author 1.0, Apple have created a new eBook format 
 very similar to ePub. Tika could be extended to identify and parse this new 
 format, re-using the existing ePub code wherever possible.
 I have created an initial patch, which I will attach to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds


 [ 
https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-858:
-
Labels: new-parser  (was: )

 Tika add parsing support for ANPA-1312 news wire feeds
 --

 Key: TIKA-858
 URL: https://issues.apache.org/jira/browse/TIKA-858
 Project: Tika
  Issue Type: New Feature
  Components: mime, parser
Affects Versions: 0.10
Reporter: Craig Stires
  Labels: new-parser
 Attachments: 7901V5.pdf, IptcAnpaParser.java, 
 org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch


 This submission adds support for ANPA-1312 news wire feeds.
 Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news 
 wire broadcasts.
 This was a pretty significant development effort, so am happy to share back 
 as a thank you to the TIKA community. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds


[ 
https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342664#comment-14342664
 ] 

Tyler Palsulich commented on TIKA-858:
--

Does anyone have an ANPA file we can use to test?

 Tika add parsing support for ANPA-1312 news wire feeds
 --

 Key: TIKA-858
 URL: https://issues.apache.org/jira/browse/TIKA-858
 Project: Tika
  Issue Type: New Feature
  Components: mime, parser
Affects Versions: 0.10
Reporter: Craig Stires
  Labels: new-parser
 Attachments: 7901V5.pdf, IptcAnpaParser.java, 
 org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch


 This submission adds support for ANPA-1312 news wire feeds.
 Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news 
 wire broadcasts.
 This was a pretty significant development effort, so am happy to share back 
 as a thank you to the TIKA community. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-880) while integrating microsoft parser it is giving error


[ 
https://issues.apache.org/jira/browse/TIKA-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342672#comment-14342672
 ] 

Tyler Palsulich commented on TIKA-880:
--

Hi [~som.mukhopadhyay]. Thank you for raising this issue. Apologies for not 
getting any attention on it. How exactly were you integrating Parsers? Were you 
ever able to resolve this? I'm going to close this as Can't Reproduce later 
this week if not.

 while integrating microsoft parser it is giving error
 -

 Key: TIKA-880
 URL: https://issues.apache.org/jira/browse/TIKA-880
 Project: Tika
  Issue Type: Wish
  Components: parser
Affects Versions: 1.0
 Environment: Android
Reporter: Somenath Mukhopadhyay
  Labels: newbie
   Original Estimate: 12h
  Remaining Estimate: 12h

 i don't know if i should raise this problem as an issue in the Jira. but i 
 have reached a roadblock.
 I am using Apache Tika for my Android developement. I was successful in 
 integrating most of the parsers. however, when i am trying to integrate 
 Microsoft and Microsoft.ooxml, it is giving the most dreaded Conversion to 
 Dalvik format failed with error 1. if someone can help me out in resolving 
 this issue, that will be really fantastic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-887) Tika fails to parse some MP3 tags correctly and produces null characters in value


 [ 
https://issues.apache.org/jira/browse/TIKA-887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-887.
--
Resolution: Fixed

No objection and the linked file seemed to have valid metadata. So I'm marking 
this as fixed.

 Tika fails to parse some MP3 tags correctly and produces null characters in 
 value
 -

 Key: TIKA-887
 URL: https://issues.apache.org/jira/browse/TIKA-887
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0, 1.1
Reporter: Jens Hübel
Priority: Minor

 I have a problem when extracting the comment tag from an MP3 file. It 
 contains an invalid prefix then a '\0' character and then the real value of 
 the tag. This happpens with files downloaded from www.jamendo.com, for 
 example this one:
 http://storage.newjamendo.com/download/track/450545/mp32/Swansong.mp3
 It may be that the tags are not created properly on this site, but at least 
 tools like mp3tag display them correctly.
 The extracted value looks like this: eng http://www.jamendo.com 
 Attribution-Noncommercial-Share Alike 3.0
 At position 3 there is a null character. The tag value should start with 
 http...
 Here is the byte sequence at the beginning of this file:
 49 44 33 04 00 00 00 01 18 32 54 49 54 32 00 00 
 00 09 00 00 03 53 77 61 6E 73 6F 6E 67 54 50 45 
 31 00 00 00 0E 00 00 03 4A 6F 73 68 20 57 6F 6F 
 64 77 61 72 64 54 41 4C 42 00 00 00 0C 00 00 03 
 42 72 65 61 64 63 72 75 6D 62 73 54 44 52 4C 00 
 00 00 05 00 00 03 32 30 30 39 43 4F 4D 4D 00 00 
 00 22 00 00 03 65 6E 67 49 44 33 20 76 31 20 43 
 6F 6D 6D 65 6E 74 00 41 74 74 72 69 62 75 74 69 
 6F 6E 20 33 2E 30 54 43 4F 4E 00 00 00 06 00 00 
 03 28 32 35 35 29 54 50 55 42 00 00 00 08 00 00 
 03 4A 61 6D 65 6E 64 6F 43 4F 4D 4D 00 00 00 2C 
 00 00 03 65 6E 67 00 68 74 74 70 3A 2F 2F 77 77 
 77 2E 6A 61 6D 65 6E 64 6F 2E 63 6F 6D 20 41 74 
 74 72 69 62 75 74 69 6F 6E 20 33 2E 30 20 54 43 
 4F 50 00 00 01 1F 00 00 03 32 30 30 39 2D 31 30 
 2D 32 31 54 31 31 3A 31 31 3A 32 30 2B 30 31 3A 
 30 30 20 4A 6F 73 68 20 57 6F 6F 64 77 61 72 64 
 2E 20 4C 69 63 65 6E 73 65 64 20 74 6F 20 74 68
 ID3..2TIT2...SwansongTPE1...Josh 
 WoodwardTALB...BreadcrumbsTDRL...2009COMM..engID3 v1 
 Comment.Attribution 
 3.0TCON...(255)TPUB...JamendoCOMM...,...eng.http://www.jamendo.com 
 Attribution 3.0 TCOP...2009-10-21T11:11:20+01:00 Josh Woodward. Licensed 
 to th



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-888) NetCDF parser uses Java 6 JAR file and test/compilation fails with Java 1.5, although TIKA is Java 1.5


 [ 
https://issues.apache.org/jira/browse/TIKA-888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-888.

Resolution: Fixed

Tika is now using Java 1.6 (talking about 1.7) and there were some Java 1.5 
compatibility updates making tests pass. Marking as Fixed.

 NetCDF parser uses Java 6 JAR file and test/compilation fails with Java 1.5, 
 although TIKA is Java 1.5
 --

 Key: TIKA-888
 URL: https://issues.apache.org/jira/browse/TIKA-888
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0
Reporter: Uwe Schindler
Assignee: Chris A. Mattmann

 Lucene/Solr developers ran this tool before releasing Lucene/Solr 3.6 (Solr 
 3.6 is still required to run on Java 1.5, see SOLR-3295): 
 http://code.google.com/p/versioncheck/
 {noformat}
 Major.Minor Version : 50.0 JAVA compatibility : Java 1.6 
 platform: 45.3-50.0
 Number of classes : 60
 Classes are: 
 c:\Work\lucene-solr\.\solr\contrib\extraction\lib\netcdf-4.2-min.jar [:] 
 ucar/unidata/geoloc/Bearing.class
 ...
 {noformat}
 TIKA should use a 1.5 version of this class and especially do some Java 5 
 tests before releasing (as it's build dependencies says, it's minimum Java5). 
 I tried to compile and run TIKA tests with Java 1.5 - crash (Invalid class 
 file format).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server


[ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342703#comment-14342703
 ] 

Tyler Palsulich commented on TIKA-891:
--

I made a couple changes related to this for TIKA-1547 (use POST for forms). 
Should we still convert the other PUT resources to POST?

 Use POST in addition to PUT on method calls in tika-server
 --

 Key: TIKA-891
 URL: https://issues.apache.org/jira/browse/TIKA-891
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Trivial
 Fix For: 1.8


 Per Jukka's email:
 http://s.apache.org/uR
 It would be a better use of REST/HTTP verbs to use POST to put content to a 
 resource where we don't intend to store that content (which is the 
 implication of PUT). Max suggested adding:
 {code}
 @POST
 {code}
 annotations to the methods we are currently exposing using PUT to take care 
 of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Curating Issues

2015-03-01 Thread Nick Burch


On Sun, 1 Mar 2015, Tyler Palsulich wrote:
I've started labeling some issues as new-parser and newbie. I think 
these should be helpful for organization. Please let me know if there is 
another label we've already been using for those. I put new-parser on 
any requests to support a new filetype, even if it doesn't require a 
full on Parser (e.g. just magic).


I don't know if anyone has the time to mentor, but there's just about 
still time to get something into GSoC for 2015. If we do have someone who 
could mentor a student in the summer, then it could be worth tagging any 
summer sized issues with gsoc2015. http://community.apache.org/gsoc.html 
has some more info for anyone new to gsoc


Nick

[jira] [Commented] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment


[ 
https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342709#comment-14342709
 ] 

Tyler Palsulich commented on TIKA-894:
--

[~lewismc], if you have the time, this would be great to have.

 Add webapp mode for Tika Server, simplifies deployment
 --

 Key: TIKA-894
 URL: https://issues.apache.org/jira/browse/TIKA-894
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.1, 1.2
Reporter: Chris Wilson
  Labels: maven, newbie, patch
 Fix For: 1.8

 Attachments: tika-server-webapp.patch


 For use in production services, Tika Server should really be deployed as a 
 WAR file, under a reliable servlet container that knows how to run as a 
 system service, for example Tomcat or JBoss.
 This is especially important on Windows, where I wasted an entire day trying 
 to make TikaServerCli run as some kind of a service. 
 Maven makes building a webapp pretty trivial. With the attached patch 
 applied, mvn war:war should work. It seems to run fine in Tomcat, which 
 makes Windows deployment much simpler. Just install Tomcat and drop the WAR 
 file into tomcat's webapps directory and you're away.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-897) UTF-8 encoded XML is detected as text/plain because of UTF-8 BOM


 [ 
https://issues.apache.org/jira/browse/TIKA-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-897.

Resolution: Fixed

Closing as fixed per Nick's comment above. We can open a new issue if someone 
wants UTF-32 XML detection support.

 UTF-8 encoded XML is detected as text/plain because of UTF-8 BOM
 

 Key: TIKA-897
 URL: https://issues.apache.org/jira/browse/TIKA-897
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.1
Reporter: Wade Taylor
Priority: Minor

 Detection of XML fails when encoded as UTF-8. The UTF-8 BOM: 0xEF,0xBB,0xBF 
 causes the XML detector to fail when trying to match ?xml at the beginning 
 of the input stream.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case


 [ 
https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-911:
-
Affects Version/s: (was: 1.1)
   1.8

 Converted PDF document contains question marks in place of spaces and 
 inconsistent case
 ---

 Key: TIKA-911
 URL: https://issues.apache.org/jira/browse/TIKA-911
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.8
Reporter: Matt Sheppard
 Attachments: Rust Biosecurity Brochure.pdf, Rust Biosecurity 
 Brochure.pdf.html


 The PDF document at 
 http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, 
 when converted with tika v1.1 using
 {code}
 $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
 {code}
 Produces substantially worse output than xpdf's pdftotext program.
 Specifically, we see...
 Some 'spaces' replaced with question marks
 {noformat}
 ...
 bodydiv class=pagep/
 pHow can I help?
 When you're overseas:
 • ?wherever?possible,?don't?visit?crops?—?contact?with?
 /p
 pgrowing?crops?greatly?increases?the?risk?of?contaminating?
 footwear?or?clothing;?
 ...
 {noformat}
 and some odd case conversions
 {noformat}
 pstem rust in wheat.  
  (soURce: BRAd collIs)/p
 p/
 /div
 {noformat}
 (The original document seems to contain SOURCE: BRAD COLLIS all in upper 
 case.
 To compare that with pdftotext
 {code}
 $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ 
 Brochure.pdf
 {code}
 This does not output the question marks, and produces Source: BRAD COLLIS 
 at the end there, both of which seem to be improvements. Note that it does, 
 however, produce a number of ^G characters which are not desireable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case


[ 
https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342716#comment-14342716
 ] 

Tyler Palsulich commented on TIKA-911:
--

Still seeing this issue (question marks instead of spaces) on a Mac with Tika 
1.8-SNAPSHOT.

{{mvn -version}}:
{code}
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 
2014-08-11T16:58:10-04:00)
Maven home: /usr/local/Cellar/maven/3.2.3/libexec
Java version: 1.7.0_71, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_71.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: mac os x, version: 10.10.2, arch: x86_64, family: mac
{code}

 Converted PDF document contains question marks in place of spaces and 
 inconsistent case
 ---

 Key: TIKA-911
 URL: https://issues.apache.org/jira/browse/TIKA-911
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.8
Reporter: Matt Sheppard
 Attachments: Rust Biosecurity Brochure.pdf, Rust Biosecurity 
 Brochure.pdf.html


 The PDF document at 
 http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, 
 when converted with tika v1.1 using
 {code}
 $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
 {code}
 Produces substantially worse output than xpdf's pdftotext program.
 Specifically, we see...
 Some 'spaces' replaced with question marks
 {noformat}
 ...
 bodydiv class=pagep/
 pHow can I help?
 When you're overseas:
 • ?wherever?possible,?don't?visit?crops?—?contact?with?
 /p
 pgrowing?crops?greatly?increases?the?risk?of?contaminating?
 footwear?or?clothing;?
 ...
 {noformat}
 and some odd case conversions
 {noformat}
 pstem rust in wheat.  
  (soURce: BRAd collIs)/p
 p/
 /div
 {noformat}
 (The original document seems to contain SOURCE: BRAD COLLIS all in upper 
 case.
 To compare that with pdftotext
 {code}
 $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ 
 Brochure.pdf
 {code}
 This does not output the question marks, and produces Source: BRAD COLLIS 
 at the end there, both of which seem to be improvements. Note that it does, 
 however, produce a number of ^G characters which are not desireable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-885) Possible ConcurrentModificationException while accessing Metadata produced by ParsingReader


[ 
https://issues.apache.org/jira/browse/TIKA-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342675#comment-14342675
 ] 

Tyler Palsulich commented on TIKA-885:
--

[~lfcnassif], is this issue superseded by TIKA-1007? Or, should we keep this 
open?

 Possible ConcurrentModificationException while accessing Metadata produced by 
 ParsingReader
 ---

 Key: TIKA-885
 URL: https://issues.apache.org/jira/browse/TIKA-885
 Project: Tika
  Issue Type: Improvement
  Components: metadata, parser
Affects Versions: 1.0
 Environment: jre 1.6_25 x64 and Windows7 Enterprise x64
Reporter: Luis Filipe Nassif
Priority: Minor
  Labels: patch

 Oracle PipedReader and PipedWriter classes have a bug that do not allow them 
 to execute concurrently, because they notify each other only when the pipe is 
 full or empty, and do not after a char is read or written to the pipe. So i 
 modified ParsingReader to use modified versions of PipedReader and 
 PipedWriter, similar to gnu versions of them, that work concurrently. 
 However, sometimes and with certain files, i am getting the following error:
 java.util.ConcurrentModificationException
 at java.util.HashMap$HashIterator.nextEntry(Unknown Source)
 at java.util.HashMap$KeyIterator.next(Unknown Source)
 at java.util.AbstractCollection.toArray(Unknown Source)
 at org.apache.tika.metadata.Metadata.names(Metadata.java:146)
 It is because the ParsingReader.ParsingTask thread is writing metadata while 
 it is being read by the ParsingReader thread, with files containing metadata 
 beyond its initial bytes. It will not occur with the current implementation, 
 because java PipedReader and PipedWriter block each other, what is a 
 performance bug that affect ParsingReader, but they could be fixed in a 
 future java release. I think it would be a defensive approach to turn access 
 to the private Metadata.metadata Map synchronized, what could avoid a 
 possible future problem using ParsingReader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-899) [Jackrabbit 2.4 - Tika Parser 1.0] How to configure AutoDetectParser for not detecting content when using files without extension


 [ 
https://issues.apache.org/jira/browse/TIKA-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-899.

Resolution: Duplicate

 [Jackrabbit 2.4 - Tika Parser 1.0] How to configure AutoDetectParser for not 
 detecting content when using files without extension
 -

 Key: TIKA-899
 URL: https://issues.apache.org/jira/browse/TIKA-899
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0
Reporter: Claudiu Muresan
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-898) [Jackrabbit 2.4 - Tika Parser 1.0] How to configure AutoDetectParser for not detecting content when using files without extension


 [ 
https://issues.apache.org/jira/browse/TIKA-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-898.

Resolution: Cannot Reproduce

There are a few ways to configure available Parsers. You can use the new 
blacklist feature, configuration in TIKA-1509, or pull out the underlying 
dependencies. Closing this as Cannot Reproduce, since I'm not sure what the 
exact issue is.

 [Jackrabbit 2.4 - Tika Parser 1.0] How to configure AutoDetectParser for not 
 detecting content when using files without extension
 -

 Key: TIKA-898
 URL: https://issues.apache.org/jira/browse/TIKA-898
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0
Reporter: Claudiu Muresan
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server


[ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342722#comment-14342722
 ] 

Tyler Palsulich commented on TIKA-891:
--

There are only 3 -- getText, getXML, getHTML. So, easy list.

 Use POST in addition to PUT on method calls in tika-server
 --

 Key: TIKA-891
 URL: https://issues.apache.org/jira/browse/TIKA-891
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: newbie
 Fix For: 1.8


 Per Jukka's email:
 http://s.apache.org/uR
 It would be a better use of REST/HTTP verbs to use POST to put content to a 
 resource where we don't intend to store that content (which is the 
 implication of PUT). Max suggested adding:
 {code}
 @POST
 {code}
 annotations to the methods we are currently exposing using PUT to take care 
 of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-03-01 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342725#comment-14342725
 ] 

Chris A. Mattmann commented on TIKA-891:


ahh if it's get anything I would recommend making them GET methods with 
@GET 

 Use POST in addition to PUT on method calls in tika-server
 --

 Key: TIKA-891
 URL: https://issues.apache.org/jira/browse/TIKA-891
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: newbie
 Fix For: 1.8


 Per Jukka's email:
 http://s.apache.org/uR
 It would be a better use of REST/HTTP verbs to use POST to put content to a 
 resource where we don't intend to store that content (which is the 
 implication of PUT). Max suggested adding:
 {code}
 @POST
 {code}
 annotations to the methods we are currently exposing using PUT to take care 
 of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Curating Issues

2015-03-01 Thread Mattmann, Chris A (3980)

Good idea, Nick. The vision parser I threw up I labeled with
gsoc2015 - if there are any takers, please send them my way!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Nick Burch apa...@gagravarr.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Sunday, March 1, 2015 at 8:14 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: Curating Issues

On Sun, 1 Mar 2015, Tyler Palsulich wrote:
 I've started labeling some issues as new-parser and newbie. I think
 these should be helpful for organization. Please let me know if there
is 
 another label we've already been using for those. I put new-parser on
 any requests to support a new filetype, even if it doesn't require a
 full on Parser (e.g. just magic).

I don't know if anyone has the time to mentor, but there's just about
still time to get something into GSoC for 2015. If we do have someone who
could mentor a student in the summer, then it could be worth tagging any
summer sized issues with gsoc2015.
http://community.apache.org/gsoc.html
has some more info for anyone new to gsoc

Nick

[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.


[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342738#comment-14342738
 ] 

Nick Burch commented on TIKA-879:
-

It might be good to try the widened versions with Tika Batch, to see if on a 
wide range of files it causes any noticable slowdown or false positives?

I still think this isn't a file format that can be fully reliably detected with 
mime magic alone, and ideally we do need a dedicated detector for it as 
mentioned above, to fully solve this and related (eg multipart/signed) detection

 Detection problem: message/rfc822 file is detected as text/plain.
 -

 Key: TIKA-879
 URL: https://issues.apache.org/jira/browse/TIKA-879
 Project: Tika
  Issue Type: Bug
  Components: metadata, mime
Affects Versions: 1.0, 1.1, 1.2
 Environment: linux 3.2.9
 oracle jdk7, openjdk7, sun jdk6
Reporter: Konstantin Gribov
  Labels: new-parser
 Attachments: TIKA-879-thunderbird.eml


 When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
 can test it on {{testRFC822}} and {{testRFC822_base64}} in 
 {{tika-parsers/src/test/resources/test-documents/}}).
 Main reason for such behavior is that only magic detector is really works for 
 such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
 file name in {{RESOURCE_NAME_KEY}}.
 As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, 
 text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
 works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-634) Command Line Parser for Metadata Extraction


 [ 
https://issues.apache.org/jira/browse/TIKA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch updated TIKA-634:

Labels: new-parser  (was: )

 Command Line Parser for Metadata Extraction
 ---

 Key: TIKA-634
 URL: https://issues.apache.org/jira/browse/TIKA-634
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 0.9
Reporter: Nick Burch
Assignee: Nick Burch
Priority: Minor
  Labels: new-parser

 As discussed on the mailing list:
 http://mail-archives.apache.org/mod_mbox/tika-dev/201104.mbox/%3calpine.deb.2.00.1104052028380.29...@urchin.earth.li%3E
 This issue is to track improvements in the ExternalParser support to handle 
 metadata extraction, and probably easier configuration of an external parser 
 too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-634) Command Line Parser for Metadata Extraction


[ 
https://issues.apache.org/jira/browse/TIKA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342743#comment-14342743
 ] 

Nick Burch commented on TIKA-634:
-

We still seem to lack proper unit tests for {{ExternalParser}} in the Tika Core 
module, so I think it needs to stay open until some are added, and until Ray is 
happy it's all working fine for ffmpeg as well!

 Command Line Parser for Metadata Extraction
 ---

 Key: TIKA-634
 URL: https://issues.apache.org/jira/browse/TIKA-634
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 0.9
Reporter: Nick Burch
Assignee: Nick Burch
Priority: Minor
  Labels: new-parser

 As discussed on the mailing list:
 http://mail-archives.apache.org/mod_mbox/tika-dev/201104.mbox/%3calpine.deb.2.00.1104052028380.29...@urchin.earth.li%3E
 This issue is to track improvements in the ExternalParser support to handle 
 metadata extraction, and probably easier configuration of an external parser 
 too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-675) PackageExtractor should track names of recursively nested resources


[ 
https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342749#comment-14342749
 ] 

Nick Burch commented on TIKA-675:
-

I think this is already handled by the RecursiveParserWrapper, via the 
EMBEDDED_RESOURCE_PATH metadata key?

 PackageExtractor should track names of recursively nested resources
 ---

 Key: TIKA-675
 URL: https://issues.apache.org/jira/browse/TIKA-675
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 0.10
Reporter: Andrzej Bialecki 

 When parsing archive formats the hierarchy of names is not tracked, only the 
 current embedded component's name is preserved under 
 Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be 
 nice to build pseudo-urls for nested resources. In case of Tika API that uses 
 streams this could look like 
 {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or 
 otherwise track the parent-child relationship - e.g. some applications need 
 this information to indicate what composite documents to delete from the 
 index after a container archive has been deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-712) Master slide text isn't extracted


[ 
https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342750#comment-14342750
 ] 

Nick Burch commented on TIKA-712:
-

I think it might already be as fixed as it can be? It isn't perfect, as POI's 
HSLF can't detect the boilerplate text there yet, but otherwise I think it's 
pretty much there. [~mikemccand] can hopefully confirm?

 Master slide text isn't extracted
 -

 Key: TIKA-712
 URL: https://issues.apache.org/jira/browse/TIKA-712
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Attachments: TIKA-712-master-slide.xml, TIKA-712.patch, 
 TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, 
 testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx


 It looks like we are not getting text from the master slide for PPT
 and PPTX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-727) Improve the outputed XHTML by HSLFExtractor


 [ 
https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-727.
-
Resolution: Fixed

I believe this has been fixed for some time, so I'm closing it. If you still 
have this problem, please re-open the bug and attach a small test file which 
shows the problem!

 Improve the outputed XHTML by HSLFExtractor
 ---

 Key: TIKA-727
 URL: https://issues.apache.org/jira/browse/TIKA-727
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 0.10
Reporter: Pablo Queixalos
Priority: Minor
 Attachments: HSLFExtractor.java, HSLFExtractor.patch


 The XHTML output of HSLFExtractor parser is not pure XHTML, it only inserts 
 the full text into a P[aragraph] tag (including non-html carriage returns).  
 This behavior comes from the poor capabilities that the POI 
 PowerPointExtractor offers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-770) New ODF metadata keys


[ 
https://issues.apache.org/jira/browse/TIKA-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342764#comment-14342764
 ] 

Nick Burch commented on TIKA-770:
-

I think this probably wants to be a Tika 2.0 fix. We have some other metadata 
keys in there which have also been deprecated, so it's probably best to remove 
them all in one go in Tika 2.0, rather than remove a few now, and the rest 
later, to avoid confusion

 New ODF metadata keys
 -

 Key: TIKA-770
 URL: https://issues.apache.org/jira/browse/TIKA-770
 Project: Tika
  Issue Type: Improvement
  Components: metadata, parser
Reporter: Jukka Zitting
Priority: Minor
  Labels: odf

 Followup from TIKA-764.
 {quote}
 The 2nd step is to add a few extra common keys for the stats that ODF has 
 that aren't covered, then remove the non standard keys
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1531) Upgrade to POI 3.12-beta1 when available


[ 
https://issues.apache.org/jira/browse/TIKA-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342768#comment-14342768
 ] 

Nick Burch commented on TIKA-1531:
--

Apache POI 3.12 beta 1 was released over the weekend, in case anyone wants to 
tackle this!

 Upgrade to POI 3.12-beta1 when available
 

 Key: TIKA-1531
 URL: https://issues.apache.org/jira/browse/TIKA-1531
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor

 Opening this issue to track integration items with POI 3.12-beta1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Curating Issues

2015-03-01 Thread Tyler Palsulich

I'll keep GSOC in mind. We should also start labeling issues with 2.0.

Tyler
On Mar 1, 2015 11:39 PM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Good idea, Nick. The vision parser I threw up I labeled with
 gsoc2015 - if there are any takers, please send them my way!

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Nick Burch apa...@gagravarr.org
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Sunday, March 1, 2015 at 8:14 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: Curating Issues

 On Sun, 1 Mar 2015, Tyler Palsulich wrote:
  I've started labeling some issues as new-parser and newbie. I think
  these should be helpful for organization. Please let me know if there
 is
  another label we've already been using for those. I put new-parser on
  any requests to support a new filetype, even if it doesn't require a
  full on Parser (e.g. just magic).
 
 I don't know if anyone has the time to mentor, but there's just about
 still time to get something into GSoC for 2015. If we do have someone who
 could mentor a student in the summer, then it could be worth tagging any
 summer sized issues with gsoc2015.
 http://community.apache.org/gsoc.html
 has some more info for anyone new to gsoc
 
 Nick

Re: Curating Issues

2015-03-01 Thread Nick Burch


On Mon, 2 Mar 2015, Tyler Palsulich wrote:

I'll keep GSOC in mind. We should also start labeling issues with 2.0.


I think we only have a few issues for that currently, mostly around 
metadata keys, but it may grow!


As a reminder for everyone, don't forget we've got a wiki page at
https://wiki.apache.org/tika/Tika2_0RoadMap to track the things we need a 
major version number bump to change/break


Nick

[jira] [Updated] (TIKA-912) Response charset encoding not declared, and depends on host OS (Windows/Linux)


 [ 
https://issues.apache.org/jira/browse/TIKA-912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-912:
-
Attachment: TIKA-912.palsulich.patch

Attached an updated patch which adds charset info to each {{@Produces}} 
annotation. If no one objects, I'll commit it this week.

 Response charset encoding not declared, and depends on host OS (Windows/Linux)
 --

 Key: TIKA-912
 URL: https://issues.apache.org/jira/browse/TIKA-912
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.1
 Environment: java version 1.6.0_26
 Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
 Java HotSpot(TM) Server VM (build 20.1-b02, mixed mode)
 java version 1.6.0_31
 Java(TM) SE Runtime Environment (build 1.6.0_31-b05)
 Java HotSpot(TM) Client VM (build 20.6-b01, mixed mode, sharing)
Reporter: Chris Wilson
  Labels: newbie, patch
 Attachments: TIKA-912.palsulich.patch, 
 TikaResource-utf8-response.patch, TikaResource.java.patch


 When the response to the /tika servlet contains non-ASCII characters, Tika 
 doesn't tell us what encoding it's using, and the encoding differs depending 
 on which OS the server is running on.
 This is a server running on Tomcat on Linux:
 {code}
 chris@lap-x201:~/projects/atamis-intranet/django/intranet$ curl -i -T 
 documents/fixtures/smartquote-bullet.docx http://localhost:8080/tika/tika | 
 hexdump -C
   48 54 54 50 2f 31 2e 31  20 31 30 30 20 43 6f 6e  |HTTP/1.1 100 Con|
 0010  74 69 6e 75 65 0d 0a 0d  0a 48 54 54 50 2f 31 2e  |tinueHTTP/1.|
 0020  31 20 32 30 30 20 4f 4b  0d 0a 53 65 72 76 65 72  |1 200 OK..Server|
 0030  3a 20 41 70 61 63 68 65  2d 43 6f 79 6f 74 65 2f  |: Apache-Coyote/|
 0040  31 2e 31 0d 0a 43 6f 6e  74 65 6e 74 2d 54 79 70  |1.1..Content-Typ|
 0050  65 3a 20 74 65 78 74 2f  70 6c 61 69 6e 0d 0a 54  |e: text/plain..T|
 0060  72 61 6e 73 66 65 72 2d  45 6e 63 6f 64 69 6e 67  |ransfer-Encoding|
 0070  3a 20 63 68 75 6e 6b 65  64 0d 0a 44 61 74 65 3a  |: chunked..Date:|
 0080  20 46 72 69 2c 20 30 34  20 4d 61 79 20 32 30 31  | Fri, 04 May 201|
 0090  32 20 31 39 3a 34 30 3a  35 34 20 47 4d 54 0d 0a  |2 19:40:54 GMT..|
 00a0  0d 0a e2 80 99 0a e2 80  a2 09 0a |...|
 00ab
 {code}
 And this is a server running on Tomcat on Windows:
 {code}
 chris@lap-x201:~/projects/atamis-intranet/django/intranet$ curl -i -T 
 documents/fixtures/smartquote-bullet.docx http://localhost:9080/tika/tika | 
 hexdump -C
   48 54 54 50 2f 31 2e 31  20 31 30 30 20 43 6f 6e  |HTTP/1.1 100 Con|
 0010  74 69 6e 75 65 0d 0a 0d  0a 48 54 54 50 2f 31 2e  |tinueHTTP/1.|
 0020  31 20 32 30 30 20 4f 4b  0d 0a 53 65 72 76 65 72  |1 200 OK..Server|
 0030  3a 20 41 70 61 63 68 65  2d 43 6f 79 6f 74 65 2f  |: Apache-Coyote/|
 0040  31 2e 31 0d 0a 43 6f 6e  74 65 6e 74 2d 54 79 70  |1.1..Content-Typ|
 0050  65 3a 20 74 65 78 74 2f  70 6c 61 69 6e 0d 0a 54  |e: text/plain..T|
 0060  72 61 6e 73 66 65 72 2d  45 6e 63 6f 64 69 6e 67  |ransfer-Encoding|
 0070  3a 20 63 68 75 6e 6b 65  64 0d 0a 44 61 74 65 3a  |: chunked..Date:|
 0080  20 46 72 69 2c 20 30 34  20 4d 61 79 20 32 30 31  | Fri, 04 May 201|
 0090  32 20 31 39 3a 33 39 3a  35 32 20 47 4d 54 0d 0a  |2 19:39:52 GMT..|
 00a0  0d 0a 92 0a 95 09 0a  |...|
 00a7
 {code}
 As you can see, the data (last few bytes) is encoded differently. The Linux 
 server encodes it as UTF-8, while Windows is using something strange, 
 probably Windows-1252, where 0x92 is a curly quote and 0x95 is a bullet point.
 A client can't know what encoding the server used, because the Content-Type 
 is just text/plain with no encoding.
 Ideally I would like it to use UTF-8 always, so that the client doesn't have 
 to do extra work to decode it. The attached patch does that, and declares it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-613) PDF parser is changing letters positions