RE: Tika 1.15.1? -> 1.16

2017-06-30 Thread Allison, Timothy B.
Y, I was thinking that I may have already pushed us over this threshold with 
the * below.  1.16 it is then?

Chris, let us know when the age detection is good to go or if 1.17 is a better 
target.


  * Allow extraction of scripts as embedded "MACRO". Users
must turn this on via TikaConfig (TIKA-2391).

  * Allow users to turn off extraction of headers and footers
from .doc, .docx, .xls, .xlsx, .xlsb (TIKA-2362)

  * Extract text from charts in .docx, .pptx, .xlsx and .xlsb
(TIKA-2254).

  * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb
(TIKA-1945).

  * Enable base32 encoding of digests and enable BouncyCastle implementations
of digest algorithms (TIKA-2386).

-Original Message-
From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com] 
Sent: Thursday, June 29, 2017 4:12 PM
To: dev@tika.apache.org
Subject: Re: Tika 1.15.1?

Agreed.

Luis


2017-06-29 15:45 GMT-03:00 Bob Paulin :

> If we're adding features does it make sense just to bump to 1.16 
> rather than 1.15.1?  Traditionally point releases would be bug fixes only [1].
>
>
> - Bob
>
> [1] http://semver.org/
> On 6/29/2017 1:18 PM, Allison, Timothy B. wrote:
> > K.
> >
> > -Original Message-
> > From: Mattmann, Chris A (3010) 
> > [mailto:chris.a.mattm...@jpl.nasa.gov]
> > Sent: Thursday, June 29, 2017 1:59 PM
> > To: dev@tika.apache.org
> > Subject: Re: Tika 1.15.1?
> >
> > Hey Tim, I’d like to try and get in:
> >
> > https://issues.apache.org/jira/browse/TIKA-1988
> >
> > today for 15.1. I am working on integrating it now and adding some 
> > docs
> to the wiki.
> >
> > I’ll keep you posted.
> >
> > Cheers,
> > Chris
> >
> >
> > 
> ++
> > Chris Mattmann, Ph.D.
> > Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, NSF & Open Source Projects Formulation and Development 
> Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 180-503E, Mailstop: 180-503
> > Email: chris.a.mattm...@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > 
> ++
> > Director, Information Retrieval and Data Science Group (IRDS) 
> > Adjunct
> Associate Professor, Computer Science Department University of 
> Southern California, Los Angeles, CA 90089 USA
> > WWW: http://irds.usc.edu/
> > 
> ++
> >
> >
> > On 6/28/17, 12:24 PM, "Allison, Timothy B."  wrote:
> >
> > POI is available on maven, and I just upgraded.
> >
> > Unless there are objections, I'll change our
> >
> > org.apache.tika.parser.sentiment.analysis.SentimentParser
> >
> > to
> >
> > 
> > org.apache.tika.parser.sentiment.analysis.SentimentAnalysisParser
> >
> > and we should be good to go for 1.15.1?
> >
> > Let me know if you'd like to hold off for a bit, but there's 
> > always
> 1.15.2.   :)
> >
> > Cheers,
> >
> >   Tim
> >
> > -Original Message-
> > From: Mattmann, Chris A (3010) 
> > [mailto:chris.a.mattm...@jpl.nasa.gov
> ]
> > Sent: Friday, June 23, 2017 3:39 PM
> > To: dev@tika.apache.org
> > Subject: Re: Tika 1.15.1?
> >
> > Let me get back to you I’d like to see if we can get some 
> > progress
> on the Age Detector Parser
> >
> > 
> ++
> > Chris Mattmann, Ph.D.
> > Principal Data Scientist, Engineering Administrative Office 
> > (3010)
> Manager, NSF & Open Source Projects Formulation and Development 
> Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 180-503E, Mailstop: 180-503
> > Email: chris.a.mattm...@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > 
> ++
> > Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department University of 
> Southern California, Los Angeles, CA 90089 USA
> > WWW: http://irds.usc.edu/
> > 
> ++
> >
> >
> > On 6/23/17, 10:01 AM, "Allison, Timothy B." 
> wrote:
> >
> > All,
> >   With the exception of the SentimentParser (which we have a
> path forward on), I think we're good to go.  It looks like POI is 
> about to kick off the release process for 3.17-beta1, and the batch 
> results look good.  I propose waiting a week or so to incorporate that.
> >   Anything else we need to get in for 1.15.1?
> >
> >  Cheers,
> >
> >   Tim
> >
> > -Original Message-
> > From: Chris Mattmann [mailto:mattm...@apache.org]
> > Sent: Friday, June 16, 2017 2:43 PM
> > To: dev@tika.apache.org
> > Subject: Re: Tika 1.15.1?
> >
> 

[jira] [Updated] (TIKA-2407) Tika crashed while parsing corrupt PDF

2017-06-30 Thread Jorge Spinsanti (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Spinsanti updated TIKA-2407:
--
Description: 
Tika throws an exception when try to parse a corrupt PDF file to extract text 
content (see attached file):
{code}
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
IOException from org.apache.tika.parser.pdf.PDFParser@d71dc5e
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 16 more
Caused by: java.io.IOException: Error reading stream, expected='endstream' 
actual='' at offset 116070
at 
org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1013)
at 
org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:781)
at 
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742)
at 
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673)
at 
org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633)
at 
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1132)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1066)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:141)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 23 more
{code}
Can you thrown a specific exception to allow better error handling? Something 
like BadInputException or WrongFileException?

  was:
Tika throws an exception when try to parse a corrupt PDF file:
{code}
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
IOException from org.apache.tika.parser.pdf.PDFParser@d71dc5e
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 16 more
Caused by: java.io.IOException: Error reading stream, expected='endstream' 
actual='' at offset 116070
at 
org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1013)
at 
org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:781)
at 
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742)
at 
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673)
at 
org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633)
at 
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1132)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1066)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:141)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 23 more
{code}
Can you thrown a specific exception to allow better error handling? Something 
like BadInputException or WrongFileException?


> Tika crashed while parsing corrupt PDF
> --
>
> Key: TIKA-2407
> URL: https://issues.apache.org/jira/browse/TIKA-2407
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
> Attachments: IOException.pdf
>
>
> Tika throws an exception when try to parse a corrupt PDF file to extract text 
> content (see attached file):
> {code}
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.pdf.PDFParser@d71dc5e
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Error reading stream, expected='endstream' 
> actual='' at offset 116070
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1013)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:781)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633)
>   at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241)
>   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java

[jira] [Created] (TIKA-2407) Tika crashed while parsing corrupt PDF

2017-06-30 Thread Jorge Spinsanti (JIRA)
Jorge Spinsanti created TIKA-2407:
-

 Summary: Tika crashed while parsing corrupt PDF
 Key: TIKA-2407
 URL: https://issues.apache.org/jira/browse/TIKA-2407
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.15
Reporter: Jorge Spinsanti
 Attachments: IOException.pdf

Tika throws an exception when try to parse a corrupt PDF file:
{code}
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
IOException from org.apache.tika.parser.pdf.PDFParser@d71dc5e
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 16 more
Caused by: java.io.IOException: Error reading stream, expected='endstream' 
actual='' at offset 116070
at 
org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1013)
at 
org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:781)
at 
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742)
at 
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673)
at 
org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633)
at 
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1132)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1066)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:141)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 23 more
{code}
Can you thrown a specific exception to allow better error handling? Something 
like BadInputException or WrongFileException?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2407) Tika crashed while parsing corrupt PDF

2017-06-30 Thread Jorge Spinsanti (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Spinsanti updated TIKA-2407:
--
Attachment: IOException.pdf

> Tika crashed while parsing corrupt PDF
> --
>
> Key: TIKA-2407
> URL: https://issues.apache.org/jira/browse/TIKA-2407
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
> Attachments: IOException.pdf
>
>
> Tika throws an exception when try to parse a corrupt PDF file:
> {code}
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.pdf.PDFParser@d71dc5e
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Error reading stream, expected='endstream' 
> actual='' at offset 116070
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1013)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:781)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633)
>   at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241)
>   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1132)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1066)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:141)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> {code}
> Can you thrown a specific exception to allow better error handling? Something 
> like BadInputException or WrongFileException?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2407) Tika crashed while parsing corrupt PDF

2017-06-30 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070055#comment-16070055
 ] 

Nick Burch commented on TIKA-2407:
--

You'd be best off reporting this to the Apache PDFBox project, which is the 
library that Tika uses to process PDF files. That's the right place to get this 
fixed, or a more appropriate error thrown. You can report it as the PDFBOX 
project here, see https://issues.apache.org/jira/projects/PDFBOX

> Tika crashed while parsing corrupt PDF
> --
>
> Key: TIKA-2407
> URL: https://issues.apache.org/jira/browse/TIKA-2407
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
> Attachments: IOException.pdf
>
>
> Tika throws an exception when try to parse a corrupt PDF file to extract text 
> content (see attached file):
> {code}
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.pdf.PDFParser@d71dc5e
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Error reading stream, expected='endstream' 
> actual='' at offset 116070
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1013)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:781)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633)
>   at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241)
>   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1132)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1066)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:141)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> {code}
> Can you thrown a specific exception to allow better error handling? Something 
> like BadInputException or WrongFileException?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2407) Tika crashed while parsing corrupt PDF

2017-06-30 Thread Jorge Spinsanti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070069#comment-16070069
 ] 

Jorge Spinsanti commented on TIKA-2407:
---

Issue created in PDFBox project: 
https://issues.apache.org/jira/browse/PDFBOX-3849

> Tika crashed while parsing corrupt PDF
> --
>
> Key: TIKA-2407
> URL: https://issues.apache.org/jira/browse/TIKA-2407
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
> Attachments: IOException.pdf
>
>
> Tika throws an exception when try to parse a corrupt PDF file to extract text 
> content (see attached file):
> {code}
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.pdf.PDFParser@d71dc5e
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Error reading stream, expected='endstream' 
> actual='' at offset 116070
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1013)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:781)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633)
>   at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241)
>   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1132)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1066)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:141)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> {code}
> Can you thrown a specific exception to allow better error handling? Something 
> like BadInputException or WrongFileException?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2408) ZipException in text extraction from DOCX file

2017-06-30 Thread Jorge Spinsanti (JIRA)
Jorge Spinsanti created TIKA-2408:
-

 Summary: ZipException in text extraction from DOCX file
 Key: TIKA-2408
 URL: https://issues.apache.org/jira/browse/TIKA-2408
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.15
Reporter: Jorge Spinsanti
 Attachments: ZipException.docx

I got a ZipException when try to extract text from DOCX file (attached):
{code}
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 16 more
Caused by: java.util.zip.ZipException: invalid literal/lengths set
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:122)
at 
org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:210)
at 
org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown 
Source)
at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
Source)
at 
org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at 
org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
at 
org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
 Source)
at 
org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
at 
org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
at 
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 23 more
{code}
OpenOffice extracts text successfully.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2408) ZipException in text extraction from DOCX file

2017-06-30 Thread Jorge Spinsanti (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Spinsanti updated TIKA-2408:
--
Attachment: ZipException.docx

> ZipException in text extraction from DOCX file
> --
>
> Key: TIKA-2408
> URL: https://issues.apache.org/jira/browse/TIKA-2408
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
> Attachments: ZipException.docx
>
>
> I got a ZipException when try to extract text from DOCX file (attached):
> {code}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.util.zip.ZipException: invalid literal/lengths set
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:122)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:210)
>   at 
> org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown 
> Source)
>   at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> {code}
> OpenOffice extracts text successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2402) Support all image formats in Object Recognition REST Parser

2017-06-30 Thread Thejan Wijesinghe (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070128#comment-16070128
 ] 

Thejan Wijesinghe commented on TIKA-2402:
-

Hi [~lfcnassif] :), this is for the Object Recognition REST parser. For Tika-DL 
we could use this ability of DL4J to support multiple image formats. Here, I'm 
expecting to do the conversion in the python back end. In fact, the conversion 
is quite fast, image won't be saved to disc during or after the conversion, all 
happens in memory. Are you suggesting to use NativeImageLoader to load the 
image from client side? But it's no use, if the python server only knows how to 
decode jpeg image bytes, right? [~tgow...@gmail.com] what do you think?


> Support all image formats in Object Recognition REST Parser
> ---
>
> Key: TIKA-2402
> URL: https://issues.apache.org/jira/browse/TIKA-2402
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Thejan Wijesinghe
>Priority: Minor
> Fix For: 1.16
>
>
> Currently object recognition REST parser only supports parsing jpeg image 
> type. Objective of this task is to add all image format support to it by 
> converting any image into jpeg format at the server's end.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2402) Support all image formats in Object Recognition REST Parser

2017-06-30 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070154#comment-16070154
 ] 

Luis Filipe Nassif commented on TIKA-2402:
--

duh... sorry, I did not see the "REST" in title, the suggestion was to dl4j 
parser.

> Support all image formats in Object Recognition REST Parser
> ---
>
> Key: TIKA-2402
> URL: https://issues.apache.org/jira/browse/TIKA-2402
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Thejan Wijesinghe
>Priority: Minor
> Fix For: 1.16
>
>
> Currently object recognition REST parser only supports parsing jpeg image 
> type. Objective of this task is to add all image format support to it by 
> converting any image into jpeg format at the server's end.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2408) ZipException in text extraction from DOCX file

2017-06-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070271#comment-16070271
 ] 

Tim Allison commented on TIKA-2408:
---

[~Giorgy], thank you for opening this and sharing a triggering document!

MSWord also fails to open the document because of numbering.xml, and Winzip 
also has a problem with numbering.xml  There really does appear to be a problem 
with the zipped numbering.xml.

But wait, there's good news!  Our new, experimental SAX-based docx parser 
ignores problems with numbering and extracts text from this document...the 
extracted numbers that rely on numbering.xml are nearly guaranteed to be bad, 
but you at least get something.

To tell Tika to use that parser instead of our legacy DOM-based parser, do 
something like this:

{noformat}
ParseContext parseContext = new ParseContext();
OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
parseContext.set(OfficeParserConfig.class, officeParserConfig);
{noformat}

Longer, term, I'm not sure if we want to move the SAX parser into POI or leave 
it in Tika.  IMHO, POI is right to throw an exception and stop because POI's 
xwpfdocument is read/write, and I'm not sure there's a correct behavior for 
writing to a corrupt document.  However, if your goal is to extract as much as 
you can even if there are problems, then our new SAX parser is for you!

Let me know if you need help turning on that parser via tika-config.xml.  I 
should update our wiki...probably...if I haven't.

> ZipException in text extraction from DOCX file
> --
>
> Key: TIKA-2408
> URL: https://issues.apache.org/jira/browse/TIKA-2408
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
> Attachments: ZipException.docx
>
>
> I got a ZipException when try to extract text from DOCX file (attached):
> {code}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.util.zip.ZipException: invalid literal/lengths set
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:122)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:210)
>   at 
> org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown 
> Source)
>   at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> {code}
> OpenOffice extracts text successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2405) SAXParseException in text extraction from DOCX file

2017-06-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070289#comment-16070289
 ] 

Tim Allison commented on TIKA-2405:
---

The footer does have a hdr element in it.   See: 
https://issues.apache.org/jira/browse/TIKA-2408?focusedCommentId=16070271&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16070271

Again, POI, IMHO, is right to throw an exception with this.  Other tools may 
choose to ignore it.  The new SAX-based docx parser has no problems with this 
document.

> SAXParseException in text extraction from DOCX file
> ---
>
> Key: TIKA-2405
> URL: https://issues.apache.org/jira/browse/TIKA-2405
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
> Attachments: SAXParseException.docx
>
>
> I got SAXParseException in text extraction from DOCX file (see attachment):
> {code}
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Unable to parse xml bean
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:146)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 37; 
> The encoding declaration is required in the text declaration.
>   at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
>   at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   ... 33 more
> {code}
> Text extraction using OpenOffice is successful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2404) XMLException in DOCX->TXT conversion

2017-06-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070293#comment-16070293
 ] 

Tim Allison commented on TIKA-2404:
---

The footer does have a hdr element in it. See: 
https://issues.apache.org/jira/browse/TIKA-2408?focusedCommentId=16070271&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16070271
Again, POI, IMHO, is right to throw an exception with this. Other tools may 
choose to ignore it. The new SAX-based docx parser has no problems with this 
document.

> XMLException in DOCX->TXT conversion
> 
>
> Key: TIKA-2404
> URL: https://issues.apache.org/jira/browse/TIKA-2404
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
> Attachments: XmlException.docx
>
>
> I got an XMLException when try to extract text from DOCX file (see attached 
> file):
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: org.apache.poi.POIXMLException: org.apache.xmlbeans.XmlException: 
> Element hdr@http://schemas.openxmlformats.org/wordprocessingml/2006/main is 
> not a valid ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main 
> document or a valid substitution.
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:121)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:196)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> Caused by: org.apache.xmlbeans.XmlException: Element 
> hdr@http://schemas.openxmlformats.org/wordprocessingml/2006/main is not a 
> valid ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main 
> document or a valid substitution.
>   at 
> org.apache.xmlbeans.impl.store.Locale.autoTypeDocument(Locale.java:322)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1384)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1363)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:370)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:144)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.FtrDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:96)
> {code}
> If I use OpenOffice, the text can be get sucessfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (TIKA-2405) SAXParseException in text extraction from DOCX file

2017-06-30 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2405:
--
Comment: was deleted

(was: The footer does have a hdr element in it.   See: 
https://issues.apache.org/jira/browse/TIKA-2408?focusedCommentId=16070271&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16070271

Again, POI, IMHO, is right to throw an exception with this.  Other tools may 
choose to ignore it.  The new SAX-based docx parser has no problems with this 
document.)

> SAXParseException in text extraction from DOCX file
> ---
>
> Key: TIKA-2405
> URL: https://issues.apache.org/jira/browse/TIKA-2405
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
> Attachments: SAXParseException.docx
>
>
> I got SAXParseException in text extraction from DOCX file (see attachment):
> {code}
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Unable to parse xml bean
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:146)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 37; 
> The encoding declaration is required in the text declaration.
>   at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
>   at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   ... 33 more
> {code}
> Text extraction using OpenOffice is successful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2405) SAXParseException in text extraction from DOCX file

2017-06-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070298#comment-16070298
 ] 

Tim Allison commented on TIKA-2405:
---

Again, thank you.  

The new SAX-based docx parser handles this document with no problem. Winzip 
can't extract numbering.xml (bad checksum).  There really is something wrong 
with it.

See: 
https://issues.apache.org/jira/browse/TIKA-2408?focusedCommentId=16070271&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16070271



> SAXParseException in text extraction from DOCX file
> ---
>
> Key: TIKA-2405
> URL: https://issues.apache.org/jira/browse/TIKA-2405
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
> Attachments: SAXParseException.docx
>
>
> I got SAXParseException in text extraction from DOCX file (see attachment):
> {code}
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Unable to parse xml bean
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:146)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 37; 
> The encoding declaration is required in the text declaration.
>   at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
>   at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   ... 33 more
> {code}
> Text extraction using OpenOffice is successful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2404) XMLException in DOCX->TXT conversion

2017-06-30 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2404:
--
Labels: experimental_sax  (was: )

> XMLException in DOCX->TXT conversion
> 
>
> Key: TIKA-2404
> URL: https://issues.apache.org/jira/browse/TIKA-2404
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: experimental_sax
> Attachments: XmlException.docx
>
>
> I got an XMLException when try to extract text from DOCX file (see attached 
> file):
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: org.apache.poi.POIXMLException: org.apache.xmlbeans.XmlException: 
> Element hdr@http://schemas.openxmlformats.org/wordprocessingml/2006/main is 
> not a valid ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main 
> document or a valid substitution.
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:121)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:196)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> Caused by: org.apache.xmlbeans.XmlException: Element 
> hdr@http://schemas.openxmlformats.org/wordprocessingml/2006/main is not a 
> valid ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main 
> document or a valid substitution.
>   at 
> org.apache.xmlbeans.impl.store.Locale.autoTypeDocument(Locale.java:322)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1384)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1363)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:370)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:144)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.FtrDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:96)
> {code}
> If I use OpenOffice, the text can be get sucessfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2408) ZipException in text extraction from DOCX file

2017-06-30 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2408:
--
Labels: experimental_sax  (was: )

> ZipException in text extraction from DOCX file
> --
>
> Key: TIKA-2408
> URL: https://issues.apache.org/jira/browse/TIKA-2408
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: experimental_sax
> Attachments: ZipException.docx
>
>
> I got a ZipException when try to extract text from DOCX file (attached):
> {code}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.util.zip.ZipException: invalid literal/lengths set
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:122)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:210)
>   at 
> org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown 
> Source)
>   at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> {code}
> OpenOffice extracts text successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2405) SAXParseException in text extraction from DOCX file

2017-06-30 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2405:
--
Labels: experimental_sax  (was: )

> SAXParseException in text extraction from DOCX file
> ---
>
> Key: TIKA-2405
> URL: https://issues.apache.org/jira/browse/TIKA-2405
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: experimental_sax
> Attachments: SAXParseException.docx
>
>
> I got SAXParseException in text extraction from DOCX file (see attachment):
> {code}
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Unable to parse xml bean
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:146)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 37; 
> The encoding declaration is required in the text declaration.
>   at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
>   at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   ... 33 more
> {code}
> Text extraction using OpenOffice is successful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2404) XMLException in DOCX->TXT conversion

2017-06-30 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2404:
--
Labels: sax_docx_fixes  (was: experimental_sax)

> XMLException in DOCX->TXT conversion
> 
>
> Key: TIKA-2404
> URL: https://issues.apache.org/jira/browse/TIKA-2404
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: sax_docx_fixes
> Attachments: XmlException.docx
>
>
> I got an XMLException when try to extract text from DOCX file (see attached 
> file):
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: org.apache.poi.POIXMLException: org.apache.xmlbeans.XmlException: 
> Element hdr@http://schemas.openxmlformats.org/wordprocessingml/2006/main is 
> not a valid ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main 
> document or a valid substitution.
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:121)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:196)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> Caused by: org.apache.xmlbeans.XmlException: Element 
> hdr@http://schemas.openxmlformats.org/wordprocessingml/2006/main is not a 
> valid ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main 
> document or a valid substitution.
>   at 
> org.apache.xmlbeans.impl.store.Locale.autoTypeDocument(Locale.java:322)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1384)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1363)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:370)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:144)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.FtrDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:96)
> {code}
> If I use OpenOffice, the text can be get sucessfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2408) ZipException in text extraction from DOCX file

2017-06-30 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2408:
--
Labels: sax_docx_fixes  (was: experimental_sax)

> ZipException in text extraction from DOCX file
> --
>
> Key: TIKA-2408
> URL: https://issues.apache.org/jira/browse/TIKA-2408
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: sax_docx_fixes
> Attachments: ZipException.docx
>
>
> I got a ZipException when try to extract text from DOCX file (attached):
> {code}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.util.zip.ZipException: invalid literal/lengths set
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:122)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:210)
>   at 
> org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown 
> Source)
>   at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> {code}
> OpenOffice extracts text successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2405) SAXParseException in text extraction from DOCX file

2017-06-30 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2405:
--
Labels: sax_docx_fixes  (was: experimental_sax)

> SAXParseException in text extraction from DOCX file
> ---
>
> Key: TIKA-2405
> URL: https://issues.apache.org/jira/browse/TIKA-2405
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: sax_docx_fixes
> Attachments: SAXParseException.docx
>
>
> I got SAXParseException in text extraction from DOCX file (see attachment):
> {code}
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Unable to parse xml bean
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:146)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 37; 
> The encoding declaration is required in the text declaration.
>   at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
>   at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   ... 33 more
> {code}
> Text extraction using OpenOffice is successful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2201) OutOfMemoryError on a reasonably sized document

2017-06-30 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2201:
--
Labels: sax_pptx_fixes  (was: )

> OutOfMemoryError on a reasonably sized document
> ---
>
> Key: TIKA-2201
> URL: https://issues.apache.org/jira/browse/TIKA-2201
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>  Labels: sax_pptx_fixes
>
> The following document, which is not particularly big, causes an OOM in Tika 
> parser:
> https://dl.dropboxusercontent.com/u/92341073/Certificates-9-20-2013.pptx
> Java memory limit is 4GB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2109) OutOfMemory when parsing 5MB word document

2017-06-30 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2109:
--
Labels: sax_docx_fixes  (was: )

> OutOfMemory when parsing 5MB word document
> --
>
> Key: TIKA-2109
> URL: https://issues.apache.org/jira/browse/TIKA-2109
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.13
> Environment: openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-0ubuntu4~14.04-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Julian
>  Labels: sax_docx_fixes
> Fix For: 2.0, 1.15
>
> Attachments: zafar-bug-9.docx
>
>
> When I run the following command to extract text from the attached 5MB word 
> document, I get the OOM error below.
> java -jar tika-app-1.13.jar --text '/vagrant/zafar-bug-9.docx'
> The problem goes away if I set -Xms2G -Xmx2G, but I'm reluctant to specify 
> such a high setting for my use case for what seems like a small file? Also I 
> don't see this error with other files of similar size.
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>   at 
> com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.getNodeObject(DeferredDocumentImpl.java:972)
>   at 
> com.sun.org.apache.xerces.internal.dom.DeferredElementNSImpl.synchronizeData(DeferredElementNSImpl.java:126)
>   at 
> com.sun.org.apache.xerces.internal.dom.ElementNSImpl.getNamespaceURI(ElementNSImpl.java:250)
>   at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1420)
>   at 
> org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
>   at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
>   at 
> org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
>   at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
>   at 
> org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
>   at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
>   at 
> org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
>   at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
>   at 
> org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
>   at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1385)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1370)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:370)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:117)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:164)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2408) ZipException in text extraction from DOCX file

2017-06-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070309#comment-16070309
 ] 

Tim Allison commented on TIKA-2408:
---

See: 
https://issues.apache.org/jira/browse/TIKA-2109?focusedCommentId=15828228&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15828228

for how to configure tika-config.xml

> ZipException in text extraction from DOCX file
> --
>
> Key: TIKA-2408
> URL: https://issues.apache.org/jira/browse/TIKA-2408
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: sax_docx_fixes
> Attachments: ZipException.docx
>
>
> I got a ZipException when try to extract text from DOCX file (attached):
> {code}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.util.zip.ZipException: invalid literal/lengths set
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:122)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:210)
>   at 
> org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown 
> Source)
>   at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> {code}
> OpenOffice extracts text successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2408) ZipException in text extraction from DOCX file

2017-06-30 Thread Jorge Spinsanti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070322#comment-16070322
 ] 

Jorge Spinsanti commented on TIKA-2408:
---

Thank you for your reply.

Yes, I need help with tika-config.xml configuration to force the use of 
SAX-based docx parser. 

> ZipException in text extraction from DOCX file
> --
>
> Key: TIKA-2408
> URL: https://issues.apache.org/jira/browse/TIKA-2408
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: sax_docx_fixes
> Attachments: ZipException.docx
>
>
> I got a ZipException when try to extract text from DOCX file (attached):
> {code}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.util.zip.ZipException: invalid literal/lengths set
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:122)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:210)
>   at 
> org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown 
> Source)
>   at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> {code}
> OpenOffice extracts text successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (TIKA-2408) ZipException in text extraction from DOCX file

2017-06-30 Thread Jorge Spinsanti (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Spinsanti updated TIKA-2408:
--
Comment: was deleted

(was: Thank you for your reply.

Yes, I need help with tika-config.xml configuration to force the use of 
SAX-based docx parser. )

> ZipException in text extraction from DOCX file
> --
>
> Key: TIKA-2408
> URL: https://issues.apache.org/jira/browse/TIKA-2408
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: sax_docx_fixes
> Attachments: ZipException.docx
>
>
> I got a ZipException when try to extract text from DOCX file (attached):
> {code}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.util.zip.ZipException: invalid literal/lengths set
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:122)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:210)
>   at 
> org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown 
> Source)
>   at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> {code}
> OpenOffice extracts text successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2406) IllegalArgumentException in text extraction from PDF file

2017-06-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070332#comment-16070332
 ] 

Tim Allison commented on TIKA-2406:
---

Thank you for sharing.  This is corrupt, as you noted.  Please do the same 
thing with this that you did with TIKA-2407.  I wasn't able to get anything out 
of this file even with the legacy 1.8.x branch.  Your request would be for a 
clearer exception?  Or, how should this document be handled?

> IllegalArgumentException in text extraction from PDF file
> -
>
> Key: TIKA-2406
> URL: https://issues.apache.org/jira/browse/TIKA-2406
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
> Attachments: IllegalArgumentException.pdf
>
>
> I got an IllegalArgumentException in text extraction from PDF file (attached):
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.pdf.PDFParser@d71dc5e
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.lang.IllegalArgumentException: root cannot be null
>   at org.apache.pdfbox.pdmodel.PDPageTree.(PDPageTree.java:75)
>   at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatalog.java:129)
>   at 
> org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:1381)
>   at 
> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:235)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2405) SAXParseException in text extraction from DOCX file

2017-06-30 Thread Jorge Spinsanti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070379#comment-16070379
 ] 

Jorge Spinsanti commented on TIKA-2405:
---

Thanks! As you commented, the issue is not reproduced with SAX-based docx 
parser.

Do you see same problem if I migrate to SAX-based docx parser for all docx 
documents?

> SAXParseException in text extraction from DOCX file
> ---
>
> Key: TIKA-2405
> URL: https://issues.apache.org/jira/browse/TIKA-2405
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: sax_docx_fixes
> Attachments: SAXParseException.docx
>
>
> I got SAXParseException in text extraction from DOCX file (see attachment):
> {code}
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Unable to parse xml bean
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:146)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 37; 
> The encoding declaration is required in the text declaration.
>   at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
>   at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   ... 33 more
> {code}
> Text extraction using OpenOffice is successful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2408) ZipException in text extraction from DOCX file

2017-06-30 Thread Jorge Spinsanti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070381#comment-16070381
 ] 

Jorge Spinsanti commented on TIKA-2408:
---

Thanks a lot! The issue is not reproducible using SAX-based docx parser.

> ZipException in text extraction from DOCX file
> --
>
> Key: TIKA-2408
> URL: https://issues.apache.org/jira/browse/TIKA-2408
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: sax_docx_fixes
> Attachments: ZipException.docx
>
>
> I got a ZipException when try to extract text from DOCX file (attached):
> {code}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.util.zip.ZipException: invalid literal/lengths set
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:122)
>   at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:210)
>   at 
> org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown 
> Source)
>   at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> {code}
> OpenOffice extracts text successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2405) SAXParseException in text extraction from DOCX file

2017-06-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070391#comment-16070391
 ] 

Tim Allison commented on TIKA-2405:
---

I regret I haven't had a chance to do a formal evaluation btwn legacy DOM and 
SAX.  IIRC, I didn't get around to some formatting stuff (putting footnotes in 
the right locations???) in the new SAX, but it will be more robust on docs like 
you shared with us, and it will be more robust on extracting text (it makes _no 
assumptions_ about where text should be (e.g. TIKA-1130), it extracts 
everything in the document.xml); it will likely use far less memory (really 
only a problem in practice with huge docs).

In short, y, I'd move everything over to the new SAX parser.  I also added a 
SAX parser for pptx for the same reasons...with the same caveats.

If you have the time, you could run tika-app.jar against your docx with and 
without SAX and then run tika-eval's Compare to see if there are any 
degradations in extracted content, increases in exceptions, etc, see 
[wiki|https://wiki.apache.org/tika/TikaEval], 
[slides|http://events.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf]
 and/or [youtube|https://www.youtube.com/watch?v=vRPTPMwI53k)]

I'd be more than happy to walk you through that process.  You have the rare 
opportunity to be the second person in the world to run it. :)

> SAXParseException in text extraction from DOCX file
> ---
>
> Key: TIKA-2405
> URL: https://issues.apache.org/jira/browse/TIKA-2405
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: sax_docx_fixes
> Attachments: SAXParseException.docx
>
>
> I got SAXParseException in text extraction from DOCX file (see attachment):
> {code}
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Unable to parse xml bean
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:146)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 37; 
> The encoding declaration is required in the text declaration.
>   at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
>   at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.read

[jira] [Updated] (TIKA-2147) ClassCastException on a valid Word template

2017-06-30 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2147:
--
Labels: sax_docx_fixes  (was: )

> ClassCastException on a valid Word template
> ---
>
> Key: TIKA-2147
> URL: https://issues.apache.org/jira/browse/TIKA-2147
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>  Labels: sax_docx_fixes
> Attachments: basicresume.docx, Forefront Fax.dotx
>
>
> On the attached document template, which opens fine in Word, the Tika parser 
> throws the following error:
> java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be 
> cast to org.apache.poi.xwpf.usermodel.XWPFDocument
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
>   at 
> org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-1432) some docx files creates exception

2017-06-30 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1432.
---
Resolution: Fixed

This seems to have been fixed at some point.  I'm not able to reproduce this 
with 1.15 in the legacy DOM extractor or with the experimental SAX parser.

> some docx files creates exception
> -
>
> Key: TIKA-1432
> URL: https://issues.apache.org/jira/browse/TIKA-1432
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
> Environment: Linux Mint 17
> Java 1.7.0_67 64-bits 
>Reporter: Marco Machado
>Priority: Minor
> Attachments: java.docx, ListaQuestoes2014.docx
>
>
> using some docx files (attached files) as input throws exception. 
> Trace:
> Exception in thread "main" org.apache.tika.exception.TikaException: Error 
> creating OOXML extractor
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:125)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
> Caused by: java.lang.IllegalArgumentException: Value for parameter 'id' was 
> out of bounds
>   at 
> org.apache.poi.util.IdentifierManager.reserve(IdentifierManager.java:80)
>   at org.apache.poi.xwpf.usermodel.XWPFRun.(XWPFRun.java:110)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFParagraph.buildRunsInOrderFromXml(XWPFParagraph.java:126)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFParagraph.(XWPFParagraph.java:79)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:146)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:116)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:53)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87)
>   ... 7 more



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2239) Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser

2017-06-30 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2239:
--
Labels: sax_docx_fixes  (was: )

> Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser
> ---
>
> Key: TIKA-2239
> URL: https://issues.apache.org/jira/browse/TIKA-2239
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Jorge Spinsanti
>  Labels: sax_docx_fixes
> Attachments: tika2239.docx
>
>
> I got an exception to extract text from DOCX due to SAXParseException on 
> Apache POI. See stacktrace:
> {code}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@51a94303
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1114)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1050)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:199)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.eclipse.jetty.server.Server.handle(Server.java:462)
>   at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:281)
>   at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:232)
>   at 
> org.eclipse.jetty.io.AbstractConnection$1.run(AbstractConnection.java:505)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:607)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:536)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@51a94303
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Unable to parse xml bean
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:118)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:87)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:204)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 37; 
> The encoding declaration is required in the text declaration.
>   at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
>   at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelpe

[jira] [Commented] (TIKA-2404) XMLException in DOCX->TXT conversion

2017-06-30 Thread Jorge Spinsanti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070440#comment-16070440
 ] 

Jorge Spinsanti commented on TIKA-2404:
---

Yes, you are right again. We are applying your suggestion and moving to use 
SAX-based docx parser.

> XMLException in DOCX->TXT conversion
> 
>
> Key: TIKA-2404
> URL: https://issues.apache.org/jira/browse/TIKA-2404
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: sax_docx_fixes
> Attachments: XmlException.docx
>
>
> I got an XMLException when try to extract text from DOCX file (see attached 
> file):
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: org.apache.poi.POIXMLException: org.apache.xmlbeans.XmlException: 
> Element hdr@http://schemas.openxmlformats.org/wordprocessingml/2006/main is 
> not a valid ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main 
> document or a valid substitution.
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:121)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:196)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> Caused by: org.apache.xmlbeans.XmlException: Element 
> hdr@http://schemas.openxmlformats.org/wordprocessingml/2006/main is not a 
> valid ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main 
> document or a valid substitution.
>   at 
> org.apache.xmlbeans.impl.store.Locale.autoTypeDocument(Locale.java:322)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1384)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1363)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:370)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:144)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.FtrDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:96)
> {code}
> If I use OpenOffice, the text can be get sucessfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2404) XMLException in DOCX->TXT conversion

2017-06-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070446#comment-16070446
 ] 

Tim Allison commented on TIKA-2404:
---

https://issues.apache.org/jira/browse/TIKA-2408?jql=labels%20%3D%20sax_docx_fixes

:)

Seriously, though, if you want to evaluate with tika-eval before making the 
switch, I'm happy to help.

> XMLException in DOCX->TXT conversion
> 
>
> Key: TIKA-2404
> URL: https://issues.apache.org/jira/browse/TIKA-2404
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: sax_docx_fixes
> Attachments: XmlException.docx
>
>
> I got an XMLException when try to extract text from DOCX file (see attached 
> file):
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: org.apache.poi.POIXMLException: org.apache.xmlbeans.XmlException: 
> Element hdr@http://schemas.openxmlformats.org/wordprocessingml/2006/main is 
> not a valid ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main 
> document or a valid substitution.
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:121)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:196)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> Caused by: org.apache.xmlbeans.XmlException: Element 
> hdr@http://schemas.openxmlformats.org/wordprocessingml/2006/main is not a 
> valid ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main 
> document or a valid substitution.
>   at 
> org.apache.xmlbeans.impl.store.Locale.autoTypeDocument(Locale.java:322)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1384)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1363)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:370)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:144)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.FtrDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:96)
> {code}
> If I use OpenOffice, the text can be get sucessfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (TIKA-2404) XMLException in DOCX->TXT conversion

2017-06-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070446#comment-16070446
 ] 

Tim Allison edited comment on TIKA-2404 at 6/30/17 5:23 PM:


https://issues.apache.org/jira/browse/TIKA-2408?jql=labels%20%3D%20sax_docx_fixes

https://issues.apache.org/jira/browse/TIKA-2408?jql=labels%20%3D%20sax_pptx_fixes

:)

I'm pretty sure there are other pptx fixes, but I haven't found them yet.

Seriously, though, if you want to evaluate with tika-eval before making the 
switch, I'm happy to help.


was (Author: talli...@mitre.org):
https://issues.apache.org/jira/browse/TIKA-2408?jql=labels%20%3D%20sax_docx_fixes

:)

Seriously, though, if you want to evaluate with tika-eval before making the 
switch, I'm happy to help.

> XMLException in DOCX->TXT conversion
> 
>
> Key: TIKA-2404
> URL: https://issues.apache.org/jira/browse/TIKA-2404
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: sax_docx_fixes
> Attachments: XmlException.docx
>
>
> I got an XMLException when try to extract text from DOCX file (see attached 
> file):
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: org.apache.poi.POIXMLException: org.apache.xmlbeans.XmlException: 
> Element hdr@http://schemas.openxmlformats.org/wordprocessingml/2006/main is 
> not a valid ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main 
> document or a valid substitution.
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:121)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:196)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> Caused by: org.apache.xmlbeans.XmlException: Element 
> hdr@http://schemas.openxmlformats.org/wordprocessingml/2006/main is not a 
> valid ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main 
> document or a valid substitution.
>   at 
> org.apache.xmlbeans.impl.store.Locale.autoTypeDocument(Locale.java:322)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1384)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1363)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:370)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:144)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.FtrDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:96)
> {code}
> If I use OpenOffice, the text can be get sucessfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2406) IllegalArgumentException in text extraction from PDF file

2017-06-30 Thread Jorge Spinsanti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070455#comment-16070455
 ] 

Jorge Spinsanti commented on TIKA-2406:
---

IMHO, bad inputs (corrupt files) should be managed more specific than 
TikaException: may be a subclass of TikaException with current message (e.x. 
CorruptFileException).

When one app consumes your service can catch CorruptFileException and proceed 
with other flow than a generic TikaException.

Make sense?

> IllegalArgumentException in text extraction from PDF file
> -
>
> Key: TIKA-2406
> URL: https://issues.apache.org/jira/browse/TIKA-2406
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
> Attachments: IllegalArgumentException.pdf
>
>
> I got an IllegalArgumentException in text extraction from PDF file (attached):
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.pdf.PDFParser@d71dc5e
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.lang.IllegalArgumentException: root cannot be null
>   at org.apache.pdfbox.pdmodel.PDPageTree.(PDPageTree.java:75)
>   at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatalog.java:129)
>   at 
> org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:1381)
>   at 
> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:235)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2407) Tika crashed while parsing corrupt PDF

2017-06-30 Thread Jorge Spinsanti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070465#comment-16070465
 ] 

Jorge Spinsanti commented on TIKA-2407:
---

https://issues.apache.org/jira/browse/PDFBOX-3849?focusedCommentId=16070356&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16070356

> Tika crashed while parsing corrupt PDF
> --
>
> Key: TIKA-2407
> URL: https://issues.apache.org/jira/browse/TIKA-2407
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
> Attachments: IOException.pdf
>
>
> Tika throws an exception when try to parse a corrupt PDF file to extract text 
> content (see attached file):
> {code}
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.pdf.PDFParser@d71dc5e
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Error reading stream, expected='endstream' 
> actual='' at offset 116070
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1013)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:781)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673)
>   at 
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633)
>   at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241)
>   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1132)
>   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1066)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:141)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> {code}
> Can you thrown a specific exception to allow better error handling? Something 
> like BadInputException or WrongFileException?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2405) SAXParseException in text extraction from DOCX file

2017-06-30 Thread Jorge Spinsanti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070522#comment-16070522
 ] 

Jorge Spinsanti commented on TIKA-2405:
---

Sure, we can more details about the use of Tika :D

> SAXParseException in text extraction from DOCX file
> ---
>
> Key: TIKA-2405
> URL: https://issues.apache.org/jira/browse/TIKA-2405
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15
>Reporter: Jorge Spinsanti
>  Labels: sax_docx_fixes
> Attachments: SAXParseException.docx
>
>
> I got SAXParseException in text extraction from DOCX file (see attachment):
> {code}
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 16 more
> Caused by: java.io.IOException: Unable to parse xml bean
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:146)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 23 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 37; 
> The encoding declaration is required in the text declaration.
>   at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
>   at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>   at org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(Unknown
>  Source)
>   at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>   at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>   at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>   ... 33 more
> {code}
> Text extraction using OpenOffice is successful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2409) Tar has different mime type by name vs contents

2017-06-30 Thread Collin Peters (JIRA)
Collin Peters created TIKA-2409:
---

 Summary: Tar has different mime type by name vs contents
 Key: TIKA-2409
 URL: https://issues.apache.org/jira/browse/TIKA-2409
 Project: Tika
  Issue Type: Bug
  Components: mime
Reporter: Collin Peters


[TestMimeTypes.java#L360|https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java#L360]
 has the following:

{code}
assertTypeByName("application/x-tar",  "test.tar");
assertTypeByData("application/x-gtar",  "test-documents.tar"); // GNU TAR
{code}

The {{tar}} extension is detected as a {{application/x-tar}} by name, but a 
{{application-x-gtar}} by contents. This doesn't seem to match up.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)