[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870573#comment-13870573 ] Hong-Thai Nguyen commented on TIKA-1215: Great catch. Thank [~jukkaz] Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, tika-1215-without-wildcard.patch With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save
[ https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1078. -- Resolution: Fixed Thanks Stefano, I made one small change (added generics: HashSetCharacter) and committed. TikaCLI: invalid characters in embedded document name causes FNFE when trying to save - Key: TIKA-1078 URL: https://issues.apache.org/jira/browse/TIKA-1078 Project: Tika Issue Type: Bug Components: cli, parser Reporter: Michael McCandless Fix For: 1.5 Attachments: T-DS_Excel2003-PPT2003_1.xls, tika-1078-2.patch, tika-1078.patch Attached document hits this on Windows: {noformat} C:\java.exe -jar tika-app-1.3.jar -z -x c:\data\idit\T-DS_Excel2003-PPT2003_1.xls Extracting 'file0.png' (image/png) to .\file0.png Extracting 'file1.emf' (application/x-emf) to .\file1.emf Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg Extracting 'file3.emf' (application/x-emf) to .\file3.emf Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to .\MBD0016BDE4\?£☺.bin Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@75f875f8 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The filename, directory name, or volume label syntax is incorrect.) at java.io.FileOutputStream.init(FileOutputStream.java:205) at java.io.FileOutputStream.init(FileOutputStream.java:156) at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more {noformat} TikaCLI manages to create the sub-directory, but because the embedded fileName has invalid (for Windows) characters, it fails. On Linux it runs fine. I think somehow ... we have to sanitize the embedded file name ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1219) Add .svn to .gitignore
[ https://issues.apache.org/jira/browse/TIKA-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated TIKA-1219: --- Attachment: TIKA-1219.patch Patch for trunk. I've generated this with --no-prefix so I hope it is OK to apply to trunk codebase. Please say if it is not and I will re-generate. Thank you Add .svn to .gitignore -- Key: TIKA-1219 URL: https://issues.apache.org/jira/browse/TIKA-1219 Project: Tika Issue Type: Improvement Components: general Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 Attachments: TIKA-1219.patch This is for folks who may be working on TIKA issues on their own Git branches. It is an extremely trivial change. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (TIKA-1219) Add .svn to .gitignore
Lewis John McGibbney created TIKA-1219: -- Summary: Add .svn to .gitignore Key: TIKA-1219 URL: https://issues.apache.org/jira/browse/TIKA-1219 Project: Tika Issue Type: Improvement Components: general Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 This is for folks who may be working on TIKA issues on their own Git branches. It is an extremely trivial change. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1219) Add .svn to .gitignore
[ https://issues.apache.org/jira/browse/TIKA-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870826#comment-13870826 ] Jukka Zitting commented on TIKA-1219: - Why would you have a {{.svn}} directory if you're using a Git clone? Add .svn to .gitignore -- Key: TIKA-1219 URL: https://issues.apache.org/jira/browse/TIKA-1219 Project: Tika Issue Type: Improvement Components: general Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 Attachments: TIKA-1219.patch This is for folks who may be working on TIKA issues on their own Git branches. It is an extremely trivial change. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1219) Add .svn to .gitignore
[ https://issues.apache.org/jira/browse/TIKA-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870836#comment-13870836 ] Lewis John McGibbney commented on TIKA-1219: I am always working with Tika trunk from svn, however I'll be maintaining my TIKA-1208 branch and working on it when I can. When I do 'git status' to see changes, I get every .svn directory as an untracked change to the codebase. If you see a better way to do this [~jukkaz], please tell me :) Thank you Add .svn to .gitignore -- Key: TIKA-1219 URL: https://issues.apache.org/jira/browse/TIKA-1219 Project: Tika Issue Type: Improvement Components: general Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 Attachments: TIKA-1219.patch This is for folks who may be working on TIKA issues on their own Git branches. It is an extremely trivial change. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1219) Add .svn to .gitignore
[ https://issues.apache.org/jira/browse/TIKA-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870836#comment-13870836 ] Lewis John McGibbney edited comment on TIKA-1219 at 1/14/14 3:58 PM: - I am always working with Tika trunk from svn, however I'll be maintaining my TIKA-1208 branch and working on it when I can. So basically after checking out the code from svn, I effectively did a 'git init'... so the codebase is both svn and git compatable... When I do 'git status' to see changes in my branch, I get every .svn directory as an untracked change to the codebase. If you see a better way to do this [~jukkaz], please tell me :) Thank you was (Author: lewismc): I am always working with Tika trunk from svn, however I'll be maintaining my TIKA-1208 branch and working on it when I can. When I do 'git status' to see changes, I get every .svn directory as an untracked change to the codebase. If you see a better way to do this [~jukkaz], please tell me :) Thank you Add .svn to .gitignore -- Key: TIKA-1219 URL: https://issues.apache.org/jira/browse/TIKA-1219 Project: Tika Issue Type: Improvement Components: general Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 Attachments: TIKA-1219.patch This is for folks who may be working on TIKA issues on their own Git branches. It is an extremely trivial change. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1219) Add .svn to .gitignore
[ https://issues.apache.org/jira/browse/TIKA-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870844#comment-13870844 ] Jukka Zitting commented on TIKA-1219: - There actually is a better way. :-) You can clone Tika from https://github.com/apache/tika and use it to keep track of the latest trunk, as the GitHub mirror is automatically kept up to date. See also http://git.apache.org/. Add .svn to .gitignore -- Key: TIKA-1219 URL: https://issues.apache.org/jira/browse/TIKA-1219 Project: Tika Issue Type: Improvement Components: general Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 Attachments: TIKA-1219.patch This is for folks who may be working on TIKA issues on their own Git branches. It is an extremely trivial change. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1219) Add .svn to .gitignore
[ https://issues.apache.org/jira/browse/TIKA-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870846#comment-13870846 ] Lewis John McGibbney commented on TIKA-1219: OK doke. Feel free to close this one off [~jukkaz]. Thanks Lewis Add .svn to .gitignore -- Key: TIKA-1219 URL: https://issues.apache.org/jira/browse/TIKA-1219 Project: Tika Issue Type: Improvement Components: general Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 Attachments: TIKA-1219.patch This is for folks who may be working on TIKA issues on their own Git branches. It is an extremely trivial change. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (TIKA-1219) Add .svn to .gitignore
[ https://issues.apache.org/jira/browse/TIKA-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-1219. - Resolution: Not A Problem Fix Version/s: (was: 1.5) Add .svn to .gitignore -- Key: TIKA-1219 URL: https://issues.apache.org/jira/browse/TIKA-1219 Project: Tika Issue Type: Improvement Components: general Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Attachments: TIKA-1219.patch This is for folks who may be working on TIKA issues on their own Git branches. It is an extremely trivial change. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: [DISCUSS] Prepare Release 1.5?
Hi Dave, I am fairly new to the community, but I'll provide my feedback anyway :) Currently, tika 1.4 has some serious bug that makes it hang with partial mp3, so it can be quite bad in production. tika 1.5 fixes it, but I do understand TIKA-1198is a bad regression, therefore it is blocker for me too. I am not familiar with WS so I do not know how much work would be to fix it. however, I am wondering if no one commit to fix it, is roll back an option? we may roll back the CXf fix and then be ready to release. Thoughts? Ste On Thu, Jan 9, 2014 at 12:45 PM, Chris Mattmann mattm...@apache.org wrote: Hey Dave, I kind of got bogged down and haven't had time to release. If someone else does have time and wants to pick this up, +1 for it! Cheers, Chris -Original Message- From: David Meikle loo...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, January 9, 2014 3:46 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: [DISCUSS] Prepare Release 1.5? Hi, On 29 Dec 2013, at 11:41, David Meikle loo...@gmail.com wrote: Hi Guys, There have been some questions pop up around when a new 1.5 release will be available. I have some free cycles over the next couple of weeks to prepare one and I believe Chris has some too, so in preparation for that what do we need to do to make the current trunk releasable as version 1.5? For me the following issue need to be fixed before release: TIKA-1198 - the change to using multi-parts appears to have broken our current guidance on usage significantly. Is there anything else others think is a must before rolling a release? I was also thinking we could do some quick work to include the following issues: TIKA-1059 TIKA-985, TIKA-980 I don¹t want to hold things up, so if we sort peoples mandatories I think we should roll a release. @Chris - I know you had free cycles and volunteered so will defer to you on the release management side of things. That said happy to take it on if that helps. Cheers, Dave Conscious it was the festive period of late, so wondering if anyone has had further thoughts on this? Cheers, Dave
[jira] [Created] (TIKA-1220) Parser implementration for IFC files
Lewis John McGibbney created TIKA-1220: -- Summary: Parser implementration for IFC files Key: TIKA-1220 URL: https://issues.apache.org/jira/browse/TIKA-1220 Project: Tika Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 The Industry Foundation Classes (IFC) [0] data model is intended to describe building and construction industry data. For the sake of argument, it can be considered as a more intelligent successor to the .dwg data models used within CAD models. I've tracked down a potential 3rd party library [1] which we maybe able to wrap and use within Tika however the provided software packages are licensed under: http://creativecommons.org/licenses/by-nc-sa/3.0/de/ so I am currently over on legal-discuss@ in an attempt to see if it is possible to wrap some code and contribute it to tika-parsers. When I get feedback from legal-discuss, and if this is a go-ahead, I'll need to help the developers package the code as a Maven artifact(s), then I will progress with writing the implementation. [0] http://en.wikipedia.org/wiki/Industry_Foundation_Classes [1] http://www.ifctoolsproject.com/ -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1220) Parser implementration for IFC files
[ https://issues.apache.org/jira/browse/TIKA-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated TIKA-1220: --- Attachment: 2012-03-23-Duplex-Programming.ifc Sample .ifc data model Parser implementration for IFC files Key: TIKA-1220 URL: https://issues.apache.org/jira/browse/TIKA-1220 Project: Tika Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 Attachments: 2012-03-23-Duplex-Programming.ifc The Industry Foundation Classes (IFC) [0] data model is intended to describe building and construction industry data. For the sake of argument, it can be considered as a more intelligent successor to the .dwg data models used within CAD models. I've tracked down a potential 3rd party library [1] which we maybe able to wrap and use within Tika however the provided software packages are licensed under: http://creativecommons.org/licenses/by-nc-sa/3.0/de/ so I am currently over on legal-discuss@ in an attempt to see if it is possible to wrap some code and contribute it to tika-parsers. When I get feedback from legal-discuss, and if this is a go-ahead, I'll need to help the developers package the code as a Maven artifact(s), then I will progress with writing the implementation. [0] http://en.wikipedia.org/wiki/Industry_Foundation_Classes [1] http://www.ifctoolsproject.com/ -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1220) Parser implementration for IFC files
[ https://issues.apache.org/jira/browse/TIKA-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13871362#comment-13871362 ] Nick Burch commented on TIKA-1220: -- My hunch is that a -nc- license won't be deemed Apache License v2 compatible. However, it's also a rather odd choice for a software license, so it might be worth you contacting the authors of the library to see if they might be willing to re-license or dual license If the license isn't compatible, a parser can still be written and listed on https://wiki.apache.org/tika/3rd%20party%20parser%20plugins - end users can then make their own decision on if they can abide by the more restrictive licenses listed there or not, and include the relevant jars if they wish to. (The auto detect service loader system is used to auto-load these extra parsers and detectors if they're found at runtime) Parser implementration for IFC files Key: TIKA-1220 URL: https://issues.apache.org/jira/browse/TIKA-1220 Project: Tika Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 Attachments: 2012-03-23-Duplex-Programming.ifc The Industry Foundation Classes (IFC) [0] data model is intended to describe building and construction industry data. For the sake of argument, it can be considered as a more intelligent successor to the .dwg data models used within CAD models. I've tracked down a potential 3rd party library [1] which we maybe able to wrap and use within Tika however the provided software packages are licensed under: http://creativecommons.org/licenses/by-nc-sa/3.0/de/ so I am currently over on legal-discuss@ in an attempt to see if it is possible to wrap some code and contribute it to tika-parsers. When I get feedback from legal-discuss, and if this is a go-ahead, I'll need to help the developers package the code as a Maven artifact(s), then I will progress with writing the implementation. [0] http://en.wikipedia.org/wiki/Industry_Foundation_Classes [1] http://www.ifctoolsproject.com/ -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1220) Parser implementration for IFC files
[ https://issues.apache.org/jira/browse/TIKA-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13871365#comment-13871365 ] Lewis John McGibbney commented on TIKA-1220: Hi [~gagravarr], bq. However, it's also a rather odd choice for a software license, so it might be worth you contacting the authors of the library to see if they might be willing to re-license or dual license I'm in the process of doing this right now. I'm also looking for preferably ASLv2.0 licensed 3rd party parsers for .ifc models, there are a few out there so we may be in luck. Thanks for wiki link anyway I didn't realize that it existed. Parser implementration for IFC files Key: TIKA-1220 URL: https://issues.apache.org/jira/browse/TIKA-1220 Project: Tika Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 Attachments: 2012-03-23-Duplex-Programming.ifc The Industry Foundation Classes (IFC) [0] data model is intended to describe building and construction industry data. For the sake of argument, it can be considered as a more intelligent successor to the .dwg data models used within CAD models. I've tracked down a potential 3rd party library [1] which we maybe able to wrap and use within Tika however the provided software packages are licensed under: http://creativecommons.org/licenses/by-nc-sa/3.0/de/ so I am currently over on legal-discuss@ in an attempt to see if it is possible to wrap some code and contribute it to tika-parsers. When I get feedback from legal-discuss, and if this is a go-ahead, I'll need to help the developers package the code as a Maven artifact(s), then I will progress with writing the implementation. [0] http://en.wikipedia.org/wiki/Industry_Foundation_Classes [1] http://www.ifctoolsproject.com/ -- This message was sent by Atlassian JIRA (v6.1.5#6160)