[jira] [Commented] (TIKA-1126) text/html procuder for tika-server

2013-05-27 Thread Ali Mosavian (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667707#comment-13667707
 ] 

Ali Mosavian commented on TIKA-1126:


Thanks Dave!

 text/html procuder for tika-server
 --

 Key: TIKA-1126
 URL: https://issues.apache.org/jira/browse/TIKA-1126
 Project: Tika
  Issue Type: Improvement
  Components: server
Affects Versions: 1.4
Reporter: Ali Mosavian
Priority: Trivial
 Fix For: 1.4

 Attachments: tika_server_html_output.patch


 the /tika resource handler of tika-server can only produce text/plain. This 
 patch adds support for producing text/html.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


tika pull request: Similar to TIKA-1126, this commit adds the ability to pr...

2013-05-27 Thread stdexcept
GitHub user stdexcept opened a pull request:

https://github.com/apache/tika/pull/3

Similar to TIKA-1126, this commit adds the ability to produce text/xml t...

...o tika-server.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stdexcept/tika trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/3.patch


commit 59db45256f9c2c791894e0b9d11e7c6d4f7ce78d
Author: alim ali.mosav...@euroling.se
Date:   2013-05-27T14:22:49Z

Similar to TIKA-1126, this commit adds the ability to produce text/xml to 
tika-server.





[jira] [Updated] (TIKA-1046) Get java.util.zip.ZipException: unknown compression method when indexing ppf97-file containing wmf-image

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1046:


Component/s: parser

 Get java.util.zip.ZipException: unknown compression method when indexing 
 ppf97-file containing wmf-image
 --

 Key: TIKA-1046
 URL: https://issues.apache.org/jira/browse/TIKA-1046
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Olof Jonasson
 Attachments: ppt2000_working.ppt, ppt2010_working.ppt, 
 ppt97_failing.ppt


 With solr4.0 and tika1.2 we get an exeption when trying to index a powerpoint 
 file that contains a specific .wmf-image.
 As it seems, the powerpoint file must be created in Office97 (or older?) to 
 generate the error, since re-saving the file in Office2000 or Office2010 
 makes the problem go away.
 Full stacktrace from the solr-server below:
 2012-dec-19 14:39:46 org.apache.solr.common.SolrException log
 ALLVARLIG: org.apache.solr.common.SolrException: 
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@12f195
   at 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
   at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
   at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
   at 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
   at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
   at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
   at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
   at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
   at 
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:563)
   at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
   at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
   at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
   at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
   at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
   at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
   at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
   at java.lang.Thread.run(Thread.java:662)
 Caused by: org.apache.tika.exception.TikaException: Unexpected 
 RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@12f195
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
   ... 19 more
 Caused by: org.apache.poi.hslf.exceptions.HSLFException: 
 java.util.zip.ZipException: unknown compression method
   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:65)
   at 
 org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:204)
   at 
 org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:162)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:189)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 23 more
 Caused by: java.util.zip.ZipException: unknown compression method
   at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:147)
   at java.io.FilterInputStream.read(FilterInputStream.java:90)
   at org.apache.poi.hslf.blip.WMF.getData(WMF.java:59)
   ... 28 more
 2012-dec-19 14:39:46 org.apache.solr.common.SolrException log
 ALLVARLIG: null:org.apache.solr.common.SolrException: 
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
 

[jira] [Updated] (TIKA-1079) Word document hits AIOOBE in SummaryExtractor.parseSummaries

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1079:


Component/s: parser

 Word document hits AIOOBE in SummaryExtractor.parseSummaries
 

 Key: TIKA-1079
 URL: https://issues.apache.org/jira/browse/TIKA-1079
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.4

 Attachments: guide_to_daips_(id_3152_ver_1.0.0).doc


 I'm not yet sure if this is a corrupted document (though, MS Word opens it 
 just fine) or a bug in POI ... but I hit this exc when running it through 
 TikaCLI:
 {noformat}
 java.lang.ArrayIndexOutOfBoundsException: -1
   at org.apache.poi.hpsf.CodePageString.init(CodePageString.java:161)
   at 
 org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:158)
   at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:163)
   at org.apache.poi.hpsf.Property.init(Property.java:164)
   at org.apache.poi.hpsf.Section.init(Section.java:277)
   at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451)
   at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:246)
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:78)
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:69)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1067) Tika extracts non-existent asterisks (*) from .ppt files

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1067:


Component/s: parser

 Tika extracts non-existent asterisks (*) from .ppt files
 

 Key: TIKA-1067
 URL: https://issues.apache.org/jira/browse/TIKA-1067
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless

 I created a new blank presentation, put in title + subtitle, saved it as 
 .ppt, and then ran TikaCLI -t:
 {noformat}
 bodydiv class=slideShowdiv class=slidep 
 class=slide-master-content*br/
 *br/
 /p
 p class=slide-contentTestingbr/
 testingbr/
 /p
 /div
 /div
 div class=slideNotes/
 {noformat}
 The two extra *'s seem to be coming from the master slide, but I'm not sure 
 which text runs they are and how to stop them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1078:


Component/s: parser
 cli

 TikaCLI: invalid characters in embedded document name causes FNFE when trying 
 to save
 -

 Key: TIKA-1078
 URL: https://issues.apache.org/jira/browse/TIKA-1078
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Reporter: Michael McCandless
 Fix For: 1.4

 Attachments: T-DS_Excel2003-PPT2003_1.xls


 Attached document hits this on Windows:
 {noformat}
 C:\java.exe -jar tika-app-1.3.jar -z -x 
 c:\data\idit\T-DS_Excel2003-PPT2003_1.xls
 Extracting 'file0.png' (image/png) to .\file0.png
 Extracting 'file1.emf' (application/x-emf) to .\file1.emf
 Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg
 Extracting 'file3.emf' (application/x-emf) to .\file3.emf
 Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf
 Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to 
 .\MBD0016BDE4\?£☺.bin
 Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: 
 Illegal IOException from 
 org.apache.tika.parser.microsoft.OfficeParser@75f875f8
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The 
 filename, directory name, or volume label syntax is incorrect.)
 at java.io.FileOutputStream.init(FileOutputStream.java:205)
 at java.io.FileOutputStream.init(FileOutputStream.java:156)
 at 
 org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722)
 at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201)
 at 
 org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
 at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
 at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 {noformat}
 TikaCLI manages to create the sub-directory, but because the embedded 
 fileName has invalid (for Windows) characters, it fails.
 On Linux it runs fine.
 I think somehow ... we have to sanitize the embedded file name ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1111) Class loading issues when running in OSGi environment

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-:


Component/s: packaging

 Class loading issues when running in OSGi environment
 -

 Key: TIKA-
 URL: https://issues.apache.org/jira/browse/TIKA-
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.3
 Environment: Tika 1.3 (tika-core and tika-bundle OSGi bundles)
 Felix 2.0.5
Reporter: Niels Beekman

 When dom4j is on the system classpath, a class loading error occurs during 
 detection of Office Open XML files:
 java.lang.ExceptionInInitializerError
   at 
 org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller.clinit(PackagePropertiesUnmarshaller.java:49)
   at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:154)
   at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:141)
   at org.apache.poi.openxml4j.opc.Package.init(Package.java:54)
   at org.apache.poi.openxml4j.opc.ZipPackage.init(ZipPackage.java:99)
   at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:207)
   at 
 org.apache.tika.parser.pkg.ZipContainerDetector.detectOfficeOpenXML(ZipContainerDetector.java:194)
   at 
 org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:134)
   at 
 org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:77)
   at 
 org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
   at 
 org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113)
   at 
 org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:221)
   at java.lang.Thread.run(Thread.java:662)
 Caused by: java.lang.ClassCastException: org.dom4j.DocumentFactory cannot be 
 cast to org.dom4j.DocumentFactory
   at org.dom4j.DocumentFactory.getInstance(DocumentFactory.java:97)
   at org.dom4j.tree.AbstractNode.clinit(AbstractNode.java:39)
   ... 14 more
 As a workaround (maybe a solution), I modified the context classloader when 
 running the detection (wrapped the detector and parser). This appears to be 
 the common fix for dom4j, as it uses the context classloader during 
 initialization. Ideally, the detectors and parsers would be running with 
 their original loader (from ServiceLoader) as context class loader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1045) Unsupported AutoCAD drawing version: AC1014

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1045:


Component/s: parser

 Unsupported AutoCAD drawing version: AC1014
 ---

 Key: TIKA-1045
 URL: https://issues.apache.org/jira/browse/TIKA-1045
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
Reporter: Szasz Tamas
 Attachments: autocad_example


 [#|2012-12-19T15:35:24.297+0100|SEVERE|glassfish3.1|org.apache.solr.core.SolrCore|_ThreadID=38;_ThreadName=Thread-1;|org.apache.solr.common.SolrException:
  org.apache.tika.exception.TikaException: Unsupported AutoCAD drawing 
 version: AC1014
 at 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
 at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:215)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:279)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
 at 
 org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:655)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:595)
 at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:98)
 at 
 com.sun.enterprise.web.PESessionLockingStandardPipeline.invoke(PESessionLockingStandardPipeline.java:91)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:162)
 at 
 org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:326)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:227)
 at 
 com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:170)
 at 
 com.sun.grizzly.http.ProcessorTask.invokeAdapter(ProcessorTask.java:822)
 at 
 com.sun.grizzly.http.ProcessorTask.doProcess(ProcessorTask.java:719)
 at com.sun.grizzly.http.ProcessorTask.process(ProcessorTask.java:1013)
 at 
 com.sun.grizzly.http.DefaultProtocolFilter.execute(DefaultProtocolFilter.java:225)
 at 
 com.sun.grizzly.DefaultProtocolChain.executeProtocolFilter(DefaultProtocolChain.java:137)
 at 
 com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:104)
 at 
 com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:90)
 at 
 com.sun.grizzly.http.HttpProtocolChain.execute(HttpProtocolChain.java:79)
 at 
 com.sun.grizzly.ProtocolChainContextTask.doCall(ProtocolChainContextTask.java:54)
 at 
 com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:59)
 at com.sun.grizzly.ContextTask.run(ContextTask.java:71)
 at 
 com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:532)
 at 
 com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:513)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: org.apache.tika.exception.TikaException: Unsupported AutoCAD 
 drawing version: AC1014
 at org.apache.tika.parser.dwg.DWGParser.parse(DWGParser.java:126)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
 ... 32 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-988) We don't extract a placeholder for a Word document embedded in an Excel document

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-988:
---

Component/s: parser

 We don't extract a placeholder for a Word document embedded in an Excel 
 document
 

 Key: TIKA-988
 URL: https://issues.apache.org/jira/browse/TIKA-988
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.4

 Attachments: bug31373.xls


 In TIKA-956 we fixed the Word parser so that at the point where an embedded 
 document appears, we output a div class=embedded id=_XXX/ tag.
 It would be nice to do this for documents embedded in Excel too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1102) Can we add div to the list of heuristics for bad html fragments?

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1102:


Component/s: parser

 Can we add div to the list of heuristics for bad html fragments?
 --

 Key: TIKA-1102
 URL: https://issues.apache.org/jira/browse/TIKA-1102
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.2, 1.3
 Environment: I'm using Solr 4.0 final with tika v1.2 and ManifoldCF 
 v1.2dev all on tomcat 7.0.37
Reporter: David Morana

 Good morning,
 Crawling legacy sites with poorly written html fragments causes severe Solr 
 Xml parse errors and in turn causes ManifoldCF to abort.
 Can we add div to the list of heuristics so the html parser is used instead 
 of the xml parser?
 see this ticket for further information: 
 [TIKA-1101|https://issues.apache.org/jira/browse/TIKA-1101]
 Thank you,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1108) Represent individual slides in pptx

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1108:


Component/s: parser

 Represent individual slides in pptx
 ---

 Key: TIKA-1108
 URL: https://issues.apache.org/jira/browse/TIKA-1108
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Daniel Bonniot de Ruisselet
 Fix For: 1.4


 When parsing ppt, tika produces for each slide:
 div class=slide
 However for pptx these seem to be missing, all the text is directly under 
 body.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1107) Can't parse velocity file

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1107:


Component/s: parser

 Can't parse velocity file
 -

 Key: TIKA-1107
 URL: https://issues.apache.org/jira/browse/TIKA-1107
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: openjdk-1.7.0_17
Reporter: Jorge Urdaneta
 Attachments: events-detail.vtl


 When I parse some VTL (velocity) files I get an error
 2013-04-15 22:39:56,488 ERROR com.dotcms.tika.TikaUtils - Could not parse 
 file metadata for file : 
 /home/jorgeu/dotcms/dotcms/tomcat/webapps/../../dotCMS/assets/5/a/5a533adc-818f-4f55-a448-622bb90b576c/fileAsset/events-detail.vtl
 org.apache.tika.exception.TikaException: XML parse error
   at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at com.dotcms.tika.TikaUtils.getMetaDataMap(TikaUtils.java:41)
   at com.dotcms.tika.TikaUtils.getMetaDataMap(TikaUtils.java:85)
   at 
 com.dotmarketing.portlets.fileassets.business.FileAssetAPIImpl.getMetaDataMap(FileAssetAPIImpl.java:202)
   at 
 com.dotcms.content.elasticsearch.business.ESContentletAPIImpl.checkin(ESContentletAPIImpl.java:2409)
   at 
 com.dotcms.content.elasticsearch.business.ESContentletAPIImpl.checkin(ESContentletAPIImpl.java:1946)
   at 
 com.dotmarketing.portlets.contentlet.business.ContentletAPIInterceptor.checkin(ContentletAPIInterceptor.java:169)
   at 
 com.dotmarketing.portlets.contentlet.business.web.ContentletWebAPIImpl._saveWebAsset(ContentletWebAPIImpl.java:495)
   at 
 com.dotmarketing.portlets.contentlet.business.web.ContentletWebAPIImpl.saveContent(ContentletWebAPIImpl.java:129)
   at 
 com.dotmarketing.portlets.contentlet.ajax.ContentletAjax.saveContent(ContentletAjax.java:1321)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:601)
   at 
 org.directwebremoting.impl.CreatorModule$1.doFilter(CreatorModule.java:229)
   at 
 org.directwebremoting.impl.CreatorModule.executeMethod(CreatorModule.java:241)
   at 
 org.directwebremoting.impl.DefaultRemoter.execute(DefaultRemoter.java:379)
   at 
 org.directwebremoting.impl.DefaultRemoter.execute(DefaultRemoter.java:332)
   at 
 org.directwebremoting.dwrp.BaseCallHandler.handle(BaseCallHandler.java:104)
   at 
 org.directwebremoting.servlet.UrlProcessor.handle(UrlProcessor.java:120)
   at org.directwebremoting.servlet.DwrServlet.doPost(DwrServlet.java:141)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:637)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
   at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
   at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
   at 
 org.tuckey.web.filters.urlrewrite.UrlRewriteFilter.doFilter(UrlRewriteFilter.java:404)
   at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
   at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
   at com.dotmarketing.filters.CMSFilter.doFilter(CMSFilter.java:122)
   at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
   at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
   at 
 com.dotmarketing.filters.AutoLoginFilter.doFilter(AutoLoginFilter.java:61)
   at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
   at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
   at 
 com.dotmarketing.filters.CacheImagesFilter.doFilter(CacheImagesFilter.java:47)
   at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
   at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
   at 
 com.dotmarketing.cms.urlmap.filters.URLMapFilter.doFilter(URLMapFilter.java:87)
   at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
   at 
 

[jira] [Updated] (TIKA-1057) document content property Status is not extracted for *.doc files

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1057:


Component/s: parser

 document content property Status is not extracted for *.doc files
 ---

 Key: TIKA-1057
 URL: https://issues.apache.org/jira/browse/TIKA-1057
 Project: Tika
  Issue Type: Bug
  Components: parser
 Environment: java 1.5/1.6 / Windows 7
Reporter: Thomas Stroeter
Priority: Minor

 I would like to use Tika to extract the document property Status from a 
 word 97-2003 *.doc file.

 Tika dumps the document status property correctly from the xml *.docx files 
 as Content-Status and cp:contentStatus, but I can not extract the 
 metadata from a *.doc Word documents using Tika. 
 Nevertheless Word 2010 has no problem to set and extract that document meta 
 data from a *.doc file.
 Is there a way to extract these information by Tika for *.doc files, too?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-978) OSGi bundle build fails if space exists in build path

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-978:
---

Component/s: packaging

 OSGi bundle build fails if space exists in build path
 -

 Key: TIKA-978
 URL: https://issues.apache.org/jira/browse/TIKA-978
 Project: Tika
  Issue Type: Bug
  Components: packaging
Reporter: Ken Krugler
Priority: Minor

 While trying to replicate TIKA-997, I copied the Tika 1.2 source release to 
 /Volumes/Ken Backup/. Tika parent/core/parsers/XMP/application built fine, 
 but the OSGi bundle build failed a test - something doesn't like a space in 
 the path to the tika-core.jar file:
 Running org.apache.tika.bundle.BundleIT
 35 [main] INFO org.ops4j.pax.exam.spi.DefaultExamSystem - Pax Exam System 
 (Version: 2.2.0) created.
 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.152 sec  
 FAILURE!
 Results :
 Tests in error: 
   initializationError(org.apache.tika.bundle.BundleIT): Illegal character in 
 path at index 17: file:/Volumes/Ken 
 Backup/tika-1.2/tika-bundle/target/test-bundles/tika-core.jar
 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0
 [INFO] 
 [INFO] --- maven-failsafe-plugin:2.10:verify (default) @ tika-bundle ---
 [INFO] Failsafe report directory: /Volumes/Ken 
 Backup/tika-1.2/tika-bundle/target/failsafe-reports
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO] 
 [INFO] Apache Tika parent  SUCCESS [2.218s]
 [INFO] Apache Tika core .. SUCCESS [19.498s]
 [INFO] Apache Tika parsers ... SUCCESS [1:00.914s]
 [INFO] Apache Tika XMP ... SUCCESS [1.895s]
 [INFO] Apache Tika application ... SUCCESS [13.102s]
 [INFO] Apache Tika OSGi bundle ... FAILURE [18.073s]
 [INFO] Apache Tika server  SKIPPED
 [INFO] Apache Tika ... SKIPPED

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-993) Language Detection Fault

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-993:
---

Component/s: languageidentifier

 Language Detection Fault
 

 Key: TIKA-993
 URL: https://issues.apache.org/jira/browse/TIKA-993
 Project: Tika
  Issue Type: Bug
  Components: languageidentifier
Reporter: Iman Reihanian
 Attachments: DetectorImpl.java


 This text's language is English but it detects as Italy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1076) Upgrade to Apache POI 3.9

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1076:


Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Upgrade to Apache POI 3.9
 -

 Key: TIKA-1076
 URL: https://issues.apache.org/jira/browse/TIKA-1076
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Nick Burch
 Fix For: 1.5


 We should upgrade to Apache POI 3.9, which is the latest version

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-961) No whitespace added if BoilerpipeContentHandler.setIncludeMarkup(true)

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-961:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 No whitespace added if BoilerpipeContentHandler.setIncludeMarkup(true)
 --

 Key: TIKA-961
 URL: https://issues.apache.org/jira/browse/TIKA-961
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
Reporter: Markus Jelsma
Assignee: Ken Krugler
 Fix For: 1.5

 Attachments: TIKA-961-1.3-1.patch, TIKA-961-1.3-2.patch, 
 TIKA-961-1.3-3.patch


 ignorableWhitespace is not properly added when using the 
 BoilerpipeContentHandler and if markus is included.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-539:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Encoding detection is too biased by encoding in meta tag
 

 Key: TIKA-539
 URL: https://issues.apache.org/jira/browse/TIKA-539
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 0.8, 0.9, 0.10
Reporter: Reinhard Schwab
Assignee: Ken Krugler
 Fix For: 1.5

 Attachments: TIKA-539_2.patch, TIKA-539.patch


 if the encoding in the meta tag is wrong, this encoding is detected,
 even if there is the right encoding set in metadata before(which can be  from 
 http response header).
 test code to reproduce:
 static String content = htmlhead\n
   + meta http-equiv=\content-type\ 
 content=\application/xhtml+xml; charset=iso-8859-1\ /
   + /headbodyÜber den Wolken\n/body/html;
   /**
* @param args
* @throws IOException
* @throws TikaException
* @throws SAXException
*/
   public static void main(String[] args) throws IOException, SAXException,
   TikaException {
   Metadata metadata = new Metadata();
   metadata.set(Metadata.CONTENT_TYPE, text/html);
   metadata.set(Metadata.CONTENT_ENCODING, UTF-8);
   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
   InputStream in = new 
 ByteArrayInputStream(content.getBytes(UTF-8));
   AutoDetectParser parser = new AutoDetectParser();
   BodyContentHandler h = new BodyContentHandler(1);
   parser.parse(in, h, metadata, new ParseContext());
   System.out.print(h.toString());
   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1109:


Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Metadata not extracted before the context in OOXML (pptx)
 -

 Key: TIKA-1109
 URL: https://issues.apache.org/jira/browse/TIKA-1109
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Daniel Bonniot de Ruisselet
Priority: Critical
 Fix For: 1.5


 It seems that when processing OOXML documents, the metadata is only read 
 after the text. This means it's impossible to use the medata while processing 
 the text. I think it would be more useful to have the metadata populated 
 first.
 As a symptom:
 java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
 outputs only as metadata:
 meta name=Content-Length content=36518/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=resourceName content=testPPT.pptx/
 while there is more medata in the file (e.g. dc:titleAttachment 
 Test/dc:title).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
 Fix For: 1.5

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 

[jira] [Updated] (TIKA-980) MicrodataContentHandler for Apache Tika

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-980:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 MicrodataContentHandler for Apache Tika
 ---

 Key: TIKA-980
 URL: https://issues.apache.org/jira/browse/TIKA-980
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Markus Jelsma
Assignee: Ken Krugler
 Fix For: 1.5

 Attachments: TIKA-980-1.3-1.patch, TIKA-980-1.3-2.patch, 
 TIKA-980-1.3-3.patch, TIKA-980-1.3-4.patch, TIKA-980-1.3-5.patch


 ContentHandler for Apache Tika capable of building a data structure 
 containing Microdata item scopes and item properties. The Item* classes are 
 borrowed from the Apache Any23 project and are slightly modified to 
 accomodate this SAX-based extractor vs the original DOM-based extractor.
 The provided unit test outputs two item scopes about the Europe and NA 
 ApacheCon events and each has a nested property.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-817) (PPT/PPTX) Missing date/time in text content.

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-817:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 (PPT/PPTX) Missing date/time in text content.
 -

 Key: TIKA-817
 URL: https://issues.apache.org/jira/browse/TIKA-817
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.0
 Environment: Win7-64 + java version 1.6.0_26
Reporter: Albert L.
 Fix For: 1.5


 Missing date/time text in text content for PPT and PPTX files.
 The date and time are missing from the text content.  This occurs when one 
 chooses the following with MS-PowerPoint 2010:
 1) Insert
 2) Date  Time
 3) Update automatically
 4) save to PPT or PPTX

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-820) Locator is unset for HTML parser

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-820:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Locator is unset for HTML parser
 

 Key: TIKA-820
 URL: https://issues.apache.org/jira/browse/TIKA-820
 Project: Tika
  Issue Type: Bug
  Components: general, parser
Affects Versions: 1.0
Reporter: Daniel Bonniot de Ruisselet
Assignee: Ken Krugler
  Labels: patch
 Fix For: 1.5

 Attachments: text-locator.patch


 The HtmlParser does not call setDocumentLocator(Locator locator) on the 
 user's content handler.
 Patch and unit test attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1106) CLAVIN Integration

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1106:


Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 CLAVIN Integration
 --

 Key: TIKA-1106
 URL: https://issues.apache.org/jira/browse/TIKA-1106
 Project: Tika
  Issue Type: Wish
  Components: general
Affects Versions: 1.3
 Environment: All
Reporter: Adam Estrada
Priority: Minor
  Labels: entity, geospatial
 Fix For: 1.5


 I've been evaluating CLAVIN as a way to extract location information from 
 unstructured text. It seems like meshing it with Tika in some way would make 
 a lot of sense. From CLAVIN website...
 {quote}
 CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source 
 software package for document geotagging and geoparsing that employs 
 context-based geographic entity resolution. It combines a variety of open 
 source tools with natural language processing techniques to extract location 
 names from unstructured text documents and resolve them against gazetteer 
 records. Importantly, CLAVIN does not simply look up location names; 
 rather, it uses intelligent heuristics in an attempt to identify precisely 
 which Springfield (for example) was intended by the author, based on the 
 context of the document. CLAVIN also employs fuzzy search to handle 
 incorrectly-spelled location names, and it recognizes alternative names 
 (e.g., Ivory Coast and Côte d'Ivoire) as referring to the same geographic 
 entity. By enriching text documents with structured geo data, CLAVIN enables 
 hierarchical geospatial search and advanced geospatial analytics on 
 unstructured data.
 {quote}
 There was only one other instance of the word clavin mentioned in the ASF 
 jira site so I thought it was definitely worth posting here.
 https://github.com/Berico-Technologies/CLAVIN

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1108) Represent individual slides in pptx

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1108:


Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Represent individual slides in pptx
 ---

 Key: TIKA-1108
 URL: https://issues.apache.org/jira/browse/TIKA-1108
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Daniel Bonniot de Ruisselet
 Fix For: 1.5


 When parsing ppt, tika produces for each slide:
 div class=slide
 However for pptx these seem to be missing, all the text is directly under 
 body.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-776) ExifTool Embedder

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-776:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 ExifTool Embedder
 -

 Key: TIKA-776
 URL: https://issues.apache.org/jira/browse/TIKA-776
 Project: Tika
  Issue Type: New Feature
  Components: metadata
Affects Versions: 1.0
 Environment: ExifTool is required 
 (http://www.sno.phy.queensu.ca/~phil/exiftool/)
Reporter: Ray Gauss II
  Labels: embed, exiftool, patch
 Fix For: 1.5

 Attachments: tika-parsers-exiftool-embed-patch.txt


 This patch adds an ExifTool ExternalEmbedder which builds upon the work in 
 issue TIKA-774 and TIKA-775.
 In the tika-parsers an ExiftoolExternalEmbedder is added which extends 
 ExternalEmbedder to programmatically create an Embedder which calls the 
 ExifTool command line to embed tika metadata into a file stream and an 
 ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and 
 XMP fields then parses the resulting file stream to verify the operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1127) text/xml for tika-server

2013-05-27 Thread Chris A. Mattmann (JIRA)
Chris A. Mattmann created TIKA-1127:
---

 Summary: text/xml for tika-server
 Key: TIKA-1127
 URL: https://issues.apache.org/jira/browse/TIKA-1127
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.4


[~amosavian] contributed this patch from Github to provide text/xml to 
tika-server:

https://github.com/apache/tika/pull/3.patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-988) We don't extract a placeholder for a Word document embedded in an Excel document

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-988:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 We don't extract a placeholder for a Word document embedded in an Excel 
 document
 

 Key: TIKA-988
 URL: https://issues.apache.org/jira/browse/TIKA-988
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.5

 Attachments: bug31373.xls


 In TIKA-956 we fixed the Word parser so that at the point where an embedded 
 document appears, we output a div class=embedded id=_XXX/ tag.
 It would be nice to do this for documents embedded in Excel too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1122) Tika fails to parse chm files

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1122:


Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Tika fails to parse chm files
 -

 Key: TIKA-1122
 URL: https://issues.apache.org/jira/browse/TIKA-1122
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Tejas Patil
Priority: Minor
 Fix For: 1.5


 (reported by Jan Riewe over nutch user group, see 
 http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html)
 Nutch fails to parse chm files with
 ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type 
 application/vnd.ms-htmlhelp
 Even after running tika-app in standalone manner (ie. not via nutch), I could 
 see not even a single chm file being parsed (I tried with 10-15 different chm 
 files of variable sizes).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-985) Support for HTML5 elements

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-985:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Support for HTML5 elements
 --

 Key: TIKA-985
 URL: https://issues.apache.org/jira/browse/TIKA-985
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.2
Reporter: Markus Jelsma
 Fix For: 1.5

 Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, 
 TIKA-985-1.3-3.patch


 TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, 
 section). This prevents some custom ContentHandlers from reading expected 
 elements and/or attributes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-891:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Use POST in addition to PUT on method calls in tika-server
 --

 Key: TIKA-891
 URL: https://issues.apache.org/jira/browse/TIKA-891
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Trivial
 Fix For: 1.5


 Per Jukka's email:
 http://s.apache.org/uR
 It would be a better use of REST/HTTP verbs to use POST to put content to a 
 resource where we don't intend to store that content (which is the 
 implication of PUT). Max suggested adding:
 {code}
 @POST
 {code}
 annotations to the methods we are currently exposing using PUT to take care 
 of this. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1079) Word document hits AIOOBE in SummaryExtractor.parseSummaries

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1079:


Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Word document hits AIOOBE in SummaryExtractor.parseSummaries
 

 Key: TIKA-1079
 URL: https://issues.apache.org/jira/browse/TIKA-1079
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.5

 Attachments: guide_to_daips_(id_3152_ver_1.0.0).doc


 I'm not yet sure if this is a corrupted document (though, MS Word opens it 
 just fine) or a bug in POI ... but I hit this exc when running it through 
 TikaCLI:
 {noformat}
 java.lang.ArrayIndexOutOfBoundsException: -1
   at org.apache.poi.hpsf.CodePageString.init(CodePageString.java:161)
   at 
 org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:158)
   at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:163)
   at org.apache.poi.hpsf.Property.init(Property.java:164)
   at org.apache.poi.hpsf.Section.init(Section.java:277)
   at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451)
   at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:246)
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:78)
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:69)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1110) Incorrectly declared SUPPORTED_TYPES in ChmParser.

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1110:


Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Incorrectly declared SUPPORTED_TYPES in ChmParser.
 --

 Key: TIKA-1110
 URL: https://issues.apache.org/jira/browse/TIKA-1110
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3, 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.5


 [This 
 link|http://www.iana.org/assignments/media-types/application/vnd.ms-htmlhelp] 
 assigns the official mime type for these files to 
 application/vnd.ms-htmlhelp. In the wild there are also two other types 
 used:
 * application/chm
 * application/x-chm
 tika-mimetypes.xml uses the correct official mime type, but ChmParser 
 declares that it supports only application/chm. For this reason content 
 that uses the official mime type (e.g. coming via Detector or parsed using 
 AutoDetectParser, or simply declared in metadata) fails to parse due to 
 unknown mime type.
 The fix seems simple - ChmParser should declare also all of the above types 
 in its SUPPORTED_TYPES.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-995) XHTMLContentHandler doesn't pass attributes of body element

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-995:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 XHTMLContentHandler doesn't pass attributes of body element
 ---

 Key: TIKA-995
 URL: https://issues.apache.org/jira/browse/TIKA-995
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
Reporter: Markus Jelsma
 Fix For: 1.5

 Attachments: TIKA-995-1.3-1.patch, TIKA-995-unit.patch


 XHTMLContentHandler.startElement() uses lazyHead() for the body element 
 because it's defined in the AUTO Set. As a consequence, attributes of the 
 body element are not passed to downstream content handlers. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-819:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Make Option to Exclude Embedded Files' Text for Text Content
 

 Key: TIKA-819
 URL: https://issues.apache.org/jira/browse/TIKA-819
 Project: Tika
  Issue Type: New Feature
  Components: general
Affects Versions: 1.0
 Environment: Windows-7 + JDK 1.6 u26
Reporter: Albert L.
 Fix For: 1.5


 It would be nice to be able to disable text content from embedded files.
 For example, if I have a DOCX with an embedded PPTX, then I would like the 
 option to disable text from the PPTX from showing up when asking for the text 
 content from DOCX.  In other words, it would be nice to have the option to 
 get text content *only* from the DOCX instead of the DOCX+PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1086) Tika-bundle 1.3 does not import org.w3c.dom package

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1086:


Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Tika-bundle 1.3 does not import org.w3c.dom package
 ---

 Key: TIKA-1086
 URL: https://issues.apache.org/jira/browse/TIKA-1086
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Gaurav
 Fix For: 1.5

 Attachments: TIKA-1086.svn.diff


 The tika-bundle 1.3 version does not import org.w3c.dom package, as a result 
 it is not able to parse DOM based documents such as Microsoft Word (docx) 
 documents.
 This issue does not have in version 1.2 as it does import the necessary 
 package and therefore the parsing of the documents work fine.
 Can someone please look into the issue, as Microsoft Word is a very popular 
 document.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-605) Tika GDAL parser

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-605:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Tika GDAL parser
 

 Key: TIKA-605
 URL: https://issues.apache.org/jira/browse/TIKA-605
 Project: Tika
  Issue Type: New Feature
  Components: parser
 Environment: indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
  Labels: gdal, gsoc2013, integration, mentor, tika
 Fix For: 1.5

 Attachments: 0001-TIKA-605-Tika-GDAL-parser.patch, 
 TIKA-605.Mattmann.092511.patch.txt


 Leverage the GDAL toolkit and its Java SWIG bindings to create a Tika parser 
 around GDAL. See here: 
 http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/java/apps/gdalinfo.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1059:


Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
 --

 Key: TIKA-1059
 URL: https://issues.apache.org/jira/browse/TIKA-1059
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
 Fix For: 1.5


 The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch 
 {{InterruptedException}} and ignore it.
 The methods should either call {{interrupt()}} on the current thread or 
 re-throw the exception, possibly wrapped in a {{TikaException}}.
 See TIKA-775 for a previous discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1072:


Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 AIOOBE when handling embedded document in .doc file
 ---

 Key: TIKA-1072
 URL: https://issues.apache.org/jira/browse/TIKA-1072
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.5

 Attachments: 20-Force-on-a-current-S00.doc, Ole10NativeEntry.bin


 I have a Word (.doc) document that hits an exception when I run:
 {noformat}
 java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar 
 /x/tmp/20-Force-on-a-current-S00.doc 
 {noformat}
 Here's the exception:
 {noformat}
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
   at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
   at 
 org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139)
   at 
 org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 {noformat}
 It happens when we try to parse an OLE10 embedded object ... the code
 that does this parsing captures and ignores Ole10NativeException and
 skips the entry ... so I'm wondering if we should also catch AIOOBE
 and skip the entry?  Ie, maybe this entry really is not OLE10, and the
 Ole10Native code is failing to throw Ole10NativeException for it?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-774) ExifTool Parser

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-774:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 ExifTool Parser
 ---

 Key: TIKA-774
 URL: https://issues.apache.org/jira/browse/TIKA-774
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.0
 Environment: Requires be installed 
 (http://www.sno.phy.queensu.ca/~phil/exiftool/)
Reporter: Ray Gauss II
  Labels: features, newbie, patch,
 Fix For: 1.5

 Attachments: testJPEG_IPTC_EXT.jpg, 
 tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt


 Adds an external parser that calls ExifTool to extract extended metadata 
 fields from images and other content types.
 In the core project:
 An ExifTool interface is added which contains Property objects that define 
 the metadata fields available.
 An additional Property constructor for internalTextBag type.
 In the parsers project:
 An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
 on the command line and mapping the response to tika metadata fields.  This 
 extractor could be called instead of or in addition to the existing 
 ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
 JpegParser but those have not been changed at this time.
 An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
 An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
 metadata fields to existing tika and Drew Noakes metadata fields if enabled.
 An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
 implementations in XML files.
 An ExifToolParserTest is added which tests several expected XMP and IPTC 
 metadata values in testJPEG_IPTC_EXT.jpg.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1127) text/xml for tika-server

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1127.
-

Resolution: Fixed

- patch applied in r1486665.

 text/xml for tika-server
 

 Key: TIKA-1127
 URL: https://issues.apache.org/jira/browse/TIKA-1127
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.4


 [~amosavian] contributed this patch from Github to provide text/xml to 
 tika-server:
 https://github.com/apache/tika/pull/3.patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1086) Tika-bundle 1.3 does not import org.w3c.dom package

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1086:


Fix Version/s: (was: 1.2)
   1.4

 Tika-bundle 1.3 does not import org.w3c.dom package
 ---

 Key: TIKA-1086
 URL: https://issues.apache.org/jira/browse/TIKA-1086
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Gaurav
 Fix For: 1.4

 Attachments: TIKA-1086.svn.diff


 The tika-bundle 1.3 version does not import org.w3c.dom package, as a result 
 it is not able to parse DOM based documents such as Microsoft Word (docx) 
 documents.
 This issue does not have in version 1.2 as it does import the necessary 
 package and therefore the parsing of the documents work fine.
 Can someone please look into the issue, as Microsoft Word is a very popular 
 document.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [DISCUSS] Apache Tika 1.4 RC?

2013-05-27 Thread Michael McCandless
+1, thanks Chris!

Mike McCandless

http://blog.mikemccandless.com


On Mon, May 27, 2013 at 1:06 PM, Mattmann, Chris A (398J)
chris.a.mattm...@jpl.nasa.gov wrote:
 Hey Guys,

 I have some free cycles this week -- and the energy to produce a Tika 1.4
 RC. Sound good? I cleaned up JIRA and got all resolved (22) issues done
 and scheduled for 1.4. Did I miss anything?

 If I don't hear any objections expect an RC #1 for 1.4 by the end of the
 week.

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






[jira] [Commented] (TIKA-1127) text/xml for tika-server

2013-05-27 Thread Ali Mosavian (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667934#comment-13667934
 ] 

Ali Mosavian commented on TIKA-1127:


Ghansk Chris!

 text/xml for tika-server
 

 Key: TIKA-1127
 URL: https://issues.apache.org/jira/browse/TIKA-1127
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.4


 [~amosavian] contributed this patch from Github to provide text/xml to 
 tika-server:
 https://github.com/apache/tika/pull/3.patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Issue Comment Deleted] (TIKA-1127) text/xml for tika-server

2013-05-27 Thread Ali Mosavian (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Mosavian updated TIKA-1127:
---

Comment: was deleted

(was: Ghansk Chris!)

 text/xml for tika-server
 

 Key: TIKA-1127
 URL: https://issues.apache.org/jira/browse/TIKA-1127
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.4


 [~amosavian] contributed this patch from Github to provide text/xml to 
 tika-server:
 https://github.com/apache/tika/pull/3.patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1127) text/xml for tika-server

2013-05-27 Thread Ali Mosavian (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667935#comment-13667935
 ] 

Ali Mosavian commented on TIKA-1127:


Thanks Chris!

 text/xml for tika-server
 

 Key: TIKA-1127
 URL: https://issues.apache.org/jira/browse/TIKA-1127
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.4


 [~amosavian] contributed this patch from Github to provide text/xml to 
 tika-server:
 https://github.com/apache/tika/pull/3.patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1127) text/xml for tika-server

2013-05-27 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667965#comment-13667965
 ] 

Chris A. Mattmann commented on TIKA-1127:
-

np probs, thanks to you, Ali!

 text/xml for tika-server
 

 Key: TIKA-1127
 URL: https://issues.apache.org/jira/browse/TIKA-1127
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.4


 [~amosavian] contributed this patch from Github to provide text/xml to 
 tika-server:
 https://github.com/apache/tika/pull/3.patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira