date:20121213

[jira] [Commented] (TIKA-775) Embed Capabilities

2012-12-13 Thread Nick Burch (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532033#comment-13532033
 ] 

Nick Burch commented on TIKA-775:
-

Could you maybe add a simple dummy parser for testing with?

Another option is to have the test in the parsers package, even though the main 
code is in core. We have quite a few examples of that, eg some of the mime 
magic stuff is tested in parsers because that's where the test files live

> Embed Capabilities
> --
>
> Key: TIKA-775
> URL: https://issues.apache.org/jira/browse/TIKA-775
> Project: Tika
>  Issue Type: Improvement
>  Components: general, metadata
>Affects Versions: 1.0
> Environment: The default ExternalEmbedder requires that sed be 
> installed.
>Reporter: Ray Gauss II
>  Labels: embed, patch
> Fix For: 1.3
>
> Attachments: embed_20121029.diff, embed.diff, 
> tika-core-embed-patch.txt, tika-parsers-embed-patch.txt
>
>
> This patch defines and implements the concept of embedding tika metadata into 
> a file stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed 
> ExternalEmbedder implementation meant to be extended or configured are added. 
>  These classes are essentially a reverse flow of the existing Parser and 
> ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which 
> uses the default ExternalEmbedder (calls sed) to embed a value placed in 
> Metadata.DESCRIPTION then verify the operation by parsing the resulting 
> stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1044) Can't parse Word files with no format set

2012-12-13 Thread Nick Burch (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1044.
--

   Resolution: Fixed
Fix Version/s: 1.3

Fixed in r1421646, along with a unit test based on your files, thanks!

> Can't parse Word files with no format set
> -
>
> Key: TIKA-1044
> URL: https://issues.apache.org/jira/browse/TIKA-1044
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.0
>Reporter: Jonas Wilhelmsson
>Priority: Trivial
> Fix For: 1.3
>
> Attachments: test2.doc, test.docx
>
>
> When we were using Solr for indexing we came over this Tika bug.
> While parsing a doc or docx file that contains text without any format set 
> (format inside Microsoft Word) the parser will throw exceptions.
> By setting a format to the text the file can be correctly parsed without 
> unexpected errors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-775) Embed Capabilities

2012-12-13 Thread Ray Gauss II (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531938#comment-13531938
 ] 

Ray Gauss II commented on TIKA-775:
---

According to a few posts on the subject including one on developerWorks [1] it 
looks like it's more appropriate to reassert the thread's interrupt status with:

{code}
...
} catch (InterruptedException ignore) {
Thread.currentThread().interrupt();
}
...
{code}

rather than refactoring {{ExternalParser}} and {{ExternalEmbedder}} to re-throw 
it or wrap in a {{TikaException}}.


I too would prefer {{ExternalEmbedderTest}} to be in core, but I do feel that 
we want to confirm the embedding with a known working parser.  Would anyone 
have issue with moving {{TXTParser}} and its test into core? There don't seem 
to be any issues with dependencies when trying it.


[1] http://www.ibm.com/developerworks/java/library/j-jtp05236/index.html

> Embed Capabilities
> --
>
> Key: TIKA-775
> URL: https://issues.apache.org/jira/browse/TIKA-775
> Project: Tika
>  Issue Type: Improvement
>  Components: general, metadata
>Affects Versions: 1.0
> Environment: The default ExternalEmbedder requires that sed be 
> installed.
>Reporter: Ray Gauss II
>  Labels: embed, patch
> Fix For: 1.3
>
> Attachments: embed_20121029.diff, embed.diff, 
> tika-core-embed-patch.txt, tika-parsers-embed-patch.txt
>
>
> This patch defines and implements the concept of embedding tika metadata into 
> a file stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed 
> ExternalEmbedder implementation meant to be extended or configured are added. 
>  These classes are essentially a reverse flow of the existing Parser and 
> ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which 
> uses the default ExternalEmbedder (calls sed) to embed a value placed in 
> Metadata.DESCRIPTION then verify the operation by parsing the resulting 
> stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1043) Tika parser v1.2 fails on legacy power point documents

2012-12-13 Thread Nick Burch (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531691#comment-13531691
 ] 

Nick Burch commented on TIKA-1043:
--

Could you post the full stacktrace? That one is missing the interesting bit of 
why the parser broke...

Also, do you have an example file that shows the problem?

> Tika parser v1.2 fails on legacy power point documents
> --
>
> Key: TIKA-1043
> URL: https://issues.apache.org/jira/browse/TIKA-1043
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.2
> Environment: Solr 4.0 on Tomcat 7 with manifoldcf v1.1 dev
>Reporter: David Morana
> Fix For: 1.2
>
>
> I can't index "older" powerpoint documents
> I did some research and the current "fix" is to open the legacy ppt and save 
> it as a newer version of PowerPoint.
> I have over 3000 ppt docs in my development environment alone so that's not 
> an option.
> Here's the error in solr:
> SEVERE: null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@e86b202
>   at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:215)
>   at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:244)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>   at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:240)
>   at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
>   at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
>   at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
>   at 
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:541)
>   at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>   at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:383)
>   at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:243)
>   at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:188)
>   at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:166)
>   at 
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:288)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-990) Mp3Parser extracts wrong number of channels

2012-12-13 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-990.
---

   Resolution: Fixed
Fix Version/s: 1.3

Resolved in r1421584.

> Mp3Parser extracts wrong number of channels
> ---
>
> Key: TIKA-990
> URL: https://issues.apache.org/jira/browse/TIKA-990
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
>Reporter: Oliver Heger
>Assignee: Ray Gauss II
> Fix For: 1.3
>
>
> In class {{AudioFrame}} the last two bits of the MPEG frame header are used 
> to determine the number of channels. According to my documentation, this 
> information is encoded in bits 7 and 6.
> I did a cross check with the ID3 tag editor tool ID3-TagIT 
> (http://www.id3-tagit.de/). The unit tests expect that the test MP3 files 
> have 2 channels. However, the tool reports that the files are mono.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-990) Mp3Parser extracts wrong number of channels

2012-12-13 Thread Ray Gauss II (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531588#comment-13531588
 ] 

Ray Gauss II commented on TIKA-990:
---

After investigated this:

* Confirmed the test files are in fact mono with the exception of 
{{testMP3lyrics.mp3}}
* Confirmed the MPEG spec does define the channels in bits 7,6
* After creating my own test files confirmed Tika does report the channels 
incorrectly

Committing a fix shortly.

> Mp3Parser extracts wrong number of channels
> ---
>
> Key: TIKA-990
> URL: https://issues.apache.org/jira/browse/TIKA-990
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
>Reporter: Oliver Heger
>Assignee: Ray Gauss II
>
> In class {{AudioFrame}} the last two bits of the MPEG frame header are used 
> to determine the number of channels. According to my documentation, this 
> information is encoded in bits 7 and 6.
> I did a cross check with the ID3 tag editor tool ID3-TagIT 
> (http://www.id3-tagit.de/). The unit tests expect that the test MP3 files 
> have 2 channels. However, the tool reports that the files are mono.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (TIKA-990) Mp3Parser extracts wrong number of channels

2012-12-13 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-990:
-

Assignee: Ray Gauss II

> Mp3Parser extracts wrong number of channels
> ---
>
> Key: TIKA-990
> URL: https://issues.apache.org/jira/browse/TIKA-990
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
>Reporter: Oliver Heger
>Assignee: Ray Gauss II
>
> In class {{AudioFrame}} the last two bits of the MPEG frame header are used 
> to determine the number of channels. According to my documentation, this 
> information is encoded in bits 7 and 6.
> I did a cross check with the ID3 tag editor tool ID3-TagIT 
> (http://www.id3-tagit.de/). The unit tests expect that the test MP3 files 
> have 2 channels. However, the tool reports that the files are mono.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (TIKA-1044) Can't parse Word files with no format set

2012-12-13 Thread Jonas Wilhelmsson (JIRA)

Jonas Wilhelmsson created TIKA-1044:
---

 Summary: Can't parse Word files with no format set
 Key: TIKA-1044
 URL: https://issues.apache.org/jira/browse/TIKA-1044
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0
Reporter: Jonas Wilhelmsson
Priority: Trivial
 Attachments: test2.doc, test.docx

When we were using Solr for indexing we came over this Tika bug.
While parsing a doc or docx file that contains text without any format set 
(format inside Microsoft Word) the parser will throw exceptions.
By setting a format to the text the file can be correctly parsed without 
unexpected errors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-1044) Can't parse Word files with no format set

2012-12-13 Thread Jonas Wilhelmsson (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonas Wilhelmsson updated TIKA-1044:


Attachment: test2.doc
test.docx

Stacktrace while parseing test.docx:
2012-dec-13 15:51:39 org.apache.solr.common.SolrException log
ALLVARLIG: org.apache.solr.common.SolrException
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:233)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:563)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@299629
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:227)
... 19 more
Caused by: java.lang.NullPointerException
at 
org.apache.tika.parser.microsoft.WordExtractor.buildParagraphTagAndStyle(WordExtractor.java:463)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:108)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:76)
at 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:63)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:97)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:69)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 23 more

Stacktrace while parseing test2.doc:
2012-dec-13 15:51:52 org.apache.solr.common.SolrException log
ALLVARLIG: org.apache.solr.common.SolrException
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:233)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:

[jira] [Created] (TIKA-1043) Tika parser v1.2 fails on legacy power point documents

2012-12-13 Thread David Morana (JIRA)

David Morana created TIKA-1043:
--

 Summary: Tika parser v1.2 fails on legacy power point documents
 Key: TIKA-1043
 URL: https://issues.apache.org/jira/browse/TIKA-1043
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.2
 Environment: Solr 4.0 on Tomcat 7 with manifoldcf v1.1 dev
Reporter: David Morana
 Fix For: 1.2


I can't index "older" powerpoint documents
I did some research and the current "fix" is to open the legacy ppt and save it 
as a newer version of PowerPoint.
I have over 3000 ppt docs in my development environment alone so that's not an 
option.
Here's the error in solr:
SEVERE: null:org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@e86b202
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:215)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:244)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:240)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:541)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:383)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:243)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:188)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:166)
at 
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:288)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1041) Tika 1.2 universalcharset errors

2012-12-13 Thread Jukka Zitting (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1041.
-

   Resolution: Fixed
Fix Version/s: (was: 1.2)
 Assignee: Jukka Zitting

I fixed this in revision 1421141 by catching the NoClassDefFoundError and just 
ignoring the missing functionality when the required dependency is not present. 
A deployment can pass in a ServiceLoader with a custom LoadErrorHandler through 
the ParseContext to log or otherwise handle such dependency issues.

> Tika 1.2 universalcharset errors
> 
>
> Key: TIKA-1041
> URL: https://issues.apache.org/jira/browse/TIKA-1041
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.2
> Environment: I'm running solr 4.0 with tika 1.2 on tomcat 7.0.8 with 
> manifoldcf v1.1dev 
>Reporter: David Morana
>Assignee: Jukka Zitting
> Fix For: 1.3
>
>
> This is somewhat confusing and frustrating. I successfully crawled Opentext 
> using all of the above. then I recrawled and it aborted almost immediately.
> It choked on images, so I excluded them for now. 
> but now it's choking on txt files! 
> sometimes I get this error
> SEVERE: null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: 
> org/mozilla/universalchardet/CharsetListener
> and sometimes I get this one
> SEVERE: null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: 
> org/apache/tika/parser/txt/UniversalEncodingListener

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-775) Embed Capabilities

[jira] [Resolved] (TIKA-1044) Can't parse Word files with no format set

[jira] [Commented] (TIKA-775) Embed Capabilities

[jira] [Commented] (TIKA-1043) Tika parser v1.2 fails on legacy power point documents

[jira] [Resolved] (TIKA-990) Mp3Parser extracts wrong number of channels

[jira] [Commented] (TIKA-990) Mp3Parser extracts wrong number of channels

[jira] [Assigned] (TIKA-990) Mp3Parser extracts wrong number of channels

[jira] [Created] (TIKA-1044) Can't parse Word files with no format set

[jira] [Updated] (TIKA-1044) Can't parse Word files with no format set

[jira] [Created] (TIKA-1043) Tika parser v1.2 fails on legacy power point documents

[jira] [Resolved] (TIKA-1041) Tika 1.2 universalcharset errors

11 matches

Site Navigation

Mail list logo

Footer information