[
https://issues.apache.org/jira/browse/TIKA-4517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030149#comment-18030149
]
Tilman Hausherr commented on TIKA-4517:
---------------------------------------
I have another one:
org.apache.tika.exception.TikaException: Illegal char <"> at index 36:
testPST.pst-embed/00000002-putstatic".msg
at
org.apache.tika.parser.microsoft.pst.OutlookPSTParser.parse(OutlookPSTParser.java:95)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
at
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:168)
at
org.apache.tika.pipes.core.PipesServer.parseRecursive(PipesServer.java:717)
at
org.apache.tika.pipes.core.PipesServer.parseWithStream(PipesServer.java:604)
at
org.apache.tika.pipes.core.PipesServer.parseFromTuple(PipesServer.java:545)
at
org.apache.tika.pipes.core.PipesServer.actuallyParse(PipesServer.java:435)
at org.apache.tika.pipes.core.PipesServer.parseOne(PipesServer.java:380)
at
org.apache.tika.pipes.core.PipesServer.processRequests(PipesServer.java:249)
at org.apache.tika.pipes.core.PipesServer.main(PipesServer.java:183)
Caused by: java.nio.file.InvalidPathException: Illegal char <"> at index 36:
testPST.pst-embed/00000002-putstatic".msg
at
java.base/sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:204)
at
java.base/sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:175)
at
java.base/sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77)
at java.base/sun.nio.fs.WindowsPath.parse(WindowsPath.java:92)
at
java.base/sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:203)
at java.base/java.nio.file.Path.resolve(Path.java:513)
at
org.apache.tika.pipes.emitter.fs.FileSystemEmitter.emit(FileSystemEmitter.java:150)
at
org.apache.tika.pipes.core.extractor.EmittingEmbeddedDocumentBytesHandler.add(EmittingEmbeddedDocumentBytesHandler.java:65)
at
org.apache.tika.extractor.RUnpackExtractor.storeEmbeddedBytes(RUnpackExtractor.java:175)
at
org.apache.tika.extractor.RUnpackExtractor.parseWithBytes(RUnpackExtractor.java:137)
at
org.apache.tika.extractor.RUnpackExtractor.parseEmbedded(RUnpackExtractor.java:93)
at
org.apache.tika.parser.microsoft.pst.OutlookPSTParser.parseFolder(OutlookPSTParser.java:120)
at
org.apache.tika.parser.microsoft.pst.OutlookPSTParser.parseFolder(OutlookPSTParser.java:132)
at
org.apache.tika.parser.microsoft.pst.OutlookPSTParser.parse(OutlookPSTParser.java:90)
... 12 more
Tests run: 40, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 43.80 s <<<
FAILURE! -- in org.apache.tika.cli.TikaCLITest
org.apache.tika.cli.TikaCLITest.testPSTRUnpack -- Time elapsed: 5.568 s <<<
FAILURE!
org.opentest4j.AssertionFailedError: expected: <true> but was: <false>
at
org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:158)
at
org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:139)
at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:69)
at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:41)
at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:35)
at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:195)
at
org.apache.tika.cli.TikaCLITest.testRecursiveUnpack(TikaCLITest.java:411)
at org.apache.tika.cli.TikaCLITest.testPSTRUnpack(TikaCLITest.java:306)
This is because " is really an illegal character in a windows filename. So I
think what is happening here is that these "bad" filenames are in that PST
file. (Despite that this is a windows format?!)
> Improve async cli
> -----------------
>
> Key: TIKA-4517
> URL: https://issues.apache.org/jira/browse/TIKA-4517
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Trivial
> Fix For: 4.0.0
>
>
> Improve documentation and handling of file names as non-options.
> Add xml vs text for content extraction.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)