[
https://issues.apache.org/jira/browse/TIKA-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946321#comment-17946321
]
Tilman Hausherr edited comment on TIKA-4398 at 4/22/25 8:02 AM:
----------------------------------------------------------------
please check the test code.
{code:java}
List<Class<? extends Parser>> excludeParsers = Arrays.asList(
MP4Parser.class,
AudioParser.class,
Mp3Parser.class,
MidiParser.class,
FLVParser.class,
CompressorParser.class,
RarParser.class
);
TikaConfig config = TikaConfig.getDefaultConfig();
Parser myParser = new DefaultParser(config.getMediaTypeRegistry(),
new ServiceLoader(), excludeParsers);
Parser parser = new AutoDetectParser(config.getDetector(), myParser);
try {
ContentHandler contentHandler = new BodyContentHandler();
Metadata meta = new Metadata();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
InputStream stream = new FileInputStream("output/01.docx");
parser.parse(stream, contentHandler, meta, context);
System.out.println(contentHandler.toString());
System.out.println(meta.toString());
}catch (Throwable e){
e.printStackTrace();
}
{code}
It detects the content-type=application/zip
meta info:
{noformat}
X-TIKA:Parsed-By=org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By=org.apache.tika.parser.pkg.PackageParser
X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.pkg.PackageParser
X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.xml.DcXMLParser
X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.image.ImageParser
X-TIKA:detectedEncoding=ISO-8859-1
X-TIKA:encodingDetector=UniversalEncodingDetector Content-Type=application/zip
{noformat}
my original code limited the embed media, can't get the xmlParser
was (Author: JIRAUSER281021):
please check the test code.
List<Class<? extends Parser>> excludeParsers = Arrays.asList(
MP4Parser.class,
AudioParser.class,
Mp3Parser.class,
MidiParser.class,
FLVParser.class,
CompressorParser.class,
RarParser.class
);
TikaConfig config = TikaConfig.getDefaultConfig();
Parser myParser = new DefaultParser(config.getMediaTypeRegistry(),
new ServiceLoader(), excludeParsers);
Parser parser = new AutoDetectParser(config.getDetector(), myParser);
try {
ContentHandler contentHandler = new BodyContentHandler();
Metadata meta = new Metadata();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
InputStream stream = new FileInputStream("output/01.docx");
parser.parse(stream, contentHandler, meta, context);
System.out.println(contentHandler.toString());
System.out.println(meta.toString());
}catch (Throwable e){
e.printStackTrace();
}
It detect the content-type=application/zip
```
meta info:
X-TIKA:Parsed-By=org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By=org.apache.tika.parser.pkg.PackageParser
X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.pkg.PackageParser
X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.xml.DcXMLParser
X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.image.ImageParser
X-TIKA:detectedEncoding=ISO-8859-1
X-TIKA:encodingDetector=UniversalEncodingDetector Content-Type=application/zip
```
my original code limited the embed media, can't get the xmlParser
> When extracting a docx file with Tika 3.1.0, the package parser was detected
> instead of the OOXML parser
> --------------------------------------------------------------------------------------------------------
>
> Key: TIKA-4398
> URL: https://issues.apache.org/jira/browse/TIKA-4398
> Project: Tika
> Issue Type: Bug
> Components: tika-core
> Affects Versions: 3.1.0
> Environment: java17
> Reporter: mannixli
> Priority: Major
> Attachments: 01.docx, image-2025-04-16-20-46-07-228.png,
> image-2025-04-22-11-26-09-936.png, image-2025-04-22-11-27-33-655.png,
> image-2025-04-22-11-37-15-401.png
>
>
> 3.0.0 detected ooxml parser
--
This message was sent by Atlassian Jira
(v8.20.10#820010)