Pluggable magic header detectors
--------------------------------
Key: TIKA-95
URL: https://issues.apache.org/jira/browse/TIKA-95
Project: Tika
Issue Type: New Feature
Reporter: Jukka Zitting
Priority: Minor
Some file formats like MS Office files or specific XML schemas don't have
simple magic marker bytes that could be used to easily identify the type of the
document. However, it would in many cases be possible to detect such formats
with more complex parsing logic.
Also, there are some external libraries (like Sanselan as mentioned in TIKA-92)
that contain their own magic header rules. Instead of duplicating such rules in
Tika, it would be better if Tika could just invoke the existing external
functionality.
To support these cases Tika should provide a mechanism to plug in custom magic
header detector components in addition to the traditional configured magic
patterns.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.