Sebastian Nagel created TIKA-2422:
-------------------------------------
Summary: Improve detection of Graphviz *.dot format
Key: TIKA-2422
URL: https://issues.apache.org/jira/browse/TIKA-2422
Project: Tika
Issue Type: Improvement
Components: detector, mime
Reporter: Sebastian Nagel
Priority: Minor
Detection of Graphviz document formats could be improved by adding
- either *.dot as glob pattern (conflicts with the more frequent MSWord
templates)
- a magic pattern which catches the [.dot
language|http://www.graphviz.org/content/dot-language] grammar, eg.
{{^\s*(?:strict\s+)?(?:di)?graph\b}}
Seen with Common Crawl data (see also discussions on
[user@tika|https://lists.apache.org/thread.html/1e4f4b6c249618a446f2e92f56ef90e6bfa0dfe51ce10197461df3d9@%3Cuser.tika.apache.org%3E]
and
[dev@poi|https://lists.apache.org/thread.html/7e0c25a389a03011eabce81e933f17a6093390138f4890fa77c36a59@%3Cdev.poi.apache.org%3E]):
web server sends "text/vnd.graphviz" (often wrong) and Tika detects
"application/msword" (sometimes wrong), see [WARC
file|https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/tika_dot_graphviz_msword.warc.gz]).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)