Sebastian Nagel created TIKA-2422:
-------------------------------------

             Summary: Improve detection of Graphviz *.dot format
                 Key: TIKA-2422
                 URL: https://issues.apache.org/jira/browse/TIKA-2422
             Project: Tika
          Issue Type: Improvement
          Components: detector, mime
            Reporter: Sebastian Nagel
            Priority: Minor


Detection of Graphviz document formats could be improved by adding
- either *.dot as glob pattern (conflicts with the more frequent MSWord 
templates)
- a magic pattern which catches the [.dot 
language|http://www.graphviz.org/content/dot-language] grammar, eg. 
{{^\s*(?:strict\s+)?(?:di)?graph\b}}

Seen with Common Crawl data (see also discussions on 
[user@tika|https://lists.apache.org/thread.html/1e4f4b6c249618a446f2e92f56ef90e6bfa0dfe51ce10197461df3d9@%3Cuser.tika.apache.org%3E]
 and 
[dev@poi|https://lists.apache.org/thread.html/7e0c25a389a03011eabce81e933f17a6093390138f4890fa77c36a59@%3Cdev.poi.apache.org%3E]):
 web server sends "text/vnd.graphviz" (often wrong) and Tika detects 
"application/msword" (sometimes wrong), see [WARC 
file|https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/test/tika_dot_graphviz_msword.warc.gz]).
 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to