[ 
https://issues.apache.org/jira/browse/JCR-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469546
 ] 

Jukka Zitting commented on JCR-728:
-----------------------------------

I've looked at jmimemagic too, but as you mentioned, it's a bit limited. It's 
also licensed under the LGPL, which makes it a bit troublesome for us.

There's a recent codebase at 
http://hedges.net/archives/2006/11/08/java-shared-mime-info/ that seems pretty 
good, but the code is under the GPL.

I recently discussed with some people form Apache Nutch about a project to 
implement the shared mime info standard from freedesktop.org 
(http://www.freedesktop.org/wiki/Standards_2fshared_2dmime_2dinfo_2dspec), and 
apparently someone already had some Apache-licensed code for that but I haven't 
yet seen it.

I've been planning to propose an implementation project for the mime info 
standard in Apache Labs (http://labs.apache.org/), but if there's more interest 
within the Jackrabbit community we could also start working on it within the 
jackrabbit-text-extractors component.

> Automatic MIME type detection
> -----------------------------
>
>                 Key: JCR-728
>                 URL: https://issues.apache.org/jira/browse/JCR-728
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: indexing
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> Currently only the jcr:mimeType property is used to determine the MIME type 
> and thus the applicable text extractor to use for indexing a document. If the 
> jcr:mimeType property is not available or is set to a generic value like 
> "application/octet-stream", then the indexer could also use some heuristics 
> based on the node name or magic numbers within the binary stream to determine 
> the type of the document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to