[
https://issues.apache.org/jira/browse/TIKA-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661997#action_12661997
]
Andrzej Rusin commented on TIKA-154:
------------------------------------
I implemented a simple, maybe a bit naive, but working checking mechanism for
text files:
public boolean isProperFile(File file, String mimeTypeName) throws
IOException {
//we check only text types here
if (!mimeTypeName.startsWith("text"))
return true;
Perl5Util util = new Perl5Util();
byte[] data = getFileSample(file);
if (data == null)
//empty file, can assume as text
return true;
String s = new String(data, "UTF-8");
if (!util.match("/[^[:ascii:][:space:]]/", s)) {
return true;
}
return false;
}
protected byte[] getFileSample(File file) throws IOException,
IOException {
byte[] data = new byte[SAMPLE_SIZE];
FileInputStream fs = null;
try {
fs = new FileInputStream(file);
int read = fs.read(data);
if (read < 0)
return null;
data = Arrays.copyOfRange(data, 0, read);
} finally {
if (fs != null)
fs.close();
}
return data;
}
> Better detection of plain text versus binary formats with a text header
> -----------------------------------------------------------------------
>
> Key: TIKA-154
> URL: https://issues.apache.org/jira/browse/TIKA-154
> Project: Tika
> Issue Type: Improvement
> Components: mime
> Reporter: Jukka Zitting
> Priority: Minor
>
> Antoni Mylka noted on the mailing list:
> Many binary formats begin with magic byte sequences composed of ASCII
> characters, e.g.
> zipfiles begin with PK
> pdfs begin with %PDF-
> chms help files begin with ITSF
> etc.
> Tika should do a better job of detecting such cases.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.