TIKA CharsetDetector not detecting UTF-16BE/UTF-16LE encodings
--------------------------------------------------------------
Key: TIKA-729
URL: https://issues.apache.org/jira/browse/TIKA-729
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.9
Reporter: Abhishek Jain
Came across this bug when trying to convert Unicode files to UTF-16. For files
written in UTF-16BE or UTF-16LE, CharsetDetector detects it as "ISO-8859-1".
{code}
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.txt.CharsetDetector;
import org.apache.tika.parser.txt.CharsetMatch;
import org.xml.sax.SAXException;
public class TikaTextConverter {
public static void main(String args[]) throws IOException, SAXException,
TikaException {
String inputPath = "/tmp/input.csv";
Writer writer = new OutputStreamWriter(new FileOutputStream(inputPath),
"UTF-16LE");
writer.write("Line1, Some text, Some more text");
writer.close();
InputStream inputStream = TikaInputStream.get(new
File(inputPath).toURI().toURL(), new Metadata());
CharsetDetector detector = new CharsetDetector();
detector.setText(inputStream);
CharsetMatch[] matches = detector.detectAll();
for (CharsetMatch match : matches) {
System.out.println(match.getName());
}
}
}
{code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira