Proposal for PRT Parser ----------------------- Key: TIKA-679 URL: https://issues.apache.org/jira/browse/TIKA-679 Project: Tika Issue Type: Improvement Components: mime, parser Affects Versions: 0.9 Reporter: Troy Witthoeft Priority: Minor
It would be nice if Tika had support for prt CAD files. A preliminary prt text extractor has been created. {code:title=PRTParser.java|borderStyle=solid} package org.apache.tika.parser.prt; import java.io.BufferedInputStream; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.Reader; import java.io.UnsupportedEncodingException; import java.nio.charset.Charset; import java.util.Collections; import java.util.Set; import org.apache.poi.util.IOUtils; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.XHTMLContentHandler; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; /** * Description: PRT (CAD Drawing) parser. This is a very basic parser. * It also currently sets some dummy metadata. */ public class PRTParser implements Parser { private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("prt")); public static final String PRT_MIME_TYPE = "application/prt"; public Set<MediaType> getSupportedTypes(ParseContext context) { return SUPPORTED_TYPES; } public void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { byte[] prefix = new byte[] {0x01, 0x1F}; // int pos = 0; int read; while( (read = stream.read()) > -1) { // Reads the next single byte of data (returns byte pos) until you hit the EOF if(read == prefix[pos]) { // If the byte being read is pos++; if(pos == prefix.length) { // found it! int length = stream.read(); int unknown = stream.read(); byte[] text = new byte[length]; IOUtils.readFully(stream, text); //reads a selected byte array from the InputStream // turn it into a string, removing null termination // assumes it's found to be utf-8 String str = new String(text, 0, text.length, "UTF-8"); XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); xhtml.startElement("p"); xhtml.characters(str); xhtml.endElement("p"); pos--; } } else { pos = 0; } } } /** * @deprecated This method will be removed in Apache Tika 1.0. */ public void parse( InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException { parse(stream, handler, metadata, new ParseContext()); } } {code} I am looking for assistance in improving this code. I am in the process of picking apart the prt file structure. Here are my findings. The file header contains, a magic mime type, file creation date, and file description. The magic mime type can be identified with <match value="0M3C" type="string" offset="8" /> If present, the file creation date is after the identifier. It is in format YYYYMMDDhhmm. It is always in the same address, 0x001Eh-0x002Ah OR the 31st-43rd bytes. If present, the user entered file description IMMEDIATELY follows date. Max chars is 498. It is always at the same address, 0x002Bh-0x021Ch OR the 43rd-540th bytes. Terminated with [00][01][C8] The goal is to extract the user entered text. User text is marked by a prefix of 42 bytes. Newest entries are at the top of the file. The prefix is always marked by the presence of six 3's and [E3][3F], that is followed by 10 variable bytes, then a byte signifying the length of the user input text + 1, and a null. GUIDE [33][33][33][33][33][33][E3][3F][0#][00][00][0#][00][00][0#][0#][0#][1F][ln][00][USERINPUT TEXT][00][xx] EXAMPLE [33][33][33][33][33][33][E3][3F][00][00][00][00][00][00][00][02][01][1F][05][00][54][49][4B][41][00][0B] = TIKA Any pointers on how to improve the code is appreciated. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira