[jira] [Commented] (TIKA-679) Proposal for PRT Parser

Troy Witthoeft (JIRA) Thu, 14 Jul 2011 14:51:30 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065559#comment-13065559
 ]


Troy Witthoeft commented on TIKA-679:
-------------------------------------


>Quite a few of the view names in your sample files had 5*00 before them, 
>rather than 5*33. Which rule is the blackfeather one incorrectly triggering? 
>And what bit in the text types comment have we got wrong?  For the view names 
>vs text, maybe we should put different classes on them or something like that?

I'm not suggesting any changes to the byte rules on view detection (yet). But, 
I just don't see the payoff in detecting view names. It's tough to pin down the 
prefixes on the last views. I think simply detecting note text, date, 
description, is sufficient.
  
>I suspect we'll need to understand the file format better first!

I've compared my collection of files using www.fairdell.com
The first 8,000 bytes of PRT files are fairly static.  
Bytes don't vary dramatically.  Specifically, the first 8 views are evenly 
spaced, and always in the same location, in every PRT file.  

>see if we can figure out the rules for getting from the start of the file to 
>the first view entry.

For instance, you will always find the text for the first view, "Top View," is 
always at 7031st Byte or 0x01b77.  

> Once we can answer at least some of those, we'll be more along the way of 
> cracking the format's structure!

The Top View is followed by Front, Back, Bottom, Right, Left, Isometric, and 
Axonometric Views which are always in the same location.
No need for byte pattern matching, we know exactly where they are.
Next, as we approach the 8Kb mark, System view may be present multiple times, 
or not at all.  That is were it starts to get dicey.
For instance, ultra complex blackfeather has 240 system views with different 
prefixes!
But we can determine where the System View section ends because it is always 
suffixed by what I call the "ascending byte matrix marker"
It always starts with the same few bytes..

[FF FF 10 00 00 00 11 .. .. .. 12 .. .. .. .. 13 ..]
[.. .. 14 .. .. .. 15 .. .. .. 16 .. .. .. .. 17 ..]
[.. .. 18 .. .. .. 19 .. .. .. 1A .. .. .. 1B .. ]

Immediately Following the "ascending byte matrix" 
is a section dedicated to whichever CAD programs have touched the file.
You should always find the CADKEY identifier "CK_TTFTABLE"

Does this help?
 

> Proposal for PRT Parser
> -----------------------
>
>                 Key: TIKA-679
>                 URL: https://issues.apache.org/jira/browse/TIKA-679
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>            Reporter: Troy Witthoeft
>            Priority: Minor
>              Labels: CAD, Mime, Parser, Prt, Tika
>         Attachments: PRTParser.patch, PRTParser.patch, TikaTest.prt, 
> TikaTest2.prt, TikaTest2.prt.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> It would be nice if Tika had support for prt CAD files.
> A preliminary prt text extractor has been created.
> Any assistance further developing this code is appreciated.
> {code:title=PRTParser.java|borderStyle=solid}
> package org.apache.tika.parser.prt;
> import java.io.BufferedInputStream;
> import java.io.BufferedReader;
> import java.io.IOException;
> import java.io.InputStream;
> import java.io.InputStreamReader;
> import java.io.Reader;
> import java.io.UnsupportedEncodingException;
> import java.nio.charset.Charset;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.poi.util.IOUtils;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.XHTMLContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> /**
>  * Description: PRT (CAD Drawing) parser. This is a very basic parser.   
>  * Searches for specific byte prefix, and outputs text from note entities.
>  * Does not support special characters.
>  */
>  
> public class PRTParser implements Parser {
>     private static final Set<MediaType> SUPPORTED_TYPES = 
> Collections.singleton(MediaType.application("prt"));
>     public static final String PRT_MIME_TYPE = "application/prt";
>               
>     public Set<MediaType> getSupportedTypes(ParseContext context) {
>         return SUPPORTED_TYPES;
>         }
>               
>     public void parse(
>               InputStream stream, ContentHandler handler,
>               Metadata metadata, ParseContext context)
>               throws IOException, SAXException, TikaException {
>               XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, 
> metadata);
>               
>               int[] prefix = new int[] {227, 63};                             
> //Looking for a prefix set of bytes {E3, 3F} 
>               int pos = 0;                                                    
>                         //position inside the prefix
>               int read;
>               while( (read = stream.read()) > -1) {                           
>         // stream.read() moves to the next byte, and returns an integer value 
> of the byte.  a value of -1 signals the EOF
>                       if(read == prefix[pos]) {                               
>                                 // is the last byte read the same as the 
> first byte in the prefix?
>                               pos++;                                          
>                                                         
>                                       if(pos == prefix.length) {              
>                                                 //Are we at the last position 
> of the prefix?
>                                               stream.skip(11);                
>                                                                 // skip the 
> 11 bytes of the prefix which can vary.
>                                               int lengthbyte = stream.read(); 
>                                                 // Set the next byte equal to 
> the length of text in the user input field, see PRT schema
>                                               stream.skip(1);                 
>                                                                 
>                                               byte[] text = new 
> byte[lengthbyte];                                             // a new byte 
> array called text is created.  It should contain an array of integer values 
> of the user inputted text.
>                                               IOUtils.readFully(stream, 
> text);                                                
>                                               String str = new String(text, 
> 0, text.length, "Cp437"); // Cp437 turn it into a string, but does not remove 
> null termination, assumes it's found to be MS-DOS Encoding
>                                               str = 
> str.replace("\u03C6","\u00D8");                                   // Note: 
> Substitute CP437's lowercase "phi" for Nordic "O with slash" to represent 
> diameter symbol. 
>                                               metadata.add("Content",str);
>                                               xhtml.startElement("p");        
>                                               xhtml.characters(str);
>                                               xhtml.endElement("p");
>                                               pos = 0;                        
>                                 
>                                       }
>                       } 
>                       else {
>                               //Did not find the prefix. Reset the position 
> counter.
>                               pos = 0;
>                       }
>               }
>       //Reached the end of file
>       //System.out.println("Finished searching the file");    
>       }
>               
>       /**
>     * @deprecated This method will be removed in Apache Tika 1.0.
>     */
>     public void parse(
>                    InputStream stream, ContentHandler handler, Metadata 
> metadata)
>                    throws IOException, SAXException, TikaException {
>                 parse(stream, handler, metadata, new ParseContext());
>     }
> }{code}   
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-679) Proposal for PRT Parser

Reply via email to