[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15234276#comment-15234276
 ] 

Nick C commented on TIKA-1513:
------------------------------

I wrote the detector from scratch a couple months ago because 0x03 caused too 
many false positives. For the parser I ended up using jdbf but found some bugs. 
One was that the parser would error if inputStream.read(...) returned less than 
the number of required bytes (The code needs to use something like 
IOUtils.readFully)

The logic I used was
- Validate the signature
- Validate the header last update date (Is the month between 1 and 12 and is 
the day valid for that month)
- Validate the header size by dividing by 32 and making sure there aren’t more 
then 255 fields
- Calculate the file size using the record count, header length and record 
length from the header making sure its less than 4GB. If I can get the input 
stream length without reading the entire stream (TikaInputStream.hasLength or 
metadata.content_length) I make sure the calculated size matches (or is within 
2 bytes).

I'll put the code up on github tomorrow and get a list of the jdbf bugs.

> Add mime detection and parsing for dbf files
> --------------------------------------------
>
>                 Key: TIKA-1513
>                 URL: https://issues.apache.org/jira/browse/TIKA-1513
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>             Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to