[
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15234276#comment-15234276
]
Nick C commented on TIKA-1513:
------------------------------
I wrote the detector from scratch a couple months ago because 0x03 caused too
many false positives. For the parser I ended up using jdbf but found some bugs.
One was that the parser would error if inputStream.read(...) returned less than
the number of required bytes (The code needs to use something like
IOUtils.readFully)
The logic I used was
- Validate the signature
- Validate the header last update date (Is the month between 1 and 12 and is
the day valid for that month)
- Validate the header size by dividing by 32 and making sure there aren’t more
then 255 fields
- Calculate the file size using the record count, header length and record
length from the header making sure its less than 4GB. If I can get the input
stream length without reading the entire stream (TikaInputStream.hasLength or
metadata.content_length) I make sure the calculated size matches (or is within
2 bytes).
I'll put the code up on github tomorrow and get a list of the jdbf bugs.
> Add mime detection and parsing for dbf files
> --------------------------------------------
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)