[ 
https://issues.apache.org/jira/browse/TIKA-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603946#comment-14603946
 ] 

Tim Allison edited comment on TIKA-1601 at 6/30/15 12:51 AM:
-------------------------------------------------------------

Not anywhere near committing, but this is a rough start.

Some TODOs:
* -Figure out how to get non-ascii text out correctly-
* Figure out how to grab attachments from the accdb file
* Figure out if there's a flag for html-marked up text cells so that we can 
strip the markup [0]
* Figure out if there's a way to prevent Jackcess from trying to open linked 
files [0]
* Add unit tests :)

I used [~centic]'s code [1] to pull ~3k mdb files from CommonCrawl for testing. 
 Those tests were invaluable for identifying a potentially serious security 
issue -- default behavior of the table iterator was to try to load linked 
files.  Our code is now configured to skip linked tables.

Many thanks, again, to James Ahlborn for his patience in answering the above.


[0]: https://sourceforge.net/p/jackcess/discussion/456474/thread/038878e6/
[1]: https://github.com/centic9/CommonCrawlDocumentDownload



was (Author: talli...@mitre.org):
Not anywhere near committing, but this is a rough start.

Some TODOs:
* -Figure out how to get non-ascii text out correctly-
* Figure out how to grab attachments from the accdb file
* Figure out if there's a flag for html-marked up text cells so that we can 
strip the markup [0]
* Figure out if there's a way to prevent Jackcess from trying to open linked 
files [0]
* Add unit tests :)

I used [~centic]'s code [1] to pull ~3k mdb files from CommonCrawl for testing.

[0]: https://sourceforge.net/p/jackcess/discussion/456474/thread/038878e6/
[1]: https://github.com/centic9/CommonCrawlDocumentDownload


> Integrate Jackcess to handle MSAccess files
> -------------------------------------------
>
>                 Key: TIKA-1601
>                 URL: https://issues.apache.org/jira/browse/TIKA-1601
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>         Attachments: jackcess_nocommit_v1.patch, testAccess2.zip
>
>
> Recently, James Ahlborn, the current maintainer of 
> [Jackcess|http://jackcess.sourceforge.net/], kindly agreed to relicense 
> Jackcess to Apache 2.0.  [~boneill], the CTO at [Health Market Science, a 
> LexisNexis® Company|https://www.healthmarketscience.com/], also agreed with 
> this relicensing and led the charge to obtain all necessary corporate 
> approval to deliver a 
> [CCLA|https://www.apache.org/licenses/cla-corporate.txt] for Jackcess to 
> Apache.  As anyone who has tried to get corporate approval for anything 
> knows, this can sometimes require not a small bit of effort.
> If I may speak on behalf of Tika and the larger Apache community, I offer a 
> sincere thanks to James, Brian and the other developers and contributors to 
> Jackcess!!!
> Once the licensing info has been changed in Jackcess and the new release is 
> available in maven, we can integrate Jackcess into Tika and add a capability 
> to process MSAccess.
> As a side note, I reached out to the developers and contributors to determine 
> if there were any objections.  I couldn't find addresses for everyone, and 
> not everyone replied, but those who did offered their support to this move. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to