Update MSWord Media Filter to use Apache POI (like PPT Filter) and also support
.docx
-------------------------------------------------------------------------------------
Key: DS-1140
URL: https://jira.duraspace.org/browse/DS-1140
Project: DSpace
Issue Type: Improvement
Components: DSpace API
Reporter: Tim Donohue
Fix For: 3.0
The Microsoft Word Media Filter (org.dspace.app.mediafilter.WordFilter) uses
outdated, obsolete third party software, specifically the "text-mining" tools
at: http://code.google.com/p/text-mining/
However, there are now better options out there, especially Apache POI.
http://poi.apache.org/text-extraction.html
Apache POI also has the benefit of being able to extract text from docx, xls,
xlsx and even Publisher and Visio files.
We may even be able to create a single "MSFilter" which can just extract doc,
docx, ppt, pptx, xls, xlsx, etc. all using POI.
Any volunteers to implement? Looks like we should be able to implement it
similar to the current PPT Filter (org.dspace.app.mediafilter.PowerPointFilter)
which already uses POI. See also DS-714.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://jira.duraspace.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel