Update MSWord Media Filter to use Apache POI (like PPT Filter) and also support 
.docx
-------------------------------------------------------------------------------------

                 Key: DS-1140
                 URL: https://jira.duraspace.org/browse/DS-1140
             Project: DSpace
          Issue Type: Improvement
          Components: DSpace API
            Reporter: Tim Donohue
             Fix For: 3.0


The Microsoft Word Media Filter (org.dspace.app.mediafilter.WordFilter) uses 
outdated, obsolete third party software, specifically the "text-mining" tools 
at: http://code.google.com/p/text-mining/

However, there are now better options out there, especially Apache POI.

http://poi.apache.org/text-extraction.html

Apache POI also has the benefit of being able to extract text from docx, xls, 
xlsx and even Publisher and Visio files.

We may even be able to create a single "MSFilter" which can just extract doc, 
docx, ppt, pptx, xls, xlsx, etc. all using POI.

Any volunteers to implement?  Looks like we should be able to implement it 
similar to the current PPT Filter (org.dspace.app.mediafilter.PowerPointFilter) 
which already uses POI.  See also DS-714.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://jira.duraspace.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to