[ 
https://jira.duraspace.org/browse/DS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=24416#comment-24416
 ] 

Richard Rodgers commented on DS-1140:
-------------------------------------

There is already a reimplementation of text extraction a curation task suite.It 
uses the Apache Tika framework - which updates all the extractor libraries as 
well as adding support for dozens of new formats (open doc, etc).
See github project: 
https://github.com/richardrodgers/ctask/tree/master/mediafilter and Tika: 
http://tika.apache.org/
So I guess that's volunteering...
                
> Update MSWord Media Filter to use Apache POI (like PPT Filter) and also 
> support .docx
> -------------------------------------------------------------------------------------
>
>                 Key: DS-1140
>                 URL: https://jira.duraspace.org/browse/DS-1140
>             Project: DSpace
>          Issue Type: Improvement
>          Components: DSpace API
>            Reporter: Tim Donohue
>             Fix For: 3.0
>
>
> The Microsoft Word Media Filter (org.dspace.app.mediafilter.WordFilter) uses 
> outdated, obsolete third party software, specifically the "text-mining" tools 
> at: http://code.google.com/p/text-mining/
> However, there are now better options out there, especially Apache POI.
> http://poi.apache.org/text-extraction.html
> Apache POI also has the benefit of being able to extract text from docx, xls, 
> xlsx and even Publisher and Visio files.
> We may even be able to create a single "MSFilter" which can just extract doc, 
> docx, ppt, pptx, xls, xlsx, etc. all using POI.
> Any volunteers to implement?  Looks like we should be able to implement it 
> similar to the current PPT Filter 
> (org.dspace.app.mediafilter.PowerPointFilter) which already uses POI.  See 
> also DS-714.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://jira.duraspace.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to