[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595007#action_12595007
 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

I'm not sure this patch entirely reinvents the wheel, as it does most of the 
heavy lifting with preexisting components, namely PDFBox, POI, and Solr's own 
HTMLStripReader. It also has the advantage of already existing, whereas tying 
Solr to Tika or Aperture would take additional effort.

Tika or Aperture do look really nice, though. The most obvious advantage these 
projects have over this patch is that they can already extract text from more 
file formats than this patch, and that the developers will probably continue to 
add more file formats over time. Are you thinking of additional advantages on 
top of this, Grant? Do you have any cool ideas about how Tika/Aperture's 
metadata extraction facilities might be integrated into Solr? Is there a 
potentially interesting interface between Aperture's crawling facilities and 
Solr?

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to