[ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509676 ]
Ryan McKinley commented on SOLR-284: ------------------------------------ I haven't run this patch, but have a few questions... What is the *general* approach to extract a lucene document (list of fields) from a PDF? Word? Powerpoint? Is this just access to a few common fields like author, keywords, text, etc? Is this something that realistically would need to be custom for each case? Perhaps it makes sense to add a contrib section for this sort of stuff. It seems weird to add 10 library dependencies to the core distribution. How does nutch handle this? > Parsing Rich Document Types > --------------------------- > > Key: SOLR-284 > URL: https://issues.apache.org/jira/browse/SOLR-284 > Project: Solr > Issue Type: New Feature > Components: update > Affects Versions: 1.3 > Reporter: Eric Pugh > Fix For: 1.3 > > Attachments: rich.patch, test-files.zip > > > I have developed a RichDocumentRequestHandler based on the CSVRequestHandler > that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into > Solr. > I am attaching a patch file with the code changes, and if this looks good, > will add a page similar to http://wiki.apache.org/solr/UpdateCSV. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.