[
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509676
]
Ryan McKinley commented on SOLR-284:
------------------------------------
I haven't run this patch, but have a few questions...
What is the *general* approach to extract a lucene document (list of fields)
from a PDF? Word? Powerpoint?
Is this just access to a few common fields like author, keywords, text, etc?
Is this something that realistically would need to be custom for each case?
Perhaps it makes sense to add a contrib section for this sort of stuff. It
seems weird to add 10 library dependencies to the core distribution. How does
nutch handle this?
> Parsing Rich Document Types
> ---------------------------
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
> Issue Type: New Feature
> Components: update
> Affects Versions: 1.3
> Reporter: Eric Pugh
> Fix For: 1.3
>
> Attachments: rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into
> Solr.
> I am attaching a patch file with the code changes, and if this looks good,
> will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.