[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509676
 ] 

Ryan McKinley commented on SOLR-284:
------------------------------------

I haven't run this patch, but have a few questions...

What is the *general* approach to extract a lucene document (list of fields) 
from a PDF? Word? Powerpoint?

Is this just access to a few common fields like author, keywords, text, etc?  
Is this something that realistically would need to be custom for each case?  

Perhaps it makes sense to add a contrib section for this sort of stuff.  It 
seems weird to add 10 library dependencies to the core distribution.  How does 
nutch handle this?
 


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> I am attaching a patch file with the code changes, and if this looks good, 
> will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to