[jira] Commented: (SOLR-284) Parsing Rich Document Types

Eric Pugh (JIRA) Mon, 02 Jul 2007 14:31:25 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509683
 ]


Eric Pugh commented on SOLR-284:
--------------------------------

So, I was not attempting to "boil the ocean" and provide the ultimate solution. 
 Our need was just to take all the raw text and index it in a field, and pass 
in a bunch of other data fields to be indexed.  

We are parsing a large number of unstructured documents, that may or may not 
have common fields populated, but fortunately we don't really need them.  Our 
users aren't searching by author, but by content.  

I think there are only 5 additional libraries, and one (poi-scratchpad) may be 
able to be removed...

Yonik also mentioned using Tika, as a framework for creating a common interface 
to these types of rich documents, but Tika is still in incubation and has no 
code in it!

I originally had separate handlers for each data type, and that was really 
icky, so I condensed it into the RichDocumentRequestHandler.  I could also 
merge in the CSVRequestHandler into it as well, by just taking out the logic 
for parsing CSV and putting it into a CSVParser.  However, the 
CSVRequestHandler has very complex and rich semantics that these unstructured 
documents don't really need.



> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> I am attaching a patch file with the code changes, and if this looks good, 
> will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Reply via email to