[jira] Commented: (SOLR-284) Parsing Rich Document Types

Erik Hatcher (JIRA) Mon, 24 Nov 2008 17:09:41 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650431#action_12650431
 ]


Erik Hatcher commented on SOLR-284:
-----------------------------------

bq. I'm not familiar with the state of the patch, but i'm assuming that (by 
default) all of the metadata fields produced by tika have a common naming 
convention - either in terms of a common prefix or a common suffix. in which 
case people can always make a dynamicField declaration to ignore all metadata 
fields not already explicitly declared.

Tika doesn't need to do this explicitly.... you know all fields coming out of 
your call to the Tika API will be Tika fields.  Solar Cell (I'm on board with 
that nickname, Grant - now you're catching on :) - thus we could map all Tika 
output fields to tika_* where * is the Tika outputted field name.  And with 
field name mapping this default would be overridden, say tika_title mapped to 
"title".   Just some off the cuff thoughts.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, 
> SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, 
> test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Reply via email to