[ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yonik Seeley updated SOLR-284: ------------------------------ Attachment: SOLR-284.patch OK, here's my first crack at cleaning things up a little before release. Changes: - there were no tests for XML attribute indexing. - capture had no unit tests - boost has no unit tests - ignoring unknown fields had no unit test - metadata prefix had no unit test - logging ignored fields at the INFO level for each document loaded is too verbose - removed handling of undeclared fields and let downstream components handle this. - avoid the String catenation code for single valued fields when Tika only produces a single value (for performance) - remove multiple literal detection handling for single valued fields - let a downstream component handle it - map literal values just as one would with generated metadata, since the user may be just supplying the extra metadata. also apply transforms (date formatting currently) - fixed a bug where null field values were being added (and later dropped by Solr... hence it was never caught). - avoid catching previously thrown SolrExceptions... let them fly through - removed some unused code (id generation, etc) - added lowernames option to map field names to lowercase/underscores - switched builderStack from synchronized Stack to LinkedList - fixed a bug that caused content to be appended with no whitespace in between - made extracting request handler lazy loading in example config - added ignored_ and attr_ dynamic fields in example schema Interface: {code} The default field is always "content" - use map to change it to something else lowernames=true/false // if true, map names like Content-Type to content_type map.<fname>=<target_field> boost.<fname>=<boost> literal.<fname>=<literal_value> xpath=<xpath_expr> - only generate content for the matching xpath expr extractOnly=true/false - if true, just return the extracted content capture=<xml_element_name> // separate out these elements captureAttr=<xml_element_name> // separate out the attributes for these elements uprefix=<prefix> // unknown field prefix - any unknown fields will be prepended with this value stream.type resource.name {code} To try and make things more uniform, all fields, whether "content" or metadata or attributes or literals, all go through the same process. 1) map to lowercase if lowernames=true 2) apply map.field rules 3) if the resulting field is unknown, prefix it with uprefix Hopefully people will agree that this is an improvement in general. I think in the future we'll need more advanced options, esp around dealing with links in HTML and more powerful xpath constructs, but that's for after 1.4 IMO. > Parsing Rich Document Types > --------------------------- > > Key: SOLR-284 > URL: https://issues.apache.org/jira/browse/SOLR-284 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Eric Pugh > Assignee: Grant Ingersoll > Fix For: 1.4 > > Attachments: libs.zip, rich.patch, rich.patch, rich.patch, > rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, > SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, > SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, > solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, > un-hardcode-id.diff > > > I have developed a RichDocumentRequestHandler based on the CSVRequestHandler > that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into > Solr. > There is a wiki page with information here: > http://wiki.apache.org/solr/UpdateRichDocuments > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.