Chris Hostetter wrote:

: The design issue for this is to be clear about the schema and how
: documents are mapped into the schema. If all document types are
: mapped into the same schema, then one type of query will work
: for all. If the documents have different schemas (in the search
: index), then the query needs an expansion specific to each
: document type.

Right, the only way to provide a general purpose solution is to make sure
any out of the box "UpdateParsers" (using the interface names from my
previous email) can be configured in the solrconfig.xml to map the native
concepts in the document format to user defined schema fields.
>
(people writing their own custom UpdateParsers could allways hardcode
their schema fields)

I don't know anything about PDF structure

http://en.wikipedia.org/wiki/Extensible_Metadata_Platform
http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf

but using your RFC-2822 email
as an example, the configuration for an Rfc2822UpdateParser would need to
be able to specify which Headers map to which fields, and what to do with
body text -- in theory, it could also be configured with refrences to
other UpdateParser instances for dealing with multi-part mime messages

There's two cases I can think of:

1. The document is already decomposed into fields before the insert/update, but one or more of the fields requires special handling. For example when indexing source code you could get the author, date, revision etc from the SCMS, but you might want to process the code itself just to extract identifiers and ignore keywords. You might want different handlers for different languages, but for the resulting tokens all to be stored in the same field, irrespective of language.

2. The document contains both metadata and content. PDF is a good example of such a document type.

You therefore need to be able to specify two types of preprocessing - either at the whole-document level, or at the individual field level. And for both of these you'd need to be able to specify the mapping between the data/metadata in the source document and the corresponding Solr schema fields. I'm not sure if you'd want this in the solrconfig.xml file or in the indexing request itself. Doing it in solrconfig.xml means you could change the disposition of the indexed data without changing the clients submitting the content.

That was the reasoning behind my initial suggestion:

| Extend the <doc> and <field> element with the following attributes:
|
| mime-type Mime type of the document, e.g. application/pdf, text/html
| and so on.
|
| encoding Encoding of the document, with base64 being the standard
| implementation.
|
| href The URL of any documents that can be accessed over HTTP, instead
| of embedding them in the indexing request.  The indexer would fetch
| the document using the specified URL.
|
| There would then be entries in the configuration file that map each
| MIME type to a handler that is capable of dealing with that document
| type.

So for case 1 where the source is locally accessible you might have something like this:

<add>
  <doc>
    <field name="author">Alan Burlison</field>
    <field name="revision">1.2</field>
    <field name="date">08-Jan-2007</field>
    <field name="source" mime-type"text/java"
      href="file:///source/org/apache/foo/bar.java">
    </field>
  </doc>
</add>

And for case 2 where the file can't be directly accessed you might have something like this:

<add>
  <doc encoding="base64" mime-type"application/pdf">
[base64-encoded version of the PDF file]
  </doc>
</add>

--
Alan Burlison
--

Reply via email to