Re: ExtractRH: How to strip metadata

Jack Krupansky Wed, 02 May 2012 09:54:51 -0700

I did some testing, and evidently the "meta" field is treated specially fromthe ERH.

I copied the example schema, and added both "meta" and "metax" fields andset "fmap.content=metax", and lo and behold only the doc content appears in"metax", but all the doc metadata appears in "meta".

Although, I did get 400 errors with Solr complaining that "meta" was not amultivalued field. This is with Solr 3.6. What release of Solr are youusing?

I was not aware of this undocumented feature. I haven't checked the codeyet.


-- Jack Krupansky

-----Original Message-----From: Joseph Hagerty

Sent: Wednesday, May 02, 2012 11:10 AM
To: solr-user@lucene.apache.org
Subject: Re: ExtractRH: How to strip metadata

I do not. I commented out all of the copyFields provided in the default
schema.xml that ships with 3.5. My schema is rather minimal. Here is my
fields block, if this helps:

<fields>
  <field name="cust"     type="string"    indexed="true"  stored="true"
required="true"  />
  <field name="asset"    type="string"    indexed="true"  stored="true"
required="true"  />
  <field name="ent"      type="string"    indexed="true"  stored="true"
required="true"  />
  <field name="meta"     type="text_en"   indexed="true"  stored="true"
required="true"  />
  <dynamicField name="ignored_*" type="ignored" multiValued="true"/>
  <!--field name="modified"  type="dateTime"  indexed="true"
stored="true"  required="false" /-->
</fields>

On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky<j...@basetechnology.com>wrote:

Check to see if you have a CopyField for a wildcard pattern that copies to
"meta", which would copy all of the Tika-generated fields to "meta."

-- Jack Krupansky

-----Original Message----- From: Joseph Hagerty
Sent: Wednesday, May 02, 2012 9:56 AM
To: solr-user@lucene.apache.org
Subject: ExtractRH: How to strip metadata

Greetings Solr folk,

How can I instruct the extract request handler to ignore metadata/headers
etc. when it constructs the "content" of the document I send to it?

For example, I created an MS Word document containing just the word
"SEARCHWORD" and nothing else. However, when I ship this doc to my solr
server, here's what's thrown in the index:

<str name="meta">
Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments
stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm

Page-Count 1 subject Application-Name Microsoft Macintosh Word AuthorJesus

Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 108600000000 Creation-Date

2008-11-05T20:19:00Z stream_content_type application/octet-streamCharacter

Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/**
phpHCIg7y

Company Parkman Elastomers Pvt Ltd Content-Type application/mswordKeywords

Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD
</str>

All I want is the body of the document, in this case the word"SEARCHWORD."


For further reference, here's my extraction handler:

<requestHandler name="/update/extract"
                startup="lazy"
                class="solr.extraction.**ExtractingRequestHandler" >
  <lst name="defaults">
    <!-- All the main content goes into "text"... if you need to return
         the extracted text or do highlighting, use a stored field. -->
    <str name="fmap.content">meta</str>
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>
  </lst>
 </requestHandler>

(Ironically, "meta" is the field in the solr schema to which I'mattempting

to extract the body of the document. Don't ask).

Thanks in advance for any pointers you can provide me.

--
- Joe

--

- Joe

Re: ExtractRH: How to strip metadata

Reply via email to