Re: ExtractRH: How to strip metadata

Joseph Hagerty Wed, 02 May 2012 08:11:20 -0700

I do not. I commented out all of the copyFields provided in the default
schema.xml that ships with 3.5. My schema is rather minimal. Here is my
fields block, if this helps:


 <fields>
   <field name="cust"     type="string"    indexed="true"  stored="true"
 required="true"  />
   <field name="asset"    type="string"    indexed="true"  stored="true"
 required="true"  />
   <field name="ent"      type="string"    indexed="true"  stored="true"
 required="true"  />
   <field name="meta"     type="text_en"   indexed="true"  stored="true"
 required="true"  />
   <dynamicField name="ignored_*" type="ignored" multiValued="true"/>
   <!--field name="modified"  type="dateTime"  indexed="true"
 stored="true"  required="false" /-->
 </fields>


On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky <j...@basetechnology.com>wrote:

> Check to see if you have a CopyField for a wildcard pattern that copies to
> "meta", which would copy all of the Tika-generated fields to "meta."
>
> -- Jack Krupansky
>
> -----Original Message----- From: Joseph Hagerty
> Sent: Wednesday, May 02, 2012 9:56 AM
> To: solr-user@lucene.apache.org
> Subject: ExtractRH: How to strip metadata
>
>
> Greetings Solr folk,
>
> How can I instruct the extract request handler to ignore metadata/headers
> etc. when it constructs the "content" of the document I send to it?
>
> For example, I created an MS Word document containing just the word
> "SEARCHWORD" and nothing else. However, when I ship this doc to my solr
> server, here's what's thrown in the index:
>
> <str name="meta">
> Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments
> stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm
> Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus
> Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 108600000000 Creation-Date
> 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character
> Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/**
> phpHCIg7y
> Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords
> Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD
> </str>
>
> All I want is the body of the document, in this case the word "SEARCHWORD."
>
> For further reference, here's my extraction handler:
>
> <requestHandler name="/update/extract"
>                 startup="lazy"
>                 class="solr.extraction.**ExtractingRequestHandler" >
>   <lst name="defaults">
>     <!-- All the main content goes into "text"... if you need to return
>          the extracted text or do highlighting, use a stored field. -->
>     <str name="fmap.content">meta</str>
>     <str name="lowernames">true</str>
>     <str name="uprefix">ignored_</str>
>   </lst>
>  </requestHandler>
>
> (Ironically, "meta" is the field in the solr schema to which I'm attempting
> to extract the body of the document. Don't ask).
>
> Thanks in advance for any pointers you can provide me.
>
> --
> - Joe
>



-- 
- Joe

Re: ExtractRH: How to strip metadata

Reply via email to