I do not. I commented out all of the copyFields provided in the default schema.xml that ships with 3.5. My schema is rather minimal. Here is my fields block, if this helps:
<fields> <field name="cust" type="string" indexed="true" stored="true" required="true" /> <field name="asset" type="string" indexed="true" stored="true" required="true" /> <field name="ent" type="string" indexed="true" stored="true" required="true" /> <field name="meta" type="text_en" indexed="true" stored="true" required="true" /> <dynamicField name="ignored_*" type="ignored" multiValued="true"/> <!--field name="modified" type="dateTime" indexed="true" stored="true" required="false" /--> </fields> On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky <j...@basetechnology.com>wrote: > Check to see if you have a CopyField for a wildcard pattern that copies to > "meta", which would copy all of the Tika-generated fields to "meta." > > -- Jack Krupansky > > -----Original Message----- From: Joseph Hagerty > Sent: Wednesday, May 02, 2012 9:56 AM > To: solr-user@lucene.apache.org > Subject: ExtractRH: How to strip metadata > > > Greetings Solr folk, > > How can I instruct the extract request handler to ignore metadata/headers > etc. when it constructs the "content" of the document I send to it? > > For example, I created an MS Word document containing just the word > "SEARCHWORD" and nothing else. However, when I ship this doc to my solr > server, here's what's thrown in the index: > > <str name="meta"> > Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments > stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm > Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus > Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 108600000000 Creation-Date > 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character > Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/** > phpHCIg7y > Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords > Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD > </str> > > All I want is the body of the document, in this case the word "SEARCHWORD." > > For further reference, here's my extraction handler: > > <requestHandler name="/update/extract" > startup="lazy" > class="solr.extraction.**ExtractingRequestHandler" > > <lst name="defaults"> > <!-- All the main content goes into "text"... if you need to return > the extracted text or do highlighting, use a stored field. --> > <str name="fmap.content">meta</str> > <str name="lowernames">true</str> > <str name="uprefix">ignored_</str> > </lst> > </requestHandler> > > (Ironically, "meta" is the field in the solr schema to which I'm attempting > to extract the body of the document. Don't ask). > > Thanks in advance for any pointers you can provide me. > > -- > - Joe > -- - Joe