How interesting! You know, I did at one point consider that perhaps the fieldname "meta" may be treated specially, but I talked myself out of it. I reasoned that a field name in my local schema should have no bearing on how a plugin such as solr-cell/Tika behaves. I should have tested my hypothesis; even if this phenomenon turns out to be undocumented behavior, I consider myself a victim of my own assumptions.
I am running version 3.5. You may have gotten the multivalue errors due to the way your test schema and/or extracting request handler is lain out (my bad). I am using the "ignored" fieldtype and a dynamicField called "ignored_" as a catch-all for extraneous fields delivered by Tika. Thanks for your help! Please keep me posted on any further insights/revelations, and I'll do the same. On Wed, May 2, 2012 at 12:54 PM, Jack Krupansky <j...@basetechnology.com>wrote: > I did some testing, and evidently the "meta" field is treated specially > from the ERH. > > I copied the example schema, and added both "meta" and "metax" fields and > set "fmap.content=metax", and lo and behold only the doc content appears in > "metax", but all the doc metadata appears in "meta". > > Although, I did get 400 errors with Solr complaining that "meta" was not a > multivalued field. This is with Solr 3.6. What release of Solr are you > using? > > I was not aware of this undocumented feature. I haven't checked the code > yet. > > > -- Jack Krupansky > > -----Original Message----- From: Joseph Hagerty > Sent: Wednesday, May 02, 2012 11:10 AM > To: solr-user@lucene.apache.org > Subject: Re: ExtractRH: How to strip metadata > > > I do not. I commented out all of the copyFields provided in the default > schema.xml that ships with 3.5. My schema is rather minimal. Here is my > fields block, if this helps: > > <fields> > <field name="cust" type="string" indexed="true" stored="true" > required="true" /> > <field name="asset" type="string" indexed="true" stored="true" > required="true" /> > <field name="ent" type="string" indexed="true" stored="true" > required="true" /> > <field name="meta" type="text_en" indexed="true" stored="true" > required="true" /> > <dynamicField name="ignored_*" type="ignored" multiValued="true"/> > <!--field name="modified" type="dateTime" indexed="true" > stored="true" required="false" /--> > </fields> > > > On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky <j...@basetechnology.com>* > *wrote: > > Check to see if you have a CopyField for a wildcard pattern that copies to >> "meta", which would copy all of the Tika-generated fields to "meta." >> >> -- Jack Krupansky >> >> -----Original Message----- From: Joseph Hagerty >> Sent: Wednesday, May 02, 2012 9:56 AM >> To: solr-user@lucene.apache.org >> Subject: ExtractRH: How to strip metadata >> >> >> Greetings Solr folk, >> >> How can I instruct the extract request handler to ignore metadata/headers >> etc. when it constructs the "content" of the document I send to it? >> >> For example, I created an MS Word document containing just the word >> "SEARCHWORD" and nothing else. However, when I ship this doc to my solr >> server, here's what's thrown in the index: >> >> <str name="meta"> >> Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments >> stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm >> Page-Count 1 subject Application-Name Microsoft Macintosh Word Author >> Jesus >> Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 108600000000 Creation-Date >> 2008-11-05T20:19:00Z stream_content_type application/octet-stream >> Character >> Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/** >> >> phpHCIg7y >> Company Parkman Elastomers Pvt Ltd Content-Type application/msword >> Keywords >> Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD >> </str> >> >> All I want is the body of the document, in this case the word >> "SEARCHWORD." >> >> For further reference, here's my extraction handler: >> >> <requestHandler name="/update/extract" >> startup="lazy" >> class="solr.extraction.****ExtractingRequestHandler" > >> >> <lst name="defaults"> >> <!-- All the main content goes into "text"... if you need to return >> the extracted text or do highlighting, use a stored field. --> >> <str name="fmap.content">meta</str> >> <str name="lowernames">true</str> >> <str name="uprefix">ignored_</str> >> </lst> >> </requestHandler> >> >> (Ironically, "meta" is the field in the solr schema to which I'm >> attempting >> to extract the body of the document. Don't ask). >> >> Thanks in advance for any pointers you can provide me. >> >> -- >> - Joe >> >> > > > -- > - Joe > -- - Joe