Re: ExtractRH: How to strip metadata

Joseph Hagerty Wed, 02 May 2012 11:06:02 -0700

How interesting! You know, I did at one point consider that perhaps the
fieldname "meta" may be treated specially, but I talked myself out of it. I
reasoned that a field name in my local schema should have no bearing on how
a plugin such as solr-cell/Tika behaves. I should have tested my
hypothesis; even if this phenomenon turns out to be undocumented behavior,
I consider myself a victim of my own assumptions.


I am running version 3.5. You may have gotten the multivalue errors due to
the way your test schema and/or extracting request handler is lain out (my
bad). I am using the "ignored" fieldtype and a dynamicField called
"ignored_" as a catch-all for extraneous fields delivered by Tika.

Thanks for your help! Please keep me posted on any further
insights/revelations, and I'll do the same.

On Wed, May 2, 2012 at 12:54 PM, Jack Krupansky <j...@basetechnology.com>wrote:

> I did some testing, and evidently the "meta" field is treated specially
> from the ERH.
>
> I copied the example schema, and added both "meta" and "metax" fields and
> set "fmap.content=metax", and lo and behold only the doc content appears in
> "metax", but all the doc metadata appears in "meta".
>
> Although, I did get 400 errors with Solr complaining that "meta" was not a
> multivalued field. This is with Solr 3.6. What release of Solr are you
> using?
>
> I was not aware of this undocumented feature. I haven't checked the code
> yet.
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Joseph Hagerty
> Sent: Wednesday, May 02, 2012 11:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: ExtractRH: How to strip metadata
>
>
> I do not. I commented out all of the copyFields provided in the default
> schema.xml that ships with 3.5. My schema is rather minimal. Here is my
> fields block, if this helps:
>
> <fields>
>  <field name="cust"     type="string"    indexed="true"  stored="true"
> required="true"  />
>  <field name="asset"    type="string"    indexed="true"  stored="true"
> required="true"  />
>  <field name="ent"      type="string"    indexed="true"  stored="true"
> required="true"  />
>  <field name="meta"     type="text_en"   indexed="true"  stored="true"
> required="true"  />
>  <dynamicField name="ignored_*" type="ignored" multiValued="true"/>
>  <!--field name="modified"  type="dateTime"  indexed="true"
> stored="true"  required="false" /-->
> </fields>
>
>
> On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky <j...@basetechnology.com>*
> *wrote:
>
>  Check to see if you have a CopyField for a wildcard pattern that copies to
>> "meta", which would copy all of the Tika-generated fields to "meta."
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Joseph Hagerty
>> Sent: Wednesday, May 02, 2012 9:56 AM
>> To: solr-user@lucene.apache.org
>> Subject: ExtractRH: How to strip metadata
>>
>>
>> Greetings Solr folk,
>>
>> How can I instruct the extract request handler to ignore metadata/headers
>> etc. when it constructs the "content" of the document I send to it?
>>
>> For example, I created an MS Word document containing just the word
>> "SEARCHWORD" and nothing else. However, when I ship this doc to my solr
>> server, here's what's thrown in the index:
>>
>> <str name="meta">
>> Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments
>> stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm
>> Page-Count 1 subject Application-Name Microsoft Macintosh Word Author
>> Jesus
>> Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 108600000000 Creation-Date
>> 2008-11-05T20:19:00Z stream_content_type application/octet-stream
>> Character
>> Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/**
>>
>> phpHCIg7y
>> Company Parkman Elastomers Pvt Ltd Content-Type application/msword
>> Keywords
>> Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD
>> </str>
>>
>> All I want is the body of the document, in this case the word
>> "SEARCHWORD."
>>
>> For further reference, here's my extraction handler:
>>
>> <requestHandler name="/update/extract"
>>                startup="lazy"
>>                class="solr.extraction.****ExtractingRequestHandler" >
>>
>>  <lst name="defaults">
>>    <!-- All the main content goes into "text"... if you need to return
>>         the extracted text or do highlighting, use a stored field. -->
>>    <str name="fmap.content">meta</str>
>>    <str name="lowernames">true</str>
>>    <str name="uprefix">ignored_</str>
>>  </lst>
>>  </requestHandler>
>>
>> (Ironically, "meta" is the field in the solr schema to which I'm
>> attempting
>> to extract the body of the document. Don't ask).
>>
>> Thanks in advance for any pointers you can provide me.
>>
>> --
>> - Joe
>>
>>
>
>
> --
> - Joe
>



-- 
- Joe

Re: ExtractRH: How to strip metadata

Reply via email to