Hi,
I didn't know where else to post this so apologies in advance...
Here's my quandary:
I'm using manifoldcf v1.1.1 to crawl non standard (IBM) RSS feeds and custom
RSS feeds.
There's additional metadata in each item that we need to capture.
I added the additional fields to the Solr schema (4.0 final) but the
additional fields are nowhere to be found.
I used fiddler to confirm that manifoldcf is indeed sending all the data to
solr.
I can only assume that tika is ignoring it / removing it. 
I turned on the <str name="uprefix">attr_</str> in the solrconfig.xml but
that didn't work either.

Can anyone tell me how to modify solr and or tika to accept the additional
fields from the feed?
I looked into the tika.config file option but I couldn't find any examples
and I found one post that says it's obsolete...
I also tried putting the additional metadata in the content field but the
xml was stripped out leaving the data. so I used a double pipe as a
delimiter but that had mixed results.

here's what my solrconfig.xml extraction handler looks like for the RSS
feed:
<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
          <str name="fmap.content">content</str>
          <str name="fmap.title">solr.title</str>
          <str name="fmap.name">solr.name</str>
          <str name="link">link</str>
          <str name="pubdateiso">pubdateiso</str>
          <str name="summary">summary</str>
          <str name="description">comments</str>
          <str name="authoremail">authoremail</str>
          <str name="modifier">modifier</str>
          <str name="modifieremail">modifieremail</str>
          <str name="authoremail">authoremail</str>
          <str name="published">published</str>
          <str name="updated">updated</str>
          <str name="modified">modified</str>
          <str name="created">created</str>
      <str name="fmap.Last-Modified">last_modified</str>
      <str name="uprefix">attr_</str>
          <str name="lowernames">true</str>
          <str name="fmap.div">ignored_</str>
    </lst>
    <lst name="date.formats">
      <str>yyyy-MM-dd</str>
    </lst>      
    
    
  </requestHandler>

Please advise...

Thanks,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-add-more-metadata-to-tika-extraction-tp4043417.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Reply via email to