Re: How to exclude a mimetype form being indexed in solr using tika?
Thanks for the reply, I'll investigate the EmbeddedDocumentExtractor the solr community told me it is a tika issue and the tika community told me it's a solr issue... Oh boy... :( -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-from-being-indexed-in-solr-using-tika-tp4127767p4128188.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.
How to exclude a mimetype form being indexed in solr using tika?
Good afternoon, I already asked this question in the solr - user forum and I didn't get anywhere. They suggested I ask the tika community... I'm using solr 4.0 Final I need movies "hidden" in zip files that need to be excluded from the index. I can't filter movies on the crawler because then I would have to exclude all zip files. I was told I can have tika skip the movies. the details are escaping me at this point. How do I exclude a file in the tika configuration? I assume it's something I add in the update/extract handler but I'm not sure. Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-form-being-indexed-in-solr-using-tika-tp4127767.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.
Re: how to add more metadata to tika extraction?
Ok, I figured it out. I manually ran the tika-app --gui and I dropped the rss feed into it. Here's what the metadata output: Content-Length: 615913 Content-Type: application/rss+xml dc:description: This is an IBM C3 Public Files feed generated by a Java application. dc:title: IBM - C3 Public Files RSS feed description: This is an IBM C3 Public Files feed generated by a Java application. title: IBM - C3 Public Files RSS feed that's not what I was expecting. where are the items? the items are in the xml but tika isn't showing them... I tried using it on the original IBM feed but it failed with SSL errors. so I saved the feed as an XML file and gave it to tika and it had even less metadata: Content-Length: 2068565 Content-Type: application/xml resourceName: c3files-2-6-2013.xml Please advise... Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-add-more-metadata-to-tika-extraction-tp4043417p4043466.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.
Re: how to add more metadata to tika extraction?
Hi Nick, Sorry, but can you tell me how to do that exactly? thanks for the reply, I greatly appreciate it. -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-add-more-metadata-to-tika-extraction-tp4043417p4043456.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.
how to add more metadata to tika extraction?
Hi, I didn't know where else to post this so apologies in advance... Here's my quandary: I'm using manifoldcf v1.1.1 to crawl non standard (IBM) RSS feeds and custom RSS feeds. There's additional metadata in each item that we need to capture. I added the additional fields to the Solr schema (4.0 final) but the additional fields are nowhere to be found. I used fiddler to confirm that manifoldcf is indeed sending all the data to solr. I can only assume that tika is ignoring it / removing it. I turned on the attr_ in the solrconfig.xml but that didn't work either. Can anyone tell me how to modify solr and or tika to accept the additional fields from the feed? I looked into the tika.config file option but I couldn't find any examples and I found one post that says it's obsolete... I also tried putting the additional metadata in the content field but the xml was stripped out leaving the data. so I used a double pipe as a delimiter but that had mixed results. here's what my solrconfig.xml extraction handler looks like for the RSS feed: content solr.title solr.name link pubdateiso summary comments authoremail modifier modifieremail authoremail published updated modified created last_modified attr_ true ignored_ -MM-dd Please advise... Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-add-more-metadata-to-tika-extraction-tp4043417.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.