Re: How to exclude a mimetype form being indexed in solr using tika?

2014-03-31 Thread eShard
Thanks for the reply,
I'll investigate the EmbeddedDocumentExtractor

the solr community told me it is a tika issue and the tika community told me
it's a solr issue...
Oh boy... :(



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-from-being-indexed-in-solr-using-tika-tp4127767p4128188.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


How to exclude a mimetype form being indexed in solr using tika?

2014-03-28 Thread eShard
Good afternoon,
I already asked this question in the solr - user forum and I didn't get
anywhere.
They suggested I ask the tika community...
I'm using solr 4.0 Final
I need movies "hidden" in zip files that need to be excluded from the index.
I can't filter movies on the crawler because then I would have to exclude
all zip files.

I was told I can have tika skip the movies.
the details are escaping me at this point.

How do I exclude a file in the tika configuration?
I assume it's something I add in the update/extract handler but I'm not
sure.

Thanks, 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-form-being-indexed-in-solr-using-tika-tp4127767.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: how to add more metadata to tika extraction?

2013-02-27 Thread eShard
Ok,
I figured it out. 
I manually ran the tika-app --gui and I dropped the rss feed into it.
Here's what the metadata output:

Content-Length: 615913
Content-Type: application/rss+xml
dc:description: This is an IBM C3 Public Files feed generated by a Java
application.
dc:title: IBM - C3 Public Files RSS feed
description: This is an IBM C3 Public Files feed generated by a Java
application.
title: IBM - C3 Public Files RSS feed

that's not what I was expecting. where are the items?
the items are in the xml but tika isn't showing them...

I tried using it on the original IBM feed but it failed with SSL errors.
so I saved the feed as an XML file and gave it to tika and it had even less
metadata:
Content-Length: 2068565
Content-Type: application/xml
resourceName: c3files-2-6-2013.xml

Please advise...

Thanks,






--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-add-more-metadata-to-tika-extraction-tp4043417p4043466.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: how to add more metadata to tika extraction?

2013-02-27 Thread eShard
Hi Nick,
Sorry, but can you tell me how to do that exactly?

thanks for the reply, I greatly appreciate it.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-add-more-metadata-to-tika-extraction-tp4043417p4043456.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


how to add more metadata to tika extraction?

2013-02-27 Thread eShard
Hi,
I didn't know where else to post this so apologies in advance...
Here's my quandary:
I'm using manifoldcf v1.1.1 to crawl non standard (IBM) RSS feeds and custom
RSS feeds.
There's additional metadata in each item that we need to capture.
I added the additional fields to the Solr schema (4.0 final) but the
additional fields are nowhere to be found.
I used fiddler to confirm that manifoldcf is indeed sending all the data to
solr.
I can only assume that tika is ignoring it / removing it. 
I turned on the attr_ in the solrconfig.xml but
that didn't work either.

Can anyone tell me how to modify solr and or tika to accept the additional
fields from the feed?
I looked into the tika.config file option but I couldn't find any examples
and I found one post that says it's obsolete...
I also tried putting the additional metadata in the content field but the
xml was stripped out leaving the data. so I used a double pipe as a
delimiter but that had mixed results.

here's what my solrconfig.xml extraction handler looks like for the RSS
feed:


  content
  solr.title
  solr.name
  link
  pubdateiso
  summary
  comments
  authoremail
  modifier
  modifieremail
  authoremail
  published
  updated
  modified
  created
  last_modified
  attr_
  true
  ignored_


  -MM-dd
  


  

Please advise...

Thanks,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-add-more-metadata-to-tika-extraction-tp4043417.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.