Re: SolrJ/Tika custom indexer not indexing CERTAIN .doc text?

Alexandre Rafalovitch Mon, 27 Jul 2015 12:14:01 -0700

Thank you for the update.

The MSWord format changed significantly from .doc to .docx so has a
different parser I suspect. I would not be surprised if old
binary-format parser would miss something exotic in the documents
(e.g. content of text boxes or frames).


Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 27 July 2015 at 14:50, Paden <rumsey...@gmail.com> wrote:
> Pretty old thread. I know. But in the end it wasn't Solr. I'm fairly
> certainly that it was Tika. The autoparser wasn't pulling any of the ".doc"
> file text. It came out as just blank. The documents were 1997-2003. When I
> opened them in word 2010 and RESAVED them as 2010 documents they indexed
> just fine.
>
> So I guess I wanted to put this here if anybody has a problem creating their
> own custom SolrJ indexer. I think the current version of tika has some
> compatibility issues with 2003 word docs.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrJ-Tika-custom-indexer-not-indexing-CERTAIN-doc-text-tp4216541p4219341.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrJ/Tika custom indexer not indexing CERTAIN .doc text?

Reply via email to