Thank you for attempting to answer. I will try out with solrj and standalone java with tika parser. I completely understand that a bad document could cause this, however, when I opened up the document I couldn't find anything suspicious expect for some binary images/pictures embedded into the document.
-----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, November 04, 2015 4:33 PM To: solr-user <solr-user@lucene.apache.org> Subject: Re: tikaparser docx file fails with exception Possibly a corrupt file? Tika does its best, but bad data is...bad data. You can experiment a bit with using Tika in Java, that might give you a better idea of what's really going on, here's a SolrJ example: https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ Best, Erick On Wed, Nov 4, 2015 at 3:49 PM, Aswath Srinivasan (TMS) <aswath.sriniva...@toyota.com> wrote: > > Trying to index a document. A docx file. Ending up with the below exception. > Not sure why it is erroring out. When I opened the docx I was able to see > lots of binary data like embedded pictures etc., Is there a possible solution > to this or is it a bug? Only one such file fails. Rest of the files are > smoothly indexed. > > 2015-11-04 23:16:11.549 INFO (coreLoadExecutor-6-thread-1) [ x:tika] > o.a.s.c.CoreContainer registering core: tika > 2015-11-04 23:16:11.549 INFO (searcherExecutor-7-thread-1-processing-x:tika) > [ x:tika] o.a.s.c.SolrCore QuerySenderListener sending requests to > Searcher@1eb69b2[tika] > main{ExitableDirectoryReader(UninvertingDirectoryReader())} > 2015-11-04 23:16:11.585 INFO (searcherExecutor-7-thread-1-processing-x:tika) > [ x:tika] o.a.s.c.S.Request [tika] webapp=null path=null > params={q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false&event=firstSearcher} > hits=0 status=0 QTime=34 > 2015-11-04 23:16:11.586 INFO (searcherExecutor-7-thread-1-processing-x:tika) > [ x:tika] o.a.s.c.SolrCore QuerySenderListener done. > 2015-11-04 23:16:11.586 INFO (searcherExecutor-7-thread-1-processing-x:tika) > [ x:tika] o.a.s.h.c.SpellCheckComponent Loading spell index for > spellchecker: default > 2015-11-04 23:16:11.586 INFO (searcherExecutor-7-thread-1-processing-x:tika) > [ x:tika] o.a.s.h.c.SpellCheckComponent Loading spell index for > spellchecker: wordbreak > 2015-11-04 23:16:11.586 INFO (searcherExecutor-7-thread-1-processing-x:tika) > [ x:tika] o.a.s.h.c.SuggestComponent buildOnStartup: mySuggester > 2015-11-04 23:16:11.586 INFO (searcherExecutor-7-thread-1-processing-x:tika) > [ x:tika] o.a.s.s.s.SolrSuggester SolrSuggester.build(mySuggester) > 2015-11-04 23:16:11.605 INFO (searcherExecutor-7-thread-1-processing-x:tika) > [ x:tika] o.a.s.c.SolrCore [tika] Registered new searcher > Searcher@1eb69b2[tika] > main{ExitableDirectoryReader(UninvertingDirectoryReader())} > 2015-11-04 23:16:25.923 INFO (qtp7980742-16) [ x:tika] > o.a.s.h.d.DataImporter Loading DIH Configuration: tika-data-config.xml > 2015-11-04 23:16:25.937 INFO (qtp7980742-16) [ x:tika] > o.a.s.h.d.DataImporter Data Configuration loaded successfully > 2015-11-04 23:16:25.947 INFO (qtp7980742-16) [ x:tika] o.a.s.c.S.Request > [tika] webapp=/solr path=/dataimport > params={debug=false&optimize=false&indent=true&commit=true&clean=true&wt=json&command=full-import&verbose=false} > status=0 QTime=28 > 2015-11-04 23:16:25.948 INFO (Thread-17) [ x:tika] o.a.s.h.d.DataImporter > Starting Full Import > 2015-11-04 23:16:25.961 INFO (Thread-17) [ x:tika] > o.a.s.h.d.SimplePropertiesWriter Read dataimport.properties > 2015-11-04 23:16:25.966 INFO (qtp7980742-14) [ x:tika] o.a.s.c.S.Request > [tika] webapp=/solr path=/dataimport > params={indent=true&wt=json&command=status&_=1446678985952} status=0 QTime=1 > 2015-11-04 23:16:25.998 INFO (Thread-17) [ x:tika] o.a.s.c.SolrCore [tika] > REMOVING ALL DOCUMENTS FROM INDEX > 2015-11-04 23:16:26.728 ERROR (Thread-17) [ x:tika] > o.a.s.h.d.EntityProcessorWrapper Exception in entity : > documentImport:org.apache.solr.handler.dataimport.DataImportHandlerException: > Unable to read content Processing Document # 1 > > at > org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndT > hrow(DataImportHandlerException.java:70) > > at > org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEnt > ityProcessor.java:168) > > at > org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Enti > tyProcessorWrapper.java:243) > > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder > .java:475) > > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder > .java:514) > > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder > .java:414) > > at > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.ja > va:329) > > at > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java: > 232) > > at > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpor > ter.java:416) > > at > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.ja > va:480) > > at > org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.jav > a:461) > > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal > IOException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1b3e0a6<mailto:org. > apache.tika.parser.microsoft.ooxml.OOXMLParser@1b3e0a6> > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:262) > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) > > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:12 > 0) > > at > org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEnt > ityProcessor.java:162) > > ... 9 more > > Caused by: java.io.CharConversionException: Characters larger than 4 > bytes are not supported: byte 0xb7 implies a length of more than 4 > bytes > > at > org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDeco > der.java:162) > > at > org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader$FastStreamDecoder > .read(XMLStreamReader.java:762) > > at > org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read(XMLStreamRea > der.java:162) > > at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill(PiccoloLex > er.java:3477) > > at > org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.j > ava:3962) > > at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290) > > at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400 > ) > > at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714) > > at > org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479) > > at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:127 > 7) > > at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:126 > 4) > > at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeL > oaderBase.java:345) > > at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocumen > t$Factory.parse(Unknown Source) > > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument > .java:136) > > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166) > > at > org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:11 > 8) > > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtract > or.java:59) > > at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFac > tory.java:181) > > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOX > MLExtractorFactory.java:86) > > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.j > ava:82) > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) > > ... 12 more > > > 2015-11-04 23:16:26.729 INFO (Thread-17) [ x:tika] o.a.s.h.d.DocBuilder > Import completed successfully >