I test one file that is missing in Solr index. And solr response as below
... <str name="Total Requests made to DataSource">0</str> <str name="Total Rows Fetched">1</str> <str name="Total Documents Skipped">0</str> <str name="Full Dump Started">2012-02-10 00:03:23</str> <str name=""> Indexing completed. Added/Updated: 0 documents. Deleted 0 documents. </str> .. I see tomcat's log file and find this Exception in entity : tika-test:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:130) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:591) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:617) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@190725e at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137) at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128) ... 8 more Caused by: java.lang.NullPointerException at org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109) at org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:943) at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:108) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) ... 10 more I think this is because tika can't read the pdf file or this pdf file's format has some error. But I can read this pdf file in Adobe Reader. Regards, Rong Kang At 2012-02-09 23:49:28,"Michael Kuhlmann" <k...@solarier.de> wrote: >I'd suggest that you check which documents *exactly* are missing in Solr >index. Or find at least one that's missing, and try to figure out how >this document differs from the other ones that can be found in Solr. > >Maybe we can then find out what exact problem there is. > >Greetings, >-Kuli > >On 09.02.2012 16:37, Rong Kang wrote: >> >> Yes, I put all file in one directory and I have tested file names using >> code. >> >> >> >> >> At 2012-02-09 20:45:49,"Jan Høydahl"<jan....@cominvent.com> wrote: >>> Hi, >>> >>> Are you 100% sure that the filename is globally unique, since you use it as >>> the uniqueKey? >>> >>> -- >>> Jan Høydahl, search solution architect >>> Cominvent AS - www.cominvent.com >>> Solr Training - www.solrtraining.com >>> >>> On 9. feb. 2012, at 08:30, 荣康 wrote: >>> >>>> Hey , >>>> I am using solr as my search engine to search my pdf files. I have 18219 >>>> files(different file names) and all the files are in one same >>>> directory。But when I use solr to import the files into index using >>>> Dataimport method, solr report only import 17233 files. It's very strange. >>>> This problem has stoped out project for a few days. I can't handle it. >>>> >>>> >>>> please help me! >>>> >>>> >>>> Schema.xml >>>> >>>> >>>> <fields> >>>> <field name="text" type="text" indexed="true" multiValued="true" >>>> termVectors="true" termPositions="true" termOffsets="true"/> >>>> <field name="filename" type="filenametext" indexed="true" >>>> required="true" termVectors="true" termPositions="true" >>>> termOffsets="true"/> >>>> <field name="id" type="string" stored="true"/> >>>> </fields> >>>> <uniqueKey>id</uniqueKey> >>>> <copyField source="filename" dest="text"/> >>>> >>>> >>>> and >>>> <dataConfig> >>>> <dataSource type="BinFileDataSource" name="bin"/> >>>> <document> >>>> <entity name="f" processor="FileListEntityProcessor" recursive="true" >>>> rootEntity="false" >>>> dataSource="null" baseDir="H:/pdf/cls_1_16800_OCRed/1" >>>> fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" >>>> onError="skip"> >>>> >>>> >>>> <entity name="tika-test" processor="TikaEntityProcessor" >>>> url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip"> >>>> <field column="text" name="text"/> >>>> </entity> >>>> <field column="file" name="id"/> >>>> <field column="file" name="filename"/> >>>> </entity> >>>> </document> >>>> </dataConfig> >>>> >>>> >>>> >>>> >>>> sincerecly >>>> Rong Kang >>>> >>>> >>>> >>> >