Re:Re: Help:Solr can't put all pdf files into index

Rong Kang Thu, 09 Feb 2012 08:14:08 -0800

I test one file that is missing in Solr index. And solr response as below

...
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">1</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-02-10 00:03:23</str>
<str name="">
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
</str>
..

I see tomcat's log file and find this

Exception in entity : 
tika-test:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable 
to read content Processing Document # 1
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:130)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:591)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:617)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.ParserDecorator$1@190725e
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
... 8 more
Caused by: java.lang.NullPointerException
at org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
at org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:943)
at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:108)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 10 more

I think this is because tika can't read the pdf file or this  pdf file's format 
has some error. But I can read this pdf file in Adobe Reader.
Regards,

Rong Kang
At 2012-02-09 23:49:28,"Michael Kuhlmann" <k...@solarier.de> wrote:
>I'd suggest that you check which documents *exactly* are missing in Solr 
>index. Or find at least one that's missing, and try to figure out how 
>this document differs from the other ones that can be found in Solr.
>
>Maybe we can then find out what exact problem there is.
>
>Greetings,
>-Kuli
>
>On 09.02.2012 16:37, Rong Kang wrote:
>>
>> Yes, I put all file in one directory and I have tested file names using 
>> code.
>>
>>
>>
>>
>> At 2012-02-09 20:45:49,"Jan Høydahl"<jan....@cominvent.com>  wrote:
>>> Hi,
>>>
>>> Are you 100% sure that the filename is globally unique, since you use it as 
>>> the uniqueKey?
>>>
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> Solr Training - www.solrtraining.com
>>>
>>> On 9. feb. 2012, at 08:30, 荣康 wrote:
>>>
>>>> Hey ,
>>>> I am using solr as my search engine to search my pdf files. I have 18219 
>>>> files(different file names) and all the files are in one same 
>>>> directory。But when I use solr to import the files into index using 
>>>> Dataimport method, solr report only import 17233 files. It's very strange. 
>>>> This problem has stoped out project for a few days. I can't handle it.
>>>>
>>>>
>>>> please help me!
>>>>
>>>>
>>>> Schema.xml
>>>>
>>>>
>>>> <fields>
>>>>    <field name="text" type="text" indexed="true" multiValued="true" 
>>>> termVectors="true" termPositions="true" termOffsets="true"/>
>>>>    <field name="filename" type="filenametext" indexed="true" 
>>>> required="true" termVectors="true" termPositions="true" 
>>>> termOffsets="true"/>
>>>>    <field name="id" type="string" stored="true"/>
>>>> </fields>
>>>> <uniqueKey>id</uniqueKey>
>>>> <copyField source="filename" dest="text"/>
>>>>
>>>>
>>>> and
>>>> <dataConfig>
>>>>     <dataSource type="BinFileDataSource" name="bin"/>
>>>> <document>
>>>> <entity name="f" processor="FileListEntityProcessor" recursive="true"
>>>> rootEntity="false"
>>>> dataSource="null"  baseDir="H:/pdf/cls_1_16800_OCRed/1"
>>>> fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" 
>>>> onError="skip">
>>>>
>>>>
>>>> <entity name="tika-test" processor="TikaEntityProcessor"
>>>> url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
>>>>                 <field column="text" name="text"/>
>>>> </entity>
>>>> <field column="file" name="id"/>
>>>> <field column="file" name="filename"/>
>>>> </entity>
>>>>     </document>
>>>> </dataConfig>
>>>>
>>>>
>>>>
>>>>
>>>> sincerecly
>>>> Rong Kang
>>>>
>>>>
>>>>
>>>
>

Re:Re: Help:Solr can't put all pdf files into index

Reply via email to