Hi, I'm new to Solr and trying to get it to index PDFs. Having trouble getting 
started. Following examples in ExtractingRequestHandler wiki 
<http://wiki.apache.org/solr/ExtractingRequestHandler>. 

Got Solr running and it indexes html, xml & txt files just fine...but when I 
try to feed it a .pdf it barfs back a "Error 500 Could not initialize class 
org.apache.pdfbox.pdmodel.PDPage" error:

  $ curl 
"http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true"; -F 
"myfile=@index.pdf"
  <html>
  <head>
  <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
  <title>Error 500 Could not initialize class org.apache.pdfbox.pdmodel.PDPage

  java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.pdfbox.pdmodel.PDPage
  ...

I thought maybe it's because Tika isn't installed/included so I tried 
downloading and installing Tika separately...but even the Tika install fails 
with: 

  
-------------------------------------------------------------------------------
  Test set: org.apache.tika.parser.pdf.PDFParserTest
  
-------------------------------------------------------------------------------
  Tests run: 5, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 0.63 sec <<< 
FAILURE!
  testVarious(org.apache.tika.parser.pdf.PDFParserTest)  Time elapsed: 0.165 
sec  <<< ERROR!
  java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.pdfbox.pdmodel.PDPage

I don't know Java (but hopefully won't need to in order to get basic indexing 
up and running as ultimate goal is to query this via Sunspot from a Rails app) 
so go easy on me. 

Let me know if you want/need more of the error dump.

Any help would be greatly appreciated!
-Mike

Reply via email to