Re: Pdf in Lucene?

Kalani Ruwanpathirana Thu, 04 Dec 2008 01:50:12 -0800

Hi,

In my case I used PDFBox, just to extract the text from PDF document and
then I created the Lucene document giving the extracted text. (I didn't use
the PDFBox built in Lucene search engine). So I didn't get any
incompatibility problems.


This blog post shows the way.
http://kalanir.blogspot.com/2008/08/indexing-pdf-documents-with-lucene.html

It worked perfect for me.

Thanks.

On Tue, Dec 2, 2008 at 2:33 PM, tiziano bernardi <[EMAIL PROTECTED]> wrote:

>
>
> This is the exception:
> Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache.lucene.document.Document.add(Lorg/apache/lucene/document/Field;)V
> at
> org.pdfbox.searchengine.lucene.LucenePDFDocument.addUnindexedField(LucenePDFDocument.java:224)
> at
> org.pdfbox.searchengine.lucene.LucenePDFDocument.convertDocument(LucenePDFDocument.java:265)
> at
> org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDocument.java:377)
> at SimplePdfSearch.main(SimplePdfSearch.java:30)
>
> I thank you for the time you spent
> > From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> Subject: Re:
> Pdf in Lucene?> Date: Mon, 1 Dec 2008 17:40:12 -0500> > I certainly don't
> either, since you haven't said what the actual > exception is. If I had to
> guess, though, I would say it is the line> Document document =
> LucenePDFDocument.getDocument> > And that the Lucene library expected by
> PDFBox is not the same version > of Lucene you are using. I would suggest
> not relying on PDFBox to > create your document, and instead look at the
> PDFBox calls that you > need to make to then create your Document.> > > On
> Dec 1, 2008, at 9:18 AM, tiziano bernardi wrote:> > >> >> > this is my
> class, I use eclipse and I haven't any errors.Do not > > understand where
> the problem ....> >> >> > import java.io.File;> > import
> java.io.IOException;> >> > import org.apache.lucene.analysis.Analyzer;> >
> import org.apache.lucene.analysis.standard.StandardAnalyzer;> > import
> org.apache.lucene.document.Document;> > import
> org.apache.lucene.index.IndexWriter;> > import
> org.apache.lucene.index.Term;> > import org.apache.lucene.search.Hits;> >
> import org.apache.lucene.search.IndexSearcher;> > import
> org.apache.lucene.search.Query;> > import
> org.apache.lucene.search.TermQuery;> > import
> org.apache.lucene.store.Directory;> > import
> org.apache.lucene.store.RAMDirectory;> > import
> org.pdfbox.searchengine.lucene.LucenePDFDocument;> >> > public final class
> SimplePdfSearch> > {> > private static final String PDF_FILE_PATH =
> "C:\\Users\\Tiziano\ > > \Desktop\\doc_di_prova\\prova.pdf";> > private
> static final String SEARCH_TERM = "prova";> >> > public static final void
> main(String[] args) throws IOException> > {> > Directory directory = null;>
> >> > try> > {> > File pdfFile = new File(PDF_FILE_PATH);> > Document
> document = LucenePDFDocument.getDocument(pdfFile);> >> > directory = new
> RAMDirectory();> >> > IndexWriter indexWriter = null;> >> > try> > {> >
> Analyzer analyzer = new StandardAnalyzer();> > indexWriter = new
> IndexWriter(directory, analyzer, true);> >> >
> indexWriter.addDocument(document);> > }> > finally> > {> > if (indexWriter
> != null)> > {> > try> > {> > indexWriter.close();> > }> > catch (IOException
> ignore)> > {> > // Ignore> > }> >> > indexWriter = null;> > }> > }> >> >
> IndexSearcher indexSearcher = null;> >> > try> > {> > indexSearcher = new
> IndexSearcher(directory);> >> > Term term = new Term("contents",
> SEARCH_TERM);> > Query query = new TermQuery(term);> >> > Hits hits =
> indexSearcher.search(query);> >> > System.out.println((hits.length() != 0) ?
> "Found" : "Not Found");> > }> > finally> > {> > if (indexSearcher != null)>
> > {> > try> > {> > indexSearcher.close();> > }> > catch (IOException
> ignore)> > {> > // Ignore> > }> >> > indexSearcher = null;> > }> > }> > }> >
> finally> > {> > if (directory != null)> > {> > try> > {> >
> directory.close();> > }> > catch (IOException ignore)> > {> > // Ignore> >
> }> >> > directory = null;> > }> > }> > }> > }> From: [EMAIL PROTECTED]>
> To: java-user@lucene.apache.org> > > Subject: Re: Pdf in Lucene?> Date:
> Mon, 1 Dec 2008 08:22:58 -0500> > > > > On Dec 1, 2008, at 8:01 AM, tiziano
> bernardi wrote:> > >> > I > > tried to use pdfbox but gives me an error.> >
> That the version of > > lucene and the pdfbox are incompatible.> > Lucene
> knows nothing > > about PDFBox, so I don't see how they could be >
> incompatible, > > unless your are referring to PDFBox's Lucene Document >
> creator, in > > which case, you should ask on the PDFBox mailing list. I >
> think, > > however, that it's pretty straightforward to create a Lucene > >
> > document from PDFBox, so you shouldn't need to rely on their > > version.>
> > Personally, I'd have a look at Tika (http://lucene.apache.org/tika > >
> ), > which wraps PDFBox (and other extraction libraries) and gives > > you
> back > SAX-like events via a ContentHandler, which you can then > > use to
> create > Lucene documents. Else, I've been working on > > SOLR-284, which >
> integrates Tika into Solr, see
> https://issues.apache.org/jira/browse/SOLR-284 > > > > -Grant> > >> > I
> use pdf box 0.7.3 and lucene 2.1.0> Date: Mon, > > 1 Dec 2008 11:43:00 > >
> +0000> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org > > > > >
> Subject: Re: Pdf in Lucene?> > Hi> > > Lucene only indexes > > text so > >
> you'll have to get the text out of the PDF> and feed it > > to lucene.> > >
> > Google for lucene pdf, or go straight to http://www.pdfbox.org/ > > > >
> > > > --> Ian.> > > > 2008/12/1 tiziano bernardi <[EMAIL PROTECTED] > >
> >:> > > >> >> > Hi,> > I want to index PDF files with lucene is > >
> possible?> > > > What like?> > Thanks Tiziano Bernardi> > > > > >
> _________________________________________________________________> > > > > >
> Fanne di tutti i colori, personalizza la tua Hotmail!> >
> http://imagine-windowslive.com/Hotmail/#0 > > > > > > > > > >
> --------------------------------------------------------------------- > > >
> > > To unsubscribe, e-mail: java-user- > > [EMAIL PROTECTED]>
> > > For additional commands, e-mail: [EMAIL PROTECTED] > >
> >> > > > _________________________________________________________________>
> > > > 50 nuovi schemi per giocare su CrossWire! Accetta la sfida!> >
> http://livesearch.games.msn.com/crosswire/play_it/ > > > >
> --------------------------> Grant Ingersoll> > Lucene Helpful > > Hints:>
> http://wiki.apache.org/lucene-java/BasicsOfPerformance>
> http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > > > > > >
> > --------------------------------------------------------------------- > >
> > To unsubscribe, e-mail: [EMAIL PROTECTED]> > > For
> additional commands, e-mail: [EMAIL PROTECTED]>> >
> _________________________________________________________________> > Vai
> oltre le parole, scarica il nuovo Messenger!> >
> http://download.live.com/?mkt=it-it> > --------------------------> Grant
> Ingersoll> > Lucene Helpful Hints:>
> http://wiki.apache.org/lucene-java/BasicsOfPerformance>
> http://wiki.apache.org/lucene-java/LuceneFAQ> > > > > > > > > > > >
> ---------------------------------------------------------------------> To
> unsubscribe, e-mail: [EMAIL PROTECTED]> For
> additional commands, e-mail: [EMAIL PROTECTED]>
> _________________________________________________________________
> Vai oltre le parole, scarica il nuovo Messenger!
> http://download.live.com/?mkt=it-it
>



-- 
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa

Re: Pdf in Lucene?

Reply via email to