I don't know if it can help you but here you are my code to extract
code of pdf doc:
/**
* Extracts text from a pdf document
*
* @param in The InputStream representing the pdf file.
* @return The text in the file
*/
public String extractText(InputStream in)
{
String s = null;
try
{
PDFTextStripper _stripper = new PDFTextStripper();
PDFParser parser = new PDFParser(in);
parser.parse();
s = _stripper.getText(parser.getDocument());
}
catch (Throwable t)
{
t.printStackTrace();
}
return s;
}
On Wednesday, August 20, 2003, at 08:59 AM, Yang Sun wrote:
Hi,
I am a newbie on Lucene. Now I want to index all my harddisk
contents for searching, these includes html file, pdf file, word file
and etc. But I have encounter a problem when I try to index pdf files,
I need your help.
My environment is lucene-1.3-rc (lucene-1.2 has also been tried),
jdk1.4.02, pdfbox-0.62. I try to index all my pdfs. There seems no
error when executing the indexing (I use StandardAnalyzer, you can
refer to my sources in the attachment). But when I search using the
keyword, I find a lot of useless results. The pdf haven't contain the
content I want. Can you help me with this problem.
In my attachment, I put my source files and the test pdf files.
After I use my program to index these three pdf files, it seems all
right then. But when I search the result using keyword cisco based
on the indexing result, I get three Hits as the result. But two of the
results do not contain the keyword cisco, they are useless. I wonder
if the pdfbox wrong, so I print out the indexed content, it also does
not contain the keyword cisco. I use Luke and my searcher program as
the searching client, it seems no problem.
Can anyone help me? Or any comments on this problem. Everyone is
welcome. My email is [EMAIL PROTECTED]
suny
PS: sorry, I can not attach the files, this mailing list can not hold
attachment? So I have to put my source codes here. My three test pdf
files are totally 100k, if someone would like to help me test it, I
will be very appreciate.
IndexPDF.java
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.io.File;
import java.util.Date;
/**
* FileName:
* User: Administrator
* Date: 2003-8-19
* Time: 23:18:30
* Functions:
*/
public class IndexPDF {
File indexFiles;//the file or directory we want to index
public static void indexDocs(IndexWriter writer, File file) throws
Exception {
if (file.isDirectory()) {
String[] files = file.list();
for (int i = 0; i files.length; i++) {
indexDocs(writer, new File(file, files[i]));
}
} else if (file.getPath().endsWith(.pdf)) {
System.out.println(adding + file);
writer.addDocument(PdfDocument.Document(file));
} else {
System.out.println(Ignoring + file);
}
}
public static void main(String args[]) throws Exception {
if (args.length != 1) {
System.out.println(Usage: IndexPDF file/directory);
return;
}
try {
Date start = new Date();
IndexWriter writer = new IndexWriter(E:/Index, new
StandardAnalyzer(), true);
indexDocs(writer, new File(args[0]));
writer.optimize();
writer.close();
Date end = new Date();
System.out.println(end.getTime() - start.getTime());
System.out.println( total milliseconds);
} catch (Exception e) {
System.out.println( caught a + e.getClass() +
\n with message: + e.getMessage());
}
}
}
PdfDocument.java
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
import java.io.File;
import java.io.FileInputStream;
import java.io.ByteArrayOutputStream;
import java.io.OutputStreamWriter;
import net.vicp.resshare.weblucene.document.PDFDocument;
/**
* FileName:
* User: Administrator
* Date: 2003-8-19
* Time: 23:20:54
* Functions:
*/
public class PdfDocument {
/* lucene object which represent a single data file */
static Document doc = new Document();
public static Document Document(File f ) {
//set relative path to the path field in lucene
doc.add(Field.UnIndexed(path, f.getPath()));
System.out.println(Path is + f.getPath());
// use 1 as the limit for temporary use
doc.add(Field.Text(content, getPDFContent(f)));
doc.add(Field.UnIndexed(filetype, pdf));
doc.add(Field.UnIndexed(title, f.getName()));
return doc;
}
/**
* get the text content from the specified pdf file.
* @param f the pdf we should extract the content
* @return the string contains the pdf content
*/
private static String getPDFContent(File f) {
byte[] contents = null;
try {
FileInputStream is = new FileInputStream(f);
PDFParser parser