Re: Question on Lucene when indexing big pdf files

2003-08-20 Thread Damien Lust
I don't know if it can help you but here you are my code to extract 
code of pdf doc:

  /**
   * Extracts text from a pdf document
   *
   * @param in The InputStream representing the pdf file.
   * @return The text in the file
   */
  public String extractText(InputStream in)
  {
String s = null;
try
{
  PDFTextStripper _stripper = new PDFTextStripper();
  PDFParser parser = new PDFParser(in);
  parser.parse();
  s = _stripper.getText(parser.getDocument());
}
catch (Throwable t)
{
  t.printStackTrace();
}
return s;
  }




On Wednesday, August 20, 2003, at 08:59 AM, Yang Sun wrote:

Hi,
I am a newbie on Lucene. Now I want to index all my harddisk 
contents for searching, these includes html file, pdf file, word file 
and etc. But I have encounter a problem when I try to index pdf files, 
I need your help.
My environment is lucene-1.3-rc (lucene-1.2 has also been tried), 
jdk1.4.02, pdfbox-0.62. I try to index all my pdfs. There seems no 
error when executing the indexing (I use StandardAnalyzer, you can 
refer to my sources in the attachment). But when I search using the 
keyword, I find a lot of useless results. The pdf haven't contain the 
content I want. Can you help me with this problem.
In my attachment, I put my source files and the test pdf files. 
After I use my program to index these three pdf files, it seems all 
right then. But when I search the result using keyword cisco based 
on the indexing result, I get three Hits as the result. But two of the 
results do not contain the keyword cisco, they are useless. I wonder 
if the pdfbox wrong, so I print out the indexed content, it also does 
not contain the keyword cisco. I use Luke and my searcher program as 
the searching client, it seems no problem.
Can anyone help me? Or any comments on this problem. Everyone is 
welcome. My email is [EMAIL PROTECTED]
suny

PS: sorry, I can not attach the files, this mailing list can not hold 
attachment? So I have to put my source codes here. My three test pdf 
files are totally 100k, if someone would like to help me test it, I 
will be very appreciate.

IndexPDF.java

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.io.File;
import java.util.Date;
/**
 * FileName:
 * User: Administrator
 * Date: 2003-8-19
 * Time: 23:18:30
 * Functions:
 */
public class IndexPDF {
  File indexFiles;//the file or directory we want to index
  public static void indexDocs(IndexWriter writer, File file) throws 
Exception {
if (file.isDirectory()) {
  String[] files = file.list();
  for (int i = 0; i  files.length; i++) {
indexDocs(writer, new File(file, files[i]));
  }
} else if (file.getPath().endsWith(.pdf)) {
  System.out.println(adding  + file);
  writer.addDocument(PdfDocument.Document(file));
} else {
  System.out.println(Ignoring  + file);
}
  }
  public static void main(String args[]) throws Exception {
if (args.length != 1) {
  System.out.println(Usage: IndexPDF file/directory);
  return;
}
try {
  Date start = new Date();
  IndexWriter writer = new IndexWriter(E:/Index, new 
StandardAnalyzer(), true);
  indexDocs(writer, new File(args[0]));
  writer.optimize();
  writer.close();
  Date end = new Date();
  System.out.println(end.getTime() - start.getTime());
  System.out.println( total milliseconds);
} catch (Exception e) {
  System.out.println( caught a  + e.getClass() +
  \n with message:  + e.getMessage());
}
  }
}

PdfDocument.java

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
import java.io.File;
import java.io.FileInputStream;
import java.io.ByteArrayOutputStream;
import java.io.OutputStreamWriter;
import net.vicp.resshare.weblucene.document.PDFDocument;
/**
 * FileName:
 * User: Administrator
 * Date: 2003-8-19
 * Time: 23:20:54
 * Functions:
 */
public class PdfDocument {
  /* lucene object which represent a single data file */
  static Document doc = new Document();
  public static Document Document(File f ) {
//set relative path to the path field in lucene
doc.add(Field.UnIndexed(path, f.getPath()));
System.out.println(Path is  + f.getPath());
// use 1 as the limit for temporary use
doc.add(Field.Text(content, getPDFContent(f)));
doc.add(Field.UnIndexed(filetype, pdf));
doc.add(Field.UnIndexed(title, f.getName()));
return doc;
  }
  /**
   * get the text content from the specified pdf file.
   * @param f the pdf we should extract the content
   * @return  the string contains the pdf content
   */
  private static String getPDFContent(File f) {
byte[] contents = null;
try {
  FileInputStream is = new FileInputStream(f);
  PDFParser parser 

Re: Question on Lucene when indexing big pdf files

2003-08-20 Thread Yang Sun
Hi,
When I use luke to look at my index, it seems all right. The content in the index 
is well, all the contents are extracted from the pdf file. I copy the pdf file content 
(namely content field), and search the keyword, but I can not found the keyword 
either. I think there is nothing wrong with the pdfbox program.

Would you please help me to test this situation. I have three pdf file(totally 
100k), after I index them,  I will get useless results when I ues cisco as the 
keyword. If you would like to help me, I will send you my test source files and the 
three pdf files to you. I will be very appreciate for your help.

Ben Litchfield [EMAIL PROTECTED] wrote:


 cisco. I use Luke and my searcher program as the searching client,
 it seems no problem. Can anyone help me? Or any comments on this

When you use luke to look at your index does it show the correct contents
for those documents?

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software