date:20030820

Re: Similar Document Search

2003-08-20 Thread Peter Becker

Hi all,

it seems there are quite a few people looking for similar features, i.e. 
(a) document identity and (b) forward indexing. So far we hacked (a) by 
using a wrapper implementing equals/hashcode based on a unique field, 
but of course that assumes maintaining a unique field in the index. (b) 
is something we haven't tackled yet, but plan to.

The source code for Mark's thesis seems to be part of the Haystack 
distribution. The comments in the files put it under Apche-license. This 
seems to make it a good candidate to be included at least in the Lucene 
sandbox -- although I haven't tried it myself yet. But it sounds like a 
good candidate for us to use.

Since the haystack source is a bit larger and I actually couldn't get 
the download at the moment, here is a copy of the relevant bit grabbed 
from one of my colleague's machines:

 http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb)

Note that this is just a tarball of src/org/apache/lucene out of some 
Haystack source. Untested, unmodified.

I'd love to see something like this supported in the Lucene context were 
people might actually find it :-)

 Peter

Gregor Heinrich wrote:

Hello Terry,

Lucene can do forward indexing, as Mark Rosen outlines in his Master's
thesis: http://citeseer.nj.nec.com/rosen03email.html.
We use a similar approach for (probabilistic) latent semantic analysis and
vector space searches. However, the solution is not really completely fixed
yet, therefore no code at this time...
Best regards,

Gregor



-Original Message-
From: Peter Becker [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 19, 2003 3:06 AM
To: Lucene Users List
Subject: Re: Similar Document Search
Hi Terry,

we have been thinking about the same problem and in the end we decided
that most likely the only good solution to this is to keep a
non-inverted index, i.e. a map from the documents to the terms. Then you
can query the most terms for the documents and query other documents
matching parts of this (where you get the usual question of what is
actually interesting: high frequency, low frequency or the mid range).
Indexing would probably be quite expensive since Lucene doesn't seem to
support changes in the index, and the index for the terms would change
all the time. We haven't implemented it yet, but it shouldn't be hard to
code. I just wouldn't expect good performance when indexing large
collections.
 Peter

Terry Steichen wrote:

 

Is it possible without extensive additional coding to use Lucene to conduct
   

a search based on a document rather than a query?  (One use of this would be
to refine a search by selecting one of the hits returned from the initial
query and subsequently retrieving other documents "like" the selected one.)
 

Regards,

Terry



   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Doug Cutting

Leo Galambos wrote:
Isn't it better for Dan to skip the optimization phase before merging? I 
am not sure, but he could save some time on this (if he has enough file 
handles for that, of course).
It depends.  If you have 10 machines, each with a single disk, that you 
use for indexing in parallel, and copy all of the indexes to a single 
machine for the final merge, then you're probably better off optimizing 
each index before copying it and merging it with the others, in order to 
maximize the amount of work done in parallel, using all disk spindles. 
However, if instead you have one machine with ten processors and a 
filesystem striped across ten disks, then, in theory, optimizing before 
merging might not help much, since the single-threaded final merge could 
use all ten disks at once.  Even then, though the final merge would be 
doing some CPU work serially which would have been done in parallel in 
the first configuration.  In general I think it's best to do as much 
work as possible in parallel.

> What strategy do you use in "nutch"?

Nutch builds optimized indexes for each fetched "segment" (n.b., a Nutch 
segment is different than a Lucene segment) and only merges segment 
indexes as the final step before deploying them for searching.  Nutch 
has a rolling set of active segments: the oldest are periodically 
discarded and replaced with newly fetched segments.  Before a new set of 
segments is deployed, duplicate elimination processing must occur, which 
marks duplicates as deleted prior to merging new production indexes.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Will failed optimize corrupt an index?

2003-08-20 Thread Doug Cutting

The index should be fine.  Lucene index updates are atomic.

Doug

Dan Quaroni wrote:
My index grew about 7 gigs larger than I projected it would, and it ran out
of disk space during optimize.  Does lucene have transactions or anything
that would prevent this from corrupting an index, or do I need to generate
the index again?
Thanks!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Leo Galambos

Isn't it better for Dan to skip the optimization phase before merging? I 
am not sure, but he could save some time on this (if he has enough file 
handles for that, of course). What strategy do you use in "nutch"?

THX

-g-

Doug Cutting wrote:

As the index grows, disk i/o becomes the bottleneck.  The default 
indexing parameters do a pretty good job of optimizing this.  But if 
you have lots of CPUs and lots of disks, you might try building 
several indexes in parallel, each containing a subset of the 
documents, optimize each index and finally merge them all into a 
single index at the end. But you need lots of i/o capacity for this to 
pay off.

Doug

Dan Quaroni wrote:

Looks like I spoke too soon... As the index gets larger, time to merge
becomes prohibitably high.  It appears to increase linearly.
Oh well.  I guess I'll just have to go with about 3ms/doc.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Doug Cutting

As the index grows, disk i/o becomes the bottleneck.  The default 
indexing parameters do a pretty good job of optimizing this.  But if you 
have lots of CPUs and lots of disks, you might try building several 
indexes in parallel, each containing a subset of the documents, optimize 
each index and finally merge them all into a single index at the end. 
But you need lots of i/o capacity for this to pay off.

Doug

Dan Quaroni wrote:
Looks like I spoke too soon... As the index gets larger, time to merge
becomes prohibitably high.  It appears to increase linearly.
Oh well.  I guess I'll just have to go with about 3ms/doc.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Dan Quaroni

Looks like I spoke too soon... As the index gets larger, time to merge
becomes prohibitably high.  It appears to increase linearly.

Oh well.  I guess I'll just have to go with about 3ms/doc.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene Index on NFS Server

2003-08-20 Thread Doug Cutting

I don't know the details of how lock files are unreliable over NFS, only 
that they are.  The window of vulnerability, when the lock file is used, 
is when one JVM is opening all of the files in an index, and another is 
completing an update at the same time.  If the updating machine removes 
some files after the opening machine has read the 'segments' file but 
before it has opened all of the files, then the open will fail with a 
FileNotFound exception.  If your application can guarantee that indexes 
are not opened while an update is completing (under IndexWriter.close() 
or IndexReader.close() for deletions) then this will not be a problem.

Doug

Morus Walter wrote:
Doug Cutting writes:


Can I have a lucene index on a NFS filesystem without problems
(access is readonly)?
So long as all access is read-only, there should not be a problem.  Keep 
in mind however that lock files are known to not work correctly over NFS.

Hmm. Sorry, I was a bit unprecise (at least in the quoted part) so I'm 
not sure, if I got that correctly. Access over NFS is readonly, but 
there would be write access on the NFS server itself (local filesystem).
Is this ok? Or should I use a "update a copy of the index and exchange
indexes afterwards" strategy?

TIA
Morus
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching while optimizing

2003-08-20 Thread Doug Cutting

That is an old FAQ item.  Lucene has been thread safe for a while now.

Doug

Steve Rajavuori wrote:
This seems to contradict an item from the Lucene FAQ:

<<
41. Can I modify the index while performing ongoing searches ?
Yes and no. At the time of writing this FAQ (June 2001), Lucene is not
thread safe in this regard. Here is a quote from Doug Cutting, the creator
of Lucene: 

The problems are only when you add documents or optimize an index, and then
search with an IndexReader that was constructed before those changes to the
index were made. 
A possible work around is to perform the index updates in a parable and
separate index and switch to the new index when its updating is done. The
switching may be done for example, using a variable that will points to the
directory of the current active index. Since searches have a relatively
short life time, you may discard (or resue the old index) short time after
performing the switch (this grace period should be a little longer if you
want to let all searches that involved paging through the hit list to be
completed with consistent results). 

Can you explain further?

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 31, 2003 2:31 PM
To: Lucene Users List
Subject: Re: Searching while optimizing
Aviran Mordo wrote:

Is it possible and safe to search an index while another thread adds
documents or optimizes the same index?


Yes.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Dan Quaroni

Hey there.  What's the fastest way to do a batch index with lucene 1.3-rc1
on a dual or quad-processor box?  The files I'm indexing are very easy to
split divide among multiple threads.

Here's what I've done at this point:

Each thread has its own IndexWriter writing to its own RAMDirectory.  Every
 of documents, I mergeIndexes the thread's index to the main disk
index.

The thread writers have a mergeFactor of 50.
The disk indexWriter has a mergeFactor of 30.
I call optimize only on the main disk index, and only once at the very end.

Just doing this has shown great improvements for me, but I want to squeeze
out every bit of performance I can.  What's the fastest way to mergeIndexes?
Should I use a low mergeFactor when working with RAMDirectorys?  Should I
optimize the thread index before I merge it to the main one?

Thanks!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Question on Lucene when indexing big pdf files

2003-08-20 Thread Yang Sun

Hi,
When I use luke to look at my index, it seems all right. The content in the index 
is well, all the contents are extracted from the pdf file. I copy the pdf file content 
(namely "content" field), and search the keyword, but I can not found the keyword 
either. I think there is nothing wrong with the pdfbox program.

Would you please help me to test this situation. I have three pdf file(totally 
100k), after I index them,  I will get useless results when I ues "cisco" as the 
keyword. If you would like to help me, I will send you my test source files and the 
three pdf files to you. I will be very appreciate for your help.

Ben Litchfield <[EMAIL PROTECTED]> wrote:

> "cisco". I use Luke and my searcher program as the searching client,
> it seems no problem. Can anyone help me? Or any comments on this

When you use luke to look at your index does it show the correct contents
for those documents?

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software

RE: Similar Document Search

2003-08-20 Thread Gregor Heinrich

Hello Terry,

Lucene can do forward indexing, as Mark Rosen outlines in his Master's
thesis: http://citeseer.nj.nec.com/rosen03email.html.

We use a similar approach for (probabilistic) latent semantic analysis and
vector space searches. However, the solution is not really completely fixed
yet, therefore no code at this time...

Best regards,

Gregor

-Original Message-
From: Peter Becker [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 19, 2003 3:06 AM
To: Lucene Users List
Subject: Re: Similar Document Search

Hi Terry,

we have been thinking about the same problem and in the end we decided
that most likely the only good solution to this is to keep a
non-inverted index, i.e. a map from the documents to the terms. Then you
can query the most terms for the documents and query other documents
matching parts of this (where you get the usual question of what is
actually interesting: high frequency, low frequency or the mid range).

Indexing would probably be quite expensive since Lucene doesn't seem to
support changes in the index, and the index for the terms would change
all the time. We haven't implemented it yet, but it shouldn't be hard to
code. I just wouldn't expect good performance when indexing large
collections.

  Peter

Terry Steichen wrote:

>Is it possible without extensive additional coding to use Lucene to conduct
a search based on a document rather than a query?  (One use of this would be
to refine a search by selecting one of the hits returned from the initial
query and subsequently retrieving other documents "like" the selected one.)
>
>Regards,
>
>Terry
>
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Question on Lucene when indexing big pdf files

2003-08-20 Thread Ben Litchfield



> "cisco". I use Luke and my searcher program as the searching client,
> it seems no problem. Can anyone help me? Or any comments on this

When you use luke to look at your index does it show the correct contents
for those documents?

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Question on Lucene when indexing big pdf files

2003-08-20 Thread Damien Lust

I don't know if it can help you but here you are my code to extract 
code of pdf doc:

  /**
   * Extracts text from a pdf document
   *
   * @param in The InputStream representing the pdf file.
   * @return The text in the file
   */
  public String extractText(InputStream in)
  {
String s = null;
try
{
  PDFTextStripper _stripper = new PDFTextStripper();
  PDFParser parser = new PDFParser(in);
  parser.parse();
  s = _stripper.getText(parser.getDocument());
}
catch (Throwable t)
{
  t.printStackTrace();
}
return s;
  }




On Wednesday, August 20, 2003, at 08:59 AM, Yang Sun wrote:

Hi,
I am a newbie on Lucene. Now I want to index all my harddisk 
contents for searching, these includes html file, pdf file, word file 
and etc. But I have encounter a problem when I try to index pdf files, 
I need your help.
My environment is lucene-1.3-rc (lucene-1.2 has also been tried), 
jdk1.4.02, pdfbox-0.62. I try to index all my pdfs. There seems no 
error when executing the indexing (I use StandardAnalyzer, you can 
refer to my sources in the attachment). But when I search using the 
keyword, I find a lot of useless results. The pdf haven't contain the 
content I want. Can you help me with this problem.
In my attachment, I put my source files and the test pdf files. 
After I use my program to index these three pdf files, it seems all 
right then. But when I search the result using keyword "cisco" based 
on the indexing result, I get three Hits as the result. But two of the 
results do not contain the keyword "cisco", they are useless. I wonder 
if the pdfbox wrong, so I print out the indexed content, it also does 
not contain the keyword "cisco". I use Luke and my searcher program as 
the searching client, it seems no problem.
Can anyone help me? Or any comments on this problem. Everyone is 
welcome. My email is [EMAIL PROTECTED]
suny

PS: sorry, I can not attach the files, this mailing list can not hold 
attachment? So I have to put my source codes here. My three test pdf 
files are totally 100k, if someone would like to help me test it, I 
will be very appreciate.

IndexPDF.java

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.io.File;
import java.util.Date;
/**
 * FileName:
 * User: Administrator
 * Date: 2003-8-19
 * Time: 23:18:30
 * Functions:
 */
public class IndexPDF {
  File indexFiles;//the file or directory we want to index
  public static void indexDocs(IndexWriter writer, File file) throws 
Exception {
if (file.isDirectory()) {
  String[] files = file.list();
  for (int i = 0; i < files.length; i++) {
indexDocs(writer, new File(file, files[i]));
  }
} else if (file.getPath().endsWith(".pdf")) {
  System.out.println("adding " + file);
  writer.addDocument(PdfDocument.Document(file));
} else {
  System.out.println("Ignoring " + file);
}
  }
  public static void main(String args[]) throws Exception {
if (args.length != 1) {
  System.out.println("Usage: IndexPDF ");
  return;
}
try {
  Date start = new Date();
  IndexWriter writer = new IndexWriter("E:/Index", new 
StandardAnalyzer(), true);
  indexDocs(writer, new File(args[0]));
  writer.optimize();
  writer.close();
  Date end = new Date();
  System.out.println(end.getTime() - start.getTime());
  System.out.println(" total milliseconds");
} catch (Exception e) {
  System.out.println(" caught a " + e.getClass() +
  "\n with message: " + e.getMessage());
}
  }
}

PdfDocument.java

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
import java.io.File;
import java.io.FileInputStream;
import java.io.ByteArrayOutputStream;
import java.io.OutputStreamWriter;
import net.vicp.resshare.weblucene.document.PDFDocument;
/**
 * FileName:
 * User: Administrator
 * Date: 2003-8-19
 * Time: 23:20:54
 * Functions:
 */
public class PdfDocument {
  /* lucene object which represent a single data file */
  static Document doc = new Document();
  public static Document Document(File f ) {
//set relative path to the path field in lucene
doc.add(Field.UnIndexed("path", f.getPath()));
System.out.println("Path is " + f.getPath());
// use 1 as the limit for temporary use
doc.add(Field.Text("content", getPDFContent(f)));
doc.add(Field.UnIndexed("filetype", "pdf"));
doc.add(Field.UnIndexed("title", f.getName()));
return doc;
  }
  /**
   * get the text content from the specified pdf file.
   * @param f the pdf we should extract the content
   * @return  the string contains the pdf content
   */
  private static String getPDFContent(File f) {
byte[] contents = null;
try {
  FileInputStream is = new FileInputStream(f);

updating a document

2003-08-20 Thread Lars Hammer

Hello

I'm trying to update a document in my index. As far as i can tell from the FAQ and 
other places of documentation, the only way to do this is by deleting the document and 
adding it again.

Now, I want to be able to add the document a new but keep from having to re-parse the 
original file again. That is i want to extract a document from the index (and keep a 
copy in memory), delete the document from the index, update a field in the doc in 
memory and add the doc to index once again.

I imagine that it has to be done something like this :

1. extract the desired document from index with a function returning the document (not 
complete code):

Document doc;
String fileNameToGetFromIdx
String tempName

   for ( int i = 0; i < numDocs; i++ )
   {
if ( !indexreader.isDeleted( i ) )
{
 doc = indexreader.document( i );

 if ( doc != null )
 {
  tmpName = ( doc.get( "pathToFileOnDisk" ) ); 

  if ( tmpName.equals( fileNameToGetFromIdx ) )
  {
   ir.delete( i );
   return doc;
  }
 }
}
   }

This would leave me with the document in memory and the document deleted from the 
index -right?


2. update a field in the document by adding the field again.

 doc.add( Field.Text( "someField", "value" ) );

The API for Document says that if multiple fields exists with the same name, the value 
of the last value added is returned when getting the value.


3. add the document to the index again

indexwriter.addDocument( doc );


Is this a correct way of doing an update because i can't seem to get i to work 
properly.
The reason for trying it this way is to not having to reindex the original file again. 
I have many large PDF documents which takes some time to index :-(

Bottom line -when i do a search and a list of search results are displayed to the 
user, the user clicks the title of the document and the document is shown to the user. 
Before the document is shown i execute an update function to increase the number of 
times the documents has been visited -hence i need to update the visited field of that 
particular document in the index.

Uhm -hope you get the idea :-)

Any suggestions and comments are very welcome

thanks in advance

BTW : does anyone know if an update function is planned to be added to lucene? Would 
it be hard to write it yourself?


/Lars Hammer

www.dezide.com

Question on Lucene when indexing big pdf files

2003-08-20 Thread Yang Sun

Hi,
I am a newbie on Lucene. Now I want to index all my harddisk contents for 
searching, these includes html file, pdf file, word file and etc. But I have encounter 
a problem when I try to index pdf files, I need your help.
My environment is lucene-1.3-rc (lucene-1.2 has also been tried), jdk1.4.02, 
pdfbox-0.62. I try to index all my pdfs. There seems no error when executing the 
indexing (I use StandardAnalyzer, you can refer to my sources in the attachment). But 
when I search using the keyword, I find a lot of useless results. The pdf haven't 
contain the content I want. Can you help me with this problem.
In my attachment, I put my source files and the test pdf files. After I use my 
program to index these three pdf files, it seems all right then. But when I search the 
result using keyword "cisco" based on the indexing result, I get three Hits as the 
result. But two of the results do not contain the keyword "cisco", they are useless. I 
wonder if the pdfbox wrong, so I print out the indexed content, it also does not 
contain the keyword "cisco". I use Luke and my searcher program as the searching 
client, it seems no problem.
Can anyone help me? Or any comments on this problem. Everyone is welcome. My email 
is [EMAIL PROTECTED]
suny 

PS: sorry, I can not attach the files, this mailing list can not hold attachment? So I 
have to put my source codes here. My three test pdf files are totally 100k, if someone 
would like to help me test it, I will be very appreciate.
 
IndexPDF.java

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.io.File;
import java.util.Date;
/**
 * FileName:
 * User: Administrator
 * Date: 2003-8-19
 * Time: 23:18:30
 * Functions:
 */
public class IndexPDF {
  File indexFiles;//the file or directory we want to index
  public static void indexDocs(IndexWriter writer, File file) throws Exception {
if (file.isDirectory()) {
  String[] files = file.list();
  for (int i = 0; i < files.length; i++) {
indexDocs(writer, new File(file, files[i]));
  }
} else if (file.getPath().endsWith(".pdf")) {
  System.out.println("adding " + file);
  writer.addDocument(PdfDocument.Document(file));
} else {
  System.out.println("Ignoring " + file);
}
  }
  public static void main(String args[]) throws Exception {
if (args.length != 1) {
  System.out.println("Usage: IndexPDF ");
  return;
}
try {
  Date start = new Date();
  IndexWriter writer = new IndexWriter("E:/Index", new StandardAnalyzer(), true);
  indexDocs(writer, new File(args[0]));
  writer.optimize();
  writer.close();
  Date end = new Date();
  System.out.println(end.getTime() - start.getTime());
  System.out.println(" total milliseconds");
} catch (Exception e) {
  System.out.println(" caught a " + e.getClass() +
  "\n with message: " + e.getMessage());
}
  }
}

PdfDocument.java

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
import java.io.File;
import java.io.FileInputStream;
import java.io.ByteArrayOutputStream;
import java.io.OutputStreamWriter;
import net.vicp.resshare.weblucene.document.PDFDocument;
/**
 * FileName:
 * User: Administrator
 * Date: 2003-8-19
 * Time: 23:20:54
 * Functions:
 */
public class PdfDocument {
  /* lucene object which represent a single data file */
  static Document doc = new Document();
  public static Document Document(File f ) {
//set relative path to the path field in lucene
doc.add(Field.UnIndexed("path", f.getPath()));
System.out.println("Path is " + f.getPath());
// use 1 as the limit for temporary use
doc.add(Field.Text("content", getPDFContent(f)));
doc.add(Field.UnIndexed("filetype", "pdf"));
doc.add(Field.UnIndexed("title", f.getName()));
return doc;
  }
  /**
   * get the text content from the specified pdf file.
   * @param f the pdf we should extract the content
   * @return  the string contains the pdf content
   */
  private static String getPDFContent(File f) {
byte[] contents = null;
try {
  FileInputStream is = new FileInputStream(f);
  PDFParser parser = new PDFParser( is );
  parser.parse();
  PDDocument nbsp = parser.getPDDocument();
  ByteArrayOutputStream out = new ByteArrayOutputStream();
  OutputStreamWriter writer = new OutputStreamWriter( out );
  PDFTextStripper stripper = new PDFTextStripper();
  stripper.writeText(nbsp.getDocument(), writer);
  writer.close();
  contents = out.toByteArray();
} catch (Exception e) {
  e.printStackTrace();
  return "";
}
String ts = new String(contents);
System.out.println("the string length is"+contents.length+"\n");
return ts;
  }
}



-

Re: Similar Document Search

Re: Fastest batch indexing with 1.3-rc1

Re: Will failed optimize corrupt an index?

Re: Fastest batch indexing with 1.3-rc1

Re: Fastest batch indexing with 1.3-rc1

RE: Fastest batch indexing with 1.3-rc1

Re: Lucene Index on NFS Server

Re: Searching while optimizing

Fastest batch indexing with 1.3-rc1

Re: Question on Lucene when indexing big pdf files

RE: Similar Document Search

Re: Question on Lucene when indexing big pdf files

Re: Question on Lucene when indexing big pdf files

updating a document

Question on Lucene when indexing big pdf files

15 matches

Site Navigation

Mail list logo

Footer information