Re: Similar Document Search
Hi all, it seems there are quite a few people looking for similar features, i.e. (a) document identity and (b) forward indexing. So far we hacked (a) by using a wrapper implementing equals/hashcode based on a unique field, but of course that assumes maintaining a unique field in the index. (b) is something we haven't tackled yet, but plan to. The source code for Mark's thesis seems to be part of the Haystack distribution. The comments in the files put it under Apche-license. This seems to make it a good candidate to be included at least in the Lucene sandbox -- although I haven't tried it myself yet. But it sounds like a good candidate for us to use. Since the haystack source is a bit larger and I actually couldn't get the download at the moment, here is a copy of the relevant bit grabbed from one of my colleague's machines: http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb) Note that this is just a tarball of src/org/apache/lucene out of some Haystack source. Untested, unmodified. I'd love to see something like this supported in the Lucene context were people might actually find it :-) Peter Gregor Heinrich wrote: Hello Terry, Lucene can do forward indexing, as Mark Rosen outlines in his Master's thesis: http://citeseer.nj.nec.com/rosen03email.html. We use a similar approach for (probabilistic) latent semantic analysis and vector space searches. However, the solution is not really completely fixed yet, therefore no code at this time... Best regards, Gregor -Original Message- From: Peter Becker [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 19, 2003 3:06 AM To: Lucene Users List Subject: Re: Similar Document Search Hi Terry, we have been thinking about the same problem and in the end we decided that most likely the only good solution to this is to keep a non-inverted index, i.e. a map from the documents to the terms. Then you can query the most terms for the documents and query other documents matching parts of this (where you get the usual question of what is actually interesting: high frequency, low frequency or the mid range). Indexing would probably be quite expensive since Lucene doesn't seem to support changes in the index, and the index for the terms would change all the time. We haven't implemented it yet, but it shouldn't be hard to code. I just wouldn't expect good performance when indexing large collections. Peter Terry Steichen wrote: Is it possible without extensive additional coding to use Lucene to conduct a search based on a document rather than a query? (One use of this would be to refine a search by selecting one of the hits returned from the initial query and subsequently retrieving other documents "like" the selected one.) Regards, Terry - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fastest batch indexing with 1.3-rc1
Leo Galambos wrote: Isn't it better for Dan to skip the optimization phase before merging? I am not sure, but he could save some time on this (if he has enough file handles for that, of course). It depends. If you have 10 machines, each with a single disk, that you use for indexing in parallel, and copy all of the indexes to a single machine for the final merge, then you're probably better off optimizing each index before copying it and merging it with the others, in order to maximize the amount of work done in parallel, using all disk spindles. However, if instead you have one machine with ten processors and a filesystem striped across ten disks, then, in theory, optimizing before merging might not help much, since the single-threaded final merge could use all ten disks at once. Even then, though the final merge would be doing some CPU work serially which would have been done in parallel in the first configuration. In general I think it's best to do as much work as possible in parallel. > What strategy do you use in "nutch"? Nutch builds optimized indexes for each fetched "segment" (n.b., a Nutch segment is different than a Lucene segment) and only merges segment indexes as the final step before deploying them for searching. Nutch has a rolling set of active segments: the oldest are periodically discarded and replaced with newly fetched segments. Before a new set of segments is deployed, duplicate elimination processing must occur, which marks duplicates as deleted prior to merging new production indexes. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Will failed optimize corrupt an index?
The index should be fine. Lucene index updates are atomic. Doug Dan Quaroni wrote: My index grew about 7 gigs larger than I projected it would, and it ran out of disk space during optimize. Does lucene have transactions or anything that would prevent this from corrupting an index, or do I need to generate the index again? Thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fastest batch indexing with 1.3-rc1
Isn't it better for Dan to skip the optimization phase before merging? I am not sure, but he could save some time on this (if he has enough file handles for that, of course). What strategy do you use in "nutch"? THX -g- Doug Cutting wrote: As the index grows, disk i/o becomes the bottleneck. The default indexing parameters do a pretty good job of optimizing this. But if you have lots of CPUs and lots of disks, you might try building several indexes in parallel, each containing a subset of the documents, optimize each index and finally merge them all into a single index at the end. But you need lots of i/o capacity for this to pay off. Doug Dan Quaroni wrote: Looks like I spoke too soon... As the index gets larger, time to merge becomes prohibitably high. It appears to increase linearly. Oh well. I guess I'll just have to go with about 3ms/doc. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fastest batch indexing with 1.3-rc1
As the index grows, disk i/o becomes the bottleneck. The default indexing parameters do a pretty good job of optimizing this. But if you have lots of CPUs and lots of disks, you might try building several indexes in parallel, each containing a subset of the documents, optimize each index and finally merge them all into a single index at the end. But you need lots of i/o capacity for this to pay off. Doug Dan Quaroni wrote: Looks like I spoke too soon... As the index gets larger, time to merge becomes prohibitably high. It appears to increase linearly. Oh well. I guess I'll just have to go with about 3ms/doc. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Fastest batch indexing with 1.3-rc1
Looks like I spoke too soon... As the index gets larger, time to merge becomes prohibitably high. It appears to increase linearly. Oh well. I guess I'll just have to go with about 3ms/doc. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Index on NFS Server
I don't know the details of how lock files are unreliable over NFS, only that they are. The window of vulnerability, when the lock file is used, is when one JVM is opening all of the files in an index, and another is completing an update at the same time. If the updating machine removes some files after the opening machine has read the 'segments' file but before it has opened all of the files, then the open will fail with a FileNotFound exception. If your application can guarantee that indexes are not opened while an update is completing (under IndexWriter.close() or IndexReader.close() for deletions) then this will not be a problem. Doug Morus Walter wrote: Doug Cutting writes: Can I have a lucene index on a NFS filesystem without problems (access is readonly)? So long as all access is read-only, there should not be a problem. Keep in mind however that lock files are known to not work correctly over NFS. Hmm. Sorry, I was a bit unprecise (at least in the quoted part) so I'm not sure, if I got that correctly. Access over NFS is readonly, but there would be write access on the NFS server itself (local filesystem). Is this ok? Or should I use a "update a copy of the index and exchange indexes afterwards" strategy? TIA Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching while optimizing
That is an old FAQ item. Lucene has been thread safe for a while now. Doug Steve Rajavuori wrote: This seems to contradict an item from the Lucene FAQ: << 41. Can I modify the index while performing ongoing searches ? Yes and no. At the time of writing this FAQ (June 2001), Lucene is not thread safe in this regard. Here is a quote from Doug Cutting, the creator of Lucene: The problems are only when you add documents or optimize an index, and then search with an IndexReader that was constructed before those changes to the index were made. A possible work around is to perform the index updates in a parable and separate index and switch to the new index when its updating is done. The switching may be done for example, using a variable that will points to the directory of the current active index. Since searches have a relatively short life time, you may discard (or resue the old index) short time after performing the switch (this grace period should be a little longer if you want to let all searches that involved paging through the hit list to be completed with consistent results). Can you explain further? -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, July 31, 2003 2:31 PM To: Lucene Users List Subject: Re: Searching while optimizing Aviran Mordo wrote: Is it possible and safe to search an index while another thread adds documents or optimizes the same index? Yes. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Fastest batch indexing with 1.3-rc1
Hey there. What's the fastest way to do a batch index with lucene 1.3-rc1 on a dual or quad-processor box? The files I'm indexing are very easy to split divide among multiple threads. Here's what I've done at this point: Each thread has its own IndexWriter writing to its own RAMDirectory. Every of documents, I mergeIndexes the thread's index to the main disk index. The thread writers have a mergeFactor of 50. The disk indexWriter has a mergeFactor of 30. I call optimize only on the main disk index, and only once at the very end. Just doing this has shown great improvements for me, but I want to squeeze out every bit of performance I can. What's the fastest way to mergeIndexes? Should I use a low mergeFactor when working with RAMDirectorys? Should I optimize the thread index before I merge it to the main one? Thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question on Lucene when indexing big pdf files
Hi, When I use luke to look at my index, it seems all right. The content in the index is well, all the contents are extracted from the pdf file. I copy the pdf file content (namely "content" field), and search the keyword, but I can not found the keyword either. I think there is nothing wrong with the pdfbox program. Would you please help me to test this situation. I have three pdf file(totally 100k), after I index them, I will get useless results when I ues "cisco" as the keyword. If you would like to help me, I will send you my test source files and the three pdf files to you. I will be very appreciate for your help. Ben Litchfield <[EMAIL PROTECTED]> wrote: > "cisco". I use Luke and my searcher program as the searching client, > it seems no problem. Can anyone help me? Or any comments on this When you use luke to look at your index does it show the correct contents for those documents? Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software
RE: Similar Document Search
Hello Terry, Lucene can do forward indexing, as Mark Rosen outlines in his Master's thesis: http://citeseer.nj.nec.com/rosen03email.html. We use a similar approach for (probabilistic) latent semantic analysis and vector space searches. However, the solution is not really completely fixed yet, therefore no code at this time... Best regards, Gregor -Original Message- From: Peter Becker [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 19, 2003 3:06 AM To: Lucene Users List Subject: Re: Similar Document Search Hi Terry, we have been thinking about the same problem and in the end we decided that most likely the only good solution to this is to keep a non-inverted index, i.e. a map from the documents to the terms. Then you can query the most terms for the documents and query other documents matching parts of this (where you get the usual question of what is actually interesting: high frequency, low frequency or the mid range). Indexing would probably be quite expensive since Lucene doesn't seem to support changes in the index, and the index for the terms would change all the time. We haven't implemented it yet, but it shouldn't be hard to code. I just wouldn't expect good performance when indexing large collections. Peter Terry Steichen wrote: >Is it possible without extensive additional coding to use Lucene to conduct a search based on a document rather than a query? (One use of this would be to refine a search by selecting one of the hits returned from the initial query and subsequently retrieving other documents "like" the selected one.) > >Regards, > >Terry > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question on Lucene when indexing big pdf files
> "cisco". I use Luke and my searcher program as the searching client, > it seems no problem. Can anyone help me? Or any comments on this When you use luke to look at your index does it show the correct contents for those documents? Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question on Lucene when indexing big pdf files
I don't know if it can help you but here you are my code to extract code of pdf doc: /** * Extracts text from a pdf document * * @param in The InputStream representing the pdf file. * @return The text in the file */ public String extractText(InputStream in) { String s = null; try { PDFTextStripper _stripper = new PDFTextStripper(); PDFParser parser = new PDFParser(in); parser.parse(); s = _stripper.getText(parser.getDocument()); } catch (Throwable t) { t.printStackTrace(); } return s; } On Wednesday, August 20, 2003, at 08:59 AM, Yang Sun wrote: Hi, I am a newbie on Lucene. Now I want to index all my harddisk contents for searching, these includes html file, pdf file, word file and etc. But I have encounter a problem when I try to index pdf files, I need your help. My environment is lucene-1.3-rc (lucene-1.2 has also been tried), jdk1.4.02, pdfbox-0.62. I try to index all my pdfs. There seems no error when executing the indexing (I use StandardAnalyzer, you can refer to my sources in the attachment). But when I search using the keyword, I find a lot of useless results. The pdf haven't contain the content I want. Can you help me with this problem. In my attachment, I put my source files and the test pdf files. After I use my program to index these three pdf files, it seems all right then. But when I search the result using keyword "cisco" based on the indexing result, I get three Hits as the result. But two of the results do not contain the keyword "cisco", they are useless. I wonder if the pdfbox wrong, so I print out the indexed content, it also does not contain the keyword "cisco". I use Luke and my searcher program as the searching client, it seems no problem. Can anyone help me? Or any comments on this problem. Everyone is welcome. My email is [EMAIL PROTECTED] suny PS: sorry, I can not attach the files, this mailing list can not hold attachment? So I have to put my source codes here. My three test pdf files are totally 100k, if someone would like to help me test it, I will be very appreciate. IndexPDF.java import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyzer; import java.io.File; import java.util.Date; /** * FileName: * User: Administrator * Date: 2003-8-19 * Time: 23:18:30 * Functions: */ public class IndexPDF { File indexFiles;//the file or directory we want to index public static void indexDocs(IndexWriter writer, File file) throws Exception { if (file.isDirectory()) { String[] files = file.list(); for (int i = 0; i < files.length; i++) { indexDocs(writer, new File(file, files[i])); } } else if (file.getPath().endsWith(".pdf")) { System.out.println("adding " + file); writer.addDocument(PdfDocument.Document(file)); } else { System.out.println("Ignoring " + file); } } public static void main(String args[]) throws Exception { if (args.length != 1) { System.out.println("Usage: IndexPDF "); return; } try { Date start = new Date(); IndexWriter writer = new IndexWriter("E:/Index", new StandardAnalyzer(), true); indexDocs(writer, new File(args[0])); writer.optimize(); writer.close(); Date end = new Date(); System.out.println(end.getTime() - start.getTime()); System.out.println(" total milliseconds"); } catch (Exception e) { System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage()); } } } PdfDocument.java import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.pdfbox.pdfparser.PDFParser; import org.pdfbox.pdmodel.PDDocument; import org.pdfbox.util.PDFTextStripper; import java.io.File; import java.io.FileInputStream; import java.io.ByteArrayOutputStream; import java.io.OutputStreamWriter; import net.vicp.resshare.weblucene.document.PDFDocument; /** * FileName: * User: Administrator * Date: 2003-8-19 * Time: 23:20:54 * Functions: */ public class PdfDocument { /* lucene object which represent a single data file */ static Document doc = new Document(); public static Document Document(File f ) { //set relative path to the path field in lucene doc.add(Field.UnIndexed("path", f.getPath())); System.out.println("Path is " + f.getPath()); // use 1 as the limit for temporary use doc.add(Field.Text("content", getPDFContent(f))); doc.add(Field.UnIndexed("filetype", "pdf")); doc.add(Field.UnIndexed("title", f.getName())); return doc; } /** * get the text content from the specified pdf file. * @param f the pdf we should extract the content * @return the string contains the pdf content */ private static String getPDFContent(File f) { byte[] contents = null; try { FileInputStream is = new FileInputStream(f);
updating a document
Hello I'm trying to update a document in my index. As far as i can tell from the FAQ and other places of documentation, the only way to do this is by deleting the document and adding it again. Now, I want to be able to add the document a new but keep from having to re-parse the original file again. That is i want to extract a document from the index (and keep a copy in memory), delete the document from the index, update a field in the doc in memory and add the doc to index once again. I imagine that it has to be done something like this : 1. extract the desired document from index with a function returning the document (not complete code): Document doc; String fileNameToGetFromIdx String tempName for ( int i = 0; i < numDocs; i++ ) { if ( !indexreader.isDeleted( i ) ) { doc = indexreader.document( i ); if ( doc != null ) { tmpName = ( doc.get( "pathToFileOnDisk" ) ); if ( tmpName.equals( fileNameToGetFromIdx ) ) { ir.delete( i ); return doc; } } } } This would leave me with the document in memory and the document deleted from the index -right? 2. update a field in the document by adding the field again. doc.add( Field.Text( "someField", "value" ) ); The API for Document says that if multiple fields exists with the same name, the value of the last value added is returned when getting the value. 3. add the document to the index again indexwriter.addDocument( doc ); Is this a correct way of doing an update because i can't seem to get i to work properly. The reason for trying it this way is to not having to reindex the original file again. I have many large PDF documents which takes some time to index :-( Bottom line -when i do a search and a list of search results are displayed to the user, the user clicks the title of the document and the document is shown to the user. Before the document is shown i execute an update function to increase the number of times the documents has been visited -hence i need to update the visited field of that particular document in the index. Uhm -hope you get the idea :-) Any suggestions and comments are very welcome thanks in advance BTW : does anyone know if an update function is planned to be added to lucene? Would it be hard to write it yourself? /Lars Hammer www.dezide.com
Question on Lucene when indexing big pdf files
Hi, I am a newbie on Lucene. Now I want to index all my harddisk contents for searching, these includes html file, pdf file, word file and etc. But I have encounter a problem when I try to index pdf files, I need your help. My environment is lucene-1.3-rc (lucene-1.2 has also been tried), jdk1.4.02, pdfbox-0.62. I try to index all my pdfs. There seems no error when executing the indexing (I use StandardAnalyzer, you can refer to my sources in the attachment). But when I search using the keyword, I find a lot of useless results. The pdf haven't contain the content I want. Can you help me with this problem. In my attachment, I put my source files and the test pdf files. After I use my program to index these three pdf files, it seems all right then. But when I search the result using keyword "cisco" based on the indexing result, I get three Hits as the result. But two of the results do not contain the keyword "cisco", they are useless. I wonder if the pdfbox wrong, so I print out the indexed content, it also does not contain the keyword "cisco". I use Luke and my searcher program as the searching client, it seems no problem. Can anyone help me? Or any comments on this problem. Everyone is welcome. My email is [EMAIL PROTECTED] suny PS: sorry, I can not attach the files, this mailing list can not hold attachment? So I have to put my source codes here. My three test pdf files are totally 100k, if someone would like to help me test it, I will be very appreciate. IndexPDF.java import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyzer; import java.io.File; import java.util.Date; /** * FileName: * User: Administrator * Date: 2003-8-19 * Time: 23:18:30 * Functions: */ public class IndexPDF { File indexFiles;//the file or directory we want to index public static void indexDocs(IndexWriter writer, File file) throws Exception { if (file.isDirectory()) { String[] files = file.list(); for (int i = 0; i < files.length; i++) { indexDocs(writer, new File(file, files[i])); } } else if (file.getPath().endsWith(".pdf")) { System.out.println("adding " + file); writer.addDocument(PdfDocument.Document(file)); } else { System.out.println("Ignoring " + file); } } public static void main(String args[]) throws Exception { if (args.length != 1) { System.out.println("Usage: IndexPDF "); return; } try { Date start = new Date(); IndexWriter writer = new IndexWriter("E:/Index", new StandardAnalyzer(), true); indexDocs(writer, new File(args[0])); writer.optimize(); writer.close(); Date end = new Date(); System.out.println(end.getTime() - start.getTime()); System.out.println(" total milliseconds"); } catch (Exception e) { System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage()); } } } PdfDocument.java import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.pdfbox.pdfparser.PDFParser; import org.pdfbox.pdmodel.PDDocument; import org.pdfbox.util.PDFTextStripper; import java.io.File; import java.io.FileInputStream; import java.io.ByteArrayOutputStream; import java.io.OutputStreamWriter; import net.vicp.resshare.weblucene.document.PDFDocument; /** * FileName: * User: Administrator * Date: 2003-8-19 * Time: 23:20:54 * Functions: */ public class PdfDocument { /* lucene object which represent a single data file */ static Document doc = new Document(); public static Document Document(File f ) { //set relative path to the path field in lucene doc.add(Field.UnIndexed("path", f.getPath())); System.out.println("Path is " + f.getPath()); // use 1 as the limit for temporary use doc.add(Field.Text("content", getPDFContent(f))); doc.add(Field.UnIndexed("filetype", "pdf")); doc.add(Field.UnIndexed("title", f.getName())); return doc; } /** * get the text content from the specified pdf file. * @param f the pdf we should extract the content * @return the string contains the pdf content */ private static String getPDFContent(File f) { byte[] contents = null; try { FileInputStream is = new FileInputStream(f); PDFParser parser = new PDFParser( is ); parser.parse(); PDDocument nbsp = parser.getPDDocument(); ByteArrayOutputStream out = new ByteArrayOutputStream(); OutputStreamWriter writer = new OutputStreamWriter( out ); PDFTextStripper stripper = new PDFTextStripper(); stripper.writeText(nbsp.getDocument(), writer); writer.close(); contents = out.toByteArray(); } catch (Exception e) { e.printStackTrace(); return ""; } String ts = new String(contents); System.out.println("the string length is"+contents.length+"\n"); return ts; } } -