RE: Time to index documents
Hetan, If you are using a corpus with multiple editors, I suggest that you use a cleaner like tidy as there might be weird stuff appearing in the html. sv On Thu, 26 Aug 2004, Karthik N S wrote: > Hi Hetan > > >Th's the major Problem of non Standatrdized Tags for HTML Document's > u are Indexing ,resulting in lag time taken for Indexing process > > >If u can Tweak the HTMLParser.jj file within lucene.zip '/demo/html' > file >[U have to have some Knowledge of JAVACC for this]. > > > > Karthik > > -Original Message- > From: Hetan Shah [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 26, 2004 3:01 AM > To: Lucene Users List > Subject: Time to index documents > > > Hello all, > > Is there a way to reduce the indexing time taken when the indexer is > indexing about 30,000 + files. It is roughly taking around 6-7 hours to > do this. I am using IndexHTML class to create the index out of HTML files. > > Another issue that I see is every once in a while I get the following > output on the screen. > > adding ../31/1104852.html > Parse Aborted: Encountered "\"" at line 7, column 1. > Was expecting one of: > ... > "=" ... > ... > > Any suggestions on preventing this from happening? > > Thanks in advance. > -H > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Time to index documents
Hi Hetan Th's the major Problem of non Standatrdized Tags for HTML Document's u are Indexing ,resulting in lag time taken for Indexing process If u can Tweak the HTMLParser.jj file within lucene.zip '/demo/html' file [U have to have some Knowledge of JAVACC for this]. Karthik -Original Message- From: Hetan Shah [mailto:[EMAIL PROTECTED] Sent: Thursday, August 26, 2004 3:01 AM To: Lucene Users List Subject: Time to index documents Hello all, Is there a way to reduce the indexing time taken when the indexer is indexing about 30,000 + files. It is roughly taking around 6-7 hours to do this. I am using IndexHTML class to create the index out of HTML files. Another issue that I see is every once in a while I get the following output on the screen. adding ../31/1104852.html Parse Aborted: Encountered "\"" at line 7, column 1. Was expecting one of: ... "=" ... ... Any suggestions on preventing this from happening? Thanks in advance. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Content from multiple folders in single index
Hi, I suspect this is an easy one but I didn't see a reference in the FAQ's so I thought I'd ask. I have a file structure like this: web - pages - downloads (pdf docs) - include I want to index the html in pages and the pdf's in downloads, but not the html in include, so I don't want to start my index at web. I've modified the IndexHTML in demo to do the pdf's. What is the best way to do this? Thanks for your suggestions. John
Re: Time to index documents
JGuru explanation: http://www.jguru.com/faq/view.jsp?EID=1074228 I have no sample code for neko, I think nutch uses it though. For tidy, you can look at ant in the sandbox: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/ant/src/main/org/apache/lucene/ant/HtmlDocument.java?rev=1.3&view=markup HTH, sv On Wed, 25 Aug 2004, Hetan Shah wrote: > Do you have any pointers for sample code for them? > Would highly appreciate it. > Thanks. > -H > > Stephane James Vaucher wrote: > > > I don't think that the demo parser is meant as a production > > system component. You can look at Tidy or NekoHtml. They cleanup your html > > and are probably optimised. > > > > sv > > > > On Wed, 25 Aug 2004, Hetan Shah wrote: > > > > > >>Hello all, > >> > >>Is there a way to reduce the indexing time taken when the indexer is > >>indexing about 30,000 + files. It is roughly taking around 6-7 hours to > >>do this. I am using IndexHTML class to create the index out of HTML files. > >> > >>Another issue that I see is every once in a while I get the following > >>output on the screen. > >> > >>adding ../31/1104852.html > >>Parse Aborted: Encountered "\"" at line 7, column 1. > >>Was expecting one of: > >> ... > >> "=" ... > >> ... > >> > >>Any suggestions on preventing this from happening? > >> > >>Thanks in advance. > >>-H > >> > >> > >>- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: [EMAIL PROTECTED] > >> > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Time to index documents
Do you have any pointers for sample code for them? Would highly appreciate it. Thanks. -H Stephane James Vaucher wrote: I don't think that the demo parser is meant as a production system component. You can look at Tidy or NekoHtml. They cleanup your html and are probably optimised. sv On Wed, 25 Aug 2004, Hetan Shah wrote: Hello all, Is there a way to reduce the indexing time taken when the indexer is indexing about 30,000 + files. It is roughly taking around 6-7 hours to do this. I am using IndexHTML class to create the index out of HTML files. Another issue that I see is every once in a while I get the following output on the screen. adding ../31/1104852.html Parse Aborted: Encountered "\"" at line 7, column 1. Was expecting one of: ... "=" ... ... Any suggestions on preventing this from happening? Thanks in advance. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Time to index documents
I don't think that the demo parser is meant as a production system component. You can look at Tidy or NekoHtml. They cleanup your html and are probably optimised. sv On Wed, 25 Aug 2004, Hetan Shah wrote: > Hello all, > > Is there a way to reduce the indexing time taken when the indexer is > indexing about 30,000 + files. It is roughly taking around 6-7 hours to > do this. I am using IndexHTML class to create the index out of HTML files. > > Another issue that I see is every once in a while I get the following > output on the screen. > > adding ../31/1104852.html > Parse Aborted: Encountered "\"" at line 7, column 1. > Was expecting one of: > ... > "=" ... > ... > > Any suggestions on preventing this from happening? > > Thanks in advance. > -H > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Time to index documents
Hello all, Is there a way to reduce the indexing time taken when the indexer is indexing about 30,000 + files. It is roughly taking around 6-7 hours to do this. I am using IndexHTML class to create the index out of HTML files. Another issue that I see is every once in a while I get the following output on the screen. adding ../31/1104852.html Parse Aborted: Encountered "\"" at line 7, column 1. Was expecting one of: ... "=" ... ... Any suggestions on preventing this from happening? Thanks in advance. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How not to show results with the same score?
On Wednesday 25 August 2004 12:21, B. Grimm [Eastbeam GmbH] wrote: > hi there, > > i browsed through the list and had some different searches but i do not > find, what i'm looking for. > > i got an index which is generated by a bot, collecting websites. there > are sites like www.domain.de/article/1 and www.domain.de/article/1?page=1 > these different urls have the same content and when u search for a word, > matching, both are returned, which is correct. > > they have excatly the same score because of there content an so one, so > i would like to know if its possible "to group by" (mysql, of course) > the returned score, so that only the first match is collected into > "Hits" and all following matches with the same score are ignored. > > it would be great if anyone has an idea how to do that. You can implement your own HitCollector and pass it to IndexSearcher.search() Have a look at the javadocs of the org.apache.lucene.search package, it's quite straightforward. The PriorityQueue from the util package is useful to collect results. For every distinct score you could store an int[] of document nrs in there while collecting the hits. Basically you'll end up implementing your own Hits class. For URL's that have the same content, it's better to store multiple URL's for the same document. However, this merging is normally done by a crawler because the same contents means the same outgoing URL's. Crawlers also keep track of multiple host names resolving to the same IP address. In case you need to crawl and index an intranet or more, have a look at Nutch. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced timestamp usage (or global value storage)
On Aug 25, 2004, at 12:25 PM, Grant Ingersoll wrote: I may be confused, as I understand it you said you were interested in the last document indexed, Yes, I see what you meant. I'm sorry. That's actually an interesting option. Is getting the timestamp of the last document indexed a good enough solution or must I find the latest timestamp of all indexed documents? I'd have to ponder that for a while. Avi -- Avi 'rlwimi' Drissman [EMAIL PROTECTED] Argh! This darn mail server is trunca - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced timestamp usage (or global value storage)
Avi, I may be confused, as I understand it you said you were interested in the last document indexed, Berhnard's code does that. Lucene adds documents sequentially, so counting backwards from the maxDoc() should get you the last indexed document pretty quickly. If all documents were deleted, then this would go through all documents, otherwise, it is going to find it pretty quickly. It doesn't have to traverse through all of the documents, it just has to find the "first" document that is not deleted (since we are starting at the end of the list and going backward) >>> [EMAIL PROTECTED] 8/25/2004 12:01:50 PM >>> On Aug 25, 2004, at 11:57 AM, Grant Ingersoll wrote: > You are right, in the worst case, this would be linear, No, in _all_ cases this would be linear. > I would bet, that on average, > arguably nearly all cases, you would go through very few iterations > before finding the doc you are interested in Then you don't understand what I'm trying to do. I'm trying to find the document with the biggest value for the field. That would involve checking the field's value in every document to ensure this. Avi -- Avi 'rlwimi' Drissman [EMAIL PROTECTED] Argh! This darn mail server is trunca - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced timestamp usage (or global value storage)
On Aug 25, 2004, at 11:57 AM, Grant Ingersoll wrote: You are right, in the worst case, this would be linear, No, in _all_ cases this would be linear. I would bet, that on average, arguably nearly all cases, you would go through very few iterations before finding the doc you are interested in Then you don't understand what I'm trying to do. I'm trying to find the document with the biggest value for the field. That would involve checking the field's value in every document to ensure this. Avi -- Avi 'rlwimi' Drissman [EMAIL PROTECTED] Argh! This darn mail server is trunca - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced timestamp usage (or global value storage)
>>> [EMAIL PROTECTED] 8/25/2004 11:50:01 AM >>> On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote: > If you already store the date time when the doc was index, you could > use the following trick to get the last document added to the index: > >while (--maxDoc > 0) { Yes, but that's a linear search :( >>> You are right, in the worst case, this would be linear, but that would require you to delete a lot of documents. I would bet, that on average, arguably nearly all cases, you would go through very few iterations before finding the doc you are interested in - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced timestamp usage (or global value storage)
The more documents match, the slower the search; how long your particular search would take I cannot tell, though - you should just test it out and see. I never needed to use the trick with a flag field in all documents, but I know others do it. Otis --- Avi Drissman <[EMAIL PROTECTED]> wrote: > On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote: > > > If you already store the date time when the doc was index, you > could > > use the following trick to get the last document added to the > index: > > > >while (--maxDoc > 0) { > > Yes, but that's a linear search :( > > On Aug 25, 2004, at 11:25 AM, Otis Gospodnetic wrote: > > > What if all Documents in your index contained some flag field + an > 'add > > date' field. Then you could make a query such as: flag:1 and sort > it > > by 'add date' field, taking only the very first hit as the most > > recently added Document. > > That's a very clever approach. I'm currently using Lucene 1.3, so I > hadn't thought about using the new sorting abilities. I'd need to > move > to 1.4, of course. > > A question, though: how efficient is it to make a query that matches > all documents and then sort it? I'm looking for something as small as > I > can; after all, storing the last date in a file separate from the > index > is O(1)... > > Thanks! > > Avi > > -- > Avi 'rlwimi' Drissman > [EMAIL PROTECTED] > Argh! This darn mail server is trunca > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced timestamp usage (or global value storage)
On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote: If you already store the date time when the doc was index, you could use the following trick to get the last document added to the index: while (--maxDoc > 0) { Yes, but that's a linear search :( On Aug 25, 2004, at 11:25 AM, Otis Gospodnetic wrote: What if all Documents in your index contained some flag field + an 'add date' field. Then you could make a query such as: flag:1 and sort it by 'add date' field, taking only the very first hit as the most recently added Document. That's a very clever approach. I'm currently using Lucene 1.3, so I hadn't thought about using the new sorting abilities. I'd need to move to 1.4, of course. A question, though: how efficient is it to make a query that matches all documents and then sort it? I'm looking for something as small as I can; after all, storing the last date in a file separate from the index is O(1)... Thanks! Avi -- Avi 'rlwimi' Drissman [EMAIL PROTECTED] Argh! This darn mail server is trunca - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Introduction to Lucene [was Re: worddoucments search]
A collection of links to introductory level Lucene articles (including one in simplified Chinese and one in Turkish) is available on the Lucene Wiki at: http://wiki.apache.org/jakarta-lucene/IntroductionToLucene> Steve Otis Gospodnetic wrote: that part you have to do yourself. It is easy, just create a new Document, create an appropriate Field, give it a name and the string value you got with textmining.org library, then add the Field to your Document, and then add the Document to the index with IndexWriter. Look at one of the articles about Lucene to get started. I wrote one called something like Introduction to Text Indexing with Lucene. You probably want to read that one to get going. Otis --- Santosh <[EMAIL PROTECTED]> wrote: I have gon through textmining.org, I am able to extract text in string format. but how can I get it as lucene document format - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, August 24, 2004 11:54 PM Subject: Re: worddoucments search As I just answered in a separate email to Ryan - we used textmining.orglibrary, too, as an example of something that is easier to use thanPOI. It's been a while since I wrote that chapter, so it slipped mymind when I replied. Yes, use textmining.org first, you'll be able toinclude it in your code in 2 minutes. Good stuff. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced timestamp usage (or global value storage)
Avi, i would prefer the second approach. If you already store the date time when the doc was index, you could use the following trick to get the last document added to the index: IndexReader ir = IndexReader.open("/tmp/testindex"); int maxDoc = ir.maxDoc(); while (--maxDoc > 0) { if (!ir.isDeleted(maxDoc)) { Document doc = ir.document(maxDoc); System.out.println(doc.getField("indexDate")); break; } } What do you think about the implementation, no extra properties, nothing to worry about. Every information is within you index. regards Bernhard Avi Drissman wrote: I've used Lucene for a long time, but only in the most basic way. I have a custom analyzer and a slightly hacked query parser, but in general it's the basic add document/remove document/query documents cycle. In my system, I'm indexing a store of external documents, maintaining an index for full-text querying. However, I might be turned off when documents are added, and then when I'm restarted, I'm going to need to determine the timestamp of the last document added to the index so that I can pick up where I left off. There are three approaches to doing this, two using Lucene. I don't know how I would do the two Lucene approaches, or even if they're possible. 1. Just keep a file in parallel with the index, reading and writing the timestamp of the last indexed document in it. I know how to do this, but I don't like the idea of keeping a separate file. 2. Drop a timestamp onto each document as it's indexed. I've attached timestamp fields to documents in the past so that I could do range queries on them. However, I don't know how to do a query like "the document with the latest timestamp" or even if that's possible. 3. Create a dummy document (with some unique field identifier so you could quickly query for it) with a field "last timestamp". This is a "global value storage" approach, as you could just store any field with any value on it. But I'd be updating this timestamp field a lot, which means that every time I updated the index I'd have to remove this special document and reindex it. Is there any way to update the value of a field in a document directly in the index without removing and adding it again to the index? The field I'd want to update would just be stored, not indexed or tokenized. Thanks for your help in guiding my exploration into the capabilities of Lucene. Avi - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced timestamp usage (or global value storage)
What if all Documents in your index contained some flag field + an 'add date' field. Then you could make a query such as: flag:1 and sort it by 'add date' field, taking only the very first hit as the most recently added Document. Otis --- Avi Drissman <[EMAIL PROTECTED]> wrote: > I've used Lucene for a long time, but only in the most basic way. I > have a custom analyzer and a slightly hacked query parser, but in > general it's the basic add document/remove document/query documents > cycle. > > In my system, I'm indexing a store of external documents, maintaining > > an index for full-text querying. However, I might be turned off when > documents are added, and then when I'm restarted, I'm going to need > to > determine the timestamp of the last document added to the index so > that > I can pick up where I left off. > > There are three approaches to doing this, two using Lucene. I don't > know how I would do the two Lucene approaches, or even if they're > possible. > > 1. Just keep a file in parallel with the index, reading and writing > the > timestamp of the last indexed document in it. I know how to do this, > but I don't like the idea of keeping a separate file. > > 2. Drop a timestamp onto each document as it's indexed. I've attached > > timestamp fields to documents in the past so that I could do range > queries on them. However, I don't know how to do a query like "the > document with the latest timestamp" or even if that's possible. > > 3. Create a dummy document (with some unique field identifier so you > could quickly query for it) with a field "last timestamp". This is a > "global value storage" approach, as you could just store any field > with > any value on it. But I'd be updating this timestamp field a lot, > which > means that every time I updated the index I'd have to remove this > special document and reindex it. Is there any way to update the value > > of a field in a document directly in the index without removing and > adding it again to the index? The field I'd want to update would just > > be stored, not indexed or tokenized. > > Thanks for your help in guiding my exploration into the capabilities > of > Lucene. > > Avi > > -- > Avi 'rlwimi' Drissman > [EMAIL PROTECTED] > Argh! This darn mail server is trunca > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to implement KWIC (KeyWord In Context) display
Hi, Otis, Thank you very much. I'll try it. Best, Ying - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, August 24, 2004 5:55 PM Subject: Re: How to implement KWIC (KeyWord In Context) display > Hello Ying, > > Take a look at Lucene Highlighter in Lucene Sandbox: > http://jakarta.apache.org/lucene/docs/lucene-sandbox/ > > Otis > > --- yinjin <[EMAIL PROTECTED]> wrote: > > > Hello all, > > > > Does anyone know how to implement KWIC display using Lucene? I'd like > > to display the result similar to google search. > > > > Thanks for any help, > > Ying > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search Applet
Hi Jon, I modified the three files exactly the way you said using separate declaration and static initializer block but for IndexWriter I had to change 4 of the variables because they were final. Then I updated the Lucene JAR file with the three files in the appropriate directory. But i'm still getting the error: java.security.AccessControlException: access denied (java.util.PropertyPermission user.dir read)?? What am I doing wrong? The last mail you sent I was unable to download the files you attached. Is it possible you could send them to my work address: [EMAIL PROTECTED] Many Thanks Simon - Original Message - From: "Jon Schuster" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Sent: Monday, August 23, 2004 6:25 PM Subject: RE: Lucene Search Applet > Hi all, > > The changes I made to get past the System.getProperty issues are essentially > the same in the three files org.apache.lucene.index.IndexWriter, > org.apache.lucene.store.FSDirectory, and > org.apache.lucene.search.BooleanQuery. > > Change the static initializations from a form like this: > > public static long WRITE_LOCK_TIMEOUT = > > Integer.parseInt(System.getProperty("org.apache.lucene.writeLockTimeout", > "1000")); > > to a separate declaration and static initializer block like this: > >public static long WRITE_LOCK_TIMEOUT; >static >{ > try > { > WRITE_LOCK_TIMEOUT = > Integer.parseInt(System.getProperty("org.apache.lucene.writeLockTimeout", > "1000")); > } > catch ( Exception e ) > { > WRITE_LOCK_TIMEOUT = 1000; > } >}; > > As before, the variables are initialized when the class is loaded, but if > the System.getProperty fails, the variable still gets initialized to its > default value in the catch block. > > You can use a separate static block for each variable, or put them all into > a single static block. You could also add a setter for each variable if you > want the ability to set the value separately from the class init. > > In the FSDirectory class, the variables DISABLE_LOCKS and LOCK_DIR are > marked final, which I had to remove to do the initialization as described. > > I've also attached the three modified files if you want to just copy and > paste. > > --Jon > > -Original Message- > From: Simon mcIlwaine [mailto:[EMAIL PROTECTED] > Sent: Monday, August 23, 2004 7:37 AM > To: Lucene Users List > Subject: Re: Lucene Search Applet > > Hi, > > Just used the RODirectory and I'm now getting the following error: > java.security.AccessControlException: access denied > (java.util.PropertyPermission user.dir read) I'm reckoning that this is what > Jon was on about with System.getProperty() within certain files because im > using an applet. Is this correct and if so can someone show me one of the > hacked files so that I know what I need to modify. > > Many Thanks > > Simon > . > - Original Message - > From: "Simon mcIlwaine" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Monday, August 23, 2004 3:12 PM > Subject: Re: Lucene Search Applet > > > Hi Stephane, > > > > A bit of a stupid question but how do you mean set the system property > > disableLuceneLocks=true? Can I do it from a call from FSDirectory API or > do > > I have to actually hack the code? Also if I do use RODirectory how do I go > > about using it? Do I have to update the Lucene JAR archive file with > > RODirectory class included as I tried using it and its not recognising the > > class? > > > > Many Thanks > > > > Simon > > > > - Original Message - > > From: "Stephane James Vaucher" <[EMAIL PROTECTED]> > > To: "Lucene Users List" <[EMAIL PROTECTED]> > > Sent: Monday, August 23, 2004 2:22 PM > > Subject: Re: Lucene Search Applet > > > > > > > Hi Simon, > > > > > > Does this work? From FSDirectory api: > > > > > > If the system property 'disableLuceneLocks' has the String value of > > > "true", lock creation will be disabled. > > > > > > Otherwise, I think there was a Read-Only Directory hack: > > > > > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg05148.html > > > > > > HTH, > > > sv > > > > > > On Mon, 23 Aug 2004, Simon mcIlwaine wrote: > > > > > > > Thanks Jon that works by putting the jar file in the archive > attribute. > > Now > > > > im getting the disablelock error cause of the unsigned applet. Do I > just > > > > comment out the code anywhere where System.getProperty() appears in > the > > > > files that you specified and then update the JAR Archive?? Is it > > possible > > > > you could show me one of the hacked files so that I know what I'm > > modifying? > > > > Does anyone else know if there is another way of doing this without > > having > > > > to hack the source code? > > > > > > > > Many thanks. > > > > > > > > Simon > > > > > > > > - Original Message - > > > > From: "Jon Schuster" <[EMAIL PROTECTED]> > > > > To: "Lucene Users List" <[EMAIL PROTECTED]> > > > > Sent: Saturday, August 21, 2004 2:08 AM > > > > Subject: R
Re: Advanced timestamp usage (or global value storage)
Avi Drissman wrote: I've used Lucene for a long time, but only in the most basic way. I have a custom analyzer and a slightly hacked query parser, but in general it's the basic add document/remove document/query documents cycle. In my system, I'm indexing a store of external documents, maintaining an index for full-text querying. However, I might be turned off when documents are added, and then when I'm restarted, I'm going to need to determine the timestamp of the last document added to the index so that I can pick up where I left off. There are three approaches to doing this, two using Lucene. I don't know how I would do the two Lucene approaches, or even if they're possible. 1. Just keep a file in parallel with the index, reading and writing the timestamp of the last indexed document in it. I know how to do this, but I don't like the idea of keeping a separate file. This is similar to the way I chose (I used a property file for this, and stored certain data within it, in the index directory). I didn't like the idea at first either, but later I thought - why not? It is the simplest way. As long as the file name is not used by Lucene, I thought it should be safe. Claes - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Advanced timestamp usage (or global value storage)
I've used Lucene for a long time, but only in the most basic way. I have a custom analyzer and a slightly hacked query parser, but in general it's the basic add document/remove document/query documents cycle. In my system, I'm indexing a store of external documents, maintaining an index for full-text querying. However, I might be turned off when documents are added, and then when I'm restarted, I'm going to need to determine the timestamp of the last document added to the index so that I can pick up where I left off. There are three approaches to doing this, two using Lucene. I don't know how I would do the two Lucene approaches, or even if they're possible. 1. Just keep a file in parallel with the index, reading and writing the timestamp of the last indexed document in it. I know how to do this, but I don't like the idea of keeping a separate file. 2. Drop a timestamp onto each document as it's indexed. I've attached timestamp fields to documents in the past so that I could do range queries on them. However, I don't know how to do a query like "the document with the latest timestamp" or even if that's possible. 3. Create a dummy document (with some unique field identifier so you could quickly query for it) with a field "last timestamp". This is a "global value storage" approach, as you could just store any field with any value on it. But I'd be updating this timestamp field a lot, which means that every time I updated the index I'd have to remove this special document and reindex it. Is there any way to update the value of a field in a document directly in the index without removing and adding it again to the index? The field I'd want to update would just be stored, not indexed or tokenized. Thanks for your help in guiding my exploration into the capabilities of Lucene. Avi -- Avi 'rlwimi' Drissman [EMAIL PROTECTED] Argh! This darn mail server is trunca - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lucene 1.4 in maven repository
Hi, Can anyone tell me why there is no lucene 1.4 jar in the maven repository @ http://www.ibiblio.org/maven/lucene/jars/ ? Who makes them available? It would be very convenient to be able to get the latest version from there (or anywhere else) regards, Michael Franken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lock handling
My suggestion was referring to a timestamp that could be obtained via java.io.File, not something provided by Lucene. Otis --- Claes Holmerson <[EMAIL PROTECTED]> wrote: > Yes, looking at the time of the lock was an idea I had but I could > not > find anything like a time stamp. Am I missing something obvious here? > > Claes > > Otis Gospodnetic wrote: > > >Hello, > > > >If you use Lucene incorrectly (e.g. 2 IndexWriters writing to the > same > >index), you will see this error. Lucene has no way of telling > whether > >the lock file was left over from a previous process, or whether it's > a > >valid lock file because another process is currently indexing > documents > >or some such. > >You could try adding some logic to your app, though. For instance, > you > >can look at lock's timestamp, and using IndexReader.unlock(...) > method > >to forcefully unlock the index. > > > >Otis > > > >--- Claes Holmerson <[EMAIL PROTECTED]> wrote: > > > > > > > >>Hello, > >> > >>I am interested to hear how people handle locked indexes, for > example > >> > >>when catching an IOException like below. > >> > >>java.io.IOException: Lock obtain timed out: > >>Lock@/tmp/lucene-0b978f2c0aa12e8dcdbd5b0df491bfc4-write.lock > >>at org.apache.lucene.store.Lock.obtain(Lock.java:58) > >>at > >>org.apache.lucene.index.IndexWriter.(IndexWriter.java:223) > >>at > >>org.apache.lucene.index.IndexWriter.(IndexWriter.java:213) > >> > >>As far as I can tell, there is no good way to tell whether the lock > >>is > >>only temporary (working as it should), or if it was created by a > >>process > >>that later died, and therefore can not remove it. How can I detect > >>the > >>latter case, and how should I best handle it? > >> > >>Thanks, > >>Claes > >> > >> > >>- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: > [EMAIL PROTECTED] > >> > >> > >> > >> > > > > > >- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > -- > Claes Holmerson > Polopoly - Cultivating the information garden > Kungsgatan 88, SE-112 27 Stockholm, SWEDEN > Direct: +46 8 506 782 59 > Mobile: +46 704 47 82 59 > Fax: +46 8 506 782 51 > [EMAIL PROTECTED], http://www.polopoly.com > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: worddoucments search
Santosh please read the API' of lucene. When you can string from word doc. using textmining api's . try to convert into some temp. file and try indexing them If you are able to index PDF and normal file what trouble will you face indexing a string extracted from word docs ? please also read /search the previous posting. it should help understanding about lucene more... - Original Message - From: "Karthik N S" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, August 25, 2004 4:21 PM Subject: RE: worddoucments search > Hi > > Santosh > > Please . > > If u have Downloded the Lucene (zip )bundel , First try to read the > docs/index.html which is in the bundel, > if u are still in trouble, then approach the Form for Help [ Un > necessarily asking silly Questions will be ignored ] > > > Karthik > > > > > -Original Message- > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 25, 2004 3:01 PM > To: Lucene Users List > Subject: Re: worddoucments search > > > that part you have to do yourself. It is easy, just create a new > Document, create an appropriate Field, give it a name and the string > value you got with textmining.org library, then add the Field to your > Document, and then add the Document to the index with IndexWriter. > > Look at one of the articles about Lucene to get started. I wrote one > called something like Introduction to Text Indexing with Lucene. You > probably want to read that one to get going. > > Otis > > --- Santosh <[EMAIL PROTECTED]> wrote: > > > I have gon through textmining.org, I am able to extract text in > > string > > format. but how can I get it as > > lucene document format > > - Original Message - > > From: "Otis Gospodnetic" <[EMAIL PROTECTED]> > > To: "Lucene Users List" <[EMAIL PROTECTED]> > > Sent: Tuesday, August 24, 2004 11:54 PM > > Subject: Re: worddoucments search > > > > > > As I just answered in a separate email to Ryan - we used > > textmining.orglibrary, too, as an example of something that is easier > > to use thanPOI. It's been a while since I wrote that chapter, so it > > slipped mymind when I replied. Yes, use textmining.org first, you'll > > be able toinclude it in your code in 2 minutes. Good stuff. > > > > Otis > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: what is wrong with query
That is correct... fuzzy searches are only on a per-term basis. If what you meant, though, was a phrase query ("full" near "name") you have to add an explicit slop factor like "full name"~5 Erik On Aug 25, 2004, at 2:19 AM, Stephane James Vaucher wrote: From: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html Fuzzy Searches Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. I haven't used fuzzy searches, but it seems to indicate that it can only be used with single word terms. The query parser might have been written to support that (the output indicates that as well). HTH, sv On Wed, 25 Aug 2004, Alex Kiselevski wrote: I use QueryParser And I got an exception : org.apache.lucene.queryParser.ParseException: Encountered "~" at line 1, column 44. Was expecting one of: ... ... ... "+" ... "-" ... "(" ... ")" ... "^" ... ... ... ... ... ... "[" ... "{" ... ... at org.apache.lucene.queryParser.QueryParser.generateParseException(Query Pa rser.java:1045 at org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser .j ava:925) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:562) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108) at com.stp.corr.cv.search.CVSearcher.getMatchedResults(CVSearcher.java: 89) at com.stp.test.CVTest.main(CVTest.java:223) -Original Message- From: Stephane James Vaucher [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 25, 2004 10:07 AM To: Lucene Users List Subject: Re: what is wrong with query You'll have to give us more information than that... What is the problem you are seeing? I'll assume that you get no results. Tell us of the structure of your documents and how you index every field. Concerning your syntax, if you are using the distributed query parser, you don't need the + before name, nor the + before university as they will be added by the parser. sv On Wed, 25 Aug 2004, Alex Kiselevski wrote: Hi, pls, Tell me what is wrong with query: author:( +name AND "full name"~) AND book:( +university) Alex Kiselevsky Speech Technology Tel:972-9-776-43-46 R&D, Amdocs - IsraelMobile: 972-53-63 50 38 mailto:[EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Hebrew Analyzer
Hi, anybody heard about Hebrew Analyzer ? Alex Kiselevsky Speech Technology Tel:972-9-776-43-46 R&D, Amdocs - IsraelMobile: 972-53-63 50 38 mailto:[EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you.
RE: worddoucments search
Hi Santosh Please . If u have Downloded the Lucene (zip )bundel , First try to read the docs/index.html which is in the bundel, if u are still in trouble, then approach the Form for Help [ Un necessarily asking silly Questions will be ignored ] Karthik -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 25, 2004 3:01 PM To: Lucene Users List Subject: Re: worddoucments search that part you have to do yourself. It is easy, just create a new Document, create an appropriate Field, give it a name and the string value you got with textmining.org library, then add the Field to your Document, and then add the Document to the index with IndexWriter. Look at one of the articles about Lucene to get started. I wrote one called something like Introduction to Text Indexing with Lucene. You probably want to read that one to get going. Otis --- Santosh <[EMAIL PROTECTED]> wrote: > I have gon through textmining.org, I am able to extract text in > string > format. but how can I get it as > lucene document format > - Original Message - > From: "Otis Gospodnetic" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Tuesday, August 24, 2004 11:54 PM > Subject: Re: worddoucments search > > > As I just answered in a separate email to Ryan - we used > textmining.orglibrary, too, as an example of something that is easier > to use thanPOI. It's been a while since I wrote that chapter, so it > slipped mymind when I replied. Yes, use textmining.org first, you'll > be able toinclude it in your code in 2 minutes. Good stuff. > > Otis > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How not to show results with the same score?
hi there, i browsed through the list and had some different searches but i do not find, what i'm looking for. i got an index which is generated by a bot, collecting websites. there are sites like www.domain.de/article/1 and www.domain.de/article/1?page=1 these different urls have the same content and when u search for a word, matching, both are returned, which is correct. they have excatly the same score because of there content an so one, so i would like to know if its possible "to group by" (mysql, of course) the returned score, so that only the first match is collected into "Hits" and all following matches with the same score are ignored. it would be great if anyone has an idea how to do that. thanks and have a nice day. bastian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lock handling
Yes, looking at the time of the lock was an idea I had but I could not find anything like a time stamp. Am I missing something obvious here? Claes Otis Gospodnetic wrote: Hello, If you use Lucene incorrectly (e.g. 2 IndexWriters writing to the same index), you will see this error. Lucene has no way of telling whether the lock file was left over from a previous process, or whether it's a valid lock file because another process is currently indexing documents or some such. You could try adding some logic to your app, though. For instance, you can look at lock's timestamp, and using IndexReader.unlock(...) method to forcefully unlock the index. Otis --- Claes Holmerson <[EMAIL PROTECTED]> wrote: Hello, I am interested to hear how people handle locked indexes, for example when catching an IOException like below. java.io.IOException: Lock obtain timed out: Lock@/tmp/lucene-0b978f2c0aa12e8dcdbd5b0df491bfc4-write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:58) at org.apache.lucene.index.IndexWriter.(IndexWriter.java:223) at org.apache.lucene.index.IndexWriter.(IndexWriter.java:213) As far as I can tell, there is no good way to tell whether the lock is only temporary (working as it should), or if it was created by a process that later died, and therefore can not remove it. How can I detect the latter case, and how should I best handle it? Thanks, Claes - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Claes Holmerson Polopoly - Cultivating the information garden Kungsgatan 88, SE-112 27 Stockholm, SWEDEN Direct: +46 8 506 782 59 Mobile: +46 704 47 82 59 Fax: +46 8 506 782 51 [EMAIL PROTECTED], http://www.polopoly.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: worddoucments search
that part you have to do yourself. It is easy, just create a new Document, create an appropriate Field, give it a name and the string value you got with textmining.org library, then add the Field to your Document, and then add the Document to the index with IndexWriter. Look at one of the articles about Lucene to get started. I wrote one called something like Introduction to Text Indexing with Lucene. You probably want to read that one to get going. Otis --- Santosh <[EMAIL PROTECTED]> wrote: > I have gon through textmining.org, I am able to extract text in > string > format. but how can I get it as > lucene document format > - Original Message - > From: "Otis Gospodnetic" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Tuesday, August 24, 2004 11:54 PM > Subject: Re: worddoucments search > > > As I just answered in a separate email to Ryan - we used > textmining.orglibrary, too, as an example of something that is easier > to use thanPOI. It's been a while since I wrote that chapter, so it > slipped mymind when I replied. Yes, use textmining.org first, you'll > be able toinclude it in your code in 2 minutes. Good stuff. > > Otis > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search Applet
Hi Jon, Where do I go to get the attached files? Many Thanks Simon - Original Message - From: "Jon Schuster" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Sent: Monday, August 23, 2004 6:25 PM Subject: RE: Lucene Search Applet > Hi all, > > The changes I made to get past the System.getProperty issues are essentially > the same in the three files org.apache.lucene.index.IndexWriter, > org.apache.lucene.store.FSDirectory, and > org.apache.lucene.search.BooleanQuery. > > Change the static initializations from a form like this: > > public static long WRITE_LOCK_TIMEOUT = > > Integer.parseInt(System.getProperty("org.apache.lucene.writeLockTimeout", > "1000")); > > to a separate declaration and static initializer block like this: > >public static long WRITE_LOCK_TIMEOUT; >static >{ > try > { > WRITE_LOCK_TIMEOUT = > Integer.parseInt(System.getProperty("org.apache.lucene.writeLockTimeout", > "1000")); > } > catch ( Exception e ) > { > WRITE_LOCK_TIMEOUT = 1000; > } >}; > > As before, the variables are initialized when the class is loaded, but if > the System.getProperty fails, the variable still gets initialized to its > default value in the catch block. > > You can use a separate static block for each variable, or put them all into > a single static block. You could also add a setter for each variable if you > want the ability to set the value separately from the class init. > > In the FSDirectory class, the variables DISABLE_LOCKS and LOCK_DIR are > marked final, which I had to remove to do the initialization as described. > > I've also attached the three modified files if you want to just copy and > paste. > > --Jon > > -Original Message- > From: Simon mcIlwaine [mailto:[EMAIL PROTECTED] > Sent: Monday, August 23, 2004 7:37 AM > To: Lucene Users List > Subject: Re: Lucene Search Applet > > Hi, > > Just used the RODirectory and I'm now getting the following error: > java.security.AccessControlException: access denied > (java.util.PropertyPermission user.dir read) I'm reckoning that this is what > Jon was on about with System.getProperty() within certain files because im > using an applet. Is this correct and if so can someone show me one of the > hacked files so that I know what I need to modify. > > Many Thanks > > Simon > . > - Original Message - > From: "Simon mcIlwaine" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Monday, August 23, 2004 3:12 PM > Subject: Re: Lucene Search Applet > > > Hi Stephane, > > > > A bit of a stupid question but how do you mean set the system property > > disableLuceneLocks=true? Can I do it from a call from FSDirectory API or > do > > I have to actually hack the code? Also if I do use RODirectory how do I go > > about using it? Do I have to update the Lucene JAR archive file with > > RODirectory class included as I tried using it and its not recognising the > > class? > > > > Many Thanks > > > > Simon > > > > - Original Message - > > From: "Stephane James Vaucher" <[EMAIL PROTECTED]> > > To: "Lucene Users List" <[EMAIL PROTECTED]> > > Sent: Monday, August 23, 2004 2:22 PM > > Subject: Re: Lucene Search Applet > > > > > > > Hi Simon, > > > > > > Does this work? From FSDirectory api: > > > > > > If the system property 'disableLuceneLocks' has the String value of > > > "true", lock creation will be disabled. > > > > > > Otherwise, I think there was a Read-Only Directory hack: > > > > > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg05148.html > > > > > > HTH, > > > sv > > > > > > On Mon, 23 Aug 2004, Simon mcIlwaine wrote: > > > > > > > Thanks Jon that works by putting the jar file in the archive > attribute. > > Now > > > > im getting the disablelock error cause of the unsigned applet. Do I > just > > > > comment out the code anywhere where System.getProperty() appears in > the > > > > files that you specified and then update the JAR Archive?? Is it > > possible > > > > you could show me one of the hacked files so that I know what I'm > > modifying? > > > > Does anyone else know if there is another way of doing this without > > having > > > > to hack the source code? > > > > > > > > Many thanks. > > > > > > > > Simon > > > > > > > > - Original Message - > > > > From: "Jon Schuster" <[EMAIL PROTECTED]> > > > > To: "Lucene Users List" <[EMAIL PROTECTED]> > > > > Sent: Saturday, August 21, 2004 2:08 AM > > > > Subject: Re: Lucene Search Applet > > > > > > > > > > > > > I have Lucene working in an applet and I've seen this problem only > > when > > > > > the jar file really was not available (typo in the jar name), which > is > > > > > what you'd expect. It's possible that the classpath for your > > > > > application is not the same as the classpath for the applet; perhaps > > > > > they're using different VMs or JREs from different locations. > > > > > > > > > > Try referencing the Lucene jar file in the archive attribute of the > > > > > apple
Re: worddoucments search
I have gon through textmining.org, I am able to extract text in string format. but how can I get it as lucene document format - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, August 24, 2004 11:54 PM Subject: Re: worddoucments search As I just answered in a separate email to Ryan - we used textmining.orglibrary, too, as an example of something that is easier to use thanPOI. It's been a while since I wrote that chapter, so it slipped mymind when I replied. Yes, use textmining.org first, you'll be able toinclude it in your code in 2 minutes. Good stuff. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lock handling
Hello, If you use Lucene incorrectly (e.g. 2 IndexWriters writing to the same index), you will see this error. Lucene has no way of telling whether the lock file was left over from a previous process, or whether it's a valid lock file because another process is currently indexing documents or some such. You could try adding some logic to your app, though. For instance, you can look at lock's timestamp, and using IndexReader.unlock(...) method to forcefully unlock the index. Otis --- Claes Holmerson <[EMAIL PROTECTED]> wrote: > Hello, > > I am interested to hear how people handle locked indexes, for example > > when catching an IOException like below. > > java.io.IOException: Lock obtain timed out: > Lock@/tmp/lucene-0b978f2c0aa12e8dcdbd5b0df491bfc4-write.lock > at org.apache.lucene.store.Lock.obtain(Lock.java:58) > at > org.apache.lucene.index.IndexWriter.(IndexWriter.java:223) > at > org.apache.lucene.index.IndexWriter.(IndexWriter.java:213) > > As far as I can tell, there is no good way to tell whether the lock > is > only temporary (working as it should), or if it was created by a > process > that later died, and therefore can not remove it. How can I detect > the > latter case, and how should I best handle it? > > Thanks, > Claes > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: what is wrong with query
From: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html Fuzzy Searches Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. I haven't used fuzzy searches, but it seems to indicate that it can only be used with single word terms. The query parser might have been written to support that (the output indicates that as well). HTH, sv On Wed, 25 Aug 2004, Alex Kiselevski wrote: > > I use QueryParser > And I got an exception : > org.apache.lucene.queryParser.ParseException: Encountered "~" at line 1, > column 44. > Was expecting one of: > ... > ... > ... > "+" ... > "-" ... > "(" ... > ")" ... > "^" ... > ... > ... > ... > ... > ... > "[" ... > "{" ... > ... > > at > org.apache.lucene.queryParser.QueryParser.generateParseException(QueryPa > rser.java:1045 > at > org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.j > ava:925) > at > org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:562) > at > org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500) > at > org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108) > at > com.stp.corr.cv.search.CVSearcher.getMatchedResults(CVSearcher.java:89) > at com.stp.test.CVTest.main(CVTest.java:223) > > -Original Message- > From: Stephane James Vaucher [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 25, 2004 10:07 AM > To: Lucene Users List > Subject: Re: what is wrong with query > > > You'll have to give us more information than that... > > What is the problem you are seeing? I'll assume that you get no results. > > Tell us of the structure of your documents and how you index every > field. > > Concerning your syntax, if you are using the distributed query parser, > you don't need the + before name, nor the + before university as they > will be added by the parser. > > sv > > On Wed, 25 Aug 2004, Alex Kiselevski wrote: > > > > > Hi, pls, > > Tell me what is wrong with query: > > author:( +name AND "full name"~) AND book:( +university) > > > > > > Alex Kiselevsky > > Speech Technology Tel:972-9-776-43-46 > > R&D, Amdocs - IsraelMobile: 972-53-63 50 38 > > mailto:[EMAIL PROTECTED] > > > > > > > > > > The information contained in this message is proprietary of Amdocs, > > protected from disclosure, and may be privileged. The information is > > intended to be conveyed only to the designated recipient(s) of the > > message. If the reader of this message is not the intended recipient, > > you are hereby notified that any dissemination, use, distribution or > > copying of this communication is strictly prohibited and may be > > unlawful. If you have received this communication in error, please > > notify us immediately by replying to the message and deleting it from > > your computer. Thank you. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > The information contained in this message is proprietary of Amdocs, > protected from disclosure, and may be privileged. > The information is intended to be conveyed only to the designated recipient(s) > of the message. If the reader of this message is not the intended recipient, > you are hereby notified that any dissemination, use, distribution or copying of > this communication is strictly prohibited and may be unlawful. > If you have received this communication in error, please notify us immediately > by replying to the message and deleting it from your computer. > Thank you. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: what is wrong with query
I use QueryParser And I got an exception : org.apache.lucene.queryParser.ParseException: Encountered "~" at line 1, column 44. Was expecting one of: ... ... ... "+" ... "-" ... "(" ... ")" ... "^" ... ... ... ... ... ... "[" ... "{" ... ... at org.apache.lucene.queryParser.QueryParser.generateParseException(QueryPa rser.java:1045 at org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.j ava:925) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:562) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108) at com.stp.corr.cv.search.CVSearcher.getMatchedResults(CVSearcher.java:89) at com.stp.test.CVTest.main(CVTest.java:223) -Original Message- From: Stephane James Vaucher [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 25, 2004 10:07 AM To: Lucene Users List Subject: Re: what is wrong with query You'll have to give us more information than that... What is the problem you are seeing? I'll assume that you get no results. Tell us of the structure of your documents and how you index every field. Concerning your syntax, if you are using the distributed query parser, you don't need the + before name, nor the + before university as they will be added by the parser. sv On Wed, 25 Aug 2004, Alex Kiselevski wrote: > > Hi, pls, > Tell me what is wrong with query: > author:( +name AND "full name"~) AND book:( +university) > > > Alex Kiselevsky > Speech TechnologyTel:972-9-776-43-46 > R&D, Amdocs - Israel Mobile: 972-53-63 50 38 > mailto:[EMAIL PROTECTED] > > > > > The information contained in this message is proprietary of Amdocs, > protected from disclosure, and may be privileged. The information is > intended to be conveyed only to the designated recipient(s) of the > message. If the reader of this message is not the intended recipient, > you are hereby notified that any dissemination, use, distribution or > copying of this communication is strictly prohibited and may be > unlawful. If you have received this communication in error, please > notify us immediately by replying to the message and deleting it from > your computer. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: what is wrong with query
You'll have to give us more information than that... What is the problem you are seeing? I'll assume that you get no results. Tell us of the structure of your documents and how you index every field. Concerning your syntax, if you are using the distributed query parser, you don't need the + before name, nor the + before university as they will be added by the parser. sv On Wed, 25 Aug 2004, Alex Kiselevski wrote: > > Hi, pls, > Tell me what is wrong with query: > author:( +name AND "full name"~) AND book:( +university) > > > Alex Kiselevsky > Speech TechnologyTel:972-9-776-43-46 > R&D, Amdocs - Israel Mobile: 972-53-63 50 38 > mailto:[EMAIL PROTECTED] > > > > > The information contained in this message is proprietary of Amdocs, > protected from disclosure, and may be privileged. > The information is intended to be conveyed only to the designated recipient(s) > of the message. If the reader of this message is not the intended recipient, > you are hereby notified that any dissemination, use, distribution or copying of > this communication is strictly prohibited and may be unlawful. > If you have received this communication in error, please notify us immediately > by replying to the message and deleting it from your computer. > Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]