Re: Optional terms in BooleanQuery
Peter Bloem wrote: [...] "+(A B) C D E" [...] In other words, Lucene considers all documents that have both A and B, and ranks them higher if they also have C D or E. Hello Peter, for my understanding "+(A B) C D E" means at least one of the terms "A" or "B" must be contained and the terms "C", "D", and "E" are optional. The following documents d are hits: d(A, B) d(A) d(B) d(A, C) ... Documents without "A" and "B" are not a hit. To have both terms "A" and "B" in a document the query should be: "(+A +B) C D E" or "+A +B C D E". Sören - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: search result problem
hello, thx for u reply, i used the explain method and i understand now why some documents are returned. I am using the same Analyzer for indexing and searching. I tried to only add the content of the page where that expression can be found (instead of the whole document) and then the search works. Do i have to split my pdf text into more field? Or what could be the problem? Grant Ingersoll wrote: Try using the explain() method to see why the documents that were returned scored the way they did. If I am understanding correctly, you are saying that Luke shows that those words aren't actually in your index? Can you elaborate on what your analysis process is? Are you searching using the same Analyzer as you are indexing with? I would try to isolate the problem down to some unit tests, if possible. Cheers, Grant On May 18, 2007, at 8:12 AM, Stefan Colella wrote: Hello, My application is working with PDF files so i use lucene with PdfBox to create a little search engine. I am new to lucene. All seemed to work fine but after some tests I saw that some expressions like "stock option" where never found (or returns the wrong documents) even if it exist in my PDF files. I searched in the mail archive and found that I have to use the "French Analyser" but that didn't work too. I found that there is a tool named Luke to check the lucene index. I could see that the original text contains those words but nothing in the tokenizer. Anybody who can help or can explain where I can start to look ? thanks -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Very odd behaviour of FrenchAnalyzer with strings in capital letters
Hello, I tried org.apache.lucene.analysis.fr.FrenchAnalyzer and I got strange search results on strings in uppercase. (example : VEHICLE) When I search the string (in lower case), I get no result. I get results if I use "vehicle*" or "vehiclE", or "vehicLe" etc. What is odd is that it affects only some of the strings, not all of them. Anyone who has ever experienced this problem? Thanks, Florian -- View this message in context: http://www.nabble.com/Very-odd-behaviour-of-FrenchAnalyzer-with-strings-in-capital-letters-tf3789153.html#a10715673 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Upgrade 2.0 -> 2.1
Hi I have tried to upgrade from 2.0 -> 2.1 to overcome some NFS-issues. It compiles just fine, but when I run the application and try to add a document if throws an exception stating NoSuchMethod. This happens when I try to add an object of type Field to a newly created empty Document. I have erased all dependencies in my project aswell on the server. So it should be as clean as a whistle, but no luck. I'm running it on a Bea 8.1 SP6 with the old 1.4 Java Anyone knows where to look ?? Best regards, Svend Ole
Re: Very odd behaviour of FrenchAnalyzer with strings in capital letters
First have you gotten a copy of Luke to examine your index to see what's actually indexed? The default behavior is usually to lowercase everything, but I'm not entirely sure if the French analyzer does this. But I suspect so. Searches are case sensitive. To get caseless searching, you need to put everything in the same case. This is usually done for you with any of the standard analyzers, but check specifically. Are you using the same analyzer at index AND search time? Best Erick On 5/21/07, Jolinar13 <[EMAIL PROTECTED]> wrote: Hello, I tried org.apache.lucene.analysis.fr.FrenchAnalyzer and I got strange search results on strings in uppercase. (example : VEHICLE) When I search the string (in lower case), I get no result. I get results if I use "vehicle*" or "vehiclE", or "vehicLe" etc. What is odd is that it affects only some of the strings, not all of them. Anyone who has ever experienced this problem? Thanks, Florian -- View this message in context: http://www.nabble.com/Very-odd-behaviour-of-FrenchAnalyzer-with-strings-in-capital-letters-tf3789153.html#a10715673 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Upgrade 2.0 -> 2.1
Hi I saw this or something similar going from 2.0 to 2.1 when hadn't recompiled all my lucene related code. It went away when everything was recompiled, so I'd guess you've got an old class file lurking somewhere. -- Ian. On 5/21/07, Svend Ole Nielsen <[EMAIL PROTECTED]> wrote: Hi I have tried to upgrade from 2.0 -> 2.1 to overcome some NFS-issues. It compiles just fine, but when I run the application and try to add a document if throws an exception stating NoSuchMethod. This happens when I try to add an object of type Field to a newly created empty Document. I have erased all dependencies in my project aswell on the server. So it should be as clean as a whistle, but no luck. I'm running it on a Bea 8.1 SP6 with the old 1.4 Java Anyone knows where to look ?? Best regards, Svend Ole - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: documents with large numbers of fields
Mike Klaas wrote: > On 18-May-07, at 1:01 PM, charlie w wrote: >> Is there an upper limit on the number of fields comprising a document, >> and if so what is it? > > There is not. They are relatively costless if omitNorms=False Mike, I think you meant "relatively costless if omitNorms=True". Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Very odd behaviour of FrenchAnalyzer with strings in capital letters
Hello, Thank you for your quick answer. I use Luke to examine the index, but since I switched to FrenchAnalyzer, it says 'Not a Lucene index'. If I open the index files in a text viewer, the strings are in UPPER case. I do use the same analyzer to index and search. So, do I have to specify the FrenchAnalyzer not to be case sensitive? How to do that? Thanks a lot Florian Erick Erickson wrote: > > First have you gotten a copy of Luke to examine your index to see > what's actually indexed? > > The default behavior is usually to lowercase everything, but I'm not > entirely sure if the French analyzer does this. But I suspect so. > > Searches are case sensitive. To get caseless searching, you need > to put everything in the same case. This is usually done for you with > any of the standard analyzers, but check specifically. > > Are you using the same analyzer at index AND search time? > > Best > Erick > > On 5/21/07, Jolinar13 <[EMAIL PROTECTED]> wrote: >> >> >> Hello, >> >> I tried org.apache.lucene.analysis.fr.FrenchAnalyzer and I got strange >> search results on strings in uppercase. (example : VEHICLE) >> When I search the string (in lower case), I get no result. I get results >> if >> I use "vehicle*" or "vehiclE", or "vehicLe" etc. >> >> What is odd is that it affects only some of the strings, not all of them. >> Anyone who has ever experienced this problem? >> >> Thanks, >> Florian >> -- >> View this message in context: >> http://www.nabble.com/Very-odd-behaviour-of-FrenchAnalyzer-with-strings-in-capital-letters-tf3789153.html#a10715673 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > -- View this message in context: http://www.nabble.com/Very-odd-behaviour-of-FrenchAnalyzer-with-strings-in-capital-letters-tf3789153.html#a10719413 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Upgrade 2.0 -> 2.1
Hi Ian Well it worked. Thanks :) Wasn't aware of that could have fixed it, but after your suggestion it seemed like the most logical solution. /Svend man, 21 05 2007 kl. 14:30 +0100, skrev Ian Lea: > Hi > > > I saw this or something similar going from 2.0 to 2.1 when hadn't > recompiled all my lucene related code. It went away when everything > was recompiled, so I'd guess you've got an old class file lurking > somewhere. > > > -- > Ian. > > > On 5/21/07, Svend Ole Nielsen <[EMAIL PROTECTED]> wrote: > > Hi > > I have tried to upgrade from 2.0 -> 2.1 to overcome some NFS-issues. It > > compiles just fine, but when I run the application and try to add a > > document if throws an exception stating NoSuchMethod. This happens when > > I try to add an object of type Field to a newly created empty Document. > > > > I have erased all dependencies in my project aswell on the server. So it > > should be as clean as a whistle, but no luck. I'm running it on a Bea > > 8.1 SP6 with the old 1.4 Java > > > > Anyone knows where to look ?? > > > > Best regards, > > Svend Ole > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] >
Re: How to Update the Index once it is created
If you are using Orcale and Lucene, check out http://www.hibernate.org/410.html "Hibernate Search" , this will automaticly update your lucene index, on any change to your database table Erick Erickson wrote: > > You have to delete the old document and add it a new one. > > See IndexModifier class. > > There is no ability to modify a document in place. > > Best > Erick > > On 5/14/07, Krishna Prasad Mekala <[EMAIL PROTECTED]> wrote: >> >> Hi All, >> >> >> >> Thanks for your response. I have one more doubt. How can I update a >> index once created from Oracle, instead of recreating the whole. >> Whenever there is a change in the oracle table >> (insertion/updation/deletion of a row) my application should update the >> index. >> >> >> >> Thanks in advance. >> >> >> >> >> >> Krishna Prasad M >> >> >> >> > > -- View this message in context: http://www.nabble.com/How-to-Update-the-Index-once-it-is-created-tf3752208.html#a10724708 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Implement a tokenizer
Hi there, I was interested in changing the StandardTokenzier so it will not remove the "+" (plus) sign from my stream. Looking in the code and documentation, it reads: "If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer." I can't understand from this code where I should jump in, and do my change. Can someone point me out to where I should look at in order perform my change? Thanks in advanced -- View this message in context: http://www.nabble.com/Implement-a-tokenizer-tf3792172.html#a10724827 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
stop words, synonyms... what's in it for me?
Hi there, I have started using Lucene not long ago, with plans to replace my current sql queries in my application with it. As I wasn't aware of Lucene before, I have implemented some similar tools (filters) as Lucene includes. For example I have implemented a "stop word" tool. In my case I have much more configuration options than Lucene, having the option to remove sub strings in addition to complete tokens. I can configure the wanted location of the sub string within the token, or even the location of the token within the phrase. I have implemented a synonym mechanism (substitution mechanism) that can also be configured according to location within a phrase. It can also be configured to find synonyms taking into account spelling mistakes. Although it doesn't expand but only transforms to one certain replacement.It can find replacements for sub strings as well. So I can use it to separate words. For example in German I have "strasse"=> " strasse" (with a space in the front), so words like "mainstrasse" will be split to "main" and "strasse". I am wondering if I can use my "standardization" tools before calling the lucene indexing, without implementing any custom analyzers and achieve more or less the same results? What do I "loose" if I go this way? The stemming filters are really one thing I didn't have and I will use. Is there any point for me to start creating custom analyzers with filter for stop words, synonyms, and implementing my own "sub string" filter, for separating tokens into "sub words" (like "mainstrasse"=> "main", "strasse") ? Thanks in advance -- View this message in context: http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10725950 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stop words, synonyms... what's in it for me?
On Monday 21 May 2007 22:05, bhecht wrote: > Is there any point for me to start creating custom analyzers with filter > for stop words, synonyms, and implementing my own "sub string" filter, > for separating tokens into "sub words" (like "mainstrasse"=> "main", > "strasse") Yes: I assume your document should be found both with "strasse" and with "mainstrasse". You will then need to put main, strasse, and mainstrasse at the same position (setPositionIncrement(0)). If you don't do that, phrase queries will not work anymore as expected. Thus you need an analyzer, modifying the string before they are put in Lucene is not enough. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stop words, synonyms... what's in it for me?
Thanks Daniel, But when searching, I will run my "standardization" tools again before querying Lucene, so what you mentioned will not be a problem. If someone searches for mainstrasse, my tools will split it again to main and strasse, and then lucene will be able to find it. Daniel Naber-5 wrote: > > On Monday 21 May 2007 22:05, bhecht wrote: > >> Is there any point for me to start creating custom analyzers with filter >> for stop words, synonyms, and implementing my own "sub string" filter, >> for separating tokens into "sub words" (like "mainstrasse"=> "main", >> "strasse") > > Yes: I assume your document should be found both with "strasse" and with > "mainstrasse". You will then need to put main, strasse, and mainstrasse at > the same position (setPositionIncrement(0)). If you don't do that, phrase > queries will not work anymore as expected. Thus you need an analyzer, > modifying the string before they are put in Lucene is not enough. > > Regards > Daniel > > -- > http://www.danielnaber.de > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10726812 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stop words, synonyms... what's in it for me?
On Monday 21 May 2007 22:53, bhecht wrote: > If someone searches for mainstrasse, my tools will split it again to > main and strasse, and then lucene will be able to find it. "strasse" will match "mainstrasse" but the phrase query "schöne strasse" will not match "schöne mainstrasse". However, this could be considered a feature. Aynway, it will be difficult to use features that rely on the term list, e.g. the spellchecker. It will not be able to suggest "mainstrasse", as that's not in the index. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to Update the Index once it is created
Does it mandate you to pass data through Hibernate? This seems very similar to Compass' approach. I believe a more generic approach is to compare what's already indexed with what's changed or deleted, so you can use any framework to work with Lucene. And simply selecting the data and creating the index can avoid some specific framework limitation and easier to scale. Also, re-indexing will also be easier. DBSight tracks changes through simple SQLs, hard-deleted or soft-deleted content, and do it very efficiently even for large number of documents. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes On 5/21/07, bhecht <[EMAIL PROTECTED]> wrote: If you are using Orcale and Lucene, check out http://www.hibernate.org/410.html "Hibernate Search" , this will automaticly update your lucene index, on any change to your database table Erick Erickson wrote: > > You have to delete the old document and add it a new one. > > See IndexModifier class. > > There is no ability to modify a document in place. > > Best > Erick > > On 5/14/07, Krishna Prasad Mekala <[EMAIL PROTECTED]> wrote: >> >> Hi All, >> >> >> >> Thanks for your response. I have one more doubt. How can I update a >> index once created from Oracle, instead of recreating the whole. >> Whenever there is a change in the oracle table >> (insertion/updation/deletion of a row) my application should update the >> index. >> >> >> >> Thanks in advance. >> >> >> >> >> >> Krishna Prasad M >> >> >> >> > > -- View this message in context: http://www.nabble.com/How-to-Update-the-Index-once-it-is-created-tf3752208.html#a10724708 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Implement a tokenizer
What you need to do is to create your own tokenizer. Just copy the code from the StandardTokenizer to your XYZTokenizer and make your changes. Then you need to create your own Analyzer class (again copy the code from the StandardAnalyzer) and user your XYZTokenizer in the new XYZAnalyzer you created. HTH Aviran http://www.aviransplace.com http://shaveh.co.il -Original Message- From: bhecht [mailto:[EMAIL PROTECTED] Sent: Monday, May 21, 2007 2:59 PM To: java-user@lucene.apache.org Subject: Implement a tokenizer Hi there, I was interested in changing the StandardTokenzier so it will not remove the "+" (plus) sign from my stream. Looking in the code and documentation, it reads: "If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer." I can't understand from this code where I should jump in, and do my change. Can someone point me out to where I should look at in order perform my change? Thanks in advanced -- View this message in context: http://www.nabble.com/Implement-a-tokenizer-tf3792172.html#a10724827 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Implement a tokenizer
Actually before you jump in, be warned that the "+" plus sign is also part of query parser. You can not really/easily pass the query with the "+" sign through query parser in order to get a match. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes On 5/21/07, bhecht <[EMAIL PROTECTED]> wrote: Hi there, I was interested in changing the StandardTokenzier so it will not remove the "+" (plus) sign from my stream. Looking in the code and documentation, it reads: "If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer." I can't understand from this code where I should jump in, and do my change. Can someone point me out to where I should look at in order perform my change? Thanks in advanced -- View this message in context: http://www.nabble.com/Implement-a-tokenizer-tf3792172.html#a10724827 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stop words, synonyms... what's in it for me?
I will never have "mainstrasse" in my lucene index, since strasse is always replaced with " strasse" causing "mainstrasse" to be split to "main strasse". So the example you gave: "schöne strasse" will match "schöne mainstrasse", since in the lucene index I have "schöne main strasse". Daniel Naber-5 wrote: > > On Monday 21 May 2007 22:53, bhecht wrote: > >> If someone searches for mainstrasse, my tools will split it again to >> main and strasse, and then lucene will be able to find it. > > "strasse" will match "mainstrasse" but the phrase query "schöne strasse" > will not match "schöne mainstrasse". However, this could be considered a > feature. Aynway, it will be difficult to use features that rely on the > term list, e.g. the spellchecker. It will not be able to suggest > "mainstrasse", as that's not in the index. > > Regards > Daniel > > -- > http://www.danielnaber.de > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10727213 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stop words, synonyms... what's in it for me?
No, a phrase search it will NOT match. Phrase semantics requires that split tokens be adjacent (slop of 0). So, since "mainstrasse" was split into two tokens at index time, the test for "is schöne right next to strasse" will fail because of the intervening (introduced) term "main". Whether this is desired behavior or not is another question. You're right that asking for a non-phrase search *will* work though. Best Erick On 5/21/07, bhecht <[EMAIL PROTECTED]> wrote: I will never have "mainstrasse" in my lucene index, since strasse is always replaced with " strasse" causing "mainstrasse" to be split to "main strasse". So the example you gave: "schöne strasse" will match "schöne mainstrasse", since in the lucene index I have "schöne main strasse". Daniel Naber-5 wrote: > > On Monday 21 May 2007 22:53, bhecht wrote: > >> If someone searches for mainstrasse, my tools will split it again to >> main and strasse, and then lucene will be able to find it. > > "strasse" will match "mainstrasse" but the phrase query "schöne strasse" > will not match "schöne mainstrasse". However, this could be considered a > feature. Aynway, it will be difficult to use features that rely on the > term list, e.g. the spellchecker. It will not be able to suggest > "mainstrasse", as that's not in the index. > > Regards > Daniel > > -- > http://www.danielnaber.de > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10727213 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: search result problem
Stefan Colella wrote: > I tried to only add the content of the page where that expression can be > found (instead of the whole document) and then the search works. > > Do i have to split my pdf text into more field? Or what could be the > problem? Perhaps indexWriter's setMaxFieldLength() is relevant here. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
getTermFreqVector atomicity
I'm interested in getting the term vector of a lucene doc. The point is, it seems I have to give to the IndexReader.getTermFreqVector a doc ID, while I would know if there is a way to get the termvector by a doc identifier (not lucene doc id, but a my own field). I know how to get the lucene docid for the doc I'm interested, but my concern is about the non-atomicity of getting a id and pass it to another function. This because I reload index time by time, and I'm worried about a loss of consistency if the new indexreader remap docids (after deletion for example), and I end up accessing a different doc, just because between "get the id" and "get the termvector for that id", the reader could have been reloaded (and doc-ids changed). Best, Walter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
In memory MultiSearcher
Hello, I have been using a large, in memory MultiSearcher that is reaching the limits of my hardware RAM with this code: try { IndexSearcher[] searcher_a= { new IndexSearcher(new RAMDirectory(index_one_path)), new IndexSearcher(new RAMDirectory(index_two_path)), new IndexSearcher(new RAMDirectory(index_thee_path)), new IndexSearcher(new RAMDirectory(index_four_path)), new IndexSearcher(new RAMDirectory(index_n_path)) }; MultiSearcher searcher_ms=new MultiSearcher(searcher_a); ... } catch(Exception e) { System.out.println(e); } For example, one of several indexes is 768MB. Is there possibly a better way to do this? Regards, Peter W. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: getTermFreqVector atomicity
An IndexReader doesn't see changes in the index unless you close and reopen it, but if there is significant time between the time you fetch your docid and read it's vector, that could be a problem. You can always use TermEnum/TermDocs to find the doc ID associated with a particular field you have, but I suspect that suffers from the same problem. In fact, *anything* you do between fetching the doc ID and getting its termvector has this problem, and there's no way I know of to get termvectors by your own ID. What might work is a "sanity check" sort of algorithm. That is, fetch the doc ID, then fetch it's term vector, then look at your custom field for that doc ID and see if it matches the original. If not, do it all over again. But that all seems too complicated to me. Why not just insure that you use the *same* IndexReader both when you get the original doc ID and when you get its termverctor? Even a temporary reference should hold things open long enough to insure that atomicity. Best Erick On 5/21/07, Walter Ferrara <[EMAIL PROTECTED]> wrote: I'm interested in getting the term vector of a lucene doc. The point is, it seems I have to give to the IndexReader.getTermFreqVector a doc ID, while I would know if there is a way to get the termvector by a doc identifier (not lucene doc id, but a my own field). I know how to get the lucene docid for the doc I'm interested, but my concern is about the non-atomicity of getting a id and pass it to another function. This because I reload index time by time, and I'm worried about a loss of consistency if the new indexreader remap docids (after deletion for example), and I end up accessing a different doc, just because between "get the id" and "get the termvector for that id", the reader could have been reloaded (and doc-ids changed). Best, Walter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: In memory MultiSearcher
Why are you doing this in the first place? Do you actually have evidence that the default Lucene behavior (caching, etc) is inadequate for your needs? I'd *strongly* recommend, if you haven't, just using the regular FSDirectories rather than RAMDirectories and only getting complex if that's too slow... I ask because I am searching FS-based indexes that are 4G with no problem. The index *was* 8G and had only a 10% performance hit. Best Erick On 5/21/07, Peter W. <[EMAIL PROTECTED]> wrote: Hello, I have been using a large, in memory MultiSearcher that is reaching the limits of my hardware RAM with this code: try { IndexSearcher[] searcher_a= { new IndexSearcher(new RAMDirectory(index_one_path)), new IndexSearcher(new RAMDirectory(index_two_path)), new IndexSearcher(new RAMDirectory(index_thee_path)), new IndexSearcher(new RAMDirectory(index_four_path)), new IndexSearcher(new RAMDirectory(index_n_path)) }; MultiSearcher searcher_ms=new MultiSearcher(searcher_a); ... } catch(Exception e) { System.out.println(e); } For example, one of several indexes is 768MB. Is there possibly a better way to do this? Regards, Peter W. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stop words, synonyms... what's in it for me?
Thanks Erik, thats what I thought. In my case no phrase queries are done, so it seems I am good to go. Any additional thoughts on the issue are welcomed. Thanks Erick Erickson wrote: > > No, a phrase search it will NOT match. Phrase semantics > requires that split tokens be adjacent (slop of 0). So, since > "mainstrasse" was split into two tokens at index time, the test for > "is schöne right next to strasse" will fail because of the intervening > (introduced) term "main". Whether this is desired behavior or not is > another question. > > You're right that asking for a non-phrase search *will* work > though. > > Best > Erick > > On 5/21/07, bhecht <[EMAIL PROTECTED]> wrote: >> >> >> I will never have "mainstrasse" in my lucene index, since strasse is >> always >> replaced with " strasse" causing "mainstrasse" to be split to "main >> strasse". >> So the example you gave: >> "schöne strasse" will match "schöne mainstrasse", since in the lucene >> index >> I have "schöne main strasse". >> >> >> Daniel Naber-5 wrote: >> > >> > On Monday 21 May 2007 22:53, bhecht wrote: >> > >> >> If someone searches for mainstrasse, my tools will split it again to >> >> main and strasse, and then lucene will be able to find it. >> > >> > "strasse" will match "mainstrasse" but the phrase query "schöne >> strasse" >> > will not match "schöne mainstrasse". However, this could be considered >> a >> > feature. Aynway, it will be difficult to use features that rely on the >> > term list, e.g. the spellchecker. It will not be able to suggest >> > "mainstrasse", as that's not in the index. >> > >> > Regards >> > Daniel >> > >> > -- >> > http://www.danielnaber.de >> > >> > - >> > To unsubscribe, e-mail: [EMAIL PROTECTED] >> > For additional commands, e-mail: [EMAIL PROTECTED] >> > >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10727213 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > -- View this message in context: http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10731811 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]