Re: Index keeps growing, then shrinks on restart
Telling us the version of lucene and the OS you're running on is always a good idea. A guess here is that you aren't closing index readers, so the JVM will be holding on to deleted files until it exits. A combination of du, ls, and lsof commands should prove it, or just losf: run it against the java process and look for deleted files. If you're on unix that is. -- Ian. On Mon, Nov 10, 2014 at 11:03 PM, Rob Nikander rob.nikan...@gmail.com wrote: Hi, I have an index that's about 700 MB, and it grows over days to until it causes problems with disk size, at about 5GB. If the JVM process ends, the index shrinks back to about 700MB, I'm calling IndexWriter.commit() all the time. What else do you call to get it to compact it's use of space? thank you, Rob - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to improve the performance in Lucene when query is long?
Hi Harry, May be you can use BooleanQuery#setMinimumNumberShouldMatch method. What happens when you use set it to half of the numTerms? ahmet On Tuesday, November 11, 2014 8:35 AM, Harry Yu 502437...@qq.com wrote: Hi everyone, I have been using Lucene to build a POI searching geocoding system. After test, I found that when query is long(above 10 terms). And the speed of searching is too slow near to 1s. I think the bottleneck is that I used OR to generate my BooleanQuery. It would get plenty of candidates documents. And it would also consume too many time to score and rank. I changed to use AND to generate my BooleanQuery. But it decrease the accuracy of hits. So I want to find a solution to reduce candidate documents and do not decrease the accuracy in this situation. Thanks for your help! -- Harry YuInstitute of Remote Sensing and Geographic Information System. School of Earth and Space Sciences, Peking University; Beijing, China, 100871; Email: 502437...@qq.com OR harryyu1...@163.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Index keeps growing, then shrinks on restart
On Tue, Nov 11, 2014 at 4:26 AM, Ian Lea ian@gmail.com wrote: Telling us the version of lucene and the OS you're running on is always a good idea. Oops, yes. Lucene 4.10.0, Linux. A guess here is that you aren't closing index readers, so the JVM will be holding on to deleted files until it exits. That's probably it. I found a code path where it seems I thought the reader's `close()` would be called by GC/finalize. Rob
How to map lucene scores to range from 0~100?
Hi everyone, I met a new trouble. In my system, we should score the doc range from 0 to 100. There are some easy ways to map lucene scores to this scope. Thanks for your help~ Yu
Re: How to map lucene scores to range from 0~100?
Harry , basically converting score into range 0 to 100 require normalization(dividing each score with highest record and multiply by .100) .but this score does n't represent matching %. On Tue, Nov 11, 2014 at 7:48 PM, Harry Yu 502437...@qq.com wrote: Hi everyone, I met a new trouble. In my system, we should score the doc range from 0 to 100. There are some easy ways to map lucene scores to this scope. Thanks for your help~ Yu -- Launchship Technology respects your privacy. This email is intended only for the use of the party to which it is addressed and may contain information that is privileged, confidential, or protected by law. If you have received this message in error, or do not want to receive any further emails from us, please notify us immediately by replying to the message and deleting it from your computer.
回复: How to map lucene scores to range from 0~100?
Hi Rajendra, Thanks for your reply. Normalization is good way to solve it. But there is problem, if normalize by your formula, the score of top one doc would be 100. Although it map score range from 0~100, but the score maybe not show the similarity between query and hit docs. My system is to search POI. And I want final score to show they text similarity. It like inverse of levenshtein distance of query and hit docs. Do you have any ideas. Sincerely thank you~ Yu -- 原始邮件 -- 发件人: Rajendra Rao;rajendra@launchship.com; 发送时间: 2014年11月11日(星期二) 晚上10:55 收件人: java-userjava-user@lucene.apache.org; 主题: Re: How to map lucene scores to range from 0~100? Harry , basically converting score into range 0 to 100 require normalization(dividing each score with highest record and multiply by .100) .but this score does n't represent matching %. On Tue, Nov 11, 2014 at 7:48 PM, Harry Yu 502437...@qq.com wrote: Hi everyone, I met a new trouble. In my system, we should score the doc range from 0 to 100. There are some easy ways to map lucene scores to this scope. Thanks for your help~ Yu -- Launchship Technology respects your privacy. This email is intended only for the use of the party to which it is addressed and may contain information that is privileged, confidential, or protected by law. If you have received this message in error, or do not want to receive any further emails from us, please notify us immediately by replying to the message and deleting it from your computer.
RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2
In the end I edited the code of the StandardAnalyzer and the SnowballAnalyzer to disable the calls to the LowerCaseFilter. This seems to work. -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] Sent: 10 Nov 2014 15 19 To: java-user@lucene.apache.org Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2 Hi, Regarding Uwe's warning, NOTE: SnowballFilter expects lowercased text. [1] [1] https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/anal ysis/snowball/SnowballFilter.html On Monday, November 10, 2014 4:43 PM, Uwe Schindler u...@thetaphi.de wrote: Hi, Uwe Thanks for the reply. Given that SnowBallAnalyzer is made up of a series of filters, I was thinking about something like this where I 'pipe' output from one filter to the next: standardTokenizer =new StandardTokenizer (...); standardFilter = new StandardFilter(standardTokenizer,...); stopFilter = new StopFilter(standardFilter,...); snowballFilter = new SnowballFilter(stopFilter,...); But ignore LowerCaseFilter. Does this make sense? Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in your own package and remove LowercaseFilter. But be aware, it could be that snowball needs lowercased terms to correctly do stemming!!! I don't know about this filter, I just want to make you aware. The same applies to stop filter, but this one allows to handle that: You should make stop-filter case insensitive (there is a boolean to do this): StopFilter(boolean enablePositionIncrements, TokenStream input, Set? stopWords, boolean ignoreCase) Uwe Martin O'Shea. -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: 10 Nov 2014 14 06 To: java-user@lucene.apache.org Subject: RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2 Hi, In general, you cannot change Analyzers, they are examples and can be seen as best practise. If you want to modify them, write your own Analyzer subclass which uses the wanted Tokenizers and TokenFilters as you like. You can for example clone the source code of the original and remove LowercaseFilter. Analyzers are very simple, there is no logic in them, it's just some configuration (which Tokenizer and which TokenFilters). In later Lucene 3 and Lucene 4, this is very simple: You just need to override createComponents in Analyzer class and add your configuration there. If you use Apache Solr or Elasticsearch you can create your analyzers by XML or JSON configuration. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Martin O'Shea [mailto:m.os...@dsl.pipex.com] Sent: Monday, November 10, 2014 2:54 PM To: java-user@lucene.apache.org Subject: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2 I realise that 3.0.2 is an old version of Lucene but if I have Java code as follows: int nGramLength = 3; SetString stopWords = new SetString(); stopwords.add(the); stopwords.add(and); ... SnowballAnalyzer snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, English, stopWords); ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength); Which will generate the frequency of ngrams from a particular a string of text without stop words, how can I disable the LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to preserve the case of the ngrams generated so that I can perform various counts according to the presence / absence of upper case characters in the ngrams. I am something of a Lucene newbie. And I should add that upgrading the version of Lucene is not an option here. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2
Hi, With that analyser, your searches (for same word, but different capitalised) could return different results. Ahmet On Tuesday, November 11, 2014 6:57 PM, Martin O'Shea app...@dsl.pipex.com wrote: In the end I edited the code of the StandardAnalyzer and the SnowballAnalyzer to disable the calls to the LowerCaseFilter. This seems to work. -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] Sent: 10 Nov 2014 15 19 To: java-user@lucene.apache.org Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2 Hi, Regarding Uwe's warning, NOTE: SnowballFilter expects lowercased text. [1] [1] https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/anal ysis/snowball/SnowballFilter.html On Monday, November 10, 2014 4:43 PM, Uwe Schindler u...@thetaphi.de wrote: Hi, Uwe Thanks for the reply. Given that SnowBallAnalyzer is made up of a series of filters, I was thinking about something like this where I 'pipe' output from one filter to the next: standardTokenizer =new StandardTokenizer (...); standardFilter = new StandardFilter(standardTokenizer,...); stopFilter = new StopFilter(standardFilter,...); snowballFilter = new SnowballFilter(stopFilter,...); But ignore LowerCaseFilter. Does this make sense? Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in your own package and remove LowercaseFilter. But be aware, it could be that snowball needs lowercased terms to correctly do stemming!!! I don't know about this filter, I just want to make you aware. The same applies to stop filter, but this one allows to handle that: You should make stop-filter case insensitive (there is a boolean to do this): StopFilter(boolean enablePositionIncrements, TokenStream input, Set? stopWords, boolean ignoreCase) Uwe Martin O'Shea. -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: 10 Nov 2014 14 06 To: java-user@lucene.apache.org Subject: RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2 Hi, In general, you cannot change Analyzers, they are examples and can be seen as best practise. If you want to modify them, write your own Analyzer subclass which uses the wanted Tokenizers and TokenFilters as you like. You can for example clone the source code of the original and remove LowercaseFilter. Analyzers are very simple, there is no logic in them, it's just some configuration (which Tokenizer and which TokenFilters). In later Lucene 3 and Lucene 4, this is very simple: You just need to override createComponents in Analyzer class and add your configuration there. If you use Apache Solr or Elasticsearch you can create your analyzers by XML or JSON configuration. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Martin O'Shea [mailto:m.os...@dsl.pipex.com] Sent: Monday, November 10, 2014 2:54 PM To: java-user@lucene.apache.org Subject: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2 I realise that 3.0.2 is an old version of Lucene but if I have Java code as follows: int nGramLength = 3; SetString stopWords = new SetString(); stopwords.add(the); stopwords.add(and); ... SnowballAnalyzer snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, English, stopWords); ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength); Which will generate the frequency of ngrams from a particular a string of text without stop words, how can I disable the LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to preserve the case of the ngrams generated so that I can perform various counts according to the presence / absence of upper case characters in the ngrams. I am something of a Lucene newbie. And I should add that upgrading the version of Lucene is not an option here. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe,
RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2
Ahmet, Yes that is quite true. But as this is only a proof of concept application, I'm prepared for things to be 'imperfect'. Martin O'Shea. -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] Sent: 11 Nov 2014 18 26 To: java-user@lucene.apache.org Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2 Hi, With that analyser, your searches (for same word, but different capitalised) could return different results. Ahmet On Tuesday, November 11, 2014 6:57 PM, Martin O'Shea app...@dsl.pipex.com wrote: In the end I edited the code of the StandardAnalyzer and the SnowballAnalyzer to disable the calls to the LowerCaseFilter. This seems to work. -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] Sent: 10 Nov 2014 15 19 To: java-user@lucene.apache.org Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2 Hi, Regarding Uwe's warning, NOTE: SnowballFilter expects lowercased text. [1] [1] https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/anal ysis/snowball/SnowballFilter.html On Monday, November 10, 2014 4:43 PM, Uwe Schindler u...@thetaphi.de wrote: Hi, Uwe Thanks for the reply. Given that SnowBallAnalyzer is made up of a series of filters, I was thinking about something like this where I 'pipe' output from one filter to the next: standardTokenizer =new StandardTokenizer (...); standardFilter = new StandardFilter(standardTokenizer,...); stopFilter = new StopFilter(standardFilter,...); snowballFilter = new SnowballFilter(stopFilter,...); But ignore LowerCaseFilter. Does this make sense? Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in your own package and remove LowercaseFilter. But be aware, it could be that snowball needs lowercased terms to correctly do stemming!!! I don't know about this filter, I just want to make you aware. The same applies to stop filter, but this one allows to handle that: You should make stop-filter case insensitive (there is a boolean to do this): StopFilter(boolean enablePositionIncrements, TokenStream input, Set? stopWords, boolean ignoreCase) Uwe Martin O'Shea. -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: 10 Nov 2014 14 06 To: java-user@lucene.apache.org Subject: RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2 Hi, In general, you cannot change Analyzers, they are examples and can be seen as best practise. If you want to modify them, write your own Analyzer subclass which uses the wanted Tokenizers and TokenFilters as you like. You can for example clone the source code of the original and remove LowercaseFilter. Analyzers are very simple, there is no logic in them, it's just some configuration (which Tokenizer and which TokenFilters). In later Lucene 3 and Lucene 4, this is very simple: You just need to override createComponents in Analyzer class and add your configuration there. If you use Apache Solr or Elasticsearch you can create your analyzers by XML or JSON configuration. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Martin O'Shea [mailto:m.os...@dsl.pipex.com] Sent: Monday, November 10, 2014 2:54 PM To: java-user@lucene.apache.org Subject: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2 I realise that 3.0.2 is an old version of Lucene but if I have Java code as follows: int nGramLength = 3; SetString stopWords = new SetString(); stopwords.add(the); stopwords.add(and); ... SnowballAnalyzer snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, English, stopWords); ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength); Which will generate the frequency of ngrams from a particular a string of text without stop words, how can I disable the LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to preserve the case of the ngrams generated so that I can perform various counts according to the presence / absence of upper case characters in the ngrams. I am something of a Lucene newbie. And I should add that upgrading the version of Lucene is not an option here. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail:
Document Term matrix
Hi All, I have a Lucene index built with Lucene 4.9 for 584 text documents, I need to extract a Document-term matrix, and Document Document similarity matrix in-order to use it to cluster the documents. My questions:1- How can I extract the matrix and compute the similarity between documents in Lucene.2- Is there any java based code that can cluster the documents from Lucene index. RegardsShaimaa
Re: Document Term matrix
hi, While indexing the documents , store the Term Vectors for the content field. Now for each document you will have an array of terms and their corresponding frequency in the document. Using the Index Reader you can retrieve this term vectors. Similarity between two documents can be computed as the similarity of their term vectors. Since tf-idf is most well known and seems to give better sense of similarity, simply multiply the idf of the term with the frequency to re weight the vectors. Thanks, Parnab On Tue, Nov 11, 2014 at 8:36 PM, Elshaimaa Ali elshaimaa@hotmail.com wrote: Hi All, I have a Lucene index built with Lucene 4.9 for 584 text documents, I need to extract a Document-term matrix, and Document Document similarity matrix in-order to use it to cluster the documents. My questions:1- How can I extract the matrix and compute the similarity between documents in Lucene.2- Is there any java based code that can cluster the documents from Lucene index. RegardsShaimaa
Re: Document Term matrix
The project semanticvectors might be doing what you are looking for. paul On 11 nov. 2014, at 22:37, parnab kumar parnab.2...@gmail.com wrote: hi, While indexing the documents , store the Term Vectors for the content field. Now for each document you will have an array of terms and their corresponding frequency in the document. Using the Index Reader you can retrieve this term vectors. Similarity between two documents can be computed as the similarity of their term vectors. Since tf-idf is most well known and seems to give better sense of similarity, simply multiply the idf of the term with the frequency to re weight the vectors. Thanks, Parnab On Tue, Nov 11, 2014 at 8:36 PM, Elshaimaa Ali elshaimaa@hotmail.com wrote: Hi All, I have a Lucene index built with Lucene 4.9 for 584 text documents, I need to extract a Document-term matrix, and Document Document similarity matrix in-order to use it to cluster the documents. My questions:1- How can I extract the matrix and compute the similarity between documents in Lucene.2- Is there any java based code that can cluster the documents from Lucene index. RegardsShaimaa - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Document Term matrix
Hi, Mahout and Carrot2 can cluster the documents from lucene index. ahmet On Tuesday, November 11, 2014 10:37 PM, Elshaimaa Ali elshaimaa@hotmail.com wrote: Hi All, I have a Lucene index built with Lucene 4.9 for 584 text documents, I need to extract a Document-term matrix, and Document Document similarity matrix in-order to use it to cluster the documents. My questions:1- How can I extract the matrix and compute the similarity between documents in Lucene.2- Is there any java based code that can cluster the documents from Lucene index. RegardsShaimaa - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org