Re: Index keeps growing, then shrinks on restart

2014-11-11 Thread Ian Lea
Telling us the version of lucene and the OS you're running on is
always a good idea.

A guess here is that you aren't closing index readers, so the JVM will
be holding on to deleted files until it exits.

A combination of du, ls, and lsof commands should prove it, or just
losf: run it against the java process and look for deleted files.  If
you're on unix that is.


--
Ian.


On Mon, Nov 10, 2014 at 11:03 PM, Rob Nikander rob.nikan...@gmail.com wrote:
 Hi,

 I have an index that's about 700 MB, and it grows over days to until it
 causes problems with disk size, at about 5GB.  If the JVM process ends, the
 index shrinks back to about 700MB, I'm calling IndexWriter.commit() all the
 time.  What else do you call to get it to compact it's use of space?

 thank you,
 Rob

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to improve the performance in Lucene when query is long?

2014-11-11 Thread Ahmet Arslan
Hi Harry,

May be you can use BooleanQuery#setMinimumNumberShouldMatch method. What 
happens when you use set it to half of the numTerms?

ahmet


On Tuesday, November 11, 2014 8:35 AM, Harry Yu 502437...@qq.com wrote:
Hi everyone,



I have been using Lucene to build a POI searching  geocoding system. After 
test, I found that when query is long(above 10 terms). And the speed of 
searching is too slow near to 1s. I think the bottleneck is that I used OR to 
generate my BooleanQuery. It would get plenty of candidates documents. And it 
would also consume too many time to score and rank.

I changed to use AND to generate my BooleanQuery. But it decrease the accuracy 
of hits. So I want to find a solution to reduce candidate documents and do not 
decrease the accuracy in this situation.

Thanks for your help!‍



--
Harry YuInstitute of Remote Sensing and Geographic Information System.
School of Earth and Space Sciences, Peking University;
Beijing, China, 100871;
Email: 502437...@qq.com OR harryyu1...@163.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index keeps growing, then shrinks on restart

2014-11-11 Thread Rob Nikander
On Tue, Nov 11, 2014 at 4:26 AM, Ian Lea ian@gmail.com wrote:

 Telling us the version of lucene and the OS you're running on is
 always a good idea.


Oops, yes.  Lucene 4.10.0, Linux.


A guess here is that you aren't closing index readers, so the JVM will
 be holding on to deleted files until it exits.


That's probably it. I found a code path where it seems I thought the
reader's `close()` would be called by GC/finalize.

Rob


How to map lucene scores to range from 0~100?

2014-11-11 Thread Harry Yu
Hi everyone,


I met a new trouble. In my system, we should score the doc range from 0 to 100. 
There are some easy ways to map lucene scores to this scope. Thanks for your 
help~


Yu

Re: How to map lucene scores to range from 0~100?

2014-11-11 Thread Rajendra Rao
Harry ,

basically converting score into range 0 to 100 require
normalization(dividing each score with highest record and multiply by .100)
.but this score does n't represent matching %.


On Tue, Nov 11, 2014 at 7:48 PM, Harry Yu 502437...@qq.com wrote:

 Hi everyone,


 I met a new trouble. In my system, we should score the doc range from 0 to
 100. There are some easy ways to map lucene scores to this scope. Thanks
 for your help~


 Yu

-- 
Launchship Technology  respects your privacy. This email is intended only 
for the use of the party to which it is addressed and may contain 
information that is privileged, confidential, or protected by law. If you 
have received this message in error, or do not want to receive any further 
emails from us, please notify us immediately by replying to the message and 
deleting it from your computer.


回复: How to map lucene scores to range from 0~100?

2014-11-11 Thread Harry Yu
Hi Rajendra,


Thanks for your reply. Normalization is good way to solve it. But there is 
problem, if normalize by your formula, the score of top one doc would be 100. 
Although it map score range from 0~100, but the score maybe not show the 
similarity between query and hit docs.


My system is to search POI. And I want final score to show they text 
similarity. It like inverse of  levenshtein distance of query and hit docs.‍


Do you have any ideas. Sincerely thank you~


Yu
 




-- 原始邮件 --
发件人: Rajendra Rao;rajendra@launchship.com;
发送时间: 2014年11月11日(星期二) 晚上10:55
收件人: java-userjava-user@lucene.apache.org; 

主题: Re: How to map lucene scores to range from 0~100?



Harry ,

basically converting score into range 0 to 100 require
normalization(dividing each score with highest record and multiply by .100)
.but this score does n't represent matching %.


On Tue, Nov 11, 2014 at 7:48 PM, Harry Yu 502437...@qq.com wrote:

 Hi everyone,


 I met a new trouble. In my system, we should score the doc range from 0 to
 100. There are some easy ways to map lucene scores to this scope. Thanks
 for your help~


 Yu

-- 
Launchship Technology  respects your privacy. This email is intended only 
for the use of the party to which it is addressed and may contain 
information that is privileged, confidential, or protected by law. If you 
have received this message in error, or do not want to receive any further 
emails from us, please notify us immediately by replying to the message and 
deleting it from your computer.

RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-11 Thread Martin O'Shea
In the end I edited the code of the StandardAnalyzer and the
SnowballAnalyzer to disable the calls to the LowerCaseFilter. This seems to
work.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: 10 Nov 2014 15 19
To: java-user@lucene.apache.org
Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in
Lucene 3.0.2

Hi,

Regarding Uwe's warning, 

NOTE: SnowballFilter expects lowercased text. [1]

[1]
https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/anal
ysis/snowball/SnowballFilter.html



On Monday, November 10, 2014 4:43 PM, Uwe Schindler u...@thetaphi.de wrote:
Hi,

 Uwe
 
 Thanks for the reply. Given that SnowBallAnalyzer is made up of a 
 series of filters, I was thinking about something like this where I 
 'pipe' output from one filter to the next:
 
 standardTokenizer =new StandardTokenizer (...); standardFilter = new 
 StandardFilter(standardTokenizer,...);
 stopFilter = new StopFilter(standardFilter,...); snowballFilter = new 
 SnowballFilter(stopFilter,...);
 
 But ignore LowerCaseFilter. Does this make sense?

Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in
your own package and remove LowercaseFilter. But be aware, it could be that
snowball needs lowercased terms to correctly do stemming!!! I don't know
about this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You
should make stop-filter case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set?
stopWords, boolean ignoreCase)

Uwe

 Martin O'Shea.
 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: 10 Nov 2014 14 06
 To: java-user@lucene.apache.org
 Subject: RE: How to disable LowerCaseFilter when using 
 SnowballAnalyzer in Lucene 3.0.2
 
 Hi,
 
 In general, you cannot change Analyzers, they are examples and can 
 be seen as best practise. If you want to modify them, write your own 
 Analyzer subclass which uses the wanted Tokenizers and TokenFilters as 
 you like. You can for example clone the source code of the original 
 and remove LowercaseFilter. Analyzers are very simple, there is no 
 logic in them, it's just some configuration (which Tokenizer and 
 which TokenFilters). In later Lucene 3 and Lucene 4, this is very 
 simple: You just need to override createComponents in Analyzer class and
add your configuration there.
 
 If you use Apache Solr or Elasticsearch you can create your analyzers 
 by XML or JSON configuration.
 
 Uwe
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Martin O'Shea [mailto:m.os...@dsl.pipex.com]
  Sent: Monday, November 10, 2014 2:54 PM
  To: java-user@lucene.apache.org
  Subject: How to disable LowerCaseFilter when using SnowballAnalyzer 
  in Lucene 3.0.2
 
  I realise that 3.0.2 is an old version of Lucene but if I have Java 
  code as
  follows:
 
 
 
  int nGramLength = 3;
 
  SetString stopWords = new SetString();
 
  stopwords.add(the);
 
  stopwords.add(and);
 
  ...
 
  SnowballAnalyzer snowballAnalyzer = new 
  SnowballAnalyzer(Version.LUCENE_30,
  English, stopWords);
 
  ShingleAnalyzerWrapper shingleAnalyzer = new 
  ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
 
 
 
  Which will generate the frequency of ngrams from a particular a 
  string of text without stop words, how can I disable the 
  LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to 
  preserve the case of the ngrams generated so that I can perform 
  various counts according to the presence / absence of upper case
characters in the ngrams.
 
 
 
  I am something of a Lucene newbie. And I should add that upgrading 
  the version of Lucene is not an option here.
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-11 Thread Ahmet Arslan
Hi,

With that analyser, your searches (for same word, but different capitalised) 
could return different results.

Ahmet


On Tuesday, November 11, 2014 6:57 PM, Martin O'Shea app...@dsl.pipex.com 
wrote:
In the end I edited the code of the StandardAnalyzer and the
SnowballAnalyzer to disable the calls to the LowerCaseFilter. This seems to
work.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: 10 Nov 2014 15 19
To: java-user@lucene.apache.org
Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in
Lucene 3.0.2

Hi,

Regarding Uwe's warning, 

NOTE: SnowballFilter expects lowercased text. [1]

[1]
https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/anal
ysis/snowball/SnowballFilter.html



On Monday, November 10, 2014 4:43 PM, Uwe Schindler u...@thetaphi.de wrote:
Hi,

 Uwe
 
 Thanks for the reply. Given that SnowBallAnalyzer is made up of a 
 series of filters, I was thinking about something like this where I 
 'pipe' output from one filter to the next:
 
 standardTokenizer =new StandardTokenizer (...); standardFilter = new 
 StandardFilter(standardTokenizer,...);
 stopFilter = new StopFilter(standardFilter,...); snowballFilter = new 
 SnowballFilter(stopFilter,...);
 
 But ignore LowerCaseFilter. Does this make sense?

Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in
your own package and remove LowercaseFilter. But be aware, it could be that
snowball needs lowercased terms to correctly do stemming!!! I don't know
about this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You
should make stop-filter case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set?
stopWords, boolean ignoreCase)

Uwe

 Martin O'Shea.
 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: 10 Nov 2014 14 06
 To: java-user@lucene.apache.org
 Subject: RE: How to disable LowerCaseFilter when using 
 SnowballAnalyzer in Lucene 3.0.2
 
 Hi,
 
 In general, you cannot change Analyzers, they are examples and can 
 be seen as best practise. If you want to modify them, write your own 
 Analyzer subclass which uses the wanted Tokenizers and TokenFilters as 
 you like. You can for example clone the source code of the original 
 and remove LowercaseFilter. Analyzers are very simple, there is no 
 logic in them, it's just some configuration (which Tokenizer and 
 which TokenFilters). In later Lucene 3 and Lucene 4, this is very 
 simple: You just need to override createComponents in Analyzer class and
add your configuration there.
 
 If you use Apache Solr or Elasticsearch you can create your analyzers 
 by XML or JSON configuration.
 
 Uwe
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Martin O'Shea [mailto:m.os...@dsl.pipex.com]
  Sent: Monday, November 10, 2014 2:54 PM
  To: java-user@lucene.apache.org
  Subject: How to disable LowerCaseFilter when using SnowballAnalyzer 
  in Lucene 3.0.2
 
  I realise that 3.0.2 is an old version of Lucene but if I have Java 
  code as
  follows:
 
 
 
  int nGramLength = 3;
 
  SetString stopWords = new SetString();
 
  stopwords.add(the);
 
  stopwords.add(and);
 
  ...
 
  SnowballAnalyzer snowballAnalyzer = new 
  SnowballAnalyzer(Version.LUCENE_30,
  English, stopWords);
 
  ShingleAnalyzerWrapper shingleAnalyzer = new 
  ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
 
 
 
  Which will generate the frequency of ngrams from a particular a 
  string of text without stop words, how can I disable the 
  LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to 
  preserve the case of the ngrams generated so that I can perform 
  various counts according to the presence / absence of upper case
characters in the ngrams.
 
 
 
  I am something of a Lucene newbie. And I should add that upgrading 
  the version of Lucene is not an option here.
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org






 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, 

RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-11 Thread Martin O'Shea
Ahmet, 

Yes that is quite true. But as this is only a proof of concept application,
I'm prepared for things to be 'imperfect'.

Martin O'Shea.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: 11 Nov 2014 18 26
To: java-user@lucene.apache.org
Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in
Lucene 3.0.2

Hi,

With that analyser, your searches (for same word, but different capitalised)
could return different results.

Ahmet


On Tuesday, November 11, 2014 6:57 PM, Martin O'Shea app...@dsl.pipex.com
wrote:
In the end I edited the code of the StandardAnalyzer and the
SnowballAnalyzer to disable the calls to the LowerCaseFilter. This seems to
work.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
Sent: 10 Nov 2014 15 19
To: java-user@lucene.apache.org
Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in
Lucene 3.0.2

Hi,

Regarding Uwe's warning, 

NOTE: SnowballFilter expects lowercased text. [1]

[1]
https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/anal
ysis/snowball/SnowballFilter.html



On Monday, November 10, 2014 4:43 PM, Uwe Schindler u...@thetaphi.de wrote:
Hi,

 Uwe
 
 Thanks for the reply. Given that SnowBallAnalyzer is made up of a 
 series of filters, I was thinking about something like this where I 
 'pipe' output from one filter to the next:
 
 standardTokenizer =new StandardTokenizer (...); standardFilter = new 
 StandardFilter(standardTokenizer,...);
 stopFilter = new StopFilter(standardFilter,...); snowballFilter = new 
 SnowballFilter(stopFilter,...);
 
 But ignore LowerCaseFilter. Does this make sense?

Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in
your own package and remove LowercaseFilter. But be aware, it could be that
snowball needs lowercased terms to correctly do stemming!!! I don't know
about this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You
should make stop-filter case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set?
stopWords, boolean ignoreCase)

Uwe

 Martin O'Shea.
 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: 10 Nov 2014 14 06
 To: java-user@lucene.apache.org
 Subject: RE: How to disable LowerCaseFilter when using 
 SnowballAnalyzer in Lucene 3.0.2
 
 Hi,
 
 In general, you cannot change Analyzers, they are examples and can 
 be seen as best practise. If you want to modify them, write your own 
 Analyzer subclass which uses the wanted Tokenizers and TokenFilters as 
 you like. You can for example clone the source code of the original 
 and remove LowercaseFilter. Analyzers are very simple, there is no 
 logic in them, it's just some configuration (which Tokenizer and 
 which TokenFilters). In later Lucene 3 and Lucene 4, this is very
 simple: You just need to override createComponents in Analyzer class 
 and
add your configuration there.
 
 If you use Apache Solr or Elasticsearch you can create your analyzers 
 by XML or JSON configuration.
 
 Uwe
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Martin O'Shea [mailto:m.os...@dsl.pipex.com]
  Sent: Monday, November 10, 2014 2:54 PM
  To: java-user@lucene.apache.org
  Subject: How to disable LowerCaseFilter when using SnowballAnalyzer 
  in Lucene 3.0.2
 
  I realise that 3.0.2 is an old version of Lucene but if I have Java 
  code as
  follows:
 
 
 
  int nGramLength = 3;
 
  SetString stopWords = new SetString();
 
  stopwords.add(the);
 
  stopwords.add(and);
 
  ...
 
  SnowballAnalyzer snowballAnalyzer = new 
  SnowballAnalyzer(Version.LUCENE_30,
  English, stopWords);
 
  ShingleAnalyzerWrapper shingleAnalyzer = new 
  ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
 
 
 
  Which will generate the frequency of ngrams from a particular a 
  string of text without stop words, how can I disable the 
  LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to 
  preserve the case of the ngrams generated so that I can perform 
  various counts according to the presence / absence of upper case
characters in the ngrams.
 
 
 
  I am something of a Lucene newbie. And I should add that upgrading 
  the version of Lucene is not an option here.
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org






 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: 

Document Term matrix

2014-11-11 Thread Elshaimaa Ali
Hi All,
I have a Lucene index built with Lucene 4.9 for 584 text documents, I need to 
extract a Document-term matrix, and Document Document similarity matrix 
in-order to use it to cluster the documents. My questions:1- How can I extract 
the matrix and compute the similarity between documents in Lucene.2- Is there 
any java based code that can cluster the documents from Lucene index.
RegardsShaimaa 

  

Re: Document Term matrix

2014-11-11 Thread parnab kumar
hi,

 While indexing the documents , store the Term Vectors for the content
field. Now for each document you will have an array of terms  and their
corresponding frequency in the document. Using the Index Reader you can
retrieve this term vectors. Similarity between two documents can be
computed as the similarity of their term vectors. Since tf-idf is most well
known and seems to give better sense of similarity, simply multiply the idf
of the term with the frequency to re weight the vectors.

Thanks,
Parnab

On Tue, Nov 11, 2014 at 8:36 PM, Elshaimaa Ali elshaimaa@hotmail.com
wrote:

 Hi All,
 I have a Lucene index built with Lucene 4.9 for 584 text documents, I need
 to extract a Document-term matrix, and Document Document similarity matrix
 in-order to use it to cluster the documents. My questions:1- How can I
 extract the matrix and compute the similarity between documents in
 Lucene.2- Is there any java based code that can cluster the documents from
 Lucene index.
 RegardsShaimaa




Re: Document Term matrix

2014-11-11 Thread Paul Libbrecht
The project semanticvectors might be doing what you are looking for.
paul


On 11 nov. 2014, at 22:37, parnab kumar parnab.2...@gmail.com wrote:

 hi,
 
 While indexing the documents , store the Term Vectors for the content
 field. Now for each document you will have an array of terms  and their
 corresponding frequency in the document. Using the Index Reader you can
 retrieve this term vectors. Similarity between two documents can be
 computed as the similarity of their term vectors. Since tf-idf is most well
 known and seems to give better sense of similarity, simply multiply the idf
 of the term with the frequency to re weight the vectors.
 
 Thanks,
 Parnab
 
 On Tue, Nov 11, 2014 at 8:36 PM, Elshaimaa Ali elshaimaa@hotmail.com
 wrote:
 
 Hi All,
 I have a Lucene index built with Lucene 4.9 for 584 text documents, I need
 to extract a Document-term matrix, and Document Document similarity matrix
 in-order to use it to cluster the documents. My questions:1- How can I
 extract the matrix and compute the similarity between documents in
 Lucene.2- Is there any java based code that can cluster the documents from
 Lucene index.
 RegardsShaimaa
 
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Document Term matrix

2014-11-11 Thread Ahmet Arslan
Hi,

Mahout and Carrot2 can cluster the documents from lucene index.

ahmet



On Tuesday, November 11, 2014 10:37 PM, Elshaimaa Ali 
elshaimaa@hotmail.com wrote:
Hi All,
I have a Lucene index built with Lucene 4.9 for 584 text documents, I need to 
extract a Document-term matrix, and Document Document similarity matrix 
in-order to use it to cluster the documents. My questions:1- How can I extract 
the matrix and compute the similarity between documents in Lucene.2- Is there 
any java based code that can cluster the documents from Lucene index.
RegardsShaimaa 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org