[jira] Updated: (LUCENE-2189) Simple9 (de)compression

2010-01-04 Thread Paul Elschot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-2189:
-

Attachment: LUCENE-2189a.patch

Simple9 encoder/decoder and passing tests.
This 2189a patch still has a fixme at the encoder to not use more elements than 
given.

 Simple9 (de)compression
 ---

 Key: LUCENE-2189
 URL: https://issues.apache.org/jira/browse/LUCENE-2189
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Paul Elschot
Priority: Minor
 Attachments: LUCENE-2189a.patch


 Simple9 is an alternative for VInt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2189) Simple9 (de)compression

2010-01-04 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796111#action_12796111
 ] 

Uwe Schindler commented on LUCENE-2189:
---

Just a comment on the switch:
As far as I know: Java switch statements are very fast if there are few cases 
and these cases are near together and therefore small numbers. I would suggest 
to not switch on the raw ANDed status, but better instead shift the status  
28 (and remove the ) and then only list the raw status values 0..9 in the 
switch statement.

 Simple9 (de)compression
 ---

 Key: LUCENE-2189
 URL: https://issues.apache.org/jira/browse/LUCENE-2189
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Paul Elschot
Priority: Minor
 Attachments: LUCENE-2189a.patch


 Simple9 is an alternative for VInt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2189) Simple9 (de)compression

2010-01-04 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796111#action_12796111
 ] 

Uwe Schindler edited comment on LUCENE-2189 at 1/4/10 8:11 AM:
---

Just a comment on the switch:
As far as I know: Java switch statements are very fast if there are few cases 
and these cases are near together and therefore small numbers. I would suggest 
to not switch on the raw ANDed status, but better instead shift the status  
28 (and remove the ) and then only list the status values 0..9 in the switch 
statement.

  was (Author: thetaphi):
Just a comment on the switch:
As far as I know: Java switch statements are very fast if there are few cases 
and these cases are near together and therefore small numbers. I would suggest 
to not switch on the raw ANDed status, but better instead shift the status  
28 (and remove the ) and then only list the raw status values 0..9 in the 
switch statement.
  
 Simple9 (de)compression
 ---

 Key: LUCENE-2189
 URL: https://issues.apache.org/jira/browse/LUCENE-2189
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Paul Elschot
Priority: Minor
 Attachments: LUCENE-2189a.patch


 Simple9 is an alternative for VInt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2189) Simple9 (de)compression

2010-01-04 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796112#action_12796112
 ] 

Uwe Schindler commented on LUCENE-2189:
---

Here the explanation: 
[http://java.sun.com/docs/books/jvms/first_edition/html/Compiling.doc.html]

 Simple9 (de)compression
 ---

 Key: LUCENE-2189
 URL: https://issues.apache.org/jira/browse/LUCENE-2189
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Paul Elschot
Priority: Minor
 Attachments: LUCENE-2189a.patch


 Simple9 is an alternative for VInt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2189) Simple9 (de)compression

2010-01-04 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796113#action_12796113
 ] 

Paul Elschot commented on LUCENE-2189:
--

About the switch: I had the shift down in there initially, but then I left it 
out for speed of decoding. I could move the status bits to the lower part so 
that the shift is not needed at all if that does not affect data decoding. I'll 
have a look at it. Thanks.

 Simple9 (de)compression
 ---

 Key: LUCENE-2189
 URL: https://issues.apache.org/jira/browse/LUCENE-2189
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Paul Elschot
Priority: Minor
 Attachments: LUCENE-2189a.patch


 Simple9 is an alternative for VInt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)

2010-01-04 Thread Michael McCandless (JIRA)
CustomScoreQuery (function query) is broken (due to per-segment searching)
--

 Key: LUCENE-2190
 URL: https://issues.apache.org/jira/browse/LUCENE-2190
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 3.0, 2.9.1, 2.9, 3.0.1, 3.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9.2, 3.0.1, 3.1


Spinoff from here:

  http://lucene.markmail.org/message/psw2m3adzibaixbq

With the cutover to per-segment searching, CustomScoreQuery is not really 
usable anymore, because the per-doc custom scoring method (customScore) 
receives a per-segment docID, yet there is no way to figure out which segment 
you are currently searching.

I think to fix this we must also notify the subclass whenever a new segment is 
switched to.  I think if we copy Collector.setNextReader, that would be 
sufficient.  It would by default do nothing in CustomScoreQuery, but a subclass 
could override.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2191) rename Tokenizer.reset(Reader) to Tokenizer.setReader(Reader)

2010-01-04 Thread Robert Muir (JIRA)
rename Tokenizer.reset(Reader) to Tokenizer.setReader(Reader)
-

 Key: LUCENE-2191
 URL: https://issues.apache.org/jira/browse/LUCENE-2191
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Robert Muir
Priority: Minor


in TokenStream there is a reset() method, but the method in Tokenizer used to 
set a new Reader is called reset(Reader).

in my opinion this name overloading creates a lot of confusion, and we see 
things like reset(Reader) calling reset() even in StandardTokenizer...

So I think this would be some work to fulfill all the backwards compatibility, 
but worth it because when you look at the existing reset(Reader) and reset() 
code in various tokenizers, or the javadocs for Tokenizer, its pretty confusing 
and inconsistent.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Reopened: (LUCENE-2079) Further improvements to contrib/benchmark for testing NRT

2010-01-04 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-2079:



The BG thread priority is not finding its way down to the parallel threads, and 
is causing nightly build to sometimes hang.

I've disabled the testcase for now...

 Further improvements to contrib/benchmark for testing NRT
 -

 Key: LUCENE-2079
 URL: https://issues.apache.org/jira/browse/LUCENE-2079
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2079.patch


 Some small changes:
   * Allow specifying a priority for BG threads, after the 
 character; priority increment is + or - int that's added to main
 thread's priority to set child thread's.  For my NRT tests I make
 the reopen thread +2, the indexing threads +1, and leave searching
 threads at their default.
   * Added test case
   * NearRealTimeReopenTask now reports @ the end the full array of
 msec of each reopen latency
   * Added optional breakout of counts by time steps.  If you set
 log.time.step.msec to eg 1000 then reported counts for serial task
 sequence is broken out by 1 second windows.  EG you can use this
 to measure slowdown over time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2188) A handy utility class for tracking deprecated overridden methods

2010-01-04 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2188:
--

Attachment: (was: LUCENE-2188.patch)

 A handy utility class for tracking deprecated overridden methods
 

 Key: LUCENE-2188
 URL: https://issues.apache.org/jira/browse/LUCENE-2188
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch, 
 LUCENE-2188.patch


 This issue provides a new handy utility class that keeps track of overridden 
 deprecated methods in non-final sub classes. This class can be used in new 
 deprecations.
 See the javadocs for an example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2188) A handy utility class for tracking deprecated overridden methods

2010-01-04 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2188:
--

Attachment: LUCENE-2188.patch

New patrch, the previous one had the compare method in wrong order. Fixed docs 
and Analyzer and tests. I always get totally disturbed when using compareTo() 
and compare() :-(

 A handy utility class for tracking deprecated overridden methods
 

 Key: LUCENE-2188
 URL: https://issues.apache.org/jira/browse/LUCENE-2188
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch, 
 LUCENE-2188.patch


 This issue provides a new handy utility class that keeps track of overridden 
 deprecated methods in non-final sub classes. This class can be used in new 
 deprecations.
 See the javadocs for an example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Build failed in Hudson: Lucene-trunk #1052

2010-01-04 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1052/changes

Changes:

[rmuir] LUCENE-2185: add @Deprecated annotations

[rmuir] LUCENE-2084: remove Byte/CharBuffer wrapping for collation key 
generation

[rmuir] LUCENE-2034: Refactor analyzer reuse and stopword handling

--
[...truncated 29995 lines...]
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/regex/lucene-regex-2010-01-04_02-04-49-javadoc.jar
 [echo] Building remote...

javadocs:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-remote
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package org.apache.lucene.search...
  [javadoc] Constructing Javadoc information...
  [javadoc] Standard Doclet version 1.5.0_14
  [javadoc] Building tree for all the packages and classes...
  [javadoc] Building index for all the packages and classes...
  [javadoc] Building index for all classes...
  [javadoc] Generating 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-remote/stylesheet.css...
  [javadoc] Note: Custom tags that could override future standard tags:  @todo. 
To avoid potential overrides, use at least one period character (.) in custom 
tag names.
  [javadoc] Note: Custom tags that were not seen:  @todo, @uml.property
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/remote/lucene-remote-2010-01-04_02-04-49-javadoc.jar
 [echo] Building snowball...

javadocs:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-snowball
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package 
org.apache.lucene.analysis.snowball...
  [javadoc] Loading source files for package org.tartarus.snowball...
  [javadoc] Loading source files for package org.tartarus.snowball.ext...
  [javadoc] Constructing Javadoc information...
  [javadoc] Standard Doclet version 1.5.0_14
  [javadoc] Building tree for all the packages and classes...
  [javadoc] Building index for all the packages and classes...
  [javadoc] Building index for all classes...
  [javadoc] Generating 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-snowball/stylesheet.css...
  [javadoc] Note: Custom tags that could override future standard tags:  @todo. 
To avoid potential overrides, use at least one period character (.) in custom 
tag names.
  [javadoc] Note: Custom tags that were not seen:  @todo, @uml.property
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/snowball/lucene-snowball-2010-01-04_02-04-49-javadoc.jar
 [echo] Building spatial...

javadocs:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-spatial
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package 
org.apache.lucene.spatial.geohash...
  [javadoc] Loading source files for package 
org.apache.lucene.spatial.geometry...
  [javadoc] Loading source files for package 
org.apache.lucene.spatial.geometry.shape...
  [javadoc] Loading source files for package org.apache.lucene.spatial.tier...
  [javadoc] Loading source files for package 
org.apache.lucene.spatial.tier.projections...
  [javadoc] Constructing Javadoc information...
  [javadoc] Standard Doclet version 1.5.0_14
  [javadoc] Building tree for all the packages and classes...
  [javadoc] Building index for all the packages and classes...
  [javadoc] Building index for all classes...
  [javadoc] Generating 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-spatial/stylesheet.css...
  [javadoc] Note: Custom tags that could override future standard tags:  @todo. 
To avoid potential overrides, use at least one period character (.) in custom 
tag names.
  [javadoc] Note: Custom tags that were not seen:  @todo, @uml.property
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/spatial/lucene-spatial-2010-01-04_02-04-49-javadoc.jar
 [echo] Building spellchecker...

javadocs:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-spellchecker
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package org.apache.lucene.search.spell...
  [javadoc] Constructing Javadoc information...
  [javadoc] Standard Doclet version 1.5.0_14
  [javadoc] Building tree for all the packages and classes...
  [javadoc] Building index for all the packages and classes...
  [javadoc] Building index for all classes...
  [javadoc] Generating 

[jira] Created: (LUCENE-2192) Memory Leak

2010-01-04 Thread Ramazan VARLIKLI (JIRA)
Memory Leak 


 Key: LUCENE-2192
 URL: https://issues.apache.org/jira/browse/LUCENE-2192
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Ramazan VARLIKLI



Hi All ,

I have been working on a problem with Lucene and now gave up after trying many 
different possibilites which gives me a feeling that There is a bug on this .

The scenario is we have an CMS applicaton into which we add new content every 
week , instead of updating the index which is a bit tricky, I prefer to delete 
all index documents and add them again which is straightforward . The problem 
is Lucene doesn't delete the old data somehow and increase the index size every 
time during the update . I also profile it with java tools and see that even if 
I close the IndexWriter class and sent it to Garbage Collector it holds all the 
docs in the memory .

Here is the code I use 

Directory directory = new SimpleFSDirectory(new File(path));
writer = new IndexWriter(directory, analyzer, 
false,IndexWriter.MaxFieldLength.LIMITED);
writer.deleteAll();
//after adding docs close the indexwriter 
writer.close();

The above code invoked every time we need to update the index . I tried many 
different scenario here to overcome the problem which includes physically 
removing the index directory( see how desperate I am ) , optimizing , flushing, 
commiting indexwriter, create=true parameter and so on . 



Here is the index file size during creation. If I shutdown the application and 
restart it , index size starts with 2,458 which is correct size.

Any help will be appreciated


_17.cfs   2,458 KB
_18.cfs   3,990 KB
_19.cfs  5,149 KB

here is the Lucene logs during creationg of index files 3 times in a row 
IFD [http-8080-1]: setInfoStream 
deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649
IW 0 [http-8080-1]: setInfoStream: 
dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and 
Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene
 autoCommit=false 
mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c 
mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba 
ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 
maxFieldLength=1 index=
IW 0 [http-8080-1]: now flush at close
IW 0 [http-8080-1]:   flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 
flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 
numBufDelTerms=0
IW 0 [http-8080-1]:   index before flush 
IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765
IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment _17 
numDocs=2765
IW 0 [http-8080-1]: DW:   oldRAMSize=7485440 newFlushedSize=2472818 
docs/MB=1,172.473 new/old=33.035%
IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false]
IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false]
IFD [http-8080-1]: delete _17.fdx
IFD [http-8080-1]: delete _17.tis
IFD [http-8080-1]: delete _17.frq
IFD [http-8080-1]: delete _17.nrm
IFD [http-8080-1]: delete _17.fdt
IFD [http-8080-1]: delete _17.fnm
IFD [http-8080-1]: delete _17.tii
IFD [http-8080-1]: delete _17.prx
IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false]
IW 0 [http-8080-1]: LMP: findMerges: 1 segments
IW 0 [http-8080-1]: LMP:   level 6.2247195 to 6.400742: 1 segments
IW 0 [http-8080-1]: CMS: now merge
IW 0 [http-8080-1]: CMS:   index: _17:c2765
IW 0 [http-8080-1]: CMS:   no more merges pending; now return
IW 0 [http-8080-1]: CMS: now merge
IW 0 [http-8080-1]: CMS:   index: _17:c2765
IW 0 [http-8080-1]: CMS:   no more merges pending; now return
IW 0 [http-8080-1]: now call final commit()
IW 0 [http-8080-1]: startCommit(): start sizeInBytes=0
IW 0 [http-8080-1]: startCommit index=_17:c2765 changeCount=5
IW 0 [http-8080-1]: now sync _17.cfs
IW 0 [http-8080-1]: done all syncs
IW 0 [http-8080-1]: commit: pendingCommit != null
IW 0 [http-8080-1]: commit: wrote segments file segments_1k
IFD [http-8080-1]: now checkpoint segments_1k [1 segments ; isCommit = true]
IFD [http-8080-1]: deleteCommits: now decRef commit segments_1j
IFD [http-8080-1]: delete _16.cfs
IFD [http-8080-1]: delete segments_1j
IW 0 [http-8080-1]: commit: done
IW 0 [http-8080-1]: at close: _17:c2765
IFD [http-8080-1]: setInfoStream 
deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@fb1ba7
IW 1 [http-8080-1]: setInfoStream: 
dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and 
Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene
 autoCommit=false 
mergepolicy=org.apache.lucene.index.logbytesizemergepol...@1d49559 
mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@1990e2d 
ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 
maxFieldLength=1 

[jira] Commented: (LUCENE-2181) benchmark for collation

2010-01-04 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796181#action_12796181
 ] 

Steven Rowe commented on LUCENE-2181:
-

{quote}
bq. ... these four files don't have Apache2 license declarations in them. We 
should put a README (or something like it) with these files to indicate the 
license.

Are they really apache license? or derived from wikipedia content?... I don't 
think we should be putting apache license headers in these files
{quote}

Hmm, I just assumed that since these files were not (anything even close to) 
verbatim copies that they were independently licensable new works, but it's 
definitely more complicated than that...

This looks like the place to start where licensing is concerned:

http://en.wikipedia.org/wiki/Wikipedia_Copyright

My (way non-expert) reading of this is that Wikipedia-derived works (and I'm 
pretty sure these frequency lists qualify as such) must be licensed under the 
[Creative Commons Attribution-Share Alike 3.0 Unported 
license|http://creativecommons.org/licenses/by-sa/3.0/], which does not appear 
to me to be entirely compatible with the Apache2 license.

So I agree with you :) - with the caveat that some form of attribution and a 
pointer to licensing info should be included with these files.


 benchmark for collation
 ---

 Key: LUCENE-2181
 URL: https://issues.apache.org/jira/browse/LUCENE-2181
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/benchmark
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2181.patch.zip


 Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
 jdk and icu) under LUCENE-2084, along with some instructions to run it... 
 I think it would be a nice if we could turn this into a committable patch and 
 add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Build failed in Hudson: Lucene-trunk #1052

2010-01-04 Thread Michael McCandless
This was the build I killed, because it was hung in
contrib/benchmark's TestPerfTasksLogic.testBGSearchThreads.

Mike

On Mon, Jan 4, 2010 at 8:13 AM, Apache Hudson Server
hud...@hudson.zones.apache.org wrote:
 See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1052/changes

 Changes:

 [rmuir] LUCENE-2185: add @Deprecated annotations

 [rmuir] LUCENE-2084: remove Byte/CharBuffer wrapping for collation key 
 generation

 [rmuir] LUCENE-2034: Refactor analyzer reuse and stopword handling

 --
 [...truncated 29995 lines...]
      [jar] Building jar: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/regex/lucene-regex-2010-01-04_02-04-49-javadoc.jar
     [echo] Building remote...

 javadocs:
    [mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-remote
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package org.apache.lucene.search...
  [javadoc] Constructing Javadoc information...
  [javadoc] Standard Doclet version 1.5.0_14
  [javadoc] Building tree for all the packages and classes...
  [javadoc] Building index for all the packages and classes...
  [javadoc] Building index for all classes...
  [javadoc] Generating 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-remote/stylesheet.css...
  [javadoc] Note: Custom tags that could override future standard tags: 
 �...@todo. To avoid potential overrides, use at least one period character 
 (.) in custom tag names.
  [javadoc] Note: Custom tags that were not seen: �...@todo, @uml.property
      [jar] Building jar: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/remote/lucene-remote-2010-01-04_02-04-49-javadoc.jar
     [echo] Building snowball...

 javadocs:
    [mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-snowball
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package 
 org.apache.lucene.analysis.snowball...
  [javadoc] Loading source files for package org.tartarus.snowball...
  [javadoc] Loading source files for package org.tartarus.snowball.ext...
  [javadoc] Constructing Javadoc information...
  [javadoc] Standard Doclet version 1.5.0_14
  [javadoc] Building tree for all the packages and classes...
  [javadoc] Building index for all the packages and classes...
  [javadoc] Building index for all classes...
  [javadoc] Generating 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-snowball/stylesheet.css...
  [javadoc] Note: Custom tags that could override future standard tags: 
 �...@todo. To avoid potential overrides, use at least one period character 
 (.) in custom tag names.
  [javadoc] Note: Custom tags that were not seen: �...@todo, @uml.property
      [jar] Building jar: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/snowball/lucene-snowball-2010-01-04_02-04-49-javadoc.jar
     [echo] Building spatial...

 javadocs:
    [mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-spatial
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package 
 org.apache.lucene.spatial.geohash...
  [javadoc] Loading source files for package 
 org.apache.lucene.spatial.geometry...
  [javadoc] Loading source files for package 
 org.apache.lucene.spatial.geometry.shape...
  [javadoc] Loading source files for package org.apache.lucene.spatial.tier...
  [javadoc] Loading source files for package 
 org.apache.lucene.spatial.tier.projections...
  [javadoc] Constructing Javadoc information...
  [javadoc] Standard Doclet version 1.5.0_14
  [javadoc] Building tree for all the packages and classes...
  [javadoc] Building index for all the packages and classes...
  [javadoc] Building index for all classes...
  [javadoc] Generating 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-spatial/stylesheet.css...
  [javadoc] Note: Custom tags that could override future standard tags: 
 �...@todo. To avoid potential overrides, use at least one period character 
 (.) in custom tag names.
  [javadoc] Note: Custom tags that were not seen: �...@todo, @uml.property
      [jar] Building jar: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/spatial/lucene-spatial-2010-01-04_02-04-49-javadoc.jar
     [echo] Building spellchecker...

 javadocs:
    [mkdir] Created dir: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-spellchecker
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package org.apache.lucene.search.spell...
  [javadoc] Constructing Javadoc information...
  [javadoc] Standard Doclet version 

[jira] Updated: (LUCENE-1990) Add unsigned packed int impls in oal.util

2010-01-04 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated LUCENE-1990:
---

Attachment: ba.zip

I made some small tweaks to improve performance and added long[]-backed 
versions of Packed (optimal space) and Aligned (no values span underlying 
blocks), the ran the performance tests on 5 different computers. It seems very 
clear that level 2 cache (and presumably RAM-speed, but I do not know how to 
determine that without root-access on a Linux box) plays a bigger role for 
access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about 
half as fast than a 1.8GHz laptop with 2MB of level 2 cache.

There is a whole lot of measurements and it is getting hard to digest. I've 
attached logs from the 5 computers, should anyone want to have a look. Some 
observations are:

1. The penalty of using long[] instead of int[] on my 32 bit laptop depends on 
the number of values in the array. For less than a million values, it is 
severe: The long[]-version if 30-60% slower, depending on whether packed or 
aligned values are used. Above that, it was 10% slower for Aligned, 25% slower 
for Packed.
On the other hand, 64 bit machines dos not seem to care that much whether int[] 
or long[] is used: There was 10% win for arrays below 1M for one machine, 50% 
for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a 
small loss of below 1% for all lenghts above 10K for a third.

2. There's a fast drop-off in speed when the array reaches a certain size that 
is correlated to level 2 cache size. After that, the speed does not decrease 
much when the array grows. This also affects direct writes to an int[] and has 
the interesting implication that a packed array out-performs the direct access 
approach for writes in a number of cases. For reads, there's no contest: Direct 
access to int[] is blazingly fast.

3. The access-speed of the different implementations converges when the number 
of values in the array rises (think 10M+ values): The slow round-trip to main 
memory dwarfs the logic used for value-extraction. 

Observation #3 supports Mike McCandless choice of going for the packed approach 
and #1 suggests using int[] as the internal structure for now. Using int[] as 
internal structure makes if unfeasible to accept longs as input (or rather: 
longs with more than 32 significant bits). I don't know if this is acceptable?

 Add unsigned packed int impls in oal.util
 -

 Key: LUCENE-1990
 URL: https://issues.apache.org/jira/browse/LUCENE-1990
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Priority: Minor
 Attachments: ba.zip


 There are various places in Lucene that could take advantage of an
 efficient packed unsigned int/long impl.  EG the terms dict index in
 the standard codec in LUCENE-1458 could subsantially reduce it's RAM
 usage.  FieldCache.StringIndex could as well.  And I think load into
 RAM codecs like the one in TestExternalCodecs could use this too.
 I'm picturing something very basic like:
 {code}
 interface PackedUnsignedLongs  {
   long get(long index);
   void set(long index, long value);
 }
 {code}
 Plus maybe an iterator for getting and maybe also for setting.  If it
 helps, most of the usages of this inside Lucene will be write once
 so eg the set could make that an assumption/requirement.
 And a factory somewhere:
 {code}
   PackedUnsignedLongs create(int count, long maxValue);
 {code}
 I think we should simply autogen the code (we can start from the
 autogen code in LUCENE-1410), or, if there is an good existing impl
 that has a compatible license that'd be great.
 I don't have time near-term to do this... so if anyone has the itch,
 please jump!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2192) Memory Leak

2010-01-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796187#action_12796187
 ] 

Michael McCandless commented on LUCENE-2192:


Is there a reader open on the index, when you run the above code (calling 
IndexWriter.deleteAll)?

 Memory Leak 
 

 Key: LUCENE-2192
 URL: https://issues.apache.org/jira/browse/LUCENE-2192
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Ramazan VARLIKLI

 Hi All ,
 I have been working on a problem with Lucene and now gave up after trying 
 many different possibilites which gives me a feeling that There is a bug on 
 this .
 The scenario is we have an CMS applicaton into which we add new content every 
 week , instead of updating the index which is a bit tricky, I prefer to 
 delete all index documents and add them again which is straightforward . The 
 problem is Lucene doesn't delete the old data somehow and increase the index 
 size every time during the update . I also profile it with java tools and see 
 that even if I close the IndexWriter class and sent it to Garbage Collector 
 it holds all the docs in the memory .
 Here is the code I use 
 Directory directory = new SimpleFSDirectory(new File(path));
 writer = new IndexWriter(directory, analyzer, 
 false,IndexWriter.MaxFieldLength.LIMITED);
 writer.deleteAll();
 //after adding docs close the indexwriter 
 writer.close();
 The above code invoked every time we need to update the index . I tried many 
 different scenario here to overcome the problem which includes physically 
 removing the index directory( see how desperate I am ) , optimizing , 
 flushing, commiting indexwriter, create=true parameter and so on . 
 Here is the index file size during creation. If I shutdown the application 
 and restart it , index size starts with 2,458 which is correct size.
 Any help will be appreciated
 _17.cfs   2,458 KB
 _18.cfs   3,990 KB
 _19.cfs  5,149 KB
 here is the Lucene logs during creationg of index files 3 times in a row 
 IFD [http-8080-1]: setInfoStream 
 deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649
 IW 0 [http-8080-1]: setInfoStream: 
 dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and 
 Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene
  autoCommit=false 
 mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c 
 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba 
 ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 
 maxFieldLength=1 index=
 IW 0 [http-8080-1]: now flush at close
 IW 0 [http-8080-1]:   flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 
 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 
 numBufDelTerms=0
 IW 0 [http-8080-1]:   index before flush 
 IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765
 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment _17 
 numDocs=2765
 IW 0 [http-8080-1]: DW:   oldRAMSize=7485440 newFlushedSize=2472818 
 docs/MB=1,172.473 new/old=33.035%
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IFD [http-8080-1]: delete _17.fdx
 IFD [http-8080-1]: delete _17.tis
 IFD [http-8080-1]: delete _17.frq
 IFD [http-8080-1]: delete _17.nrm
 IFD [http-8080-1]: delete _17.fdt
 IFD [http-8080-1]: delete _17.fnm
 IFD [http-8080-1]: delete _17.tii
 IFD [http-8080-1]: delete _17.prx
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IW 0 [http-8080-1]: LMP: findMerges: 1 segments
 IW 0 [http-8080-1]: LMP:   level 6.2247195 to 6.400742: 1 segments
 IW 0 [http-8080-1]: CMS: now merge
 IW 0 [http-8080-1]: CMS:   index: _17:c2765
 IW 0 [http-8080-1]: CMS:   no more merges pending; now return
 IW 0 [http-8080-1]: CMS: now merge
 IW 0 [http-8080-1]: CMS:   index: _17:c2765
 IW 0 [http-8080-1]: CMS:   no more merges pending; now return
 IW 0 [http-8080-1]: now call final commit()
 IW 0 [http-8080-1]: startCommit(): start sizeInBytes=0
 IW 0 [http-8080-1]: startCommit index=_17:c2765 changeCount=5
 IW 0 [http-8080-1]: now sync _17.cfs
 IW 0 [http-8080-1]: done all syncs
 IW 0 [http-8080-1]: commit: pendingCommit != null
 IW 0 [http-8080-1]: commit: wrote segments file segments_1k
 IFD [http-8080-1]: now checkpoint segments_1k [1 segments ; isCommit = true]
 IFD [http-8080-1]: deleteCommits: now decRef commit segments_1j
 IFD [http-8080-1]: delete _16.cfs
 IFD [http-8080-1]: delete segments_1j
 IW 0 [http-8080-1]: commit: done
 IW 0 [http-8080-1]: at close: _17:c2765
 IFD [http-8080-1]: setInfoStream 
 deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@fb1ba7
 IW 1 

[jira] Updated: (LUCENE-1990) Add unsigned packed int impls in oal.util

2010-01-04 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated LUCENE-1990:
---

Attachment: LUCENE-1990_PerformanceMeasurements20100104.zip

 Add unsigned packed int impls in oal.util
 -

 Key: LUCENE-1990
 URL: https://issues.apache.org/jira/browse/LUCENE-1990
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1990_PerformanceMeasurements20100104.zip


 There are various places in Lucene that could take advantage of an
 efficient packed unsigned int/long impl.  EG the terms dict index in
 the standard codec in LUCENE-1458 could subsantially reduce it's RAM
 usage.  FieldCache.StringIndex could as well.  And I think load into
 RAM codecs like the one in TestExternalCodecs could use this too.
 I'm picturing something very basic like:
 {code}
 interface PackedUnsignedLongs  {
   long get(long index);
   void set(long index, long value);
 }
 {code}
 Plus maybe an iterator for getting and maybe also for setting.  If it
 helps, most of the usages of this inside Lucene will be write once
 so eg the set could make that an assumption/requirement.
 And a factory somewhere:
 {code}
   PackedUnsignedLongs create(int count, long maxValue);
 {code}
 I think we should simply autogen the code (we can start from the
 autogen code in LUCENE-1410), or, if there is an good existing impl
 that has a compatible license that'd be great.
 I don't have time near-term to do this... so if anyone has the itch,
 please jump!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1990) Add unsigned packed int impls in oal.util

2010-01-04 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated LUCENE-1990:
---

Attachment: (was: ba.zip)

 Add unsigned packed int impls in oal.util
 -

 Key: LUCENE-1990
 URL: https://issues.apache.org/jira/browse/LUCENE-1990
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1990_PerformanceMeasurements20100104.zip


 There are various places in Lucene that could take advantage of an
 efficient packed unsigned int/long impl.  EG the terms dict index in
 the standard codec in LUCENE-1458 could subsantially reduce it's RAM
 usage.  FieldCache.StringIndex could as well.  And I think load into
 RAM codecs like the one in TestExternalCodecs could use this too.
 I'm picturing something very basic like:
 {code}
 interface PackedUnsignedLongs  {
   long get(long index);
   void set(long index, long value);
 }
 {code}
 Plus maybe an iterator for getting and maybe also for setting.  If it
 helps, most of the usages of this inside Lucene will be write once
 so eg the set could make that an assumption/requirement.
 And a factory somewhere:
 {code}
   PackedUnsignedLongs create(int count, long maxValue);
 {code}
 I think we should simply autogen the code (we can start from the
 autogen code in LUCENE-1410), or, if there is an good existing impl
 that has a compatible license that'd be great.
 I don't have time near-term to do this... so if anyone has the itch,
 please jump!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Near Realtime Search (using a built in RAMDirectory)

2010-01-04 Thread Jingkei Ly (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796191#action_12796191
 ] 

Jingkei Ly commented on LUCENE-1313:


I've just tried applying this patch to my checked-out version of trunk 
(revision 895585) but it appears that the PrefixSwitchDirectory class is 
missing - is there another patch that is needed to get this working?

 Near Realtime Search (using a built in RAMDirectory)
 

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Enable near realtime search in Lucene without external
 dependencies. When RAM NRT is enabled, the implementation adds a
 RAMDirectory to IndexWriter. Flushes go to the ramdir unless
 there is no available space. Merges are completed in the ram
 dir until there is no more available ram. 
 IW.optimize and IW.commit flush the ramdir to the primary
 directory, all other operations try to keep segments in ram
 until there is no more space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2192) Memory Leak

2010-01-04 Thread Ramazan VARLIKLI (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796199#action_12796199
 ] 

Ramazan VARLIKLI commented on LUCENE-2192:
--

No , 
Would it affect if it was open ?

 Memory Leak 
 

 Key: LUCENE-2192
 URL: https://issues.apache.org/jira/browse/LUCENE-2192
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Ramazan VARLIKLI

 Hi All ,
 I have been working on a problem with Lucene and now gave up after trying 
 many different possibilites which gives me a feeling that There is a bug on 
 this .
 The scenario is we have an CMS applicaton into which we add new content every 
 week , instead of updating the index which is a bit tricky, I prefer to 
 delete all index documents and add them again which is straightforward . The 
 problem is Lucene doesn't delete the old data somehow and increase the index 
 size every time during the update . I also profile it with java tools and see 
 that even if I close the IndexWriter class and sent it to Garbage Collector 
 it holds all the docs in the memory .
 Here is the code I use 
 Directory directory = new SimpleFSDirectory(new File(path));
 writer = new IndexWriter(directory, analyzer, 
 false,IndexWriter.MaxFieldLength.LIMITED);
 writer.deleteAll();
 //after adding docs close the indexwriter 
 writer.close();
 The above code invoked every time we need to update the index . I tried many 
 different scenario here to overcome the problem which includes physically 
 removing the index directory( see how desperate I am ) , optimizing , 
 flushing, commiting indexwriter, create=true parameter and so on . 
 Here is the index file size during creation. If I shutdown the application 
 and restart it , index size starts with 2,458 which is correct size.
 Any help will be appreciated
 _17.cfs   2,458 KB
 _18.cfs   3,990 KB
 _19.cfs  5,149 KB
 here is the Lucene logs during creationg of index files 3 times in a row 
 IFD [http-8080-1]: setInfoStream 
 deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649
 IW 0 [http-8080-1]: setInfoStream: 
 dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and 
 Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene
  autoCommit=false 
 mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c 
 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba 
 ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 
 maxFieldLength=1 index=
 IW 0 [http-8080-1]: now flush at close
 IW 0 [http-8080-1]:   flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 
 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 
 numBufDelTerms=0
 IW 0 [http-8080-1]:   index before flush 
 IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765
 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment _17 
 numDocs=2765
 IW 0 [http-8080-1]: DW:   oldRAMSize=7485440 newFlushedSize=2472818 
 docs/MB=1,172.473 new/old=33.035%
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IFD [http-8080-1]: delete _17.fdx
 IFD [http-8080-1]: delete _17.tis
 IFD [http-8080-1]: delete _17.frq
 IFD [http-8080-1]: delete _17.nrm
 IFD [http-8080-1]: delete _17.fdt
 IFD [http-8080-1]: delete _17.fnm
 IFD [http-8080-1]: delete _17.tii
 IFD [http-8080-1]: delete _17.prx
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IW 0 [http-8080-1]: LMP: findMerges: 1 segments
 IW 0 [http-8080-1]: LMP:   level 6.2247195 to 6.400742: 1 segments
 IW 0 [http-8080-1]: CMS: now merge
 IW 0 [http-8080-1]: CMS:   index: _17:c2765
 IW 0 [http-8080-1]: CMS:   no more merges pending; now return
 IW 0 [http-8080-1]: CMS: now merge
 IW 0 [http-8080-1]: CMS:   index: _17:c2765
 IW 0 [http-8080-1]: CMS:   no more merges pending; now return
 IW 0 [http-8080-1]: now call final commit()
 IW 0 [http-8080-1]: startCommit(): start sizeInBytes=0
 IW 0 [http-8080-1]: startCommit index=_17:c2765 changeCount=5
 IW 0 [http-8080-1]: now sync _17.cfs
 IW 0 [http-8080-1]: done all syncs
 IW 0 [http-8080-1]: commit: pendingCommit != null
 IW 0 [http-8080-1]: commit: wrote segments file segments_1k
 IFD [http-8080-1]: now checkpoint segments_1k [1 segments ; isCommit = true]
 IFD [http-8080-1]: deleteCommits: now decRef commit segments_1j
 IFD [http-8080-1]: delete _16.cfs
 IFD [http-8080-1]: delete segments_1j
 IW 0 [http-8080-1]: commit: done
 IW 0 [http-8080-1]: at close: _17:c2765
 IFD [http-8080-1]: setInfoStream 
 deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@fb1ba7
 IW 1 [http-8080-1]: setInfoStream: 
 

[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-01-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796200#action_12796200
 ] 

Michael McCandless commented on LUCENE-2186:


bq. Is this patch for flex, as it contains CodecUtils and so on?

Actually it's intended for trunk; I was thinking this should land
before flex (it's a much smaller change, and it's isolated from
flex), and so I wrote the CodecUtil/BytesRef basic infrastructure,
thinking flex would then cutover to them.

{quote}
Hmm, so random-access would obviously be the preferred approach for SSDs, but
with conventional disks I think the performance would be poor? In 1231
I implemented the var-sized CSF with a skip list, similar to a posting
list. I think we should add that here too and we can still keep the
additional index that stores the pointers? We could have two readers:
one that allows random-access and loads the pointers into RAM (or uses
MMAP as you mentioned), and a second one that doesn't load anything
into RAM, uses the skip lists and only allows iterator-based access?
{quote}

The intention here is for this (index values) to replace field
cache, but not aim (initially at least) to do much more.  Ie, it's
meant to be a RAM resident (either via explicit slurping-into-RAM or
via MMAP).  So the SSD or spinning magnets should not be hit on
retrieval.

If we add an iterator API, I think it should be simpler than the
postings API (ie, no seeking, dense (every doc is visited,
sequentially) iteration).

{quote}
It looks like ByteRef is very similar to Payload? Could you use that instead 
and extend it with the new String constructor and compare methods?
{quote}

Good point!  I agree.  Also, we should use BytesRef when reading the
payload from TermsEnum.  Actually I think Payload, BytesRef, TermRef
(in flex) should all eventually be merged; of the three names, I think
I like BytesRef the best.  With *Enum in flex we can switch to
BytesRef.  For analysis we should switch PayloadAttribute to BytesRef,
and deprecate the methods using Payload?  Hmmm... but PayloadAttribute
is an interface.

{quote}
So it looks like with your approach you want to support certain
primitive types out of the box, such as byte[], float, int, String?
{quote}

Actually, all primitive types (ie, byte/short/int/long are
included under int, as well as arbitrary bit precision between
those primitive types).  Because the API uses a method invocation (eg
IntSource.get) instead of direct array access, we can hide how many
bits are actually used, under the impl.  Same is true for float/double
(except we can't [easily] do arbitrary bit precision here... just 4 or
8 bytes).

{quote}
If someone has custom data types, then they have, similar as with
payloads today, the byte[] indirection?
{quote}

Right, byte[] is for String, but also for arbitrary (opaque to Lucene)
extensibility.  The six anonymous (separate package private classes)
concrete impls should give good efficiency to fit the different use
cases.

{quote}
The code I initially wrote for 1231 exposed IndexOutput, so that one
can call write*() directly, without having to convert to byte[]
first. I think we will also want to do that for 2125 (store attributes
in the index). So I'm wondering if this and 2125 should work
similarly?
{quote}

This is compelling (letting Attrs read/write directly), but, I have
some questions:

  * How would the random-access API work?  (Attrs are designed for
iteration).  Eg, just providing IndexInput/Output to the Attr
isn't quite enough -- the encoding is sometimes context dependent
(like frq writes the delta between docIDs, the symbol table needed
when reading/writing deref/sorted).  How would I build a random
access API on top of that?  captureState-per-doc is too costly.
What API would be used to write the shared state, ie, to tell the
Attr we now are writing the segment, so you need to dump the
symbol table.

  * How would the packed ints work?  EG say my ints only need 5 bits.
(Attrs are sort of designed for one-value-at-once).

  * How would the symbol table based encodings (deref, sorted) work?
I guess the attr would need to have some state associated with
it, and when I first create the attr I need to pass it segment
name, Directory, etc, so it opens the right files?

  * I'm thinking we should still directly support native types, ie,
Attrs are there for extensibility beyond native types?

  * Exposing single attr across a multi reader sounds tricky --
LUCENE-2154 (and, we need this for flex, which is worrying me!).
But it sounds like you and Uwe are making some progress on that
(using some under-the-hood Java reflection magic)... and this
doesn't directly affect this issue, assuming we don't expose this
API at the MultiReader level.

{quote}
Thinking out loud: Could we have then attributes with
serialize/deserialize methods for 

[jira] Commented: (LUCENE-2192) Memory Leak

2010-01-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796236#action_12796236
 ] 

Michael McCandless commented on LUCENE-2192:


An open reader would prevent deletion of the index files... but from your log 
above, it looks like that's not happening.

It's curious because from the log I can see that _17.cfs and _18.cfs are being 
deleted.

Can you run the oal.index.CheckIndex tool on your 3-segment index and post the 
output?

 Memory Leak 
 

 Key: LUCENE-2192
 URL: https://issues.apache.org/jira/browse/LUCENE-2192
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Ramazan VARLIKLI

 Hi All ,
 I have been working on a problem with Lucene and now gave up after trying 
 many different possibilites which gives me a feeling that There is a bug on 
 this .
 The scenario is we have an CMS applicaton into which we add new content every 
 week , instead of updating the index which is a bit tricky, I prefer to 
 delete all index documents and add them again which is straightforward . The 
 problem is Lucene doesn't delete the old data somehow and increase the index 
 size every time during the update . I also profile it with java tools and see 
 that even if I close the IndexWriter class and sent it to Garbage Collector 
 it holds all the docs in the memory .
 Here is the code I use 
 Directory directory = new SimpleFSDirectory(new File(path));
 writer = new IndexWriter(directory, analyzer, 
 false,IndexWriter.MaxFieldLength.LIMITED);
 writer.deleteAll();
 //after adding docs close the indexwriter 
 writer.close();
 The above code invoked every time we need to update the index . I tried many 
 different scenario here to overcome the problem which includes physically 
 removing the index directory( see how desperate I am ) , optimizing , 
 flushing, commiting indexwriter, create=true parameter and so on . 
 Here is the index file size during creation. If I shutdown the application 
 and restart it , index size starts with 2,458 which is correct size.
 Any help will be appreciated
 _17.cfs   2,458 KB
 _18.cfs   3,990 KB
 _19.cfs  5,149 KB
 here is the Lucene logs during creationg of index files 3 times in a row 
 IFD [http-8080-1]: setInfoStream 
 deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649
 IW 0 [http-8080-1]: setInfoStream: 
 dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and 
 Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene
  autoCommit=false 
 mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c 
 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba 
 ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 
 maxFieldLength=1 index=
 IW 0 [http-8080-1]: now flush at close
 IW 0 [http-8080-1]:   flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 
 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 
 numBufDelTerms=0
 IW 0 [http-8080-1]:   index before flush 
 IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765
 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment _17 
 numDocs=2765
 IW 0 [http-8080-1]: DW:   oldRAMSize=7485440 newFlushedSize=2472818 
 docs/MB=1,172.473 new/old=33.035%
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IFD [http-8080-1]: delete _17.fdx
 IFD [http-8080-1]: delete _17.tis
 IFD [http-8080-1]: delete _17.frq
 IFD [http-8080-1]: delete _17.nrm
 IFD [http-8080-1]: delete _17.fdt
 IFD [http-8080-1]: delete _17.fnm
 IFD [http-8080-1]: delete _17.tii
 IFD [http-8080-1]: delete _17.prx
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IW 0 [http-8080-1]: LMP: findMerges: 1 segments
 IW 0 [http-8080-1]: LMP:   level 6.2247195 to 6.400742: 1 segments
 IW 0 [http-8080-1]: CMS: now merge
 IW 0 [http-8080-1]: CMS:   index: _17:c2765
 IW 0 [http-8080-1]: CMS:   no more merges pending; now return
 IW 0 [http-8080-1]: CMS: now merge
 IW 0 [http-8080-1]: CMS:   index: _17:c2765
 IW 0 [http-8080-1]: CMS:   no more merges pending; now return
 IW 0 [http-8080-1]: now call final commit()
 IW 0 [http-8080-1]: startCommit(): start sizeInBytes=0
 IW 0 [http-8080-1]: startCommit index=_17:c2765 changeCount=5
 IW 0 [http-8080-1]: now sync _17.cfs
 IW 0 [http-8080-1]: done all syncs
 IW 0 [http-8080-1]: commit: pendingCommit != null
 IW 0 [http-8080-1]: commit: wrote segments file segments_1k
 IFD [http-8080-1]: now checkpoint segments_1k [1 segments ; isCommit = true]
 IFD [http-8080-1]: deleteCommits: now decRef commit segments_1j
 IFD [http-8080-1]: delete _16.cfs
 IFD [http-8080-1]: delete segments_1j
 IW 

[jira] Commented: (LUCENE-2192) Memory Leak

2010-01-04 Thread Ramazan VARLIKLI (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796251#action_12796251
 ] 

Ramazan VARLIKLI commented on LUCENE-2192:
--

Now I am testing it with V3 , result is the same 

For the first time I create the index files the output as follows 

Segments file=segments_1r numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene 2.9]
  1 of 1: name=_1e docCount=2765
compound=true
hasProx=true
numFiles=1
size (MB)=2.348
diagnostics = {os.version=5.1, os=Windows XP, lucene.version=3.0.0 883080 - 
2009-11-22 15:43:58, source=flush, os.arch=x86, java.version=1.6.0_12, 
java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.OK
test: fields..OK [8 fields]
test: field norms.OK [8 fields]
test: terms, freq, prox...OK [55843 terms; 505243 terms/docs pairs; 856135 
tokens]
test: stored fields...OK [2765 total field count; avg 1 fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq vector 
fields per doc]

No problems were detected with this index.


The second time is 



Opening index @ C:\Documents and 
Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene

Segments file=segments_1s numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene 2.9]
  1 of 1: name=_1f docCount=2765
compound=true
hasProx=true
numFiles=1
size (MB)=3.821
diagnostics = {os.version=5.1, os=Windows XP, lucene.version=3.0.0 883080 - 
2009-11-22 15:43:58, source=flush, os.arch=x86, java.version=1.6.0_12, 
java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.OK
test: fields..OK [8 fields]
test: field norms.OK [8 fields]
test: terms, freq, prox...OK [55843 terms; 505243 terms/docs pairs; 1712270 
tokens]
test: stored fields...OK [2765 total field count; avg 1 fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq vector 
fields per doc]

No problems were detected with this index.



 Memory Leak 
 

 Key: LUCENE-2192
 URL: https://issues.apache.org/jira/browse/LUCENE-2192
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Ramazan VARLIKLI

 Hi All ,
 I have been working on a problem with Lucene and now gave up after trying 
 many different possibilites which gives me a feeling that There is a bug on 
 this .
 The scenario is we have an CMS applicaton into which we add new content every 
 week , instead of updating the index which is a bit tricky, I prefer to 
 delete all index documents and add them again which is straightforward . The 
 problem is Lucene doesn't delete the old data somehow and increase the index 
 size every time during the update . I also profile it with java tools and see 
 that even if I close the IndexWriter class and sent it to Garbage Collector 
 it holds all the docs in the memory .
 Here is the code I use 
 Directory directory = new SimpleFSDirectory(new File(path));
 writer = new IndexWriter(directory, analyzer, 
 false,IndexWriter.MaxFieldLength.LIMITED);
 writer.deleteAll();
 //after adding docs close the indexwriter 
 writer.close();
 The above code invoked every time we need to update the index . I tried many 
 different scenario here to overcome the problem which includes physically 
 removing the index directory( see how desperate I am ) , optimizing , 
 flushing, commiting indexwriter, create=true parameter and so on . 
 Here is the index file size during creation. If I shutdown the application 
 and restart it , index size starts with 2,458 which is correct size.
 Any help will be appreciated
 _17.cfs   2,458 KB
 _18.cfs   3,990 KB
 _19.cfs  5,149 KB
 here is the Lucene logs during creationg of index files 3 times in a row 
 IFD [http-8080-1]: setInfoStream 
 deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649
 IW 0 [http-8080-1]: setInfoStream: 
 dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and 
 Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene
  autoCommit=false 
 mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c 
 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba 
 ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 
 maxFieldLength=1 index=
 IW 0 [http-8080-1]: now flush at close
 IW 0 [http-8080-1]:   flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 
 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 
 numBufDelTerms=0
 IW 0 [http-8080-1]:   index before flush 
 IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765
 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment 

[jira] Commented: (LUCENE-2192) Memory Leak

2010-01-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796268#action_12796268
 ] 

Michael McCandless commented on LUCENE-2192:


So there seems to be two problems:

  * The old _X.cfs files are not getting removed

  * Each _X.cfs file is growing in size, even though you sent it exactly the 
same docs

Is that right?

 Memory Leak 
 

 Key: LUCENE-2192
 URL: https://issues.apache.org/jira/browse/LUCENE-2192
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Ramazan VARLIKLI

 Hi All ,
 I have been working on a problem with Lucene and now gave up after trying 
 many different possibilites which gives me a feeling that There is a bug on 
 this .
 The scenario is we have an CMS applicaton into which we add new content every 
 week , instead of updating the index which is a bit tricky, I prefer to 
 delete all index documents and add them again which is straightforward . The 
 problem is Lucene doesn't delete the old data somehow and increase the index 
 size every time during the update . I also profile it with java tools and see 
 that even if I close the IndexWriter class and sent it to Garbage Collector 
 it holds all the docs in the memory .
 Here is the code I use 
 Directory directory = new SimpleFSDirectory(new File(path));
 writer = new IndexWriter(directory, analyzer, 
 false,IndexWriter.MaxFieldLength.LIMITED);
 writer.deleteAll();
 //after adding docs close the indexwriter 
 writer.close();
 The above code invoked every time we need to update the index . I tried many 
 different scenario here to overcome the problem which includes physically 
 removing the index directory( see how desperate I am ) , optimizing , 
 flushing, commiting indexwriter, create=true parameter and so on . 
 Here is the index file size during creation. If I shutdown the application 
 and restart it , index size starts with 2,458 which is correct size.
 Any help will be appreciated
 _17.cfs   2,458 KB
 _18.cfs   3,990 KB
 _19.cfs  5,149 KB
 here is the Lucene logs during creationg of index files 3 times in a row 
 IFD [http-8080-1]: setInfoStream 
 deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649
 IW 0 [http-8080-1]: setInfoStream: 
 dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and 
 Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene
  autoCommit=false 
 mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c 
 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba 
 ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 
 maxFieldLength=1 index=
 IW 0 [http-8080-1]: now flush at close
 IW 0 [http-8080-1]:   flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 
 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 
 numBufDelTerms=0
 IW 0 [http-8080-1]:   index before flush 
 IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765
 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment _17 
 numDocs=2765
 IW 0 [http-8080-1]: DW:   oldRAMSize=7485440 newFlushedSize=2472818 
 docs/MB=1,172.473 new/old=33.035%
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IFD [http-8080-1]: delete _17.fdx
 IFD [http-8080-1]: delete _17.tis
 IFD [http-8080-1]: delete _17.frq
 IFD [http-8080-1]: delete _17.nrm
 IFD [http-8080-1]: delete _17.fdt
 IFD [http-8080-1]: delete _17.fnm
 IFD [http-8080-1]: delete _17.tii
 IFD [http-8080-1]: delete _17.prx
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IW 0 [http-8080-1]: LMP: findMerges: 1 segments
 IW 0 [http-8080-1]: LMP:   level 6.2247195 to 6.400742: 1 segments
 IW 0 [http-8080-1]: CMS: now merge
 IW 0 [http-8080-1]: CMS:   index: _17:c2765
 IW 0 [http-8080-1]: CMS:   no more merges pending; now return
 IW 0 [http-8080-1]: CMS: now merge
 IW 0 [http-8080-1]: CMS:   index: _17:c2765
 IW 0 [http-8080-1]: CMS:   no more merges pending; now return
 IW 0 [http-8080-1]: now call final commit()
 IW 0 [http-8080-1]: startCommit(): start sizeInBytes=0
 IW 0 [http-8080-1]: startCommit index=_17:c2765 changeCount=5
 IW 0 [http-8080-1]: now sync _17.cfs
 IW 0 [http-8080-1]: done all syncs
 IW 0 [http-8080-1]: commit: pendingCommit != null
 IW 0 [http-8080-1]: commit: wrote segments file segments_1k
 IFD [http-8080-1]: now checkpoint segments_1k [1 segments ; isCommit = true]
 IFD [http-8080-1]: deleteCommits: now decRef commit segments_1j
 IFD [http-8080-1]: delete _16.cfs
 IFD [http-8080-1]: delete segments_1j
 IW 0 [http-8080-1]: commit: done
 IW 0 [http-8080-1]: at close: _17:c2765
 IFD [http-8080-1]: setInfoStream 
 

[jira] Commented: (LUCENE-2192) Memory Leak

2010-01-04 Thread Ramazan VARLIKLI (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796286#action_12796286
 ] 

Ramazan VARLIKLI commented on LUCENE-2192:
--

no ,
The old _X.cfs files are removed correctly 
but _X.cfs file is growing in size . 

Actually  I tried to remove all _X.cfs files with java io commands but it 
didn't work. Lucene is keeping everything in memory and adds new document to it 
. I just want to remind this problem happens as long as within one JVM instance 
. If you shutdown it , it will start from scratch

 Memory Leak 
 

 Key: LUCENE-2192
 URL: https://issues.apache.org/jira/browse/LUCENE-2192
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Ramazan VARLIKLI

 Hi All ,
 I have been working on a problem with Lucene and now gave up after trying 
 many different possibilites which gives me a feeling that There is a bug on 
 this .
 The scenario is we have an CMS applicaton into which we add new content every 
 week , instead of updating the index which is a bit tricky, I prefer to 
 delete all index documents and add them again which is straightforward . The 
 problem is Lucene doesn't delete the old data somehow and increase the index 
 size every time during the update . I also profile it with java tools and see 
 that even if I close the IndexWriter class and sent it to Garbage Collector 
 it holds all the docs in the memory .
 Here is the code I use 
 Directory directory = new SimpleFSDirectory(new File(path));
 writer = new IndexWriter(directory, analyzer, 
 false,IndexWriter.MaxFieldLength.LIMITED);
 writer.deleteAll();
 //after adding docs close the indexwriter 
 writer.close();
 The above code invoked every time we need to update the index . I tried many 
 different scenario here to overcome the problem which includes physically 
 removing the index directory( see how desperate I am ) , optimizing , 
 flushing, commiting indexwriter, create=true parameter and so on . 
 Here is the index file size during creation. If I shutdown the application 
 and restart it , index size starts with 2,458 which is correct size.
 Any help will be appreciated
 _17.cfs   2,458 KB
 _18.cfs   3,990 KB
 _19.cfs  5,149 KB
 here is the Lucene logs during creationg of index files 3 times in a row 
 IFD [http-8080-1]: setInfoStream 
 deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649
 IW 0 [http-8080-1]: setInfoStream: 
 dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and 
 Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene
  autoCommit=false 
 mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c 
 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba 
 ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 
 maxFieldLength=1 index=
 IW 0 [http-8080-1]: now flush at close
 IW 0 [http-8080-1]:   flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 
 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 
 numBufDelTerms=0
 IW 0 [http-8080-1]:   index before flush 
 IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765
 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment _17 
 numDocs=2765
 IW 0 [http-8080-1]: DW:   oldRAMSize=7485440 newFlushedSize=2472818 
 docs/MB=1,172.473 new/old=33.035%
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IFD [http-8080-1]: delete _17.fdx
 IFD [http-8080-1]: delete _17.tis
 IFD [http-8080-1]: delete _17.frq
 IFD [http-8080-1]: delete _17.nrm
 IFD [http-8080-1]: delete _17.fdt
 IFD [http-8080-1]: delete _17.fnm
 IFD [http-8080-1]: delete _17.tii
 IFD [http-8080-1]: delete _17.prx
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IW 0 [http-8080-1]: LMP: findMerges: 1 segments
 IW 0 [http-8080-1]: LMP:   level 6.2247195 to 6.400742: 1 segments
 IW 0 [http-8080-1]: CMS: now merge
 IW 0 [http-8080-1]: CMS:   index: _17:c2765
 IW 0 [http-8080-1]: CMS:   no more merges pending; now return
 IW 0 [http-8080-1]: CMS: now merge
 IW 0 [http-8080-1]: CMS:   index: _17:c2765
 IW 0 [http-8080-1]: CMS:   no more merges pending; now return
 IW 0 [http-8080-1]: now call final commit()
 IW 0 [http-8080-1]: startCommit(): start sizeInBytes=0
 IW 0 [http-8080-1]: startCommit index=_17:c2765 changeCount=5
 IW 0 [http-8080-1]: now sync _17.cfs
 IW 0 [http-8080-1]: done all syncs
 IW 0 [http-8080-1]: commit: pendingCommit != null
 IW 0 [http-8080-1]: commit: wrote segments file segments_1k
 IFD [http-8080-1]: now checkpoint segments_1k [1 segments ; isCommit = true]
 IFD [http-8080-1]: deleteCommits: now decRef commit segments_1j
 IFD 

[jira] Commented: (LUCENE-2192) Memory Leak

2010-01-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796312#action_12796312
 ] 

Michael McCandless commented on LUCENE-2192:


OK so it's only the 2nd problem.

From your CheckIndex output, the 2nd segment has precisely 2X the number of 
tokens than the first segment (and the same number of documents and same 
number of unique terms).

Can you double check how you create the Document that you pass to Lucene?  Is 
it possible the field in the Document is just getting added twice?  Can you 
post the code that constructs the document?

 Memory Leak 
 

 Key: LUCENE-2192
 URL: https://issues.apache.org/jira/browse/LUCENE-2192
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Ramazan VARLIKLI

 Hi All ,
 I have been working on a problem with Lucene and now gave up after trying 
 many different possibilites which gives me a feeling that There is a bug on 
 this .
 The scenario is we have an CMS applicaton into which we add new content every 
 week , instead of updating the index which is a bit tricky, I prefer to 
 delete all index documents and add them again which is straightforward . The 
 problem is Lucene doesn't delete the old data somehow and increase the index 
 size every time during the update . I also profile it with java tools and see 
 that even if I close the IndexWriter class and sent it to Garbage Collector 
 it holds all the docs in the memory .
 Here is the code I use 
 Directory directory = new SimpleFSDirectory(new File(path));
 writer = new IndexWriter(directory, analyzer, 
 false,IndexWriter.MaxFieldLength.LIMITED);
 writer.deleteAll();
 //after adding docs close the indexwriter 
 writer.close();
 The above code invoked every time we need to update the index . I tried many 
 different scenario here to overcome the problem which includes physically 
 removing the index directory( see how desperate I am ) , optimizing , 
 flushing, commiting indexwriter, create=true parameter and so on . 
 Here is the index file size during creation. If I shutdown the application 
 and restart it , index size starts with 2,458 which is correct size.
 Any help will be appreciated
 _17.cfs   2,458 KB
 _18.cfs   3,990 KB
 _19.cfs  5,149 KB
 here is the Lucene logs during creationg of index files 3 times in a row 
 IFD [http-8080-1]: setInfoStream 
 deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649
 IW 0 [http-8080-1]: setInfoStream: 
 dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and 
 Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene
  autoCommit=false 
 mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c 
 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba 
 ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 
 maxFieldLength=1 index=
 IW 0 [http-8080-1]: now flush at close
 IW 0 [http-8080-1]:   flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 
 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 
 numBufDelTerms=0
 IW 0 [http-8080-1]:   index before flush 
 IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765
 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment _17 
 numDocs=2765
 IW 0 [http-8080-1]: DW:   oldRAMSize=7485440 newFlushedSize=2472818 
 docs/MB=1,172.473 new/old=33.035%
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IFD [http-8080-1]: delete _17.fdx
 IFD [http-8080-1]: delete _17.tis
 IFD [http-8080-1]: delete _17.frq
 IFD [http-8080-1]: delete _17.nrm
 IFD [http-8080-1]: delete _17.fdt
 IFD [http-8080-1]: delete _17.fnm
 IFD [http-8080-1]: delete _17.tii
 IFD [http-8080-1]: delete _17.prx
 IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = 
 false]
 IW 0 [http-8080-1]: LMP: findMerges: 1 segments
 IW 0 [http-8080-1]: LMP:   level 6.2247195 to 6.400742: 1 segments
 IW 0 [http-8080-1]: CMS: now merge
 IW 0 [http-8080-1]: CMS:   index: _17:c2765
 IW 0 [http-8080-1]: CMS:   no more merges pending; now return
 IW 0 [http-8080-1]: CMS: now merge
 IW 0 [http-8080-1]: CMS:   index: _17:c2765
 IW 0 [http-8080-1]: CMS:   no more merges pending; now return
 IW 0 [http-8080-1]: now call final commit()
 IW 0 [http-8080-1]: startCommit(): start sizeInBytes=0
 IW 0 [http-8080-1]: startCommit index=_17:c2765 changeCount=5
 IW 0 [http-8080-1]: now sync _17.cfs
 IW 0 [http-8080-1]: done all syncs
 IW 0 [http-8080-1]: commit: pendingCommit != null
 IW 0 [http-8080-1]: commit: wrote segments file segments_1k
 IFD [http-8080-1]: now checkpoint segments_1k [1 segments ; isCommit = true]
 IFD [http-8080-1]: 

[jira] Resolved: (LUCENE-2079) Further improvements to contrib/benchmark for testing NRT

2010-01-04 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2079.


Resolution: Fixed

Fixed, again.  Hopefully nightly build no longer hangs on this test!

 Further improvements to contrib/benchmark for testing NRT
 -

 Key: LUCENE-2079
 URL: https://issues.apache.org/jira/browse/LUCENE-2079
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2079.patch


 Some small changes:
   * Allow specifying a priority for BG threads, after the 
 character; priority increment is + or - int that's added to main
 thread's priority to set child thread's.  For my NRT tests I make
 the reopen thread +2, the indexing threads +1, and leave searching
 threads at their default.
   * Added test case
   * NearRealTimeReopenTask now reports @ the end the full array of
 msec of each reopen latency
   * Added optional breakout of counts by time steps.  If you set
 log.time.step.msec to eg 1000 then reported counts for serial task
 sequence is broken out by 1 second windows.  EG you can use this
 to measure slowdown over time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2127) Improved large result handling

2010-01-04 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned LUCENE-2127:
---

Assignee: Grant Ingersoll

 Improved large result handling
 --

 Key: LUCENE-2127
 URL: https://issues.apache.org/jira/browse/LUCENE-2127
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor

 Per 
 http://search.lucidimagination.com/search/document/350c54fc90d257ed/lots_of_results#fbb84bd297d15dd5,
  it would be nice to offer some other Collectors that are better at handling 
 really large number of results.  This could be implemented in a variety of 
 ways via Collectors.  For instance, we could have a raw collector that does 
 no sorting and just returns the ScoreDocs, or we could do as Mike suggests 
 and have Collectors that have heuristics about memory tradeoffs and only 
 heapify when appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)

2010-01-04 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2190:
---

Attachment: LUCENE-2190.patch

Patch attached, adding setNextReader to CustomScoreQuery, and a test case.  
Also fixed a couple latent test bugs, when run on indexes with more than one 
segment.

 CustomScoreQuery (function query) is broken (due to per-segment searching)
 --

 Key: LUCENE-2190
 URL: https://issues.apache.org/jira/browse/LUCENE-2190
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9.2, 3.0.1, 3.1

 Attachments: LUCENE-2190.patch


 Spinoff from here:
   http://lucene.markmail.org/message/psw2m3adzibaixbq
 With the cutover to per-segment searching, CustomScoreQuery is not really 
 usable anymore, because the per-doc custom scoring method (customScore) 
 receives a per-segment docID, yet there is no way to figure out which segment 
 you are currently searching.
 I think to fix this we must also notify the subclass whenever a new segment 
 is switched to.  I think if we copy Collector.setNextReader, that would be 
 sufficient.  It would by default do nothing in CustomScoreQuery, but a 
 subclass could override.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)

2010-01-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796343#action_12796343
 ] 

Michael McCandless commented on LUCENE-2190:


Patch applies to 2.9.x.

 CustomScoreQuery (function query) is broken (due to per-segment searching)
 --

 Key: LUCENE-2190
 URL: https://issues.apache.org/jira/browse/LUCENE-2190
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9.2, 3.0.1, 3.1

 Attachments: LUCENE-2190.patch


 Spinoff from here:
   http://lucene.markmail.org/message/psw2m3adzibaixbq
 With the cutover to per-segment searching, CustomScoreQuery is not really 
 usable anymore, because the per-doc custom scoring method (customScore) 
 receives a per-segment docID, yet there is no way to figure out which segment 
 you are currently searching.
 I think to fix this we must also notify the subclass whenever a new segment 
 is switched to.  I think if we copy Collector.setNextReader, that would be 
 sufficient.  It would by default do nothing in CustomScoreQuery, but a 
 subclass could override.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2187) improve lucene's similarity algorithm defaults

2010-01-04 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2187:


Attachment: scoring.pdf

attaching updated document with results for a 4th test collection, on english. 
for this one BM25 did not fare so well.

For the lazy, here are the MAP values:

StandardAnalyzer
Default Scoring: 0.3837
BM25 Scoring:  0.3580
Improved Scoring: 0.3994

StandardAnalyzer + Porter
Default Scoring: 0.4333
BM25 Scoring: 0.4131
Improved Scoring: 0.4515

StandardAnalyzer + Porter + MoreLikeThis (top 5 docs)
Default Scoring: 0.5234
BM25 Scoring: 0.5087
Improved Scoring: 0.5474

Note that 0.5572 was the highest performing MAP on this corpus (Microsoft 
Research) in FIRE 2008: 
http://www.isical.ac.in/~fire/paper/Udupa-mls-fire2008.pdf



 improve lucene's similarity algorithm defaults
 --

 Key: LUCENE-2187
 URL: https://issues.apache.org/jira/browse/LUCENE-2187
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Robert Muir
 Fix For: Flex Branch

 Attachments: scoring.pdf, scoring.pdf


 First things first: I am not an IR guy. The goal of this issue is to make 
 'surgical' tweaks to lucene's formula to bring its performance up to that of 
 more modern algorithms such as BM25.
 In my opinion, the concept of having some 'flexible' scoring with good speed 
 across the board is an interesting goal, but not practical in the short term.
 Instead here I propose incorporating some work similar to lnu.ltc and 
 friends, but slightly different. I noticed this seems to be in line with that 
 paper published before about the trec million queries track... 
 Here is what I propose in pseudocode (overriding DefaultSimilarity):
 {code}
   @Override
   public float tf(float freq) {
 return 1 + (float) Math.log(freq);
   }
   
   @Override
   public float lengthNorm(String fieldName, int numTerms) {
 return (float) (1 / ((1 - slope) * pivot + slope * numTerms));
   }
 {code}
 Where slope is a constant (I used 0.25 for all relevance evaluations: the 
 goal is to have a better default), and pivot is the average field length. 
 Obviously we shouldnt make the user provide this but instead have the system 
 provide it.
 These two pieces do not improve lucene much independently, but together they 
 are competitive with BM25 scoring with the test collections I have run so 
 far. 
 The idea here is that this logarithmic tf normalization is independent of the 
 tf / mean TF that you see in some of these algorithms, in fact I implemented 
 lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) 
 stuff and it did not fare as well as this method, and this is simpler, we do 
 not need to calculate this mean TF at all.
 The BM25-like binary pivot here works better on the test collections I have 
 run, but of course only with the tf modification.
 I am uploading a document with results from 3 test collections (Persian, 
 Hindi, and Indonesian). I will test at least 3 more languages... yes 
 including English... across more collections and upload those results also, 
 but i need to process these corpora to run the tests with the benchmark 
 package, so this will take some time (maybe weeks)
 so, please rip it apart with scoring theory etc, but keep in mind 2 of these 
 3 test collections are in the openrelevance svn, so if you think you have a 
 great idea, don't hesitate to test it and upload results, this is what it is 
 for. 
 also keep in mind again I am not a scoring or IR guy, the only thing i can 
 really bring to the table here is the willingness to do a lot of relevance 
 testing!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2187) improve lucene's similarity algorithm defaults

2010-01-04 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2187:


Attachment: scoring.pdf

sorry, correct some transposition of axes labels and some grammatical mistakes 
:)

 improve lucene's similarity algorithm defaults
 --

 Key: LUCENE-2187
 URL: https://issues.apache.org/jira/browse/LUCENE-2187
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Robert Muir
 Fix For: Flex Branch

 Attachments: scoring.pdf, scoring.pdf, scoring.pdf


 First things first: I am not an IR guy. The goal of this issue is to make 
 'surgical' tweaks to lucene's formula to bring its performance up to that of 
 more modern algorithms such as BM25.
 In my opinion, the concept of having some 'flexible' scoring with good speed 
 across the board is an interesting goal, but not practical in the short term.
 Instead here I propose incorporating some work similar to lnu.ltc and 
 friends, but slightly different. I noticed this seems to be in line with that 
 paper published before about the trec million queries track... 
 Here is what I propose in pseudocode (overriding DefaultSimilarity):
 {code}
   @Override
   public float tf(float freq) {
 return 1 + (float) Math.log(freq);
   }
   
   @Override
   public float lengthNorm(String fieldName, int numTerms) {
 return (float) (1 / ((1 - slope) * pivot + slope * numTerms));
   }
 {code}
 Where slope is a constant (I used 0.25 for all relevance evaluations: the 
 goal is to have a better default), and pivot is the average field length. 
 Obviously we shouldnt make the user provide this but instead have the system 
 provide it.
 These two pieces do not improve lucene much independently, but together they 
 are competitive with BM25 scoring with the test collections I have run so 
 far. 
 The idea here is that this logarithmic tf normalization is independent of the 
 tf / mean TF that you see in some of these algorithms, in fact I implemented 
 lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) 
 stuff and it did not fare as well as this method, and this is simpler, we do 
 not need to calculate this mean TF at all.
 The BM25-like binary pivot here works better on the test collections I have 
 run, but of course only with the tf modification.
 I am uploading a document with results from 3 test collections (Persian, 
 Hindi, and Indonesian). I will test at least 3 more languages... yes 
 including English... across more collections and upload those results also, 
 but i need to process these corpora to run the tests with the benchmark 
 package, so this will take some time (maybe weeks)
 so, please rip it apart with scoring theory etc, but keep in mind 2 of these 
 3 test collections are in the openrelevance svn, so if you think you have a 
 great idea, don't hesitate to test it and upload results, this is what it is 
 for. 
 also keep in mind again I am not a scoring or IR guy, the only thing i can 
 really bring to the table here is the willingness to do a lot of relevance 
 testing!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2187) improve lucene's similarity algorithm defaults

2010-01-04 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2187:


Attachment: LUCENE-2187.patch

attached is a patch with the Similarity impl. of course you have to manually 
supply this pivot value (avg doc. length), for now.


 improve lucene's similarity algorithm defaults
 --

 Key: LUCENE-2187
 URL: https://issues.apache.org/jira/browse/LUCENE-2187
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Robert Muir
 Fix For: Flex Branch

 Attachments: LUCENE-2187.patch, scoring.pdf, scoring.pdf, scoring.pdf


 First things first: I am not an IR guy. The goal of this issue is to make 
 'surgical' tweaks to lucene's formula to bring its performance up to that of 
 more modern algorithms such as BM25.
 In my opinion, the concept of having some 'flexible' scoring with good speed 
 across the board is an interesting goal, but not practical in the short term.
 Instead here I propose incorporating some work similar to lnu.ltc and 
 friends, but slightly different. I noticed this seems to be in line with that 
 paper published before about the trec million queries track... 
 Here is what I propose in pseudocode (overriding DefaultSimilarity):
 {code}
   @Override
   public float tf(float freq) {
 return 1 + (float) Math.log(freq);
   }
   
   @Override
   public float lengthNorm(String fieldName, int numTerms) {
 return (float) (1 / ((1 - slope) * pivot + slope * numTerms));
   }
 {code}
 Where slope is a constant (I used 0.25 for all relevance evaluations: the 
 goal is to have a better default), and pivot is the average field length. 
 Obviously we shouldnt make the user provide this but instead have the system 
 provide it.
 These two pieces do not improve lucene much independently, but together they 
 are competitive with BM25 scoring with the test collections I have run so 
 far. 
 The idea here is that this logarithmic tf normalization is independent of the 
 tf / mean TF that you see in some of these algorithms, in fact I implemented 
 lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) 
 stuff and it did not fare as well as this method, and this is simpler, we do 
 not need to calculate this mean TF at all.
 The BM25-like binary pivot here works better on the test collections I have 
 run, but of course only with the tf modification.
 I am uploading a document with results from 3 test collections (Persian, 
 Hindi, and Indonesian). I will test at least 3 more languages... yes 
 including English... across more collections and upload those results also, 
 but i need to process these corpora to run the tests with the benchmark 
 package, so this will take some time (maybe weeks)
 so, please rip it apart with scoring theory etc, but keep in mind 2 of these 
 3 test collections are in the openrelevance svn, so if you think you have a 
 great idea, don't hesitate to test it and upload results, this is what it is 
 for. 
 also keep in mind again I am not a scoring or IR guy, the only thing i can 
 really bring to the table here is the willingness to do a lot of relevance 
 testing!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



back_compat folders in tags when I SVN update

2010-01-04 Thread George Aroush
Hi folks,

Why do I see \java\tags\lucene_*_back_compat_tests_2009*\ directories (well
over 100 so far) when I SVN update?

Thanks.

-- George


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Hudson build is back to normal: Lucene-trunk #1053

2010-01-04 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1053/changes



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org