Re: lucene and solr trunk

2010-03-18 Thread Ian Holsman
what other libraries do is have a 'core' or a 'common' bit.. which is 
what the lucene library really is.


looking at http://svn.apache.org/repos/asf/lucene/ today I see that 
nearly, but it's called 'java'.
maybe just renaming 'java' to 'core' or 'common' (hadoop uses common) 
might make sense

and let ivy or maven be responsible for pulling the other parts.

as a weekend developer, I would just pull the bit I care about, and let 
ivy or maven get the other bits for me.


btw.. having a master 'pom.xml' in 
http://svn.apache.org/repos/asf/lucene/ could just include the the 
module pom's and build them

without having to have nightly jars etc.

as for the goal of doing single commits, I've noticed that most of the 
discussion has been in the format of


/lucene/XYZ/trunk/...
and /lucene/ABC/trunk

if this is one code base, would it make sense to have it:
/lucene/trunk/ABC
/lucene/trunk/XYZ

?
On 3/18/10 11:33 AM, Chris Hostetter wrote:

: build and nicely gets all dependencies to Lucene and Tika whenever I build
: or release, no problem there and certainly no need to have it merged into
: Lucene's svn!

The key distinction is that Solr is allready in Lucene's svn -- The
question is how reorg things in a way that makes it easier to build Solr
and Lucene-Java all at once, while wtill making it easy to build just
Lucene-Java.

: Professionally i work on a (world-class) geocoder that also nicely depends
: on Lucene by using maven, no problems there at all and no need to merge
: that code in Lucene's svn!

Unless maven has some features i'm not aware of, your nicely depends
works buy pulling Lucene jars from a repository -- changing Solr to do
that (instead of having committed jars) would be farrly simple (with or
w/o maven), but that's not the goal.  The goal is to make it easy to build
both at once, have patches that update both, and (make it easy to) have
atomic svn commits that touch both.


-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


   




[jira] Commented: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals

2010-03-18 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846778#action_12846778
 ] 

Uwe Schindler commented on LUCENE-2326:
---

What did you to for this to happen?

You can only reproduce this (and this was also possible with your previous 
setup), if you go onto the data folder and update there. If you update from 
top-level (outside the data folder), it works always. Maybe the problem lies in 
the fact, that you had the data already checked out before our reorganisation 
(from previous test runs). Can you simply delete the data folder with a OS' rm 
and update again?

Maybe it was a problem with svn server?

 Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards 
 branch and linking snowball tests by svn:externals
 ---

 Key: LUCENE-2326
 URL: https://issues.apache.org/jira/browse/LUCENE-2326
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: Flex Branch, 3.1

 Attachments: LUCENE-2326.patch, LUCENE-2326.patch


 As we often need to update backwards tests together with trunk and always 
 have to update the branch first, record rev no, and update build xml, I would 
 simply like to do a svn copy/move of the backwards branch.
 After a release, this is simply also done:
 {code}
 svn rm backwards
 svn cp releasebranch backwards
 {code}
 By this we can simply commit in one pass, create patches in one pass.
 The snowball tests are currently downloaded by svn.exe, too. These need a 
 fixed version for checkout. I would like to change this to use svn:externals. 
 Will provide patch, soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846798#action_12846798
 ] 

Michael McCandless commented on LUCENE-2323:


{quote}
Until code in contrib is to a certain degree of maturity, I feel we should 
organize
it by functionality. Its easy for the users, and it invites the sort of 
refactoring and
cleanup that some of this code needs.
{quote}

+1



 reorganize contrib modules
 --

 Key: LUCENE-2323
 URL: https://issues.apache.org/jira/browse/LUCENE-2323
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Robert Muir

 it would be nice to reorganize contrib modules, so that they are bundled 
 together by functionality.
 For example:
 * the wikipedia contrib is a tokenizer, i think really belongs in 
 contrib/analyzers
 * there are two highlighters, i think could be one highlighters package.
 * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846804#action_12846804
 ] 

Michael McCandless commented on LUCENE-2312:


{quote}
Can't we simply throw away the doc writer after a
successful segment flush (the IRs would refer to it, however
once they're closed, the DW would close as well)?
{quote}

I think that should be our first approach.  It means no pooling whatsoever.  
And it means that an app that doesn't aggressively close its old NRT readers 
will consume more RAM.

Though... the NRT readers will be able to search an active DW right?  Ie, it's 
only when that DW needs to flush, when the NRT readers would be tying up the 
RAM.

So, when a flush happens, existing NRT readers will hold a reference to that 
now-flushed DW, but when they reopen they will cutover to the on-disk segment.

I think this will be an OK limitation in practice.  Once NRT readers can search 
a live (still being written) DW, flushing of a DW will be a relatively rare 
event (unlike today where we must flush every time an NRT reader is opened).

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 3.0.1
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: 3.1


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene and solr trunk

2010-03-18 Thread Earwin Burrfoot
 Unless maven has some features i'm not aware of, your nicely depends
 works buy pulling Lucene jars from a repository
The 'missing feature' is called multi-module projects.

On Thu, Mar 18, 2010 at 03:33, Chris Hostetter hossman_luc...@fucit.org wrote:
 : build and nicely gets all dependencies to Lucene and Tika whenever I build
 : or release, no problem there and certainly no need to have it merged into
 : Lucene's svn!

 The key distinction is that Solr is allready in Lucene's svn -- The
 question is how reorg things in a way that makes it easier to build Solr
 and Lucene-Java all at once, while wtill making it easy to build just
 Lucene-Java.

 : Professionally i work on a (world-class) geocoder that also nicely depends
 : on Lucene by using maven, no problems there at all and no need to merge
 : that code in Lucene's svn!

 Unless maven has some features i'm not aware of, your nicely depends
 works buy pulling Lucene jars from a repository -- changing Solr to do
 that (instead of having committed jars) would be farrly simple (with or
 w/o maven), but that's not the goal.  The goal is to make it easy to build
 both at once, have patches that update both, and (make it easy to) have
 atomic svn commits that touch both.


 -Hoss


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846807#action_12846807
 ] 

Michael McCandless commented on LUCENE-2329:


This would be great!

But, note that term vectors today do not store the term char[] again -- they 
piggyback on the term char[] already stored for the postings.  Though, I 
believe they store int textStart (increments by term length per unique term), 
which is less compact than the termID would be (increments +1 per unique term), 
so if eg we someday use packed ints we'd be more RAM efficient by storing 
termIDs...

 Use parallel arrays instead of PostingList objects
 --

 Key: LUCENE-2329
 URL: https://issues.apache.org/jira/browse/LUCENE-2329
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


 This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
 In order to avoid having very many long-living PostingList objects in 
 TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
 simply be a int[] which maps each term to dense termIDs.
 All data that the PostingList classes currently hold will then we placed in 
 parallel arrays, where the termID is the index into the arrays.  This will 
 avoid the need for object pooling, will remove the overhead of object 
 initialization and garbage collection.  Especially garbage collection should 
 benefit significantly when the JVM runs out of memory, because in such a 
 situation the gc mark times can get very long if there is a big number of 
 long-living objects in memory.
 Another benefit could be to build more efficient TermVectors.  We could avoid 
 the need of having to store the term string per document in the TermVector.  
 Instead we could just store the segment-wide termIDs.  This would reduce the 
 size and also make it easier to implement efficient algorithms that use 
 TermVectors, because no term mapping across documents in a segment would be 
 necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2328:
---

Fix Version/s: 3.1

Anyone wanna cons up a patch here...?

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: IndexWriter.synced field accumulates data

2010-03-18 Thread Michael McCandless
Thanks!

Mike

On Wed, Mar 17, 2010 at 3:16 PM, Gregor Kaczor gkac...@gmx.de wrote:
 followup in

 https://issues.apache.org/jira/browse/LUCENE-2328


  Original-Nachricht 
 Datum: Wed, 17 Mar 2010 14:30:25 -0500
 Von: Michael McCandless luc...@mikemccandless.com
 An: java-dev@lucene.apache.org
 Betreff: Re: IndexWriter.synced field accumulates data

 You're right!

 Really we should delete from sync'd when we delete the files.  We need
 to tie into IndexFileDeleter for that, maybe moving this set into
 there.

 Though in practice the amount of actual RAM used should rarely be an
 issue?  But we should fix it...

 Can you open an issue?

 Mike

 On Wed, Mar 17, 2010 at 1:15 PM, Gregor Kaczor gkac...@gmx.de wrote:
  I am running into a strange OutOfMemoryError. My small test application
 does index and delete some few files. This is repeated for 60k times.
  Optimization is run from every 2k times a file is indexed. Index size is 
 50KB.
 I did analyze the HeapDumpFile and realized that IndexWriter.synced field
 occupied more than half of the heap. That field is a private HashSet
 without a getter. Its task is to hold files which have been synced already.
 
  There are two calls to addAll and one call to add on synced but no
 remove or clear throughout the lifecycle of the IndexWriter instance.
 
  According to the Eclipse Memory Analyzer synced contains 32618 entries
 which look like file names _e065_1.del or _e067.cfs
 
  The index directory contains 10 files only.
 
  I guess synced is holding obsolete data
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-18 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2320:
---

Attachment: LUCENE-2320.patch

Fixed a copy-paste comment error in IndexWriter (introduced in LUCENE-2294).

 Add MergePolicy to IndexWriterConfig
 

 Key: LUCENE-2320
 URL: https://issues.apache.org/jira/browse/LUCENE-2320
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, 
 LUCENE-2320.patch, LUCENE-2320.patch


 Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
 well. The change is not straightforward and so I've kept it for a separate 
 issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
 passed to it before an IndexWriter actually exists. And today IW may create 
 an MP just for it to be overridden by the application one line afterwards. I 
 don't want to make iw member of MP non-final, or settable by extending 
 classes, however it needs to remain protected so they can access it directly. 
 So the proposed changes are:
 * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
 once (hence its name). It'll have the signature SetOnceT w/ *synchronized 
 setT* and *T get()*. T will be declared volatile, so that get() won't be 
 synchronized.
 * MP will define a *protected final SetOnceIndexWriter writer* instead of 
 the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
 * MP will offer a public default ctor, together with a set(IndexWriter).
 * IndexWriter will set itself on MP using set(this). Note that if set will be 
 called more than once, it will throw an exception (AlreadySetException - or 
 does someone have a better suggestion, preferably an already existing Java 
 exception?).
 That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
 review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2327) IndexOutOfBoundsException in FieldInfos.java

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846813#action_12846813
 ] 

Michael McCandless commented on LUCENE-2327:


This exception looks like index corruption... would be good to get to the root 
cause of how this happened.

Your terms dict, which records the field number and character data for each 
term, has somehow recorded a field number of 52 when in fact this segment 
appears to only have 4 fields.

Can you run CheckIndex on the index and post the result back?

Any prior exceptions when creating this index?

I don't think adding a bounds check to FieldInfos makes sense -- the best we 
could do is throw a FieldNumberOutOfBounds exception.

 IndexOutOfBoundsException in FieldInfos.java
 

 Key: LUCENE-2327
 URL: https://issues.apache.org/jira/browse/LUCENE-2327
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1
 Environment: Fedora 12
Reporter: Shane
Priority: Minor

 When retrieving the scoreDocs from a multisearcher, the following exception 
 is thrown:
 java.lang.IndexOutOfBoundsException: Index: 52, Size: 4
 at java.util.ArrayList.rangeCheck(ArrayList.java:571)
 at java.util.ArrayList.get(ArrayList.java:349)
 at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:285)
 at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:274)
 at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86)
 at 
 org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
 at 
 org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
 at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232)
 at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179)
 at 
 org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:911)
 at 
 org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:644)
 The error is caused when the fieldNumber passed to FieldInfos.fieldInfo() is 
 greater than the size of array list containing the FieldInfo values.  I am 
 not sure what the field number represents or why it would be larger than the 
 array list's size.  The quick fix would be to validate the bounds but there 
 may be a bigger underlying problem.  The issue does appear to be directly 
 related to LUCENE-939.  I've only been able to duplicate this in my 
 production environment and so can't give a good test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Changes in SVN (backwards-compatibility branch removed, Snowball test data)

2010-03-18 Thread Uwe Schindler
Hi all,

Yesterday morning I committed https://issues.apache.org/jira/browse/LUCENE-2326 
- If you currently have checked out Lucene repositories, I recommend to do the 
following:

(1) Check, if you have changes in your backwards folder; if yes, create a patch 
(use svn diff inside the branch checkout, so inside 
backwards/lucene_3_0_back_compatibility_tests)

(2a) If you not have updated svn to HEAD:
- run ant clean-backwards, if this fails you are already on HEAD and this 
task has gone, use (2b)
- rm -rf 
contrib/analyzers/common/src/test/org/apache/lucene/analysis/snowball/data
- svn up

(2b) If you are already on HEAD:
- rm -rf backwards/lucene*
- rm -rf 
contrib/analyzers/common/src/test/org/apache/lucene/analysis/snowball/data
- svn up

(3) If applicable apply the patch of your changes on the folder using patch 
-p0 inside backwards, not backwards/src.

(4) check everything is correct:
- backwards should only contain a readme and a src/ folder
- during svn up it should print also a message, that the external snowball data 
is updated to/on rev 500 (currently).

In future, there is no need to have revision numbers or separate commits for 
changing backwards tests. Just edit in your local checkout and commit in one 
go. It's also possible for changes to be included in patches, as its now only 
one checkout.

After releasing a new Lucene version, proceed as described in the ReleaseToDo 
on Wiki, to update the backwards folder.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846821#action_12846821
 ] 

Shai Erera commented on LUCENE-2328:


Would that mean removing files from synced whenever 'deleter' (which is an 
IndexFileDeleter) calls delete*? Are there other places to look for?

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-18 Thread Michael McCandless
On Mon, Mar 15, 2010 at 7:49 PM, Marvin Humphrey mar...@rectangular.com wrote:
 On Mon, Mar 15, 2010 at 05:28:33AM -0500, Michael McCandless wrote:
 I mean specifically one should not have to commit to the precise
 scoring model they will use for a given field, when they index that
 field.

 Yeah, I've never seen committing to a precise scoring model at index-time via
 Sim choice as a big deal.  In Lucy, per-field Similarity assignments are part
 of the the Schema, which has to be set at index-time.  And index-time Sim
 choice is the way things have always been done in Lucene.

OK.  It's new territory -- I haven't heard of users doing lots of
scoring experimentation with Lucene.  But, then, it's not easy to do
now, so... chicken  egg.

Also, will Lucy store the original stats?  Ie so the chosen Sim
can properly recompute all boost bytes (if it uses those), for scoring
models that pivot based on avg's of these stats?

 In any case, the proposal to start delaying Sim choice to search-time -- while
 a nice feature for Lucene -- is a non-starter for Lucy.   We can't do that
 because it would kill the cheap-Searcher model to generate boost bytes at
 Searcher construction time and cache them within the object.  We need those
 boost bytes written to disk so we can mmap them and share them amongst many
 cheap Searchers.

It'd seem like Lucy could re-gen the boost bytes if a different Sim
were selected, or, the current Sim hadn't yet computed  cached its
bytes?  But then logically this means a reader needs write
permission to the index dir, which is not good...

 So... you're proposing shrinking Similarity's public API by removing
 functionality that Lucy can't live without.  If indeed that works out for
 Lucene, the role of Similarity within the two libraries will have to diverge.
 In Lucene, Similarity will get smaller; in Lucy it will expand a bit.

Yes.

 To my mind, these are all related data reduction tasks:

  * Omit doc-boost and field-boost, replacing them with a single float
docXfield multiplier -- because you never need doc-boost on its own.
  * Omit length-in-tokens, term-cardinality, doc-boost, and field-boost,
replacing them all with a single boost byte -- because for the kind of
scoring you want to do, you don't need all those raw stats.
  * Omit the boost byte, because you don't need to do scoring at all.
  * Omit positions because you don't need PhraseQueries, etc. to match.

I wouldn't group this one with the others -- I mean technically it is
data reduction -- but omitting positions means certain queries
(PhraseQuery) won't work even in match only searching.  Whereas the
rest of these examples affect how scoring is done (or whether it's
done).

  * Omit everything except doc-id, because you only need binary matching.

 What al those tasks all have in common is that we can determine what stats are
 disposable based on how the user describes how they are going to use the
 field.

 For Lucy, the user is going to have to commit to a precise scoring model at
 index-time by specifying a Sim choice anyway.

Right.

 If that Sim turns out to be a MatchSimilarity, why on earth should
 we keep around the boost bytes?

Well maybe some queries do scoring on the field and some don't...

  And what class other than Similarity knows enough about the scoring 
  algorithm
  to perform these data reduction tasks?  If it's not goint to be Similarity
  itself, it has to be something that know absolutely everything about the
  Similarity implementation's scoring model.

 I don't follow this...

 It will be Sim that does computes norm bytes.

 I meant that if you're writing out boost bytes, there's no sensible way to
 execute the lossy data reduction and reduce the index size other than having
 Sim do it.

Right Sim is the right class to do this.  Heck one could even use
boost nibbles... or, use float.  This is an impl detail of the Sim
class.

   class MySim extends Similarity {
 public PostingCodec makePostingCodec() {
   StandardPostingCodec codec = new StandardPostingCodec();
   codec.setOmitBoostBytes(true);
   codec.setOmitPositions(true);
   return (PostingCodec)codec;
 }
   }

 This still feels like you are mixing two very different concepts --
 what's being written (boost bytes, positions, docTermFreqs) vs how it's
 encoded (codec).

 So StandardPostingCodec shouldn't have methods like setOmitBoostBytes()?
 Maybe that's right.  Guess I'll watch to see how flex pans out and what
 methods you put on those PostingCodec classes.

Yeah I see that (setOmitBoostBytes) part of the field's type.  It's
like precisionStep for a numeric field, or omitTF/P.  Any codec should
respect these.

 For now, I just want to make the no-boost-bytes and doc-id-only index
 optimizations available, and to achieve that, it's sufficient to implement
 format-follows-sim and publish MatchSimilarity and MinimalSimilarity.  The
 PostingCodec API can remain a private implementation detail until a later
 

[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846825#action_12846825
 ] 

Michael McCandless commented on LUCENE-2328:


Yes I think that's it.

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: How can I use QueryScorer() to find only perfect matches??

2010-03-18 Thread chris.stodola

Hi Erick,

I did as recommended and changed the query approprietly. But the result is
still the same.
On page 78 in the book lucene in action it is explained how scoring is
working. Therefore I get more results than the exact match I was expecting.
But how can I highlight in a large document only the results identified by a
certain query like +contents:term +contents:query?
Are there any alternatives to the QueryScore method? any examples? any
papers to read first?

thx
christian

Erick Erickson wrote:
 
 Try +contents:term +contents:query. By misplacing the
 '+' you're getting the default OR operator and the '+'
 is probably being thrown away by the analyzer.
 
 Luke will help here a lot.
 
 HTH
 Erick
 
 On Mon, Mar 15, 2010 at 9:46 AM, christian stadler
 stadler.christ...@web.de
 wrote:
 
 Hi there,

 I have an issue with the QueryScorer(query) method at the moment and I
 need
 some assistance.
 I was indexing my e-book lucene in action and based on this index-db I
 started to play around with some boolean queries like:
 (contents:+term contents:+query)
 As a result I'm expecting as a perfect match for the phrase term query
 four
 hits.

 But when I run my sample to highlight this phrase in the context then I
 get
 a
 lot more results. It also finds all the matches for term and query
 independently.

 I think the problem is the QueryScorer() which softens the former exact
 boolean
 query.
 Then I was trying the following:
 private static Highlighter GetHits(Query query, Formatter formatter)
 {
string filed = contents
BooleanQuery termsQuery = new BooleanQuery();

WeightedTerm[] terms = QueryTermExtractor.GetTerms(query, true,
 field);
foreach (WeightedTerm term in terms)
{
TermQuery termQuery = new TermQuery(new Term(field,
 term.GetTerm()));
termsQuery.Add(termQuery, BooleanClause.Occur.MUST);
}

// create query scorer based on term queries (field specific)
QueryScorer scorer = new QueryScorer(termsQuery);

Highlighter highlighter = new Highlighter(formatter, scorer);
highlighter.SetTextFragmenter(new SimpleFragmenter(20));

return highlighter;
 }
 to rewrite the query and set the term attribute from SHOULD to MUST

 But the result was the same.
 Do you have any example how I can use the QueryScorer() in exactly the
 same
 way
 as to mimic a BooleanSearch??

 thanks in advance
 Christian




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


 
 

-- 
View this message in context: 
http://old.nabble.com/How-can-I-use-QueryScorer%28%29-to-find-only-perfect-matches---tp27904831p27943914.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846835#action_12846835
 ] 

Earwin Burrfoot commented on LUCENE-2328:
-

A shot in the sky (didn't delve deep into the problem, could definetly miss 
stuff):

What about tracking 'syncidness' from within Directory?
There shouldn't be more than one writer anyway (unless your locking is broken), 
so that's a single set of 'files-to-be-synced' for each given moment of time. 
Might as well keep track of it inside the directory, and have a 
syncAllUnsyncedGuys() on it.

This will also remove the need to transfer that list around when transferring 
write lock (IR hell).

And all-round that sounds quite logical, as the need/method of syncing depends 
solely on directory. If you're working with RAMDirectory, you don't need to 
keep track of these files at all.
Probably same for some of DB impls.
Also some filesystems sync everything, when you ask to sync a single file, so 
if you're syncing a batch of them in a row, that's some overhead that you can 
theoretically work around with a special flag to FSDir.

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)

2010-03-18 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2302:
--

Attachment: LUCENE-2302.patch

Updated patch for current flex HEAD. Still backwards needs to be fixed.

How do we want to preceed?:
- Name of new Attrubute?
- Is new CharSeq/Appendable API fine
- setEmpty()?

Thanks for reviewing!

 Replacement for TermAttribute+Impl with extended capabilities (byte[] 
 support, CharSequence, Appendable)
 

 Key: LUCENE-2302
 URL: https://issues.apache.org/jira/browse/LUCENE-2302
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: Flex Branch
Reporter: Uwe Schindler
 Fix For: Flex Branch

 Attachments: LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, 
 LUCENE-2302.patch, LUCENE-2302.patch


 For flexible indexing terms can be simple byte[] arrays, while the current 
 TermAttribute only supports char[]. This is fine for plain text, but e.g 
 NumericTokenStream should directly work on the byte[] array.
 Also TermAttribute lacks of some interfaces that would make it simplier for 
 users to work with them: Appendable and CharSequence
 I propose to create a new interface CharTermAttribute with a clean new API 
 that concentrates on CharSequence and Appendable.
 The implementation class will simply support the old and new interface 
 working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of 
 this. So if somebody adds a TermAttribute, he will get an implementation 
 class that can be also used as CharTermAttribute. As both attributes create 
 the same impl instance both calls to addAttribute are equal. So a TokenFilter 
 that adds CharTermAttribute to the source will work with the same instance as 
 the Tokenizer that requested the (deprecated) TermAttribute.
 To also support byte[] only terms like Collation or NumericField needs, a 
 separate getter-only interface will be added, that returns a reusable 
 BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will 
 also support this interface. For backwards compatibility with old 
 self-made-TermAttribute implementations, the indexer will check with 
 hasAttribute(), if the BytesRef getter interface is there and if not will 
 wrap a old-style TermAttribute (a deprecated wrapper class will be provided): 
 new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the 
 indexer then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals

2010-03-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846852#action_12846852
 ] 

Robert Muir commented on LUCENE-2326:
-

bq. What did you to for this to happen? 

Uwe, the problem happened to Mark... and this test data has *always* been rev 
500.

svn.exe simply got the wrong revision. Its probably a bug in svn, I don't think 
you did anything wrong.

But at the same time, we don't want random test failures.

 Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards 
 branch and linking snowball tests by svn:externals
 ---

 Key: LUCENE-2326
 URL: https://issues.apache.org/jira/browse/LUCENE-2326
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: Flex Branch, 3.1

 Attachments: LUCENE-2326.patch, LUCENE-2326.patch


 As we often need to update backwards tests together with trunk and always 
 have to update the branch first, record rev no, and update build xml, I would 
 simply like to do a svn copy/move of the backwards branch.
 After a release, this is simply also done:
 {code}
 svn rm backwards
 svn cp releasebranch backwards
 {code}
 By this we can simply commit in one pass, create patches in one pass.
 The snowball tests are currently downloaded by svn.exe, too. These need a 
 fixed version for checkout. I would like to change this to use svn:externals. 
 Will provide patch, soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals

2010-03-18 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846860#action_12846860
 ] 

Uwe Schindler commented on LUCENE-2326:
---

Man, I reverted the snowball part.

Lets change to a zip file as the tests will never change. This svn in build.xml 
is too much dependent on your local installation of svn tools. I dont like it.

 Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards 
 branch and linking snowball tests by svn:externals
 ---

 Key: LUCENE-2326
 URL: https://issues.apache.org/jira/browse/LUCENE-2326
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: Flex Branch, 3.1

 Attachments: LUCENE-2326.patch, LUCENE-2326.patch


 As we often need to update backwards tests together with trunk and always 
 have to update the branch first, record rev no, and update build xml, I would 
 simply like to do a svn copy/move of the backwards branch.
 After a release, this is simply also done:
 {code}
 svn rm backwards
 svn cp releasebranch backwards
 {code}
 By this we can simply commit in one pass, create patches in one pass.
 The snowball tests are currently downloaded by svn.exe, too. These need a 
 fixed version for checkout. I would like to change this to use svn:externals. 
 Will provide patch, soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: svn commit: r924731 - in /lucene/java/trunk/contrib/analyzers/common: build.xml src/test/org/apache/lucene/analysis/snowball/ src/test/org/apache/lucene/analysis/snowball/TestSnowballVocab.java

2010-03-18 Thread Yonik Seeley
E, let's strive for slightly better commit messages ;-)
-Yonik

On Thu, Mar 18, 2010 at 7:48 AM,  uschind...@apache.org wrote:
 Author: uschindler
 Date: Thu Mar 18 11:48:11 2010
 New Revision: 924731

 URL: http://svn.apache.org/viewvc?rev=924731view=rev
 Log:
 LUCENE-2326: As rmuir seems to bug me about that, i reverted the externals 
 def here. In future, lets use a zip file.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals

2010-03-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846863#action_12846863
 ] 

Robert Muir commented on LUCENE-2326:
-

bq. Lets change to a zip file as the tests will never change

I agree, but this zip file will be pretty large!

Thanks for temporarily changing it to do the checkout instead

 Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards 
 branch and linking snowball tests by svn:externals
 ---

 Key: LUCENE-2326
 URL: https://issues.apache.org/jira/browse/LUCENE-2326
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: Flex Branch, 3.1

 Attachments: LUCENE-2326.patch, LUCENE-2326.patch


 As we often need to update backwards tests together with trunk and always 
 have to update the branch first, record rev no, and update build xml, I would 
 simply like to do a svn copy/move of the backwards branch.
 After a release, this is simply also done:
 {code}
 svn rm backwards
 svn cp releasebranch backwards
 {code}
 By this we can simply commit in one pass, create patches in one pass.
 The snowball tests are currently downloaded by svn.exe, too. These need a 
 fixed version for checkout. I would like to change this to use svn:externals. 
 Will provide patch, soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: svn commit: r924731 - in /lucene/java/trunk/contrib/analyzers/common: build.xml src/test/org/apache/lucene/analysis/snowball/ src/test/org/apache/lucene/analysis/snowball/TestSnowballVocab.java

2010-03-18 Thread Uwe Schindler
I am currently unhappy on lucene because of:
- LuSolr
- communication differences

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Yonik Seeley [mailto:ysee...@gmail.com]
 Sent: Thursday, March 18, 2010 12:51 PM
 To: java-dev@lucene.apache.org
 Subject: Re: svn commit: r924731 - in
 /lucene/java/trunk/contrib/analyzers/common: build.xml
 src/test/org/apache/lucene/analysis/snowball/
 src/test/org/apache/lucene/analysis/snowball/TestSnowballVocab.java
 
 E, let's strive for slightly better commit messages ;-)
 -Yonik
 
 On Thu, Mar 18, 2010 at 7:48 AM,  uschind...@apache.org wrote:
  Author: uschindler
  Date: Thu Mar 18 11:48:11 2010
  New Revision: 924731
 
  URL: http://svn.apache.org/viewvc?rev=924731view=rev
  Log:
  LUCENE-2326: As rmuir seems to bug me about that, i reverted the
 externals def here. In future, lets use a zip file.
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846872#action_12846872
 ] 

Michael McCandless commented on LUCENE-2328:


I like this idea!

But, we don't want to simply sync all new files.  When IW commits,
it's possibly a subset of all new files.  EG running merges (or any
still-open files) should not be sync'd.

Not necessarily all closed files should be sync'd either -- eg any
files that were opened  closed while we were syncing (since syncing
can take some time) should not then be sync'd.

Maybe we change Dir.sync to take a CollectionString?

Then dir would be the one place that keeps track of what's already
been sync'd and what hasn't.

Or... I wonder if calling sync on a file that's already been sync'd is
really that wasteful... I mean it's technically a no-op, so it's just
the overhead of a no-op system call from way up in javaland.


 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846877#action_12846877
 ] 

Michael McCandless commented on LUCENE-2302:


I like the name CharTermAttribute.

How about instead of TermToBytesRefAttribute we name it TermBytesAttribute?  
(Ie, drop the To and Ref).

 Replacement for TermAttribute+Impl with extended capabilities (byte[] 
 support, CharSequence, Appendable)
 

 Key: LUCENE-2302
 URL: https://issues.apache.org/jira/browse/LUCENE-2302
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: Flex Branch
Reporter: Uwe Schindler
 Fix For: Flex Branch

 Attachments: LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, 
 LUCENE-2302.patch, LUCENE-2302.patch


 For flexible indexing terms can be simple byte[] arrays, while the current 
 TermAttribute only supports char[]. This is fine for plain text, but e.g 
 NumericTokenStream should directly work on the byte[] array.
 Also TermAttribute lacks of some interfaces that would make it simplier for 
 users to work with them: Appendable and CharSequence
 I propose to create a new interface CharTermAttribute with a clean new API 
 that concentrates on CharSequence and Appendable.
 The implementation class will simply support the old and new interface 
 working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of 
 this. So if somebody adds a TermAttribute, he will get an implementation 
 class that can be also used as CharTermAttribute. As both attributes create 
 the same impl instance both calls to addAttribute are equal. So a TokenFilter 
 that adds CharTermAttribute to the source will work with the same instance as 
 the Tokenizer that requested the (deprecated) TermAttribute.
 To also support byte[] only terms like Collation or NumericField needs, a 
 separate getter-only interface will be added, that returns a reusable 
 BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will 
 also support this interface. For backwards compatibility with old 
 self-made-TermAttribute implementations, the indexer will check with 
 hasAttribute(), if the BytesRef getter interface is there and if not will 
 wrap a old-style TermAttribute (a deprecated wrapper class will be provided): 
 new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the 
 indexer then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846880#action_12846880
 ] 

Earwin Burrfoot commented on LUCENE-2328:
-

 EG running merges (or any still-open files) should not be sync'd.
Files that are still being written should not be synced, that's kinda obvious.

 Not necessarily all closed files should be sync'd either - eg any files that 
 were opened  closed while we were syncing (since syncing can take some time) 
 should not then be sync'd.
This one is not so obvious.
I assume that on calling syncEveryoneAndHisDog() you should sync all files that 
have been written to, and were closed, and not yet deleted.

 Maybe we change Dir.sync to take a CollectionString?
What does that alone give us over the current situation? You can call 
Dir.sync() repeatedly, it's all the same.

 Or... I wonder if calling sync on a file that's already been sync'd is really 
 that wasteful... 
It can be on these systems, that just sync down everything. I don't believe in 
people writing good software : }

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: How can I use QueryScorer() to find only perfect matches??

2010-03-18 Thread Michael McCandless
Unfortunately, highlighter (and I think also fast vector highlighter)
are able to return a set of fragments which do not match the
query (eg, they only show one of the two required terms).

I really don't like that they do this.

Ideally (to me) the entire excerpt (ie, all fragments appended
together) should match the original query.  Meaning I see at least one
occurrence of each required term (the occurrence of each could occur
in different fragments).

Progress has been made in general -- eg it use to be the case that if
you highlighted a phrase query, eg president obama, you could see
excerpts that only had one of the words.  That's been fixed by
defaulting to QueryScorer.

To really fix this for all queries is not easy...  there was a long
discussion, here:

  https://issues.apache.org/jira/browse/LUCENE-1522

I think we should improve the Scorer API so that it can optionally
provide positional details of all matches, probably by absorbing
Span*Query back into their non-span counterparts and enriching the
API.  But this is a biggish change.

Maybe as a stopgap you could pull many fragments from highlighter and
then pick a set of fragments that cover the most unique terms...?
Sort like a coord factor, but for highlighting not BooleanQuery.  Is
it only required clauses you need to fix?

Mike

On Thu, Mar 18, 2010 at 5:43 AM, chris.stodola stadler.christ...@web.de wrote:

 Hi Erick,

 I did as recommended and changed the query approprietly. But the result is
 still the same.
 On page 78 in the book lucene in action it is explained how scoring is
 working. Therefore I get more results than the exact match I was expecting.
 But how can I highlight in a large document only the results identified by a
 certain query like +contents:term +contents:query?
 Are there any alternatives to the QueryScore method? any examples? any
 papers to read first?

 thx
 christian

 Erick Erickson wrote:

 Try +contents:term +contents:query. By misplacing the
 '+' you're getting the default OR operator and the '+'
 is probably being thrown away by the analyzer.

 Luke will help here a lot.

 HTH
 Erick

 On Mon, Mar 15, 2010 at 9:46 AM, christian stadler
 stadler.christ...@web.de
 wrote:

 Hi there,

 I have an issue with the QueryScorer(query) method at the moment and I
 need
 some assistance.
 I was indexing my e-book lucene in action and based on this index-db I
 started to play around with some boolean queries like:
 (contents:+term contents:+query)
 As a result I'm expecting as a perfect match for the phrase term query
 four
 hits.

 But when I run my sample to highlight this phrase in the context then I
 get
 a
 lot more results. It also finds all the matches for term and query
 independently.

 I think the problem is the QueryScorer() which softens the former exact
 boolean
 query.
 Then I was trying the following:
 private static Highlighter GetHits(Query query, Formatter formatter)
 {
    string filed = contents
    BooleanQuery termsQuery = new BooleanQuery();

    WeightedTerm[] terms = QueryTermExtractor.GetTerms(query, true,
 field);
    foreach (WeightedTerm term in terms)
    {
        TermQuery termQuery = new TermQuery(new Term(field,
 term.GetTerm()));
        termsQuery.Add(termQuery, BooleanClause.Occur.MUST);
    }

    // create query scorer based on term queries (field specific)
    QueryScorer scorer = new QueryScorer(termsQuery);

    Highlighter highlighter = new Highlighter(formatter, scorer);
    highlighter.SetTextFragmenter(new SimpleFragmenter(20));

    return highlighter;
 }
 to rewrite the query and set the term attribute from SHOULD to MUST

 But the result was the same.
 Do you have any example how I can use the QueryScorer() in exactly the
 same
 way
 as to mimic a BooleanSearch??

 thanks in advance
 Christian




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





 --
 View this message in context: 
 http://old.nabble.com/How-can-I-use-QueryScorer%28%29-to-find-only-perfect-matches---tp27904831p27943914.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846890#action_12846890
 ] 

Shai Erera commented on LUCENE-2328:


ok so let me see if I understand this. Before Earwin suggested adding synced to 
Directory, the approach (as I understood it) was - whenever deleter deletes a 
file, remove it from synced as well.

After Earwin's suggestion, which I like very much, as it moves more stuff out 
of IW, which could use some simplification, I initially thought that we should 
do this: when dir.sync is called, add that file to dir.synced. Then when 
dir.delete is called, remove it from there. When dir.commit is called, add all 
changed/synced files to the set (probably all of them). Something very 
straightforward and simple.

However, the last two posts seem to try to complicate it ... and I don't 
understand why. So I'd appreciate if you can explain what am I missing.

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2330) Allow easy extension of IndexWriter

2010-03-18 Thread Shai Erera (JIRA)
Allow easy extension of IndexWriter
---

 Key: LUCENE-2330
 URL: https://issues.apache.org/jira/browse/LUCENE-2330
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


IndexWriter is not so easy to extend. It hides a lot of useful methods from 
extending classes as well as useful members (like infoStream). Most of this 
stuff is very straightforward and I believe it's not exposed for no particular 
reason. Over in LUCENE-1879 I plan extend IndexWriter to provide a 
ParallelWriter which will support the parallel indexing requirements. For that 
I'll need access to several methods and members. I plan to contain in this 
issue some simple hooks, nothing fancy (and hopefully controversial). I'll 
leave the rest to specific issues. For now:
# Introduce a protected default constructor and init(Directory, 
IndexWriterConfig). That's required because ParallelWriter does not itself 
index anything, but instead delegates to its Slices. So that ctor is for 
convenience only, and I'll make it clear (through javadocs) that if one uses 
it, one needs to call init(). PQ has the same pattern.
# Expose some members and methods that are useful for extensions (such as 
config, infoStream etc.). Some candidates are package-private methods, but 
these will be reviewed and converted on a case by case basis.

I don't plan to do anything drastic here, just prepare IW for easier 
extendability.

I'll post a patch after LUCENE-2320 is committed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException

2010-03-18 Thread Ritesh Nigam (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846892#action_12846892
 ] 

Ritesh Nigam commented on LUCENE-2280:
--

I installed a test setup with lucene 3.0.0 and tried to reproduce the scenario 
with NPE, but after the Exception thrown, main index file is not getting 
deleted but only optimize is failing and i can see some small index file (.cfs) 
also along with main index file, and one more thing here is i am not using 
commit yet, but using close(), does close do the same thing as commit does?

By looking at above behavior, is there a bug in 2.3.2 version where this kind 
of situaion is not handled properly?

Can you please have a look at the log which i got after turning on the 
infostream for IndexWriter(for lucene 2.3.2). Attached as lucene.zip.

 IndexWriter.optimize() throws NullPointerException
 --

 Key: LUCENE-2280
 URL: https://issues.apache.org/jira/browse/LUCENE-2280
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3.2
 Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6
Reporter: Ritesh Nigam
 Attachments: lucene.jar, lucene.zip


 I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB 
 database which creates approax 200MB index file, after finishing the indexing 
 and while running optimize() i can see NullPointerExcception thrown in my log 
 and index file is getting corrupted, log says
 
 Caused by: 
 java.lang.NullPointerException
   at 
 org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49)
   at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40)
   at 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566)
   at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135)
   at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273)
   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
 
 and this is happening quite frequently, although I am not able to reproduce 
 it on demand, I saw an issue logged which is some what related to mine issue 
 (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e)
  but the only difference here is I am not using Store.Compress for my fields, 
 i am using Store.NO instead. please note that I am using IBM JRE for my 
 application.
 Is this an issue with lucene?, if yes it is fixed in which version?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals

2010-03-18 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2326:
--

Attachment: TestVnowballVocabData.zip
LUCENE-2326-snowball-try2.patch

Here the patch without external references. The data dir was cleaned up 
(removed the large unneeded diff.txt files) and the zip compressed with -9.

 Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards 
 branch and linking snowball tests by svn:externals
 ---

 Key: LUCENE-2326
 URL: https://issues.apache.org/jira/browse/LUCENE-2326
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: Flex Branch, 3.1

 Attachments: LUCENE-2326-snowball-try2.patch, LUCENE-2326.patch, 
 LUCENE-2326.patch, TestVnowballVocabData.zip


 As we often need to update backwards tests together with trunk and always 
 have to update the branch first, record rev no, and update build xml, I would 
 simply like to do a svn copy/move of the backwards branch.
 After a release, this is simply also done:
 {code}
 svn rm backwards
 svn cp releasebranch backwards
 {code}
 By this we can simply commit in one pass, create patches in one pass.
 The snowball tests are currently downloaded by svn.exe, too. These need a 
 fixed version for checkout. I would like to change this to use svn:externals. 
 Will provide patch, soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals

2010-03-18 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846897#action_12846897
 ] 

Uwe Schindler commented on LUCENE-2326:
---

Sorry, ZIP file has wrong name. Fixed here locally (test+zip).

 Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards 
 branch and linking snowball tests by svn:externals
 ---

 Key: LUCENE-2326
 URL: https://issues.apache.org/jira/browse/LUCENE-2326
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: Flex Branch, 3.1

 Attachments: LUCENE-2326-snowball-try2.patch, LUCENE-2326.patch, 
 LUCENE-2326.patch, TestVnowballVocabData.zip


 As we often need to update backwards tests together with trunk and always 
 have to update the branch first, record rev no, and update build xml, I would 
 simply like to do a svn copy/move of the backwards branch.
 After a release, this is simply also done:
 {code}
 svn rm backwards
 svn cp releasebranch backwards
 {code}
 By this we can simply commit in one pass, create patches in one pass.
 The snowball tests are currently downloaded by svn.exe, too. These need a 
 fixed version for checkout. I would like to change this to use svn:externals. 
 Will provide patch, soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846899#action_12846899
 ] 

Earwin Burrfoot commented on LUCENE-2328:
-

I'm proposing something even more dead simple.

1. We remove Directory.sync(String) completely.
2. Each time you call IndexOutput.close(), Dir adds this file to its internal 
set (if it cares about it at all).
3. If you call Directory.delete(), it also removes file from the set (though 
not strictly necessary).
4. When you commit at IW, it calls Directory.sync() and everything in its 
internal set gets synced. 

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals

2010-03-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846900#action_12846900
 ] 

Robert Muir commented on LUCENE-2326:
-

Thanks Uwe, this simplifies our tests.

Its nice to remove a network connection (it seems reliable so far, but...)

 Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards 
 branch and linking snowball tests by svn:externals
 ---

 Key: LUCENE-2326
 URL: https://issues.apache.org/jira/browse/LUCENE-2326
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: Flex Branch, 3.1

 Attachments: LUCENE-2326-snowball-try2.patch, LUCENE-2326.patch, 
 LUCENE-2326.patch, TestVnowballVocabData.zip


 As we often need to update backwards tests together with trunk and always 
 have to update the branch first, record rev no, and update build xml, I would 
 simply like to do a svn copy/move of the backwards branch.
 After a release, this is simply also done:
 {code}
 svn rm backwards
 svn cp releasebranch backwards
 {code}
 By this we can simply commit in one pass, create patches in one pass.
 The snowball tests are currently downloaded by svn.exe, too. These need a 
 fixed version for checkout. I would like to change this to use svn:externals. 
 Will provide patch, soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846902#action_12846902
 ] 

Earwin Burrfoot commented on LUCENE-2328:
-

Btw, initial problem stems from the fact that IW/IR keeps track of the files it 
*has already* synced, instead of the files it *has not yet* synced. Which is 
kinda upside down, and requires upkeep, unlike straightforward approach in 
which this set gets cleared anew after each commit call.

I can conjure up a patch in a day or two.

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2330) Allow easy extension of IndexWriter

2010-03-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846903#action_12846903
 ] 

Earwin Burrfoot commented on LUCENE-2330:
-

Please, only open up something if you decorate it with @experimental 
@will.change.without.single.warning annotations like a christmas tree.

With luceneish freakyish back-compat policy you want to have as few things 
public as possible :)

 Allow easy extension of IndexWriter
 ---

 Key: LUCENE-2330
 URL: https://issues.apache.org/jira/browse/LUCENE-2330
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 IndexWriter is not so easy to extend. It hides a lot of useful methods from 
 extending classes as well as useful members (like infoStream). Most of this 
 stuff is very straightforward and I believe it's not exposed for no 
 particular reason. Over in LUCENE-1879 I plan extend IndexWriter to provide a 
 ParallelWriter which will support the parallel indexing requirements. For 
 that I'll need access to several methods and members. I plan to contain in 
 this issue some simple hooks, nothing fancy (and hopefully controversial). 
 I'll leave the rest to specific issues. For now:
 # Introduce a protected default constructor and init(Directory, 
 IndexWriterConfig). That's required because ParallelWriter does not itself 
 index anything, but instead delegates to its Slices. So that ctor is for 
 convenience only, and I'll make it clear (through javadocs) that if one uses 
 it, one needs to call init(). PQ has the same pattern.
 # Expose some members and methods that are useful for extensions (such as 
 config, infoStream etc.). Some candidates are package-private methods, but 
 these will be reviewed and converted on a case by case basis.
 I don't plan to do anything drastic here, just prepare IW for easier 
 extendability.
 I'll post a patch after LUCENE-2320 is committed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2327) IndexOutOfBoundsException in FieldInfos.java

2010-03-18 Thread Shane (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane updated LUCENE-2327:
--

Attachment: CheckIndex.txt

CheckIndex output generated by Luke v1.0.0.

 IndexOutOfBoundsException in FieldInfos.java
 

 Key: LUCENE-2327
 URL: https://issues.apache.org/jira/browse/LUCENE-2327
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1
 Environment: Fedora 12
Reporter: Shane
Priority: Minor
 Attachments: CheckIndex.txt


 When retrieving the scoreDocs from a multisearcher, the following exception 
 is thrown:
 java.lang.IndexOutOfBoundsException: Index: 52, Size: 4
 at java.util.ArrayList.rangeCheck(ArrayList.java:571)
 at java.util.ArrayList.get(ArrayList.java:349)
 at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:285)
 at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:274)
 at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86)
 at 
 org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
 at 
 org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
 at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232)
 at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179)
 at 
 org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:911)
 at 
 org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:644)
 The error is caused when the fieldNumber passed to FieldInfos.fieldInfo() is 
 greater than the size of array list containing the FieldInfo values.  I am 
 not sure what the field number represents or why it would be larger than the 
 array list's size.  The quick fix would be to validate the bounds but there 
 may be a bigger underlying problem.  The issue does appear to be directly 
 related to LUCENE-939.  I've only been able to duplicate this in my 
 production environment and so can't give a good test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2330) Allow easy extension of IndexWriter

2010-03-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846906#action_12846906
 ] 

Shai Erera commented on LUCENE-2330:


Sure, I'll annotate whatever is needed for PI (e.g. protected/public but still 
for internal use) as @lucene.experimental. After we see more than one extension 
of IW, we can decide whether those API need to made 'public' in essence (i.e. 
w/o the annotation).

I've been burned plenty of times w/ bw policy :).

 Allow easy extension of IndexWriter
 ---

 Key: LUCENE-2330
 URL: https://issues.apache.org/jira/browse/LUCENE-2330
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 IndexWriter is not so easy to extend. It hides a lot of useful methods from 
 extending classes as well as useful members (like infoStream). Most of this 
 stuff is very straightforward and I believe it's not exposed for no 
 particular reason. Over in LUCENE-1879 I plan extend IndexWriter to provide a 
 ParallelWriter which will support the parallel indexing requirements. For 
 that I'll need access to several methods and members. I plan to contain in 
 this issue some simple hooks, nothing fancy (and hopefully controversial). 
 I'll leave the rest to specific issues. For now:
 # Introduce a protected default constructor and init(Directory, 
 IndexWriterConfig). That's required because ParallelWriter does not itself 
 index anything, but instead delegates to its Slices. So that ctor is for 
 convenience only, and I'll make it clear (through javadocs) that if one uses 
 it, one needs to call init(). PQ has the same pattern.
 # Expose some members and methods that are useful for extensions (such as 
 config, infoStream etc.). Some candidates are package-private methods, but 
 these will be reviewed and converted on a case by case basis.
 I don't plan to do anything drastic here, just prepare IW for easier 
 extendability.
 I'll post a patch after LUCENE-2320 is committed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2327) IndexOutOfBoundsException in FieldInfos.java

2010-03-18 Thread Shane (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846912#action_12846912
 ] 

Shane commented on LUCENE-2327:
---

The index is relatively old and doesn't appear to have been modified for a 
number of years.  I can't say for certain about prior exceptions.  If the 
CheckIndex results provides any more details, then great.  Regardless, I'm 
willing to chalk this up to a system specific error and close the ticket.  I 
was able to fix the index using Luke.

 IndexOutOfBoundsException in FieldInfos.java
 

 Key: LUCENE-2327
 URL: https://issues.apache.org/jira/browse/LUCENE-2327
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1
 Environment: Fedora 12
Reporter: Shane
Priority: Minor
 Attachments: CheckIndex.txt


 When retrieving the scoreDocs from a multisearcher, the following exception 
 is thrown:
 java.lang.IndexOutOfBoundsException: Index: 52, Size: 4
 at java.util.ArrayList.rangeCheck(ArrayList.java:571)
 at java.util.ArrayList.get(ArrayList.java:349)
 at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:285)
 at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:274)
 at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86)
 at 
 org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
 at 
 org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
 at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232)
 at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179)
 at 
 org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:911)
 at 
 org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:644)
 The error is caused when the fieldNumber passed to FieldInfos.fieldInfo() is 
 greater than the size of array list containing the FieldInfo values.  I am 
 not sure what the field number represents or why it would be larger than the 
 array list's size.  The quick fix would be to validate the bounds but there 
 may be a bigger underlying problem.  The issue does appear to be directly 
 related to LUCENE-939.  I've only been able to duplicate this in my 
 production environment and so can't give a good test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846913#action_12846913
 ] 

Shai Erera commented on LUCENE-2328:


How would IndexInput report back to the Directory when its close() was called? 
I've checked a couple of Directories and when they openInput, they don't pass 
themselves to the IndexInput. I think what you say makes sense, but I don't see 
how this can be implemented w/ the current implementations (and w/o relying on 
broken Directory impls out there). Broken in the sense that they don't expect 
to get any notification from IndexInput.close().

Other than that, I like that approach. Also, what you wrote about IW keeping 
track on already synced files - I guess you'll change that when it moves into 
Directory, so that it will track the files it hasn't synced yet?

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals

2010-03-18 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-2326.
---

Resolution: Fixed

Committed revision: 924781 (with correct zip file name)

 Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards 
 branch and linking snowball tests by svn:externals
 ---

 Key: LUCENE-2326
 URL: https://issues.apache.org/jira/browse/LUCENE-2326
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: Flex Branch, 3.1

 Attachments: LUCENE-2326-snowball-try2.patch, LUCENE-2326.patch, 
 LUCENE-2326.patch, TestVnowballVocabData.zip


 As we often need to update backwards tests together with trunk and always 
 have to update the branch first, record rev no, and update build xml, I would 
 simply like to do a svn copy/move of the backwards branch.
 After a release, this is simply also done:
 {code}
 svn rm backwards
 svn cp releasebranch backwards
 {code}
 By this we can simply commit in one pass, create patches in one pass.
 The snowball tests are currently downloaded by svn.exe, too. These need a 
 fixed version for checkout. I would like to change this to use svn:externals. 
 Will provide patch, soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2331) Add NoOpMergePolicy

2010-03-18 Thread Shai Erera (JIRA)
Add NoOpMergePolicy
---

 Key: LUCENE-2331
 URL: https://issues.apache.org/jira/browse/LUCENE-2331
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


I'd like to add a simple and useful MP implementation which does  nothing ! 
:). I've came across many places where either the following is documented or 
implemented: if you want to prevent merges, set mergeFactor to a high enough 
value. I think a NoOpMergePolicy is just as good, and can REALLY allow you 
disable merges (except for maybe set mergeFactor to Int.MAX_VAL).

As such, NoOpMergePolicy will be introduced as a singleton, and can be used for 
convenience purposes only. Also, for Parallel Index it's important, because I'd 
like the slices to never do any merges, unless ParallelWriter decides so. So 
they should be set w/ that MP.

I have a patch ready. Waiting for LUCENE-2320 to go in, so that I don't need to 
change it afterwards.

About the name - I like the name, but suggestions are welcome. I thought of a 
NullMergePolicy, but I don't like 'Null' used for a NoOp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2331) Add NoOpMergePolicy

2010-03-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846926#action_12846926
 ] 

Earwin Burrfoot commented on LUCENE-2331:
-

NoMergesPolicy - that's exactly what it is, a policy of no merges

 Add NoOpMergePolicy
 ---

 Key: LUCENE-2331
 URL: https://issues.apache.org/jira/browse/LUCENE-2331
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I'd like to add a simple and useful MP implementation which does  nothing 
 ! :). I've came across many places where either the following is documented 
 or implemented: if you want to prevent merges, set mergeFactor to a high 
 enough value. I think a NoOpMergePolicy is just as good, and can REALLY 
 allow you disable merges (except for maybe set mergeFactor to Int.MAX_VAL).
 As such, NoOpMergePolicy will be introduced as a singleton, and can be used 
 for convenience purposes only. Also, for Parallel Index it's important, 
 because I'd like the slices to never do any merges, unless ParallelWriter 
 decides so. So they should be set w/ that MP.
 I have a patch ready. Waiting for LUCENE-2320 to go in, so that I don't need 
 to change it afterwards.
 About the name - I like the name, but suggestions are welcome. I thought of a 
 NullMergePolicy, but I don't like 'Null' used for a NoOp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)

2010-03-18 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846933#action_12846933
 ] 

Uwe Schindler commented on LUCENE-2302:
---

bq. How about instead of TermToBytesRefAttribute we name it TermBytesAttribute? 
(Ie, drop the To and Ref).

This attribute is special, it only has this getter for the bytesref.

If we need a real BytesTermAttribute it should be explicitely defined. Now 
open is NumericTokenStream and so on

 Replacement for TermAttribute+Impl with extended capabilities (byte[] 
 support, CharSequence, Appendable)
 

 Key: LUCENE-2302
 URL: https://issues.apache.org/jira/browse/LUCENE-2302
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: Flex Branch
Reporter: Uwe Schindler
 Fix For: Flex Branch

 Attachments: LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, 
 LUCENE-2302.patch, LUCENE-2302.patch


 For flexible indexing terms can be simple byte[] arrays, while the current 
 TermAttribute only supports char[]. This is fine for plain text, but e.g 
 NumericTokenStream should directly work on the byte[] array.
 Also TermAttribute lacks of some interfaces that would make it simplier for 
 users to work with them: Appendable and CharSequence
 I propose to create a new interface CharTermAttribute with a clean new API 
 that concentrates on CharSequence and Appendable.
 The implementation class will simply support the old and new interface 
 working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of 
 this. So if somebody adds a TermAttribute, he will get an implementation 
 class that can be also used as CharTermAttribute. As both attributes create 
 the same impl instance both calls to addAttribute are equal. So a TokenFilter 
 that adds CharTermAttribute to the source will work with the same instance as 
 the Tokenizer that requested the (deprecated) TermAttribute.
 To also support byte[] only terms like Collation or NumericField needs, a 
 separate getter-only interface will be added, that returns a reusable 
 BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will 
 also support this interface. For backwards compatibility with old 
 self-made-TermAttribute implementations, the indexer will check with 
 hasAttribute(), if the BytesRef getter interface is there and if not will 
 wrap a old-style TermAttribute (a deprecated wrapper class will be provided): 
 new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the 
 indexer then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)

2010-03-18 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-2302.
---

Resolution: Fixed

Was accidently committed with merge. Sorry.

Revision: 924791

 Replacement for TermAttribute+Impl with extended capabilities (byte[] 
 support, CharSequence, Appendable)
 

 Key: LUCENE-2302
 URL: https://issues.apache.org/jira/browse/LUCENE-2302
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: Flex Branch
Reporter: Uwe Schindler
 Fix For: Flex Branch

 Attachments: LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, 
 LUCENE-2302.patch, LUCENE-2302.patch


 For flexible indexing terms can be simple byte[] arrays, while the current 
 TermAttribute only supports char[]. This is fine for plain text, but e.g 
 NumericTokenStream should directly work on the byte[] array.
 Also TermAttribute lacks of some interfaces that would make it simplier for 
 users to work with them: Appendable and CharSequence
 I propose to create a new interface CharTermAttribute with a clean new API 
 that concentrates on CharSequence and Appendable.
 The implementation class will simply support the old and new interface 
 working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of 
 this. So if somebody adds a TermAttribute, he will get an implementation 
 class that can be also used as CharTermAttribute. As both attributes create 
 the same impl instance both calls to addAttribute are equal. So a TokenFilter 
 that adds CharTermAttribute to the source will work with the same instance as 
 the Tokenizer that requested the (deprecated) TermAttribute.
 To also support byte[] only terms like Collation or NumericField needs, a 
 separate getter-only interface will be added, that returns a reusable 
 BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will 
 also support this interface. For backwards compatibility with old 
 self-made-TermAttribute implementations, the indexer will check with 
 hasAttribute(), if the BytesRef getter interface is there and if not will 
 wrap a old-style TermAttribute (a deprecated wrapper class will be provided): 
 new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the 
 indexer then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846936#action_12846936
 ] 

Andrzej Bialecki  commented on LUCENE-2329:
---

Slightly off-topic ... Having a facility to obtain termID-s per segment (or 
better yet per index) would greatly benefit Solr's UnInverted field creation, 
which currently needs to assign term ids by linear scanning.

 Use parallel arrays instead of PostingList objects
 --

 Key: LUCENE-2329
 URL: https://issues.apache.org/jira/browse/LUCENE-2329
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


 This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
 In order to avoid having very many long-living PostingList objects in 
 TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
 simply be a int[] which maps each term to dense termIDs.
 All data that the PostingList classes currently hold will then we placed in 
 parallel arrays, where the termID is the index into the arrays.  This will 
 avoid the need for object pooling, will remove the overhead of object 
 initialization and garbage collection.  Especially garbage collection should 
 benefit significantly when the JVM runs out of memory, because in such a 
 situation the gc mark times can get very long if there is a big number of 
 long-living objects in memory.
 Another benefit could be to build more efficient TermVectors.  We could avoid 
 the need of having to store the term string per document in the TermVector.  
 Instead we could just store the segment-wide termIDs.  This would reduce the 
 size and also make it easier to implement efficient algorithms that use 
 TermVectors, because no term mapping across documents in a segment would be 
 necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846938#action_12846938
 ] 

Earwin Burrfoot commented on LUCENE-2328:
-

 How would IndexInput report back to the Directory when its close() was 
 called? I've checked a couple of Directories and when they openInput, they 
 don't pass themselves to the IndexInput.
Hmm. I guess I have to change IndexOutput impls?

 so that it will track the files it hasn't synced yet?
Sure

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2332) Mrge CharTermAttribute and deprecations to trunk

2010-03-18 Thread Uwe Schindler (JIRA)
Mrge CharTermAttribute and deprecations to trunk


 Key: LUCENE-2332
 URL: https://issues.apache.org/jira/browse/LUCENE-2332
 Project: Lucene - Java
  Issue Type: New Feature
Affects Versions: 3.1
Reporter: Uwe Schindler
Assignee: Uwe Schindler


This should be merged to trunk until flex lands, so the analyzers can be ported 
to new api.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2332) Mrge CharTermAttribute and deprecations to trunk

2010-03-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846943#action_12846943
 ] 

Robert Muir commented on LUCENE-2332:
-

I agree.

This gives us a chance to make sure it really works the way we want,
by porting all our own analyzers to the attribute.

Also, we can hopefully simplify/improve some code (e.g. PatternReplaceFilter)
with the new capabilities.


 Mrge CharTermAttribute and deprecations to trunk
 

 Key: LUCENE-2332
 URL: https://issues.apache.org/jira/browse/LUCENE-2332
 Project: Lucene - Java
  Issue Type: New Feature
Affects Versions: 3.1
Reporter: Uwe Schindler
Assignee: Uwe Schindler

 This should be merged to trunk until flex lands, so the analyzers can be 
 ported to new api.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846944#action_12846944
 ] 

Michael McCandless commented on LUCENE-2328:


Keeping track of not-yet-sync'd files instead of sync'd files is
better, but it still requires upkeep (ie when file is deleted you have
to remove it) because files can be opened, written to, closed, deleted
without ever being sync'd.

And I like moving this tracking under Dir -- that's where it belongs.

bq. I assume that on calling syncEveryoneAndHisDog() you should sync all files 
that have been written to, and were closed, and not yet deleted.

This will over-sync in some situations.

Ie, causing commit to take longer than it should.

EG say a merge has finished with the first set of files (say _X.fdx/t,
since it merges fields first) but is still working on postings, when
the user calls commit.  We should not then sync _X.fdx/t because they
are unreferenced by the segments_N we are committing.

Or the merge has finished (so _X.* has been created) but is now off
building the _X.cfs file -- we don't want to sync _X.*, only _X.cfs
when its done.

Another example: we don't do this today, but, addIndexes should really
run fully outside of IW's normal segments file, merging away, and then
only on final success alter IW's segmentInfos.  If we switch to that,
we don't want to sync all the files that addIndexes is temporarily
writing...

The knowledge of which files make up the transaction lives above
Directory... so I think we should retain the per-file control.

I proposed the bulk-sync API so that Dir impls could choose to do a
system-wide sync.  Or, more generally, any Dir which can be more
efficient if it knows the precise set of files that must be sync'd
right now.

If we stick with file-by-file API, doing a system-wide sync is
somewhat trickier... because you can't assume from one call to the
next that nothing had changed.

Also, bulk sync better matches the semantics IW/IR require: these
consumers don't care the order in which these files are sync'd.  They
just care that the requested set is sync'd.  So it exposes a degree of
freedom to the Dir impls that's otherwise hidden today.


 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846946#action_12846946
 ] 

Michael McCandless commented on LUCENE-2329:


This issue is just about how IndexWriter's RAM buffer stores its terms...

But, the flex API adds long ord() and seek(long ord) to the TermsEnum API.


 Use parallel arrays instead of PostingList objects
 --

 Key: LUCENE-2329
 URL: https://issues.apache.org/jira/browse/LUCENE-2329
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


 This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
 In order to avoid having very many long-living PostingList objects in 
 TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
 simply be a int[] which maps each term to dense termIDs.
 All data that the PostingList classes currently hold will then we placed in 
 parallel arrays, where the termID is the index into the arrays.  This will 
 avoid the need for object pooling, will remove the overhead of object 
 initialization and garbage collection.  Especially garbage collection should 
 benefit significantly when the JVM runs out of memory, because in such a 
 situation the gc mark times can get very long if there is a big number of 
 long-living objects in memory.
 Another benefit could be to build more efficient TermVectors.  We could avoid 
 the need of having to store the term string per document in the TermVector.  
 Instead we could just store the segment-wide termIDs.  This would reduce the 
 size and also make it easier to implement efficient algorithms that use 
 TermVectors, because no term mapping across documents in a segment would be 
 necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2327) IndexOutOfBoundsException in FieldInfos.java

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846947#action_12846947
 ] 

Michael McCandless commented on LUCENE-2327:


Yikes -- you had 10 corrupted segments (of 23) and there's at least 4 different 
flavors of corruption across those segments!  Curious...  What storage device 
did you store the index on? ;)

Note the that fix just drops those segments from the index, so any docs that 
were in them are lost.

 IndexOutOfBoundsException in FieldInfos.java
 

 Key: LUCENE-2327
 URL: https://issues.apache.org/jira/browse/LUCENE-2327
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1
 Environment: Fedora 12
Reporter: Shane
Priority: Minor
 Attachments: CheckIndex.txt


 When retrieving the scoreDocs from a multisearcher, the following exception 
 is thrown:
 java.lang.IndexOutOfBoundsException: Index: 52, Size: 4
 at java.util.ArrayList.rangeCheck(ArrayList.java:571)
 at java.util.ArrayList.get(ArrayList.java:349)
 at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:285)
 at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:274)
 at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86)
 at 
 org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
 at 
 org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
 at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232)
 at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179)
 at 
 org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:911)
 at 
 org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:644)
 The error is caused when the fieldNumber passed to FieldInfos.fieldInfo() is 
 greater than the size of array list containing the FieldInfo values.  I am 
 not sure what the field number represents or why it would be larger than the 
 array list's size.  The quick fix would be to validate the bounds but there 
 may be a bigger underlying problem.  The issue does appear to be directly 
 related to LUCENE-939.  I've only been able to duplicate this in my 
 production environment and so can't give a good test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Reopened: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)

2010-03-18 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reopened LUCENE-2302:
---

  Assignee: Uwe Schindler

Add note to backwards compatibility section:
- TermAttribute now changed toString() behaviour
- Token now implemnts CharSequence but violates its contract

 Replacement for TermAttribute+Impl with extended capabilities (byte[] 
 support, CharSequence, Appendable)
 

 Key: LUCENE-2302
 URL: https://issues.apache.org/jira/browse/LUCENE-2302
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: Flex Branch
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: Flex Branch

 Attachments: LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, 
 LUCENE-2302.patch, LUCENE-2302.patch


 For flexible indexing terms can be simple byte[] arrays, while the current 
 TermAttribute only supports char[]. This is fine for plain text, but e.g 
 NumericTokenStream should directly work on the byte[] array.
 Also TermAttribute lacks of some interfaces that would make it simplier for 
 users to work with them: Appendable and CharSequence
 I propose to create a new interface CharTermAttribute with a clean new API 
 that concentrates on CharSequence and Appendable.
 The implementation class will simply support the old and new interface 
 working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of 
 this. So if somebody adds a TermAttribute, he will get an implementation 
 class that can be also used as CharTermAttribute. As both attributes create 
 the same impl instance both calls to addAttribute are equal. So a TokenFilter 
 that adds CharTermAttribute to the source will work with the same instance as 
 the Tokenizer that requested the (deprecated) TermAttribute.
 To also support byte[] only terms like Collation or NumericField needs, a 
 separate getter-only interface will be added, that returns a reusable 
 BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will 
 also support this interface. For backwards compatibility with old 
 self-made-TermAttribute implementations, the indexer will check with 
 hasAttribute(), if the BytesRef getter interface is there and if not will 
 wrap a old-style TermAttribute (a deprecated wrapper class will be provided): 
 new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the 
 indexer then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2327) IndexOutOfBoundsException in FieldInfos.java

2010-03-18 Thread Shane (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846954#action_12846954
 ] 

Shane commented on LUCENE-2327:
---

I believe at the time we were storing on a NAS via NFS.  If my memory serves me 
well, there were known issues with running Lucene over NFS at the time.   We 
were experiencing issues with the file system at the time so have since moved 
to a different architecture. 

Also, I was aware that the fix drops the segments, but thanks anyway. :)

 IndexOutOfBoundsException in FieldInfos.java
 

 Key: LUCENE-2327
 URL: https://issues.apache.org/jira/browse/LUCENE-2327
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1
 Environment: Fedora 12
Reporter: Shane
Priority: Minor
 Attachments: CheckIndex.txt


 When retrieving the scoreDocs from a multisearcher, the following exception 
 is thrown:
 java.lang.IndexOutOfBoundsException: Index: 52, Size: 4
 at java.util.ArrayList.rangeCheck(ArrayList.java:571)
 at java.util.ArrayList.get(ArrayList.java:349)
 at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:285)
 at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:274)
 at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86)
 at 
 org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
 at 
 org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
 at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232)
 at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179)
 at 
 org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:911)
 at 
 org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:644)
 The error is caused when the fieldNumber passed to FieldInfos.fieldInfo() is 
 greater than the size of array list containing the FieldInfo values.  I am 
 not sure what the field number represents or why it would be larger than the 
 array list's size.  The quick fix would be to validate the bounds but there 
 may be a bigger underlying problem.  The issue does appear to be directly 
 related to LUCENE-939.  I've only been able to duplicate this in my 
 production environment and so can't give a good test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846956#action_12846956
 ] 

Earwin Burrfoot commented on LUCENE-2328:
-

 Keeping track of not-yet-sync'd files instead of sync'd files is better, but 
 it still requires upkeep (ie when file is deleted you have to remove it) 
 because files can be opened, written to, closed, deleted without ever being 
 sync'd.
You can just skip this and handle FileNotFound exception when syncing. Have to 
handle it anyway, no guarantees some file won't be snatched from under your 
nose.

 This will over-sync in some situations.
Don't feel this is a serious problem. If you over-sync (in fact sync some files 
a little bit earlier than strictly required), in a few seconds you will 
under-sync, so total time is still the same.

But I feel you're somewhat missing the point. System-wide sync is not the 
original aim, it's just a possible byproduct of what is the original aim - to 
move sync tracking code from IW to Directory. And I don't see at all how adding 
batch-syncs achieves this.
If you're calling sync(CollectionString), damn, you should keep that 
collection somewhere :) and it is supposed to be inside!

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846960#action_12846960
 ] 

Michael McCandless commented on LUCENE-2328:


{quote}
bq. Keeping track of not-yet-sync'd files instead of sync'd files is better, 
but it still requires upkeep (ie when file is deleted you have to remove it) 
because files can be opened, written to, closed, deleted without ever being 
sync'd.
You can just skip this and handle FileNotFound exception when syncing. Have to 
handle it anyway, no guarantees some file won't be snatched from under your 
nose.
{quote}

IW  IR do in fact guarantee they will never ask for a deleted file to
be sync'd.  If they ever do that we have more serious problems ;)

{quote}
bq. This will over-sync in some situations.
Don't feel this is a serious problem. If you over-sync (in fact sync some files 
a little bit earlier than strictly required), in a few seconds you will 
under-sync, so total time is still the same.
{quote}

I think this is important -- commit is already slow enough -- why make
it slower?

Further, the extra files you sync'd may never have needed to be sync'd
(they will be merged away).  My examples above include such cases.

Turning this around... what's so bad about keeping the sync per file?

bq. System-wide sync is not the original aim, it's just a possible byproduct of 
what is the original aim

I know this is not the aim of this issue, rather just a nice
by-product if we switch to a global sync method.

bq. to move sync tracking code from IW to Directory.

Right this is a great step forward, as long as long as we don't slow
commit by dumbing down the API :)

bq. And I don't see at all how adding batch-syncs achieves this.

You're right: this doesn't achieve / is not required for moving
sync'd file tracking down to Dir.  It's orthogonal, but, is another
way that we could allow Dir impls to do global sync.

I'm proposing this as a different change, to make the API better match
the needs of its consumers.  In fact, really the OS ought to allow for
this as well (but I know of none that do) since it'd give the IO
scheduler more freedom on which bytes need to be moved to disk.

We can open this one as a separate issue...


 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2327) IndexOutOfBoundsException in FieldInfos.java

2010-03-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2327.


Resolution: Invalid

OK I'm resolving as optimistically invalid :)

 IndexOutOfBoundsException in FieldInfos.java
 

 Key: LUCENE-2327
 URL: https://issues.apache.org/jira/browse/LUCENE-2327
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1
 Environment: Fedora 12
Reporter: Shane
Priority: Minor
 Attachments: CheckIndex.txt


 When retrieving the scoreDocs from a multisearcher, the following exception 
 is thrown:
 java.lang.IndexOutOfBoundsException: Index: 52, Size: 4
 at java.util.ArrayList.rangeCheck(ArrayList.java:571)
 at java.util.ArrayList.get(ArrayList.java:349)
 at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:285)
 at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:274)
 at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86)
 at 
 org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
 at 
 org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
 at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232)
 at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179)
 at 
 org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:911)
 at 
 org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:644)
 The error is caused when the fieldNumber passed to FieldInfos.fieldInfo() is 
 greater than the size of array list containing the FieldInfo values.  I am 
 not sure what the field number represents or why it would be larger than the 
 array list's size.  The quick fix would be to validate the bounds but there 
 may be a bigger underlying problem.  The issue does appear to be directly 
 related to LUCENE-939.  I've only been able to duplicate this in my 
 production environment and so can't give a good test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846984#action_12846984
 ] 

Michael McCandless commented on LUCENE-2280:


From the log I can see that you run fine for a long time, opening IW,
indexing a few docs, optimizing, then closing.  Then suddenly the
exceptions start happening on many (but not all) merges, and, merges
involving different segments.  JRE bug seems most likely I guess...

Since you see this only on Windows (not eg on Linux), I think this is
likely not a bug in Lucene but rather something particular about your
Windows env -- virus checker maybe?  Is there anything in the Windows
events log that correlate to when the exceptions start?

Or it could be a JRE bug -- you really should try on different (Sun)
JRE.


 IndexWriter.optimize() throws NullPointerException
 --

 Key: LUCENE-2280
 URL: https://issues.apache.org/jira/browse/LUCENE-2280
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3.2
 Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6
Reporter: Ritesh Nigam
 Attachments: lucene.jar, lucene.zip


 I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB 
 database which creates approax 200MB index file, after finishing the indexing 
 and while running optimize() i can see NullPointerExcception thrown in my log 
 and index file is getting corrupted, log says
 
 Caused by: 
 java.lang.NullPointerException
   at 
 org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49)
   at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40)
   at 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566)
   at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135)
   at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273)
   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
 
 and this is happening quite frequently, although I am not able to reproduce 
 it on demand, I saw an issue logged which is some what related to mine issue 
 (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e)
  but the only difference here is I am not using Store.Compress for my fields, 
 i am using Store.NO instead. please note that I am using IBM JRE for my 
 application.
 Is this an issue with lucene?, if yes it is fixed in which version?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846985#action_12846985
 ] 

Michael McCandless commented on LUCENE-2280:


Yes close() does commit() internally.

Are you saying you see the same exception on 3.0, using the IBM JRE?  Can you 
try with the Sun JRE?

 IndexWriter.optimize() throws NullPointerException
 --

 Key: LUCENE-2280
 URL: https://issues.apache.org/jira/browse/LUCENE-2280
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3.2
 Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6
Reporter: Ritesh Nigam
 Attachments: lucene.jar, lucene.zip


 I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB 
 database which creates approax 200MB index file, after finishing the indexing 
 and while running optimize() i can see NullPointerExcception thrown in my log 
 and index file is getting corrupted, log says
 
 Caused by: 
 java.lang.NullPointerException
   at 
 org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49)
   at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40)
   at 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566)
   at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135)
   at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273)
   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968)
   at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
 
 and this is happening quite frequently, although I am not able to reproduce 
 it on demand, I saw an issue logged which is some what related to mine issue 
 (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e)
  but the only difference here is I am not using Store.Compress for my fields, 
 i am using Store.NO instead. please note that I am using IBM JRE for my 
 application.
 Is this an issue with lucene?, if yes it is fixed in which version?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846991#action_12846991
 ] 

Earwin Burrfoot commented on LUCENE-2328:
-

Okay, summing up.

1. Directory gets a new method - sync(CollectionString), it will become 
abstract in 4.0, but now by default delegates to current sync(String), which is 
deprecated.
2. FSDirectory tracks newly written, closed and not deleted files, by changing 
FSD.IndexOutput accordingly.
3. sync() semantics changes from sync this now to sync this now, if you 
think it's needed. Noop sync() impls like RAMDir continue to be noop, FSDir 
syncs only those files that exist in its tracking set and ignores all others.
4. IW/IR stop tracking synced files completely (lots of garbage code gone from 
IW), and instead call sync(Collection) on commit with a list of all files that 
constitute said commit.

These steps preserve back-compatibility (Except for cases of custom Directory 
impls in which calling sync on the same file sequentially is costly. They will 
suffer performance degradation), ensure that for each commit only strictly 
requested subset of files is synced (thing Mike insisted on), and will 
completely remove sync-tracking code from IW and IR.

5. We open another issue to experiment with batch syncing and various 
filesystems. Some relevant fun data: 
http://www.humboldt.co.uk/2009/03/fsync-across-platforms.html


 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846996#action_12846996
 ] 

Shai Erera commented on LUCENE-2328:


bq.  changing FSD.IndexOutput accordingly

This worries me a bit. If only FSD.IndexOutput will do that, I'm afraid other 
Directory implementations won't realize that they should do so as well (NIO?). 
I'd prefer if IndexOutput in its contract is supposed to callback on Directory 
upon close ... not sure - maybe just put some heave documentation around 
createOutput? If we could enforce this API-wise, and let the Dirs that don't 
care simply ignore, then it'd be better. It'll also allow for someone to extend 
FSD.createOutput, return his own IndexOutput and not worry (or do, but 
knowingly) about calling back to Dir.

Other than that - this looks great.

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2331) Add NoOpMergePolicy

2010-03-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846998#action_12846998
 ] 

Shai Erera commented on LUCENE-2331:


I like NoMergesPolicy ... perhaps, like NoLockFactory, we can call it 
NoMergePolicy? so MP is preserved in the name (not that it's critical)?

 Add NoOpMergePolicy
 ---

 Key: LUCENE-2331
 URL: https://issues.apache.org/jira/browse/LUCENE-2331
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I'd like to add a simple and useful MP implementation which does  nothing 
 ! :). I've came across many places where either the following is documented 
 or implemented: if you want to prevent merges, set mergeFactor to a high 
 enough value. I think a NoOpMergePolicy is just as good, and can REALLY 
 allow you disable merges (except for maybe set mergeFactor to Int.MAX_VAL).
 As such, NoOpMergePolicy will be introduced as a singleton, and can be used 
 for convenience purposes only. Also, for Parallel Index it's important, 
 because I'd like the slices to never do any merges, unless ParallelWriter 
 decides so. So they should be set w/ that MP.
 I have a patch ready. Waiting for LUCENE-2320 to go in, so that I don't need 
 to change it afterwards.
 About the name - I like the name, but suggestions are welcome. I thought of a 
 NullMergePolicy, but I don't like 'Null' used for a NoOp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846997#action_12846997
 ] 

Shai Erera commented on LUCENE-2320:


Mike - are you reviewing it? I think I fixed all mentioned comments.

 Add MergePolicy to IndexWriterConfig
 

 Key: LUCENE-2320
 URL: https://issues.apache.org/jira/browse/LUCENE-2320
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, 
 LUCENE-2320.patch, LUCENE-2320.patch


 Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
 well. The change is not straightforward and so I've kept it for a separate 
 issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
 passed to it before an IndexWriter actually exists. And today IW may create 
 an MP just for it to be overridden by the application one line afterwards. I 
 don't want to make iw member of MP non-final, or settable by extending 
 classes, however it needs to remain protected so they can access it directly. 
 So the proposed changes are:
 * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
 once (hence its name). It'll have the signature SetOnceT w/ *synchronized 
 setT* and *T get()*. T will be declared volatile, so that get() won't be 
 synchronized.
 * MP will define a *protected final SetOnceIndexWriter writer* instead of 
 the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
 * MP will offer a public default ctor, together with a set(IndexWriter).
 * IndexWriter will set itself on MP using set(this). Note that if set will be 
 called more than once, it will throw an exception (AlreadySetException - or 
 does someone have a better suggestion, preferably an already existing Java 
 exception?).
 That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
 review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847008#action_12847008
 ] 

Michael McCandless commented on LUCENE-2320:


The patch looks great Shai -- I plan to commit in a day or two.

I added @lucene.experimental to SetOnce's jdocs, and also removed stale javadoc 
in MP and MS saying that you need access to package-private APIs (unrelated to 
this issue but spotted it ;).

 Add MergePolicy to IndexWriterConfig
 

 Key: LUCENE-2320
 URL: https://issues.apache.org/jira/browse/LUCENE-2320
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, 
 LUCENE-2320.patch, LUCENE-2320.patch


 Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
 well. The change is not straightforward and so I've kept it for a separate 
 issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
 passed to it before an IndexWriter actually exists. And today IW may create 
 an MP just for it to be overridden by the application one line afterwards. I 
 don't want to make iw member of MP non-final, or settable by extending 
 classes, however it needs to remain protected so they can access it directly. 
 So the proposed changes are:
 * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
 once (hence its name). It'll have the signature SetOnceT w/ *synchronized 
 setT* and *T get()*. T will be declared volatile, so that get() won't be 
 synchronized.
 * MP will define a *protected final SetOnceIndexWriter writer* instead of 
 the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
 * MP will offer a public default ctor, together with a set(IndexWriter).
 * IndexWriter will set itself on MP using set(this). Note that if set will be 
 called more than once, it will throw an exception (AlreadySetException - or 
 does someone have a better suggestion, preferably an already existing Java 
 exception?).
 That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
 review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847010#action_12847010
 ] 

Earwin Burrfoot commented on LUCENE-2328:
-

Every Directory implementation decides how to handle sync() calls on its own. 
The fact that FSDir (and descendants) do this performance optimization is their 
implementation details.
I don't want to bind this somehow into the base class. But, I will note in 
javadocs to sync() that clients may pass the same file over and over again, so 
you might want to optimize for this.

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene and solr trunk

2010-03-18 Thread Mark Miller
Alight, so we have implemented Hoss' suggestion here on the lucene/solr 
merged dev branch at lucene/solr/branches/newtrunk.


Feel free to check it out and give some feedback.

We also roughly have Solr running on Lucene trunk - eg compiling Solr 
will first compile lucene and run off those compiled class files. 
Running dist or example in Solr will grab Lucene's jars and put them in 
the war. This still needs further love, but it works.


There is also a top level build.xml with two targets: clean, and test. 
Clean will clean both Lucene and Solr, and test will run tests for both 
Lucene and Solr.


Thanks to everyone that contributed to getting all this working!

--
- Mark

http://www.lucidimagination.com



On 03/17/2010 12:40 PM, Mark Miller wrote:
Okay, so this looks good to me (a few others seemed to like it - 
though Lucene-Dev was somehow dropped earlier) - lets try this out on 
the branch? (then we can get rid of that horrible branch name ;) )


Anyone on the current branch object to having to do a quick svn switch?

On 03/16/2010 06:46 PM, Chris Hostetter wrote:
: Otis, yes, I think so, eventually.  But that's gonna take much more 
discussion.

:
: I don't think this initial cutover should try to solve how modules
: will be organized, yet... we'll get there, eventually.

But we should at least consider it, and not move in a direction that's
distinct from the ultimate goal of better refactoring (especailly since
that was one of the main goals of unifying development efforts)

Here's my concrete suggestion that could be done today (for simplicity:
$svn = https://svn.apache.org/repos/asf/lucene)...

   svn mv $svn/java/trunk $svn/java/tmp-migration
   svn mkdir $svn/java/trunk
   svn mv $svn/solr/trunk $svn/java/trunk/solr
   svn mv $svn/java/tmp-migration $svn/java/trunk/core

At which point:

0. People who want to work only on Lucene-Java can start checking out
$svn/java/trunk/core (i'm pretty sure existing checkouts will 
continue to

work w/o any changes, the svn info should just update itself)

1. build files can be added to (the new) $svn/java/trunk to build ./core
followed by ./solr

2. the build files in $svn/java/trunk/solr can be modified to look at
../core/ to find lucene jars

3. people who care about Solr (including all committers) should start
checking out and building all of $svn/java/trunk

4. Long term, we could choose to branch all of $svn/java/trunk
for releases ... AND/OR we could choose to branch specific modules
(ie: solr) independently (with modifications to the build files on those
branches to pull in their dependencies from alternate locations

5. Long term, we can start refactoring additional modules out of
$svn/java/trunk/solr and $svn/java/trunk/core (like
$svn/java/trunk/core/contrib) into their own directory in 
$svn/java/trunk


6. Long term, people who want to work on more then just core but don't
care about certain modules (like solr) can do a simple non-recursive
checkout of $svn/java/trunk and then do full checkouts of whatever 
modules

they care about


(Please note: I'm just trying to list things we *could* do if we go this
route, i'm not advocating that we *should* do any of these things)

I can't think of any objections people have raised to any of the 
previous
suggestions which apply to this suggestion.  Is there anything people 
can

think of that would be useful, but not possible, if we go this route?


-Hoss








-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene and solr trunk

2010-03-18 Thread Michael McCandless
All tests pass for me :)

Mike

On Thu, Mar 18, 2010 at 12:27 PM, Mark Miller markrmil...@gmail.com wrote:
 Alight, so we have implemented Hoss' suggestion here on the lucene/solr
 merged dev branch at lucene/solr/branches/newtrunk.

 Feel free to check it out and give some feedback.

 We also roughly have Solr running on Lucene trunk - eg compiling Solr will
 first compile lucene and run off those compiled class files. Running dist or
 example in Solr will grab Lucene's jars and put them in the war. This still
 needs further love, but it works.

 There is also a top level build.xml with two targets: clean, and test. Clean
 will clean both Lucene and Solr, and test will run tests for both Lucene and
 Solr.

 Thanks to everyone that contributed to getting all this working!

 --
 - Mark

 http://www.lucidimagination.com



 On 03/17/2010 12:40 PM, Mark Miller wrote:

 Okay, so this looks good to me (a few others seemed to like it - though
 Lucene-Dev was somehow dropped earlier) - lets try this out on the branch?
 (then we can get rid of that horrible branch name ;) )

 Anyone on the current branch object to having to do a quick svn switch?

 On 03/16/2010 06:46 PM, Chris Hostetter wrote:

 : Otis, yes, I think so, eventually.  But that's gonna take much more
 discussion.
 :
 : I don't think this initial cutover should try to solve how modules
 : will be organized, yet... we'll get there, eventually.

 But we should at least consider it, and not move in a direction that's
 distinct from the ultimate goal of better refactoring (especailly since
 that was one of the main goals of unifying development efforts)

 Here's my concrete suggestion that could be done today (for simplicity:
 $svn = https://svn.apache.org/repos/asf/lucene)...

   svn mv $svn/java/trunk $svn/java/tmp-migration
   svn mkdir $svn/java/trunk
   svn mv $svn/solr/trunk $svn/java/trunk/solr
   svn mv $svn/java/tmp-migration $svn/java/trunk/core

 At which point:

 0. People who want to work only on Lucene-Java can start checking out
 $svn/java/trunk/core (i'm pretty sure existing checkouts will continue to
 work w/o any changes, the svn info should just update itself)

 1. build files can be added to (the new) $svn/java/trunk to build ./core
 followed by ./solr

 2. the build files in $svn/java/trunk/solr can be modified to look at
 ../core/ to find lucene jars

 3. people who care about Solr (including all committers) should start
 checking out and building all of $svn/java/trunk

 4. Long term, we could choose to branch all of $svn/java/trunk
 for releases ... AND/OR we could choose to branch specific modules
 (ie: solr) independently (with modifications to the build files on those
 branches to pull in their dependencies from alternate locations

 5. Long term, we can start refactoring additional modules out of
 $svn/java/trunk/solr and $svn/java/trunk/core (like
 $svn/java/trunk/core/contrib) into their own directory in $svn/java/trunk

 6. Long term, people who want to work on more then just core but don't
 care about certain modules (like solr) can do a simple non-recursive
 checkout of $svn/java/trunk and then do full checkouts of whatever
 modules
 they care about


 (Please note: I'm just trying to list things we *could* do if we go this
 route, i'm not advocating that we *should* do any of these things)

 I can't think of any objections people have raised to any of the previous
 suggestions which apply to this suggestion.  Is there anything people can
 think of that would be useful, but not possible, if we go this route?


 -Hoss






 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847015#action_12847015
 ] 

Michael McCandless commented on LUCENE-2328:


Must the Dir insist the file is closed in order to sync it?

Why not enroll newly created files in the to be sync'd set?

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2331) Add NoOpMergePolicy

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847018#action_12847018
 ] 

Michael McCandless commented on LUCENE-2331:


+1 for NoMergePolicy

 Add NoOpMergePolicy
 ---

 Key: LUCENE-2331
 URL: https://issues.apache.org/jira/browse/LUCENE-2331
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I'd like to add a simple and useful MP implementation which does  nothing 
 ! :). I've came across many places where either the following is documented 
 or implemented: if you want to prevent merges, set mergeFactor to a high 
 enough value. I think a NoOpMergePolicy is just as good, and can REALLY 
 allow you disable merges (except for maybe set mergeFactor to Int.MAX_VAL).
 As such, NoOpMergePolicy will be introduced as a singleton, and can be used 
 for convenience purposes only. Also, for Parallel Index it's important, 
 because I'd like the slices to never do any merges, unless ParallelWriter 
 decides so. So they should be set w/ that MP.
 I have a patch ready. Waiting for LUCENE-2320 to go in, so that I don't need 
 to change it afterwards.
 About the name - I like the name, but suggestions are welcome. I thought of a 
 NullMergePolicy, but I don't like 'Null' used for a NoOp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-18 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847024#action_12847024
 ] 

Michael Busch commented on LUCENE-2329:
---

bq. This issue is just about how IndexWriter's RAM buffer stores its terms... 

Actually, when I talked about the TermVectors I meant we should explore to 
store the termIDs on *disk*, rather than the strings.  It would help things 
like similarity search and facet counting.

{quote}
But, note that term vectors today do not store the term char[] again - they 
piggyback on the term char[] already stored for the postings.
{quote}

Yeah I think I'm familiar with that part (secondary entry point in 
TermsHashPerField, hashes based on termStart).  Haven't looked much into how 
the rest of the TermVector in-memory data structures are working.  

{quote}
Though, I believe they store int textStart (increments by term length per 
unique term), which is less compact than the termID would be (increments +1 per 
unique term)
{quote}

Actually we wouldn't need a second hashtable for the secondary TermsHash 
anymore, right?  It would just have like the primary TermsHash a parallel array 
with the things that the TermVectorsTermsWriter.Postinglist class currently 
contains (freq, lastOffset, lastPosition)?  And the index into that array would 
be the termID of course.

This would be a nice simplification, because no hash collisions, no hash table 
resizing based on load factor, etc. would be necessary for non-primary 
TermsHashes?

bq.  so if eg we someday use packed ints we'd be more RAM efficient by storing 
termIDs...

How does the read performance of packed ints compare to normal int[] arrays?  
I think nowadays RAM is less of an issue?  And with a searchable RAM buffer we 
might want to sacrifice a bit more RAM for higher search performance?  Oh man, 
will we need flexible indexing for the in-memory index too? :) 

 Use parallel arrays instead of PostingList objects
 --

 Key: LUCENE-2329
 URL: https://issues.apache.org/jira/browse/LUCENE-2329
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


 This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
 In order to avoid having very many long-living PostingList objects in 
 TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
 simply be a int[] which maps each term to dense termIDs.
 All data that the PostingList classes currently hold will then we placed in 
 parallel arrays, where the termID is the index into the arrays.  This will 
 avoid the need for object pooling, will remove the overhead of object 
 initialization and garbage collection.  Especially garbage collection should 
 benefit significantly when the JVM runs out of memory, because in such a 
 situation the gc mark times can get very long if there is a big number of 
 long-living objects in memory.
 Another benefit could be to build more efficient TermVectors.  We could avoid 
 the need of having to store the term string per document in the TermVector.  
 Instead we could just store the segment-wide termIDs.  This would reduce the 
 size and also make it easier to implement efficient algorithms that use 
 TermVectors, because no term mapping across documents in a segment would be 
 necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847034#action_12847034
 ] 

Shai Erera commented on LUCENE-2320:


Thanks Mike !

 Add MergePolicy to IndexWriterConfig
 

 Key: LUCENE-2320
 URL: https://issues.apache.org/jira/browse/LUCENE-2320
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, 
 LUCENE-2320.patch, LUCENE-2320.patch


 Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
 well. The change is not straightforward and so I've kept it for a separate 
 issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
 passed to it before an IndexWriter actually exists. And today IW may create 
 an MP just for it to be overridden by the application one line afterwards. I 
 don't want to make iw member of MP non-final, or settable by extending 
 classes, however it needs to remain protected so they can access it directly. 
 So the proposed changes are:
 * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
 once (hence its name). It'll have the signature SetOnceT w/ *synchronized 
 setT* and *T get()*. T will be declared volatile, so that get() won't be 
 synchronized.
 * MP will define a *protected final SetOnceIndexWriter writer* instead of 
 the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
 * MP will offer a public default ctor, together with a set(IndexWriter).
 * IndexWriter will set itself on MP using set(this). Note that if set will be 
 called more than once, it will throw an exception (AlreadySetException - or 
 does someone have a better suggestion, preferably an already existing Java 
 exception?).
 That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
 review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847036#action_12847036
 ] 

Shai Erera commented on LUCENE-2328:


Yeah I guess I wasn't clear enough. So suppose someone sub-classes FSDir and 
overrides createOutput. How should he know his IndexOutput should call 
dir.sync()? How should he know he needs to pass the Dir to his IndexOutput? So 
I suggested to either mention it in the Javadocs, or somehow make all of 
FSDir's outputs know about that, API-wise ...

So today a file is closed only upon commit (?), and it's then that it's synced? 
If so, why would you want to sync a file that is still open? I guess it cannot 
harm, but what's the use case?

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847050#action_12847050
 ] 

Michael McCandless commented on LUCENE-2328:


In the current proposal, IndexOutput won't call dir.sync.  All it will do is 
notify the dir when it was closed so the dir will record that filename as 
eligible for commit.

Lucene today never syncs a file until after it's closed, but, conceivably some 
day it could.  Or others who use the Dir API to write their own files could.

At the OS level this is perfectly fine (in fact you have to pass an open fd to 
fsync).  It seems presumptuous of the directory to silently ignore a call to 
sync just because the file hadn't been closed yet...

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847058#action_12847058
 ] 

Michael McCandless commented on LUCENE-2329:


bq. Actually, when I talked about the TermVectors I meant we should explore to 
store the termIDs on disk, rather than the strings. It would help things like 
similarity search and facet counting.

A that would be great!

bq. Actually we wouldn't need a second hashtable for the secondary TermsHash 
anymore, right? It would just have like the primary TermsHash a parallel array 
with the things that the TermVectorsTermsWriter.Postinglist class currently 
contains (freq, lastOffset, lastPosition)? And the index into that array would 
be the termID of course.

Hmm the challenge is that the tracking done for term vectors is just within a 
single doc.  Ie the hash used for term vectors only holds the terms for that 
one doc (so it's much smaller), vs the primary hash that holds terms for all 
docs in the current RAM buffer.  So we'd be burning up much more RAM if we also 
key into the term vector's parallel arrays using the primary term id?

But I do think we should cutover to parallel arrays for TVTW, too.

bq. How does the read performance of packed ints compare to normal int[] 
arrays? I think nowadays RAM is less of an issue? And with a searchable RAM 
buffer we might want to sacrifice a bit more RAM for higher search performance?

It's definitely slower to read/write to/from packed ints, and I agree, indexing 
and searching speed trumps RAM efficiency.

bq. Oh man, will we need flexible indexing for the in-memory index too?

EG custom attrs appearing in the TokenStream?  Yes we will need to... but 
hopefully once we get serialization working cleanly for the attrs this'll be 
easy?  With ByteSliceWriter/Reader you just .writeBytes and .readBytes...

I don't think we should allow Codecs to be used in the RAM buffer anytime soon 
though... ;)



 Use parallel arrays instead of PostingList objects
 --

 Key: LUCENE-2329
 URL: https://issues.apache.org/jira/browse/LUCENE-2329
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


 This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
 In order to avoid having very many long-living PostingList objects in 
 TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
 simply be a int[] which maps each term to dense termIDs.
 All data that the PostingList classes currently hold will then we placed in 
 parallel arrays, where the termID is the index into the arrays.  This will 
 avoid the need for object pooling, will remove the overhead of object 
 initialization and garbage collection.  Especially garbage collection should 
 benefit significantly when the JVM runs out of memory, because in such a 
 situation the gc mark times can get very long if there is a big number of 
 long-living objects in memory.
 Another benefit could be to build more efficient TermVectors.  We could avoid 
 the need of having to store the term string per document in the TermVector.  
 Instead we could just store the segment-wide termIDs.  This would reduce the 
 size and also make it easier to implement efficient algorithms that use 
 TermVectors, because no term mapping across documents in a segment would be 
 necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak

2010-03-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847063#action_12847063
 ] 

Michael McCandless commented on LUCENE-2328:


Yes please clean as you go Earwin -- those sound great.

{quote}
bq. Must the Dir insist the file is closed in order to sync it?
Well, no, this can be relaxed.
Because default Directory clients - IW+IR will never call sync() on a file they 
didn't close yet.
Also this client behaviour is guaranteed with current implementation - if 
someone calls current sync() on an open file, it will fail on 'new 
RandomAccessFile'?
{quote}

I'd like to allow for this to work in the future, even if current FSDir impls 
cannot sync an open file.  EG conceivably they could reach in and get the RAF 
that IndexOutput has open and sync it.

So I think we just note this as a limitation of FSDir impls today, but, the API 
allows for it?

 IndexWriter.synced  field accumulates data leading to a Memory Leak
 ---

 Key: LUCENE-2328
 URL: https://issues.apache.org/jira/browse/LUCENE-2328
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
 Environment: all
Reporter: Gregor Kaczor
Priority: Minor
 Fix For: 3.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 I am running into a strange OutOfMemoryError. My small test application does
 index and delete some few files. This is repeated for 60k times. Optimization
 is run from every 2k times a file is indexed. Index size is 50KB. I did 
 analyze
 the HeapDumpFile and realized that IndexWriter.synced field occupied more than
 half of the heap. That field is a private HashSet without a getter. Its task 
 is
 to hold files which have been synced already.
 There are two calls to addAll and one call to add on synced but no remove or
 clear throughout the lifecycle of the IndexWriter instance.
 According to the Eclipse Memory Analyzer synced contains 32618 entries which
 look like file names _e065_1.del or _e067.cfs
 The index directory contains 10 files only.
 I guess synced is holding obsolete data 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-18 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847068#action_12847068
 ] 

Michael Busch commented on LUCENE-2329:
---

bq. Hmm the challenge is that the tracking done for term vectors is just within 
a single doc.

Duh! Of course you're right.


 Use parallel arrays instead of PostingList objects
 --

 Key: LUCENE-2329
 URL: https://issues.apache.org/jira/browse/LUCENE-2329
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


 This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
 In order to avoid having very many long-living PostingList objects in 
 TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
 simply be a int[] which maps each term to dense termIDs.
 All data that the PostingList classes currently hold will then we placed in 
 parallel arrays, where the termID is the index into the arrays.  This will 
 avoid the need for object pooling, will remove the overhead of object 
 initialization and garbage collection.  Especially garbage collection should 
 benefit significantly when the JVM runs out of memory, because in such a 
 situation the gc mark times can get very long if there is a big number of 
 long-living objects in memory.
 Another benefit could be to build more efficient TermVectors.  We could avoid 
 the need of having to store the term string per document in the TermVector.  
 Instead we could just store the segment-wide termIDs.  This would reduce the 
 size and also make it easier to implement efficient algorithms that use 
 TermVectors, because no term mapping across documents in a segment would be 
 necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2331) Add NoOpMergePolicy

2010-03-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847086#action_12847086
 ] 

Shai Erera commented on LUCENE-2331:


In the process, I'll also add a NoMergeScheduler which will have empty 
implementations of MS. That's kind of redundant if one uses NoMP, however for 
symmetry it's nice to have it, as well as for not running any unnecessary code, 
like CMS and its threads, just to discover MP returned nothing.

 Add NoOpMergePolicy
 ---

 Key: LUCENE-2331
 URL: https://issues.apache.org/jira/browse/LUCENE-2331
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I'd like to add a simple and useful MP implementation which does  nothing 
 ! :). I've came across many places where either the following is documented 
 or implemented: if you want to prevent merges, set mergeFactor to a high 
 enough value. I think a NoOpMergePolicy is just as good, and can REALLY 
 allow you disable merges (except for maybe set mergeFactor to Int.MAX_VAL).
 As such, NoOpMergePolicy will be introduced as a singleton, and can be used 
 for convenience purposes only. Also, for Parallel Index it's important, 
 because I'd like the slices to never do any merges, unless ParallelWriter 
 decides so. So they should be set w/ that MP.
 I have a patch ready. Waiting for LUCENE-2320 to go in, so that I don't need 
 to change it afterwards.
 About the name - I like the name, but suggestions are welcome. I thought of a 
 NullMergePolicy, but I don't like 'Null' used for a NoOp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Contrib tests fail if core jar is not up to date

2010-03-18 Thread Shai Erera
Hi

I've noticed that sometimes, after I run test-core and test-contrib, and
then change core code, test-contrib fail on NoSuchMethodError and stuff like
that. I've noticed that core.jar exists under build, and I assumed it's used
by test-contrib, and probably is not recreated after core code has changed.

I verified it when looking in contrib-build.xml, which defines a property
lucene.jar.present which is set to true if the jar is ... well, present.
Which I believe is the reason for these failures. I've been thinking how to
resolve that, and I can think of two ways:

(1) have test-core always delete that file, but that has two issues:
(1.1) It's redundant if the code hasn't changed.
(1.2) It forces you to either jar-core or test-core before you test-contrib,
if you want to make sure you run w/ the latest jar.

or

(2) have test-contrib always call jar-core, which will first delete the file
and then re-create it by compiling first. Compiling should not do anything
if the code hasn't changed. So the only waste would be to create the .jar,
but I think that's quite fast?

Does anyone, with more Ant skills than me, know of a better way to detect
from test-contrib that core code has changed and only then rebuild the jar?

Shai


[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-18 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847123#action_12847123
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

For the skip list, we could reuse what we have (ie,
DefaultSkipListReader), though, we'd need to choose a default
number of docs, pulled out of thin air, as there's no way to
guesstimate per term before hand. Or we can have a single level
skip list (more like an index) and binary search to find the
value (assuming we have an int array instead of storing vints)
in the skip list we're looking for.

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 3.0.1
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: 3.1


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Contrib tests fail if core jar is not up to date

2010-03-18 Thread Robert Muir
On Thu, Mar 18, 2010 at 5:33 PM, Shai Erera ser...@gmail.com wrote:
 Hi

 I've noticed that sometimes, after I run test-core and test-contrib, and
 then change core code, test-contrib fail on NoSuchMethodError and stuff like
 that. I've noticed that core.jar exists under build, and I assumed it's used
 by test-contrib, and probably is not recreated after core code has changed.

 I verified it when looking in contrib-build.xml, which defines a property
 lucene.jar.present which is set to true if the jar is ... well, present.
 Which I believe is the reason for these failures. I've been thinking how to
 resolve that, and I can think of two ways:

 (1) have test-core always delete that file, but that has two issues:
 (1.1) It's redundant if the code hasn't changed.
 (1.2) It forces you to either jar-core or test-core before you test-contrib,
 if you want to make sure you run w/ the latest jar.

 or

 (2) have test-contrib always call jar-core, which will first delete the file
 and then re-create it by compiling first. Compiling should not do anything
 if the code hasn't changed. So the only waste would be to create the .jar,
 but I think that's quite fast?

 Does anyone, with more Ant skills than me, know of a better way to detect
 from test-contrib that core code has changed and only then rebuild the jar?

 Shai


In addition to what Shai mentioned, I wanted to say that there are
other oddities about how the contrib tests run in ant. For example,
I'm not sure why we create the junitfailed.flag files (I think it has
something to do with detecting top-level that a single contrib
failed).

I noticed this when working on
https://issues.apache.org/jira/browse/LUCENE-1709, as I guess we
should really fix it before doing that issue.

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Contrib tests fail if core jar is not up to date

2010-03-18 Thread Uwe Schindler
Hi Shai,

 

there is no way to do this with ant (detecting code change). The ant script 
*always* builds the jar file. In this case, it is just missing the dependency 
to jar-core in test-contrib. Alternatively, test-contrib should not use the jar 
file at all and simply add build/classes/java to classpath.

 

The fix is simple, can do that tomorrow.

 

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de http://www.thetaphi.de/ 

eMail: u...@thetaphi.de

 

From: Shai Erera [mailto:ser...@gmail.com] 
Sent: Thursday, March 18, 2010 10:34 PM
To: java-dev@lucene.apache.org
Subject: Contrib tests fail if core jar is not up to date

 

Hi

I've noticed that sometimes, after I run test-core and test-contrib, and then 
change core code, test-contrib fail on NoSuchMethodError and stuff like that. 
I've noticed that core.jar exists under build, and I assumed it's used by 
test-contrib, and probably is not recreated after core code has changed.

I verified it when looking in contrib-build.xml, which defines a property 
lucene.jar.present which is set to true if the jar is ... well, present. Which 
I believe is the reason for these failures. I've been thinking how to resolve 
that, and I can think of two ways:

(1) have test-core always delete that file, but that has two issues:
(1.1) It's redundant if the code hasn't changed.
(1.2) It forces you to either jar-core or test-core before you test-contrib, if 
you want to make sure you run w/ the latest jar.

or

(2) have test-contrib always call jar-core, which will first delete the file 
and then re-create it by compiling first. Compiling should not do anything if 
the code hasn't changed. So the only waste would be to create the .jar, but I 
think that's quite fast?

Does anyone, with more Ant skills than me, know of a better way to detect from 
test-contrib that core code has changed and only then rebuild the jar?

Shai



Re: Contrib tests fail if core jar is not up to date

2010-03-18 Thread Chris Hostetter

: In addition to what Shai mentioned, I wanted to say that there are
: other oddities about how the contrib tests run in ant. For example,
: I'm not sure why we create the junitfailed.flag files (I think it has
: something to do with detecting top-level that a single contrib
: failed).

Correct ... even if one contrib fails, test-contrib attempts to run the 
tests for all the other contribs, and then fails if any junitfailed.flag 
files are found in any contribs.

The assumption was if you were specificly testing a single contrib you'd 
be using the contrib specific build from it's own directory, and it would 
still fail fast -- it's only if you run test-contrib from the top level 
that it ignores when ant test fails for individual contribs, and then 
reports the failure at the end.

It's a hack, but it's a useful hack for getting nightly builds that can 
report on the tests for all contribs, even if the first one fails (it's 
less useful when one contrib depends on another, but that's a more complex 
issue)

-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Contrib tests fail if core jar is not up to date

2010-03-18 Thread Robert Muir
On Thu, Mar 18, 2010 at 5:50 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:


 It's a hack, but it's a useful hack for getting nightly builds that can
 report on the tests for all contribs, even if the first one fails (it's
 less useful when one contrib depends on another, but that's a more complex
 issue)

 -Hoss


Hoss, thanks, that makes sense.



-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-18 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847140#action_12847140
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

I'm hitting a weird error where after executing a ram buf term docs iteration, 
adding some docs, then closing the the DWs and the writer, there's an exception 
which indicates some unknown (to me) state was modified because of the term 
docs iteration.  Or maybe it's obvious? :)

{code}
org.apache.lucene.index.CorruptIndexException: docs out of order (-2147483648 
= 2147483647 )
at 
org.apache.lucene.index.FormatPostingsDocsWriter.addDoc(FormatPostingsDocsWriter.java:76)
at 
org.apache.lucene.index.FreqProxTermsWriter.appendPostings(FreqProxTermsWriter.java:209)
at 
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:127)
at org.apache.lucene.index.TermsHash.flush(TermsHash.java:145)
at org.apache.lucene.index.DocInverter.flush(DocInverter.java:72)
at 
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:64)
at 
org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:1185)
at 
org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3824)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3733)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3712)
at 
org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1807)
{code}

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 3.0.1
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: 3.1


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2127) Improved large result handling

2010-03-18 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847146#action_12847146
 ] 

Jason Rutherglen commented on LUCENE-2127:
--

What's the status of this one?  I'm qausi interested in getting it into Solr.

 Improved large result handling
 --

 Key: LUCENE-2127
 URL: https://issues.apache.org/jira/browse/LUCENE-2127
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: LUCENE-2127.patch, LUCENE-2127.patch


 Per 
 http://search.lucidimagination.com/search/document/350c54fc90d257ed/lots_of_results#fbb84bd297d15dd5,
  it would be nice to offer some other Collectors that are better at handling 
 really large number of results.  This could be implemented in a variety of 
 ways via Collectors.  For instance, we could have a raw collector that does 
 no sorting and just returns the ScoreDocs, or we could do as Mike suggests 
 and have Collectors that have heuristics about memory tradeoffs and only 
 heapify when appropriate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-18 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2312:
-

Comment: was deleted

(was: I'm hitting a weird error where after executing a ram buf term docs 
iteration, adding some docs, then closing the the DWs and the writer, there's 
an exception which indicates some unknown (to me) state was modified because of 
the term docs iteration.  Or maybe it's obvious? :)

{code}
org.apache.lucene.index.CorruptIndexException: docs out of order (-2147483648 
= 2147483647 )
at 
org.apache.lucene.index.FormatPostingsDocsWriter.addDoc(FormatPostingsDocsWriter.java:76)
at 
org.apache.lucene.index.FreqProxTermsWriter.appendPostings(FreqProxTermsWriter.java:209)
at 
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:127)
at org.apache.lucene.index.TermsHash.flush(TermsHash.java:145)
at org.apache.lucene.index.DocInverter.flush(DocInverter.java:72)
at 
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:64)
at 
org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:1185)
at 
org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3824)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3733)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3712)
at 
org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1807)
{code})

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 3.0.1
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: 3.1


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2323) reorganize contrib modules

2010-03-18 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned LUCENE-2323:
---

Assignee: Robert Muir

 reorganize contrib modules
 --

 Key: LUCENE-2323
 URL: https://issues.apache.org/jira/browse/LUCENE-2323
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2323.patch


 it would be nice to reorganize contrib modules, so that they are bundled 
 together by functionality.
 For example:
 * the wikipedia contrib is a tokenizer, i think really belongs in 
 contrib/analyzers
 * there are two highlighters, i think could be one highlighters package.
 * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Contrib tests fail if core jar is not up to date

2010-03-18 Thread Shai Erera
Uwe,

(1) the problem is not the missing dependency, but rather the use of
lucene.jar.present. So you'll need to remove it as well.
(2) Adding build/classes/java is not enough - you'll need to add a target
dependency on compile-core or something.

I guess you already know that. Just pointing it out :).

Thanks for taking care of this,

Shai

On Thu, Mar 18, 2010 at 11:51 PM, Robert Muir rcm...@gmail.com wrote:

 On Thu, Mar 18, 2010 at 5:50 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

 
  It's a hack, but it's a useful hack for getting nightly builds that can
  report on the tests for all contribs, even if the first one fails (it's
  less useful when one contrib depends on another, but that's a more
 complex
  issue)
 
  -Hoss
 

 Hoss, thanks, that makes sense.



 --
 Robert Muir
 rcm...@gmail.com

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org