[jira] Commented: (LUCENE-600) ParallelWriter companion to ParallelReader

2009-08-31 Thread Chuck Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749450#action_12749450
 ] 

Chuck Williams commented on LUCENE-600:
---

I contributed the first patch to make flush-by-size possible; see Lucene-709.  
There is no incompatibility with ParallelWriter, even the early version 
contributed here 3 years ago.  We've been doing efficient updating of selected 
mutable fields now for a long time and filed for a patent on the method.  See 
published patent application 20090193406.


 ParallelWriter companion to ParallelReader
 --

 Key: LUCENE-600
 URL: https://issues.apache.org/jira/browse/LUCENE-600
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.1
Reporter: Chuck Williams
Priority: Minor
 Attachments: ParallelWriter.patch


 A new class ParallelWriter is provided that serves as a companion to 
 ParallelReader.  ParallelWriter meets all of the doc-id synchronization 
 requirements of ParallelReader, subject to:
 1.  ParallelWriter.addDocument() is synchronized, which might have an 
 adverse effect on performance.  The writes to the sub-indexes are, however, 
 done in parallel.
 2.  The application must ensure that the ParallelReader is never reopened 
 inside ParallelWriter.addDocument(), else it might find the sub-indexes out 
 of sync.
 3.  The application must deal with recovery from 
 ParallelWriter.addDocument() exceptions.  Recovery must restore the 
 synchronization of doc-ids, e.g. by deleting any trailing document(s) in one 
 sub-index that were not successfully added to all sub-indexes, and then 
 optimizing all sub-indexes.
 A new interface, Writable, is provided to abstract IndexWriter and 
 ParallelWriter.  This is in the same spirit as the existing Searchable and 
 Fieldable classes.
 This implementation uses java 1.5.  The patch applies against today's svn 
 head.  All tests pass, including the new TestParallelWriter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-600) ParallelWriter companion to ParallelReader

2009-08-31 Thread Chuck Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749656#action_12749656
 ] 

Chuck Williams commented on LUCENE-600:
---

The version attached here is from over 3 years ago.  Our version has evolved 
along with Lucene and the whole apparatus is fully functional with the latest 
lucene.

The fields in each subindex are disjoint.  A logical Document is the collection 
of all fields from each real Document in each real subindex with same doc-id 
(i.e., the model Doug started with ParallelReader).  There is no issue with 
deletion by query or term as it deletes the whole logical Document.  Field 
updates in our scheme don't use deletion.

Merge-by-size is only an issue if you allow it to be decided independently in 
each subindex.  In practice that is not very important since one subindex is 
size-dominant (the one containing the document body field).  One can 
merge-by-size that subindex and force the others to merge consistently.

The only reason for the corresponding-segment constraint is that deletion 
changes doc-id's by purging deleted documents.  I know some Lucene apps address 
this by never purging deleted documents, which is ok in some domains where 
deletion is rare.  I think there are other ways to resolve it as well.



 ParallelWriter companion to ParallelReader
 --

 Key: LUCENE-600
 URL: https://issues.apache.org/jira/browse/LUCENE-600
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.1
Reporter: Chuck Williams
Priority: Minor
 Attachments: ParallelWriter.patch


 A new class ParallelWriter is provided that serves as a companion to 
 ParallelReader.  ParallelWriter meets all of the doc-id synchronization 
 requirements of ParallelReader, subject to:
 1.  ParallelWriter.addDocument() is synchronized, which might have an 
 adverse effect on performance.  The writes to the sub-indexes are, however, 
 done in parallel.
 2.  The application must ensure that the ParallelReader is never reopened 
 inside ParallelWriter.addDocument(), else it might find the sub-indexes out 
 of sync.
 3.  The application must deal with recovery from 
 ParallelWriter.addDocument() exceptions.  Recovery must restore the 
 synchronization of doc-ids, e.g. by deleting any trailing document(s) in one 
 sub-index that were not successfully added to all sub-indexes, and then 
 optimizing all sub-indexes.
 A new interface, Writable, is provided to abstract IndexWriter and 
 ParallelWriter.  This is in the same spirit as the existing Searchable and 
 Fieldable classes.
 This implementation uses java 1.5.  The patch applies against today's svn 
 head.  All tests pass, including the new TestParallelWriter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-600) ParallelWriter companion to ParallelReader

2009-08-31 Thread Chuck Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749660#action_12749660
 ] 

Chuck Williams commented on LUCENE-600:
---

Erratum:  deletion changes doc-id's by purging deleted documents -- 
*merging* changes doc-id's by purging deleted documents


 ParallelWriter companion to ParallelReader
 --

 Key: LUCENE-600
 URL: https://issues.apache.org/jira/browse/LUCENE-600
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.1
Reporter: Chuck Williams
Priority: Minor
 Attachments: ParallelWriter.patch


 A new class ParallelWriter is provided that serves as a companion to 
 ParallelReader.  ParallelWriter meets all of the doc-id synchronization 
 requirements of ParallelReader, subject to:
 1.  ParallelWriter.addDocument() is synchronized, which might have an 
 adverse effect on performance.  The writes to the sub-indexes are, however, 
 done in parallel.
 2.  The application must ensure that the ParallelReader is never reopened 
 inside ParallelWriter.addDocument(), else it might find the sub-indexes out 
 of sync.
 3.  The application must deal with recovery from 
 ParallelWriter.addDocument() exceptions.  Recovery must restore the 
 synchronization of doc-ids, e.g. by deleting any trailing document(s) in one 
 sub-index that were not successfully added to all sub-indexes, and then 
 optimizing all sub-indexes.
 A new interface, Writable, is provided to abstract IndexWriter and 
 ParallelWriter.  This is in the same spirit as the existing Searchable and 
 Fieldable classes.
 This implementation uses java 1.5.  The patch applies against today's svn 
 head.  All tests pass, including the new TestParallelWriter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1052) Add an termInfosIndexDivisor to IndexReader

2007-11-20 Thread Chuck Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544055
 ] 

Chuck Williams commented on LUCENE-1052:


I agree a general configuration system would be much better.  Doug. we use a 
similar method to what you described in our application.

TermInfosConfigurer is slightly different though since the desired config is a 
method that implements a formula, rather than just a value.  This could still 
be done more generally by allowing methods as well as properties or setters on 
a higher level configuration object.

I didn't want to take on the broader issue just for this feature.

Michael I agree with both of your points.

I'd be happy to clean up this patch if you guys provide some guidance for what 
would make it acceptable to commit.


 Add an termInfosIndexDivisor to IndexReader
 -

 Key: LUCENE-1052
 URL: https://issues.apache.org/jira/browse/LUCENE-1052
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.2
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.3

 Attachments: LUCENE-1052.patch, termInfosConfigurer.patch


 The termIndexInterval, set during indexing time, let's you tradeoff
 how much RAM is used by a reader to load the indexed terms vs cost of
 seeking to the specific term you want to load.
 But the downside is you must set it at indexing time.
 This issue adds an indexDivisor to TermInfosReader so that on opening
 a reader you could further sub-sample the the termIndexInterval to use
 less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
 loaded into RAM.
 This is particularly useful if your index has a great many terms (eg
 you accidentally indexed binary terms).
 Spinoff from this thread:
   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1052) Add an termInfosIndexDivisor to IndexReader

2007-11-20 Thread Chuck Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544136
 ] 

Chuck Williams commented on LUCENE-1052:


I can report that in our application having a formula is critical.  We have no 
control over the content our users index, nor in fact do they.  These are 
arbitrary documents.  We find a surprising number of them contain embedded 
encoded binary data.  When those are indexed, lucene's memory consumption 
skyrockets, either bringing the whole app down with an OOM or slowing 
performance to a crawl due to excessive GC's reclaiming a tiny remaining 
working memory space.

Our users won't accept a solution like, wait until the problem occurs and then 
increment your termIndexDivisor.  They expect our app to manage this 
automatically.

I agree that making TermInfosReader, SegmentReader, etc. public classes is not 
a great solution  The current patch does not do that.  It simply adds a 
configurable class that can be used to provide formula parameters as opposed to 
just value parameters.  At least for us, this special case is sufficiently 
important to outweigh any considerations of the complexity of an additional 
class.

A single configuration class could be used at the IndexReader level that 
provides for both static and dynamically-varying properties through getters, 
some of which take parameters.

Here is another possible solution.  My current thought is that the bound should 
always be a multiple of sqrt(numDocs).  E.g., see Heap's Law here:  
http://nlp.stanford.edu/IR-book/html/htmledition/heaps-law-estimating-the-number-of-terms-1.html

I'm currently using this formula in my TermInfosConfigurer:

int bound = (int) 
(1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL);

This has Heap's law as foundation.  I provide TERM_BOUNDING_MULTIPLIER as the 
config parameter, with 0 meaning don't do this.  I also provide a 
TERM_INDEX_DIVISOR_OVERRIDE that overrides the dynamic bounding with a manually 
specified constant amount.

If that approach would be acceptable to lucene in general, then we just need 
two static parameters.  However, I don't have enough experience with how well 
this formula works in our user base yet to know whether or not we'll tune it 
further.




 Add an termInfosIndexDivisor to IndexReader
 -

 Key: LUCENE-1052
 URL: https://issues.apache.org/jira/browse/LUCENE-1052
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.2
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.3

 Attachments: LUCENE-1052.patch, termInfosConfigurer.patch


 The termIndexInterval, set during indexing time, let's you tradeoff
 how much RAM is used by a reader to load the indexed terms vs cost of
 seeking to the specific term you want to load.
 But the downside is you must set it at indexing time.
 This issue adds an indexDivisor to TermInfosReader so that on opening
 a reader you could further sub-sample the the termIndexInterval to use
 less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
 loaded into RAM.
 This is particularly useful if your index has a great many terms (eg
 you accidentally indexed binary terms).
 Spinoff from this thread:
   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1052) Add an termInfosIndexDivisor to IndexReader

2007-11-19 Thread Chuck Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Williams updated LUCENE-1052:
---

Attachment: termInfosConfigurer.patch

termInfosConfigurer.patch extends the termInfoIndexDivisor mechanism to allow 
dynamic management of this parameter.  A new new interface, 
TermInfosConfigurer, allows specification of a method, getMaxTermsCached(), 
that bounds the size of the in-memory term infos as a function of the segment 
name, segment numDocs, and total segment terms.  This bound is then used to 
automatically set termInfosIndexDivisor whenever a TermInfosReader reads the 
term index.  This mechanism provides a simple way to ensure that the total 
amount of memory consumed by the term cache is bounded by, say, O(log(numDocs)).

All Lucene core tests pass.  I'm using another version of this same patch in 
Lucene 2.1+ in an application that has indexes with binary term pollution, 
using the TermInfosConfigurer to dynamically bound the term cache in the 
polluted segments.

Tried to test contrib, but it appears gdata-server needs external libraries I 
don't have to compile.

Michael, this patch applies cleanly to today's Lucene trunk.  I'd appreciate if 
you could verify one thing.  Lucene 2.3 has the incremental reopen mechanism 
(can't wait to get that!), new since Lucene 2.1.  It appears that reopen of a 
segment reuses the same TermInfosReader and thus does not need to configure a 
new one.  I've implemented that part of the patch with this assumption.


 Add an termInfosIndexDivisor to IndexReader
 -

 Key: LUCENE-1052
 URL: https://issues.apache.org/jira/browse/LUCENE-1052
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.2
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.3

 Attachments: LUCENE-1052.patch, termInfosConfigurer.patch


 The termIndexInterval, set during indexing time, let's you tradeoff
 how much RAM is used by a reader to load the indexed terms vs cost of
 seeking to the specific term you want to load.
 But the downside is you must set it at indexing time.
 This issue adds an indexDivisor to TermInfosReader so that on opening
 a reader you could further sub-sample the the termIndexInterval to use
 less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
 loaded into RAM.
 This is particularly useful if your index has a great many terms (eg
 you accidentally indexed binary terms).
 Spinoff from this thread:
   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1052) Add an termInfosIndexDivisor to IndexReader

2007-11-18 Thread Chuck Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543383
 ] 

Chuck Williams commented on LUCENE-1052:


I believe this needs to be a formula as a reasonable bound on the number of 
terms is in general a function of the number of documents in the segment and 
the nature of the index (e.g., types of fields).  A common thing to do would be 
to enforce that RAM usage for cached terms grows no faster than logarithmically 
in the number of documents.  The specific formula that is appropriate will 
depend on the index, i.e. on the application.  It might be of the form:  
c*ln(numdocs+k), wnere c and k are constants dependent on the index.

One consequence of this approach, or any approach along these lines, is that 
the indexDivisor will vary across the segments, both in a single index and 
across indexes.  It seems to me from the code that this should work fine.

This leaves the issue of how to best specify an arbitrary formula.  This 
requires a method to compute the max cached terms allowed for a segment based 
on the number of docs in the segment, the number of terms in the segment's 
index, and possibly other factors.  The most direct way to do this is to 
introduce an interface, e.g. TermInfosConfigurer, to define the method 
signature, and to add setTermInfosConfigurer as an alternative to 
setTermInfosIndexDivisor.  It would need to be in all the same places.

A more general approach would be to introduce an IndexConfigurer class which 
over time could hold additional methods like this.  It could even replace the 
current setters on IndexReader (as well as IndexWriter, etc.) with a more 
general mechanism that would allow dynamic parameters used to configure any 
classes in the index structure.  Each constructor would be passed the 
IndexConfigurer and call getters or other methods on it to obtain its config.  
The methods could provide constant values or dynamic formulas.

I'm going to implement the straightforward solution at the moment in our older 
version of Lucene, then will sync up to whatever you guys decide is best for 
the trunk later.
 

 Add an termInfosIndexDivisor to IndexReader
 -

 Key: LUCENE-1052
 URL: https://issues.apache.org/jira/browse/LUCENE-1052
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.2
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.3

 Attachments: LUCENE-1052.patch


 The termIndexInterval, set during indexing time, let's you tradeoff
 how much RAM is used by a reader to load the indexed terms vs cost of
 seeking to the specific term you want to load.
 But the downside is you must set it at indexing time.
 This issue adds an indexDivisor to TermInfosReader so that on opening
 a reader you could further sub-sample the the termIndexInterval to use
 less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
 loaded into RAM.
 This is particularly useful if your index has a great many terms (eg
 you accidentally indexed binary terms).
 Spinoff from this thread:
   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1052) Add an termInfosIndexDivisor to IndexReader

2007-11-17 Thread Chuck Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543306
 ] 

Chuck Williams commented on LUCENE-1052:


Michael, thanks for creating an excellent production version of this idea and 
committing it!

I'd like to take it one step further to eliminate the need to call 
IndexReader.setTermInfosIndexDivisor up front.  The idea is to instead specify 
a maximum number of index terms to cache in memory.  This could then allow 
TermInfosReader to set indexDivisor automatically to the smallest value that 
yields a cache size less than the maximum.

This seems a simple and extremely useful extension.  Unfortunately, I'm still 
on an older Lucene, but will post my update.  If you like this idea, you may 
want to just add the feature directly to your implementation in the trunk.


 Add an termInfosIndexDivisor to IndexReader
 -

 Key: LUCENE-1052
 URL: https://issues.apache.org/jira/browse/LUCENE-1052
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.2
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.3

 Attachments: LUCENE-1052.patch


 The termIndexInterval, set during indexing time, let's you tradeoff
 how much RAM is used by a reader to load the indexed terms vs cost of
 seeking to the specific term you want to load.
 But the downside is you must set it at indexing time.
 This issue adds an indexDivisor to TermInfosReader so that on opening
 a reader you could further sub-sample the the termIndexInterval to use
 less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
 loaded into RAM.
 This is particularly useful if your index has a great many terms (eg
 you accidentally indexed binary terms).
 Spinoff from this thread:
   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term pollution from binary data

2007-11-12 Thread Chuck Williams

Doug Cutting wrote on 11/07/2007 09:26 AM:
Hadoop's MapFile is similar to Lucene's term index, and supports a 
feature where only a subset of the index entries are loaded 
(determined by io.map.index.skip).  It would not be difficult to add 
such a feature to Lucene by changing TermInfosReader#ensureIndexIsRead().


Here's a (totally untested) patch.


Doug, thanks for this suggestion and your quick patch.

I fleshed this out in the version of Lucene we are using, a bit after 
2.1.  There was an off-by-1 bug plus a few missing pieces.  The attached 
patch is for 2.1+, but might be useful as it at least contains the 
corrections and missing elements.  It also contains extensions to the 
tests to exercise the patch.


I tried integrating this into 2.3, but enough has changed so that it was 
not straightforward (primarily for the test case extensions -- the 
implementation seems it will apply with just a bit of manual merging).  
Unfortunately, I have so many local changes that is has become difficult 
to track the latest Lucene.  The task of syncing up will come soon.  
I'll post a proper patch against the trunk in jira at a future date if 
the issue is not already resolved before then.


Michael McCandless wrote on 11/08/2007 12:43 AM:

I'll open an issue and work through this patch.
  
Michael, I did not see the issue, else would have posted this there.  
Unfortunately, I'm pretty far behind on lucene mail these days.

One thing is: I'd prefer to not use system property for this, since
it's so global, but I'm not sure how to better do it.
  


Agree strongly that this is not global.  Whether ctors or an 
index-specific properties object or whatever, it is important to be able 
to set this on some indexes and not others in a single application.


Thanks for picking this up!

Chuck

Index: src/test/org/apache/lucene/index/DocHelper.java
===
--- src/test/org/apache/lucene/index/DocHelper.java	(revision 2247)
+++ src/test/org/apache/lucene/index/DocHelper.java	(working copy)
@@ -254,10 +254,25 @@
*/ 
   public static void writeDoc(Directory dir, Analyzer analyzer, Similarity similarity, String segment, Document doc) throws IOException
   {
-DocumentWriter writer = new DocumentWriter(dir, analyzer, similarity, 50);
-writer.addDocument(segment, doc);
+writeDoc(dir, analyzer, similarity, segment, doc, IndexWriter.DEFAULT_TERM_INDEX_INTERVAL);
   }
 
+  /**
+   * Writes the document to the directory segment using the analyzer and the similarity score
+   * @param dir
+   * @param analyzer
+   * @param similarity
+   * @param segment
+   * @param doc
+   * @param termIndexInterval
+   * @throws IOException
+   */ 
+  public static void writeDoc(Directory dir, Analyzer analyzer, Similarity similarity, String segment, Document doc, int termIndexInterval) throws IOException
+  {
+DocumentWriter writer = new DocumentWriter(dir, analyzer, similarity, 50, termIndexInterval);
+writer.addDocument(segment, doc);
+  }
+  
   public static int numFields(Document doc) {
 return doc.getFields().size();
   }
Index: src/test/org/apache/lucene/index/TestSegmentTermDocs.java
===
--- src/test/org/apache/lucene/index/TestSegmentTermDocs.java	(revision 2247)
+++ src/test/org/apache/lucene/index/TestSegmentTermDocs.java	(working copy)
@@ -25,6 +25,7 @@
 import org.apache.lucene.document.Field;
 
 import java.io.IOException;
+import org.apache.lucene.search.Similarity;
 
 public class TestSegmentTermDocs extends TestCase {
   private Document testDoc = new Document();
@@ -212,6 +213,23 @@
 dir.close();
   }
   
+  public void testIndexDivisor() throws IOException {
+dir = new RAMDirectory();
+testDoc = new Document();
+DocHelper.setupDoc(testDoc);
+DocHelper.writeDoc(dir, new WhitespaceAnalyzer(), Similarity.getDefault(), test, testDoc, 3);
+
+assertNull(System.getProperty(lucene.term.index.divisor));
+System.setProperty(lucene.term.index.divisor, 2);
+try {
+  testTermDocs();
+  testBadSeek();
+  testSkipTo();
+} finally {
+  System.clearProperty(lucene.term.index.divisor);
+}
+  }
+  
   private void addDoc(IndexWriter writer, String value) throws IOException
   {
   Document doc = new Document();
Index: src/test/org/apache/lucene/index/TestSegmentReader.java
===
--- src/test/org/apache/lucene/index/TestSegmentReader.java	(revision 2247)
+++ src/test/org/apache/lucene/index/TestSegmentReader.java	(working copy)
@@ -23,10 +23,12 @@
 import java.util.List;
 
 import junit.framework.TestCase;
+import org.apache.lucene.analysis.WhitespaceAnalyzer;
 
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Fieldable;
 import org.apache.lucene.search.DefaultSimilarity;
+import org.apache.lucene.search.Similarity;
 import 

Term pollution from binary data

2007-11-06 Thread Chuck Williams

Hi All,

We are experiencing OOM's when binary data contained in text files 
(e.g., a base64 section of a text file) is indexed.  We have extensive 
recognition of file types but have encountered binary sections inside of 
otherwise normal text files.


We are using the default value of 128 for termIndexInterval.  The 
problem arises because binary data generates a large set of random 
tokens, leading to totalTerms/termIndexInterval terms stored in memory.  
Increasing the -Xmx is not viable as it is already maxed.


Does anybody know of a better solution to this problem than writing some 
kind of binary section recognizer/filter?


It appears that termIndexInterval is factored into the stored index and 
thus cannot be changed dynamically to work around the problem after an 
index has become polluted.  Other than identifying the documents 
containing binary data, deleting them, and then optimizing the whole 
index, has anybody found a better way to recover from this problem?


Thanks for any insights or suggestions,

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1037) Corrupt index: term out of order after forced stop during indexing

2007-10-28 Thread Chuck Williams (JIRA)
Corrupt index:  term out of order after forced stop during indexing
---

 Key: LUCENE-1037
 URL: https://issues.apache.org/jira/browse/LUCENE-1037
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.0.1
 Environment: Windows Server 2003
Reporter: Chuck Williams


In testing a reboot during active indexing, upon restart this exception 
occurred:

Caused by: java.io.IOException: term out of order 
(ancestorForwarders:.compareTo(descendantMoneyAmounts:$0.351) = 0)

at org.apache.lucene.index.TermInfosWriter.add(TermInfosWriter.java:96)

at 
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:322)

at 
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:289)

at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:253)

at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)

at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1398)

at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:835)

at ...   (application code)

The ancestorForwarders: term has no text.  The application never creates such 
a term.  It seems  the reboot occurred while this term was being written, but 
such a segment should not be linked into the index and so should not be visible 
after restart.

The application uses parallel subindexes accessed with ParallelReader.  This 
reboot caught the system in a state where the indexes were out of sync, i.e. a 
new document had parts indexed in one subindex but not yet indexed in another.  
The application detects this condition upon restart, uses 
IndexReader.deleteDocument() to delete the parts that were indexed from those 
subindexes, and then does optimize() all all the subindexes to bring the 
docid's back into sync.  The optimize() failed, presumably on a subindex that 
was being written at the time of the reboot.  This subindex would not have 
completed its document part and so no deleteDocument() would have been 
performed on it prior to the optimize().

The version of Lucene here is from January 2007.  I see one other reference to 
this exception in LUCENE-848.  There is a note there that the exception is 
likely a core problem, but I don't see any follow up to track it down.

Any ideas how this could happen?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene 2.1, soon

2007-01-18 Thread Chuck Williams
How about a direct solution with a reference count scheme?

Segments files could be reference-counted, as well as individual
segments either directly, possibly by interning SegmentInfo instances,
or indirectly by reference counting all files via Directory.

The most recent checkpoint and snapshot would have an implicit reference
since they can be opened.  Each reader and writer creates a reference
when it opens a segments file.

This way segments files and each segment's files would be deleted
precisely when they are no longer used, which would both support NFS and
improve performance on Windows.

Chuck


Marvin Humphrey wrote on 01/18/2007 11:40 AM:

 I wrote:
 I'd be cool with making it impossible to put an index on an NFS
 volume prior to version 4.

 Elaborating and clarifying...

 IndexReader attempts to establish a read lock on the relevant
 segments_N file.  It doesn't bother to see whether the locking attempt
 succeeds, though.

 IndexFileDeleter, before deleting any files, always touches a test
 file, attempts to lock it, and verifies that the lock succeeds.  If
 the locking test fails, it throws an exception rather than proceed.

 In addition, the locking test is run at index creation time, so that
 the user knows as soon as possible that their index is in a
 problematic location.

 I think the only way this would fail under NFS is if the client
 machine with the reader is using NFS version 3, while the machine with
 the writer is using version 4.  But before this issue arose I didn't
 have that much experience with the intricacies of NFS, so I could be
 off-base.

 This does bring back the permissions issue with IndexReader.  A search
 app may not have permission to establish a read lock on a file within
 the index directory, and in that case, an IndexFileDeleter could
 delete files out from under it.

 Marvin Humphrey
 Rectangular Research
 http://www.rectangular.com/



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: adding explicit commits to Lucene?

2007-01-17 Thread Chuck Williams
I don't see how to do commits without at least some new methods.

There needs to be some way to roll back changes rather than committing
them.  If the commit action is IndexWriter.close() (even if just an
interface) the user still needs another method to roll back.

There are reasons to close an IndexWriter other than committing changes,
such as to flush all the ram segments to disk to free memory or save
state.  We now have IndexWriter.flushRamSegments() for this case, but
are there others?

As was already pointed out to delete documents you have to find them,
which may require a reader accessing the current snapshot rather than
the current checkpoint.  There needs to be some way to specify this
distinction.

Chuck



Yonik Seeley wrote on 01/17/2007 06:48 AM:
 On 1/17/07, Michael McCandless [EMAIL PROTECTED] wrote:
 If this approach works well we could at some point deprecate the
 delete* methods on IndexReader and make package protected versions
 that IndexWriter calls.

 If we do API changes in the future, it would be nice to make the
 search side more efficient w.r.t. deleted documents... at least remove
 the synchronization for isDeleted for read-only readers, and perhaps
 even have a subclass that is a no-op for isDeleted for read-only
 readers.

 -Yonik
 http://incubator.apache.org/solr Solr, the open-source Lucene search
 server

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene 2.1, soon

2007-01-17 Thread Chuck Williams
Grant Ingersoll wrote on 01/17/2007 01:42 AM:
 Also, I'm curious as to how many people use NFS in live systems.


I've got the requirement to support large indexes and collections of
indexes on NAS devices, which from linux pretty much means NFS or CIFS.

This doesn't seem unusual.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-756) Maintain norms in a single file .nrm

2007-01-16 Thread Chuck Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465240
 ] 

Chuck Williams commented on LUCENE-756:
---

I may have the only app that will be broken by the 10-day backwards 
incompatibility, but the change seems worth it.  I need to create some large 
indexes to take on the road for demos.  Is the index format in the latest patch 
final?


 Maintain norms in a single file .nrm
 

 Key: LUCENE-756
 URL: https://issues.apache.org/jira/browse/LUCENE-756
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Doron Cohen
 Assigned To: Doron Cohen
Priority: Minor
 Attachments: index.premergednorms.cfs.zip, 
 index.premergednorms.nocfs.zip, LUCENE-756-Jan16.patch, 
 LUCENE-756-Jan16.Take2.patch, nrm.patch.2.txt, nrm.patch.3.txt, nrm.patch.txt


 Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
 comparing to compound indexes. But their file descriptors foot print is much 
 higher. 
 By maintaining all field norms in a single .nrm file, we can bound the number 
 of files used by non compound indexes, and possibly allow more applications 
 to use this format.
 More details on the motivation for this in: 
 http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
  (in particular 
 http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: adding explicit commits to Lucene?

2007-01-16 Thread Chuck Williams
Yonik Seeley wrote on 01/16/2007 11:29 AM:
 On 1/16/07, robert engels [EMAIL PROTECTED] wrote:
 You have the same problem if there is an existing reader open, so
 what is the difference? You can't remove the segments there either.

 The disk space for the segments is currently removed if no one has
 them open... this is quite a bit different than guaranteeing that a
 reader in the future will be able to open an index in the past.

To me the key benefit of explicit commits is that ongoing adds and their
associated merges update only the segments of the current snapshot.  The
current snapshot can be aborted, falling back to the last checkpoint
without having made any changes to its segments at all.  Once a commit
is done the committed snapshot becomes the new checkpoint.

Lucene does not have this desirable property now even for adding a
single document, since that document may cause a merge with consequences
arbitrarily deep into the index.

For the single-transaction use case it is only necessary that the
segments in the current checkpoint and those in the current snapshot are
maintained.  Revising the current snapshot can delete segments in the
prior snapshot, and committing can delete segments in the prior checkpoint.

Of course support for multiple parallel transactions would be even
better, but is also a huge can of worms as anyone who has spent time
chasing database deadlocks and understanding all the different types of
locks that modern databases use can attest.

The single-transaction case seems straightforward to implement per
Michael's suggestion and enables valuable use cases as the thread has
enumerated.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: adding explicit commits to Lucene?

2007-01-15 Thread Chuck Williams
robert engels wrote on 01/15/2007 08:01 AM:
 Is your parallel adding code available?

There is an early version in LUCENE-600, but without the enhancements
described.  I didn't update that version because it didn't capture any
interest and requires Java 1.5 and so it seems will not be committed.

I could update jira with the new version, but would have to create a
clean patch that applies again the lucene head.  My local copy is
diverged due to a number of uncommitted patches and so patches generated
from it contain other stuff.

My use case for parallel subindexes is as an enabler for fast bulk
updates.  Only the subindexes containing changing fields need to be
updated, so long as the update algorithm does not change doc-ids.  Even
though this requires rewriting entire segments using techniques similar
to those used in merging (but not purging deleted docs), I'm still
getting 30x (when many fields changed) to many hundreds-x (when only a
few fields changing) faster update performance than the batched
delete-add method on very large indexes (million of documents, some very
large).

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: adding explicit commits to Lucene?

2007-01-15 Thread Chuck Williams
Ning Li wrote on 01/15/2007 06:29 PM:
 On 1/14/07, Michael McCandless [EMAIL PROTECTED] wrote:
   * The support deleteDocuments in IndexWriter (LUCENE-565) feature
 could have a more efficient implementation (just like Solr) when
 autoCommit is false, because deletes don't need to be flushed
 until commit() is called.  Whereas, now, they must be aggressively
 flushed on each checkpoint.

 If a reader can only open snapshots both for search and for
 modification, I think another change is needed besides the ones
 listed: assume the latest snapshot is segments_5 and the latest
 checkpoint is segmentsx_7 with 2 new segments, then a reader opens
 snapshot segments_5, performs a few deletes and writes a new
 checkpoint segmentsx_8. The summary file segmentsx_8 should include
 the 2 new segments which are in segmentsx_7 but not in segments_5.
 Such segments to include are easily identifiable only if they are not
 merged with segments in the latest snapshot... All these won't be
 necessary if a reader always opens the latest checkpoint for
 modification, which will also support deletion of non-committed
 documents.
This problem seems worse.  I don't see how a reader and a writer can
independently compute and write checkpoints.  The adds in the writer
don't just create new segments, they replace existing ones through
merging.  And the merging changes doc-ids by expunging deletes.  It
seems that all deletes must be based on the most recent checkpoint, or
merging of checkpoints to create the next snapshot will be considerably
more complex.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: adding explicit commits to Lucene?

2007-01-15 Thread Chuck Williams
robert engels wrote on 01/15/2007 08:11 PM:
 If that is all you need, I think it is far simpler:

 If you have an OID, then al that is required is to a write to a
 separate disk file the operations (delete this OID, insert this
 document, etc...)

 Once the file is permanently on disk. Then it is simple to just keep
 playing the file back until it succeeds.
There is no guarantee a given operation will ever succeed so this
doesn't work.

 This is what we do in our search server.

 I am not completely familiar with parallel reader, but in reading the
 JavaDoc I don't see the benefit - since you have to write the
 documents to both indexes anyway??? Why is it of any benefit to break
 the document into multiple parts?
I'm sure Doug had reasons to write it.  My reason to use it is for fast
bulk updates, updating one subindex without having to update the others.

 If you have OIDs available, parallel reader can be accomplished in a
 far simpler and more efficient manner - we have a completely federated
 server implementation that was trivial - less  100 lines of code. We
 did it simpler, and create a hash from the OID, and store the document
 into a different index depending on the has, then run the query across
 all indexes in parallel, joining the results.
Lucene has this built in via MultiSearcher and RemoteSearchable.  It is
a bit more complex due to the necessity to normalize Weights, e.g. to
ensure the same docFreq's which reflect the union of all indexes are
used for the search in each.

Federated searching addresses different requirements than
ParallelReader.  Yes, I agree that ParallelReader could be done using
UID's, but believe it would be a considerably more expensive
representation to search.  The method used in federated search to
distribute the same query to each index is not applicable.  Breaking the
query up into parts that are applied against each parallel index, with
each query part referencing only the fields in a single parallel index,
would be a challenge with complex nested queries supporting all of the
operators, and much less efficient than ParallelReader.  Modifying all
the primitive Query subclasses to use UID's instead of doc-ids's would
be an alternative, but would be a lot of work and not nearly as
efficient as the existing Lucene index representation that sorts
postings by doc-id.

To illustrate this, consider the simple query, f:a AND g:b, where f and
g are in two different parallel indexes.  Performing the f  and g
queries separately on the different indexes to get possibly very long
lists of results and then joining those by UID will be much slower than
BooleanQuery operating on ParallelReader with doc-id sorted postings. 
The alternative of a UID-based BooleanQuery would have similar
challenges unless the postings were sorted by UID.  But hey, that's
permanent doc-ids.

Chuck


 On Jan 15, 2007, at 11:49 PM, Chuck Williams wrote:

 My interest is transactions, not making doc-id's permanent.
 Specifically, the ability to ensure that a group of adds either all go
 into the index or none go into the index, and to ensure that if none go
 into the index that the index is not changed in any way.

 I have UID's but they cannot ensure the latter property, i.e. they
 cannot ensure side-effect-free rollbacks.

 Yes, if you have no reliance on internal Lucene structures like doc-id's
 and segments, then that shouldn't matter.  But many capabilities have
 such reliance for good reasons.  E.g., ParallelReader, which is a public
 supported class in Lucene, requires doc-id synchronization.  There are
 similar good reasons for an application to take advantage of doc-ids.

 Lucene uses doc-id's in many of its API's and so it is not surprising
 that many applications rely on them, and I'm sure misuse them not fully
 understanding the semantics and uncertainties of doc-id changes due to
 merging segments with deletes.

 Applications can use doc-ids for legitimate and beneficial purposes
 while remaining semantically valid.  Making such capabilities efficient
 and robust in all cases is facilitated by application control over when
 doc-id's and segment structure change at a granularity larger than the
 single Document.

 If I had a vote it would be +1 on the direction Michael has proposed,
 assuming it can be done robustly and without performance penalty.

 Chuck


 robert engels wrote on 01/15/2007 07:34 PM:
 I honestly think that having a unique OID as an indexed field and
 putting a layer on top of Lucene is the best solution to all of this.
 It makes it almost trivial, and you can implement transaction handling
 in a variety of ways.

 Attempting to make the doc ids permanent is a tough challenge,
 considering the orignal design called for them to be non permanent.

 It seems doubtful that you cannot have some sort of primary key any
 way and be this concerned about the transactional nature of Lucene.

 I vote -1 on all of this. I think it will detract from the simple and
 efficient storage

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

2007-01-11 Thread Chuck Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464055
 ] 

Chuck Williams commented on LUCENE-769:
---

Robert,

Could you attach your current implementation of reopen() as well?  The 
attachment did not come through in your java-dev message today, or the one from 
12/11.  I'd like to look at an incremental implementation of reopen() for 
FieldCache.

Thanks


 [PATCH] Performance improvement for some cases of sorted search
 ---

 Key: LUCENE-769
 URL: https://issues.apache.org/jira/browse/LUCENE-769
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Artem Vasiliev
 Attachments: DocCachingSorting.patch, DocCachingSorting.patch, 
 QueryFilter.java, StoredFieldSorting.patch


 It's a small addition to Lucene that significantly lowers memory consumption 
 and improves performance for sorted searches with frequent index updates and 
 relatively big indexes (1mln docs) scenario. This solution supports only 
 single-field sorting currently (which seem to be quite popular use case). 
 Multiple fields support can be added without much trouble.
 The solution is this: documents from the sorting set (instead of given 
 field's values from the whole index - current FieldCache approach) are cached 
 in a WeakHashMap so the cached items are candidates for GC.  Their fields 
 values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

2007-01-10 Thread Chuck Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463729
 ] 

Chuck Williams commented on LUCENE-769:
---

The test case uses only tiny documents, and the reported timings for multiple 
searches with FieldCache make it appear that the version of lucene used 
contains the bug that caused FieldCaches to be frequently recomputed 
unnecessarily.

I suggest trying the test with much larger documents, of realistic size, and 
using current Lucene source.  I'm sure the patch will make things much slower 
with the current implementation.  As Hoss suggests, performance would be 
improved considerably by using a FieldSelector to obtain just the sort field, 
but even so will be slow unless the sort field is arranged to be early on the 
documents, ideally the first field, and a LOAD_AND_BREAK FieldSelector is used.

Another important performance variable will be the number of documents 
retrieved in the test query.  If the number of documents satisfying the query 
is a sizable percentage of the total collection size, I'm pretty sure the patch 
will be much slower than using FieldCache.


 [PATCH] Performance improvement for some cases of sorted search
 ---

 Key: LUCENE-769
 URL: https://issues.apache.org/jira/browse/LUCENE-769
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Artem Vasiliev
 Attachments: DocCachingSorting.patch, DocCachingSorting.patch


 It's a small addition to Lucene that significantly lowers memory consumption 
 and improves performance for sorted searches with frequent index updates and 
 relatively big indexes (1mln docs) scenario. This solution supports only 
 single-field sorting currently (which seem to be quite popular use case). 
 Multiple fields support can be added without much trouble.
 The solution is this: documents from the sorting set (instead of given 
 field's values from the whole index - current FieldCache approach) are cached 
 in a WeakHashMap so the cached items are candidates for GC.  Their fields 
 values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

2007-01-09 Thread Chuck Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463322
 ] 

Chuck Williams commented on LUCENE-767:
---

Isn't maxDoc always the same as the docCount of the segment, which is stored?  
I.e., couldn't SegmentReader.maxDoc() be equivalently defined as:

  public int maxDoc() {
return si.docCount;
  }

Since maxDoc==numDocs==docCount for a newly merged segment, and deletion with a 
reader never changes numDocs or maxDoc, it seems to me these values should 
always be the same.

All Lucene tests pass with this definition.  I have code that relies on this 
equivalence and so would appreciate knowledge of any case where this 
equivalence might not hold.


 maxDoc should be explicitly stored in the index, not derived from file length
 -

 Key: LUCENE-767
 URL: https://issues.apache.org/jira/browse/LUCENE-767
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
Reporter: Michael McCandless
 Assigned To: Michael McCandless
Priority: Minor

 This is a spinoff of LUCENE-140
 In general we should rely on as little as possible from the file system.  
 Right now, maxDoc is derived by checking the file length of the FieldsReader 
 index file (.fdx) which makes me nervous.  I think we should explicitly store 
 it instead.
 Note that there are no known cases where this is actually causing a problem. 
 There was some speculation in the discussion of LUCENE-140 that it could be 
 one of the possible, but in digging / discussion there were no specifically 
 relevant JVM bugs found (yet!).  So this would be a defensive fix at this 
 point.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

2007-01-03 Thread Chuck Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462122
 ] 

Chuck Williams commented on LUCENE-510:
---

Has an improvement been made to eliminate the reported 20% indexing hit?  That 
would be a big price to pay.

To me the performance benefits in algorithms that scan for selected fields 
(e.g., FieldsReader.doc() with a FieldSelector) are much more important than 
standard UTF-8 compliance.

A 20% hit seems suprising.  The pre-scan over the string to be written 
shouldn't cost much compared to the cost of tokenizing and indeixng that string 
(assuming it is in an indexed field).

In case it is relevant, I had a related issue in my bulk updater, a case where 
a vint required at the beginning of a record by the lucene index format was not 
known until after the end.  I solved this with a fixed length vint record that 
was estimated up front and revised if necessary after the whole record was 
processed.  The vint representation still works if more bytes than necessary 
are written.


 IndexOutput.writeString() should write length in bytes
 --

 Key: LUCENE-510
 URL: https://issues.apache.org/jira/browse/LUCENE-510
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.1
Reporter: Doug Cutting
 Assigned To: Grant Ingersoll
 Fix For: 2.1

 Attachments: SortExternal.java, strings.diff, TestSortExternal.java


 We should change the format of strings written to indexes so that the length 
 of the string is in bytes, not Java characters.  This issue has been 
 discussed at:
 http://www.mail-archive.com/java-dev@lucene.apache.org/msg01970.html
 We must increment the file format number to indicate this change.  At least 
 the format number in the segments file should change.
 I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until 
 after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 
 (other than removal of deprecated features).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-762) [PATCH] Efficiently retrieve sizes of field values

2006-12-29 Thread Chuck Williams (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-762?page=comments#action_12461460 ] 

Chuck Williams commented on LUCENE-762:
---

Hi Grant,

Maybe even better would be to have an appropriate method on 
FieldSelectorResult.  E.g.:

FieldSelectorResult.readField(doc, fieldsStream, fi, binary, compressed, 
tokenized)

This would eliminate the tests or map lookup in performance-critical code.






 [PATCH] Efficiently retrieve sizes of field values
 --

 Key: LUCENE-762
 URL: http://issues.apache.org/jira/browse/LUCENE-762
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.1
Reporter: Chuck Williams
 Assigned To: Grant Ingersoll
Priority: Minor
 Attachments: SizeFieldSelector.patch


 Sometimes an application would like to know how large a document is before 
 retrieving it.  This can be important for memory management or choosing 
 between algorithms, especially in cases where documents might be very large.
 This patch extends the existing FieldSelector mechanism with two new 
 FieldSelectorResults:  SIZE and SIZE_AND_BREAK.  SIZE creates fields on the 
 retrieved document that store field sizes instead of actual values.  
 SIZE_AND_BREAK is especially efficient if one field comprises the bulk of the 
 document size (e.g., the body field) and can thus be used as a reasonable 
 size approximation.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-762) [PATCH] Efficiently retrieve sizes of field values

2006-12-28 Thread Chuck Williams (JIRA)
[PATCH] Efficiently retrieve sizes of field values
--

 Key: LUCENE-762
 URL: http://issues.apache.org/jira/browse/LUCENE-762
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.1
Reporter: Chuck Williams


Sometimes an application would like to know how large a document is before 
retrieving it.  This can be important for memory management or choosing between 
algorithms, especially in cases where documents might be very large.

This patch extends the existing FieldSelector mechanism with two new 
FieldSelectorResults:  SIZE and SIZE_AND_BREAK.  SIZE creates fields on the 
retrieved document that store field sizes instead of actual values.  
SIZE_AND_BREAK is especially efficient if one field comprises the bulk of the 
document size (e.g., the body field) and can thus be used as a reasonable size 
approximation.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-762) [PATCH] Efficiently retrieve sizes of field values

2006-12-28 Thread Chuck Williams (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-762?page=all ]

Chuck Williams updated LUCENE-762:
--

Attachment: SizeFieldSelector.patch

 [PATCH] Efficiently retrieve sizes of field values
 --

 Key: LUCENE-762
 URL: http://issues.apache.org/jira/browse/LUCENE-762
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.1
Reporter: Chuck Williams
 Attachments: SizeFieldSelector.patch


 Sometimes an application would like to know how large a document is before 
 retrieving it.  This can be important for memory management or choosing 
 between algorithms, especially in cases where documents might be very large.
 This patch extends the existing FieldSelector mechanism with two new 
 FieldSelectorResults:  SIZE and SIZE_AND_BREAK.  SIZE creates fields on the 
 retrieved document that store field sizes instead of actual values.  
 SIZE_AND_BREAK is especially efficient if one field comprises the bulk of the 
 document size (e.g., the body field) and can thus be used as a reasonable 
 size approximation.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-754) FieldCache keeps hard references to readers, doesn't prevent multiple threads from creating same instance

2006-12-19 Thread Chuck Williams (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-754?page=comments#action_12459763 ] 

Chuck Williams commented on LUCENE-754:
---

Cool!  This should solve at least part of my problem.  Trying this now (along 
with finalizer removal patch that is already installed here).  Will report back 
results.

Thanks!


 FieldCache keeps hard references to readers, doesn't prevent multiple threads 
 from creating same instance
 -

 Key: LUCENE-754
 URL: http://issues.apache.org/jira/browse/LUCENE-754
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Yonik Seeley
 Assigned To: Yonik Seeley
 Attachments: FieldCache.patch




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-754) FieldCache keeps hard references to readers, doesn't prevent multiple threads from creating same instance

2006-12-19 Thread Chuck Williams (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-754?page=comments#action_12459791 ] 

Chuck Williams commented on LUCENE-754:
---

This patch, together with LUCENE-750 (already committed) solved our problem 
completely.  It sped up simultaneous multi-threaded searches with a new 
ParallelReader against a 1 million item investigation that has a unique id sort 
field (i.e., 1 million entry FIeldCache must be created) by a factor of 15x.

Thanks Yonik!  +1 to commit this.


 FieldCache keeps hard references to readers, doesn't prevent multiple threads 
 from creating same instance
 -

 Key: LUCENE-754
 URL: http://issues.apache.org/jira/browse/LUCENE-754
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Yonik Seeley
 Assigned To: Yonik Seeley
 Attachments: FieldCache.patch




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 15 minute hang in IndexInput.clone() involving finalizers

2006-12-16 Thread Chuck Williams
The problem appears to be this.  We have an approximately 1 million item
index.  It uses 6 parallel subindexes with ParallelReader, so each of
these subindexes has 1 million items.  Each subindex has the same
segment structure, with 15 segments in each at the moment.

I mentioned before that the issue arose just after a deleteAdd update
that closed the reader after the deletes, added with the writer, and
then reopened the reader.

We have been using a default sort that looks at score first and then id
of the item.  Each id is unique, with an integer sort field.  So the
query just after the IndexReader refresh has to create a new FieldCache
comparator for this integer field.  That generates a ParallelTermDocs
that iterates the id field, which is of course in only one of the
subindexes.  So we have to build a field cache with 1,000,000 entries,
which requires cloning the freqStream in the SegmentReader for each
segment.  This  should only be 15 clones as I interpret the code.

There were 4 threads doing this simultaneously, so make that 60 clones.

I can see reading 1 million terms and building the comparator taking a
while, although not the 15-20 minutes it does, and am baffled at how
every thread dump on many trials of this issue end up with every one
inside the clone()!  The clone just doesn't do much, the most expensive
thing being copying the 1024 byte buffer in BufferedIndexInput.

Applying the patch moved the issue somewhat, but not materially.  The
setup of the FieldCache comparator still takes the same amount of time
and all thread dumps still find the stack inside Object.clone() working
on finalizers.

I'll study this further and look for an optimization, submitting a patch
if I find one.  One interesting thing is that it appears that all 4
threads simultaneously doing this query are building a field cache.  It
seems the synchronization in FieldCacheIImpl.get() with the
CreationPlaceHolder should have prevented that, but for some reason it
is not.

Any further suggestions would be welcome!

For easy access, here is the thread dump again without the patch:

 == Thread Connection thread group.HttpConnection-26493-7 ===
 java.lang.ref.Finalizer.add(Unknown Source)
 java.lang.ref.Finalizer.init(Unknown Source)
 java.lang.ref.Finalizer.register(Unknown Source)
 java.lang.Object.clone(Native Method)
 org.apache.lucene.store.IndexInput.clone(IndexInput.java:175)

 org.apache.lucene.store.BufferedIndexInput.clone(BufferedIndexInput.java:128)
 org.apache.lucene.store.FSIndexInput.clone(FSDirectory.java:562)

 org.apache.lucene.index.SegmentTermDocs.init(SegmentTermDocs.java:45)

 org.apache.lucene.index.SegmentReader.termDocs(SegmentReader.java:333)

 org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:416)

 org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:409)
 org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:361)

 org.apache.lucene.index.ParallelReader$ParallelTermDocs.next(ParallelReader.java:353)

 org.apache.lucene.search.FieldCacheImpl$3.createValue(FieldCacheImpl.java:173)

 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)

 org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:154)

 org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:148)

 org.apache.lucene.search.FieldSortedHitQueue.comparatorInt(FieldSortedHitQueue.java:204)

 org.apache.lucene.search.FieldSortedHitQueue$1.createValue(FieldSortedHitQueue.java:175)

 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)

 org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:155)

 org.apache.lucene.search.FieldSortedHitQueue.init(FieldSortedHitQueue.java:56)

 org.apache.lucene.search.TopFieldDocCollector.init(TopFieldDocCollector.java:41)

And here is the top of the stack with the patch (rest is the same):

 == Thread Connection thread group.HttpConnection-26493-3 ===
 java.lang.ref.Finalizer.init(Unknown Source)
 java.lang.ref.Finalizer.register(Unknown Source)
 java.lang.Object.clone(Native Method)
 org.apache.lucene.store.IndexInput.clone(IndexInput.java:175)
 org.apache.lucene.store.BufferedIndexInput.clone(BufferedIndexInput.java:128)

 org.apache.lucene.store.FSIndexInput.clone(FSDirectory.java:564)
 org.apache.lucene.index.SegmentTermDocs.init(SegmentTermDocs.java:45) 


Thanks,

Chuck


Chuck Williams wrote on 12/15/2006 08:22 AM:
 Yonik and Robert, thanks for the suggestions and pointer to the patch!

 We've looked at the synchronization involved with finalizers and don't
 see how it could cause the issue as running the finalizers themselves is
 outside the lock.  The code inside the lock is simple fixed-time list
 manipulation, not even a loop.  On the other hand, we don't see how
 anything else could cause

15 minute hang in IndexInput.clone() involving finalizers

2006-12-15 Thread Chuck Williams
Hi All,

I've had a bizarre anomaly arise in an application and am wondering if
anybody has ever seen anything like this.  Certain queries, in not easy
to reproduce cases, take 15-20 minutes to execute rather than a few
seconds.  The same query is fast some times and anomalously slow
others.  This is on a 1,000,000 document collection, but the problem
seems independent of that.

I took a bunch of thread dumps during the anomaly period.  There are 4
threads executing the same query at the same time, and all 4 appear to
spend almost the entire time trying to register finalizers as part of
cloning an IndexInput within an application call to create a
TopFieldDocCollector into which the results will be collected.  The
actual search has not been launched yet, and will be reasonably quick
when it is.  All 4 threads show this unchanging stack trace during the
15-20 minutes:

 == Thread Connection thread group.HttpConnection-26493-11 ===
 java.lang.ref.Finalizer.add(Unknown Source)
 java.lang.ref.Finalizer.init(Unknown Source)
 java.lang.ref.Finalizer.register(Unknown Source)
 java.lang.Object.clone(Native Method)
 org.apache.lucene.store.IndexInput.clone(IndexInput.java:175)

 org.apache.lucene.store.BufferedIndexInput.clone(BufferedIndexInput.java:128)
 org.apache.lucene.store.FSIndexInput.clone(FSDirectory.java:562)

 org.apache.lucene.index.SegmentTermDocs.init(SegmentTermDocs.java:45)

 org.apache.lucene.index.SegmentReader.termDocs(SegmentReader.java:333)

 org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:416)

 org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:409)
 org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:361)

 org.apache.lucene.index.ParallelReader$ParallelTermDocs.next(ParallelReader.java:353)

 org.apache.lucene.search.FieldCacheImpl$3.createValue(FieldCacheImpl.java:173)

 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)

 org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:154)

 org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:148)

 org.apache.lucene.search.FieldSortedHitQueue.comparatorInt(FieldSortedHitQueue.java:204)

 org.apache.lucene.search.FieldSortedHitQueue$1.createValue(FieldSortedHitQueue.java:175)

 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)

 org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:155)

 org.apache.lucene.search.FieldSortedHitQueue.init(FieldSortedHitQueue.java:56)

 org.apache.lucene.search.TopFieldDocCollector.init(TopFieldDocCollector.java:41)
... application stack

Another factor appears to be that this anomaly usually (maybe always)
happens just after a series of deleteAdd updates, i.e. just after a
series of deleting with the IndexReader, closing it to add a modified
version of that document with the IndexWriter, and then reopening the
IndexReader.  A query just after reopening the IndexReader is most
likely to trigger this issue.

I have not seen this problem on any other collecitons with the same
application, and so it may be specific to this collection or to its size.

Any thoughts or ideas would be appreciated.

Thanks,

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Locale string compare: Java vs. C#

2006-12-13 Thread Chuck Williams
Surprising but it looks to me like a bug in Java's collation rules for
en-US.  According to
http://developer.mimer.com/collations/charts/UCA_latin.htm, \u00D8
(which is Latin Capital Letter O With Stroke) should be before U,
implying -1 is the correct result.  Java is returning 1 for all
strengths of the collator.  Maybe there is some other subtlety with this
character...

Chuck


George Aroush wrote on 12/13/2006 04:20 PM:
 Hi folks,

 Over at Lucene.Net, I have run into a NUnit test which is failing with
 Lucene.Net (C#) but is passing with Lucene (Java).  The two tests that fail
 are: TestInternationalMultiSearcherSort and TestInternationalSort

 After several hours of investigation, I narrowed the problem to what I
 believe is a difference in the way Java and .NET implement compare.

 The code in question is this method (found in FieldSortedHitQueue.java):

 public final int compare (final ScoreDoc i, final ScoreDoc j) {
 return collator.compare (index[i.doc], index[j.doc]);
 }

 To demonstrate the compare problem (Java vs. .NET) I crated this simple code
 both in Java and C#:

 // Java code: you get back 1 for 'res'
 String s1 = H\u00D8T;
 String s2 = HUT;
 Collator collator = Collator.getInstance (Locale.US);
 int diff = collator.compare(s1, s2);

 // C# code: you get back -1 for 'res'
 string s1 = H\u00D8T;
 string s2 = HUT;
 System.Globalization.CultureInfo locale = new
 System.Globalization.CultureInfo(en-US);
 System.Globalization.CompareInfo collator = locale.CompareInfo;
 int res = collator.Compare(s1, s2);

 Java will give me back a 1 while .NET gives me back -1.

 So, what I am trying to figure out is who is doing the right thing?  Or am I
 missing additional calls before I can compare?

 My goal is to understand why the difference exist and thus based on that
 understanding I can judge how serious this issue is and find a fix for it or
 just document it as a language difference between Java and .NET.

 Btw, this is based on Lucene 2.0 for both Java and C# Lucene.

 Regards,

 -- George Aroush


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Efficiently expunging deletions of recently added documents

2006-12-05 Thread Chuck Williams
Thanks Ning.  This is all very helpful.  I'll make sure to be consistent
with the new merge policy and its invariant conditions.

Chuck


Ning Li wrote on 12/05/2006 08:01 AM:
 An old issue (http://issues.apache.org/jira/browse/LUCENE-325 new
 method expungeDeleted() added to IndexWriter) requested a similar
 functionality as described in the latter half of your email.

 The patch for that issue breaks the invariants of the new merge
 policy. An algorithm similar to that of addIndexesNoOptimize()
 (http://issues.apache.org/jira/browse/LUCENE-528 Optimization for
 IndexWriter.addIndexes()) would solve the problem.

 Ning

 On 12/5/06, Ning Li [EMAIL PROTECTED] wrote:
  I'd like to open up the API to mergeSegments() in IndexWriter and am
  wondering if there are potential problems with this.

 I'm worried that opening up mergeSegments() could easily break the
 invariants currently guaranteed by the new merge
 policy(http://issues.apache.org/jira/browse/LUCENE-672).

 The two invariants say that if M does not change and segment doc count
 is not reaching maxMergeDocs:
 B for maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B)))
 1: If i (left*) and i+1 (right*) are two consecutive segments of doc
 counts x and y, then f(x) = f(y).
 2: The number of committed segments on the same level (f(n)) = M.

 Ning


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Attached proposed modifications to Lucene 2.0 to support Field.Store.Encrypted

2006-12-05 Thread Chuck Williams


Mike Klaas wrote on 12/05/2006 11:38 AM:
 On 12/5/06, negrinv [EMAIL PROTECTED] wrote:

 Chris Hostetter wrote:

  If the code was not already in the core, and someone asked about
 adding it
  I would argue against doing so on the grounds that some helpfull
 utility
  methods (possibly in a contrib) would be just as usefull, and would
 have
  no performance cost for people who don't care about compression.
 
 Perhaps, if you look at compression on its own, but once you see
 compression
 in the context of all the other field options it makes sense to have it
 added to Lucene, it's about having everything in one place for ease of
 implementation that offsets the performance issue, in my opinion.

 Note that built-in compression is deprecated, for similar reasons as
 are being given for the encrypted fields.

Built-in compression is also memory-hungry and slow due to the copying
it does.  External compression is much faster, especially if you extend
Field binary values to support a binary length parameter (which I
submitted a patch for a long time ago).

Here is another argument against adding Field encryption to the lucene
core.  Changes in index format make life complex for any implementations
that deal with index files directly.  There are a number of Lucene
sister projects that do this, plus a number of applications.

I have a fast bulk updater that directly manipulates index files and am
busy upgrading it right now to the 2.1 index format with lockless
commits (which is not fully documented in the new index file formats, by
the way, e.g. the segmentN.sM separate norm files are missing).  It's
a pain.  In general, I think changes to Lucene index format should only
be driven by compelling benefits.  Moving encryption from external to
internal to get a minor application simplification is not sufficiently
compelling to me.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Efficiently expunging deletions of recently added documents

2006-12-04 Thread Chuck Williams
Hi All,

I'd like to open up the API to mergeSegments() in IndexWriter and am
wondering if there are potential problems with this.

I use ParallelReader and ParallelWriter (in jira) extensively as these
provide the basis for fast bulk updates of small metadata fields. 
ParallelReader requires that the subindexes be strictly synchronized by
matching doc ids.  The thorniest problem arises when writing a new
document (with ParallelWriter) generates an exception in some of the
subindexes but not others, as this leaves the subindexes out of sync.

I have recovery for this now that works by deleting the successfully
added subdocuments that are parallel to any unsuccessful subdocument and
then optimizing to expunge the unsuccessful doc-id from those segments
where it had been added.  Optimization is prohibitively expensive for
large indexes, and unnecessary for this recovery.

A much better solution is to have an API in IndexWriter to expunge a
given set of deleted doc ids.  This could merge only enough recent
segments to fully encompass the specified docs, which in this case is
not much since they will be recently added.  The result should be orders
of magnitude performance improvement to the recovery.

I'm planning to make this change and submit a patch for it unless I've
missed something that somebody can point out.  At the same time, I'll
update the ParallelWriter submission as there are a number of bug fixes
plus a substantial general (non-recovery-case) performance improvement
I've just identified and am about to implement.

Thanks for any thoughts. suggestions, or problems you can point out.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Resolved: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-22 Thread Chuck Williams
Michael Busch wrote on 11/22/2006 08:47 AM:
 Ning Li wrote:
 A possible design could be:
 First, in addDocument(), compute the byte size of a ram segment after
 the ram segment is created. In the synchronized block, when the newly
 created segment is added to ramSegmentInfos, also add its byte size to
 the total byte size of ram segments.
 Then, in maybeFlushRamSegments(), either one of two conditions can
 trigger a flush: number of ram segments reaching maxBufferedDocs, and
 total byte size of ram segments exceeding a threshold.

There is a flaw in this approach as you exceed the threshold before
flushing.  With very large documents, that can cause an OOM.


 This is exactly how I implemented it in my private version a couple of
 weeks ago. It works good and I don't see performance problems with
 this design. I named the new parameter in IndexWriter:
 setMaxBufferSize(long).

I implemented it externally because I need to check the size before
adding a new document.  To make this work, I have a notion of size of
Document (via a Sized interface).

I agree that it would be better to do this in IndexWriter, but more
machinery would be needed.  Lucene would need to estimate the size of
the new ram segment and check the threshold prior to consuming the space.

The API that Yonik committed last night (thanks Yonik!) provides the
flexibility to address both use cases.  It's a tiny bit more work for
the app, but at least in my case, is necessary to tune for best
performance (by minimizing memory usage variance as a function of size
parameters) and avoid OOM's.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-21 Thread Chuck Williams (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-709?page=all ]

Chuck Williams updated LUCENE-709:
--

Attachment: ramDirSizeManagement.patch

This one should be golden as it addresses all the issues that have been raised 
and I believe the syncrhonization is fairly well optimized.

Size is now computed based on buffer size, and so is a more accurate accounting 
of actual memory usage.

I've added all the various checking and FileNotFoundExceptions that Doug 
suggested.

I've also changed RamFile.buffers to an ArrayList per Yonik's last suggestion.  
This is probably better than cosmetic since it does allow some unnecessary 
syncrhonization to be eliminated.

Unfortunately, my local Lucene differs now fairly substantially from the head 
-- wish you guys would commit more of my patches so merging wasn't so difficult 
:-) -- so I'm not using the version submitted here, but I did merge it into the 
head carefully and all tests pass, including the new RAMDIrectory tests 
specifically for the functionality this patch provides.


 [PATCH] Enable application-level management of IndexWriter.ramDirectory size
 

 Key: LUCENE-709
 URL: http://issues.apache.org/jira/browse/LUCENE-709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.0.1
 Environment: All
Reporter: Chuck Williams
 Attachments: ramdir.patch, ramdir.patch, ramDirSizeManagement.patch, 
 ramDirSizeManagement.patch, ramDirSizeManagement.patch, 
 ramDirSizeManagement.patch


 IndexWriter currently only supports bounding of in the in-memory index cache 
 using maxBufferedDocs, which limits it to a fixed number of documents.  When 
 document sizes vary substantially, especially when documents cannot be 
 truncated, this leads either to inefficiencies from a too-small value or 
 OutOfMemoryErrors from a too large value.
 This simple patch exposes IndexWriter.flushRamSegments(), and provides access 
 to size information about IndexWriter.ramDirectory so that an application can 
 manage this based on total number of bytes consumed by the in-memory cache, 
 thereby allow a larger number of smaller documents or a smaller number of 
 larger documents.  This can lead to much better performance while elimianting 
 the possibility of OutOfMemoryErrors.
 The actual job of managing to a size constraint, or any other constraint, is 
 left up the applicatation.
 The addition of synchronized to flushRamSegments() is only for safety of an 
 external call.  It has no significant effect on internal calls since they all 
 come from a sychronized caller.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-723) QueryParser support for MatchAllDocs

2006-11-21 Thread Chuck Williams (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-723?page=comments#action_12451849 ] 

Chuck Williams commented on LUCENE-723:
---

+1

With this could also come negative-only queries, e.g.

-foo

as a shortcut for

*:* -foo


 QueryParser support for MatchAllDocs
 

 Key: LUCENE-723
 URL: http://issues.apache.org/jira/browse/LUCENE-723
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
Affects Versions: 2.0.0
Reporter: Yonik Seeley
 Assigned To: Yonik Seeley
Priority: Minor

 It seems like there really should be QueryParser support for 
 MatchAllDocsQuery.
 I propose *:* (brings back memories of DOS :-)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-17 Thread Chuck Williams (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-709?page=all ]

Chuck Williams updated LUCENE-709:
--

Attachment: ramDirSizeManagement.patch

I've just attached my version of this patch.  It includes a multi-threaded test 
case.  I believe it is sound.

A few notes:

  1.  Re. Yonik's comment about my synchronization scenario.  Synhronizing as 
described does resolve the issue.  No higher level synchronization is requried. 
 It doesn't matter how concurent operations on the directory are ordered or 
intereleaved, so long as any computation that does a loop sees some instance of 
the directory that corresponds to its actual content at any polnt in time.  The 
result of the loop will then be accurate for that instant.

2.  Lucene has this same syncrhonization bug today in RAMDIrectory.list().  It 
can return a list of files that never comprised the contents of the directory.  
This is fixed in the attached.

3.  Also, the long synchronization bug exists in RAMDirectory.fileModified() as 
well as RAMDIrectory.fileLength() since both are public.  These are fixed in 
the attached.

4.  I moved the synchronization off of the Hashtable (replacing it with a 
HashMap) up to the RAMDirectory as there are some operations that require 
synchronization at the directory level.  Using just one lock seems better.  As 
all Hashtable operations were already synchonized, I don't believe any material 
additional synchronization is added.

5.  Lucene currently make the assumption that if a file is being written by a 
stream then no other streams are simultaneously reading or writing it.  I've 
maintained this assumption as an optimization, allowing the streams to access 
fields directly without syncrhonization.  This is documented in the comments, 
as is the locking order.

5.  sizeInBytes is now maintained incrementally, efficiently.

6.  Yonik, your version (which I just now saw) has a bug in 
RAMDIrectory.renameFile().  The to file may already exist, in which case it is 
overwritten and it's size must be subtracted.  I actually hit this in my test 
case for my implementation and fixed it (since Lucene renames a new version of 
the segments file).

All Lucene tests, including the new test, pass.  Some contrib tests fail, I 
believe none of these failures are in any way related to this patch.




 [PATCH] Enable application-level management of IndexWriter.ramDirectory size
 

 Key: LUCENE-709
 URL: http://issues.apache.org/jira/browse/LUCENE-709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.0.1
 Environment: All
Reporter: Chuck Williams
 Attachments: ramdir.patch, ramdir.patch, ramDirSizeManagement.patch, 
 ramDirSizeManagement.patch, ramDirSizeManagement.patch


 IndexWriter currently only supports bounding of in the in-memory index cache 
 using maxBufferedDocs, which limits it to a fixed number of documents.  When 
 document sizes vary substantially, especially when documents cannot be 
 truncated, this leads either to inefficiencies from a too-small value or 
 OutOfMemoryErrors from a too large value.
 This simple patch exposes IndexWriter.flushRamSegments(), and provides access 
 to size information about IndexWriter.ramDirectory so that an application can 
 manage this based on total number of bytes consumed by the in-memory cache, 
 thereby allow a larger number of smaller documents or a smaller number of 
 larger documents.  This can lead to much better performance while elimianting 
 the possibility of OutOfMemoryErrors.
 The actual job of managing to a size constraint, or any other constraint, is 
 left up the applicatation.
 The addition of synchronized to flushRamSegments() is only for safety of an 
 external call.  It has no significant effect on internal calls since they all 
 come from a sychronized caller.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-17 Thread Chuck Williams (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-709?page=comments#action_12450894 ] 

Chuck Williams commented on LUCENE-709:
---

I didn't see Yonik's new version or comments until after my attach.

Throwing IOExceptions when files that should exist don't  is clearly a good 
thing.  I'll add that to mine if you guys decide it is the one you would like 
to use.

Counting buffer sizes rather than file length may be slightly more accurate, 
but at least for me it is not material.  There are other inaccuracies as well 
(non-file-storage space in the RAMFiles and RAMDIrectory).

If you guys decide to go with Yonik's version, I think my test case should 
still be used, and that the other synchronization errors I've fixed should be 
fixed (e.g., RAMDIrectory.list()).


 [PATCH] Enable application-level management of IndexWriter.ramDirectory size
 

 Key: LUCENE-709
 URL: http://issues.apache.org/jira/browse/LUCENE-709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.0.1
 Environment: All
Reporter: Chuck Williams
 Attachments: ramdir.patch, ramdir.patch, ramDirSizeManagement.patch, 
 ramDirSizeManagement.patch, ramDirSizeManagement.patch


 IndexWriter currently only supports bounding of in the in-memory index cache 
 using maxBufferedDocs, which limits it to a fixed number of documents.  When 
 document sizes vary substantially, especially when documents cannot be 
 truncated, this leads either to inefficiencies from a too-small value or 
 OutOfMemoryErrors from a too large value.
 This simple patch exposes IndexWriter.flushRamSegments(), and provides access 
 to size information about IndexWriter.ramDirectory so that an application can 
 manage this based on total number of bytes consumed by the in-memory cache, 
 thereby allow a larger number of smaller documents or a smaller number of 
 larger documents.  This can lead to much better performance while elimianting 
 the possibility of OutOfMemoryErrors.
 The actual job of managing to a size constraint, or any other constraint, is 
 left up the applicatation.
 The addition of synchronized to flushRamSegments() is only for safety of an 
 external call.  It has no significant effect on internal calls since they all 
 come from a sychronized caller.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-15 Thread Chuck Williams (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-709?page=comments#action_12450260 ] 

Chuck Williams commented on LUCENE-709:
---

Not synchronizing on the Hashtable, even if using an Enumerator, creates 
problems as the contents of the hash table may change during the sizeInBytes() 
iteration.  Files might be deleted and/or added to the directory concurrently, 
causing the size to be computed from an invalid intermediate state.  Using an 
Enumerator would cause the invalid value to be returned without an exception, 
while using an Iterator instead generates a ConcurrentModificationException.  
Synchronizing on files avoids the problem altogether without much cost as the 
loop is fast.

Hashtable uses a single class, Hashtable.Enumerator, for both its iterator and 
its enumerator.  There are a couple minor differences in the respective 
methods, such as the above, but not much.

The issue with RAMFile.length being a long is an issue, but, this bug already 
exists in Lucene without sizeInBytes().  See RAMDirectory.fileLength(), which 
has the same problem now.

I'll submit another verison of the patch that encapsulates RAMFile.length into 
a sychronized getter and setter.  It's only used in a few places (RAMDIrectory, 
RAMInputStream and RAMOutputStream).


 [PATCH] Enable application-level management of IndexWriter.ramDirectory size
 

 Key: LUCENE-709
 URL: http://issues.apache.org/jira/browse/LUCENE-709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.0.1
 Environment: All
Reporter: Chuck Williams
 Attachments: ramDirSizeManagement.patch, ramDirSizeManagement.patch


 IndexWriter currently only supports bounding of in the in-memory index cache 
 using maxBufferedDocs, which limits it to a fixed number of documents.  When 
 document sizes vary substantially, especially when documents cannot be 
 truncated, this leads either to inefficiencies from a too-small value or 
 OutOfMemoryErrors from a too large value.
 This simple patch exposes IndexWriter.flushRamSegments(), and provides access 
 to size information about IndexWriter.ramDirectory so that an application can 
 manage this based on total number of bytes consumed by the in-memory cache, 
 thereby allow a larger number of smaller documents or a smaller number of 
 larger documents.  This can lead to much better performance while elimianting 
 the possibility of OutOfMemoryErrors.
 The actual job of managing to a size constraint, or any other constraint, is 
 left up the applicatation.
 The addition of synchronized to flushRamSegments() is only for safety of an 
 external call.  It has no significant effect on internal calls since they all 
 come from a sychronized caller.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-15 Thread Chuck Williams (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-709?page=comments#action_12450301 ] 

Chuck Williams commented on LUCENE-709:
---

I hadn' t considered the case of such large values for maxBufferedDocs, and 
agree that the loop execution time is non-trivial in such cases.  Incremental 
management of the size seems most important, especially considering that this 
will also eliminate the cost of the synchronization.

I still think the syncrhonization adds safety since it guarantees that the loop 
sees a state of the directory that did exist at some time.  At that time, the 
directory did have the reported size.  Without the synchronization the loop may 
compute a size for a set of files that never comprised the contents of the 
directory at any instant.  Consider this case:

  1.  Thread 1 adds a new document, creating a new segment with new index 
files, leading to segment merging, that creates new larger segment index files, 
and then deletes all replaced segment index files.  Thread 1 then adds a second 
document, creating new segment index files.
  2.  Thread 2 is computing sizeInBytes and happens to see a state where all 
the new files from both the first and second documents are added, but the 
deletions are not seen.  This could happen if the deleted files happen to be 
earlier in the hash array than the added files for either document.

In this case sizeInBytes() without the synchronization computes a larger size 
for the directory than ever actually existed.

Re. RAMDIrectory.fileLength(), it is not used within Lucene at all, but it is 
public, and the restriction that is not valid when index operations are 
happening concurrently is not specified.  I think that is a bug.

I'll rethink the patch based on your observations, Yonik, and resubmit.  Thanks.


 [PATCH] Enable application-level management of IndexWriter.ramDirectory size
 

 Key: LUCENE-709
 URL: http://issues.apache.org/jira/browse/LUCENE-709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.0.1
 Environment: All
Reporter: Chuck Williams
 Attachments: ramDirSizeManagement.patch, ramDirSizeManagement.patch


 IndexWriter currently only supports bounding of in the in-memory index cache 
 using maxBufferedDocs, which limits it to a fixed number of documents.  When 
 document sizes vary substantially, especially when documents cannot be 
 truncated, this leads either to inefficiencies from a too-small value or 
 OutOfMemoryErrors from a too large value.
 This simple patch exposes IndexWriter.flushRamSegments(), and provides access 
 to size information about IndexWriter.ramDirectory so that an application can 
 manage this based on total number of bytes consumed by the in-memory cache, 
 thereby allow a larger number of smaller documents or a smaller number of 
 larger documents.  This can lead to much better performance while elimianting 
 the possibility of OutOfMemoryErrors.
 The actual job of managing to a size constraint, or any other constraint, is 
 left up the applicatation.
 The addition of synchronized to flushRamSegments() is only for safety of an 
 external call.  It has no significant effect on internal calls since they all 
 come from a sychronized caller.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ParallelMultiSearcher reimplementation

2006-11-13 Thread Chuck Williams


Doug Cutting wrote on 11/13/2006 10:50 AM:
 Chuck Williams wrote:
 I followed this same logic in ParallelWriter and got burned.  My first
 implementation (still the version submitted as a patch in jira) used
 dynamic threads to add the subdocuments to the parallel subindexes
 simultaneously.  This hit a problem with abnormal native heap OOM's in
 the jvm.  At first I thought it was simply a thread stack size / java
 heap size configuration issue, but adjusting these did not resolve the
 issue.  This was on linux.  ps -L showed large numbers of defunct
 threads.  jconsole showed enormous growing total-ever-allocated thread
 counts.  I switched to a thread pool and the issue went away with the
 same config settings.

 Can you demonstrate the problem with a standalone program?

 Way back in the 90's I implemented a system at Excite that spawned one
 or more Java threads per request, and it ran for days on end, handling
 20 or more requests per second.  The thread spawning overhead was
 insignificant.  That was JDK 1.2 on Solaris.  Have things gotten that
 much worse in the interim?  Today Hadoop's RPC allocates a thread per
 connection, and we see good performance.  So I certainly have
 counterexamples.

Are you pushing memory to the limit?  In my case, we need a maximally
sized Java heap (about 2.5G on linux) and so carefully minimize the
thread stack and perm space sizes.  My suspicion is that it takes a
while after a thread is defunct before all resources are reclaimed.  We
are hitting our server with 50 simultaneous threads doing indexing, each
of which writes 6 parallel subindexes in a separate thread.  This yields
hundreds of threads created per second in tight total thread stack
space; the process continually bumped over the native heap limit.  With
the change to thread pools, and therefore no dynamic creation and
destruction of thread stacks, all works fine.

Unless you are running with a maximal Java heap, you are unlikely to
have the issue as there is plenty of space left over for the native
heap, so a delay in thread stack reclamation would yield a larger
average process size, but would not cause OOM's.

Chuck



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-10 Thread Chuck Williams (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-709?page=comments#action_12448923 ] 

Chuck Williams commented on LUCENE-709:
---

Mea Culpa!  Bad bug on my part.  Thanks for spotting it!

I believe the solution is simple.  RAMDirectory.files is a Hashtable, i.e. it 
is synchronized.  Hashtable.values() tracks all changes to the ram directory as 
they occur.  The fail-fast iterator does not accept concurrent modificaitons.  
So, the answer is to stop concurrent modifications during sizeInBytes().  This 
is accomplised by synchronizing on the same objects as the modificaitons 
already use, i.e. files.  I'm attaching a new version of the the patch that I 
believe is correct.

Please emabarass me again if there is another mistake!


 [PATCH] Enable application-level management of IndexWriter.ramDirectory size
 

 Key: LUCENE-709
 URL: http://issues.apache.org/jira/browse/LUCENE-709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.0.1
 Environment: All
Reporter: Chuck Williams
 Attachments: ramDirSizeManagement.patch, ramDirSizeManagement.patch


 IndexWriter currently only supports bounding of in the in-memory index cache 
 using maxBufferedDocs, which limits it to a fixed number of documents.  When 
 document sizes vary substantially, especially when documents cannot be 
 truncated, this leads either to inefficiencies from a too-small value or 
 OutOfMemoryErrors from a too large value.
 This simple patch exposes IndexWriter.flushRamSegments(), and provides access 
 to size information about IndexWriter.ramDirectory so that an application can 
 manage this based on total number of bytes consumed by the in-memory cache, 
 thereby allow a larger number of smaller documents or a smaller number of 
 larger documents.  This can lead to much better performance while elimianting 
 the possibility of OutOfMemoryErrors.
 The actual job of managing to a size constraint, or any other constraint, is 
 left up the applicatation.
 The addition of synchronized to flushRamSegments() is only for safety of an 
 external call.  It has no significant effect on internal calls since they all 
 come from a sychronized caller.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-10 Thread Chuck Williams (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-709?page=all ]

Chuck Williams updated LUCENE-709:
--

Attachment: ramDirSizeManagement.patch

 [PATCH] Enable application-level management of IndexWriter.ramDirectory size
 

 Key: LUCENE-709
 URL: http://issues.apache.org/jira/browse/LUCENE-709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.0.1
 Environment: All
Reporter: Chuck Williams
 Attachments: ramDirSizeManagement.patch, ramDirSizeManagement.patch


 IndexWriter currently only supports bounding of in the in-memory index cache 
 using maxBufferedDocs, which limits it to a fixed number of documents.  When 
 document sizes vary substantially, especially when documents cannot be 
 truncated, this leads either to inefficiencies from a too-small value or 
 OutOfMemoryErrors from a too large value.
 This simple patch exposes IndexWriter.flushRamSegments(), and provides access 
 to size information about IndexWriter.ramDirectory so that an application can 
 manage this based on total number of bytes consumed by the in-memory cache, 
 thereby allow a larger number of smaller documents or a smaller number of 
 larger documents.  This can lead to much better performance while elimianting 
 the possibility of OutOfMemoryErrors.
 The actual job of managing to a size constraint, or any other constraint, is 
 left up the applicatation.
 The addition of synchronized to flushRamSegments() is only for safety of an 
 external call.  It has no significant effect on internal calls since they all 
 come from a sychronized caller.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Dynamically varying maxBufferedDocs

2006-11-09 Thread Chuck Williams
Hi All,

Does anybody have experience dynamically varying maxBufferedDocs?  In my
app, I can never truncate docs and so work with maxFieldLength set to
Integer.MAX_VALUE.  Some documents are large, over 100 MBytes.  Most
documents are tiny.  So a fixed value of maxBufferedDocs to avoid OOM's
is too small for good ongoing performance.

It appears to me that the merging code will work fine if the initial
segment sizes vary.  E.g., a simple solution is to make
IndexWriter.flushRamSegments() public and manage this externally (for
which I already have all the needed apparatus, including size
information, the necessary thread synchronization, etc.).

A better solution might be to build a size-management option into the
maxBufferedDocs mechanism in lucene, but at least for my purposes, that
doesn' t appear necessary as a first step.

My main concern is that the mergeFactor escalation merging logic will
somehow behave poorly in the presence of dynamically varying initial
segment sizes.

I'm going to try this now, but am wondering if anybody has tried things
along these lines and might offer useful suggestions or admonitions.

Thanks for any advice,

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Dynamically varying maxBufferedDocs

2006-11-09 Thread Chuck Williams
Thanks Yonik!  Poor wording on my part.  I won't vary maxBufferedDocs,
just am making flushRamSegments() public and calling it externally
(properly synchronized), earlier than it would otherwise be called from
ongoing addDocument-driven merging.

Sounds like this should work.

Chuck


Yonik Seeley wrote on 11/09/2006 08:37 AM:
 On 11/9/06, Chuck Williams [EMAIL PROTECTED] wrote:
 My main concern is that the mergeFactor escalation merging logic will
 somehow behave poorly in the presence of dynamically varying initial
 segment sizes.

 Things will work as expected with varying segments sizes, but *not*
 varying maxBufferedDocuments.  The level of a segment is defined by
 maxBufferedDocuments.

 If there were a solution to flush early w/o maxBufferedDocuments
 changing, things would work fine.

 -Yonik
 http://incubator.apache.org/solr Solr, the open-source Lucene search
 server

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Dynamically varying maxBufferedDocs

2006-11-09 Thread Chuck Williams

Yonik Seeley wrote on 11/09/2006 08:50 AM:
 For best behavior, you probably want to be using the current
 (svn-trunk) version of Lucene with the new merge policy.  It ensures
 there are mergeFactor segments with size = maxBufferedDocs before
 triggering a merge.  This makes for faster indexing in the presence of
 deleted docs or partially full segments.


I've got quite a few local patches unfortunately.  It will take a while
to sync up.  If I don't already have this new logic, can I pick it up by
just merging with the latest IndexWriter or are the changes more extensive?

Thanks again,

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Dynamically varying maxBufferedDocs

2006-11-09 Thread Chuck Williams


Chuck Williams wrote on 11/09/2006 08:55 AM:
 Yonik Seeley wrote on 11/09/2006 08:50 AM:
   
 For best behavior, you probably want to be using the current
 (svn-trunk) version of Lucene with the new merge policy.  It ensures
 there are mergeFactor segments with size = maxBufferedDocs before
 triggering a merge.  This makes for faster indexing in the presence of
 deleted docs or partially full segments.

 

 I've got quite a few local patches unfortunately.  It will take a while
 to sync up.  If I don't already have this new logic, can I pick it up by
 just merging with the latest IndexWriter or are the changes more extensive?
   
I must already have the new merge logic as the only diff between my
IndexWriter and latest svn is the change just made to make
flushRamSegments public.

Yonik, thanks for your help.  This should work well!

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Dynamically varying maxBufferedDocs

2006-11-09 Thread Chuck Williams
This sounds good.  Michael, I'd love to see your patch,

Chuck


Michael Busch wrote on 11/09/2006 09:13 AM:
 I had the same problem with large documents causing memory problems. I
 solved this problem by introducing a new setting in IndexWriter
 setMaxBufferSize(long). Now a merge is either triggered when
 bufferedDocs==maxBufferedDocs *or* the size of the bufferedDocs =
 maxBufferSize. I made these changes based on the new merge policy
 Yonik mentioned, so if anyone is interested I could open a Jira issue
 and submit a patch.

 - Michael


 Yonik Seeley wrote:
 On 11/9/06, Chuck Williams [EMAIL PROTECTED] wrote:
 Thanks Yonik!  Poor wording on my part.  I won't vary maxBufferedDocs,
 just am making flushRamSegments() public and calling it externally
 (properly synchronized), earlier than it would otherwise be called from
 ongoing addDocument-driven merging.

 Sounds like this should work.

 Yep.
 For best behavior, you probably want to be using the current
 (svn-trunk) version of Lucene with the new merge policy.  It ensures
 there are mergeFactor segments with size = maxBufferedDocs before
 triggering a merge.  This makes for faster indexing in the presence of
 deleted docs or partially full segments.

 -Yonik
 http://incubator.apache.org/solr Solr, the open-source Lucene search
 server

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Dynamically varying maxBufferedDocs

2006-11-09 Thread Chuck Williams
Michael Busch wrote on 11/09/2006 09:56 AM:

 This sounds good.  Michael, I'd love to see your patch,

 Chuck

 Ok, I'll probably need a few days before I can submit it (have to code
 unit tests and check if it compiles with the current head), because
 I'm quite busy with other stuff right now. But you will get it soon :-)

I've just written my patch and will submit it too once it is fully
tested.  I took this approach:

   1. Add sizeInBytes() to RAMDirectory
   2. Make flushRamSegments() plus new numRamDocs() and ramSizeInBytes()
  public in IndexWriter


This does not provide the facility in IndexWriter, but it does provide a
nice api to manage this externally.  I didn't do it in IndexWriter for
two reasons:

   1. I use ParallelWriter, which has to manage this differently
   2. There is no general mechanism in lucene to size documents.  I use
  have an interface for my readers in reader-valued fields to
  support this.


In general, there are things the application knows that lucene doesn't
know that help to manage the size bounds

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-09 Thread Chuck Williams (JIRA)
[PATCH] Enable application-level management of IndexWriter.ramDirectory size


 Key: LUCENE-709
 URL: http://issues.apache.org/jira/browse/LUCENE-709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.0.1
 Environment: All
Reporter: Chuck Williams


IndexWriter currently only supports bounding of in the in-memory index cache 
using maxBufferedDocs, which limits it to a fixed number of documents.  When 
document sizes vary substantially, especially when documents cannot be 
truncated, this leads either to inefficiencies from a too-small value or 
OutOfMemoryErrors from a too large value.

This simple patch exposes IndexWriter.flushRamSegments(), and provides access 
to size information about IndexWriter.ramDirectory so that an application can 
manage this based on total number of bytes consumed by the in-memory cache, 
thereby allow a larger number of smaller documents or a smaller number of 
larger documents.  This can lead to much better performance while elimianting 
the possibility of OutOfMemoryErrors.

The actual job of managing to a size constraint, or any other constraint, is 
left up the applicatation.

The addition of synchronized to flushRamSegments() is only for safety of an 
external call.  It has no significant effect on internal calls since they all 
come from a sychronized caller.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-09 Thread Chuck Williams (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-709?page=all ]

Chuck Williams updated LUCENE-709:
--

Attachment: ramDirSizeManagement.patch

 [PATCH] Enable application-level management of IndexWriter.ramDirectory size
 

 Key: LUCENE-709
 URL: http://issues.apache.org/jira/browse/LUCENE-709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.0.1
 Environment: All
Reporter: Chuck Williams
 Attachments: ramDirSizeManagement.patch


 IndexWriter currently only supports bounding of in the in-memory index cache 
 using maxBufferedDocs, which limits it to a fixed number of documents.  When 
 document sizes vary substantially, especially when documents cannot be 
 truncated, this leads either to inefficiencies from a too-small value or 
 OutOfMemoryErrors from a too large value.
 This simple patch exposes IndexWriter.flushRamSegments(), and provides access 
 to size information about IndexWriter.ramDirectory so that an application can 
 manage this based on total number of bytes consumed by the in-memory cache, 
 thereby allow a larger number of smaller documents or a smaller number of 
 larger documents.  This can lead to much better performance while elimianting 
 the possibility of OutOfMemoryErrors.
 The actual job of managing to a size constraint, or any other constraint, is 
 left up the applicatation.
 The addition of synchronized to flushRamSegments() is only for safety of an 
 external call.  It has no significant effect on internal calls since they all 
 come from a sychronized caller.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ParallelMultiSearcher reimplementation

2006-11-05 Thread Chuck Williams
Doug Cutting wrote on 11/03/2006 12:18 PM:
 Chuck Williams wrote:
 Why would a thread pool be more controversial?  Dynamically creating and
 garbaging threads has many downsides.

 The JVM already pools native threads, so mostly what's saved by thread
 pools is the allocation  initialization of new Thread instances. 
 There are also downsides to thread pools.  They alter ThreadLocal
 semantics and generally add complexity that may not be warranted.

 Like most optimizations, use of thread pools should be motivated by
 benchmarks.

I followed this same logic in ParallelWriter and got burned.  My first
implementation (still the version submitted as a patch in jira) used
dynamic threads to add the subdocuments to the parallel subindexes
simultaneously.  This hit a problem with abnormal native heap OOM's in
the jvm.  At first I thought it was simply a thread stack size / java
heap size configuration issue, but adjusting these did not resolve the
issue.  This was on linux.  ps -L showed large numbers of defunct
threads.  jconsole showed enormous growing total-ever-allocated thread
counts.  I switched to a thread pool and the issue went away with the
same config settings.

So, I'm not convinced the jvm does such a good job a pooling OS native
threads.

Re. ThreadLocals, I agree the semantics are different, but arguably they
are most useful with thread pools.  With dynamic threads, you get a
reallocation every time, while with thread pools you avoid constant
reallocations.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ParallelMultiSearcher reimplementation

2006-11-03 Thread Chuck Williams
Chris Hostetter wrote on 11/03/2006 09:40 AM:
 : Is there any timeline for when Java 1.5 packages will be allowed?

 I don't think i'll incite too much rioting to say no there is no
 timeline
 .. I may incite some rioting by saying my guess is 1.5 packages will be
 supported when the patches requiring them become highly desired.
   

Not being shy about inciting riots, the problem with this approach is
that people using Java 1.5 are discouraged from submitting patches to
being with.


Doug Cutting wrote on 11/03/2006 08:39 AM:
 Please consider breaking these into separate patches, one to permit
 ParallelMultiSearcher w/ HitCollector to not be single-threaded, and
 another to re-implement things with a thread pool.  The latter is more
 controversial, and it would be a shame to have the former wait on it. 

Why would a thread pool be more controversial?  Dynamically creating and
garbaging threads has many downsides.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Include BM25 in Lucene?

2006-10-17 Thread Chuck Williams
Vic Bancroft wrote on 10/17/2006 02:44 AM:
 In some of my group's usage of lucene over large document collections,
 we have split the documents across several machines.  This has lead to
 a concern of whether the inverse document frequency was appropriate,
 since the score seems to be dependant on the partioning of documents
 over indexing hosts.  We have not formulated an experiment to
 determine if it seriously effects our results, though it has been
 discussed.

What version of Lucene are you using?  Are you using
ParallelMultiSearcher to manage the distributed indexes or have you
implemented your own mechanism?  There was a bug a couple years ago, in
the 1.4.3 version as I recall, where ParallelMultiSearcher was not
computing df's appropriately, but that has been fixed for a long time
now.  The df's are the sum of the df's from each distributed index and
thus are independent of the partitioning.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Ferret's changes

2006-10-11 Thread Chuck Williams

David Balmain wrote on 10/10/2006 08:53 PM:
 On 10/11/06, Chuck Williams [EMAIL PROTECTED] wrote:

 I personally would always store term vectors since I use a
 StandardTokenizer and Stemming. In this case highlighting matches in
 small documents is not trivial. Ferret's highlighter matches even
 sloppy phrase queries and phrases with gaps between the terms
 correctly. I couldn't do this without the use of term vectors.

I use stemming as well, but am not yet matching phrases like that. 
Perhaps term vectors will be useful to achieve this, although they come
at a high cost and it doesn't seem difficult or expensive to do the
matching directly on the text of small items.

 I suppose it would be possible for the single conceptual field 'body' to
 be represented with two physical fields 'smallBody' and 'largeBody'
 where the former stores term vectors and the latter does not.

 If I really wanted to solve this problem I would use this solution. It
 is pretty easy to search multiple fields when I need to. Ferret's
 Query language even supports it:

smallBody|largeBody:phrase to search for

Couldn't agree more.  I have a number of extensions to Lucene's query
parser, including this for multiple fields:

{smallBody largeBody}:phrase to search for


 In the end, I think the benifits of my model far outweight the costs.
 For me at least anyway.

Based on the performance figures so far, it seems they do!  I think
dynamic term vectors have a substantial benefit, but can easily be
implemented in model where all field indexing properties are fixed.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Define end-of-paragraph

2006-10-03 Thread Chuck Williams

Reuven Ivgi wrote on 10/02/2006 09:32 PM:
 I want to divide a document to paragraphs, still having proximity search
 within each paragraph

 How can I do that?
   

Is your issue that you want the paragraphs to be in a single document,
but you want to limit proximity search to find matches only within a
single paragraph?  If so, you could parse your document into paragraphs
and when generating tokens for it place large gaps at the paragraph
boundaries.  Each Token in lucene has a startOffset and endOffset that
you can set as you generate Tokens inside TokenStream.next() for the
TokenStream returned by your Analyzer.  Those classes and methods are
all in org.apache.lucene.analysis.  Or alternatively, you could make
each paragraph a separate field value and use
Analyzer.getPositionIncrementGap() to achieve essentially the same thing
(except that your Documents could get unwieldy if you that have many
paragraphs).

If this is not what you are trying to do, then please explain your
objectives precisely.

Good luck,

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Define end-of-paragraph

2006-10-03 Thread Chuck Williams
Hi Reuven,

In my haste last night, I pointed you at the wrong fields on Token. You
need to set the position to create inter-paragraph gaps, not the
offsets, so you want Token.setPositionIncrement() for that approach, or
Analyzer.getPositionIncrementGap() if you use the multi-field approach.

You will likely have performance problems with Documents that have
thousands of fields, so I would not recommend that approach. Are you
only matching paragraphs rather than whole documents? If so, another
approach would be to make each paragraph a separate document. Then you
could store document and paragraph id's in separate fields and have all
the information you want.

If you need whole document matching, but want the paragraph number of
matches, one approach might be to use SpanQuery's together with a
position-encoding of paragraph numbers. E.g., place you paragraphs
starting at positions 0, 1, 2, 3, ... Then from the
positions on the spans you find, you can identify what paragraph you are in.

I'm sure you can come up with many other ways to represent this
information as well.

Hope this helps,

Chuck


Reuven Ivgi wrote on 10/02/2006 11:27 PM:
 Hello,
 To be more precise, the basic entity I am using is a document, each with
 paragraphs which may be up to few thousands. I need the proximity search
 within a paragraph, yet, I want to get as a search result the paragraph
 number also. Maybe, defining each paragraph as separate field it the
 best way
 What do you think?
 Thanks in advance 

 Reuven Ivgi

 -Original Message-
 From: Chuck Williams [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, October 03, 2006 10:58 AM
 To: java-dev@lucene.apache.org
 Subject: Re: Define end-of-paragraph


 Reuven Ivgi wrote on 10/02/2006 09:32 PM:
   
 I want to divide a document to paragraphs, still having proximity
 
 search
   
 within each paragraph

 How can I do that?
   
 

 Is your issue that you want the paragraphs to be in a single document,
 but you want to limit proximity search to find matches only within a
 single paragraph?  If so, you could parse your document into paragraphs
 and when generating tokens for it place large gaps at the paragraph
 boundaries.  Each Token in lucene has a startOffset and endOffset that
 you can set as you generate Tokens inside TokenStream.next() for the
 TokenStream returned by your Analyzer.  Those classes and methods are
 all in org.apache.lucene.analysis.  Or alternatively, you could make
 each paragraph a separate field value and use
 Analyzer.getPositionIncrementGap() to achieve essentially the same thing
 (except that your Documents could get unwieldy if you that have many
 paragraphs).

 If this is not what you are trying to do, then please explain your
 objectives precisely.

 Good luck,

 Chuck


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 __
 This email has been scanned by the MessageLabs Email Security System.
 For more information please visit http://www.messagelabs.com/email 
 __


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: After kill -9 index was corrupt

2006-09-29 Thread Chuck Williams
Hi All,

I found this issue.  There is no problem in Lucene, and I'd like to
leave this thread with that assertion to avoid confusing future archive
searcher/readers.

The index was actually not corrupt at all.  I use ParallelReader and
ParallelWriter.  A kill -9 can leave the subindexes out of sync.  My
recovery code repairs this on restart by noticing the indexes are
out-of-sync, deleting the document(s) that were added to some
subindex(es) but not the other(s), then optimizing to resync the doc-ids.

The issue is that my bulk updater does not at present support compound
file format and the recovery code forgot to turn that off prior to the
optimize!  Thus a .cfs file was created, which confused the bulk updater
-- it did not see a segment that was inside the cfs.

Sorry for the false alarm and thanks to all who helped with the original
question/concern,

Chuck


Chuck Williams wrote on 09/11/2006 12:10 PM:
 I do have one module that does custom index operations.  This is my bulk
 updater.  It creates new index files for the segments it modifies and a
 new segments file, then uses the same commit mechanism as merging. 
 I.e., it copes its new segments file into segments with the commit
 lock only after all the new index files are closed.  In the problem
 scenario, I don't have any indication that the bulk updater was
 complicit but am of course fully exploring that possibility as well.

 The index was only reopened by the process after the kill -9 of the old
 process was completed, so there were not any threads still working on
 the old process.

 This remains a mystery.  Thanks for you analysis and suggestions.  If
 you have more ideas, please keep them coming!

 Chuck


 robert engels wrote on 09/11/2006 10:06 AM:
   
 I am not stating that you did not uncover a problem. I am only stating
 that it is not due to OS level caching.

 Maybe your sequence of events triggered a reread of the index, while
 some thread was still writing. The reread sees the 'unused segments'
 and deletes them, and then the other thread writes the updated
 'segments' file.

 From what you state, it seems that you are using some custom code for
 index writing? (Maybe the NewIndexModified stuff)? Possibly there is
 an issue there. Do you maybe have your own cleanup code that attempts
 to remove unused segments from the directory? If so, that appears to
 be the likely culprit to me.

 On Sep 11, 2006, at 2:56 PM, Chuck Williams wrote:

 
 robert engels wrote on 09/11/2006 07:34 AM:
   
 A kill -9 should not affect the OS's writing of dirty buffers
 (including directory modifications). If this were the case, massive
 system corruption would almost always occur every time a kill -9 was
 used with any program.

 The only thing a kill -9 affects is user level buffering. The OS
 always maintains a consistent view of directory modifications and or
 file modification that were requesting by programs.

 This entire discussion is pointless.

 
 Thanks everyone for your analysis.  It appears I do not have any
 explanation.  In my case, the process was in gc-limbo due to the memory
 leak and having butted up against its -Xmx.  The process was kill -9'd
 and then restarted.  The OS never crashed.  The server this is on is
 healthy; it has been used continually since this happened without being
 rebooted and no file system or any other issues.  When the process was
 killed, one thread was merging segments as part of flushing the ram
 buffer while closing the index, due to the prior kill -15.  When Lucene
 restarted, the segments file contained a segment name for which there
 were no corresponding index data files.

 Chuck


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

   
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: After kill -9 index was corrupt

2006-09-11 Thread Chuck Williams


Paul Elschot wrote on 09/10/2006 09:15 PM:
 On Monday 11 September 2006 02:24, Chuck Williams wrote:
   
 Hi All,

 An application of ours under development had a memory link that caused
 it to slow interminably.  On linux, the application did not response to
 kill -15 in a reasonable time, so kill -9 was used to forcibly terminate
 it.  After this the segments file contained a reference to a segment
 whose index files were not present.  I.e., the index was corrupt and
 Lucene could not open it.

 A thread dump at the time of the kill -9 shows that Lucene was merging
 segments inside IndexWriter.close().  Since segment merging only commits
 (updates the segments file) after the newly merged segment(s) are
 complete, I expect this is not the actual problem.

 Could a kill -9 prevent data from reaching disk for files that were
 previously closed?  If so, then Lucene's index can become corrupt after
 kill -9.  In this case, it is possible that a prior merge created new
 segment index files, updated the segments file, closed everything, the
 segments file made it to disk, but the index data files and/or their
 directory entries did not.

 If this is the case, it seems to me that flush() and
 FileDescriptor.sync() are required on each index file prior to close()
 to guarantee no corruption.  Additionally a FileDescriptor.sync() is
 also probably required on the index directory to ensure the directory
 entries have been persisted.
 

 Shouldn't the sync be done after closing the files? I'm using sync in a
 (un*x) shell script after merges before backups. I'd prefer to have some
 more of this syncing built into Lucene because the shell sync syncs all
 disks which might be more than needed. So far I've had no problems,
 so there was no need to investigate further.
   
I believe FileDescriptor,sync() uses fsync and not sync on linux.  A
FileDescriptor is no longer valid after the stream is closed, so sync()
could not be done on a closed stream.  I think the correct protocol is
flush() the stream, sync() it's FD, then close() it.

Paul, do you know if kill -9 can create the situation where bytes from a
closed file never make it to disk in linux?  I think Lucene needs sync()
in any event to be robust with respect to OS crashes, but am wondering
if this explains my kill -9 problem as well.  It seems bogus to me that
a closed file's bytes would fail to be persisted unless the OS crashed,
but I can't find any other explanation and I can't find any definitive
information to affirm or refute this possible side effect of kill -9.

The issue I've got is that my index can never lose documents.  So I've
implemented journaling on top of Lucene where only the last
maxBufferedDocs documents are journaled and the whole journal is reset
after close().  My application has no way to know when the bytes make it
to disk, and so cannot manage its journal properly unless Lucene ensures
index integrity with sync()'s.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: After kill -9 index was corrupt

2006-09-11 Thread Chuck Williams
robert engels wrote on 09/11/2006 07:34 AM:
 A kill -9 should not affect the OS's writing of dirty buffers
 (including directory modifications). If this were the case, massive
 system corruption would almost always occur every time a kill -9 was
 used with any program.

 The only thing a kill -9 affects is user level buffering. The OS
 always maintains a consistent view of directory modifications and or
 file modification that were requesting by programs.

 This entire discussion is pointless.

Thanks everyone for your analysis.  It appears I do not have any
explanation.  In my case, the process was in gc-limbo due to the memory
leak and having butted up against its -Xmx.  The process was kill -9'd
and then restarted.  The OS never crashed.  The server this is on is
healthy; it has been used continually since this happened without being
rebooted and no file system or any other issues.  When the process was
killed, one thread was merging segments as part of flushing the ram
buffer while closing the index, due to the prior kill -15.  When Lucene
restarted, the segments file contained a segment name for which there
were no corresponding index data files.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: After kill -9 index was corrupt

2006-09-11 Thread Chuck Williams
I do have one module that does custom index operations.  This is my bulk
updater.  It creates new index files for the segments it modifies and a
new segments file, then uses the same commit mechanism as merging. 
I.e., it copes its new segments file into segments with the commit
lock only after all the new index files are closed.  In the problem
scenario, I don't have any indication that the bulk updater was
complicit but am of course fully exploring that possibility as well.

The index was only reopened by the process after the kill -9 of the old
process was completed, so there were not any threads still working on
the old process.

This remains a mystery.  Thanks for you analysis and suggestions.  If
you have more ideas, please keep them coming!

Chuck


robert engels wrote on 09/11/2006 10:06 AM:
 I am not stating that you did not uncover a problem. I am only stating
 that it is not due to OS level caching.

 Maybe your sequence of events triggered a reread of the index, while
 some thread was still writing. The reread sees the 'unused segments'
 and deletes them, and then the other thread writes the updated
 'segments' file.

 From what you state, it seems that you are using some custom code for
 index writing? (Maybe the NewIndexModified stuff)? Possibly there is
 an issue there. Do you maybe have your own cleanup code that attempts
 to remove unused segments from the directory? If so, that appears to
 be the likely culprit to me.

 On Sep 11, 2006, at 2:56 PM, Chuck Williams wrote:

 robert engels wrote on 09/11/2006 07:34 AM:
 A kill -9 should not affect the OS's writing of dirty buffers
 (including directory modifications). If this were the case, massive
 system corruption would almost always occur every time a kill -9 was
 used with any program.

 The only thing a kill -9 affects is user level buffering. The OS
 always maintains a consistent view of directory modifications and or
 file modification that were requesting by programs.

 This entire discussion is pointless.

 Thanks everyone for your analysis.  It appears I do not have any
 explanation.  In my case, the process was in gc-limbo due to the memory
 leak and having butted up against its -Xmx.  The process was kill -9'd
 and then restarted.  The OS never crashed.  The server this is on is
 healthy; it has been used continually since this happened without being
 rebooted and no file system or any other issues.  When the process was
 killed, one thread was merging segments as part of flushing the ram
 buffer while closing the index, due to the prior kill -15.  When Lucene
 restarted, the segments file contained a segment name for which there
 were no corresponding index data files.

 Chuck


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



After kill -9 index was corrupt

2006-09-10 Thread Chuck Williams
Hi All,

An application of ours under development had a memory link that caused
it to slow interminably.  On linux, the application did not response to
kill -15 in a reasonable time, so kill -9 was used to forcibly terminate
it.  After this the segments file contained a reference to a segment
whose index files were not present.  I.e., the index was corrupt and
Lucene could not open it.

A thread dump at the time of the kill -9 shows that Lucene was merging
segments inside IndexWriter.close().  Since segment merging only commits
(updates the segments file) after the newly merged segment(s) are
complete, I expect this is not the actual problem.

Could a kill -9 prevent data from reaching disk for files that were
previously closed?  If so, then Lucene's index can become corrupt after
kill -9.  In this case, it is possible that a prior merge created new
segment index files, updated the segments file, closed everything, the
segments file made it to disk, but the index data files and/or their
directory entries did not.

If this is the case, it seems to me that flush() and
FileDescriptor.sync() are required on each index file prior to close()
to guarantee no corruption.  Additionally a FileDescriptor.sync() is
also probably required on the index directory to ensure the directory
entries have been persisted.

A power failure or other operating system crash could cause this, not
just kill -9.

Does this seem like a possible explanation and fix for what happened? 
Could the same kind of problem happen on Windows?

If this is the issue, then how would people feel about having Lucene do
sync()'s a) always? or b) as an index configuration option?

I need to fix whatever happened and so would submit a patch to resolve it.

Thanks for advice and suggestions,

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Combining search steps without re-searching

2006-08-28 Thread Chuck Williams
I presume your search steps are anded, as in typical drill-downs?

From  a Lucene standpoint, each sequence of steps is a BooleanQuery of
required clauses, one for each step.  To add a step, you extend the
BooleanQuery with a new clause.  To not re-evaluate the full query,
you'd need some query that regenerated the results of the prior step
more efficiently than BooleanQuery.  For example, if you happened to
generate the entire result set for each step, presumably not feasible,
then the results might be cached for regeneration.  Assuming you cannot
generate the entire result set, it's not obvious to me how having
partially generated S1 and ... Sn-1 will help you generate S1 and ... Sn
any faster.

You will already get the the benefit of OS caching with Lucene as it
stands.  You might find further caching extension to the query types you
use to be a performance gain that achieves what you want.  You might
also consider some kind of query optimization by extending the rewrite()
methods.

Chuck


Fernando Mato Mira wrote on 08/28/2006 12:21 AM:
 Hello,

  We think we would have a problem if we try to use lucene because we
 do search combinations which might have hundreds of steps, so creating
 a combined query and executing again each time might be a problem.
  What would entail overhauling Lucene to do search combinations by
 taking advantage of the results already generated?

 Thanks

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Combining search steps without re-searching

2006-08-28 Thread Chuck Williams


Andrzej Bialecki wrote on 08/28/2006 09:19 AM:
 Chuck Williams wrote:
 I presume your search steps are anded, as in typical drill-downs?

 From  a Lucene standpoint, each sequence of steps is a BooleanQuery of
 required clauses, one for each step.  To add a step, you extend the
 BooleanQuery with a new clause.  To not re-evaluate the full query,
   

 ... umm, guys, wouldn't a series of QueryFilter's work much better in
 this case? If some of the clauses are repeatable, then filtering
 results through a cached BitSet in such filtered query would work
 nicely, right?

If the possible initial steps comprise a small finite set, I could see
that as a winner.  In my app for instance, the drill-down selectors are
dynamic and drawn from a large set of possibilities.  It's hard to see
how any small set of filters would be much of a benefit.  A large set of
filters would consume too much space.  For a 10 million document node at
1.25 megabytes per filter even a couple hundred filters adds up to
something significant.

As I understand things, filters take considerably more time to initially
create but then can more than make this up through repetitive use.  So
they are a winner iff there are a small number of specific steps that
are frequently and disproportionately used.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-659) [PATCH] PerFieldAnalyzerWrapper fails to implement getPositionIncrementGap()

2006-08-17 Thread Chuck Williams (JIRA)
[PATCH] PerFieldAnalyzerWrapper fails to implement getPositionIncrementGap()


 Key: LUCENE-659
 URL: http://issues.apache.org/jira/browse/LUCENE-659
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.0.1, 2.1
 Environment: Any
Reporter: Chuck Williams
 Attachments: PerFieldAnalyzerWrapper.patch

The attached patch causes PerFieldAnalyzerWrapper to delegate calls to 
getPositionIncrementGap() to the analyzer that is appropriate for the field in 
question.  The current behavior without this patch is to always use the default 
value from Analyzer, which is a bug because PerFieldAnalyzerWrapper should 
behave just as if it was the analyzer for the selected field.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Strange behavior of positionIncrementGap

2006-08-11 Thread Chuck Williams
Hi All,

There is a strange treatment of positionIncrementGap in
DocumentWriter.invertDocument().The gap is inserted between all
values of a field, except it is not inserted between values if the
prefix of the value list up to that point has not yet generated a token.

For example, if a field F has values A, B and C the following example
cases arise:
  1.  A and B both generate no tokens == no positionIncrementGaps are
generated
  2.  A has no tokens but B does == just the gap between B and C
  3.  A has tokens but B and C do not == both gaps between A and B, and
between B and C are generated

So, empty fields are treated anomalously.  They are ignored for gap
purposes at the beginning of the field list, but included if they occur
later in the field list.

This issue caused a subtle bug in my bulk update operation because to 
modify values and update the postings it must reanalyze them with
precisely the same positions used when they were originally indexed. 
So, I had to match this previously unnoticed strange behavior.

I could post a patch to fix this, but am concerned it might introduce
upward incompatibilities in various implementations and applications
that are dependent on Lucene index format.  If that is not a concern in
this case, please let me know and I'll post a patch.  I at least wanted
to report it.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Strange behavior of positionIncrementGap

2006-08-11 Thread Chuck Williams

Chris Hostetter wrote on 08/11/2006 09:08 AM:
 (using lower case
 to indicate no tokens produced and upper case to indicate tokens were
 produced) ...

 1) a b C _gap_ D ...results in:  C _gap_ D
 2) a B _gap_ C _gap_ D   ...results in:  B _gap_ C _gap_ D
 3) A _gap_ b _gap_ c _gap_ D ...results in:  A _double_gap_ D

 ...is that the behavior you are seeing?
   
Almost.  The only difference is that case 3 has 3 gaps, so it's A
_triple_gap_ D.
 Only case #3 seems wrongish to me there. ... i started to explain why i
 thought it made sense to go ahead and fix this, where by fix i ment only
 insert one gap in case#3 ... and then realized i was acctually arguing in
 favor of the current behavior for case#3, here is why...

based on the semi-frequently discussed usage of token gap sizes to
denote sentence/paragraph/page boundaries for the purpose of sloppy
phrase queries, it certianly seems worthwhile to fix to me (so that
queries like find Erik within 3 pages of Otis still work even if one
of those pages is blank ...

 ...that's when i realized the current behavior of case#3 is acctually
 important for accurate matching, otherwise a search for two words within a
 certain number of pages would have a false match if those pages were
 blank.  case #1 seems fine, but case #2 seems like the wrong case to me
 know, becuase trying to find occurances of B on page #1 using a
 SpanFirst query will have false positives ... it seems like the
 positionIncrimentGap should always be called/used after any field value is
 added (even if the value results in no okens) before the next value is
 added (even if that value results in no tokens)


 Does this jive with what you were expecting, and the patch you were
 considering?
   
Precisely.  The same concern about SpanFirstQuery also applies to case
1.  My bulk update code was always generating the positionIncrementGap
between all field values, so if there are 4 values it would always
generate 3 gaps independent of whether or not the values generate
tokens.  For your cases it generated:

1) a b C D ...results in:  _gap_ _gap_ C _gap_ D
2) a B C D ...results in:  _gap_ B _gap_ C _gap_ D
3) A b c D ...results in:  A _gap_ _gap_ _gap_ D


This seems a natural behavior and is consistent with the use cases you
describe (which are essentially the same reason I'm using gaps, and
presumably the main purpose of gaps).

Hoss, do you think it would be ok to fix given the potential upward
incompatibility for index-format-dependent implementaitons?

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using Lucene for Semantic search

2006-07-20 Thread Chuck Williams
I have built such a system, although not with Lucene at the time.  I
doubt you need to modify anything in Lucene to achieve this.

You may want to index words, stems and/or concepts from the ontology. 
Concepts from the ontology may relate to words or phrases.  Lucene's
token structure is flexible, supporting all of these.  E.g., you can
create your own Analyzer that looks up words and phrases in your
ontology and then generates appropriate concept tokens that supplement
the word/stem tokens.  Concept tokens can similarly span phrases. 
Presuming you want some kind of word sense disambiguation through
context, you can either integrate your model into the Analyzer or create
a separate pre-processor.

The same Analyzer or a variant of it could be used to map the Query into
tokens to search.  This would support concept--concept searches, useful
for example in cross-language search.

Word sense disambiguation is generally more difficult in typically short
queries, so there are alternatives worth considering.  E.g., you could
expand queries (or index tokens) into the full set of possibilities
(synonym words or concepts).  If you have an a-priori or contextual
ranking of those possibilities, you can generate boosts in Lucene to
reflect that.

If all you want is ontologic search, there are your hooks.  If you want
more sophisticated query transformations, e.g. for natural language QA,
you probably want a custom query pre-processor to generate the specific
queries you want.

Hope these thoughts are useful,

Chuck


Chris Wildgoose wrote on 07/20/2006 11:19 AM:
 I have been working with Lucene for some time, and I have an interest in 
 developing a Semantic Search solution. I was looking into extending lucene 
 for this. I know this would involve some significant re-engineering of the 
 indexing procedure to support the ability to assign words to nodes within an 
 ontology. In addition the query would need to be modified. I was wondering 
 whether anyone out there had gone down this path? 


 Chris

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene/Netbean Newbie looking for help

2006-07-11 Thread Chuck Williams
Hi Peter,

I'm also a Netbeans user, ableit a very happy one who would never
consider eclipse!

The following sequence of steps has worked for me in netbeans 4.0 and
5.0 (haven't upgraded to 5.5 quite yet).  The reason for the unusual
directory structure is that Lucene's interleaving of the core and the
various contribs within a single directory tree is incompatible with
netbeans standard assumptions.  This is worked around by having all the
project files external to the Lucene directory tree; each can point at
its build script, source package, etc., in the same directory tree.

   1. Create a parent directory for all of your projects, say Projects.
   2. Check lucene out of svn into Projects/LuceneTrunk.
   3. Create new netbeans for core and whatever contribs you use, all
  parallel to Projects/LuceneTrunk.  E.g., Projects/Lucene (the
  core), Projects/Highlighter, Projects/Snowball, etc..  For each
  project (e.g., Lucene), do:
 1. File - New Project - General - Java Project with Existing
Ant Script
 2. Set the project location:  Projects/LuceneTrunk
 3. Set the build script (defaults correctly): 
../LuceneTrunk/build.xml
 4. Set the project name:  Lucene
 5. Set the project location:  Projects/Lucene
 6. Update the ant targets (build == jar, not compile; rest are
correct; add custom targets for jar-demo, javacc, javadocs
and docs)
 7. Set the source package folders:  ../LuceneTrunk/src/java
 8. Set the test package folders:  ../LuceneTrunk/src/test and
../LuceneTrunk/src/demo
 9. Finish (no classpath settings)
10. Build the source (Lucene project context menu - Build)
11. Set the class path for src/demo (Lucene context menu -
Properties - Java Sources Classpath - select src/demo - Add
Jar/Folder LuceneTrunk/build/lucene-core-version-dev.jar
12. Build the demos (Lucene context menu - jar-demo)
13. Set the classpath for src/test (as above, add both the core
jar and the demo jar)
14. Now run the tests (Lucene context menu - Test Project)

All works great.  From here on, all netbeans features are available
(debugging, refactoring, code database, completion, ...)

You can also of course run ant from the command line, should you ever
want to.

Good luck,

Chuck


peter decrem wrote on 07/10/2006 07:05 PM:
 I am trying to contribute to the dot lucene port, but
 I am having no luck in getting the tests to compile
 and debug for the java version.  I tried eclipse and
 failed and now I am stuck in Netbean.

 More specifically I am using Netbean 5.5 (same
 problems with 5.0).  My understanding is that it comes
 with junit standard (3.8).  I did create a
 build.properties file for javacc.  It compiles but I
 get the following error when I run the tests:

 compile-core:
 compile-demo:
 common.compile-test:
 compile-test:
 test:
 C:\lucene-1.9.1\common-build.xml:169:
 C:\lucene-1.9.1\lib not found.
 BUILD FAILED (total time: 0 seconds)

 The relevant code in common-build.xml is:

   target name=test depends=compile-test
 description=Runs unit tests
 fail unless=junit.present
  
 ##
   JUnit not found.
   Please make sure junit.jar is in ANT_HOME/lib,
 or made available
   to Ant using other mechanisms like -lib or
 CLASSPATH.
  
 ##
 /fail
 mkdir dir=${junit.output.dir}/
 junit printsummary=off haltonfailure=no
 line 169 XX-  errorProperty=tests.failed
 failureProperty=tests.failed
   classpath refid=junit.classpath/
   !-- TODO: create propertyset for test
 properties, so each project can have its own set --
   sysproperty key=dataDir file=src/test/
   sysproperty key=tempDir
 file=${build.dir}/test/


 Any suggestions?  Or any pointers to getting the tests
 to work in netbeans are appreciated.



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

   

-- 
*Chuck Williams*
Manawiz
Principal
V: (808)885-8688
C: (415)846-9018
[EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
Skype: manawiz
AIM: hawimanawiz
Yahoo: jcwxx

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-10 Thread Chuck Williams
David Balmain wrote on 07/10/2006 01:04 AM:
 The only problem I could find with this solution is that
 fields are no longer in alphabetical order in the term dictionary but
 I couldn't think of a use-case where this is necessary although I'm
 sure there probably is one.

So presumably fields are still contiguous, you keep a pointer to where
each field starts, and terms within the field remain in alphabetical order?

Chuck



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-10 Thread Chuck Williams
Chris Hostetter wrote on 07/10/2006 02:06 AM:
 As near as i can tell, the large issue can be sumarized with the following
 sentiment:

   Performance gains could be realized if Field
   properties were made fixed and homogeneous for
   all Documents in an index.
   

This is certainly a large issue, as David says he has achieved a 5x
performance gain.

My interest in global field semantics originally sprang from
functionality considerations, not performance considerations.  I've got
many features that require reasoning about field semantics.  I
previously mentioned a very simple one:  validating fields in the query
parser.  More interesting examples are:

  1.  Multiple inheritance on the fields of documents that record the
sources of each inherited value to support efficient incremental maintenance
  2.  Record-valued fields that store facets with values (e.g., time
and user information for who set that value).  These cannot easily be
broken into multiple fields because the fields in question are multi-valued.
  3.  Join fields that reference id's of objects stored in separate
indices (supporting queries that reference the fields in the joined index)

Managing these kinds of rich semantic features in query parsing and
indexing is greatly facilitated by a global field model.  I've built
this into my app, and then started thinking about benefits in Lucene
generally from such a model.

   1) all Fields and their properties must be predeclared before any
  document is ever added to the index, and any Field not declared is
  illegal.
   2) a Field springs into existence the first time a Document is added
  with a value for it -- but after that all newly added Documents with
  a value for that field must conform to the Field properites initially
  used.

 (have I missed any general approaches?)
   

Yes.  Here is (an elaboration of) the global model with exceptions
idea we reached:

3) There is a global field model in Lucene that contains the list of
all known fields and their default semantics.  The class that contains
this model supports a number of implicit and explicit methods to
construct and query the model.  The model can be evolved.  The model is
used many places in Lucene, in some cases according to
application-settable properties.  E.g.:
a) Creating a Field uses the properties of the model so they
need not be specified at each construction.  A global model property
determines whether or not field properties may be overridden, and
whether or not fields may be created that are not in the model (in which
case, they are automatically added to the model).
b) The query parser has hooks that affect Query generation based
on the model properties of the field (not just for certain special query
types like Term's and RangeQuery's).  The application can easily provide
methods to implement these hooks.  This is essential for features like
23 above (and beneficial for 1).

 How would something like this work?

   docA.add(new Field(f, bar, Store.YES, Index.UN_TOKENIZED)):
   docA.add(new Field(f, foo, Store.NO,  Index.TOKENIZED)):

   docB.add(new Field(f, x y, Store.YES, Index.TOKENIZED)):
   docB.add(new Field(f, z,   Store.NO,  Index.UN_TOKENIZED)):
   

The application could determine whether or not this kind of operation
was supported accorded to the global enforcement properties of the
model.  If this is needed, the ability to have exceptions at the Field
level would permit it.

Hoss, do you have a use case requiring Store and Index variance like this?

The impact of this flexibility on David's 5x is another question...

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-10 Thread Chuck Williams


Yonik Seeley wrote on 07/10/2006 09:27 AM:
 I'll rephrase my original question:
  When implementing NewIndexModifier, what type of efficiencies do we
 get by using the new protected methods of IndexWriter vs using the
 public APIs of IndexReader and IndexWriter?

I won't comment on Ning's implementation, but will comment wrt this
issue for related work I've done with bulk update.  I needed at least
package-level access to several of the private capabilities in the index
package (e.g., from SergmentMerger:  resetSkip(), bufferSkip(),
writeSkip(); from IndexWriter:  readDeletableFiles(),
writeDeletableFiles(); etc.).

I think the index package and its api's have not been designed from the
standpoint of update (batched delete/add or bulk), and are not nearly as
friendly to application-level specialization/customization as other
parts of Lucene.

As part of the new index representation being considered now, I hope
that these issues are addressed, and would be happy to participate in
addressing them (especially if gcj releases 1.5 and 1.5 code becomes
acceptable).

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-509) Performance optimization when retrieving a single field from a document

2006-07-09 Thread Chuck Williams (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-509?page=comments#action_12419926 ] 

Chuck Williams commented on LUCENE-509:
---

LUCENE-545 does resolve this in a more general way, although the code to get 
precisely one field value efficiently is slightly clunky, requiring something 
like this (for a single-valued field):

final seekfield = retrievefield.intern();
String value = reader.document(doc, new FieldSelector(){
FieldSelectorResult accept(String field) {
if (field==seekfield)
return FieldSelectorResult.LOAD_AND_BREAK;
else return FieldSelectorResult.NO_LOAD;
}).get(seekfield);

Even with this, a Document, a Field and a FieldSelector are created 
unnecessarily.

There are important cases where fast single-field-access is important.  E.g., I 
have cases where it is necessary to obtain the id field for all results of a 
query, leading to (an obviously refactored version of) the above code in a 
HitCollector.

I think some special optimization for the single-field access case makes sense 
if benchmarks show it is material, but that it should be integrated with the 
mechanism of LUCENE-545.

$0.02,

Chuck


 Performance optimization when retrieving a single field from a document
 ---

  Key: LUCENE-509
  URL: http://issues.apache.org/jira/browse/LUCENE-509
  Project: Lucene - Java
 Type: Improvement

   Components: Index
 Versions: 1.9, 2.0.0
 Reporter: Steven Tamm
 Assignee: Otis Gospodnetic
  Attachments: DocField.patch, DocField_2.patch, DocField_3.patch, 
 DocField_4.patch, DocField_4b.patch

 If you just want to retrieve a single field from a Document, the only way to 
 do it is to retrieve all the fields from the Document and then search it.  
 This patch is an optimization that allows you retrieve a specific field from 
 a document without instantiating a lot of field and string objects.  This 
 reduces our memory consumption on a per query basis by around around 20% when 
 a lot of documents are returned.
 I've added a lot of comments saying you should only call it if you only ever 
 need one field.  There's also a unit test.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-09 Thread Chuck Williams
Marvin Humphrey wrote on 07/08/2006 11:13 PM:

 On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote:

 Many things would be cleaner in Lucene if fields had a global semantics,
 i.e., if properties like text vs. binary, Index, Store, TermVector, the
 appropriate Analyzer, the assignment of Directory in ParallelReader (or
 ParallelWriter), etc. were a function of just the field name and the
 index.

 In June, Dave Balmain and I discussed the issue extensively on the
 Ferret list.  It might have been nice to use the Lucy list, since a
 lot of the discussion was about Lucy, but the Lucy lists didn't exist
 at the time.

 http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html

I think there are a number of problems with that proposal and hope it
was not adopted.  As my earlier example showed, there is at least one
valid use case where storing a term vector is not an invariant property
of a field; specifically, when using term vectors to optimize excerpt
generation, it is best to store them only for fields that have long
values.  This is even a counter-example to Karl's proposal, since a
single Document may have multiple fields of the same name, some with
long values and others with short values; multiple fields of the same
name may legitimately have different TermVector settings even on a
single Document.

As another counter-example from my own app which I'd forgotten
yesterday, an important case where the Analyzer will vary across
documents is for i18n, where different languages require different
analyzers.  Refuting again my own argument about this not being
consistent with query parsing, the language of the query is a distinct
property from the languages of various documents in the collection.  In
my app, I let the user specify the language of the query, while the
language of each Document is determined automatically.  So, analyzers
vary for both queries and documents, but independently.

I haven't thought of cases where Index or Store would legitimately vary
across Fields or Documents, but am less convinced there aren't important
use cases for these as well.  Similarly, although it is important to
allow term vectors to be on or off at the field level, I don't see any
obvious need to vary the type of term vector (positions, offsets or both).

There are significant benefits to global semantics, as evidenced by the
fact that several of us independently came to desire this.  However,
deciding what can be global and what cannot is more subtle.

Perhaps the best thing at the Lucene level is to have a notion of
default semantics for a field name.  Whenever a Field of that name is
constructed, those semantics would be used unless the constructor
overrides them.  This would allow additional constructors on Field with
simpler signatures for the common case of invariant Field properties. 
It would also allow applications to access the class that holds the
default field information for an index.  The application will know which
properties it can rely on as invariant and whether or not the set of
fields is closed.

This approach would preserve upward compatibility and provide, I
believe, most of the benefits we all seek.

Thoughts?

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-09 Thread Chuck Williams
David Balmain wrote on 07/09/2006 06:44 PM:
 On 7/10/06, Chuck Williams [EMAIL PROTECTED] wrote:
 Marvin Humphrey wrote on 07/08/2006 11:13 PM:
 
  On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote:
 
  Many things would be cleaner in Lucene if fields had a global
 semantics,
  i.e., if properties like text vs. binary, Index, Store,
 TermVector, the
  appropriate Analyzer, the assignment of Directory in
 ParallelReader (or
  ParallelWriter), etc. were a function of just the field name and the
  index.
 
  In June, Dave Balmain and I discussed the issue extensively on the
  Ferret list.  It might have been nice to use the Lucy list, since a
  lot of the discussion was about Lucy, but the Lucy lists didn't exist
  at the time.
 
  http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html
 
 I think there are a number of problems with that proposal and hope it
 was not adopted.

 Hi Chuck,

 Actually, it was adopted and I'm quite happy with the solution. I'd be
 very interested to hear what the number of problems are, besides the
 example you've already given. Even if you never use Ferret, it can
 only help me improve my software.

Hi David,

Thanks for your reply.

I'm not aware of other problems beyond the ones I've already cited. 
After thinking of these, my confidence that there were not others waned.


 I'll start by covering your term-vector example. By adding fixed
 index-wide field properties to Ferret I was able to obtain up to a
 huge speed improvement during indexing.

This is very interesting.  Can you say how much?

 With the CPU time I gain in Ferret I could
 easily re-analyze large fields and build term vectors for them
 separately. It's a little more work for less common use cases like
 yours but in the end, everyone benifits in terms of performance.

Does Ferret work this way, or would that be up to the application?


 As my earlier example showed, there is at least one
 valid use case where storing a term vector is not an invariant property
 of a field; specifically, when using term vectors to optimize excerpt
 generation, it is best to store them only for fields that have long
 values.  This is even a counter-example to Karl's proposal, since a
 single Document may have multiple fields of the same name, some with
 long values and others with short values; multiple fields of the same
 name may legitimately have different TermVector settings even on a
 single Document.

 I think you'll find if you look at the DocumentWriter#writePostings
 method that it's one in, all in in terms of storing term vectors for
 a field. That is, if you have 5 content fields and only one of those
 is set to store term vectors, then all of the fields will store term
 vectors.

Right you are, and clearly necessarily so since the values of the
multiple fields are implicitly concatenated (with
positionIncrementGap).  So, Lucene already limits my term vector
optimization to the Document level.  As it happens, I only use it for
large body fields, of which each of my Documents has at most one.


 I haven't thought of cases where Index or Store would legitimately vary
 across Fields or Documents, but am less convinced there aren't important
 use cases for these as well.  Similarly, although it is important to
 allow term vectors to be on or off at the field level, I don't see any
 obvious need to vary the type of term vector (positions, offsets or
 both).

 I think Store could definitely legitimately vary across Fields or
 Documents for the same reason your term vectors do. Perhaps you are
 indexing pages from the web and you want to cache only the smaller
 pages.

That's an interesting example, but not as compelling an objection to me
(and seemingly not to you either!).  The app could always store an empty
string without much consequence in this scenario.


 There are significant benefits to global semantics, as evidenced by the
 fact that several of us independently came to desire this.  However,
 deciding what can be global and what cannot is more subtle.

 I agree. I can't see global field semantics making it into Lucene in
 the short term. It's a rather large change, particularly if you want
 to make full use of the performance benifits it affords.

Could you summarize where these derive from?


 Perhaps the best thing at the Lucene level is to have a notion of
 default semantics for a field name.  Whenever a Field of that name is
 constructed, those semantics would be used unless the constructor
 overrides them.  This would allow additional constructors on Field with
 simpler signatures for the common case of invariant Field properties.
 It would also allow applications to access the class that holds the
 default field information for an index.  The application will know which
 properties it can rely on as invariant and whether or not the set of
 fields is closed.

 This approach would preserve upward compatibility and provide, I
 believe, most of the benefits we all seek.

 Thoughts?

 If this is all you

Global field semantics

2006-07-08 Thread Chuck Williams
Many things would be cleaner in Lucene if fields had a global semantics,
i.e., if properties like text vs. binary, Index, Store, TermVector, the
appropriate Analyzer, the assignment of Directory in ParallelReader (or
ParallelWriter), etc. were a function of just the field name and the
index.  This approach would naturally admit a class, say IndexFieldSet,
that would hold global field semantics for an index.

Lucene today allows many field properties to vary at the Field level. 
E.g., the same field name might be tokenized in one Field on a Document
while it is untokenized in another Field on the same or different
Document.  Does anybody know how often this flexibility is used?  Are
there interesting use cases for which it is important?  It seems to me
this functionality is already problematic and not fully supported; e.g.,
indexing can manage tokenization-variant fields, but query parsing
cannot.  Various extensions to Lucene exacerbate this kind of problem.

Perhaps more controversially, the notion of global field semantics would
be even stronger if the set of fields is closed.  This would allow, for
example, QueryParser to validate field names.  This has a number of
benefits, including for example avoiding false-negative no results due
to misspelling a field name.

Has this been considered before?  Are there good reasons this path has
not been followed?

Thanks for any info,

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Java 1.5 (was ommented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided))

2006-07-08 Thread Chuck Williams
Doug Cutting wrote on 07/08/2006 09:41 AM:
 Chuck Williams wrote:
 I only work in 1.5 and use its features extensively.  I don't think
 about 1.4 at all, and so have no idea how heavily dependent the code in
 question is on 1.5.

 Unfortunately, I won't be able to contribute anything substantial to
 Lucene so long as it has a 1.4 requirement.

 The 1.5 decision requires a consensus.  You're making ultimatums, which
 does not help to build consensus.  By stating an inflexible position
 you've become a fact that informs the process.

My statement was not intended as an ultimatum at all.  Rather, it is
simply a fact.  I prefer to contribute to Lucene, but my workload simply
does not allow time to be spent on backporting.


 I think we should try to minimize the number of inconvenienced people.
 Both developers and users are people.  Some developers are happy to
 continue in 1.4, adding new features that users who are confined to 1.4
 JVMs will be able to use.  Other developers will only contribute 1.5
 code, perhaps (unless we find a technical workaround) excluding users
 confined to 1.4 JVMs.  But it is difficult to compare the inconvenience
 of a developer who refuses to code back-compatibly to a user who is
 deprived new features.

Doug, respectfully, this issue is inflammatory in its nature.  I've
found a couple of your comments to be inflammatory, although I suspect
you did not intend them that way.  Specifically the term refuses above
and your prior comment about considering use of your veto power if the
committers were to vote to move to 1.5.

I'm not refusing to do anything.  I am overwhelmed in a crunch for the
next several months and simply informing the community that I have code
that others may find valuable that might be contributed, but that it
requires 1.5 and that I cannot backport it.  I cannot unilaterally
decide to contribute the code, needing the agreement of the company I'm
working for.  They are only interested in the contribution if there is
interest in having it in the core.  These are simply facts.  I suspect
I'm not the only person in this kind of situation.


 Since GCJ is effectively available on all platforms, we could say that
 we will start accepting 1.5 features when a GCJ release supports those
 features.  Does that seem reasonable?

Seems like a reasonable compromise to me.  If I had a vote on this it
would be +1.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-08 Thread Chuck Williams
karl wettin wrote on 07/08/2006 10:27 AM:
 On Sat, 2006-07-08 at 09:46 -0700, Chuck Williams wrote:
   
 Many things would be cleaner in Lucene if fields had a global semantics,
 

   
 Has this been considered before?  Are there good reasons this path has
 not been followed?
 

 I've been posting some advocacy about the current Field. Basically I
 would like to see a more normalized field setting per document (instead
 of normalizing it in the writer), and I've been talking about something
 like this:

 [Document]#--- {1..*} -[Value]--[Field +name +store +index +vector]
 A
 | {0..*}
 |
  [Index]

   

And what I'm after would look like this:

[Document]#--- {1..*} -[Value]
 A
 | {*..1}
 |
  [Field +store +index +vector +analyzer +directory]
 A
 | {1..1}
 |
[FieldName]
 A
 | {0..*}
 |
  [Index]



The key points are to have Index be a first-class object and to have
field names uniquely specify field properties.

Karl, do you have specific reasons or use cases to normalize fields at
Document rather than at Index?

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-08 Thread Chuck Williams
karl wettin wrote on 07/08/2006 12:27 PM:
 On Sat, 2006-07-08 at 11:08 -0700, Chuck Williams wrote:
   
 Karl, do you have specific reasons or use cases to normalize fields at
 Document rather than at Index? 
 

 Nothing more than that the way the API looks it implies features that
 does not exist. Boost, store, index and vectors. I've learned, but I'm
 certain lots of newbies does the same assumptions as I did.
   

I forgot one of my own use cases!  My app uses term vectors as an
optimization for determining excerpts (aka summaries).  Term vectors
increase the index size.  For large documents, the performance benefits
of using term vectors to find excerpts are large, but for small
documents they are non-existent or negative.  So, to optimize
performance and minimize index size, I store term vectors on the
relevant fields only when their values are sufficiently large.  This is
a concrete example of using the same field name with different
Field.TermVector values on different Documents.

Are there any similar examples for Field.Index or Field.Store?

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Java 1.5 (was ommented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided))

2006-07-07 Thread Chuck Williams

DM Smith wrote on 07/07/2006 07:07 PM:
 Otis,
 First let me say, I don't want to rehash the arguments for or
 against Java 1.5.

This is an emotional issue for people on both sides.

 However, I think you have identified that the core people need to
 make a decision and the rest of us need to go with it.

It would be most helpful to have clarity on this issue.

 On Jul 7, 2006, at 1:17 PM, Otis Gospodnetic wrote:

 Hi Chuck,

 I think bulk update would be good (although I'm not sure how it would
 be different from batching deletes and adds, but I'm sure there is a
 difference, or else you wouldn't have done it).

Bulk update works by rewriting all segments that contain a document to
be modified in a single linear pass.  This is orders of magnitude faster
than delete/add if the set of documents to be updated is large,
especially if only a few small fields are mutable on Documents that have
many possibly large immutable fields.  E.g., on a somewhat slow
development machine I updated several fields on 1,000,000 large
documents in 43 seconds.

There is an existing patch in jira that takes this same approach
(LUCENE-382).  However the limitations in that patch are substantial: 
only optimized indexes, stored fields are not updated, updates are
independent of the existing field value, etc.  These limitations make
that implementation not suitable for many use cases.

My implementation eliminates all of those limitations, providing a fast
flexible solution for applying an arbitrary value transformation to
selected documents and fields in the index (doc.field.new_value = f(doc,
field.old_value, doc.other_field_values) for arbitrary f).  It also
works with ParallelReader (and the ParallelWriter I've already
contributed).  This allows the mutable fields to be segregated into a
separate subindex.  Only that subindex need be updated.  This alone is
an enormous advantage over a large number of delete/add's where the same
optimization is not possible due to the doc-id synchronization
requirements of ParallelReader.

There is a substantial amount of code required to do this, and it is
completely dependent on the index representation.  To simplify merge
issues with ongoing Lucene changes, I had to copy and edit certain
private methods out of the existing index code (and make extensive use
of the package-only api's).  Beyond normal benefits of open sourcing
code, my interest in contributing this is to see the index code
refactored to take bulk update into account.  This is increased by the
current focus on a new flexible index representation.  I would like to
see bulk update as one of the operations supported in the new
representation.

 So I think you should contribute your code.  This will give us a real
 example of having something possibly valuable, and written with 1.5
 features, so we can finalize 1.4 vs. 1.5 discussion, probably with a
 vote on lucene-dev.

I doubt any single contribution will change anyone's mind.  I would like
to have clarity on the 1.5 decision before deciding whether or not to
contribute this and other things.  My ParallelWriter contribution, which
also requires 1.5, is already sitting in jira.

I only work in 1.5 and use its features extensively.  I don't think
about 1.4 at all, and so have no idea how heavily dependent the code in
question is on 1.5.

Unfortunately, I won't be able to contribute anything substantial to
Lucene so long as it has a 1.4 requirement.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-06 Thread Chuck Williams
robert engels wrote on 07/06/2006 12:24 PM:
 I guess we just chose a much simpler way to do this...

 Even with you code changes, to see the modification made using the
 IndexWriter, it must be closed, and a new IndexReader opened.

 So a far simpler way is to get the collection of updates first, then

 using opened indexreader,
 for each doc in collection
   delete document using key
 endfor

 open indexwriter
 for each doc in collection
   add document
 endfor

 open indexreader


 I don't see how your way is any faster. You must always flush to disk
 and open the indexreader to see the changes.

With the patch you can have ongoing writes and deletes happening
asynchronously with reads and searches.  Reopening the IndexReader to
refresh its view is an independent decision.  The IndexWriter need never
be closed.

Without the patch, you have to close the IndexWriter to do any deletes. 
If the requirements of your app prohibit batching updates for very long,
this could be a frequent occurrence.

So, it seems to me the patch has benefit for apps that do frequent
updates and need reasonably quick access to those changes.

Bulk updates however require yet another approach.  Sorry to change
topics here, but I'm wondering if there was a final decision on the
question of java 1.5 in the core.  If I submitted a bulk update
capability that required java 1.5, would it be eligible for inclusion in
the core or not?

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-06 Thread Chuck Williams
Robert,

Either you or I are missing something basic.  I'm not sure which.

As I understand things, an IndexWriter and an IndexReader cannot both
have the write lock at the same time (they use the same write lock file
name).  Only an IndexReader can delete and only an IndexWriter can add. 
So to update, you need to close the IndexWriter, have the IndexReader
delete, and then reopen the IndexWriter.  With the patch, you never need
to close the IndexWriter, as I said before.  This provides a benefit in
cases where updates cannot be combined into large batches.  In this case
without the patch the IndexWriter must be closed and reopened
frequently, whereas with the patch it does not.

Have I got something wrong?

Chuck


robert engels wrote on 07/06/2006 03:08 PM:
 I think I finally see how this is supposed to optimize - basically
 because it remember the terms, and then does the batch deletions.

 We avoid all of this messiness by just making sure each document has a
 primary key and we always remove/update by primary key and we can keep
 the operations in an ordered list (actually set since the keys are
 unique, and that way multiple updates to the same document in a batch
 can be coalesced).

 I guess still don't see why the change is so involved though...

 I would just maintain an ordered list of operations (deletes an adds)
 on the buffered writer.
 When the buffered writer is closed:
 Create a RamDirectory.
 Perform all deletions in a batch on the main IndexReader.
 Perform ordered deletes and adds on the RamDirectory.
 Merge the RamDirectory with the main index.

 This could all be encapsulated in a BufferedIndexWriter class.


 On Jul 6, 2006, at 4:34 PM, robert engels wrote:

 I guess I don't see the difference...

 You need the write lock to use the indexWriter, and you also need the
 write lock to perform a deletion, so if you just get the write lock
 you can perform the deletion and the add, then close the writer.

 I have asked how this submission optimizes anything, and I still
 can't seem to get an answer?


 On Jul 6, 2006, at 4:27 PM, Otis Gospodnetic wrote:

 I think that patch is for a different scenario, the one where you
 can't wait to batch deletes and adds, and want/need to execute them
 more frequently and in order they really are happening, without
 grouping them.

 Otis

 - Original Message 
 From: robert engels [EMAIL PROTECTED]
 To: java-dev@lucene.apache.org
 Sent: Thursday, July 6, 2006 3:24:13 PM
 Subject: Re: [jira] Commented: (LUCENE-565) Supporting
 deleteDocuments in IndexWriter (Code and Performance Results Provided)

 I guess we just chose a much simpler way to do this...

 Even with you code changes, to see the modification made using the
 IndexWriter, it must be closed, and a new IndexReader opened.

 So a far simpler way is to get the collection of updates first, then

 using opened indexreader,
 for each doc in collection
delete document using key
 endfor

 open indexwriter
 for each doc in collection
add document
 endfor

 open indexreader


 I don't see how your way is any faster. You must always flush to disk
 and open the indexreader to see the changes.



 On Jul 6, 2006, at 2:07 PM, Ning Li wrote:

 Hi Otis and Robert,

 I added an overview of my changes in JIRA. Hope that helps.

 Anyway, my test did exercise the small batches, in that in our
 incremental updates we delete the documents with the unique term, and
 then add the new (which is what I assumed this was improving), and I
 saw o appreciable difference.

 Robert, could you describe a bit more how your test is set up? Or a
 short
 code snippet will help me explain.

 Without the patch, when inserts and deletes are interleaved in small
 batches, the performance can degrade dramatically because the
 ramDirectory
 is flushed to disk whenever an IndexWriter is closed, causing a lot of
 small segments to be created on disk, which eventually need to be
 merged.

 Is this how your test is set up? And, what are the maxBufferedDocs
 and the
 maxBufferedDeleteTerms in your test? You won't see a performance
 improvement
 if they are about the same as the small batch size. The patch works by
 internally buffering inserts and deletes into larger batches.

 Regards,
 Ning


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-06 Thread Chuck Williams
The need to close the IndexWriter is no different with the patch for
deletes than it already is for adds.  This is a separate issue that can
be managed asynchronously using the existing mechanism in the
applicaiton.  The patch ensures the proper order of operations, so the
benefit remains.  Applications can now freely add and delete without
worrying about delete's forcing a close of the IndexWriter.

I think we are all in agreement that delete really belongs in IndexWriter.

I agree with Otis that IndexModifier should be deprecated for several
reasons.  I use an IndexManager that coordinates all of search, read,
add, delete, update, etc.  It manages the refreshes, the batches, bulk
updates, etc.  And does it all more efficiently than IndexManager.

Haven't heard an answer yet whether or not 1.5 code contributions would
be eligible for the core.

Chuck


robert engels wrote on 07/06/2006 08:01 PM:
 I think you still need to close the IndexWriter at some point, in
 order to search the new documents. In effect all of the changes using
 the buffered IndexWriter are meaningless until the IndexWriter is
 closed and a new IndexReader opened.

 Given that, it doesn't make much difference when you do the buffering...

 My statement about getting the lock once was not entirely correct as
 you point out, it needs to be grabbed in two stages, but a far more
 simple design (as I proposed) could be used - obviously some changes
 for lock management would be needed.

 I DO think that the deletion code should be moved to IndexWriter - it
 makes more sense there. The current design IS a bit goofy... I don't
 see why you would delete using an IndexReader - why be able to see
 deletions in the current IndexReader but not be able to see additions?
 What is the benefit?

 I really like the idea of the BufferedWriter - it is similar to what
 is proposed but I think the implementation would be far simpler and
 more straightforward.  It would be similar to IndexModifier without
 the warning that you should do all the deletions first, and then all
 the additions - the BufferedWriter would manage this for you.

 On Jul 6, 2006, at 9:16 PM, Chuck Williams wrote:

 Robert,

 Either you or I are missing something basic.  I'm not sure which.

 As I understand things, an IndexWriter and an IndexReader cannot both
 have the write lock at the same time (they use the same write lock file
 name).  Only an IndexReader can delete and only an IndexWriter can add.
 So to update, you need to close the IndexWriter, have the IndexReader
 delete, and then reopen the IndexWriter.  With the patch, you never need
 to close the IndexWriter, as I said before.  This provides a benefit in
 cases where updates cannot be combined into large batches.  In this case
 without the patch the IndexWriter must be closed and reopened
 frequently, whereas with the patch it does not.

 Have I got something wrong?

 Chuck


 robert engels wrote on 07/06/2006 03:08 PM:
 I think I finally see how this is supposed to optimize - basically
 because it remember the terms, and then does the batch deletions.

 We avoid all of this messiness by just making sure each document has a
 primary key and we always remove/update by primary key and we can keep
 the operations in an ordered list (actually set since the keys are
 unique, and that way multiple updates to the same document in a batch
 can be coalesced).

 I guess still don't see why the change is so involved though...

 I would just maintain an ordered list of operations (deletes an adds)
 on the buffered writer.
 When the buffered writer is closed:
 Create a RamDirectory.
 Perform all deletions in a batch on the main IndexReader.
 Perform ordered deletes and adds on the RamDirectory.
 Merge the RamDirectory with the main index.

 This could all be encapsulated in a BufferedIndexWriter class.


 On Jul 6, 2006, at 4:34 PM, robert engels wrote:

 I guess I don't see the difference...

 You need the write lock to use the indexWriter, and you also need the
 write lock to perform a deletion, so if you just get the write lock
 you can perform the deletion and the add, then close the writer.

 I have asked how this submission optimizes anything, and I still
 can't seem to get an answer?


 On Jul 6, 2006, at 4:27 PM, Otis Gospodnetic wrote:

 I think that patch is for a different scenario, the one where you
 can't wait to batch deletes and adds, and want/need to execute them
 more frequently and in order they really are happening, without
 grouping them.

 Otis

 - Original Message 
 From: robert engels [EMAIL PROTECTED]
 To: java-dev@lucene.apache.org
 Sent: Thursday, July 6, 2006 3:24:13 PM
 Subject: Re: [jira] Commented: (LUCENE-565) Supporting
 deleteDocuments in IndexWriter (Code and Performance Results
 Provided)

 I guess we just chose a much simpler way to do this...

 Even with you code changes, to see the modification made using the
 IndexWriter, it must be closed, and a new IndexReader opened.

 So a far

Re: Memory Leak IndexSearcher

2006-07-03 Thread Chuck Williams
I'd suggest forcing gc after each n iteration(s) of your loop to
eliminate the garbage factor.  Also, you can run a profiler to see which
objects are leaking (e.g., the netbeans profiler is excellent).  Those
steps should identify any issues quickly.

Chuck

robert engels wrote on 07/03/2006 07:40 AM:
 Did you try what was suggested? (-Xmx16m) and did you get an OOM? If
 not, there is no memory leak.

 On Jul 3, 2006, at 12:33 PM, Bruno Vieira wrote:

 Thanks for the answer, but I have isolated the cycle inside a loop on a
 static void main (String args[]) Class to test this issue.In this
 case there
 were no classes referencing the IndexSercher and the problem still
 happened.


 2006/7/3, robert engels [EMAIL PROTECTED]:

 You may not have a memory leak at all. It could just be garbage
 waiting to be collected. I am fairly certain there are no memory
 leaks in the current Lucene code base (outside of the ThreadLocal
 issue).

 A simple way to verify this would be to add -Xmx16m on the command
 line. If there were a memory leak then it will eventually fail with
 an OOM.

 If there is a memory leak, then it is probably because your code is
 holding on to IndexReader references in some static var or map.


 On Jul 3, 2006, at 9:43 AM, Bruno Vieira wrote:

  Hi everyone,
 
  I am working on a project with around 35000 documents (8 text
  fields with
  256 chars at most for each field) on lucene. But unfortunately this
  index is
  updated at every moment and I need that these new items be in the
  results of
  my search as fast as possible.
 
  I have an IndexSearcher, then I do a search getting the last 10
  results with
  ordering by a name field and the memory allocated is 13mb, I close
 the
  IndexSearcher because the lucene database was updated by and external
  application and I create a new IndexSearcher, do the same search
 again
  wanting to get the last 10 results with ordering by a name field
  and the
  memory allocated is 15mb. At every time I do this cycle the memory
  increase
  in 2mb, so in a moment I have a memory leak.
 
  If the database is not updated and i do not create a new
  IndexSearcher i can
  do searches forever without memory leak.
 
  Why when I close an IndexSearcher (indexSearcher.close();
  indexSearcher =
  new IndexSearcher(/database/) ;)after some searches with ordering
  and open
  a new one the memory is not free ?
 
  Thanks to any suggestions.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Combining Hits and HitCollector

2006-06-27 Thread Chuck Williams
IMHO, Hits is the worst class in Lucene.  It's atrocities are numerous,
including the hardwired 50 and the strange normalization of dividing
all scores by the top score if the top score happens to be greater than
1.0 (which destroys any notion of score values having any absolute
meaning, although many apps erroneously assume they do).  It is quite
easy to use a TopDocsCollector or a TopFieldDocCollector and do a better
job than Hits does.

For faceted search I use a SamplingHitCollector to gather the
facet-determination sample.  It takes as one of its constructor
parameters, rankingCollector, an arbitrary HitCollector to gather the
top scoring or top sorted results.  Then it only takes one line of code
to combine the two collectors:  rankingCollector.collect(doc, score)
within SamplingHitCollector.collect().

This all notwithstanding, a built-in class that combined Hits with a
second HitCollector probably would be used by many people, although I
would recommend the approach above as a better alternative.

Chuck


Nadav Har'El wrote on 06/27/2006 09:08 AM:
 Hi,

 Searcher.search(Query) returns a Hits object, useful for the display of top
 results. Searcher.search(Query, HitCollector) runs a HitsCollector for doing
 some sort of processing over all results.
 Unfortunately, there is currently no method to do both at the same time.

 For some uses, for example faceted search (that was discussed on this list
 a few times in the past), you need to do both: go over all results (and,
 for example, count how many results belong to each value), and at the same
 time build a Hits object (for displaying the top search results).

 Changing Searcher, and/or Hits to allow for doing both things at once should
 not be too hard, but before I go and do it (and submit the change as a patch),
 I was wondering if I'm not reinventing the wheel, and if perhaps someone has
 already done this, or there were already discussions on how or how not to do
 it.

 Thanks,
 Nadav.


   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-609) Lazy field loading breaks backward compat

2006-06-21 Thread Chuck Williams (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-609?page=comments#action_12417188 ] 

Chuck Williams commented on LUCENE-609:
---

I'm late to the discussion and have only read the patch file, but it seems 
invalid to me.  Won't getField() get a class cast exception when it encounters 
a Fieldable that is not a Field?  The semantics of getField() would have to be 
something like, only get this field if it is a Field rather than some other 
kind of Fieldable, which means it would have to do type testing on the members 
of fields.

I think it is much better to remove this patch and leave Fieldable as is.  
Searchable was the same kind of thing.  IndexReader is an abstract super class 
for the different types of readers.  When I did ParallelWriter, I had the same 
problem and had to introduce Writable since IndexWriter is not an abstract 
class and ParallelWriter is a different implementation.  I think it is best to 
introduce all the abstract classes now for fundamental types that have multiple 
implementations.

Chuck




 Lazy field loading breaks backward compat
 -

  Key: LUCENE-609
  URL: http://issues.apache.org/jira/browse/LUCENE-609
  Project: Lucene - Java
 Type: Bug

   Components: Other
 Versions: 2.0.1
 Reporter: Yonik Seeley
 Assignee: Yonik Seeley
  Fix For: 2.0.1
  Attachments: fieldable_patch.diff

 Document.getField() and Document.getFields() have changed in a non backward 
 compatible manner.
 Simple code like the following no longer compiles:
  Field x = mydoc.getField(x);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)

2006-06-20 Thread Chuck Williams
 that there are great new 
 features in 1.6 - that would improve the lucene core if they were used - 
   I think that that is when this issue gets revisited.

 This isn't the type of question that should be decided by a poll.  This 

 OG: The poll was about what do you use, and not what version of Java 
 should Lucene support.  I hope this wasn't misinterpreted by those who took 
 the poll.

 should be decided by thoughtfully looking at the consequences of each 
 choice.  For me - the negative consequences of choosing 1.5 - leaving 
 behind a lot of users - is much worse than the negative consequences of 
 staying at 1.4 - making a couple dozen highly skilled developers check 
 an extra box in their lucene development environments?

 OG: I don't think the checkbox will remove 1.5-style for loops or generics 
 and other stuff if that is already in the code.

 If any developers have actually read this far (sorry - it got kind of 
 long) - thanks again for all of your great work - Lucene is a great tool 
 - and a great community.

 OG: Thanks Dan, and please don't take my email(s) wrong.  I'm quite 
 clear-headed in this issue, and am trying to be objective.  I personally 
 wouldn't get hurt if we stayed with 1.4, I'd just be feeling bad and guilty 
 if we had to reject contributions that have 1.5 bits in it.

 OG: How about this.  I noticed th significant number of people left behind 
 statement in a few people's arguments.  How small of a percentage of 1.4 
 users do you think we should look for before we can ove to 1.5?  What does 
 the 1.5:1.4 ration need to be?
 This is not a question for Dan only. I would really be interested what others 
 think about this.  How small does the percentage of 1.4 users need to be, 
 before we can have 1.5 in Lucene?

 Thanks,
 Otis





 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

   


-- 
*Chuck Williams*
Manawiz
Principal
V: (808)885-8688
C: (415)846-9018
[EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
Skype: manawiz
AIM: hawimanawiz
Yahoo: jcwxx

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)

2006-06-19 Thread Chuck Williams


Ray Tsang wrote on 06/19/2006 09:06 AM:
 On 6/17/06, Chuck Williams [EMAIL PROTECTED] wrote:

 Ray Tsang wrote on 06/17/2006 06:29 AM:
  I think the problem right now isn't whether we are going to have 1.5
  code or not.  We will eventually have to have 1.5 code anyways.  But
  we need a sound plan that will make the transition easy.  I believe
  the transition from 1.4 to 1.5  is not an over night thing.

 I disagree.  1.5 was specifically designed to make transition easy,
 including the inclusion of non-features that ensure smooth
 interoperability (e.g., raw types and no runtime presence whatsoever of
 generics -- quite different from how it was done in .Net 2.0 for
 example).

 But will 1.4 jvm be able to run the new Lucene w/ 1.5 core?

If 1.5 features are fully embraced, no.



 
  Secondly can we specifically find places where some people _will_
  contribute code immediately if it's 1.5 is accepted?

 I already have.  That's what started this second round of debate.

 What is it?

ParallelWriter (see LUCENE-600).  I have quite a few more behind that. 
Whether or not various people will find them useful is tbd, but they are
all working well for me and essential to meet my requirements, and some
are for things often requested on the various lists (e.g., a general
purpose fast bulk index updater that supports arbitrary transformations
on the values of fields).

 Who else?  How many?  Do we have statistics?  We have
 statistics of number of users between 1.4 vs. 1.5 (which btw didn't
 present a significant polarization), but how about actual numbers
 potential of contributions between the 2?

There has been a proposal to poll java-dev for this.  Wagers on the outcome?



 
  Like what I have suggested before, why not have contribution modules
  that act as a transition into 1.5 code?  Much like what other
  framework has a tiger module.  This module may have say, a 1.5
  compatible layer on top of 1.4 core, or other components of lucene
  that was made to be extensible, e.g. 1.5 version of QueryParser,
  Directory, etc.

 I think this would make it unnecessarily complex.

 How is it unnecessary or complex?  If it only means layering,
 extending classes, adding implementations, it should be relatively
 easy with the existing design.  It's something we do everyday
 regardless what lucene's direction takes.

Contributing to Lucene is a volunteer effort.  The more difficult you
make it, the fewer people will do it.  That's what this is all about. 
Accept 1.5 contributions and I believe you will get more high quality
contributions.  Of course, this comes at a high cost for those who
cannot transition to 1.5, since they would need to stick with Lucene 2.0.x.

If I had a vote on this, honestly I'm not sure how I would vote.  It's a
tough call either way.  Do you support a significant minority of users
and contributors who are stuck on an old java platform, or do you strike
forward with a more robust contributing community from the majority at
the cost of cutting out the minority from the latest and greatest?  My
first comment on this topic was something like, why would somebody who
is on an old java platform expect to have the latest and greatest
lucene?.  I think if I was stuck on 1.4, I wouldn't be happy about a
1.5 decision for lucene 2.1+, but I would understand it, accept it, and
do whatever I could to speed my transition to 1.5.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-561) ParallelReader fails on deletes and on seeks of previously unused fields

2006-06-19 Thread Chuck Williams (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-561?page=comments#action_12416790 ] 

Chuck Williams commented on LUCENE-561:
---

Christian,

That is a different bug than this one.  This bug has been fixed.

Chuck


 ParallelReader fails on deletes and on seeks of previously unused fields
 

  Key: LUCENE-561
  URL: http://issues.apache.org/jira/browse/LUCENE-561
  Project: Lucene - Java
 Type: Bug

   Components: Index
 Versions: 2.0.0
  Environment: All
 Reporter: Chuck Williams
 Assignee: Yonik Seeley
  Fix For: 2.0.0
  Attachments: ParallelReaderBugs.patch, ParallelReaderBugs.patch

 In using ParallelReader I've hit two bugs:
 1.  ParallelReader.doDelete() and doUndeleteAll() call doDelete() and 
 doUndeleteAll() on the subreaders, but these methods do not set hasChanges.  
 Thus the changes are lost when the readers are closed.  The fix is to call 
 deleteDocument() and undeleteAll() on the subreaders instead.
 2.  ParallelReader discovers the fields in each subindex by using 
 IndexReader.getFieldNames() which only finds fields that have occurred on at 
 least one document.  In general a parallel index is designed with assignments 
 of fields to sub-indexes and term seeks (including searches) may be done on 
 any of those fields, even if no documents in a particular state of the index 
 have yet had an assigned field.  Seeks/searches on fields that have not yet 
 been indexed generated an NPE in ParallelReader's various inner class seek() 
 and next() methods because fieldToReader.get() returns null on the unseen 
 field.  The fix is to extend the add() methods to supply the correct list of 
 fields for each subindex.
 Patch that corrects both of these issues attached.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-398) ParallelReader crashes when trying to merge into a new index

2006-06-19 Thread Chuck Williams (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-398?page=comments#action_12416837 ] 

Chuck Williams commented on LUCENE-398:
---

Christian,

I'm going to open a new issue on this in order to rename it, post a revised 
patch, and hopefully get the attention of a committer.

Chuck


 ParallelReader crashes when trying to merge into a new index
 

  Key: LUCENE-398
  URL: http://issues.apache.org/jira/browse/LUCENE-398
  Project: Lucene - Java
 Type: Bug

   Components: Index
 Versions: unspecified
  Environment: Operating System: All
 Platform: All
 Reporter: Sebastian Kirsch
 Assignee: Lucene Developers
  Attachments: ParallelReader.diff, ParallelReaderTest1.java, 
 parallelreader.diff, patch-next.diff

 ParallelReader causes a NullPointerException in
 org.apache.lucene.index.ParallelReader$ParallelTermPositions.seek(ParallelReader.java:318)
 when trying to merge into a new index.
 See test case and sample output:
 $ svn diff
 Index: src/test/org/apache/lucene/index/TestParallelReader.java
 ===
 --- src/test/org/apache/lucene/index/TestParallelReader.java(revision 
 179785)
 +++ src/test/org/apache/lucene/index/TestParallelReader.java(working copy)
 @@ -57,6 +57,13 @@
  
}
   
 +  public void testMerge() throws Exception {
 +Directory dir = new RAMDirectory();
 +IndexWriter w = new IndexWriter(dir, new StandardAnalyzer(), true);
 +w.addIndexes(new IndexReader[] { ((IndexSearcher)
 parallel).getIndexReader() });
 +w.close();
 +  }
 +
private void queryTest(Query query) throws IOException {
  Hits parallelHits = parallel.search(query);
  Hits singleHits = single.search(query);
 $ ant -Dtestcase=TestParallelReader test
 Buildfile: build.xml
 [...]
 test:
 [mkdir] Created dir:
 /Users/skirsch/text/lectures/da/thirdparty/lucene-trunk/build/test
 [junit] Testsuite: org.apache.lucene.index.TestParallelReader
 [junit] Tests run: 2, Failures: 0, Errors: 1, Time elapsed: 1.993 sec
 [junit] Testcase: testMerge(org.apache.lucene.index.TestParallelReader):  
 Caused an ERROR
 [junit] null
 [junit] java.lang.NullPointerException
 [junit] at
 org.apache.lucene.index.ParallelReader$ParallelTermPositions.seek(ParallelReader.java:318)
 [junit] at
 org.apache.lucene.index.ParallelReader$ParallelTermDocs.seek(ParallelReader.java:294)
 [junit] at
 org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325)
 [junit] at
 org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296)
 [junit] at
 org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270)
 [junit] at
 org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234)
 [junit] at
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
 [junit] at
 org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:596)
 [junit] at
 org.apache.lucene.index.TestParallelReader.testMerge(TestParallelReader.java:63)
 [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 [junit] at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 [junit] at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 [junit] Test org.apache.lucene.index.TestParallelReader FAILED
 BUILD FAILED
 /Users/skirsch/text/lectures/da/thirdparty/lucene-trunk/common-build.xml:188:
 Tests failed!
 Total time: 16 seconds
 $

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-607) ParallelTermEnum is BROKEN

2006-06-19 Thread Chuck Williams (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-607?page=all ]

Chuck Williams updated LUCENE-607:
--

Attachment: ParallelTermEnum.patch

 ParallelTermEnum is BROKEN
 --

  Key: LUCENE-607
  URL: http://issues.apache.org/jira/browse/LUCENE-607
  Project: Lucene - Java
 Type: Bug

   Components: Index
 Versions: 2.0.0
 Reporter: Chuck Williams
 Priority: Critical
  Attachments: ParallelTermEnum.patch

 ParallelTermEnum.next() fails to advance properly to new fields.  This is a 
 serious bug. 
 Christian Kohlschuetter diagnosed this as the root problem underlying 
 LUCENE-398 and posted a first patch there.
 I've addressed a couple issues in the patch (close skipped field TermEnum's, 
 generate field iterator only once, integrated Christian's test case as a 
 Lucene test) and packaged in all the revised patch here.
 All Lucene tests pass, and I've further tested in this in my app, which makes 
 extensive use of ParallelReader.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-398) ParallelReader crashes when trying to merge into a new index

2006-06-19 Thread Chuck Williams (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-398?page=comments#action_12416838 ] 

Chuck Williams commented on LUCENE-398:
---

Revised patch posted in LUCENE-607


 ParallelReader crashes when trying to merge into a new index
 

  Key: LUCENE-398
  URL: http://issues.apache.org/jira/browse/LUCENE-398
  Project: Lucene - Java
 Type: Bug

   Components: Index
 Versions: unspecified
  Environment: Operating System: All
 Platform: All
 Reporter: Sebastian Kirsch
 Assignee: Lucene Developers
  Attachments: ParallelReader.diff, ParallelReaderTest1.java, 
 parallelreader.diff, patch-next.diff

 ParallelReader causes a NullPointerException in
 org.apache.lucene.index.ParallelReader$ParallelTermPositions.seek(ParallelReader.java:318)
 when trying to merge into a new index.
 See test case and sample output:
 $ svn diff
 Index: src/test/org/apache/lucene/index/TestParallelReader.java
 ===
 --- src/test/org/apache/lucene/index/TestParallelReader.java(revision 
 179785)
 +++ src/test/org/apache/lucene/index/TestParallelReader.java(working copy)
 @@ -57,6 +57,13 @@
  
}
   
 +  public void testMerge() throws Exception {
 +Directory dir = new RAMDirectory();
 +IndexWriter w = new IndexWriter(dir, new StandardAnalyzer(), true);
 +w.addIndexes(new IndexReader[] { ((IndexSearcher)
 parallel).getIndexReader() });
 +w.close();
 +  }
 +
private void queryTest(Query query) throws IOException {
  Hits parallelHits = parallel.search(query);
  Hits singleHits = single.search(query);
 $ ant -Dtestcase=TestParallelReader test
 Buildfile: build.xml
 [...]
 test:
 [mkdir] Created dir:
 /Users/skirsch/text/lectures/da/thirdparty/lucene-trunk/build/test
 [junit] Testsuite: org.apache.lucene.index.TestParallelReader
 [junit] Tests run: 2, Failures: 0, Errors: 1, Time elapsed: 1.993 sec
 [junit] Testcase: testMerge(org.apache.lucene.index.TestParallelReader):  
 Caused an ERROR
 [junit] null
 [junit] java.lang.NullPointerException
 [junit] at
 org.apache.lucene.index.ParallelReader$ParallelTermPositions.seek(ParallelReader.java:318)
 [junit] at
 org.apache.lucene.index.ParallelReader$ParallelTermDocs.seek(ParallelReader.java:294)
 [junit] at
 org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325)
 [junit] at
 org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296)
 [junit] at
 org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270)
 [junit] at
 org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234)
 [junit] at
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
 [junit] at
 org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:596)
 [junit] at
 org.apache.lucene.index.TestParallelReader.testMerge(TestParallelReader.java:63)
 [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 [junit] at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 [junit] at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 [junit] Test org.apache.lucene.index.TestParallelReader FAILED
 BUILD FAILED
 /Users/skirsch/text/lectures/da/thirdparty/lucene-trunk/common-build.xml:188:
 Tests failed!
 Total time: 16 seconds
 $

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Soccer-themed question: null fields?

2006-06-18 Thread Chuck Williams

JMA wrote on 06/17/2006 10:16 PM:
 1) Is there a way to find a document that has null fields?  
 For example, if I have two fields (FIRST_NAME, LAST_NAME) for World Cup 
 players:

 FIRST_NAME: Brian LAST_NAME: McBride
 FIRST_NAME: Agustin   LAST_NAME: Delgado
 FIRST_NAME: Zinha LAST_NAME: (null or blank)
 FIRST_NAME: Kaka  LAST_NAME: (null or blank)

 ... and so on

 What's the way to find all players that use only their first name?
   

By far the best way is to store a special token into null fields and
then just match on this.

One less-performant alternative if you have no control over the index is
to enable prefix wildcard queries and then write a query like this:

FIRST_NAME:* -LAST_NAME:*

To enable prefix wildcard queries, you need to regenerate
QueryParser.java from QueryParser.jj after replacing the wildcard
production (search for OG, as Otis has nicely included the appropriate
production as a comment).

 2) Is there a way to count field terms?  For example, if instead we have one 
 field...

 NAME: Brian McBride
 NAME: Agustin Delgado
 NAME: Zinha
 NAME: Kaka

 Can I answer the same question by finding all documents where the number of 
 terms
 in the NAME field is 1 and only 1?  Is there a way to do that?
   

You would need to write your own Query subclass, and I can't think of
any way to achieve this that would not be very slow.  Not recommended.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)

2006-06-17 Thread Chuck Williams

Tatu Saloranta wrote on 06/17/2006 06:54 AM:
 And it's
 bit curious as to what the current mad rush regarding
 migration is -- beyond the convenience and syntactic
 sugar, only the concurrency package seems like a
 tempting immediate reason?
   

The only people who keep bringing up these non-arguments are those on
the con side.  You should read the arguments on the pro side -- they are
not these.

 I hope it can be a practical decision made with
 cool minds.
   

Agreed.  I think a key part of this is to listen to what the other side
is saying.

This all boils down to people and the environments they use.  People
using 1.4 want the latest and greatest Lucene and don't understand why
it's important to use 1.5 anyway.  People using 1.5 are writing 1.5 code
everyday and want to be able to make contributions to Lucene without
backporting and retesting.  Also, they don't want to consciously write
code that might be a Lucene contribution in 1.4 because a) the cognitive
shift back to 1.4 is not easy once you are fully indoctrinated into 1.5
(primarily generics), and b) 1.4 code is not type-safe in the sense that
1.5 code is.

So, do 1.4 people live with Lucene 2.0.x until they move to 1.5, or do
1.5 people get limited or cut out from making contributions.  Neither
option is appealing, especially to those negative affected.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-602) [PATCH] Filtering tokens for position and term vector storage

2006-06-15 Thread Chuck Williams (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-602?page=all ]

Chuck Williams updated LUCENE-602:
--

Attachment: TokenSelectorSoloAll.patch

TokenSelectorSoloAll.patch applies against today's svn head.  It only requires 
Java 1.4.



 [PATCH] Filtering tokens for position and term vector storage
 -

  Key: LUCENE-602
  URL: http://issues.apache.org/jira/browse/LUCENE-602
  Project: Lucene - Java
 Type: New Feature

   Components: Index
 Versions: 2.1
 Reporter: Chuck Williams
  Attachments: TokenSelectorSoloAll.patch

 This patch provides a new TokenSelector mechanism to select tokens of 
 interest and creates two new IndexWriter configuration parameters:  
 termVectorTokenSelector and positionsTokenSelector.
 termVectorTokenSelector, if non-null, selects which index tokens will be 
 stored in term vectors.  If positionsTokenSelector is non-null, then any 
 tokens it rejects will have only their first position stored in each document 
 (it is necessary to store one position to keep the doc freq properly to avoid 
 the token being garbage collected in merges).
 This mechanism provides a simple solution to the problem of minimzing index 
 size overhead cause by storing extra tokens that facilitate queries, in those 
 cases where the mere existence of the extra tokens is sufficient.  For 
 example, in my test data using reverse tokens to speed prefix wildcard 
 matching, I obtained the following index overheads:
   1.  With no TokenSelectors:  60% larger with reverse tokens than without
   2.  With termVectorTokenSelector rejecting reverse tokens:  36% larger
   3.  With both positionsTokenSelector and termVectorTokenSelector rejecting 
 reverse tokens:  25% larger
 It is possible to obtain the same effect by using a separate field that has 
 one occurrence of each reverse token and no term vectors, but this can be 
 hard or impossible to do and a performance problem as it requires either 
 rereading the content or storing all the tokens for subsequent processing.
 The solution with TokenSelectors is very easy to use and fast.
 Otis, thanks for leaving a comment in QueryParser.jj with the correct 
 production to enable prefix wildcards!  With this, it is a straightforward 
 matter to override the wildcard query factory method and use reverse tokens 
 effectively.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-602) [PATCH] Filtering tokens for position and term vector storage

2006-06-15 Thread Chuck Williams (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-602?page=all ]

Chuck Williams updated LUCENE-602:
--

Attachment: TokenSelectorAllWithParallelWriter.patch

TokenSelectorAllWithParallelWriter.patch contains ParallelWriter as well 
(LUCENE-600) as it is also affected.


 [PATCH] Filtering tokens for position and term vector storage
 -

  Key: LUCENE-602
  URL: http://issues.apache.org/jira/browse/LUCENE-602
  Project: Lucene - Java
 Type: New Feature

   Components: Index
 Versions: 2.1
 Reporter: Chuck Williams
  Attachments: TokenSelectorAllWithParallelWriter.patch, 
 TokenSelectorSoloAll.patch

 This patch provides a new TokenSelector mechanism to select tokens of 
 interest and creates two new IndexWriter configuration parameters:  
 termVectorTokenSelector and positionsTokenSelector.
 termVectorTokenSelector, if non-null, selects which index tokens will be 
 stored in term vectors.  If positionsTokenSelector is non-null, then any 
 tokens it rejects will have only their first position stored in each document 
 (it is necessary to store one position to keep the doc freq properly to avoid 
 the token being garbage collected in merges).
 This mechanism provides a simple solution to the problem of minimzing index 
 size overhead cause by storing extra tokens that facilitate queries, in those 
 cases where the mere existence of the extra tokens is sufficient.  For 
 example, in my test data using reverse tokens to speed prefix wildcard 
 matching, I obtained the following index overheads:
   1.  With no TokenSelectors:  60% larger with reverse tokens than without
   2.  With termVectorTokenSelector rejecting reverse tokens:  36% larger
   3.  With both positionsTokenSelector and termVectorTokenSelector rejecting 
 reverse tokens:  25% larger
 It is possible to obtain the same effect by using a separate field that has 
 one occurrence of each reverse token and no term vectors, but this can be 
 hard or impossible to do and a performance problem as it requires either 
 rereading the content or storing all the tokens for subsequent processing.
 The solution with TokenSelectors is very easy to use and fast.
 Otis, thanks for leaving a comment in QueryParser.jj with the correct 
 production to enable prefix wildcards!  With this, it is a straightforward 
 matter to override the wildcard query factory method and use reverse tokens 
 effectively.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Java 1.5 was [jira] Updated: (LUCENE-600) ParallelWriter companion to ParallelReader

2006-06-13 Thread Chuck Williams
I think the last discussion ended with the main counter-argument being
lack of support by gjc.  Current top of GJC News:

 *June 6, 2006* RMS approved the plan to use the Eclipse compiler as
 the new gcj front end. Work is being done on the |gcj-eclipse| branch;
 it can already build libgcj. This project will allow us to ship a 1.5
 compiler in the relatively near future. The old |gcjx| branch and
 project is now dead.

In addition to performance, productivity and functionality benefits, my
main argument for 1.5 is that it is used by the vast majority of lucene
community members.  Everything I write is in 1.5 and I don't have time
to backport.  I have a significant body of code from which to extract
and contribute patches that others would likely find useful.  How many
others are in a similar position?

On the side, not leaving valued community members behind is important.

I think the pmc / committers just need to make a decision which will
impact one group or the other.

Chuck


Grant Ingersoll wrote on 06/13/2006 03:35 AM:
 Well, we have our first Java 1.5 patch...  Now that we have had a week
 or two to digest the comments, do we want to reopen the discussion?

 Chuck Williams (JIRA) wrote:
  [ http://issues.apache.org/jira/browse/LUCENE-600?page=all ]

 Chuck Williams updated LUCENE-600:
 --

 Attachment: ParallelWriter.patch

 Patch to create and integrate ParallelWriter, Writable and
 TestParallelWriter -- also modifies build to use java 1.5.


  
 ParallelWriter companion to ParallelReader
 --

  Key: LUCENE-600
  URL: http://issues.apache.org/jira/browse/LUCENE-600
  Project: Lucene - Java
 Type: Improvement
 

  
   Components: Index
 Versions: 2.1
 Reporter: Chuck Williams
  Attachments: ParallelWriter.patch

 A new class ParallelWriter is provided that serves as a companion to
 ParallelReader.  ParallelWriter meets all of the doc-id
 synchronization requirements of ParallelReader, subject to:
 1.  ParallelWriter.addDocument() is synchronized, which might
 have an adverse effect on performance.  The writes to the
 sub-indexes are, however, done in parallel.
 2.  The application must ensure that the ParallelReader is never
 reopened inside ParallelWriter.addDocument(), else it might find the
 sub-indexes out of sync.
 3.  The application must deal with recovery from
 ParallelWriter.addDocument() exceptions.  Recovery must restore the
 synchronization of doc-ids, e.g. by deleting any trailing
 document(s) in one sub-index that were not successfully added to all
 sub-indexes, and then optimizing all sub-indexes.
 A new interface, Writable, is provided to abstract IndexWriter and
 ParallelWriter.  This is in the same spirit as the existing
 Searchable and Fieldable classes.
 This implementation uses java 1.5.  The patch applies against
 today's svn head.  All tests pass, including the new
 TestParallelWriter.
 

   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fwd: How to combine results from several indices

2006-06-13 Thread Chuck Williams
You can try that approach, but I think you will find it more difficult. 
E.g., all of the primitive query classes are written specifically to use
doc-ids.  So, you either need to do you searches separately on each
subindex and then write your own routine to join the results, or you
would need to rewrite all the queries.

I use two different indexing combining techniques:

   1. ParallelReader/ParallelWriter for performance reasons in various
  circumstances; e.g., fast access to frequently used fields (in
  combination with lazy fields -- very useful for fast categorical
  analysis of large samples), fast bulk updates of mutable fields by
  copying a much smaller subindex, etc.
   2. Subindex query rewriting for accessing different types of objects
  in separate indices.  A query on the main index may contain a
  subquery that retrieves objects in a different index and rewrites
  itself into a disjunction of the uid's of those objects.  This
  approach works well assuming you can arrange indexing of fields in
  the main index with subindex uid values, and the disjunction
  expansions are not too large.

Maybe approach 2 is more what you need?  It's pretty simple to do. 
E.g., take a look at MultiTermQuery for a non-primitive query that
rewrites itself dependent on the index.  You need a similar class that
rewrites itself dependent on a different index.

Chuck


wu fox wrote on 06/13/2006 02:18 AM:
 thank you very much Chuck.But I still wondered is there any way that I
 can
 revise ParallelReader so that it do not need the same doc id
 .Can IndexReader comebine different doc according some mapping rules ?for
 example I can override Document method that combine docs from indices
 acoording to same uuid or override some other methods,I think it is much
 easier to do than a writer:) Thank you for your help again.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fwd: How to combine results from several indices

2006-06-12 Thread Chuck Williams
Hi Wu,

The simplest solution is to synchronize calls to a
ParallelWriter.addDocument() method that calls IndexWriter.addDocument()
for each sub-index.  This will work assuming there are no exceptions and
assuming you never refresh your IndexReader within
ParallelWriter.addDocument().  If exceptions occur writing one of the
sub-indexes, then you need to recover them.  The best approach I've
found is to delete the unequal final subdocuments and optimize all the
subindexes to restore equal doc ids.

This approach has the consequence of single-threading all index
writing.  I'm working on a solution to avoid this, but it may require
deeper integration into the higher level IndexManager mechanism (which
does reader reopening, journaling, recovery, and a lot of other things).

If you can get by with single threading, I have a ParallelWriter class
now that I could contribute.  If not, I'm considering the more general
solution now, but will only be able to contribute it if it can be kept
separate from the much larger IndexManager mechanism (which is more
specific to my app and thus not likely a fit for your app anyway).

Chuck


wu fox wrote on 06/12/2006 02:43 AM:
 Hi Chuck:
  I am still looking forward to a solution which ensure to to meet the
 constraints of
 ParallelReader so that I can use it for my seach programm. I have
 tried a lot of methods but none of them
 is good enough for me because of obvious
 bugs. Can you help me? thanks in advance



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



  1   2   >