[jira] Commented: (LUCENE-672) new merge policy

2006-09-15 Thread Ning Li (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-672?page=comments#action_12435174 ] 

Ning Li commented on LUCENE-672:


A small fix named KeepDocCount0Segment.Sept15.patch is attached to LUCENE-565 
(can't attach here).

In mergeSegments(...), if the doc count of a merged segment is 0, it is not 
added to the index (it should be properly cleaned up). Before LUCENE-672, a 
merged segment was always added to the index. The use of mergeSegments(...) in, 
e.g. addIndexes(Directory[]), assumed that behaviour. For code simplicity, this 
fix restores the old behaviour that a merged segment is always added to the 
index. This does NOT break any of the good properties of the new merge policy.

TestIndexWriterMergePolicy is slightly modified to fix a bug and to check that 
segments are probably cleaned up. The patch passes all the tests.

> new merge policy
> 
>
> Key: LUCENE-672
> URL: http://issues.apache.org/jira/browse/LUCENE-672
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.0.0
>Reporter: Yonik Seeley
> Assigned To: Yonik Seeley
> Fix For: 2.1
>
>
> New merge policy developed in the course of 
> http://issues.apache.org/jira/browse/LUCENE-565
> http://issues.apache.org/jira/secure/attachment/12340475/newMergePolicy.Sept08.patch

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-15 Thread Anonymous (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]


Attachment: KeepDocCount0Segment.Sept15.patch

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.July09.patch, NewIndexWriter.Aug23.patch, 
> NewIndexWriter.July18.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, 
> perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java, 
> TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://w

[jira] Resolved: (LUCENE-648) Allow changing of ZIP compression level for compressed fields

2006-09-15 Thread Grant Ingersoll (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-648?page=all ]

Grant Ingersoll resolved LUCENE-648.


Resolution: Won't Fix

Won't fix, as I think it is agreed that compression should be handled outside 
of Lucene and then stored as a binary value

> Allow changing of ZIP compression level for compressed fields
> -
>
> Key: LUCENE-648
> URL: http://issues.apache.org/jira/browse/LUCENE-648
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 1.9, 2.0.0, 2.1, 2.0.1
>Reporter: Michael McCandless
>Priority: Minor
>
> In response to this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-user/38810
> I think we should allow changing the compression level used in the call to 
> java.util.zip.Deflator in FieldsWriter.java.  Right now it's hardwired to 
> "best":
>   compressor.setLevel(Deflater.BEST_COMPRESSION);
> Unfortunately, this can apparently cause the zip library to take a very long 
> time (10 minutes for 4.5 MB in the above thread) and so people may want to 
> change this setting.
> One approach would be to read the default from a Java system property, but, 
> it seems recently (pre 2.0 I think) there was an effort to not rely on Java 
> System properties (many were removed).
> A second approach would be to add static methods (and static class attr) to 
> globally set the compression level?
> A third method would be in document.Field class, eg a 
> setCompressLevel/getCompressLevel?  But then every time a document is created 
> with this field you'd have to call setCompressLevel since Lucene doesn't have 
> a global Field schema (like Solr).
> Any other ideas / prefererences for either of these methods?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-665) temporary file access denied on Windows

2006-09-15 Thread Michael McCandless

Yonik Seeley wrote:

> If it will happen so rarely, make it simpler and go directly for
> segments_(N-1)... (treat it like your previous plan if segments_N.done
> hadn't been written yet).

Yes, true, we could just fall back to the prior segments_(N-1) file in
this case.  Though that means the reader will likely just hit an
IOException trying to load the segments (since a commit is "in
process") and then I'd have to re-retry against segments_N.


You need to fall back in any case... (remember the writer crashing 
scenario).

Reusing the fallback logic makes the code simpler in a case that will
almost never happen.
It's really just a question of if you put in extra retry logic or not.


Yes, but with the separate file (segments_N.done) approach this is
just an file exists check.  In other words, the reader does a dir
listing, finds the most recent segments_N for which there also exists
a segments_N.done and uses that.

But you're right I could re-use the retry logic I already have now,
but instead of retrying forwards (N+1), retry backwards specifically
when hitting an IOException when reading the contents of the segments
file.  I don't think it's that hard to implement ... my biggest worry
is whether filesystem caching (for the likes of NFS) will cause
delays, ie does NFS "cache" (under its own timeout policy) the fact
that the file was N bytes when last it was opened?  I would rather
have a solution that doesn't rely on the caching policies of the
filesystem.

I will try both approaches & report back!


I've been using NFS as my "proxy" for "least common denominator"


I think that's a safe bet ;-)
NFS v2 or v3?


Good question -- so far only v3.  I will test v2 as well.  So many
filesystems!

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-665) temporary file access denied on Windows

2006-09-15 Thread robert engels
Correct, but if you are using a distributing index, you should make  
copies of it, not access the same index (on one machine) from  
multiple machines. If doing this, you would still need some sort of  
server process on the master, and if that is available, it can  
control the distribution of the index - so once again, no problem  
with NFS do to renames, etc.


I guess I just don't see how any NFS issues should matter in a  
properly deployed Lucene - even multiserver - environment.


It is similar to the class reason why unix databases never used/ 
needed file locks - the server process managed the locks internally -  
which is far more efficient.




On Sep 15, 2006, at 9:49 AM, Yonik Seeley wrote:


On 9/15/06, robert engels <[EMAIL PROTECTED]> wrote:

Why not just use a dedicated server with an HTTP/TCP listener and let
it respond to Lucene queries.


If you have more than one server to handle the load, you need to
distribute the index to all the search boxes.  NFS is an easy way, but
I imagine performance would suffer.
Solr can use rsync to update each server with the segment changes, but
other methods like letting NFS handle it are possible.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene  
search server


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-665) temporary file access denied on Windows

2006-09-15 Thread Yonik Seeley

On 9/15/06, robert engels <[EMAIL PROTECTED]> wrote:

Why not just use a dedicated server with an HTTP/TCP listener and let
it respond to Lucene queries.


If you have more than one server to handle the load, you need to
distribute the index to all the search boxes.  NFS is an easy way, but
I imagine performance would suffer.
Solr can use rsync to update each server with the segment changes, but
other methods like letting NFS handle it are possible.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Delay Problem Adding a document to existing index

2006-09-15 Thread sunildm4u

Hi

   I am new to lucene..In my project we are implementing serch engine
through lucene.we have lacs of records.The problem is

Once if i index 5 records its takes approximation 20 min. Now thnk i
want to add  5 more records to database. if  i use

 IndexWriter indexWriter = new IndexWriter(getFilePath(),
getAnalyzer(),false)

Then also its creating all the index from sctratch.. which again takes 20
min for  extra 5 records.. Its not affordable does any one has a 

solution for this.. i am sending the code for your reference

IndexWriter indexWriter = new IndexWriter(getFilePath(),
getAnalyzer(),false);

while (resultSet.next()) {

Document document = new Document(); 
 
document.add(new Field("Surname",
resultSet.getString("NAME_1"),Field.Store.YES,Field.Index.TOKENIZED));

indexWriter.addDocument(document);

count= count+1;

System.out.println("indexed :" + count + " " +
resultSet.getString("NAME_1"));

  }

indexWriter.optimize();

indexWriter.close();
 
-- 
View this message in context: 
http://www.nabble.com/Delay-Problem-Adding-a-document-to-existing-index-tf2277884.html#a6326374
Sent from the Lucene - Java Developer forum at Nabble.com.


[jira] Commented: (LUCENE-671) Hashtable based Document

2006-09-15 Thread Chris (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-671?page=comments#action_12434958 ] 

Chris commented on LUCENE-671:
--

> It is to keep folks from thinking, if they subclass Document, that instances 
> of their subclass will be returned to them in search results. To make 
> Documents fully-subclassible one would need to make their serialization 
> extensible.

Ahhh, that makes sense to me, and I think providing a method for informing the 
rest of lucene which versions of various classes to use is probably more 
trouble than it's worth. We'll just maintain our own tree then.

Thanks



> Hashtable based Document
> 
>
> Key: LUCENE-671
> URL: http://issues.apache.org/jira/browse/LUCENE-671
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index, Search
>Affects Versions: 2.0.0, 1.9
>Reporter: Chris
>Priority: Minor
> Attachments: HashDocument.java, TestBenchDocuments.java
>
>
> I've attached a Document based on a hashtable and a performance test case. It 
> performs better in most cases (all but enumeration by my measurement), but 
> likely uses a larger memory footprint. The Document testcase will fail since 
> it accesses the "fields" variable directly and gets confused when it's not 
> the list it expected it to be. 
> If nothing else we would be interested in at least being able to extend 
> Document, which is currently declared final. (Anyone know the performance 
> gains on declaring a class final?) Currently we have to maintain a copy of 
> lucene which has methods and classes definalized and overriden. 
> There are other classes as well that could be declared non-final (Fieldable 
> comes to mind) since it's possible to make changes for project specific 
> situations in those aswell but that's off-topic.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2006-09-15 Thread Paul Elschot (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-584?page=comments#action_12434901 ] 

Paul Elschot commented on LUCENE-584:
-

In the inheritance from Matcher to Scorer there is an asymmetry
in this patch.

Matcher provides a default implementation for Matcher.explain()
but Scorer does not, and this might lead to unexpected surprises
for future Scorers when the current Matcher.explain() is used.
One could add an abstract Scorer.explain() to catch these, or
provide a default implementation for Scorer.explain().

With matcher implementations quite a few other implementation
decisions need to be taken. 
Also any place in the current code where a Scorer is used, but none
of the Scorer.score() methods, is a candidate for a change from
Scorer to Matcher.
This will be mostly the current filtering implementations,
but ConstantScoringQuery is another nice example.

Regards,
Paul Elschot


> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: http://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: BitsMatcher.java, Filter-20060628.patch, 
> HitCollector-20060628.patch, IndexSearcher-20060628.patch, 
> MatchCollector.java, Matcher.java, Matcher20060830b.patch, 
> Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, 
> Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]