[jira] Updated: (LUCENE-443) ConjunctionScorer tune-up

2006-09-21 Thread Paul Elschot (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-443?page=all ]

Paul Elschot updated LUCENE-443:


Attachment: Conjunction20060921.patch

Iirc the orginal performance problem was caused by creation of objects in the 
tight loop
doing skipTo() on al  the scorers.

This patch is against current trunk and based on the earlier posted versions of 
ConjunctionScorer.
which was based (by the first poster) on an existing ConjunctionScorer with an 
ASL notice,
which is why I could grant the licence to ASF.

> ConjunctionScorer tune-up
> -
>
> Key: LUCENE-443
> URL: http://issues.apache.org/jira/browse/LUCENE-443
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 1.9
> Environment: Linux, Java 1.5, Large Index with 4 million items and 
> some heavily nested boolean queries
>Reporter: Abdul Chaudhry
> Attachments: Conjunction20060921.patch, ConjunctionScorer.java, 
> ConjunctionScorer.java
>
>
> I just recently ran a load test on the latest code from lucene , which is 
> using a new BooleanScore and noticed the ConjunctionScorer was crunching 
> through objects , especially while sorting as part of the skipTo call. It 
> turns a linked list into an array, sorts the array, then converts the array 
> back to a linked list for further processing by the scoring engines below.
> 'm not sure if anyone else is experiencing this as I have a very large index 
> (> 4 million items) and I am issuing some heavily nested queries
> Anyway, I decide to change the link list into an array and use a first and 
> last marker to "simulate" a linked list.
> This scaled much better during my load test as the java gargbage collector 
> was less - umm - virulent 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-443) ConjunctionScorer tune-up

2006-09-21 Thread Paul Elschot (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-443?page=comments#action_12436453 ] 

Paul Elschot commented on LUCENE-443:
-

I just overlooked the grant by Abdul to the ASF.

> ConjunctionScorer tune-up
> -
>
> Key: LUCENE-443
> URL: http://issues.apache.org/jira/browse/LUCENE-443
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 1.9
> Environment: Linux, Java 1.5, Large Index with 4 million items and 
> some heavily nested boolean queries
>Reporter: Abdul Chaudhry
> Attachments: Conjunction20060921.patch, ConjunctionScorer.java, 
> ConjunctionScorer.java
>
>
> I just recently ran a load test on the latest code from lucene , which is 
> using a new BooleanScore and noticed the ConjunctionScorer was crunching 
> through objects , especially while sorting as part of the skipTo call. It 
> turns a linked list into an array, sorts the array, then converts the array 
> back to a linked list for further processing by the scoring engines below.
> 'm not sure if anyone else is experiencing this as I have a very large index 
> (> 4 million items) and I am issuing some heavily nested queries
> Anyway, I decide to change the link list into an array and use a first and 
> last marker to "simulate" a linked list.
> This scaled much better during my load test as the java gargbage collector 
> was less - umm - virulent 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustering IndexWriter?

2006-09-21 Thread adasal

Don't be coy, what's your comapany?
Adam

On 21/09/06, Steve Harris <[EMAIL PROTECTED]> wrote:


Warning, I'm a vendor dude but this isn't really a vendor message.

My IT guy had mentioned to me that a bunch of the open source products
we use (JIRA, JForum etc) have Lucene inside and in the name of eating
our own dog food
I tried to cluster IndexWriter (with a RAMDirectory) using our
(terracotta) clustering technology.

Took me about a half hour to get the basics working from download
time. I was wondering, do people in the real world want to be able to
cluster this stuff? Is clustering the IndexWriter really all I need to do?

If it is interesting, how do I feedback a small code change into the
project. We don't yet support subclasses of collections and
SegmentInfos subclasses Vector. I just turned it into aggregation
(that took 10 of the 30 minutes). We will support this in a future
release so it isn't a huge deal but I could get something out sooner
if the change was made.

Cheers,
Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-09-21 Thread Karl Wettin (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436502 ] 

Karl Wettin commented on LUCENE-675:


It is also interesting to know how much time is consumed to assemble an 
instance of Document from the storage. According to my own tests this is the 
major reason to why InstantiatedIndex is so much faster than a FS/RAMDirectory. 
I also presume it to be the bottleneck of any RDBMS-, RMI- or any other 
"proxy"-based storage. 

> Lucene benchmark: objective performance test for Lucene
> ---
>
> Key: LUCENE-675
> URL: http://issues.apache.org/jira/browse/LUCENE-675
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Andrzej Bialecki 
> Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing 
> and querying, on a known corpus. This issue is intended to collect comments 
> and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is 
> the original Reuters collection, available from 
> http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
> or 
> http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
>  I propose to use this corpus as a base for benchmarks. The benchmarking 
> suite could automatically retrieve it from known locations, and cache it 
> locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustering IndexWriter?

2006-09-21 Thread Vic Bancroft

adasal wrote:


Don't be coy, what's your comapany?


This URL is derivable from the text, with a little search ening help . . .
**
 http://www.terracottatech.com/terracotta_spring.shtml

more,
l8r,
v


On 21/09/06, Steve Harris <[EMAIL PROTECTED]> wrote:



Warning, I'm a vendor dude but this isn't really a vendor message.

My IT guy had mentioned to me that a bunch of the open source products
we use (JIRA, JForum etc) have Lucene inside and in the name of eating
our own dog food
I tried to cluster IndexWriter (with a RAMDirectory) using our
(terracotta) clustering technology.

Took me about a half hour to get the basics working from download
time. I was wondering, do people in the real world want to be able to
cluster this stuff? Is clustering the IndexWriter really all I need 
to do?


If it is interesting, how do I feedback a small code change into the
project. We don't yet support subclasses of collections and
SegmentInfos subclasses Vector. I just turned it into aggregation
(that took 10 of the 30 minutes). We will support this in a future
release so it isn't a huge deal but I could get something out sooner
if the change was made.

Cheers,
Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







--
"The future is here. It's just not evenly distributed yet."
-- William Gibson, quoted by Whitfield Diffie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-09-21 Thread Grant Ingersoll (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436516 ] 

Grant Ingersoll commented on LUCENE-675:


Since this has dependencies, do you think we should put it under contrib?  I 
would be for a Performance directory and we could then organize it from there.  
Perhaps into packages for quantitative and qualitative performance.  

> Lucene benchmark: objective performance test for Lucene
> ---
>
> Key: LUCENE-675
> URL: http://issues.apache.org/jira/browse/LUCENE-675
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Andrzej Bialecki 
> Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing 
> and querying, on a known corpus. This issue is intended to collect comments 
> and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is 
> the original Reuters collection, available from 
> http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
> or 
> http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
>  I propose to use this corpus as a base for benchmarks. The benchmarking 
> suite could automatically retrieve it from known locations, and cache it 
> locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-09-21 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436518 ] 

Andrzej Bialecki  commented on LUCENE-675:
--

The dependency on commons-compress could be avoided - I used this just to be 
able to unpack tar.gz files, we can use Ant for that. If you meant the 
dependency on the corpus - can't Ant download this too as a dependency?

Re: Project Gutenberg - good point, this is a good source for multi-lingual 
documents. The "Europarl" collection is another, although a bit more hefty, so 
that could be suitable for running large-scale benchmarks, and texts from 
Project Gutenberg for running small-scale tests.

> Lucene benchmark: objective performance test for Lucene
> ---
>
> Key: LUCENE-675
> URL: http://issues.apache.org/jira/browse/LUCENE-675
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Andrzej Bialecki 
> Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing 
> and querying, on a known corpus. This issue is intended to collect comments 
> and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is 
> the original Reuters collection, available from 
> http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
> or 
> http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
>  I propose to use this corpus as a base for benchmarks. The benchmarking 
> suite could automatically retrieve it from known locations, and cache it 
> locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-09-21 Thread Grant Ingersoll (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436519 ] 

Grant Ingersoll commented on LUCENE-675:


Yeah, ANT can do this, I think.  Take a look at the DB contrib package, it 
downloads.  I think I can setup the necessary stuff in contrib, if people think 
that is a good idea.  First contribution will be this file and then we can go 
from there.  I think Otis has run some perf. stuff too, but I am not sure if it 
can be contributed.  I think someone else has really studied query perf. so it 
would be cool if that was added too.

> Lucene benchmark: objective performance test for Lucene
> ---
>
> Key: LUCENE-675
> URL: http://issues.apache.org/jira/browse/LUCENE-675
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Andrzej Bialecki 
> Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing 
> and querying, on a known corpus. This issue is intended to collect comments 
> and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is 
> the original Reuters collection, available from 
> http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
> or 
> http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
>  I propose to use this corpus as a base for benchmarks. The benchmarking 
> suite could automatically retrieve it from known locations, and cache it 
> locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Yep, that's us. No secret, just didn't want to make my question an
billboard :-). Just needed a bit of info from the people who know
best.
Cheers,
steve

On 9/21/06, Vic Bancroft <[EMAIL PROTECTED]> wrote:

adasal wrote:

> Don't be coy, what's your comapany?

This URL is derivable from the text, with a little search ening help . . .
**
  http://www.terracottatech.com/terracotta_spring.shtml

more,
l8r,
v

> On 21/09/06, Steve Harris <[EMAIL PROTECTED]> wrote:
>
>>
>> Warning, I'm a vendor dude but this isn't really a vendor message.
>>
>> My IT guy had mentioned to me that a bunch of the open source products
>> we use (JIRA, JForum etc) have Lucene inside and in the name of eating
>> our own dog food
>> I tried to cluster IndexWriter (with a RAMDirectory) using our
>> (terracotta) clustering technology.
>>
>> Took me about a half hour to get the basics working from download
>> time. I was wondering, do people in the real world want to be able to
>> cluster this stuff? Is clustering the IndexWriter really all I need
>> to do?
>>
>> If it is interesting, how do I feedback a small code change into the
>> project. We don't yet support subclasses of collections and
>> SegmentInfos subclasses Vector. I just turned it into aggregation
>> (that took 10 of the 30 minutes). We will support this in a future
>> release so it isn't a huge deal but I could get something out sooner
>> if the change was made.
>>
>> Cheers,
>> Steve
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>


--
"The future is here. It's just not evenly distributed yet."
 -- William Gibson, quoted by Whitfield Diffie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustering IndexWriter?

2006-09-21 Thread Yonik Seeley

On 9/20/06, Steve Harris <[EMAIL PROTECTED]> wrote:

Is clustering the IndexWriter really all I need to do?


Hi Steve,
Could you explain the details of what "clustering" really means in this context?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Sure,

I'm fairly new to Lucene but what I was trying to do was make it so
that an index could be shared among multiple nodes. If an index is
updated in any way it would be updated across the cluster coherently.
In my first version I was really only taking advantage of the fact
that we detect fine grained changes and can extend synchronization
across the cluster but if I can prove to myself that this is actually
useful I'll go back and mark some of the synchronize blocks/methods as
read locks to improve concurrency and reduce instrumentation to only
what is needed.

If I'm going to be able to publish the config for what I'm doing I
would need to change that one class that I mentioned above becuase we
won't support subclasses of collections for a few more months.

I'm not a very good writer. Does any of that make sense?

Summary would be:
Goals,
Usefully cluster luncene indexes for across multiple nodes.

Questions:
Is this useful in the real world
Would it be  possible to get that one small thing changed.

Cheers,
Steve

On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 9/20/06, Steve Harris <[EMAIL PROTECTED]> wrote:
> Is clustering the IndexWriter really all I need to do?

Hi Steve,
Could you explain the details of what "clustering" really means in this context?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Clustering IndexWriter?

2006-09-21 Thread Otis Gospodnetic
I don't fully follow, and I don't even have the "it's late!" excuse.  It sounds 
like you want to have the same index on multiple nodes in the cluster and when 
a data change occurs, you want to synchronously make the same change to all 
indices in your cluster.  Is that it?

Solr has a different approach.  There, only the master index is modified, while 
slave servers copy the master index periodically.

Otis

- Original Message 
From: Steve Harris <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Thursday, September 21, 2006 11:18:43 AM
Subject: Re: Re: Clustering IndexWriter?

Sure,

I'm fairly new to Lucene but what I was trying to do was make it so
that an index could be shared among multiple nodes. If an index is
updated in any way it would be updated across the cluster coherently.
In my first version I was really only taking advantage of the fact
that we detect fine grained changes and can extend synchronization
across the cluster but if I can prove to myself that this is actually
useful I'll go back and mark some of the synchronize blocks/methods as
read locks to improve concurrency and reduce instrumentation to only
what is needed.

If I'm going to be able to publish the config for what I'm doing I
would need to change that one class that I mentioned above becuase we
won't support subclasses of collections for a few more months.

I'm not a very good writer. Does any of that make sense?

Summary would be:
Goals,
Usefully cluster luncene indexes for across multiple nodes.

Questions:
Is this useful in the real world
Would it be  possible to get that one small thing changed.

Cheers,
Steve

On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On 9/20/06, Steve Harris <[EMAIL PROTECTED]> wrote:
> > Is clustering the IndexWriter really all I need to do?
>
> Hi Steve,
> Could you explain the details of what "clustering" really means in this 
> context?
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search server
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-21 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---

Attachment: NewIndexModifier.Sept21.patch

This is to update the delete-support patch after the commit of the new merge 
policy.
  - Very few changes to IndexWriter.
  - The patch passes all tests.
  - A new test call TestNewIndexModifierDelete is added to show different 
scenarios when using delete methods in NewIndexModifier.

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> -
>
> Key: LUCENE-565
> URL: http://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Ning Li
> Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
> NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
> newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
> perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> ---
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> ---
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current   current  new
> Workload  IndexWriter  IndexModifier   IndexWriter
> ---
> Insert only 116 min   119 min116 min
> Insert/delete (big batches)   --  135 min125 min
> Insert/delete (small batches) --  338 min134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interlea

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-09-21 Thread Otis Gospodnetic (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436587 ] 

Otis Gospodnetic commented on LUCENE-675:
-

I still haven't gotten my employer to sign and fax the CCLA, so I'm stuck and 
can't contribute my search benchmark.

I have a suggestion for a name for this - Lube, for Lucene Benchmark - 
contrib/lube.

> Lucene benchmark: objective performance test for Lucene
> ---
>
> Key: LUCENE-675
> URL: http://issues.apache.org/jira/browse/LUCENE-675
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Andrzej Bialecki 
> Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing 
> and querying, on a known corpus. This issue is intended to collect comments 
> and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is 
> the original Reuters collection, available from 
> http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz 
> or 
> http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz.
>  I propose to use this corpus as a base for benchmarks. The benchmarking 
> suite could automatically retrieve it from known locations, and cache it 
> locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Updated: (LUCENE-665) temporary file access denied on Windows

2006-09-21 Thread Chris Hostetter

The recurring pattern seems to be...

  ResultType methodName(ArgType args) throws ExceptionType {
int trialsSoFar = 0;
long maxTime = System.currentTimeMillis() + maxTotalDelay;
Exception error = null;
while (waitAgain(maxTime, trialsSoFar++, error)) {
  try {
return super.methodName(args);
  } catch (ExceptionType e) {
error = e;
  }
}
return super.methodName(args);
  }

...where the waitAgain method seems to take in more args then it really
needs (error is completley unused, and trialsSoFar is only need to know if
we are on the first trial)

There may be a subltety i'm missing here, but it seems like this might be
more clearly (and susinctly) expressed with something like...

  ResultType methodName(ArgType args) throws ExceptionType {
long maxTime = System.currentTimeMillis() + maxTotalDelay;
while (true) {
  try {
return super.methodName(args);
  } catch (ExceptionType e) {
if (maxTime < System.currentTimeMillis()) throw e
  }
  wait(maxTime);
}
  }

...where the wait method also get's simpler...

  static void wait(long maxTime)
long moreTime = maxTime - System.currentTimeMillis();
long delay = Math.min(moreTime, intervalDelay);
try {
  Thread.sleep(delay);
} catch (InterruptedException e1) {  /* NOOP */  }
  }

...but i haven't tried this, and as i said: there may be some subltety of
your approach that i'm missing.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Clustering IndexWriter?

2006-09-21 Thread Chris Hostetter

: Questions:
: Is this useful in the real world
: Would it be  possible to get that one small thing changed.

I'm not really clear on what the "small thing" is that you are asking
about ... you mentioned SegmentInfos subclassing Vector, are you proposing
an alternative?  If you've got a patch that doesn't break existing
functionality or have a negative impact on performance and makes lucene
more usable in some way it would certainly be considered ... i'm just not
really clear on what change you're suggesting and how it helps make Lucene
more usable for you.





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



help on Lock.obtain(lockWaitTimeout)

2006-09-21 Thread Michael McCandless

I'm working on a LockFactory that uses java.nio.* (OS native locks)
for its locks.

This should be a big help for people who keep finding their lock files
left on disk due to abnormal shutdown, etc (because OS will free the
locks, nomatter what, "in theory").

I thought I was nearly done but  in testing the new LockFactory on
an NFS server that didn't have locks properly configured (I think
possibly a common situtation) I found a problem with how the
Lock.obtain(lockWaitTimeout) works.

That function precomputes how many times to try to obtain the lock
(just divides lockWaitTimeout parameter and LOCK_POLL_INTERVAL) and
then tries Lock.obtain() followed by a sleep of LOCK_POLL_INTERVAL,
that many times, before timing out.

The problem is, in the above test case: the call to Lock.obtain() can
apparently take a long time (35 seconds, I assume some kind of
underlying timeout contacting "lockd" from the NFS client) only to
finally return "false".  But the "try N times" approach makes the
assumption that this call will take zero time.  (In fact, as things
stand now, when Lock.obtain() takes non-zero time, it causes the
timeout to be longer than what was asked for; but likely this is
typically a small amount?).

Anyway, my first reaction was to change this to use
System.currentTimeMillis() to measure elapsed time, but then I
remembered is a dangerous approach because whenever the clock on the
machine is updated (eg by a time-sync NTP client) it would mess up
this function, causing it to either take longer than was asked for (if
clock is moved backwards) or, to timeout in [much] less time than was
asked for (if clock was moved forwards).  I've hit such issues in the
past and it's devilish.  Timezone and daylight savings time don't
matter because it's measuring GMT.

So then what to do?  What's the best way to change the function to
"really" measure time?  In Java 1.5 there is now a "nanoTime()" which
is closer to what I need, but it's 1.5 (and we're still on 1.4), and
apparently it can "fallback" to currentTimeMillis() on some platforms.
In the past I've used separate a separate "clock" thread that just
sleeps & increments a counter, but I don't really like the idea of
spawning a whole new thread (Lucene doesn't launch its own threads
now, except for ParallelMultiSearcher).

Does anyone know of a good solution?

Alternatively, since this is really a "misconfiguration" (ie the
Lock.obtain() is never going to succeed), maybe we could try to obtain
a random "test" lock on creation of the LockFactory, just to confirm
that locking even "works" at all in the current environment, and then
leave the current implementation of Lock.obtain() unchanged (when NFS
locking is properly configured it seems to be fairly fast)?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Fair question.

All I did/need was take SegmentInfos and instead of subclassing Vector
I made it contain a Vector. Went from subclassing to aggregation. As
far as I could
tell from reading the code it would make no difference to anyone and
should have no performance impact (good or bad). It just allowed me to
cluster the IndexWriter with a RAMDirectory.

Maybe a little background would help. Our clustering product doesn't
use java serialization and has no API. We just use a little config
where one points us to what you want clustered and what java
synchronization needs to be shared. One of the limitations that
currently exists is that we don't support clustering subclasses of
java collections.

At this point I'm just experimenting to see if our product can cluster
lucene in a useful/performant way. When my experimenting is complete,
if everything is positive, I am going to write a blog on clustering
lucene indexes but it would be awkward to do that if the people who
run through the example have to change lucene code.

Does this help?

Cheers,
Steve

On 9/21/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:


: Questions:
: Is this useful in the real world
: Would it be  possible to get that one small thing changed.

I'm not really clear on what the "small thing" is that you are asking
about ... you mentioned SegmentInfos subclassing Vector, are you proposing
an alternative?  If you've got a patch that doesn't break existing
functionality or have a negative impact on performance and makes lucene
more usable in some way it would certainly be considered ... i'm just not
really clear on what change you're suggesting and how it helps make Lucene
more usable for you.





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Updated: (LUCENE-665) temporary file access denied on Windows

2006-09-21 Thread Doron Cohen
Thanks for the comments!

Indeed the first version I wrote followed the pattern you suggest (let's
name it pattern_1 for the discussion). However with pattern_1 I could not
cover the case of a method originally not throwing an exception. The
problem is that in pattern_1 we have to catch the exception before deciding
whether to wait or not. But if the decision is not to wait, the caught
exception must be thrown, - which is not allowed by the original method
signature.

That's why I made waitAgain() to (1) not wait in the first call, and (2)
return true iff another call to waitAgain() is anticipated.

This allows to use the same code pattern for both types of methods: those
originally throwing an exception and method originally not throwing. I see
this as an advantage.

As for passing the exception in waitAgain args - this served two purposes:
(1) debugging: it is a convenient single spot in the code to collect info
on the exception that caused the retry, and also on the number of
successive retries for the same original method call. (2) exception
analysis - if one wants to analyze the exception root cause message for
deciding if to wait or not, this is convenient location (although this
would not allow to decide if to retry or not - because even if waitMore
returns false there would be an additional try). So, when the debug is
commented out it is possible to not pass the exception, and also to make
trialsSoFar a boolean - but for debug purposes I would rather leave them
there, I believe there is no real harm done here, particularly because this
method is hardly ever called. Perhaps should mention in the waitMore
javadoc that these args are for debug mainly?

Chris Hostetter <[EMAIL PROTECTED]> wrote on 21/09/2006 12:10:56:

>
> The recurring pattern seems to be...
>
>   ResultType methodName(ArgType args) throws ExceptionType {
> int trialsSoFar = 0;
> long maxTime = System.currentTimeMillis() + maxTotalDelay;
> Exception error = null;
> while (waitAgain(maxTime, trialsSoFar++, error)) {
>   try {
> return super.methodName(args);
>   } catch (ExceptionType e) {
> error = e;
>   }
> }
> return super.methodName(args);
>   }
>
> ...where the waitAgain method seems to take in more args then it really
> needs (error is completley unused, and trialsSoFar is only need to know
if
> we are on the first trial)
>
> There may be a subltety i'm missing here, but it seems like this might be
> more clearly (and susinctly) expressed with something like...
>
>   ResultType methodName(ArgType args) throws ExceptionType {
> long maxTime = System.currentTimeMillis() + maxTotalDelay;
> while (true) {
>   try {
> return super.methodName(args);
>   } catch (ExceptionType e) {
> if (maxTime < System.currentTimeMillis()) throw e
>   }
>   wait(maxTime);
> }
>   }
>
> ...where the wait method also get's simpler...
>
>   static void wait(long maxTime)
> long moreTime = maxTime - System.currentTimeMillis();
> long delay = Math.min(moreTime, intervalDelay);
> try {
>   Thread.sleep(delay);
> } catch (InterruptedException e1) {  /* NOOP */  }
>   }
>
> ...but i haven't tried this, and as i said: there may be some subltety of
> your approach that i'm missing.
>
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: help on Lock.obtain(lockWaitTimeout)

2006-09-21 Thread Yonik Seeley

On 9/21/06, Michael McCandless <[EMAIL PROTECTED]> wrote:

Anyway, my first reaction was to change this to use
System.currentTimeMillis() to measure elapsed time, but then I
remembered is a dangerous approach because whenever the clock on the
machine is updated (eg by a time-sync NTP client) it would mess up
this function, causing it to either take longer than was asked for (if
clock is moved backwards) or, to timeout in [much] less time than was
asked for (if clock was moved forwards).


Um, wow... that's thorough design work!

In this case, I don't think it's something to worry about though.
NTP corrections are likely to be very small, not on the scale of
lock-obtain timeouts.
If one can't obtain a lock, it's due to something else asynchronously
happening, and that throws a lot bigger time variation into the
equation anyway.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Yonik Seeley

While automatically clustering java objects sure sounds cool, I have
to wonder what the performance ends up being.  Every small change to
the clustered objects is broadcast to all the nodes, correct?

Have you done any performance comparisons to see if this is a
practical approach for Lucene?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:

Fair question.

All I did/need was take SegmentInfos and instead of subclassing Vector
I made it contain a Vector. Went from subclassing to aggregation. As
far as I could
tell from reading the code it would make no difference to anyone and
should have no performance impact (good or bad). It just allowed me to
cluster the IndexWriter with a RAMDirectory.

Maybe a little background would help. Our clustering product doesn't
use java serialization and has no API. We just use a little config
where one points us to what you want clustered and what java
synchronization needs to be shared. One of the limitations that
currently exists is that we don't support clustering subclasses of
java collections.

At this point I'm just experimenting to see if our product can cluster
lucene in a useful/performant way. When my experimenting is complete,
if everything is positive, I am going to write a blog on clustering
lucene indexes but it would be awkward to do that if the people who
run through the example have to change lucene code.

Does this help?

Cheers,
Steve


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Good question. May or may not be performant enough. Only time (and
testing) will tell. My guess is that it will depend heavily on the
rate in which the data changes (or read write ratio).

Believe me, I'm not proposing that everyone go out and cluster lucene
with terracotta dso. I'm really just playing, researching, learning.
I'm a firm believer in using the right tool for the right job and
would never claim that any product (especially one I wrote) is right
for everyone.

My guess is that some segment of the world cares a lot about realtime
coherent updates and some segment of the world needs blinding speed.
Part of my research is to gather the expertise of this group on these
issues.

Cheers,
Steve

On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

While automatically clustering java objects sure sounds cool, I have
to wonder what the performance ends up being.  Every small change to
the clustered objects is broadcast to all the nodes, correct?

Have you done any performance comparisons to see if this is a
practical approach for Lucene?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:
> Fair question.
>
> All I did/need was take SegmentInfos and instead of subclassing Vector
> I made it contain a Vector. Went from subclassing to aggregation. As
> far as I could
> tell from reading the code it would make no difference to anyone and
> should have no performance impact (good or bad). It just allowed me to
> cluster the IndexWriter with a RAMDirectory.
>
> Maybe a little background would help. Our clustering product doesn't
> use java serialization and has no API. We just use a little config
> where one points us to what you want clustered and what java
> synchronization needs to be shared. One of the limitations that
> currently exists is that we don't support clustering subclasses of
> java collections.
>
> At this point I'm just experimenting to see if our product can cluster
> lucene in a useful/performant way. When my experimenting is complete,
> if everything is positive, I am going to write a blog on clustering
> lucene indexes but it would be awkward to do that if the people who
> run through the example have to change lucene code.
>
> Does this help?
>
> Cheers,
> Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Yonik Seeley

On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:

My guess is that some segment of the world cares a lot about realtime
coherent updates and some segment of the world needs blinding speed.
Part of my research is to gather the expertise of this group on these
issues.


I hear ya...

There is another part to the equation for Lucene though.
Coherent realtime updates to the IndexWriter/RamDirectory alone
doesn't get you all the way there since things are only readable
through an IndexReader that needs to be reopened to see changes.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Interesting.
I wonder, I have a notification mechanism at my disposal as well. I
wonder if it could be worked out that, much like a mvc, an IndexReader
could be notified when the underlying Directory has changed so that
the reader can adjust itself?

Cheers,
Steve


On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:
> My guess is that some segment of the world cares a lot about realtime
> coherent updates and some segment of the world needs blinding speed.
> Part of my research is to gather the expertise of this group on these
> issues.

I hear ya...

There is another part to the equation for Lucene though.
Coherent realtime updates to the IndexWriter/RamDirectory alone
doesn't get you all the way there since things are only readable
through an IndexReader that needs to be reopened to see changes.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: help on Lock.obtain(lockWaitTimeout)

2006-09-21 Thread Doron Cohen
For obtain(timeout), to prevent waiting too long you could compute the
maximum number of times that obtain() can be executed (assuming, as in
current code, that obtain() executes in no time). Then break if either it
was executed sufficiently many times or if time is up. I don't see how to
prevent waiting too short.

Btw, I wonder what happens if the time change as of sync occurs in the
middle of the sleep - since sleep is implemented natively this must be
taken care of correctly by the underlying OS...?

[EMAIL PROTECTED] wrote on 21/09/2006 13:05:06:
> On 9/21/06, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > Anyway, my first reaction was to change this to use
> > System.currentTimeMillis() to measure elapsed time, but then I
> > remembered is a dangerous approach because whenever the clock on the
> > machine is updated (eg by a time-sync NTP client) it would mess up
> > this function, causing it to either take longer than was asked for (if
> > clock is moved backwards) or, to timeout in [much] less time than was
> > asked for (if clock was moved forwards).
>
> Um, wow... that's thorough design work!
>
> In this case, I don't think it's something to worry about though.
> NTP corrections are likely to be very small, not on the scale of
> lock-obtain timeouts.
> If one can't obtain a lock, it's due to something else asynchronously
> happening, and that throws a lot bigger time variation into the
> equation anyway.
>
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
server
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Yonik Seeley

On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:

Interesting.
I wonder, I have a notification mechanism at my disposal as well. I
wonder if it could be worked out that, much like a mvc, an IndexReader
could be notified when the underlying Directory has changed so that
the reader can adjust itself?


Another little factor is that the IndexWriter must be closed before
the IndexReader is opened to see all the changes.

There is cost to opening and using a new IndexReader such as reading
the term index and the norms.  One would probably want to have some
sort of logic to limit how fast a new IndexReader was opened (which
diminishes the value of realtime updates to the underlying
IndexWriter).

It still should be doable though.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:
> > My guess is that some segment of the world cares a lot about realtime
> > coherent updates and some segment of the world needs blinding speed.
> > Part of my research is to gather the expertise of this group on these
> > issues.
>
> I hear ya...
>
> There is another part to the equation for Lucene though.
> Coherent realtime updates to the IndexWriter/RamDirectory alone
> doesn't get you all the way there since things are only readable
> through an IndexReader that needs to be reopened to see changes.
>
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search server
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

I'm don't know list servers rules but I figured I would just include
the text of the file I changed. If that is bad form give me a heads up
and I won't do it again.

Would this change break anything or bother anyone?

package org.apache.lucene.index;

/**
* Copyright 2004 The Apache Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

import java.io.IOException;
import java.util.Vector;

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.IndexInput;
import org.apache.lucene.store.IndexOutput;

final class SegmentInfos {

/** The file format version, a negative number. */
/* Works since counter, the old 1st entry, is always >= 0 */
public static final int FORMAT = -1;

public int counter = 0; // used to name new segments

private Vector vector = new Vector();

/**
 * counts how often the index has been changed by adding or deleting 
docs.
 * starting with the current time in milliseconds forces to create 
unique
 * version numbers.
 */
private long version = System.currentTimeMillis();

public final SegmentInfo info(int i) {
return (SegmentInfo) vector.elementAt(i);
}

public final void read(Directory directory) throws IOException {

IndexInput input = directory.openInput(IndexFileNames.SEGMENTS);
try {
int format = input.readInt();
if (format < 0) { // file contains explicit format info
// check that it is a format we can understand
if (format < FORMAT)
throw new IOException("Unknown format 
version: " + format);
version = input.readLong(); // read version
counter = input.readInt(); // read counter
} else { // file is in old format without explicit 
format info
counter = format;
}

for (int i = input.readInt(); i > 0; i--) { // read 
segmentInfos
SegmentInfo si = new 
SegmentInfo(input.readString(), input
.readInt(), directory);
vector.addElement(si);
}

if (format >= 0) { // in old format the version number 
may be at
// the end of the file
if (input.getFilePointer() >= input.length())
version = System.currentTimeMillis(); 
// old file format
// without version
// number
else
version = input.readLong(); // read 
version
}
} finally {
input.close();
}
}

public final void write(Directory directory) throws IOException {
IndexOutput output = directory.createOutput("segments.new");
try {
output.writeInt(FORMAT); // write FORMAT
output.writeLong(++version); // every write changes the 
index
output.writeInt(counter); // write counter
output.writeInt(size()); // write infos
for (int i = 0; i < size(); i++) {
SegmentInfo si = info(i);
output.writeString(si.name);
output.writeInt(si.docCount);
}
} finally {
output.close();
}

// install new segment info
directory.renameFile("segments.new", IndexFileNames.SEGMENTS);
}

/**
 * version number when this SegmentInfos was generated.
 */
public long getVersion() {
return version;
}

/**
 * Current version number from segments file.
 */
public static long readCurrentVersion(Directory directory)
throws IOException {

IndexInput input = directory.openInput(IndexFileNames.SEGMENTS);
int

Distributed Indexes, Searches and HDFS

2006-09-21 Thread Chris D

Hi List,

As a bit of an experiment I'm redoing some of our indexing and searching
code to try to make it easier to manage and distributed. The system has to
modify its indexes frequently, sometimes in huge batches, and the documents
in the indexes are frequently modified (deleted, modified and readded). Just
for scale we're wanting the system to be capable of searching a terabyte or
so of data.

Currently we have a bunch of index machines indexing to a local file system,
every hour or so they merge to a group of indexes stored on NFS or similar
common filesystem, and the search nodes retrieve the new indexes and search
on those. The merge can take about as long as it took to originally index
the files, since it has to re-index the "contents" field since that field
isn't stored.

After reading this thread:
http://www.gossamer-threads.com/lists/lucene/java-user/13803#13803 There
were several good suggestions but I'm curious, is there a generally accepted
best practice of distributing lucene? The cronjob/link solution which is
quite clean, doesn't work well in a windows environment. While it's my
favorite, no dice... Rats.

So I decided to experiment with a couple different ideas, and I have some
questions.

1) Indexing and Searching Directly from HDFS

Indexing to HDFS is possible with a patch if we don't use CFS. While not
ideal performance-wise, it's reliable and takes care of data redundancy,
component failure and means that I can have cheap small drives instead of a
large expensive NAS. It's also quite simple to implement (see Nutch's
indexer.FsDirectory for the Directory implmentation)

So I would have several indexes (ie 16) and the same number of indexers, and
a searcher for each index (possibly in the same process) that searches each
one directly from HDFS. One problem I'm having is an occasional filenotfound
exception. (Probably locking related)

org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open
filename /index/_3.f0
   at org.apache.hadoop.dfs.NameNode.open(NameNode.java:178)
   at sun.reflect.GeneratedMethodAccessor41.invoke (Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call (RPC.java:332)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:468)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:245)

It comes out of the Searcher when I try to do a search while things are
being indexed. I'd be interested to know what exactly is happening when this
exception is thrown, maybe I can design around it. (Do synchronization at
the appropriate times or similar)

2) Index Locally, Search in HDFS

I haven't implemented this but I was thinking something along the lines of
merging every little while and having the searchers refresh after that's
finished. I still have a problem with the merge taking a fairly long time
and if a node fails we lose the documents stored locally in that index.

3) Index HDFS, Search Locally

The system indexes to HDFS and the searchers ask the indexers to pause while
it retrieves the indexes from the store. It's then searched locally and the
Indexers continue trucking along. This, in my head, seems to work alright,
at least until the indexes get very large and copying them is prohibitive.
(Is there a java rsync?) I'll have to investigate how much a performance hit
indexing to the network actually is. If anyone has any numbers I would be
interested in seeing them.

4) Map/Reduce

I don't know a lot about this and haven't been able to find much on applying
map/reduce to lucene indexing. Well, except for the Nutch source code, which
is rather difficult to sort through for an overview. So if anyone has a
snippet or a good overview I could look over I would be grateful. Even if
you can just point at a critical part in Nutch that would also be quite
helpful.

5) Anything else

I would appreciate any insight anyone has on distributing indexes, either on
list or off.

Many Thanks,
Chris

PS. Sorry if this got double posted. Didn't seem to get through first time.


Re: Re: Re: Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Oops, I made a change and didn't test it. Doh,
This should work better:

package org.apache.lucene.index;

/**
* Copyright 2004 The Apache Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

import java.io.IOException;
import java.util.Vector;

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.IndexInput;
import org.apache.lucene.store.IndexOutput;

final class SegmentInfos {

/** The file format version, a negative number. */
/* Works since counter, the old 1st entry, is always >= 0 */
public static final int FORMAT = -1;

public int counter = 0; // used to name new segments

private Vector vector = new Vector();

/**
 * counts how often the index has been changed by adding or deleting 
docs.
 * starting with the current time in milliseconds forces to create 
unique
 * version numbers.
 */
private long version = System.currentTimeMillis();

public final SegmentInfo info(int i) {
return (SegmentInfo) vector.elementAt(i);
}

public final void read(Directory directory) throws IOException {

IndexInput input = directory.openInput(IndexFileNames.SEGMENTS);
try {
int format = input.readInt();
if (format < 0) { // file contains explicit format info
// check that it is a format we can understand
if (format < FORMAT)
throw new IOException("Unknown format 
version: " + format);
version = input.readLong(); // read version
counter = input.readInt(); // read counter
} else { // file is in old format without explicit 
format info
counter = format;
}

for (int i = input.readInt(); i > 0; i--) { // read 
segmentInfos
SegmentInfo si = new 
SegmentInfo(input.readString(), input
.readInt(), directory);
vector.addElement(si);
}

if (format >= 0) { // in old format the version number 
may be at
// the end of the file
if (input.getFilePointer() >= input.length())
version = System.currentTimeMillis(); 
// old file format
// without version
// number
else
version = input.readLong(); // read 
version
}
} finally {
input.close();
}
}

public final void write(Directory directory) throws IOException {
IndexOutput output = directory.createOutput("segments.new");
try {
output.writeInt(FORMAT); // write FORMAT
output.writeLong(++version); // every write changes the 
index
output.writeInt(counter); // write counter
output.writeInt(size()); // write infos
for (int i = 0; i < size(); i++) {
SegmentInfo si = info(i);
output.writeString(si.name);
output.writeInt(si.docCount);
}
} finally {
output.close();
}

// install new segment info
directory.renameFile("segments.new", IndexFileNames.SEGMENTS);
}

/**
 * version number when this SegmentInfos was generated.
 */
public long getVersion() {
return version;
}

/**
 * Current version number from segments file.
 */
public static long readCurrentVersion(Directory directory)
throws IOException {

IndexInput input = directory.openInput(IndexFileNames.SEGMENTS);
int format = 0;
long version = 0;
try {
format = input.readInt();
if

Re: Re: Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

So I clustered this app:

So I switched to clustering the RAMDirectory instead of the
IndexWriter and it worked in my experiments. What I did was create a
new IndexWriter on Document Adds and a new IndexSearcher on document
queries.

What I want to know is. How non-standard is this approach?

Cheers,
Steve

On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:
> Interesting.
> I wonder, I have a notification mechanism at my disposal as well. I
> wonder if it could be worked out that, much like a mvc, an IndexReader
> could be notified when the underlying Directory has changed so that
> the reader can adjust itself?

Another little factor is that the IndexWriter must be closed before
the IndexReader is opened to see all the changes.

There is cost to opening and using a new IndexReader such as reading
the term index and the norms.  One would probably want to have some
sort of logic to limit how fast a new IndexReader was opened (which
diminishes the value of realtime updates to the underlying
IndexWriter).

It still should be doable though.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

> On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> > On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:
> > > My guess is that some segment of the world cares a lot about realtime
> > > coherent updates and some segment of the world needs blinding speed.
> > > Part of my research is to gather the expertise of this group on these
> > > issues.
> >
> > I hear ya...
> >
> > There is another part to the equation for Lucene though.
> > Coherent realtime updates to the IndexWriter/RamDirectory alone
> > doesn't get you all the way there since things are only readable
> > through an IndexReader that needs to be reopened to see changes.
> >
> >
> > -Yonik
> > http://incubator.apache.org/solr Solr, the open-source Lucene search server
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Distributed Indexes, Searches and HDFS

2006-09-21 Thread Yonik Seeley

On 9/21/06, Chris D <[EMAIL PROTECTED]> wrote:

The cronjob/link solution which is
quite clean, doesn't work well in a windows environment. While it's my
favorite, no dice... Rats.


There may be hope yet for that on Windows.
Hard links work on Windows, but the only problem is that you can't
rename/delete any links when the file is open.  Michael McCandless is
working on a patch that would eliminate all renames (and deletes can
be handled by deferring them).

http://www.nabble.com/Re%3A--Solr-Wiki--Update-of-%22TaskList%22-by-YonikSeeley-tf2081816.html#a5736265
http://www.nabble.com/-jira--Created%3A-%28LUCENE-665%29-temporary-file-access-denied-on-Windows-tf2167540.html#a6295771


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]