Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Marvin Humphrey


On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote:


Yonik Seeley [EMAIL PROTECTED] wrote:

Wow, very nice results Mike!


Thanks :)  I'm just praying I don't have some sneaky bug making
the results far better than they really are!!


That's possible, but I'm confident that the model you're using is  
capable of the gains you're seeing.  When I benched KinoSearch a year  
ago against Lucene, KS was getting close, but was still a little  
behind... http://www.rectangular.com/kinosearch/benchmarks.html


(: Ironically, the numbers for Lucene on that page are a little  
better than they should be because of a sneaky bug.  I would have  
made updating the results a priority if they'd gone the other way.  :)


... However, Lucene has been tuned by an army of developers over the  
years, while KS is young yet and still had many opportunities for  
optimization.  Current svn trunk for KS is about twice as fast for  
indexing as when I did those benchmarking tests.


I look forward to studying your patch in detail at some point to see  
what you've done differently.  It sounds like you only familiarized  
yourself with the high-level details of how KS has been working,  
yes?  Hopefully, you misunderstood and came up with something better. ;)


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Created: (LUCENE-851) Pruning

2007-04-04 Thread Marvin Humphrey


On Mar 29, 2007, at 7:44 PM, Ning Li wrote:


If a query requires top-K results, isn't it
sufficient to find top-K results in each segment and merge them to
return the overall top-K results?


They are merged by collecting them into a HitQueue.


Early termination happens in
finding top-K results in one segment. Assuming each document has a
static score, document ids are assigned in the same order of their
static scores within a segment. If a top-K query is scored by the same
static score, query processing on a segment can stop as soon as the
first K results are found.


Indeed, that's exactly how the loop in Scorer_collect() works.


As to the indexing side, applications should be able to pick such a
static score? If Lucene score function is used, norm is a good
candidate? (One tricky thing with norm is that it is updatable.)


I would argue that only a single mechanism based on indexed, non- 
tokenized fields should be used to determine sort order.  Sort order  
based upon norms is easy for the user to fake using a dedicated field  
at a small cost, so library-level support is not needed.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Michael McCandless
Marvin Humphrey [EMAIL PROTECTED] wrote:
 
 On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote:
 
  Yonik Seeley [EMAIL PROTECTED] wrote:
  Wow, very nice results Mike!
 
  Thanks :)  I'm just praying I don't have some sneaky bug making
  the results far better than they really are!!
 
 That's possible, but I'm confident that the model you're using is  
 capable of the gains you're seeing.  When I benched KinoSearch a year  
 ago against Lucene, KS was getting close, but was still a little  
 behind... http://www.rectangular.com/kinosearch/benchmarks.html

OK glad to hear that :)  I *think* I don't have such bugs.

 (: Ironically, the numbers for Lucene on that page are a little  
 better than they should be because of a sneaky bug.  I would have  
 made updating the results a priority if they'd gone the other way.  :)

Hrm.  It would be nice to have hard comparison of the Lucene, KS (and
Ferret and others?).
 
 ... However, Lucene has been tuned by an army of developers over the  
 years, while KS is young yet and still had many opportunities for  
 optimization.  Current svn trunk for KS is about twice as fast for  
 indexing as when I did those benchmarking tests.

Wow, that's an awesome speedup!  So KS is faster than Lucene today?

 I look forward to studying your patch in detail at some point to see  
 what you've done differently.  It sounds like you only familiarized  
 yourself with the high-level details of how KS has been working,  
 yes?  Hopefully, you misunderstood and came up with something better. ;)

Exactly!  I very carefully didn't look closely at how KS does
indexing.  I did read your posts on this list and did read the Wiki
page and I think a few other pages describing KS's merge model but
stopped there.  We can compare our approaches in detail at some point
and then cross-fertilize :)

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-853) Caching does not work when using RMI

2007-04-04 Thread Matt Ericson (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Ericson updated LUCENE-853:


Attachment: RemoteCachingWrapperFilter.patch .patch

A new version that will hopefully patch more correctly


 Caching does not work when using RMI
 

 Key: LUCENE-853
 URL: https://issues.apache.org/jira/browse/LUCENE-853
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.1
 Environment: All 
Reporter: Matt Ericson
Priority: Minor
 Attachments: RemoteCachingWrapperFilter.patch, 
 RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch .patch


 Filters and caching uses transient maps so that caching does not work if you 
 are using RMI and a remote searcher 
 I want to add a new RemoteCachededFilter that will make sure that the caching 
 is done on the remote searcher side 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-04 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486758
 ] 

Otis Gospodnetic commented on LUCENE-855:
-

A colleague of mine is working on something similar, but possibly more 
efficient (less sorting and binary searching).  He'll probably attach his patch 
to this issue.


 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: MemoryCachedRangeFilter.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-04 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486763
 ] 

Yonik Seeley commented on LUCENE-855:
-

There is also something from Mark Harwood:
https://issues.apache.org/jira/browse/LUCENE-798

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: MemoryCachedRangeFilter.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: publish to maven-repository

2007-04-04 Thread Otis Gospodnetic
Eh, missing Jars in the Maven repo again.  Why does this always get dropped?

I can push the Jars out, but I see we have no Maven POMs, or have we?
I can create one for 2.1.0 based on 
http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/2.0.0/lucene-core-2.0.0.pom
 , but where should we keep those?
Perhaps it's time to keep a lucene-core.pom in our repo, rename it at release 
time (e.g. cp lucene-core.pom lucene-core-2.1.0.pom) and push the core jar + 
core POM out?

Thoughts?

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Joerg Hohwiller [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Tuesday, April 3, 2007 4:49:15 PM
Subject: publish to maven-repository

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi there,

I will give it another try:

Could you please publish lucene 2.* artifacts (including contribs) to the maven2
repository at ibiblio?

Currently there is only the lucene-core available up to version 2.0.0:
http://repo1.maven.org/maven2/org/apache/lucene/

JARs and POMs go to:
scp://people.apache.org/www/www.apache.org/dist/maven-repository

If you need assitance I am pleased to help.
But I am not an official apache member and do NOT have access to do the
deployment myself.

Thank you so much...
  Jörg
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGEr3LmPuec2Dcv/8RAh1sAJ9m3qs7upNGJTgie5tNeAFKZenBowCgjufY
uB1/RNnI4wB3dviKy0w7XEs=
=llLh
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-853) Caching does not work when using RMI

2007-04-04 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486764
 ] 

Otis Gospodnetic commented on LUCENE-853:
-

Nice.  Unit tests pass and caching seems to work.
I'll make some small javadoc and cosmetic fixes, upload the prettified patch 
and commit on Friday.

This will give 2 more days to others to review your changes and raise any 
issues they may see.


 Caching does not work when using RMI
 

 Key: LUCENE-853
 URL: https://issues.apache.org/jira/browse/LUCENE-853
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.1
 Environment: All 
Reporter: Matt Ericson
Priority: Minor
 Attachments: RemoteCachingWrapperFilter.patch, 
 RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch .patch


 Filters and caching uses transient maps so that caching does not work if you 
 are using RMI and a remote searcher 
 I want to add a new RemoteCachededFilter that will make sure that the caching 
 is done on the remote searcher side 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-04 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486767
 ] 

Andy Liu commented on LUCENE-855:
-

Otis, looking forward to your colleague's patch.

LUCENE-798 caches RangeFilters so that if the same exact range is executed 
again, the cached RangeFilter is used.  However, the first time a range is 
encountered, you'll still have to calculate the RangeFilter, which can be slow. 
 I haven't looked at the patch, but I'm sure LUCENE-798 can be used in 
conjunction with MemoryCachedRangeFilter to further boost performance for 
repeated range queries.

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: MemoryCachedRangeFilter.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-798) Factory for RangeFilters that caches sections of ranges to reduce disk reads

2007-04-04 Thread Matt Ericson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486768
 ] 

Matt Ericson commented on LUCENE-798:
-

I am working on a patch that will use the Field cache to do range queries.

The bit sets will be proxies to the field cache. This way the data is stored in 
the filed cache and if you change the limits of your range it will just need a 
new proxy BitSet 

 Factory for RangeFilters that caches sections of ranges to reduce disk reads
 

 Key: LUCENE-798
 URL: https://issues.apache.org/jira/browse/LUCENE-798
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Mark Harwood
 Attachments: CachedRangesFilterFactory.java


 RangeFilters can be cached using CachingWrapperFilter but are only re-used if 
 a user happens to use *exactly* the same upper/lower bounds.
 This class demonstrates a caching approach where *sections* of ranges are 
 cached as bitsets and these are re-used/combined to construct large range 
 filters if they fall within the required range. This can improve the cache 
 hit ratio and avoid going to disk to read large lists of Doc ids from 
 TermDocs.
 This class needs some more work to add thread safety but I'm making it 
 available to gather feedback on the design at this early stage before making 
 robust.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: publish to maven-repository

2007-04-04 Thread Erik Hatcher


On Apr 4, 2007, at 4:33 PM, Otis Gospodnetic wrote:
Eh, missing Jars in the Maven repo again.  Why does this always get  
dropped?


Because none of us Lucene committers care much about Maven?  :)

Perhaps it's time to keep a lucene-core.pom in our repo, rename it  
at release time (e.g. cp lucene-core.pom lucene-core-2.1.0.pom) and  
push the core jar + core POM out?


I don't know the Maven specifics, but I'm all for us maintaining the  
Maven POM file and bundling it with releases that get pushed to the  
repos.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-856) Optimize segment merging

2007-04-04 Thread Michael McCandless (JIRA)
Optimize segment merging


 Key: LUCENE-856
 URL: https://issues.apache.org/jira/browse/LUCENE-856
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.1
Reporter: Michael McCandless
 Assigned To: Michael McCandless
Priority: Minor


With LUCENE-843, the time spent indexing documents has been
substantially reduced and now the time spent merging is a sizable
portion of indexing time.

I ran a test using the patch for LUCENE-843, building an index of 10
million docs, each with ~5,500 byte plain text, with term vectors
(positions + offsets) on and with 2 small stored fields per document.
RAM buffer size was 32 MB.  I didn't optimize the index in the end,
though optimize speed would also improve if we optimize segment
merging.  Index size is 86 GB.

Total time to build the index was 8 hrs 38 minutes, 5 hrs 40 minutes
of which was spent merging.  That's 65.6% of the time!

Most of this time is presumably IO which probably can't be reduced
much unless we improve overall merge policy and experiment with values
for mergeFactor / buffer size.

These tests were run on a Mac Pro with 2 dual-core Intel CPUs.  The IO
system is RAID 0 of 4 drives, so, these times are probably better than
the more common case of a single hard drive which would likely be
slower IO.

I think there are some simple things we could do to speed up merging:

  * Experiment with buffer sizes -- maybe larger buffers for the
IndexInputs used during merging could help?  Because at a default
mergeFactor of 10, the disk heads must do alot of seeking back and
forth between these 10 files (and then to the 11th file where we
are writing).

  * Use byte copying when possible, eg if there are no deletions on a
segment we can almost (I think?) just copy things like prox
postings, stored fields, term vectors, instead of full parsing to
Jave objects and then re-serializing them.

  * Experiment with mergeFactor / different merge policies.  For
example I think LUCENE-854 would reduce time spend merging for a
given index size.

This is currently just a place to list ideas for optimizing segment
merges.  I don't plan on working on this until after LUCENE-843.

Note that for autoCommit=false, this optimization is somewhat less
important, depending on how often you actually close/open a new
IndexWriter.  In the extreme case, if you open a writer, add 100 MM
docs, close the writer, then no segment merges happen at all.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-04 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486788
 ] 

Yonik Seeley commented on LUCENE-855:
-

 LUCENE-798 caches RangeFilters so that if the same exact range is executed 
 again [...]

It's not just the exact same range though... it can reuse parts of ranges AFAIK.



 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: MemoryCachedRangeFilter.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Caching in QueryFilter - why?

2007-04-04 Thread Otis Gospodnetic
Hi,

I'm looking at LUCENE-853, so I also looked at CachingWrapperFilter and then at 
QueryFilter.  I noticed QueryFilter does its own BitSet caching, and the 
caching part of its code is nearly identical to the code in 
CachingWrapperFilter.

Why is that?  Is there a good reason for that?

Thanks,
Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-04 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486791
 ] 

Andy Liu commented on LUCENE-855:
-

Ah, you're right.  I didn't read closely enough!

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: MemoryCachedRangeFilter.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-619) Lucene 1.9.1 and 2.0.0 Maven 2 packages are incorrectly deployed

2007-04-04 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-619.
-

Resolution: Fixed

That Jar is still invalid (2.3K).
However, if anyone is going to be upgrading to the newer version of Lucene, 
they'll go straight to Lucene 2.0.0, or 2.1.0, not 1.9.1, so I'll mark this as 
Won't Fix.

The Jars for Lucene 2.0.0 are good - see LUCENE-734.
We still need to push 2.1.0 jars + POMs, though.


 Lucene 1.9.1 and 2.0.0 Maven 2 packages are incorrectly deployed
 

 Key: LUCENE-619
 URL: https://issues.apache.org/jira/browse/LUCENE-619
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 1.9, 2.0.0
 Environment: 
 http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/
Reporter: Jordan Christensen

 The lucene JARs at the URL listed in the Environment field only contain the 
 maven 2 POMs, and not the actual compiled classes. The correct JARs need to 
 be uploaded so that Lucene 1.9.1. and 2.0 can work in Maven 2.
 This was listed as fixed in http://issues.apache.org/jira/browse/LUCENE-551, 
 but was not properly done. The JARs in the Apache Maven repo are incorrect as 
 well. 
 (http://www.apache.org/dist/maven-repository/org/apache/lucene/lucene-core/)
 This issue was raised and confirmed on the mailing list as well: 
 http://www.gossamer-threads.com/lists/lucene/java-user/37169

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Reopened: (LUCENE-619) Lucene 1.9.1 and 2.0.0 Maven 2 packages are incorrectly deployed

2007-04-04 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic reopened LUCENE-619:
-


Eh, I said Won't Fix, not Fixed.  Reopening...


 Lucene 1.9.1 and 2.0.0 Maven 2 packages are incorrectly deployed
 

 Key: LUCENE-619
 URL: https://issues.apache.org/jira/browse/LUCENE-619
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 1.9, 2.0.0
 Environment: 
 http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/
Reporter: Jordan Christensen

 The lucene JARs at the URL listed in the Environment field only contain the 
 maven 2 POMs, and not the actual compiled classes. The correct JARs need to 
 be uploaded so that Lucene 1.9.1. and 2.0 can work in Maven 2.
 This was listed as fixed in http://issues.apache.org/jira/browse/LUCENE-551, 
 but was not properly done. The JARs in the Apache Maven repo are incorrect as 
 well. 
 (http://www.apache.org/dist/maven-repository/org/apache/lucene/lucene-core/)
 This issue was raised and confirmed on the mailing list as well: 
 http://www.gossamer-threads.com/lists/lucene/java-user/37169

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-619) Lucene 1.9.1 and 2.0.0 Maven 2 packages are incorrectly deployed

2007-04-04 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-619.
-

Resolution: Won't Fix

 Lucene 1.9.1 and 2.0.0 Maven 2 packages are incorrectly deployed
 

 Key: LUCENE-619
 URL: https://issues.apache.org/jira/browse/LUCENE-619
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 1.9, 2.0.0
 Environment: 
 http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/
Reporter: Jordan Christensen

 The lucene JARs at the URL listed in the Environment field only contain the 
 maven 2 POMs, and not the actual compiled classes. The correct JARs need to 
 be uploaded so that Lucene 1.9.1. and 2.0 can work in Maven 2.
 This was listed as fixed in http://issues.apache.org/jira/browse/LUCENE-551, 
 but was not properly done. The JARs in the Apache Maven repo are incorrect as 
 well. 
 (http://www.apache.org/dist/maven-repository/org/apache/lucene/lucene-core/)
 This issue was raised and confirmed on the mailing list as well: 
 http://www.gossamer-threads.com/lists/lucene/java-user/37169

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-622) Provide More of Lucene For Maven

2007-04-04 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-622:


Attachment: lucene-core.pom

Here is the POM for Maven boys and girls who want lucene-core.

Stephen:
What would the contrib POM look like?
I don't think we'd have 1 POM, because each project in Lucene contrib is a 
separate project and a separate jar with its own dependencies.  But maybe one 
can construct a single POM for the whole Lucene contrib - I haven't touched 
Maven in a few years.


 Provide More of Lucene For Maven
 

 Key: LUCENE-622
 URL: https://issues.apache.org/jira/browse/LUCENE-622
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Stephen Duncan Jr
 Attachments: lucene-core.pom


 Please provide javadoc  source jars for lucene-core.  Also, please provide 
 the rest of lucene (the jars inside of contrib in the download bundle) if 
 possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-853) Caching does not work when using RMI

2007-04-04 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-853:


Lucene Fields: [New, Patch Available]  (was: [New])

 Caching does not work when using RMI
 

 Key: LUCENE-853
 URL: https://issues.apache.org/jira/browse/LUCENE-853
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.1
 Environment: All 
Reporter: Matt Ericson
Priority: Minor
 Attachments: RemoteCachingWrapperFilter.patch, 
 RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch, 
 RemoteCachingWrapperFilter.patch .patch


 Filters and caching uses transient maps so that caching does not work if you 
 are using RMI and a remote searcher 
 I want to add a new RemoteCachededFilter that will make sure that the caching 
 is done on the remote searcher side 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-853) Caching does not work when using RMI

2007-04-04 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-853:


Attachment: RemoteCachingWrapperFilter.patch

Here is a cleaned up version.
- Changed CachingWrapperFilter - private - protected vars, so 
CachingWrapperFilterHelper can extend it
- Expanded unit tests to be more convincing
- Javadocs all fixed up + cosmetics + code comments

n.b.
The @todo in CachingWrapperFilter can go now:
   /**
* @todo What about serialization in RemoteSearchable?  Caching won't work.
*   Should transient be removed?
*/
  protected transient Map cache;

We keep the transient, and if you want remote caching, use 
RemoteCachingWrapperFilter.


I'll commit on Friday.


 Caching does not work when using RMI
 

 Key: LUCENE-853
 URL: https://issues.apache.org/jira/browse/LUCENE-853
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.1
 Environment: All 
Reporter: Matt Ericson
Priority: Minor
 Attachments: RemoteCachingWrapperFilter.patch, 
 RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch, 
 RemoteCachingWrapperFilter.patch .patch


 Filters and caching uses transient maps so that caching does not work if you 
 are using RMI and a remote searcher 
 I want to add a new RemoteCachededFilter that will make sure that the caching 
 is done on the remote searcher side 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-622) Provide More of Lucene For Maven

2007-04-04 Thread Stephen Duncan Jr (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486805
 ] 

Stephen Duncan Jr commented on LUCENE-622:
--

Because they are separate projects  jars, they would each have their own POM. 

 Provide More of Lucene For Maven
 

 Key: LUCENE-622
 URL: https://issues.apache.org/jira/browse/LUCENE-622
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Stephen Duncan Jr
 Attachments: lucene-core.pom


 Please provide javadoc  source jars for lucene-core.  Also, please provide 
 the rest of lucene (the jars inside of contrib in the download bundle) if 
 possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-622) Provide More of Lucene For Maven

2007-04-04 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486807
 ] 

Otis Gospodnetic commented on LUCENE-622:
-

Right, that's what I was trying to say.
Can you provide POMs for contrib projects, or maybe just the ones that you 
use/need?


 Provide More of Lucene For Maven
 

 Key: LUCENE-622
 URL: https://issues.apache.org/jira/browse/LUCENE-622
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Stephen Duncan Jr
 Attachments: lucene-core.pom


 Please provide javadoc  source jars for lucene-core.  Also, please provide 
 the rest of lucene (the jars inside of contrib in the download bundle) if 
 possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: publish to maven-repository

2007-04-04 Thread Otis Gospodnetic
Jörg,
Since you offered to help - please see 
https://issues.apache.org/jira/browse/LUCENE-622 .  lucene-core POM is there 
for 2.1.0, but if you need POMs for contrib/*, please attach them to that 
issue.  We have Jars, obviously, so we just need to copy those.  When we'll 
need .sha1 and .md5 files for all pushed Jars.  One of the other developers 
will have to do that, as I don't have my PGP set up, and hence no key for the 
KEYS file (if that's needed for the .sha1).
 
Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Joerg Hohwiller [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Tuesday, April 3, 2007 4:49:15 PM
Subject: publish to maven-repository

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi there,

I will give it another try:

Could you please publish lucene 2.* artifacts (including contribs) to the maven2
repository at ibiblio?

Currently there is only the lucene-core available up to version 2.0.0:
http://repo1.maven.org/maven2/org/apache/lucene/

JARs and POMs go to:
scp://people.apache.org/www/www.apache.org/dist/maven-repository

If you need assitance I am pleased to help.
But I am not an official apache member and do NOT have access to do the
deployment myself.

Thank you so much...
  Jörg
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGEr3LmPuec2Dcv/8RAh1sAJ9m3qs7upNGJTgie5tNeAFKZenBowCgjufY
uB1/RNnI4wB3dviKy0w7XEs=
=llLh
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-622) Provide More of Lucene For Maven

2007-04-04 Thread Stephen Duncan Jr (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486809
 ] 

Stephen Duncan Jr commented on LUCENE-622:
--

I'm no longer doing any work with Lucene, and I'm not even sure which contrib 
project I wanted at the time I filed this request.  While I'm sure that having 
poms for the contrib releases would be helpful to many people using Maven, this 
is no longer something that's a priority for me.

 Provide More of Lucene For Maven
 

 Key: LUCENE-622
 URL: https://issues.apache.org/jira/browse/LUCENE-622
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Stephen Duncan Jr
 Attachments: lucene-core.pom


 Please provide javadoc  source jars for lucene-core.  Also, please provide 
 the rest of lucene (the jars inside of contrib in the download bundle) if 
 possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching in QueryFilter - why?

2007-04-04 Thread Erik Hatcher
CachingWrapperFilter came along after QueryFilter.  I think I added  
CachingWrapperFilter when I realized that every Filter should have  
the capability to be cached without having to implement it.  So, the  
only reason is legacy.  I'm perfectly fine with removing the  
caching from QueryFilter in a future major release.


Erik

On Apr 4, 2007, at 5:57 PM, Otis Gospodnetic wrote:


Hi,

I'm looking at LUCENE-853, so I also looked at CachingWrapperFilter  
and then at QueryFilter.  I noticed QueryFilter does its own BitSet  
caching, and the caching part of its code is nearly identical to  
the code in CachingWrapperFilter.


Why is that?  Is there a good reason for that?

Thanks,
Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Created: (LUCENE-856) Optimize segment merging

2007-04-04 Thread Ning Li

On 4/4/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote:

Note that for autoCommit=false, this optimization is somewhat less
important, depending on how often you actually close/open a new
IndexWriter.  In the extreme case, if you open a writer, add 100 MM
docs, close the writer, then no segment merges happen at all.


I think in the current code, the merge behavior for autoCommit=false
is the same as that for autoCommit=true, isn't it?

Cheers,
Ning

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-636) [PATCH] Differently configured Lucene 'instances' in same JVM

2007-04-04 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-636.
-

Resolution: Won't Fix

We've moved away from using system properties.  I think there are only a couple 
of places in the code that still refer to system properties, and those are, I 
believe, depreated:

$ ffjg System.getProp
./org/apache/lucene/analysis/standard/ParseException.java:  protected String 
eol = System.getProperty(line.separator, \n);
./org/apache/lucene/index/SegmentReader.java:
System.getProperty(org.apache.lucene.SegmentReader.class,
./org/apache/lucene/queryParser/ParseException.java:  protected String eol = 
System.getProperty(line.separator, \n);
./org/apache/lucene/store/FSDirectory.java:  public static final String 
LOCK_DIR = System.getProperty(org.apache.lucene.lockDir,
./org/apache/lucene/store/FSDirectory.java: 
  System.getProperty(java.io.tmpdir));
./org/apache/lucene/store/FSDirectory.java:
System.getProperty(org.apache.lucene.FSDirectory.class,
./org/apache/lucene/store/FSDirectory.java:String lockClassName = 
System.getProperty(org.apache.lucene.store.FSDirectoryLockFactoryClass);
./org/apache/lucene/util/Constants.java:  /** The value of 
ttSystem.getProperty(java.version)tt. **/
./org/apache/lucene/util/Constants.java:  public static final String 
JAVA_VERSION = System.getProperty(java.version);
./org/apache/lucene/util/Constants.java:  /** The value of 
ttSystem.getProperty(os.name)tt. **/
./org/apache/lucene/util/Constants.java:  public static final String OS_NAME = 
System.getProperty(os.name);


I'll close this as Won't Fix.

 [PATCH] Differently configured Lucene 'instances' in same JVM
 -

 Key: LUCENE-636
 URL: https://issues.apache.org/jira/browse/LUCENE-636
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Johan Stuyts
 Attachments: Lucene2DifferentConfigurations.patch


 Currently Lucene can be configured using system properties. When running 
 multiple 'instances' of Lucene for different purposes in the same JVM, it is 
 not possible to use different settings for each 'instance'.
 I made changes to some Lucene classes so you can pass a configuration to that 
 class. The Lucene 'instance' will use the settings from that configuration. 
 The changes do not effect the API and/or the current behavior so are 
 backwards compatible.
 In addition to the changes above I also made the SegmentReader and 
 SegmentTermDocs extensible outside of their package. I would appreciate the 
 inclusion of these changes but don't mind creating a separate issue for them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-789) Custom similarity is ignored when using MultiSearcher

2007-04-04 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486835
 ] 

Otis Gospodnetic commented on LUCENE-789:
-

Alexey, the best way to start with this, and the way that will help get this 
fixed in Lucene core is to write a unit test class that does what your code 
does with MultiSearcher and BooleanQuery, and shows that the test fails when a 
custom Similarity class is used.  You can make that custom Similarity an inner 
class in your unit test class, to contain everything neatly in a single class.

Once we see the test failing we cann apply your suggested fix and see if that 
works, if your previously broken unit test now passes, and if all other unit 
tests still pass.


 Custom similarity is ignored when using MultiSearcher
 -

 Key: LUCENE-789
 URL: https://issues.apache.org/jira/browse/LUCENE-789
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.0.1
Reporter: Alexey Lef

 Symptoms:
 I am using Searcher.setSimilarity() to provide a custom similarity that turns 
 off tf() factor. However, somewhere along the way the custom similarity is 
 ignored and the DefaultSimilarity is used. I am using MultiSearcher and 
 BooleanQuery.
 Problem analysis:
 The problem seems to be in MultiSearcher.createWeight(Query) method. It 
 creates an instance of CachedDfSource but does not set the similarity. As the 
 result CachedDfSource provides DefaultSimilarity to queries that use it.
 Potential solution:
 Adding the following line:
 cacheSim.setSimilarity(getSimilarity());
 after creating an instance of CacheDfSource (line 312) seems to fix the 
 problem. However, I don't understand enough of the inner workings of this 
 class to be absolutely sure that this is the right thing to do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-04-04 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486837
 ] 

Karl Wettin commented on LUCENE-848:


 Karl, it looks like your stuff grabs individual articles, right? I'm gong to 
 have it download the bzip2 snapshots they provide (and that they prefer you 
 use, if you're getting much). 

They also supply the rendered HTML every now and then. It should be enough to 
change the URL pattern to file:///tmp/wikipedia/. I was considering porting the 
MediaWiki BNF as a tokenizer, but found it much simpler to just parse the HTML.

 Add supported for Wikipedia English as a corpus in the benchmarker stuff
 

 Key: LUCENE-848
 URL: https://issues.apache.org/jira/browse/LUCENE-848
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/benchmark
Reporter: Steven Parkes
 Assigned To: Steven Parkes
Priority: Minor
 Fix For: 2.2

 Attachments: WikipediaHarvester.java


 Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and Javolution: A good mix ?

2007-04-04 Thread robert engels
I would suggest that the Javolution folks do their tests against  
modern JVM...


I have followed the Javolution project for some time, and while I  
agree that some of the techniques should improve things, I think that  
modern JVMs do most of this work for you (and the latest class  
libraries also help - StringBuilder and others).


I also think that when you start doing you own memory management you  
might as well write the code in C/C++ because you need to use similar  
techniques (similar to the resource management when using SWT).


Just my thoughts.

On Apr 4, 2007, at 8:54 PM, Jean-Philippe Robichaud wrote:


Hello Dear Lucene coders!



Some of you may remember, I'm using lucene for a product (and many  
other
internal utilities).  I'm also using another open source library  
called
Javolution (www.javolution.org http://www.javolution.org/ ) which  
does

many things, one of them being to offer excellent replacements for
ArrayList/Map/... and a super good memory management extension to the
java language.



As I'm [trying to] follow the conversations on this list, I see that
many of you are working towards optimizing lucene in term of memory
footprint and speed.  I just finished optimizing my code (not lucene
itself, but my code written on top of it) using Javolution PoolContext
and the FastList/FastMap/... classes.  The resulting speedup is a 6
times faster code.



Javolution make it easy to recycle objects and do some object  
allocation
on the stack rather than on the head, which remove stress on the  
garbage
collector.  Javolution also offers 2 classes  (Text and  
TextBuilder) to

replace String/StringBuffer which are perfect for anything related to
string manipulation and some C union/struct equivalent for java.   
The

thing is really great.



Would anyone be interested in doing Lucene a face lift and start using
javolution as a core lucene dependency?  I understand that right now,
lucene is free of any dependencies, which is quite great, but anyone
interested in doing fast/lean/stable java application should seriously
consider using javolution anyway.



Any thoughts?



Jp




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-636) [PATCH] Differently configured Lucene 'instances' in same JVM

2007-04-04 Thread Ken Geis (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486840
 ] 

Ken Geis commented on LUCENE-636:
-

This is not going to be sufficient.  There are active code paths that still use 
System.getProperty(..).  For instance, the static initializers of FSDirectory 
and SegmentReader.

If I load up a Compass-based web app, and it uses an old version of Lucene that 
works off system properties, it will set the 
org.apache.lucene.SegmentReader.class property to use a Compass-specific 
segment reader.  Then in another web app that uses a current version of Lucene 
that has moved away from using system properties, the application will crash 
when it tries to load the SegmentReader class.

 [PATCH] Differently configured Lucene 'instances' in same JVM
 -

 Key: LUCENE-636
 URL: https://issues.apache.org/jira/browse/LUCENE-636
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Johan Stuyts
 Attachments: Lucene2DifferentConfigurations.patch


 Currently Lucene can be configured using system properties. When running 
 multiple 'instances' of Lucene for different purposes in the same JVM, it is 
 not possible to use different settings for each 'instance'.
 I made changes to some Lucene classes so you can pass a configuration to that 
 class. The Lucene 'instance' will use the settings from that configuration. 
 The changes do not effect the API and/or the current behavior so are 
 backwards compatible.
 In addition to the changes above I also made the SegmentReader and 
 SegmentTermDocs extensible outside of their package. I would appreciate the 
 inclusion of these changes but don't mind creating a separate issue for them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Marvin Humphrey


On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote:


(: Ironically, the numbers for Lucene on that page are a little
better than they should be because of a sneaky bug.  I would have
made updating the results a priority if they'd gone the other  
way.  :)


Hrm.  It would be nice to have hard comparison of the Lucene, KS (and
Ferret and others?).


Doing honest, rigorous benchmarking is exacting and labor-intensive.   
Publishing results tends to ignite flame wars I don't have time for.


The main point that I wanted to make with that page was that KS was a  
lot faster than Plucene, and that it was in Lucene's ballpark.   
Having made that point, I've moved on.  The benchmarking code is  
still very useful for internal development and I use it frequently.


At some point I would like to port the benchmarking work that has  
been contributed to Lucene of late, but I'm waiting for that code  
base to settle down first.  After that happens, I'll probably make a  
pass and publish some results.  Better to spend the time preparing  
one definitive presentation than to have to rebut every idiot's  
latest wildly inaccurate shootout.



... However, Lucene has been tuned by an army of developers over the
years, while KS is young yet and still had many opportunities for
optimization.  Current svn trunk for KS is about twice as fast for
indexing as when I did those benchmarking tests.


Wow, that's an awesome speedup!


The big bottleneck for KS has been its Tokenizer class.  There's only  
one such class in KS, and it's regex-based.  A few weeks ago, I  
finally figured out how to hook it into Perl's regex engine at the C  
level.  The regex engine is not an official part of Perl's C API, so  
I wouldn't do this if I didn't have to, but the tokenizing loop is  
only about 100 lines of code and the speedup is dramatic.


I've also squeezed out another 30-40% by changing the implementation  
in ways which have gradually winnowed down the number of malloc()  
calls.  Some of the techniques may be applicable to Lucene; I'll get  
around to firing up JIRA issues describing them someday.



So KS is faster than Lucene today?


I haven't tested recent versions of Lucene.  I believe that the  
current svn trunk for KS is faster for indexing than Lucene 1.9.1.   
But... A) I don't have an official release out with the current  
Tokenizer code, B) I have no immediate plans to prepare further  
published benchmarks, and C) it's not really important, because so  
long as the numbers are close you'd be nuts to choose one engine or  
the other based on that criteria rather than, say, what language your  
development team speaks.  KinoSearch scales to multiple machines, too.


Looking to the future, I wouldn't be surprised if Lucene edged ahead  
and stayed slightly ahead speed-wise, because I'm prepared to make  
some sacrifices for the sake of keeping KinoSearch's core API simple  
and the code base as small as possible.  I'd rather maintain a  
single, elegant, useful, flexible, plenty fast regex-based Tokenizer  
than the slew of Tokenizers Lucene offers, for instance.  It might be  
at a slight disadvantage going mano a mano against Lucene's  
WhiteSpaceTokenizer, but that's fine.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: publish to maven-repository

2007-04-04 Thread Sami Siren

hi,

I am volunteering to help on putting together releaseable m2 artifacts for
Lucene, I have high hopes to start building and spreading m2 artifacts for
other Lucene sub projects too (of course if there are no objections).

--
Sami Siren

2007/4/5, Otis Gospodnetic [EMAIL PROTECTED]:


Jörg,
Since you offered to help - please see
https://issues.apache.org/jira/browse/LUCENE-622 .  lucene-core POM is
there for 2.1.0, but if you need POMs for contrib/*, please attach them to
that issue.  We have Jars, obviously, so we just need to copy those.  When
we'll need .sha1 and .md5 files for all pushed Jars.  One of the other
developers will have to do that, as I don't have my PGP set up, and hence no
key for the KEYS file (if that's needed for the .sha1).

Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Joerg Hohwiller [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Tuesday, April 3, 2007 4:49:15 PM
Subject: publish to maven-repository

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi there,

I will give it another try:

Could you please publish lucene 2.* artifacts (including contribs) to the
maven2
repository at ibiblio?

Currently there is only the lucene-core available up to version 2.0.0:
http://repo1.maven.org/maven2/org/apache/lucene/

JARs and POMs go to:
scp://people.apache.org/www/www.apache.org/dist/maven-repository

If you need assitance I am pleased to help.
But I am not an official apache member and do NOT have access to do the
deployment myself.

Thank you so much...
  Jörg
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGEr3LmPuec2Dcv/8RAh1sAJ9m3qs7upNGJTgie5tNeAFKZenBowCgjufY
uB1/RNnI4wB3dviKy0w7XEs=
=llLh
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]