[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-11 Thread Andy Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Liu updated LUCENE-855:


Attachment: contrib-filters.tar.gz

I made a few changes to MemoryCachedRangeFilter:

- SortedFieldCache's values[] now contains only sorted unique values, while 
docId[] has been changed to a ragged 2D array with an array of docId's 
corresponding to each unique value.  Since there's no longer repeated values in 
values[]. forward() and rewind() are no longer required.  This also addresses 
the O(n) special case that Hoss brought up where every value is identical.
- bits() now returns OpenBitSetWrapper, a subclass of BitSet that uses Solr's 
OpenBitSet as a delegate.  Wrapping OpenBitSet presents some challenges.  Since 
the internal bits store of BitSet is private, it's difficult to perform 
operations between BitSet and OpenBitSet (like or, and, etc).
- An in-memory OpenBitSet cache is kept.  During warmup, the global range is 
partitioned and OpenBitSet instances are created for each partition.  During 
bits(), these cached OpenBitSet instances that fall in between the lower and 
upper ranges are used.
- Moved MCRF to contrib/ due to the Solr dependancy

Using the current (and incomplete) benchmark, MemoryCachedRangeFilter is 
slightly faster than FCRF when used in conjuction with ConstantRangeQuery and 
MatchAllDocsQuery:

Reader opened with 10 documents.  Creating RangeFilters...

TermQuery

FieldCacheRangeFilter
  * Total: 88ms
  * Bits: 0ms
  * Search: 14ms

MemoryCachedRangeFilter
  * Total: 89ms
  * Bits: 17ms
  * Search: 31ms

RangeFilter
  * Total: 9034ms
  * Bits: 4483ms
  * Search: 4521ms

Chained FieldCacheRangeFilter
  * Total: 33ms
  * Bits: 3ms
  * Search: 9ms

Chained MemoryCachedRangeFilter
  * Total: 77ms
  * Bits: 19ms
  * Search: 30ms


ConstantScoreQuery

FieldCacheRangeFilter
  * Total: 541ms
  * Bits: 2ms
  * Search: 485ms

MemoryCachedRangeFilter
  * Total: 473ms
  * Bits: 23ms
  * Search: 390ms

RangeFilter
  * Total: 13777ms
  * Bits: 4451ms
  * Search: 9298ms

Chained FieldCacheRangeFilter
  * Total: 12ms
  * Bits: 2ms
  * Search: 5ms

Chained MemoryCachedRangeFilter
  * Total: 80ms
  * Bits: 16ms
  * Search: 44ms


MatchAllDocsQuery

FieldCacheRangeFilter
  * Total: 1231ms
  * Bits: 3ms
  * Search: 1115ms

MemoryCachedRangeFilter
  * Total: 1222ms
  * Bits: 53ms
  * Search: 1149ms

RangeFilter
  * Total: 10689ms
  * Bits: 4954ms
  * Search: 5583ms

Chained FieldCacheRangeFilter
  * Total: 937ms
  * Bits: 1ms
  * Search: 862ms

Chained MemoryCachedRangeFilter
  * Total: 921ms
  * Bits: 19ms
  * Search: 894ms

Hoss, those were great comments you made.  I'd be happy to continue on and make 
those changes, although if the feeling around town is that Matt's range filter 
is the preferred implementation, I'll stop here.

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Assigned To: Otis Gospodnetic
> Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
> TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> 

Re: optimize() method call

2007-04-11 Thread Antony Bowesman

Robert Engels wrote:

I think this is great, and it gave me an idea. What if another thread could
call a "stop optimize" which would stop the optimize after it came to a
consistent state (not in the middle of a segment merge).

We schedule our optimizes for the "lull" time period, but with 24/7 operation
this could be hard to find.

Being able to stop and then resume the optimize seems like a great idea.


+1.  It would be useful in shutdown cases where immediate shutdown is needed, or 
to allow a scheduled backup to kick in at a fixed time, rather than having to 
wait for optimize to complete.  Or is there another way to interrupt optmimize 
safely?


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-11 Thread jian chen

I agree. this falls into the area where technical limit is reached. Time to
modify the spec.

I thought about this issue over this couple of days, there is really NO
silver bullet. If the field is multi-value field and the distinct field
values are not too many, you might reduce memory usage by storing the field
as bitset. Each bit corresponding to a distinct value.

But either way, you have to load the whole thing into memory for good
performance.

Jian


On 4/10/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:


: I'm wondering then if the Sorting infrastructure could be refactored
: to allow  with some sort of policy/strategy where one can choose a
: point where one is not willing to use memory for sorting, but willing

...

: To accomplish this would require a substantial change to the
: FieldSortHitQueue et al, and I realize that the use of NIO

I don't follow ... why could this be implemented entirely via a new
SortComparatorSource?  (you would also need something to create your file,
but that could probably be done as a decorator or subclass of IndexWRiter
couldn't it?)

: immediately pins Lucene to Java 1.4, so I'm sure this is
: controversial.  But, if we wish Lucene to go beyond where it is now,

Java 1.5 is controversial, Lucene already has 1.4 dependencies.

: I think we need to start thinking about this particular problem
: sooner rather than later.

it depends on your timeline, Lucene's gotten pretty far with what it's
got.  Personally i'm banking on RAM getting cheaper fast enough that I
won't ever need to worry about this.

If i needed to support sorting on lots of fields with lots of differnet
locales, and my index was big enough that i couldn't feasibly keep all of
the FieldCaches in memory on one box, i wouldn't partition the index
across multiple boxes and merge results with a MultiSearcher ... i'd clone
the index across multiple boxes and partition the traffic based on the
field/locale it's searching on.

it's a question of cache management, if i know i have two very differnet
use cases for a Solr index, i partition those use case to seperate tiers
of machines to get better cache utilization, FieldCache is
just another type of cache.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Maven artifacts for Lucene.*

2007-04-11 Thread Chris Hostetter

: Where and how would you store for example the dependency information
: that you would be using to generate the poms? For lucene java it is easy
: for most modules as there is only dependency to lucene-core but for
: example in solr, nutch and hadoop it starts to go beyond trivial.

Whatever files also need to be included along with the jars in order to
make the maven distribution complete that can't be built completley
dynamicly (ie: the md5 files) can certainly be commited into the
repository ... but if making a release requires a lot of manual upating to
those files, it's going to be a hinderane to the process ... things like
version number and date should ideally be filled in via variables to help
keep things automated.

jar dependencies are another matter ... as you say, for java-lucene the
issue is trivial since there are no dependencies, but for other projects
it could get complicated.  Solr (for example) ships with the versions of
it's dependencies that it expects to use, and in some cases these version
may not be official release versions that you would ever find in a maven
repository.  I'm notsure how apps that want to publish to maven but depend
on apss that do not publish to maven deal with this problem, but whatever
solution they use could also be used in this case.

...either way, it's a discussion for the solr-dev list, not java-dev.

: projects). IMO we should however try to look at the big picture also and
: not only try to solve the minimal part to get it out of lucene-java
: hands, because I am afraid that if the minimum is done here in
: lucene-java there might be caps to fill in other projects and the way
: things are done here is not usable in other sub projects as it is.

each project has it's own community ... even if you find a perfect
solution to every problem anyone in the world might ever encounter,
discussing it on java-dev does nothing to get your solution adopted by the
nutch, hadoop, or solr communities.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-625) Query auto completer

2007-04-11 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488213
 ] 

Karl Wettin commented on LUCENE-625:


(from a mail i just posted to java-user)

There is a memoryleak in the trie at optimize() that has been fixed locally. 
Might be available in LUCENE-626 too. 

I'll repackage and post it up as soon I get time.

> Query auto completer
> 
>
> Key: LUCENE-625
> URL: https://issues.apache.org/jira/browse/LUCENE-625
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Karl Wettin
>Priority: Minor
> Attachments: autocomplete_0.0.1.tar.gz, autocomplete_20060730.tar.gz
>
>
> A trie that helps users to type in their query. Made for AJAX, works great 
> with ruby on rails common scripts . Similar to the 
> Google labs suggester.
> Trained by user queries. Optimizable. Uses an in memory corpus. Serializable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Maven artifacts for Lucene.*

2007-04-11 Thread Joerg Hohwiller
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Grant.
> Initial thoughts and then more inline below, and keep in mind I long ago
> drank the Maven kool-aid and am a big fan.  :-)
> 
> I know it is a pain to a few, but realistically speaking there has not
> been all that much noise about Maven artifacts not being available.  We
> use Maven for everything we do and all I ever do when there is a new
> release of Lucene is put the new jars in our remote repository and
> everything works.  It takes two or three steps and about 5 minutes of my
> time, and would be less if I scripted it.  I frankly don't get what the
> big deal is.  

That was what I was thinking, too, when I asked on this list for somebody to
deploy lucene-highlighter 2.0.0 to central maven2 repo on 08.01.2007 21:37.
I also supplied the suggested POM for it.
Anyhow nobody was able to do this for me for over a quarter of a year now.
I think it is just a matter of minutes, but I do NOT have permission to do so.
What I did is to add it to the repository of my open-source project.
I am not an expert of rights and law and hope this is allowed according to
Apache License.

Anyways it would be easier to do it once and centralized instead of making all
maven+lucene users do this AND especially cause that they create
POMs on their own that will all be different. Finally maven users involved in
two different projects that did the same thing may end up with a conflicting
state if maybe the first project forgot to declare a dependency in a lucene
contrib POM.

In the end this shows, that the process must be made so easy, that it only takes
about one command to call. I do NOT care to much wethere this would be "ant ..."
or "mvn ...".

Best regards
  Jörg
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGHVafmPuec2Dcv/8RAv/LAJ0Xx31+Y6awJlAvBbSfmOeipghUWwCdGBEC
4aamDeqxnAgGn6hK4+YXU1c=
=BGA7
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Maven artifacts for Lucene.*

2007-04-11 Thread Sami Siren
Chris Hostetter wrote:
> : Couldn't we just add various ANT targets that package the jars per
> : the Maven way, and even copy them to the appropriate places?  I
> : wonder how hard it would be
> : to have ANT output the POM and create Maven Jars.  I know it is
> 
> This is what i would view as the ideal situation ... a patch to the
> current ant build.xml that caused the package-all-binary and
> package-all-src to produce a new maven directory with everything we need
> to copy ot the maven repository would be the best way to get people to get
> on board -- it's hard to complain with something that requires no effort
> to adopt.

Where and how would you store for example the dependency information
that you would be using to generate the poms? For lucene java it is easy
for most modules as there is only dependency to lucene-core but for
example in solr, nutch and hadoop it starts to go beyond trivial.

> 
> if that same patch included a new ant target named
> something like "publish-maven" which required key access to
> people.apache.org but took care of pushing the maven artifacts into the
> exact right spot, that would be one less thing people would have to worry
> about.
> 
> : > 1. There are differencies when comparing to ant build jars (due to
> : > release
> : > policy of a.o) the built jars will contain LICENSE.txt,
> : > NOTICE.txt in /META-INF. Is this a problem?
> 
> we should under no circumstances have two differnet jars calling
> themselves "lucene-core-X.Y.0.jar" with differnet md5 sums ... that's
> asking for a world of pain.  Fortunately there is an easy fix for this:
> start putting the LICENSE.txt and NOTICE.txt files in the jar ... i think
> there's already a patch for this floating arround in Jira.
> 
> : > 2. I propose that we add additional folder level so the groupId for
> : > lucene
> : > java would be org.apache.lucene.java (it is now org.apache.lucene
> 
> I don't really see the advantage of this ... Lucene Java has allways had
> the *java* package org.apache.lucene, and my understanding was that
> maven groupIds should attempt to match the java package structure of the
> code.  likewise the other java subprojects have their own java packages,
> shouldn't their groupIds match their pacakge structures?

I believe you are right, and I also think it makes more sense to match
the package names.

>.. but starting with Lucene
> Java is the probably the right way to go ... if a simple solution is found
> for our build file, it will probably lend itself to similar soluteions for
> hte other Lucne projects that use ant.

Yes that is my hoping also and the main motivation to start the
discussion from lucene java (as it also is a dependency to 2 more sub
projects). IMO we should however try to look at the big picture also and
not only try to solve the minimal part to get it out of lucene-java
hands, because I am afraid that if the minimum is done here in
lucene-java there might be caps to fill in other projects and the way
things are done here is not usable in other sub projects as it is.

--
 Sami Siren

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Maven artifacts for Lucene.*

2007-04-11 Thread Chris Hostetter

: Couldn't we just add various ANT targets that package the jars per
: the Maven way, and even copy them to the appropriate places?  I
: wonder how hard it would be
: to have ANT output the POM and create Maven Jars.  I know it is

This is what i would view as the ideal situation ... a patch to the
current ant build.xml that caused the package-all-binary and
package-all-src to produce a new maven directory with everything we need
to copy ot the maven repository would be the best way to get people to get
on board -- it's hard to complain with something that requires no effort
to adopt.

if that same patch included a new ant target named
something like "publish-maven" which required key access to
people.apache.org but took care of pushing the maven artifacts into the
exact right spot, that would be one less thing people would have to worry
about.

: > 1. There are differencies when comparing to ant build jars (due to
: > release
: > policy of a.o) the built jars will contain LICENSE.txt,
: > NOTICE.txt in /META-INF. Is this a problem?

we should under no circumstances have two differnet jars calling
themselves "lucene-core-X.Y.0.jar" with differnet md5 sums ... that's
asking for a world of pain.  Fortunately there is an easy fix for this:
start putting the LICENSE.txt and NOTICE.txt files in the jar ... i think
there's already a patch for this floating arround in Jira.

: > 2. I propose that we add additional folder level so the groupId for
: > lucene
: > java would be org.apache.lucene.java (it is now org.apache.lucene

I don't really see the advantage of this ... Lucene Java has allways had
the *java* package org.apache.lucene, and my understanding was that
maven groupIds should attempt to match the java package structure of the
code.  likewise the other java subprojects have their own java packages,
shouldn't their groupIds match their pacakge structures?

As a side note: nothing discussed here really has any barring on the other
Lucene sub-projects, the individual project communities need to discuss
any changes to their build processes/policies .. but starting with Lucene
Java is the probably the right way to go ... if a simple solution is found
for our build file, it will probably lend itself to similar soluteions for
hte other Lucne projects that use ant.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-11 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488125
 ] 

Hoss Man commented on LUCENE-855:
-

Another thing that occurred to me this morning is that the comparison test 
doesn't consider the performance of the various Filter's when cached and reused 
 (with something like CacheWrappingFilter)  ... you may actually see the stock 
RangeFilter be faster then either implementation when you can reuse the same 
exact Filter over and over on the same IndexReader -- a fairly common use case.

In general the numbers that really need to be conpared are...

  1) the time overhead of an implementation when opening a new IndexReader (and 
whether that overhead is per field)
  2) the time overhead of an implementation the first time a specific Filter is 
used on an IndexReader
  3) the time on average that it takes to use a Filter

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Assigned To: Otis Gospodnetic
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
> TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Maven artifacts for Lucene.*

2007-04-11 Thread Grant Ingersoll


On Apr 11, 2007, at 11:02 AM, Sami Siren wrote:
We wouldn't touch the existing single maven artifact in the  
repository,

just would deploy the new artifacts under different gId, nothing
existing is broken on the way. We could of cource continue publishing
under gId 'org.apache.lucene' if so decided but I think it's more  
clear

if the subprojects are under dirrerent gId.



I was confused on the subject.  I thought you were talking about the  
source for some reason, but you mean the structure for the artifacts  
on the servers.  Like I said, I'm not fully up on M2 yet.  And,  
honestly, the lack of a good migration plan from M1 to M2 leaves a  
very small, but bitter taste in my mouth, especially when it comes to  
existing jelly scripts.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Maven artifacts for Lucene.*

2007-04-11 Thread Sami Siren
Grant Ingersoll wrote:
> Initial thoughts and then more inline below, and keep in mind I long ago
> drank the Maven kool-aid and am a big fan.  :-)
> 
> I know it is a pain to a few, but realistically speaking there has not
> been all that much noise about Maven artifacts not being available.  We
> use Maven for everything we do and all I ever do when there is a new
> release of Lucene is put the new jars in our remote repository and
> everything works.  It takes two or three steps and about 5 minutes of my
> time, and would be less if I scripted it.  I frankly don't get what the
> big deal is.  OK, it does save a few bytes on a server somewhere and we
> have our own group/artifact names (lucene/lucene), but chances are it is
> more reliable than trying to get it from one of the mirrors and it will
> be faster and the names are clear cut and easy to remember.  I would
> venture anyone with Maven installed has their own repository to store
> their own, private, artifacts, so it isn't like they need to add some
> new, complex process.

Yes it is true that many organizations use internal repositories (at
least from what I've seen), heck even every developer  has one
(.m2/repository by default), but IMO lots of benefits of maven are lost
if that's the way users in large utilize maven.

Like you are solving the problem for your organization by deploying
lucene into your private repository (on behalf of the developers using
you local repository) I would like to solve the problem more globally
eventually you could save that 5 mins of your time to do some more
lucene magic ;)

>> The next best thing IMO would be using ant build as normally for the non
>> maven2 releases and use maven2 for building the maven releases (.jar
>> files, optionally also packages for sources used to build the binary and
>> packages for javadocs) with related check sums and signatures.
>>
>> To repeat it one more time: what I am proposing here is not meant to
>> replace
>> the current solid way of building the various Lucene projects -
>> I am just trying to provide a convenient way to make the release
>> artifacts
>> to be deployed to maven repositories.
>>
> 
> Couldn't we just add various ANT targets that package the jars per the
> Maven way, and even copy them to the appropriate places?  I wonder how
> hard it would be
> to have ANT output the POM and create Maven Jars.  I know it is
> backwards, but, it would be less complicated and could be hooked right
> into the ANT script and require very little from the RM.

For me it's more important to get where I am going to than some detail
that gets me there. So i would be very happy man if one way would be
adopted by the lucene community.

> 
> If I'm understanding correctly, you want to change the whole package
> structure by adding one more level?  Wouldn't this break every single
> user of Lucene?   We are still on M1 but are in the process of
> migrating, which is not straightforward, but, alas, the writing is on
> the wall concerning M1.

We wouldn't touch the existing single maven artifact in the repository,
just would deploy the new artifacts under different gId, nothing
existing is broken on the way. We could of cource continue publishing
under gId 'org.apache.lucene' if so decided but I think it's more clear
if the subprojects are under dirrerent gId.

--
 Sami Siren



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Maven artifacts for Lucene.*

2007-04-11 Thread Grant Ingersoll
Initial thoughts and then more inline below, and keep in mind I long  
ago drank the Maven kool-aid and am a big fan.  :-)


I know it is a pain to a few, but realistically speaking there has  
not been all that much noise about Maven artifacts not being  
available.  We use Maven for everything we do and all I ever do when  
there is a new release of Lucene is put the new jars in our remote  
repository and everything works.  It takes two or three steps and  
about 5 minutes of my time, and would be less if I scripted it.  I  
frankly don't get what the big deal is.  OK, it does save a few bytes  
on a server somewhere and we have our own group/artifact names  
(lucene/lucene), but chances are it is more reliable than trying to  
get it from one of the mirrors and it will be faster and the names  
are clear cut and easy to remember.  I would venture anyone with  
Maven installed has their own repository to store their own, private,  
artifacts, so it isn't like they need to add some new, complex process.



On Apr 10, 2007, at 4:20 AM, Sami Siren wrote:


I have been hoping to put up mechanism for (easier) deployment of m2
artifacts to maven repositories (both Apache snapshot repository  
and the

main maven repository at ibiblio).

The most convenient way would be to use maven2 to build the various  
lucene

projects but as the mailing list conversation about this subject
indicates there is no common interest for changing the (working)  
ant based

build system to a maven based.

The next best thing IMO would be using ant build as normally for  
the non

maven2 releases and use maven2 for building the maven releases (.jar
files, optionally also packages for sources used to build the  
binary and

packages for javadocs) with related check sums and signatures.

To repeat it one more time: what I am proposing here is not meant  
to replace

the current solid way of building the various Lucene projects -
I am just trying to provide a convenient way to make the release  
artifacts

to be deployed to maven repositories.



Couldn't we just add various ANT targets that package the jars per  
the Maven way, and even copy them to the appropriate places?  I  
wonder how hard it would be
to have ANT output the POM and create Maven Jars.  I know it is  
backwards, but, it would be less complicated and could be hooked  
right into the ANT script and require very little from the RM.


Ideally, I would love to see the release process automated so that it  
became push button (I know Maven goes a long way toward this)






There are however couple of things I need your opinion about (or at  
least

attention):

1. There are differencies when comparing to ant build jars (due to  
release

policy of a.o) the built jars will contain LICENSE.txt,
NOTICE.txt in /META-INF. Is this a problem?


Does this just mean it would be in two places?  I don't think that is  
a big deal.




2. I propose that we add additional folder level so the groupId for  
lucene

java would be org.apache.lucene.java (it is now org.apache.lucene
within the currently released artifacts). The initial list of  
artifacts (the

new proposed structure) is listed below:


If I'm understanding correctly, you want to change the whole package  
structure by adding one more level?  Wouldn't this break every single  
user of Lucene?   We are still on M1 but are in the process of  
migrating, which is not straightforward, but, alas, the writing is on  
the wall concerning M1.







The text above was my initial thought about this, however there  
have been
concerns that the procedure described here might not be most  
optimal one. So

far the arguments have been the following:

1. Two build systems to maintain

True. However I don't quite see that so black and white: You would  
anyway
need to maintain the poms manually (if you care about the quality  
of poms)

or you would have to build some mechanism to build those. Of course in
situation where you would not actually build with maven the poms  
could be a

bit more simple.

2. Two build systems producing different jars, would maven2  
releases require

a separate vote?

Yes the artifacts (jars) would be different, because you would need  
to add
LICENSE and MANIFEST into them (because of apache policy). I don't  
know
about the vote, how do other projects deal with this kind of  
situation,

anyone here to tell?

One solution to jar mismatch would be changing the ant build to put  
those

files in produced jars.


I think that would be fine to unify the jars.



3. Additional burden for RM, need to run additional command,  
install maven


There will be that external step for doing the maven release and  
you need to
install maven also. But compared to current situation where you  
would have
to extract jars, put some more files into them, sign them, modify  
poms to

reflect correct version numbers, upload them to repositories manually.

The other way to do is would be changing the current build system

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-11 Thread Yiqing Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488075
 ] 

Yiqing Jin commented on LUCENE-855:
---

This seems very useful. Just one thing i would like to know, do this Filter 
could work properly with the ChainedFilter? Since some times we have to filter 
the result with more than one range for different field, say  search in an area 
by lat lon. 
I have made a simple test filter two fields with ChainedFilter and it seems 
that i can't find anything even there are docs in that range. 
Maybe there are some bugs in my code, i'll check it tomorrow.
BTW the value type i used is Float.

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Assigned To: Otis Gospodnetic
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
> TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-794) SpanScorer and SimpleSpanFragmenter for Contrib Highlighter

2007-04-11 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488039
 ] 

Mark Miller commented on LUCENE-794:


I use that to make the Range Query test pass. The old style Range Query 
is highlightable.


> SpanScorer and SimpleSpanFragmenter for Contrib Highlighter
> ---
>
> Key: LUCENE-794
> URL: https://issues.apache.org/jira/browse/LUCENE-794
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
>Priority: Minor
> Attachments: CachedTokenStream.java, CachedTokenStream.java, 
> CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, 
> Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, 
> Highlighter.java, HighlighterTest.java, HighlighterTest.java, 
> HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, 
> QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, 
> QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, 
> spanhighlighter2.patch, spanhighlighter3.patch, spanhighlighter5.patch, 
> spanhighlighter_patch_4.zip, SpanHighlighterTest.java, 
> SpanHighlighterTest.java, SpanScorer.java, SpanScorer.java, 
> WeightedSpanTerm.java
>
>
> This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter 
> package that scores just like QueryScorer, but scores a 0 for Terms that did 
> not cause the Query hit. This gives 'actual' hit highlighting for the range 
> of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts 
> to fragment without breaking up Spans.
> See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-11 Thread Chris Hostetter
: I'm wondering then if the Sorting infrastructure could be refactored
: to allow  with some sort of policy/strategy where one can choose a
: point where one is not willing to use memory for sorting, but willing

...

: To accomplish this would require a substantial change to the
: FieldSortHitQueue et al, and I realize that the use of NIO

I don't follow ... why could this be implemented entirely via a new
SortComparatorSource?  (you would also need something to create your file,
but that could probably be done as a decorator or subclass of IndexWRiter
couldn't it?)

: immediately pins Lucene to Java 1.4, so I'm sure this is
: controversial.  But, if we wish Lucene to go beyond where it is now,

Java 1.5 is controversial, Lucene already has 1.4 dependencies.

: I think we need to start thinking about this particular problem
: sooner rather than later.

it depends on your timeline, Lucene's gotten pretty far with what it's
got.  Personally i'm banking on RAM getting cheaper fast enough that I
won't ever need to worry about this.

If i needed to support sorting on lots of fields with lots of differnet
locales, and my index was big enough that i couldn't feasibly keep all of
the FieldCaches in memory on one box, i wouldn't partition the index
across multiple boxes and merge results with a MultiSearcher ... i'd clone
the index across multiple boxes and partition the traffic based on the
field/locale it's searching on.

it's a question of cache management, if i know i have two very differnet
use cases for a Solr index, i partition those use case to seperate tiers
of machines to get better cache utilization, FieldCache is
just another type of cache.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]