Re: lucene and solr trunk

2010-03-16 Thread Jake Mannix
On Tue, Mar 16, 2010 at 3:10 PM, Yonik Seeley  wrote:

> On Tue, Mar 16, 2010 at 6:01 PM, Jake Mannix 
> wrote:
> > I'm not concerned with casual downloaders.  I'm talking
>
> about the companies and people who may or may not be
>
> interested in making multi-million dollar decisions regarding
>
> using or not using Lucene or Solr.
>
> Heh - multi-million dollar decisions after a quick glance at an SVN url?
>

Clearly not.  But just as I think that making the development of
both solr and lucene easier is a noble goal, I think that giving
people the impression that by choosing to "go with Lucene"
*means* they "go with Solr" as their end solution is not what
we want to do.  There are some places where Solr is just not
appropriate but Lucene may be.

Will this impression be "caused" by a SVN directory url
alone? Of course not.  Merging committer lists, locked
releases, *and* a SVN url which shows this?  Yes, I
think the kinds of VPs and CTO's I've talked to and
tried to help decide whether to go with an open-source
search solution could indeed start to get the feeling that
there's really just one apache solution, the
"Solr/Lucene solution".  And if they look into Solr and
decide that this particular application is not for them,
they may then not look deep enough to see whether
doing a custom Lucene application *would be*.

  -jake


Re: lucene and solr trunk

2010-03-16 Thread Jake Mannix
On Tue, Mar 16, 2010 at 2:53 PM, Yonik Seeley  wrote:
>
>  > Chiming in just a bit here - isn't there any concern that independent of
> > whether or not people "can"
> > build lucene without checking out solr, the mere fact that Lucene will be
> > effectively a "subdirectory"
> > of solr...  is there no concern that there will then be a perception that
> Lucene is a subproject of
> > Solr, instead of vice-versa?
>
> Who would have this perception?
> Casual users will be using downloads.
>

Developers and dev managers at companies doing build vs. buy decisions
regarding
whether they will do one of the following:

1) pay big bucks to get FAST or whatever
2) use Solr (free/cheap!)
3) pay [variable] bucks to build their own with Lucene
4) pay [variable but high] to build their own from scratch

I'm not concerned with casual downloaders.  I'm talking about the companies
and people who
may or may not be interested in making multi-million dollar decisions
regarding using or
not using Lucene or Solr.

  -jake


Re: lucene and solr trunk

2010-03-16 Thread Jake Mannix
On Tue, Mar 16, 2010 at 2:31 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:
>
> If we move lucene under Solr's existing svn path, ie:
>
>  /solr/trunk/lucene


Chiming in just a bit here - isn't there any concern that independent of
whether or not people "can"
build lucene without checking out solr, the mere fact that Lucene will be
effectively a "subdirectory"
of solr... is there no concern that there will then be a perception that
Lucene is a subproject of
Solr, instead of vice-versa?

The way mavenified projects work is that there would instead be a top level
in which both solr
and lucene would be submodules (and thus also subdirectories in svn), with a
dependency
from solr to lucene (in the pom.xml for maven, but easy enough to do with
the build.xml with
ant).  Checking out solr without lucene should be doable (using snapshot
jars from lucene
trunk nightly, maybe?), and the reverse should be easy, as could be checking
out the
top-level and getting everything (including a top-level build.xml which
's or antcall's
into the subdirectory build.xmls).

It seems really weird to have Lucene appear as a subdirectory of Solr,
especially for people
out there who aren't using Solr.

  -jake


Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restarts

2010-01-12 Thread Jake Mannix
On Tue, Jan 12, 2010 at 10:14 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> Jake,
>
> I wonder how often people need reliable transactions for
> realtime search? Maybe Mysql's t-log could be used sans the
> database part?
>

A reliable message queue - I'd imagine all the time!  Transactions... that
depends on how
"ACID" you care about.  For social media, news, log monitoring, you can miss
some events,
and transactionality isn't necessarily the key part - just the ability to
replay from some
point in time, so you can elastically replicate, and handle server crashes
(as well as
doing background batch indexing followed with incremental realtime catchup).


> The created_at column for near realtime seems like it could hurt
> the database due to excessive polling? Has anyone tried it yet?
>

I haven't tried it in a production system, but in testing it only seemed
only to be bad if
you have a single, not replicated or sharded DB, but a fully replicated
search system
(so having no separate message queue would entail that you're spamming your
db with
polling queries).  If your ratio of search shards to db shards isn't too
high (like say, the
non-distributed case where you have a handful of replica indexes talking to
one
centralized db), then this wouldn't be as much of a problem unless you go
crazy and
query every couple of milliseconds.


> > I wrote up a simple file-based indexing event log in an
> afternoon
>
> Right, however it's probably a long perilous leap from this to a t-log
> that's production ready.
>

Certainly.

I'm waiting for someone to dive in and mess with Bookkeeper
> http://wiki.apache.org/hadoop/BookKeeper and report back!
>

That could be great for this kind of thing.  Add some custom adapters
to write ledger entries which are easily translatable entries into Lucene
Documents, and that would hook right into Zoie rather easily.
BookKeeperStreamDataProvider!

  -jake


Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restarts

2010-01-12 Thread Jake Mannix
On Tue, Jan 12, 2010 at 8:55 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> > Zoie keeps track of an "index version" on disk alongside the Lucene index
> which it uses to decide where it must reindex from to "catch up" if it there
> have been incoming indexing events while the server was out of commission.
>
> This begs a little more clarity... Sounds like a transaction log.  Oh
> right, with Zoie there's the assumption of an external transaction log
> however it doesn't provide one out of the box?
>

The index versioning scheme Zoie uses is independent of what mechanism you
use to implement it.  If your indexing technique is to talk to a database
directly, you don't need a transaction log, something as simple as a
"created_at" column will suffice in many situations.  I gave a short talk to
demo zoie yesterday, and for it I wrote up a simple file-based indexing
event log in an afternoon.  Similarly if you listen on a JMS queue or
basically any other message-queue based system that not "push only", you'll
have some notion of "replay since [timestamp / version / incrementing
counter]", but they're all vendor dependent.

It's not the kind of thing you can just provide out of the box due to this
vendor dependence.  On the other hand, if someone came along and said they
wanted to use zoie with RabbitMQ or whatever, we'd certainly accept a patch
for a StreamDataProvider implementation which does that (and maybe one of
the zoie committers would even write it themself it it seemed like a common
enough use case).

  -jake


>
> On Tue, Jan 12, 2010 at 8:43 PM, Jake Mannix 
> wrote:
> > On Tue, Jan 12, 2010 at 8:15 PM, Otis Gospodnetic
> >  wrote:
> >>
> >> John, you should have a look at Zoie.  I just finished adding LinkedIn's
> >> case study about Zoie to Lucene in Action 2, so this is fresh in my
> mind.
> >>
> >> :)
> >
> > Yep, Zoie ( http://zoie.googlecode.com ) will handle the server restart
> > part, in that while yes, you lose what is in RAM, Zoie keeps track of an
> > "index version" on disk alongside the Lucene index which it uses to
> decide
> > where it must reindex from to "catch up" if it there have been incoming
> > indexing events while the server was out of commission.
> > Zoie does not support multiple servers using the same index, because each
> > zoie instance has IndexWriter instances, and you'll get locking problems
> > trying to do that.  You could have one Zoie instance effectively as the
> > "master/writer/realtime reader", and a bunch of raw Lucene "slaves" which
> > could read off of that index, but as you say, could not get access to the
> > RAMDirectory information until it was flushed to disk.
> > Why do you need a "cluster" of servers hitting the same index?  Are they
> > different applications (with different search logic, so they need to be
> > different instances), or is it just to try and utilize your hardware
> > efficiently?  If it's for performance reasons, you might find you get
> better
> > use of your CPU cores by just sharding your one index into smaller ones,
> > each having their own Zoie instance, and putting a "broker" on top of
> them
> > searching across all and mergesorting the results.  Often even this isn't
> > necessary, because Zoie will be opening the disk-backed IndexReader in
> > readonly mode, and thus all the synchronized blocks are gone, and one
> single
> > Zoie instance will easily saturate your cpu cores by simple
> multi-threading
> > by your appserver.
> > If you really needed to do many different kinds of writes (from different
> > applications) and also have applications not involved in the writing also
> > seeing (in real-time) these writes, then you could still do it with Zoie,
> > but it would take some interesting architectural juggling (write your own
> > StreamDataProvider class which takes input from a variety of sources and
> > merges them together to feed to one Zoie instance, then a broker on top
> of
> > zoie which serves out IndexReaders to different applications living on
> top
> > which can wrap them up in their own business logic as they saw fit... as
> > long as it was ok to have all the applications in the same JVM, of
> course).
> >   -jake
> >
> >>
> >>  Otis
> >> --
> >> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >>
> >>
> >>
> >> - Original Message 
> >> > From: jchang 
> >> > To: java-dev@lucene.apache.org

Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restarts

2010-01-12 Thread Jake Mannix
On Tue, Jan 12, 2010 at 8:15 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> John, you should have a look at Zoie.  I just finished adding LinkedIn's
> case study about Zoie to Lucene in Action 2, so this is fresh in my mind.

:)
>

Yep, Zoie ( http://zoie.googlecode.com ) will handle the server restart
part, in that while yes, you lose what is in RAM, Zoie keeps track of an
"index version" on disk alongside the Lucene index which it uses to decide
where it must reindex from to "catch up" if it there have been incoming
indexing events while the server was out of commission.

Zoie does not support multiple servers using the same index, because each
zoie instance has IndexWriter instances, and you'll get locking problems
trying to do that.  You could have one Zoie instance effectively as the
"master/writer/realtime reader", and a bunch of raw Lucene "slaves" which
could read off of that index, but as you say, could not get access to the
RAMDirectory information until it was flushed to disk.

Why do you need a "cluster" of servers hitting the same index?  Are they
different applications (with different search logic, so they need to be
different instances), or is it just to try and utilize your hardware
efficiently?  If it's for performance reasons, you might find you get better
use of your CPU cores by just sharding your one index into smaller ones,
each having their own Zoie instance, and putting a "broker" on top of them
searching across all and mergesorting the results.  Often even this isn't
necessary, because Zoie will be opening the disk-backed IndexReader in
readonly mode, and thus all the synchronized blocks are gone, and one single
Zoie instance will easily saturate your cpu cores by simple multi-threading
by your appserver.

If you really needed to do many different kinds of writes (from different
applications) and also have applications not involved in the writing also
seeing (in real-time) these writes, then you could still do it with Zoie,
but it would take some interesting architectural juggling (write your own
StreamDataProvider class which takes input from a variety of sources and
merges them together to feed to one Zoie instance, then a broker on top of
zoie which serves out IndexReaders to different applications living on top
which can wrap them up in their own business logic as they saw fit... as
long as it was ok to have all the applications in the same JVM, of course).

  -jake


>
>  Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> - Original Message 
> > From: jchang 
> > To: java-dev@lucene.apache.org
> > Sent: Tue, January 12, 2010 6:10:56 PM
> > Subject: Lucene 2.9.0 Near Real Time Indexing and Service
> Crashes/restarts
> >
> >
> > Lucene 2.9.0 has near real time indexing, writing to a RAMDir which gets
> > flushed to disk when you do a search.
> >
> > Does anybody know how this works out with service restarts (both orderly
> > shutdown and a crash)?  If the service goes down while indexed items are
> in
> > RAMDir but not on disk, are they lost?  Or is there some kind of log
> > recovery?
> >
> > Also, does anybody know the impact of this which clustered lucene
> servers?
> > If you have numerous servers running off one index, I assume there is no
> way
> > for the other services to pick up the newly indexed items until they are
> > flushed to disk, correct?  I'd be happy if that is not so, but I suspect
> it
> > is so.
> >
> > Thanks,
> > John
> > --
> > View this message in context:
> >
> http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-Service-Crashes-restarts-tp27136539p27136539.html
> > Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
> >
> >
> > -
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

2009-12-11 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789618#action_12789618
 ] 

Jake Mannix commented on LUCENE-2026:
-

bq. Now we want more speed, and are ready to sacrifice something if needed.
bq. You decide to sacrifice new record (in)visibility. No choice, but to hack 
into IW to allow readers see its hot, fresh innards.

Chiming in here that of course, you don't *need* (ie there is a choice) to hack 
into the IW to do this.  Zoie is a completely user-land solution which modifies 
no IW/IR internals and yet achieves millisecond index-to-query-visibility 
turnaround while keeping speedy indexing and query performance.  It just keeps 
the RAMDir outside encapsulated in an object (an IndexingSystem) which has 
IndexReaders built off of both the RAMDir and the FSDir, and hides the 
implementation details (in fact the IW itself) from the user.  

The API for this kind of thing doesn't *have* to be tightly coupled, and I 
would agree with you that it shouldn't be.

> Refactoring of IndexWriter
> --
>
> Key: LUCENE-2026
> URL: https://issues.apache.org/jira/browse/LUCENE-2026
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2143) Understand why NRT performance is affected by flush frequency

2009-12-10 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788871#action_12788871
 ] 

Jake Mannix commented on LUCENE-2143:
-

So why also, are you only picking one indexing rate?  Should you not also be 
varying this from low, all the way up to saturating the system with maximum 
indexing rate, to properly stress the system?

> Understand why NRT performance is affected by flush frequency
> -
>
> Key: LUCENE-2143
> URL: https://issues.apache.org/jira/browse/LUCENE-2143
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> In LUCENE-2061 (perf tests for NRT), I test NRT performance by first
> getting a baseline QPS with only searching, using enough threads to
> saturate.
> Then, I pick an indexing rate (I used 100 docs/sec), and index docs at
> that rate, and I also reopen a NRT reader at different frequencies
> (10/sec, 1/sec, every 5 seconds, etc.), and then again test QPS
> (saturated).
> I think this is a good approach for testing NRT -- apps can see, as a
> function of "freshness" and at a fixed indexing rate, what the cost is
> to QPS.  You'd expect as index rate goes up, and freshness goes up,
> QPS will go down.
> But I found something very strange: the low frequency reopen rates
> often caused a highish hit to QPS.  When I forced IW to flush every
> 100 docs (= once per second), the performance was generally much
> better.
> I actually would've expected the reverse -- flushing in batch ought to
> use fewer resoruces.
> One theory is something odd about my test env (based on OpenSolaris),
> so I'd like to retest on a more mainstream OS.
> I'm opening this issue to get to the bottom of it...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1377) Add HTMLStripReader and WordDelimiterFilter from SOLR

2009-12-08 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787647#action_12787647
 ] 

Jake Mannix commented on LUCENE-1377:
-

bq. A solr developer adds something to solr w/o putting it in lucene - only 
solr users benefit.
bq. A lucene developer adds something to lucene w/o adding support for that in 
solr - only lucene users benefit.

I agree with the first, but not the second:  solr users benefit on the next 
release cycle of solr which includes the lucene changes.  Your statements here 
only apply if there wasn't a dependency of solr on lucene.

> Add HTMLStripReader and WordDelimiterFilter from SOLR
> -
>
> Key: LUCENE-1377
> URL: https://issues.apache.org/jira/browse/LUCENE-1377
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.3.2
>Reporter: Jason Rutherglen
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> SOLR has two classes HTMLStripReader and WordDelimiterFilter which are very 
> useful for a wide variety of use cases.  It would be good to place them into 
> core Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1377) Add HTMLStripReader and WordDelimiterFilter from SOLR

2009-12-08 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787605#action_12787605
 ] 

Jake Mannix commented on LUCENE-1377:
-

bq. But we've been unable to make that happen in the past, so betting on it 
going forward seems like a mistake.

Sure, but keeping code which could be used by any Lucene project sequestered in 
Solr because you want to be able to modify it on a different time-schedule 
seems like bad open-source policy.

I guess what I'd change my statement to is that yes, it would be a downside to 
the Solr community (anytime you depend on another project, you also depend on 
its release schedule).  But I'd still hold that this fact alone should not keep 
code *out* of Lucene-core.  Anything which isn't Solr-specific should belong 
deeper in, to allow use by more people, otherwise only Solr users get the 
benefit.

> Add HTMLStripReader and WordDelimiterFilter from SOLR
> -
>
> Key: LUCENE-1377
> URL: https://issues.apache.org/jira/browse/LUCENE-1377
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.3.2
>Reporter: Jason Rutherglen
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> SOLR has two classes HTMLStripReader and WordDelimiterFilter which are very 
> useful for a wide variety of use cases.  It would be good to place them into 
> core Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1377) Add HTMLStripReader and WordDelimiterFilter from SOLR

2009-12-08 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787595#action_12787595
 ] 

Jake Mannix commented on LUCENE-1377:
-

bq. There are some downsides for Solr: if these are in Lucene, we (solr) lose 
the ability to easily add new features or fix bugs... mainly because Solr 
doesn't use Lucene trunk anymore.

Who is this "we" of which you speak?  If someone using Solr needs something 
changed in Lucene, a patch submitted to Lucene JIRA should be no more hard to 
get committed than a patch submitted to the Solr JIRA, no?  The release cycle 
may be slower, but yes, as Earwin says, maybe this just gives another reason 
for more rapid minor version iteration in Lucene.

> Add HTMLStripReader and WordDelimiterFilter from SOLR
> -
>
> Key: LUCENE-1377
> URL: https://issues.apache.org/jira/browse/LUCENE-1377
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.3.2
>Reporter: Jason Rutherglen
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> SOLR has two classes HTMLStripReader and WordDelimiterFilter which are very 
> useful for a wide variety of use cases.  It would be good to place them into 
> core Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Whither Query Norm?

2009-11-24 Thread Jake Mannix
Now that Otis reminded me that this thread existed (I've got a brain like a
sieve these days, I tell you)...

On Fri, Nov 20, 2009 at 10:08 AM, Grant Ingersoll wrote:

>
>   -1 from me, even though it's confusing, because having that call there
> (somewhere, at least) allows you to actually do compare scores across
> queries if you do the extra work of properly normalizing the documents as
> well (at index time).
>
>
> Do you have some references on this?  I'm interested in reading more on the
> subject.  I've never quite been sold on how it is meaningful to compare
> scores and would like to read more opinions.
>

So I couldn't find any really good papers on this specifically, but I seem
to remember seeing this stuff done a lot in Manning and Schutze' IR book -
the go over training field boosts with logistic regression and all that, but
they don't specifically look at the Lucene case (although they consider
similar scoring functions).   They must talk about the necessity of
comparable scores to do this, I'm sure.

  -jake


Re: Whither Query Norm?

2009-11-24 Thread Jake Mannix
On Tue, Nov 24, 2009 at 9:31 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hello,
>
> Regarding that monstrous term->idf map.
> Is this something that one could use to adjust the scores in
> http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitationsscenario?
>   Say a map like that was created periodically for each shard and
> distributed to all other nodes (so in the end each node has all maps
> locally).  Couldn't the local scorer in the Solr instance (and in
> distributed Lucene setup) consult idfs for relevant terms in all those maps
> and adjust the scores of local scores before returning results?
>
>
Why would you want all nodes to have all maps?  Why not merge them into one
map, then redistributed out to all nodes, which would be far smaller than
many maps anyways?  Then yes, the scoring can be done locally using this big
idfMap to produce scores, instead of using reader.docFreq() for idf, that's
what I do.  But then what are you implying should be done?  Just rescale the
top scores based on the idfs before returning your top results?  You'd need
to know exactly which terms hit those top-scoring documents, right? Which
implies the cost of basically explain(), doesn't it?

Although with the per-field scoring (the thing I do to be able to train on
sub-query field matches scores), this gets easier, because then you can try
to hang onto this information if the query isn't too big, but this isn't
something normal BooleanQueries will handle for you naturally.

  -jake


Re: Whither Query Norm?

2009-11-24 Thread Jake Mannix
On Tue, Nov 24, 2009 at 9:18 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> I'm late to the thread, and although it looks like the discussion is over,
> I'll inline a Q for Jake.
>


> You mentioned this about 3 times in this thread (contrib/queries wants
> you!)
>

Yeah, I've got to get some of the stuff I've written pulled out of all of
our own code - now that Lucene is on java 1.5, it'll be much easier for me
to contribute this stuff.  Soon!

  -jake


Re: Whither Query Norm?

2009-11-24 Thread Jake Mannix
On Tue, Nov 24, 2009 at 9:18 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> I'm late to the thread, and although it looks like the discussion is over,
> I'll inline a Q for Jake.
>
> >
> >References on how people do this *with Lucene*, or just how this is done
> in general?  There are lots of papers on fancy things which can be done, but
> I'm not sure where to point you to start out.  The technique I'm referring
> to is really just the simplest possible thing beyond setting your weights
> "by hand": let's assume you have a boolean OR query, Q, built up out of
> sub-queries q_i (hitting, for starters, different fields, although you can
> overlap as well with some more work), each with a set of weights (boosts)
> b_i, then if you have a training corpus (good matches, bad matches, or
> ranked lists of matches in order of relevance for the queries at hand),
> *and* scores (at the q_i level) which are comparable,
>
> You mentioned this about 3 times in this thread (contrib/queries wants
> you!)
> It sounds like you've done this before (with Lucene?).  But how, if the
> scores are not comparable, and that's required for this "field boost
> learning/training" to work?
>

Well that's the point, right?  You need to make the scores comparable,
somehow.  The most general thing you can do is figure out what the maximum
possible score for a query is (if there is a maximum, which for most scoring
systems there will be, given strictly positive doc norms) and normalize with
respect to that.  For Lucene, the simplest possible way to do this (I
think?) is to swap in a true cosine (or something like Tanimoto) similarity
instead of the doc-length normalized one (which may require externalizing
the IDF).

When the score for a tf-idf weighted document and a boost-idf weighted query
(with both are normalized on the same scale) is exactly just the cosine of
the angle between them, then scores become fairly comparable - they're all
on a 0 to 1 scale.  Now, longer fields/documents are still going to score
way lower than shorter documents, for typical user-generated queries, but at
least now "way lower" has more meaning than before (because it's "way lower
*relative to 1.0*").

Frankly, I've even done this kind of logistic regression training of weights
even when you use raw lucene scoring, and while it doesn't work completely
(because scores are so incomparable), it's remarkable how well it ends up
working (I guess in comparison to randomly setting your boosts by hand and
simple A/B tests...)

  -jake



> Thanks,
> Otis
>
> > then you can do a simple regression (linear or logistic, depending on
> whether you map your final scores to a logit or not) on the w_i to fit for
> the best boosts to use.  What is critical here is that scores from different
> queries are comparable.  If they're not, then queries where the best
> document for a query scores 2.0 overly affect the training in comparison to
> the queries where the best possible score is 0.5 (actually, wait, it's the
> reverse: you're training to increase scores of matching documents, so the
> system tries to make that 0.5 scoring document score much higher by raising
> boosts higher and higher, while the good matches already scoring 2.0 don't
> need any more boosting, if that makes sense).
> >
> >There are of course far more complex "state of the art" training
> techniques, but probably someone like Ted would be able to give a better
> list of references on where is easiest to read those from.  But I can try to
> dredge up some places where I've read about doing this, and post again later
> if I can find any.
> >
> >  -jake
> >
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: Whither Query Norm?

2009-11-20 Thread Jake Mannix
On Fri, Nov 20, 2009 at 5:02 PM, Mark Miller  wrote:

> Go back and put it in after you have all the documents for that commit
> point. Or on reader load, calculate it.
>

Ah!  Now I see what you mean by "expensive". :)  Basically run through all
your documents
you've indexed all over again, fixing their norms.  That's pretty
expensive.  Not as bad
as doing it at reader load, but still pretty harsh, but yes, not impossible!


 -jake


Re: Whither Query Norm?

2009-11-20 Thread Jake Mannix
Back to Grant's original question, for a second...

On Fri, Nov 20, 2009 at 1:59 PM, Grant Ingersoll wrote:


> This makes sense from a mathematical sense, assuming scores are comparable.
>  What I would like to get at is why anyone thinks scores are comparable
> across queries to begin with.  I agree it is beneficial in some cases (as
> you described) if they are.   Probably a question suited for an academic IR
> list...
>

Well, without getting into the academic IR which I'm not really qualified to
argue about, what is wrong with comparing two queries by saying that a
document which "perfectly" matches a query should score 1.0, and scale with
respect to that?

Maybe it's a better question to turn it around: can you give examples of two
queries where you can see that it *doesn't* make sense to compare scores?
Let's imagine we're doing pure, properly normalized tf-idf cosine scoring
(not default Lucene scoring) on a couple of different fields at once.  Then
whenever a sub-query is exactly equal to the field it's hitting (or else the
field is the repetition of that query some multiple number of times), the
score for that sub-query will be 1.0.  When the match isn't perfect, the
score will go down, ok.  Sub-queries hitting longer fields (which aren't
just pathologically made up of just repetitions of a smaller set of terms)
will in general have even the best scores be very low compared to the best
scores on the small fields (this is true for Lucene as well, of course), but
this makes sense: if you query with a very small set of terms (as is usually
done, unless you're doing a MoreLikeThis kind of query), and you find a
match in the "title" field which is exactly what you were looking for, that
field match is far and away better than anything else you could get in a
body match.

To put it more simply - if you do really have cosine similarity (or
Jaccard/Tanimoto or something like that, if you don't care about idf for
some reason), then queries scores are normalized relative to "how close did
I find documents to *perfectly* matching my query" - 1.0 means you found
your query in the corpus, and less than that means some fractional
proximity.  This is an absolute measure, across queries.

Of course, then you ask, well, in reality, in web and enterprise search,
documents are big, queries are small, you never really find documents which
are perfect matches, so if the best match for q1, out of your whole corpus,
is 0.1 for doc1, and the best match for q2 is 0.25 for doc2, is it really
true that the best match for the second query is "better" than the best
match for the first query?  I've typically tried to remain agnostic on that
front, and instead as the related question: if the user (or really, a
sampling of many users) queried for (q1 OR q2) and assuming for simplicity
that q1 didn't match any of the good hits for q2, and vice-versa, then does
the user (ie. your gold-standard training set) say that the best result is
doc1, or doc2?  If it's doc1, then you'd better have found a way to boost
q1's score contribution higher than q2's, right?  Is this wrong?  (in the
theoretical sense?)

  -jake


Re: Whither Query Norm?

2009-11-20 Thread Jake Mannix
On Fri, Nov 20, 2009 at 4:51 PM, Mark Miller  wrote:

> Okay - my fault - I'm not really talking in terms of Lucene. Though even
> there I consider it possible. You'd just have to like, rewrite it :) And
> it would likely be pretty slow.
>

Rewrite it how?  When you index the very first document, the docFreq of all
terms is 1, out of numDocs = 1 docs in the corpus.  Everybody's idf is the
same.
No matter how you normalize this, it'll be wrong, once you've indexed a
million
documents.  This isn't a matter of Lucene architecture, it's a matter of idf
being
a query-time exactly available value (you can approximate it partway through
indexing, but you don't know it at all when you start).

  -jake


Re: Whither Query Norm?

2009-11-20 Thread Jake Mannix
On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller  wrote:

> Mark Miller wrote:
> Okay - I guess that somewhat makes sense - you can calculate the
> magnitude of the doc vectors at index time. How is that impossible with
> incremental indexing though? Isn't it just expensive? Seems somewhat
> expensive in the non incremental case as well - your just eating it at
> index time rather than query time - though the same could be done for
> incremental? The information is all there in either case.
>
>
Ok, I think I see what you were imagining I was doing: you take the current
state of the index as gospel for idf (when the index is already large, this
is a good approximation), and look up these factors at index time - this
means grabbing docFreq(Term) for each term in my document, and yes,
this would be very expensive, I'd imagine.  I've done it by pulling a
monstrous (the most common 1-million terms, say) Map
(effectively) outside of lucene entirely, which gives term idfs, and housing
this in memory so that computing field norms for cosine is a very fast
operation at index time.

Doing it like this is hard from scratch, but is fine incrementally, because
I've basically fixed idf using some previous corpus (and update the idfMap
every once in a while, in cases where it doesn't change much).  This has
the effect of also providing a global notion of idf in a distributed corpus.

  -jake


>
>
>

> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: Whither Query Norm?

2009-11-20 Thread Jake Mannix
On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller  wrote:

> Mark Miller wrote:
> >
> > it looks expensive to me to do both
> > of them properly.
> Okay - I guess that somewhat makes sense - you can calculate the
> magnitude of the doc vectors at index time. How is that impossible with
> incremental indexing though? Isn't it just expensive? Seems somewhat
> expensive in the non incremental case as well - your just eating it at
> index time rather than query time - though the same could be done for
> incremental? The information is all there in either case.
>

The expense, if you have the idfs of all terms in the vocabulary (keep them
in the form of idf^2 for efficiency at index time), is pretty trivial, isn't
it?  If
you have a document with 1000 terms, it's maybe 3000 floating point
operations, all CPU actions, in memory, no disk seeks.

What it does require, is knowing, even when you have no documents yet
on disk, what the idf of terms in the first few documents are.  Where do
you know this, in Lucene, if you haven't externalized some notion of idf?

  -jake


>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: Whither Query Norm?

2009-11-20 Thread Jake Mannix
On Fri, Nov 20, 2009 at 4:09 PM, Mark Miller  wrote:


> But cosine has two norms? The query norm and the document norm - taking
> the two vectors to the unit space - it looks expensive to me to do both
> of them properly. The IR lit I've seen fudges them down to Root(L) even
> in the non incremental case for the document vector normalization.
>

Lucene has two norms too: lengthNorm (of the field), produced at index time,
and queryNorm.  The expense of queryNorm is done once per query, not
once per document, and so as you yourself say, is not a performance
problem.  The lengthNorm is factored into the document at index time, and
so is not a hit at *query time* at all


> > (the norm Lucene does
> > use on the query side) of fields - 1/sqrt(sum_{i}(tf_(term_i)^2 *
> > idf(term_i)^2)) -
> > is impossible to calculate.
> But isn't that *not* the query side norm? Looks like the document vector
> norm to me.
>

This is exactly the query norm - 1/sqrt(sumOfSquaredWeights), and
sum of squared weights is just sum of tf^2 * idf^2, except queries have
"boost" in place of "tf".  The document vector norm that *Lucene* uses is
independent of idf of the terms in the document (it's 1/sqrt(doc_length)
for DefaultSimilarity).


> Where are you doing the correct document vector normalization to get the
> true cosine? I'm not following. It looks like the above formula is the
> doc vector norm - so what about the query vector norm then?
>

Ok, the technique I use to do this (I should really just refactor it to a
point
where I can post it for contrib/queries so that other people can use this
if it's not that commonly known), is to, at index time, calculate the true
document vector normalizations for the fields, and then feed it into their
boosts, and make sure that Similarity does *not* do field normalization
itself.

The formula for query vector normalization and document vector normalization
is *exactly* the same if you translate boosts of terms into termFrequency of
said term in the query "document".  It's just applied to a different bag of
words.


> >   a) angles,
> >   b) numbers which are less than one, and
> >   c) when they're equal to 1, you have a perfect match.
>


> When they think of cosine yes - but IR lit seems to fudge this - non of
> the common document length normalizations are real cosine - they are
> just square root variations - with Lucene's being the most simple in the
> family.
>

This is true, which is why I'm not really giving a better wording than what
Doron already wrote - I doubt I could do better.  I'm concerned myself
because I've seen lots of questions over the years on the lists of people
asking about how to compare scores across queries, no matter how much
we tell them not to.  And now here again we're telling them "they're not
comparable, but they're more comparable than if we didn't have this
queryNorm in there!".  What exactly does that mean?  I do search, and
I'm not really sure what it means, what does someone who is just learning
this stuff going to think?


>
> >
> > I think it's way better than it used to be (although it's even more
> > dense, which
> > I guess is required, given that it's a dense subject), but I really
> > don't think
> > we want to let people think it's a cosine - it's "like" a cosine, but
> > with a
> > rather different normalization.
>


> We can amp that - but I like I said, I think thats standard practice -
> in the IR books I've seen, they say, here is the cosine formula for term
> vector scoring, and then they go on to fudge in a manner very similar to
> how Lucene does it.
>

Well just because it's standard practice, doesn't mean it's a good idea.
I came to search from a math background, not CS, so when I see "vector
space" and "cosine", I maybe think different things than the average coder
learning IR.  So maybe I'm not the right person to ask how such things
should be explained!


> > Doron's got it stated like that, already, but
> > it needs to be crystal clear that scores are sometimes higher than 1,
> > and in
> > fact the "best" score for one query may be rather different than the
> > "best"
> > for another query.
>


> I think this is general knowledge with Lucene, but that may be a stretch
> - perhaps its not as wide know as I now assume for new users. Hits used
> to normalize to a 0-1 score actually, but its no longer around.
>

It was definitely a good idea to get rid of normalizing by the top score,
but
what could replace it is normalizing by the "best possible score".  This
tells
you how close to a perfect match you could have gotten.  For cosine, this
would
just be the cosine of the angle from the document vector to the query
vector,
for Lucene, it would be something similar, but would have absolute meaning
(maybe... maybe not, because the theoretical best score is likely to be far
from the actual best score in a given corpus).

  -jake


Re: Whither Query Norm?

2009-11-20 Thread Jake Mannix
On Fri, Nov 20, 2009 at 2:50 PM, Mark Miller  wrote:

> Jake Mannix wrote:
> > Remember: we're not really doing cosine at all here.
> This, I think, is fuzzy right? It seems to be common to still call this
> cosine scoring loosely - pretty much every practical impl fudges things
> somewhat when doing the normalization (though we are on the heavy side
> of fudgers) - I think its pretty rare to do the true cosine because its
> so expensive. It can be somewhat misleading though.
>

Cosine isn't expensive - it's just *impossible* to do when you're doing
incremental indexing: to normalize to actually do cosine similarity requires

knowing the idf at *index time*, so that the cosine norm (the norm Lucene
does
use on the query side) of fields - 1/sqrt(sum_{i}(tf_(term_i)^2 *
idf(term_i)^2)) -
is impossible to calculate.

Incremental indexing has it's costs!

In cases where you *do* have either the exact idf values for your corpus's
term-vocabulary (or a good way to approximate it, since it only varies
logarithmically, this isn't too hard), then if you override lengthNorm to
always
return 1, and then take your document's Fields, and boost() them by the
factor
I listed above, then you can get exact (well, as exact as you can with only
one-byte field-norms :\ [are we really that starved for RAM that we have to
have that little precision?] ) cosine similarity, at really no additional
*query-time*
performance hit.  Just a little math done at indexing time.

Have you looked at the Similarity scoring explanation page that was
> recently improved? Have any suggestions on changes to it? Doron put a
> fair amount of work into improving it recently, but I think it could
> always get better. Its currently leaning towards presenting this as
> cosine - that seems in line with the few text books I've seen, but I'm
> admittedly not that deep into any of this.
>

Yeah, that page is a lot better than it used to be, but I'd really try not
to
call this cosine - when people think of cosines, they think of:

  a) angles,
  b) numbers which are less than one, and
  c) when they're equal to 1, you have a perfect match.

We have a) - kinda.  We do not have b) or c), and we don't have scores
which are really comparable, and maybe Grant's question should be
answered with the statement: they shouldn't be:

If I query the google with "web" (getting a little less than three billion
hits,
so it maybe has an idf of 3 or 4), and find a document with 10 terms, one
of which is "web" (and the others are equally common in terms of idf),
then lucene says we should score that as:

   idf(web) / sqrt(10) =~ 0.9,

while cosine would say:

  idf(web) / sqrt(1^2 * idf(web)^2 + tf_common_term2^2 * idf(term2)^2 + ...
)

=~ 1/sqrt(10) = 0.31

If I instead query with "floccinaucinilipilification" (which comes up with
about 45K hits, so maybe has an idf of 20), and find a document with
10 terms, one of which is floccinaucinilipilification, and the other terms
are also similarly rare (also idf 20), lucene says the score is

  idf(floccinaucinilipilification) / sqrt(10),  = 6.5

while cosine would say:

  idf(floccinaucinilipilification) / sqrt(1^2 idf(flocci...)^2 +
tf*rareTerm2^2 + ... )

=~ 1/sqrt(10) = 0.31

What is nice about *real* cosine is that it captures a sense of absolute
scoring - the hit on the really rare word in a document which does not
have a lot of other rare words (and is in general pretty short) is indeed
measured by the fact that the score is close to 1.0 (perfect match),
whereas the Lucene default scoring does not give a good measure -
web scores close to 1.0 on its hit, and you only notice that this isn't as
good as the rare word hit because that one scores way *higher* than 1.0.

On the other hand, the thing which Lucene scoring does which cosine
does *not* is measure the fact that hitting the rare word is *way* better
than hit on a common word, from a user perspective, so giving them
equal scores because their cosines are the same is maybe not the
right thing to do (and so "yay Doug!", heh).

Have you looked at the Similarity scoring explanation page that was
recently improved? Have any suggestions on changes to it? Doron put a
fair amount of work into improving it recently, but I think it could
always get better. Its currently leaning towards presenting this as
cosine - that seems in line with the few text books I've seen, but I'm
admittedly not that deep into any of this.


I think it's way better than it used to be (although it's even more dense,
which
I guess is required, given that it's a dense subject), but I really don't
think
we want to let people think it's a cosine - it's "like" a cosine, but with a
rather different normalization.  Doron's got it stated like that, already,
but
it needs to be crystal clear that scores are sometimes higher than 1, and in
fact the "best" score for one query may be rather different than the "best"
for another query.

  -jake


Re: Whither Query Norm?

2009-11-20 Thread Jake Mannix
Remember: we're not really doing cosine at all here.  The factor of IDF^2 on

the top, with the factor of 1/sqrt(numTermsInDocument) on the bottom couples

together to end up with the following effect:

 q1 = "TERM1"
 q2 = "TERM2"

doc1 = "TERM1"
doc2 = "TERM2"

score(q1, doc1) = idf(TERM1)
score(q2, doc2) = idf(TERM2)

Both are perfect matches, but one scores higher (possibly much higher) than
the other.

Boosts work just fine with cosine (it's just a way of putting "tf" into the
query side
as well as in the document side), but normalizing documents without taking
the
idf of terms in the document into consideration blows away the ability to
compare scores in default Lucene scoring, even *with* the queryNorm()
factored
in.

I know you probably know this Mark, but it's important to make sure we're
stating
that in Lucene as is currently structured, scores can be *wildly* different
between
queries, even with queryNorm() factored in, for the sake of people reading
this
who haven't worked through the scoring in detail.

  -jake


On Fri, Nov 20, 2009 at 2:24 PM, Mark Miller  wrote:

> Grant Ingersoll wrote:
> >
> >  What I would like to get at is why anyone thinks scores are
> > comparable across queries to begin with.
> >
> They are somewhat comparable because we are using the approximate cosine
> between the document/query vectors for the score - plus boosts n stuff.
> How close the vectors are to each other. If q1 has a smaller angle diff
> with d1 than q2 does with d2, then you can do a comparison. Its just
> vector similarities. Its approximate because we fudge the normalization.
> Why do you think the scores within a query search are comparable? Whats
> the difference when you try another query? The query is the difference,
> and the query norm is what makes it more comparable. Its just a
> different query vector with another query. Its still going to just be a
> given "angle" from the doc vectors. Closer is considered a better match.
> We don't do it to improve anything, or because someone discovered
> something - its just part of the formula for calculating the cosine. Its
> the dot product formula. You can lose it and keep the same relative
> rankings, but then you are further from the cosine for the score - you
> start scaling by the magnitude of the query vector. When you do that
> they are not so comparable.
>
> If you take out the queryNorm, its much less comparable. You are
> effectively multiplying the cosine by the magnitude of the query vector
> - so different queries will scale the score differently - and not in a
> helpful way - a term vector and query vector can have very different
> magnitudes, but very similar term distributions. Thats why we are using
> the cosine rather than euclidean distance in the first place. Pretty
> sure its more linear algebra than IR - or the vector stuff from calc 3
> (or wherever else different schools put it).
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: Whither Query Norm?

2009-11-20 Thread Jake Mannix
On Fri, Nov 20, 2009 at 10:08 AM, Grant Ingersoll wrote:

> I should add in my $0.02 on whether to just get rid of queryNorm()
> altogether:
>
>   -1 from me, even though it's confusing, because having that call there
> (somewhere, at least) allows you to actually do compare scores across
> queries if you do the extra work of properly normalizing the documents as
> well (at index time).
>
>
> Do you have some references on this?  I'm interested in reading more on the
> subject.  I've never quite been sold on how it is meaningful to compare
> scores and would like to read more opinions.
>

References on how people do this *with Lucene*, or just how this is done in
general?  There are lots of papers on fancy things which can be done, but
I'm not sure where to point you to start out.  The technique I'm referring
to is really just the simplest possible thing beyond setting your weights
"by hand": let's assume you have a boolean OR query, Q, built up out of
sub-queries q_i (hitting, for starters, different fields, although you can
overlap as well with some more work), each with a set of weights (boosts)
b_i, then if you have a training corpus (good matches, bad matches, or
ranked lists of matches in order of relevance for the queries at hand),
*and* scores (at the q_i level) which are comparable, then you can do a
simple regression (linear or logistic, depending on whether you map your
final scores to a logit or not) on the w_i to fit for the best boosts to
use.  What is critical here is that scores from different queries are
comparable.  If they're not, then queries where the best document for a
query scores 2.0 overly affect the training in comparison to the queries
where the best possible score is 0.5 (actually, wait, it's the reverse:
you're training to increase scores of matching documents, so the system
tries to make that 0.5 scoring document score much higher by raising boosts
higher and higher, while the good matches already scoring 2.0 don't need any
more boosting, if that makes sense).

There are of course far more complex "state of the art" training techniques,
but probably someone like Ted would be able to give a better list of
references on where is easiest to read those from.  But I can try to dredge
up some places where I've read about doing this, and post again later if I
can find any.

  -jake


Re: Whither Query Norm?

2009-11-20 Thread Jake Mannix
I should add in my $0.02 on whether to just get rid of queryNorm()
altogether:

  -1 from me, even though it's confusing, because having that call there
(somewhere, at least) allows you to actually do compare scores across
queries if you do the extra work of properly normalizing the documents as
well (at index time).  And for people who actually do machine-learning
training of their per-field query boosts, this is pretty critical.

  -jake

On Fri, Nov 20, 2009 at 8:15 AM, Jake Mannix  wrote:

> The fact Lucene Similarity is most decidely *not* cosine similarity, but
> strongly resembles it with the queryNorm() in there, means that I personally
> would certainly like to see this called out, at least in the documentation.
>
> As for performance, is the queryNorm() called ever in any loops?  It's all
> set up in the construction of the Weight, right?  Which means that by the
> time you're doing scoring, all the weighting factors are already factored
> into one?  What's the performance issue which would be saved here?
>
>   -jake
>
>
> On Fri, Nov 20, 2009 at 7:56 AM, Grant Ingersoll wrote:
>
>> For a long time now, we've been telling people not to compare scores
>> across queries, yet we maintain the queryNorm() code as an attempt to do
>> this and the javadocs even promote it.  I'm in the process of researching
>> this some more (references welcomed), but wanted to hear what people think
>> about it here.  I haven't profiled it just yet, but it seems like a good
>> chunk of wasted computation to me (loops, divisions and square roots).  At a
>> minimum, I think we might be able to refactor the callback mechanism for it
>> just as we did for the collectors, such that we push of the actual
>> calculation of the sum of squares into Similarity, instead of just doing
>> 1/sqrt(sumSqs).  That way, when people want to override queryNorm() to
>> return 1, they are saving more than just the 1/sqrt calculation.  I haven't
>> tested it yet, but wanted to find out what others think.
>>
>> Thoughts?
>>
>> -Grant
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>


Re: Whither Query Norm?

2009-11-20 Thread Jake Mannix
The fact Lucene Similarity is most decidely *not* cosine similarity, but
strongly resembles it with the queryNorm() in there, means that I personally
would certainly like to see this called out, at least in the documentation.

As for performance, is the queryNorm() called ever in any loops?  It's all
set up in the construction of the Weight, right?  Which means that by the
time you're doing scoring, all the weighting factors are already factored
into one?  What's the performance issue which would be saved here?

  -jake

On Fri, Nov 20, 2009 at 7:56 AM, Grant Ingersoll wrote:

> For a long time now, we've been telling people not to compare scores across
> queries, yet we maintain the queryNorm() code as an attempt to do this and
> the javadocs even promote it.  I'm in the process of researching this some
> more (references welcomed), but wanted to hear what people think about it
> here.  I haven't profiled it just yet, but it seems like a good chunk of
> wasted computation to me (loops, divisions and square roots).  At a minimum,
> I think we might be able to refactor the callback mechanism for it just as
> we did for the collectors, such that we push of the actual calculation of
> the sum of squares into Similarity, instead of just doing 1/sqrt(sumSqs).
>  That way, when people want to override queryNorm() to return 1, they are
> saving more than just the 1/sqrt calculation.  I haven't tested it yet, but
> wanted to find out what others think.
>
> Thoughts?
>
> -Grant
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: Why release 3.0?

2009-11-16 Thread Jake Mannix
Yeah, sorry, I just meant that 3.0 can read 2.9 index format, but 3.1 will
not necessarily have that capability (this is the whole point of the
difference between 2.9 and 3.0, in my understanding).

On Mon, Nov 16, 2009 at 11:05 AM, Uwe Schindler  wrote:

>  2.9 has **not** the same format as 3.0, an index created with 3.0 cannot
> be read with 2.9. This is because compressed field support was removed and
> therefore the version number of the stored fields file was upgraded. But
> indexes from 2.9 can be read with 3.0 and support may get removed in 4.0.
> 3.0 Indexes can be read until version 4.9.
>
>
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>   --
>
> *From:* Jake Mannix [mailto:jake.man...@gmail.com]
> *Sent:* Monday, November 16, 2009 7:15 PM
>
> *To:* java-dev@lucene.apache.org
> *Subject:* Re: Why release 3.0?
>
>
>
> Don't users need to upgrade to 3.0 because 3.1 won't be necessarily able to
> read your
> 2.4 index file formats?  I suppose if you've already upgraded to 2.9, then
> all is well because
> 2.9 is the same format as 3.0, but we can't assume all users upgraded from
> 2.4 to 2.9.
>
> If you've done that already, then 3.0 might not be necessary, but if you're
> on 2.4 right now,
> you will be in for a bad surprise if you try to upgrade to 3.1.
>
>   -jake
>
> On Mon, Nov 16, 2009 at 10:10 AM, Erick Erickson 
> wrote:
>
> One of my "specialties" is asking obvious questions just to see if
> everyone's assumptions
>
> are aligned. So with the discussion about branching 3.0 I have to ask "Is
> there going to
>
> be any 3.0 release intended for *production*?". And if not, would we save a
> lot of work
>
> by just not worrying about retrofitting fixes to a 3.0 branch and carrying
> on with 3.1
>
> as the first *supported* 3.x release?
>
>
>
> Since 3.0 is "upgrade-to-java5 and remove deprecations", I'm not sure *as a
> user* I see a
>
> good reason to upgrade to 3.0. Getting a "beta/snapshot" release to get a
> head start on
>
> cleaning up my code does seem worthwhile, if I have the spare time. And
> having a base
>
> 3.0 version that's not changing all over the place would be useful for
> that.
>
>
>
> That said, I'm also not terribly comfortable with a "release" that's out
> there and unsupported.
>
>
>
> Apologies if this has already been discussed, but I don't remember it.
> Although my memory
>
> isn't what it used to be (but some would claim it never was)...
>
>
>
> Erick
>
>
>
>
>
>
>


Re: Why release 3.0?

2009-11-16 Thread Jake Mannix
Don't users need to upgrade to 3.0 because 3.1 won't be necessarily able to
read your
2.4 index file formats?  I suppose if you've already upgraded to 2.9, then
all is well because
2.9 is the same format as 3.0, but we can't assume all users upgraded from
2.4 to 2.9.

If you've done that already, then 3.0 might not be necessary, but if you're
on 2.4 right now,
you will be in for a bad surprise if you try to upgrade to 3.1.

  -jake

On Mon, Nov 16, 2009 at 10:10 AM, Erick Erickson wrote:

> One of my "specialties" is asking obvious questions just to see if
> everyone's assumptions
> are aligned. So with the discussion about branching 3.0 I have to ask "Is
> there going to
> be any 3.0 release intended for *production*?". And if not, would we save a
> lot of work
> by just not worrying about retrofitting fixes to a 3.0 branch and carrying
> on with 3.1
> as the first *supported* 3.x release?
>
> Since 3.0 is "upgrade-to-java5 and remove deprecations", I'm not sure *as a
> user* I see a
> good reason to upgrade to 3.0. Getting a "beta/snapshot" release to get a
> head start on
> cleaning up my code does seem worthwhile, if I have the spare time. And
> having a base
> 3.0 version that's not changing all over the place would be useful for
> that.
>
> That said, I'm also not terribly comfortable with a "release" that's out
> there and unsupported.
>
> Apologies if this has already been discussed, but I don't remember it.
> Although my memory
> isn't what it used to be (but some would claim it never was)...
>
> Erick
>
>
>


[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

2009-11-12 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777068#action_12777068
 ] 

Jake Mannix commented on LUCENE-1526:
-

bq. OK. It's clear Zoie's design is optimized for insanely fast reopen.

That, and maxing out QPS and indexing rate while keeping query latency 
degredation to a minimum.  From trying to turn off the extra deleted check, the 
latency overhead on a 5M doc index is a difference of queries taking 12-13ms 
with the extra check turned on, and 10ms without it, and you only really start 
to notice on the extreme edges (the queries hitting all 5million docs by way of 
an actual query (not MatchAllDocs)), when your performance goes from maybe 
100ms to 140-150ms.  

bq. EG what I'd love to see is, as a function of reopen rate, the "curve" of 
QPS vs docs per sec. Ie, if you reopen 1X per second, that consumes some of 
your machine's resources. What's left can be spent indexing or searching or 
both, so, it's a curve/line. So we should set up fixed rate indexing, and then 
redline the QPS to see what's possible, and do this for multiple indexing 
rates, and for multiple reopen rates.

Yes, that curve would be a very useful benchmark.  Now that I think of it, it 
wouldn't be too hard to just sneak some reader caching into the ZoieSystem with 
a tunable parameter for how long you hang onto it, so that we could see how 
much that can help.  One of the nice things that we can do in Zoie by using 
this kind of index-latency backoff, is that because we have an in-memory 
two-way mapping of zoie-specific UID to docId, if we actually have time (in the 
background, since we're caching these readers now) to zip through and update 
the real delete BitVectors on the segments, and lose the extra check at query 
time, only using that if you have the index-latency time set below some 
threshold (determined by how long it takes the system to do this resolution - 
mapping docId to UID is an array lookup, the reverse is a little slower).

bq. Right, Zoie is making determined tradeoffs. I would expect that most apps 
are fine with controlled reopen frequency, ie, they would choose to not lose 
indexing and searching performance if it means they can "only" reopen, eg, 2X 
per second.

In theory Zoie is making tradeoffs - in practice, at least against what is on 
trunk, Zoie's just going way faster in both indexing and querying in the 
redline perf test.  I agree that in principle, once LUCENE-1313 and other 
improvements and bugs have been worked out of NRT, that query performance 
should be faster, and if zoie's default BalancedMergePolicy (nee 
ZoieMergePolicy) is in use for NRT, the indexing performance should be faster 
too - it's just not quite there yet at this point.

bq. I agree - having such well defined API semantics ("once updateDoc returns, 
searches can see it") is wonderful. But I think they can be cleanly built on 
top of Lucene NRT as it is today, with a pre-determined (reopen) latency.

Of course!  These api semantics are already built up on top of plain-old Lucene 
- even without NRT, so I can't imagine how NRT would *remove* this ability! :)

bq. I think the "large merge just finished" case is the most costly for such 
apps (which the "merged segment warmer" on IW should take care of)? (Because 
otherwise the segments are tiny, assuming everything is cutover to "per 
segment").

Definitely.  One thing that Zoie benefited from, from an API standpoint, which 
might be nice in Lucene, now that 1.5 is in place, is that the 
IndexReaderWarmer could replace the raw SegmentReader with a warmed 
user-specified subclass of SegmentReader:

{code}
public abstract class IndexReaderWarmer {
  public abstract T warm(IndexReader rawReader);
}
{code}

Which could replace the reader in the readerPool with the 
possibly-user-overridden subclass of SegmentReader (now that SegmentReader is 
as public as IndexReader itself is) which has now been warmed.  For users who 
like to decorate their readers to keep additional state, instead of use them as 
keys into WeakHashMaps kept separate, this could be extremely useful (I know 
that the people I talked to at Apple's iTunes store do this, as well as in 
bobo, and zoie, to name a few examples off the top of my head).

bq.  I think Lucene could handle this well, if we made an IndexReader impl that 
directly searches DocumentWriter's RAM buffer. But that's somewhat challenging

Jason mentioned this approach in his talk at ApacheCon, but I'm not at all 
convinced it's necessary - if a single box can handle indexing a couple hundred 
smallish documents a second (into a RAMDirectory), and could be sped up by 
using multiple IndexWriters (writing into multiple RAMDirecotries in parallel, 
if you w

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

2009-11-11 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776568#action_12776568
 ] 

Jake Mannix commented on LUCENE-1526:
-

bq. These sound serious - if you can provide any details, that'd help.  I'll do 
some stress testing too. Thanks for testing and reporting 

Out of these, the specific issue of incorrectness of applied deletes is easiest 
to see - we saw it by indexing up to a million docs, then keep adding docs but 
only after doing a delete on the UID where UID instead of increasing, is looped 
around mod 1million.  Calling numDocs (not maxDoc) on the reader with Zoie 
always returns 1M after looping around, but with NRT, it starts slowly growing 
above 1M.

The CPU and memory is undoubtedly due to the constant reopening of these 
readers, and yes, could be aleiviated by not doing this - we're just comparing 
to the zoie case, where we *do* reopen (the RAMDir) on every request (and copy 
the delSet) if there have been modifications since the last update.

bq. Lucene NRT makes a clone of the BitVector for every reader that has new 
deletions. Once this is done, searching is "normal" - it's as if the reader 
were a disk reader. There's no extra checking of deleted docs (unlike Zoie), no 
OR'ing of 2 BitVectors, etc.

Ok, so if this is copy-on-write, it's done every time there is a new delete for 
that segment?  If the disk index is optimized that means it would happen on 
every update, a clone of the full numDocs sized BitVector?  I'm still a little 
unsure of how this happens. 

* somebody calls getReader() - they've got all the SegmentReaders for the disk 
segments, and each of them have BitVectors for deletions.
* IW.update() gets called - the BitVector for the segment which now has a 
deletion is cloned, and set on a new pooled SegmentReader as its deletedSet
* maybe IW.update() gets called a bunch more - do these modify the pooled but 
as-yet-unused SegmentReader?  New readers in the pool?  What?
* another call to getReader() comes in, and gets an IndexReader wrapping the 
pooled SegmentReaders.

bq. Yes, this makes Lucene's reopen more costly. But, then there's no double 
checking for deletions. That's the tradeoff, and this is why the 64 msec is 
added to Zoie's search time. Zoie's searches are slower.

So we re-ran some of our tests last night, commenting out our deleted check to 
measure it's cost in the most extreme case possible: a dead easy query (in that 
it's only one term), but one which yes, hits the entire index (doing a 
MatchAllDocs query is actually special-cased in our code, and is perfectly 
fast, so not a good worst case to check), and as the index grows up above a 
million documents, zoie could shave somewhere from 22-28% of its time off by 
not doing the extra check.  

We haven't re-run the test to see what happens as the index grows to 5M or 10M 
yet, but I can probably run that later today.

bq. The fact that Zoie on the pure indexing case (ie no deletions) was 10X 
faster than Lucene is very weird - that means something else is up,
besides how deletions are carried in RAM. It's entirely possible it's the fact 
that Lucene doesn't flush the tiny segments to a RAMDir (which LUCENE-1313 
addresses).

Yeah, if you call getReader() a bunch of times per second, each one does a 
flush(true,true,true), right?  Without having LUCENE-1313, this kills the 
indexing performance if querying is going on.  If no getReader() is being 
called at all, Zoie is about 10% slower than pure Lucene IndexWriter.add() 
(that's the cost of doing it in two steps - index into *two* RAMDirs [so they 
are hot-swappable] and then writing segments to disk with 
addIndexesNoOptimize() periodically).

I don't think there's any difference in the MergePolicy - I think they're both 
using the BalancedMergePolicy (since that's the one which is optimized for the 
realtime case).

bq. Actually I thought of a simple way to run the "search only" (not reopen) 
test - I'll just augment TopScoreDocCollector to optionally check the 
IntSetAccelerator, and measure the cost in practice, for different numbers of 
docs added to the IntSet.

Due to the bloomfilter living on top of the hashSet, at least at the scales 
we're dealing with, we didn't see any change in cost due to the number of 
deletions (zoie by default keeps no more than 10k modifications in memory 
before flushing to disk, so the biggest the delSet is going to be is that, and 
we don't see the more-than-constant scaling yet at that size).

bq. But your test is missing a dimension: frequency of reopen. If you reopen 
once per second, how do Zoie/Lucene compare? Twice per second? Once every 5 
seconds? Etc.

Yep, this is true.  It's a little more invasive to p

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

2009-11-10 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776329#action_12776329
 ] 

Jake Mannix commented on LUCENE-1526:
-

bq. Whoa, pretty insane volume. 

Aiming for maxing out indexing speed and query throughput at the same time is 
what we're testing here, and this is a reasonable extreme limit to aim for when 
stress-testing real-time search.

bq. A handful by pooling the BitVector fixed size bytes arrays (see 
LUCENE-1574).

Pooling, you say?  But what if updates come in too fast to reuse your pool?  If 
you're indexing at the speeds I'm describing, won't you run out of BitVectors 
in the pool?

bq. I really need a solution that will absolutely not affect query performance 
from what is today

"You" really need this?  Why is the core case for real-time search a scenario 
where taking a hit of a huge reduction in throughput worth a possible gain in 
query latency?  If the cost was 20% query latency drop in exchange for 7x 
throughput cost when doing heavy indexing, is that worth it?  What about 10% 
latency cost vs 2x throughput loss?  These questions aren't easily answered by 
saying real-time search with Lucene needs  to _absolutely not affect query 
performance from what it is today_.  These kinds of absolute statements should 
be backed up by comparisons with real performance and load testing.

There are many axes of performance to optimize for: 
* query latency
* query throughput
* indexing throughput
* index freshness (how fast before documents are visible)

Saying that one of these is absolutely of more importance than the others 
without real metrics showing which ones are affected in which ways by different 
implementation choices is doing a disservice to the community, and is not by 
any means "conservative".

> For near real-time search, use paged copy-on-write BitVector impl
> -
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

2009-11-10 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776294#action_12776294
 ] 

Jake Mannix commented on LUCENE-1526:
-

bq. Zoie must do the IntSet check plus the BitVector check (done by
Lucene), right?

Yes, so how does Lucene NRT deal with new deletes?  The disk-backed IndexReader 
still does its internal check for deletions, right?  I haven't played with the 
latest patches on LUCENE-1313, so I'm not sure what has changed, but if you're 
leaving the disk index alone (to preserve point-in-time status of the index 
without writing to disk all the time), you've got your in-memory BitVector of 
newly uncommitted deletes, and then the SegmentReaders from the disk have their 
own internal deletedDocs BitVector.  Are these two OR'ed with each other 
somewhere?  What is done in NRT to minimize the time of checking both of these 
without modifying the read-only SegmentReader?  In the current 2.9.0 code, the 
segment is reloaded completely on getReader() if there are new add/deletes, 
right?

bq. Ie comparing IntSet lookup vs BitVector lookup isn't the comparison
you want to do. You should compare the IntSet lookup (Zoie's added
cost) to 0.

If you've got that technique for resolving new deletes against the disk-based 
ones while maintaining point-in-time nature and can completely amortize the 
reopen cost so that it doesn't affect performance, then yeah, that would be the 
comparison.  I'm not sure I understand how the NRT implementation is doing this 
currently - I tried to step through the debugger while running the 
TestIndexWriterReader test, but I'm still not quite sure what is going on 
during the reopen.

bq.  So, for a query that hits 5M docs, Zoie will take 64 msec longer than
Lucene, due to the extra check. What I'd like to know is what
pctg. slowdown that works out to be, eg for a simple TermQuery that
hits those 5M results - that's Zoie's worst case search slowdown.

Yes, this is a good check to see, for while it is still a micro-benchmark, 
really, since it would be done in isolation, while no other production tasks 
are going on, like rapid indexing and the consequent flushes to disk and reader 
reopening is going on, but it would be useful to see.

What would be even better, however, would be to have a running system whereby 
there is continual updating of the index, and many concurrent requests are 
coming in which hit all 5M documents, and measure the mean latency for zoie in 
this case, in both comparison to NRT, and in comparison to lucene when you 
*don't* reopen the index (ie. you do things the pre-lucene2.9 way, where the 
CPU is still being consumed by indexing, but the reader is out of date until 
the next time it's scheduled by the application to reopen).  This would measure 
the effective latency and throughtput costs of zoie and NRT vs non-NRT lucene.  
I'm not really sure it's terribly helpful to see "what is zoie's latency when 
you're not indexing at all" - why on earth would you use either NRT or zoie if 
you're not doing lots of indexing? 

> For near real-time search, use paged copy-on-write BitVector impl
> -
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

2009-11-10 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776277#action_12776277
 ] 

Jake Mannix commented on LUCENE-1526:
-

But Jason, if you're indexing 300 documents a second (possibly all of which are 
delete+re-add), and querying at a thousand queries a second, how many of these 
BitVectors are you going to end up making?

> For near real-time search, use paged copy-on-write BitVector impl
> -
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

2009-11-09 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775151#action_12775151
 ] 

Jake Mannix commented on LUCENE-1526:
-

bq. But how many msec does this clone add in practice?  Note that it's only 
done if there is a new deletion against that
segment.  I do agree it's silly wasteful, but searching should then be faster
than using AccelerateIntSet or MultiBitSet. It's a tradeoff of the
turnaround time for search perf.

I actually don't know for sure if this is the majority of the time, as I 
haven't actually run both the AcceleratedIntSet or 2.9 NRT through a profiler, 
but if you're indexing at high speed (which is what is done in our load/perf 
tests), you're going to be cloning these things hundreds of times per second 
(look at the indexing throughput we're forcing the system to go through), and 
even if it's fast, that's costly.

bq. I'd love to see how the worst-case queries (matching millions of hits)
perform with each of these three options.

It's pretty easy to change the index and query files in our test to do that, 
that's a good idea.  You can feel free to check out our load testing framework 
too - it will let you monkey with various parameters, monitor the whole thing 
via JMX, and so forth, both for the full zoie-based stuff, and where the zoie 
api is wrapped purely around Lucene 2.9 NRT.   The instructions for how to set 
it up are on the zoie wiki.

bq. When a doc needs to be updated, you index it immediately into the
RAMDir, and reopen the RAMDir's IndexReader. You add it's UID to the
AcceleratedIntSet, and all searches "and NOT"'d against that set. You
don't tell Lucene to delete the old doc, yet.

Yep, basically.  The IntSetAccellerator (of UIDs) is set on the (long lived) 
IndexReader for the disk index - this is why it's done as a ThreadLocal - 
everybody is sharing that IndexReader, but different threads have different 
point-in-time views of how much of it has been deleted.

bq. These are great results! If I'm reading them right, it looks like
generally you get faster query throughput, and roughly equal indexing
throughput, on upgrading from 2.4 to 2.9?

That's about right.  Of course, the comparison between zoie with either 2.4 or 
2.9 against lucene 2.9 NRT is an important one to look at: zoie is pushing 
about 7-9x better throughput for both queries and indexing than NRT.

I'm sure the performance numbers would change if we allowed not realtimeness, 
yes, that's one of the many dimensions to consider in this (along with 
percentage of indexing events which are deletes, how many of those are from 
really old segments vs. newer ones, how big the queries are, etc...).

bq. One optimization you could make with Zoie is, if a real-time deletion
(from the AcceleratedIntSet) is in fact hit, it could mark the
corresponding docID, to make subsequent searches a bit faster (and
save the bg CPU when flushing the deletes to Lucene).

That sound interesting - how would that work?  We don't really touch the disk 
indexReader, other than to set this modSet on it in the ThreadLocal, where 
would this mark live?


> For near real-time search, use paged copy-on-write BitVector impl
> -
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as t

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

2009-11-09 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775082#action_12775082
 ] 

Jake Mannix commented on LUCENE-1526:
-

I'll try to get those numbers for you, they should be in our logs, and if not, 
it's easy enough to put them.My guess it that if zoie is doing 7x the QPS, 
the latency is significantly *less* than the NRT latency, not more, but I could 
be wrong.

Note that without concurrent indexing, the query speed will be the same, as all 
docs will be flushed to disk and the isDeleted check reduces to exactly the raw 
lucene case.

> For near real-time search, use paged copy-on-write BitVector impl
> -
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

2009-11-07 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774693#action_12774693
 ] 

Jake Mannix commented on LUCENE-1526:
-

bq. But, I agree it's wasteful of space when deletes are so
sparse... though it is fast.

It's fast for random access, but it's really slow if you need to make a lot of 
these (either during heavy indexing if copy-on-write, or during heavy query 
load if copy-on-reopen).

bq. So are you using this, only, as your deleted docs? Ie you don't store
the deletions with Lucene? I'm getting confused if this is only for
the NRT case, or, in general.

These are only to augment the deleted docs *of the disk reader* - the disk 
reader isn't reopened at all except infrequently - once a batch (a big enough 
RAMDirectory is filled, or enough time goes by, depending on configuration) is 
ready to be flushed to disk, diskReader.addIndexes is called and when the 
diskReader is reopened, the deletes live in the normal diskReader's delete set. 
  Before this time is ready, when there is a batch in ram that hasn't been 
flushed, the IntSetAccelerator is applied to the not-reopened diskReader.  It's 
a copy-on-read ThreadLocal.  

So I'm not sure if that described it correctly: only the deletes which should 
have been applied to the diskReader are treated separately - those are 
basically batched: for T amount of time or D amount of docs (configurable) 
whichever comes first, they are applied to the diskReader, which knows about 
Lucene's regular deletions and now these new ones as well.   Once the memory is 
flushed to disk, the in-memory delSet is emptied, and applied to the diskReader 
using regular apis before reopening.

bq. OK, I think I'm catching up here... so you only open a new reader at
the batch boundary right? Ie, a batch update (all its adds & deletes)
is atomic from the readers standpoint?

Yes - disk reader, you mean, right?  This is only reopened at batch boundary.

bq. OK so a batch is quickly reopened, using bloom filter + int set for
fast "contains" check for the deletions that occurred during that
batch (and, custom TermDocs that does the "and not deleted"). This
gets you your fast turnaround and decent search performance.

The reopening isn't that quick, but it's in the background, or are you talking 
about the RAMDirectory?  Yeah, that is reopened per query (if necessary - if 
there are no changes, of course no reopen), but it is kept very small (10k docs 
or less, for example).  It's actually pretty fantastic performance - check out 
the zoie perf pages: 
http://code.google.com/p/zoie/wiki/Performance_Comparisons_for_ZoieLucene24ZoieLucene29LuceneNRT
 



> For near real-time search, use paged copy-on-write BitVector impl
> -
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

2009-11-07 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774628#action_12774628
 ] 

Jake Mannix commented on LUCENE-1526:
-

bq. Another approach we might take here is to track new deletions using a
sorted tree. New deletions insert in O(log(N)) time, and then we can
efficiently expose a DocIdSetIterator from that. 

We did this in Zoie for a while, and it turned out to be a bottleneck - not as 
much of a bottleneck as continually cloning a bitvector (that was even worse), 
but still not good.  We currently use a bloomfilter on top of an openintset, 
which performs pretty fantastically: constant-time adds and even-faster 
constant-time contains() checks, with small size (necessary for the new Reader 
per query scenario since this requires lots of deep-cloning of this structure).

Just a note from our experience over in zoie-land.  It also helped to not 
produce a docIdset iterator using these bits, but instead override TermDocs to 
be returned on the disk reader, and keep track of it directly there.

> For near real-time search, use paged copy-on-write BitVector impl
> -
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-11-03 Thread Jake Mannix
There are really not that many hoops you need to jump through to be able to
periodically optimize down to 10 segments or so.  I've used lucene at plenty
of other places before LinkedIn, and rarely (since 2.3's indexing speed blew
through the roof) have I had to worry about setting the merge factor too
high, and even when I do, you simply index into another directory while
you're optimizing the original (which is now kept read-only while keeping an
in-memory delete set).  It's not that hard, and it does wonders for your
performance.

Sure, plenty of lucene installations can't optimize, and while many of those
could do with some much-needed refactoring to allow them the possibility of
doing that (otherwise, you get what happened at my old company before I
worked there - there was never any optimize, high merge factor, and commit
after every document [ouch!], and eventually query latency went through the
roof and the system just fell over), I understand that not everyone is going
to do that.

But even in these installations, I'm still saying that you've narrowed the
field down to a very tiny number if you add up all the requirements for
multiPQ to be painful for them (seriously: when is 40MB going to hurt a
system that's designed to handle 100QPS per box?  Or when does 4MB hurt one
designed to handle 10QPS?)

  -jake

On Tue, Nov 3, 2009 at 12:40 PM, Mark Miller  wrote:

> Not *ever* being able to optimize is a common case, without jumping a lot
> of hoops. There are many systems that need to be on nearly 24/7 - an
> optimize on a large index can take many hours - usually an unknown number.
> Linkedin and it's use cases are not the only consumers of lucene.
>
>
> - Mark
>
> http://www.lucidimagination.com (mobile)
>
> On Nov 3, 2009, at 10:51 AM, "Jake Mannix (JIRA)"  wrote:
>
>
>>   [
>> https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773114#action_12773114
>>  ]
>>
>> Jake Mannix commented on LUCENE-1997:
>> -
>>
>> bq. Since each approach has distinct advantages, why not offer both
>> ("simple" and "expert") comparator extensions APIs?
>>
>> +1 from me on this one, as long as the simpler one is around.  I'll bet
>> we'll find that we regret keeping the "expert" one by 3.2 or so though, but
>> I'll take any compromise which gets the simpler API in there.
>>
>> bq. Don't forget that this is multiplied by however many queries are
>> currently in flight.
>>
>> Sure, so if you're running with 100 queries per second on a single shard
>> (pretty fast!), with 100 segments, and you want to do sorting by value on
>> the top 1000 values (how far down the long tail of extreme cases are we at
>> now?  Do librarians hit their search servers with 100 QPS and have indices
>> poorly built with hundreds of segments and can't take downtime to *ever*
>> optimize?), we're now talking about 40MB.
>>
>> *Forty megabytes*.  On a beefy machine which is supposed to be handling
>> 100QPS across an index big enough to need 100 segments.  How much heap would
>> such a machine already be allocating?  4GB?  6?  More?
>>
>> We're talking about less than 1% of the heap is being used by the multiPQ
>> approach in comparison to singlePQ.
>>
>>  Explore performance of multi-PQ vs single-PQ sorting API
>>> 
>>>
>>>   Key: LUCENE-1997
>>>   URL: https://issues.apache.org/jira/browse/LUCENE-1997
>>>   Project: Lucene - Java
>>>Issue Type: Improvement
>>>Components: Search
>>>  Affects Versions: 2.9
>>>  Reporter: Michael McCandless
>>>  Assignee: Michael McCandless
>>>   Attachments: LUCENE-1997.patch, LUCENE-1997.patch,
>>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,
>>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>>>
>>>
>>> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
>>> where a simpler (non-segment-based) comparator API is proposed that
>>> gathers results into multiple PQs (one per segment) and then merges
>>> them in the end.
>>> I started from John's multi-PQ code and worked it into
>>> contrib/benchmark so that we could run perf tests.  Then I generified
>>> the Python script I use for running search benchmarks (in
>>> contrib/benchmark/sortBench.py

Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-11-03 Thread Jake Mannix
Um, according to Mike's latest numbers, multiPQ is actually *faster* at 1000
hits sometimes.  In fact, all of the most recent tests have shown no clear
winner either way in terms of QPS.  Sometimes (look at Yonik's linux
numbers), multiPQ is nearly across the board faster.

Either way, I think it's been made abundantly clear that any advantages
singlePQ has in terms of QPS performance degrade to near zero (if not
reverse to be negative) as the number of hits increases higher and higher.
I could show you what it looks like on a 10 or 20 million document index if
you'd like to see that as well, but I think that's pretty clear from the
difference from 1M to 5M.

  -jake

On Tue, Nov 3, 2009 at 12:56 PM, Mark Miller  wrote:

> The memory issue is just one example of something that's somewhat worse - I
> don't see it as a deciding faster. If things were clarified to be decidedly
> faster with multi queue, and not 50% worse at 1000 hits, I'd be for the
> change, more memory or not.
>
>
> - Mark
>
> http://www.lucidimagination.com (mobile)
>
> On Nov 3, 2009, at 12:42 PM, Jake Mannix  wrote:
>
> Mark, I'm not stuck on single examples, I'm thinking about all of lucene
> land: what tiny fraction of people need the combined intersection of
>
>   a) many many segments
>
> AND
>
>   b) deep paging
>
> AND
>
>   c) high QPS
>
> AND
>
>   e) can't handle another 40MB of RAM usage.
>
> Only people in the intersection of all of those bitsets would possibly have
> a problem with the memory requirements of multiPQ.
>
> On Tue, Nov 3, 2009 at 12:32 PM, Mark Miller < 
> markrmil...@gmail.com> wrote:
>
>> Your obviously too stuck on single examples. We have to consider everyone
>> in lucene land.
>>
>> I'm against 2 Apis. A custom search is advanced - it's not worth the
>> baggage of maintaining two APIs or be limited by the APIs and back compat
>> when moving forward.
>>
>> If the advantage of the second API is just going to be it's simpler, I'm
>> not for it currently.
>>
>> - Mark
>>
>>  <http://www.lucidimagination.com>http://www.lucidimagination.com(mobile)
>>
>> On Nov 3, 2009, at 10:51 AM, "Jake Mannix (JIRA)" < 
>> j...@apache.org> wrote:
>>
>>
>>>   [
>>> <https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773114#action_12773114>
>>> https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773114#action_12773114
>>>  ]
>>>
>>> Jake Mannix commented on LUCENE-1997:
>>> -
>>>
>>> bq. Since each approach has distinct advantages, why not offer both
>>> ("simple" and "expert") comparator extensions APIs?
>>>
>>> +1 from me on this one, as long as the simpler one is around.  I'll bet
>>> we'll find that we regret keeping the "expert" one by 3.2 or so though, but
>>> I'll take any compromise which gets the simpler API in there.
>>>
>>> bq. Don't forget that this is multiplied by however many queries are
>>> currently in flight.
>>>
>>> Sure, so if you're running with 100 queries per second on a single shard
>>> (pretty fast!), with 100 segments, and you want to do sorting by value on
>>> the top 1000 values (how far down the long tail of extreme cases are we at
>>> now?  Do librarians hit their search servers with 100 QPS and have indices
>>> poorly built with hundreds of segments and can't take downtime to *ever*
>>> optimize?), we're now talking about 40MB.
>>>
>>> *Forty megabytes*.  On a beefy machine which is supposed to be handling
>>> 100QPS across an index big enough to need 100 segments.  How much heap would
>>> such a machine already be allocating?  4GB?  6?  More?
>>>
>>> We're talking about less than 1% of the heap is being used by the multiPQ
>>> approach in comparison to singlePQ.
>>>
>>>  Explore performance of multi-PQ vs single-PQ sorting API
>>>> 
>>>>
>>>>   Key: LUCENE-1997
>>>>   URL: <https://issues.apache.org/jira/browse/LUCENE-1997>
>>>> https://issues.apache.org/jira/browse/LUCENE-1997
>>>>   Project: Lucene - Java
>>>>Issue T

[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-11-03 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773114#action_12773114
 ] 

Jake Mannix commented on LUCENE-1997:
-

bq. Since each approach has distinct advantages, why not offer both ("simple" 
and "expert") comparator extensions APIs?

+1 from me on this one, as long as the simpler one is around.  I'll bet we'll 
find that we regret keeping the "expert" one by 3.2 or so though, but I'll take 
any compromise which gets the simpler API in there.

bq. Don't forget that this is multiplied by however many queries are currently 
in flight.

Sure, so if you're running with 100 queries per second on a single shard 
(pretty fast!), with 100 segments, and you want to do sorting by value on the 
top 1000 values (how far down the long tail of extreme cases are we at now?  Do 
librarians hit their search servers with 100 QPS and have indices poorly built 
with hundreds of segments and can't take downtime to *ever* optimize?), we're 
now talking about 40MB.  

*Forty megabytes*.  On a beefy machine which is supposed to be handling 100QPS 
across an index big enough to need 100 segments.  How much heap would such a 
machine already be allocating?  4GB?  6?  More? 

We're talking about less than 1% of the heap is being used by the multiPQ 
approach in comparison to singlePQ.

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-11-03 Thread Jake Mannix
Mark, I'm not stuck on single examples, I'm thinking about all of lucene
land: what tiny fraction of people need the combined intersection of

  a) many many segments

AND

  b) deep paging

AND

  c) high QPS

AND

  e) can't handle another 40MB of RAM usage.

Only people in the intersection of all of those bitsets would possibly have
a problem with the memory requirements of multiPQ.

On Tue, Nov 3, 2009 at 12:32 PM, Mark Miller  wrote:

> Your obviously too stuck on single examples. We have to consider everyone
> in lucene land.
>
> I'm against 2 Apis. A custom search is advanced - it's not worth the
> baggage of maintaining two APIs or be limited by the APIs and back compat
> when moving forward.
>
> If the advantage of the second API is just going to be it's simpler, I'm
> not for it currently.
>
> - Mark
>
> http://www.lucidimagination.com (mobile)
>
> On Nov 3, 2009, at 10:51 AM, "Jake Mannix (JIRA)"  wrote:
>
>
>>   [
>> https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773114#action_12773114
>>  ]
>>
>> Jake Mannix commented on LUCENE-1997:
>> -
>>
>> bq. Since each approach has distinct advantages, why not offer both
>> ("simple" and "expert") comparator extensions APIs?
>>
>> +1 from me on this one, as long as the simpler one is around.  I'll bet
>> we'll find that we regret keeping the "expert" one by 3.2 or so though, but
>> I'll take any compromise which gets the simpler API in there.
>>
>> bq. Don't forget that this is multiplied by however many queries are
>> currently in flight.
>>
>> Sure, so if you're running with 100 queries per second on a single shard
>> (pretty fast!), with 100 segments, and you want to do sorting by value on
>> the top 1000 values (how far down the long tail of extreme cases are we at
>> now?  Do librarians hit their search servers with 100 QPS and have indices
>> poorly built with hundreds of segments and can't take downtime to *ever*
>> optimize?), we're now talking about 40MB.
>>
>> *Forty megabytes*.  On a beefy machine which is supposed to be handling
>> 100QPS across an index big enough to need 100 segments.  How much heap would
>> such a machine already be allocating?  4GB?  6?  More?
>>
>> We're talking about less than 1% of the heap is being used by the multiPQ
>> approach in comparison to singlePQ.
>>
>>  Explore performance of multi-PQ vs single-PQ sorting API
>>> 
>>>
>>>   Key: LUCENE-1997
>>>   URL: https://issues.apache.org/jira/browse/LUCENE-1997
>>>   Project: Lucene - Java
>>>Issue Type: Improvement
>>>Components: Search
>>>  Affects Versions: 2.9
>>>  Reporter: Michael McCandless
>>>  Assignee: Michael McCandless
>>>   Attachments: LUCENE-1997.patch, LUCENE-1997.patch,
>>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,
>>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>>>
>>>
>>> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
>>> where a simpler (non-segment-based) comparator API is proposed that
>>> gathers results into multiple PQs (one per segment) and then merges
>>> them in the end.
>>> I started from John's multi-PQ code and worked it into
>>> contrib/benchmark so that we could run perf tests.  Then I generified
>>> the Python script I use for running search benchmarks (in
>>> contrib/benchmark/sortBench.py).
>>> The script first creates indexes with 1M docs (based on
>>> SortableSingleDocSource, and based on wikipedia, if available).  Then
>>> it runs various combinations:
>>>  * Index with 20 balanced segments vs index with the "normal" log
>>>   segment size
>>>  * Queries with different numbers of hits (only for wikipedia index)
>>>  * Different top N
>>>  * Different sorts (by title, for wikipedia, and by random string,
>>>   random int, and country for the random index)
>>> For each test, 7 search rounds are run and the best QPS is kept.  The
>>> script runs singlePQ then multiPQ, and records the resulting best QPS
>>> for each and produces table (in Jira format) as output.
>>>
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-11-02 Thread Jake Mannix
Right, that's my common sense saying that I'm guessing that worrying about
memory usage of the order of numSegments * numResultsReturned is rarely
going to be an issue.

I think I actually took latin once, so I may have even known what "eg"
literally meant back at some point in prehistory.  I just have seen how
these discussions can catch on one particular thing that is brought up, and
then derailed into long banterings back and forth on that topic like it was
the most important in the world for ever (for example, how I'm highlighting
this one parenthetical remark which Mike made), losing what the actual
consensus and disagreement was.

As far as I could see, in general, for most common use cases, the
performance was so comparable as to be changeable by wind patterns, and ram
usage is such that in comparison to how much memory a lucene system which is
asking for thousands of hits to be returned across hundred-segment indices
(for typical LogMergePolicy, this would only happen once you get up toward a
billion documents) is probably already so large that worrying about another
10KB here and there is really going to get washed away in the same way that
the leading order complexity is totally dominating the CPU time.

The real thing to focus on is the API itself, for maximal ease of use for
the next possibly 3 years (if the timeframe of 2.x is any indicator).

  -jake

On Mon, Nov 2, 2009 at 3:16 PM, Mark Miller  wrote:

> That I don't know. Probably fewer than more, but that's just me guessing
> based on common sense :)
>
>
> - Mark
>
> http://www.lucidimagination.com (mobile)
>
> On Nov 2, 2009, at 6:13 PM, Jake Mannix  wrote:
>
> Ok, and how many of those users are also running on indices with hundreds
> of segments?
>
>   -jake
>
> On Mon, Nov 2, 2009 at 3:10 PM, Mark Miller < 
> markrmil...@gmail.com> wrote:
>
>> There are plenty of Lucene users that do go 1000 in. We've been calling it
>> "deep paging" at LI. I like that name :)
>>
>> - Mark
>>
>>  <http://www.lucidimagination.com>http://www.lucidimagination.com(mobile)
>>
>>
>> On Nov 2, 2009, at 6:04 PM, "Jake Mannix (JIRA)" < 
>> j...@apache.org> wrote:
>>
>>
>>>   [
>>> <https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772741#action_12772741>
>>> https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772741#action_12772741
>>>  ]
>>>
>>> Jake Mannix commented on LUCENE-1997:
>>> -
>>>
>>> The current concern is to do with the memory?  I'm more concerned with
>>> the weird "java ghosts" that are flying around, sometimes swaying results by
>>> 20-40%...  the memory could only be an issue on a setup with hundreds of
>>> segments and sorting the top 1000 values (do we really try to optimize for
>>> this performance case?).  In the normal case (no more than tens of segments,
>>> and the top 10 or 100 hits), we're talking about what, 100-1000 PQ entries?
>>>
>>>  Explore performance of multi-PQ vs single-PQ sorting API
>>>> 
>>>>
>>>>   Key: LUCENE-1997
>>>>   URL: <https://issues.apache.org/jira/browse/LUCENE-1997>
>>>> https://issues.apache.org/jira/browse/LUCENE-1997
>>>>   Project: Lucene - Java
>>>>Issue Type: Improvement
>>>>Components: Search
>>>>  Affects Versions: 2.9
>>>>  Reporter: Michael McCandless
>>>>  Assignee: Michael McCandless
>>>>   Attachments: LUCENE-1997.patch, LUCENE-1997.patch,
>>>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,
>>>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>>>>
>>>>
>>>> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
>>>> where a simpler (non-segment-based) comparator API is proposed that
>>>> gathers results into multiple PQs (one per segment) and then merges
>>>> them in the end.
>>>> I started from John's multi-PQ code and worked it into
>>>> contrib/benchmark so that we could run perf tests.  Then I generified
>>>> the Python script I use for running search benchmarks (in
>>>> contrib/benchmark/sortBench.py).
>

Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-11-02 Thread Jake Mannix
Ok, and how many of those users are also running on indices with hundreds of
segments?

  -jake

On Mon, Nov 2, 2009 at 3:10 PM, Mark Miller  wrote:

> There are plenty of Lucene users that do go 1000 in. We've been calling it
> "deep paging" at LI. I like that name :)
>
> - Mark
>
> http://www.lucidimagination.com (mobile)
>
>
> On Nov 2, 2009, at 6:04 PM, "Jake Mannix (JIRA)"  wrote:
>
>
>>   [
>> https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772741#action_12772741
>>  ]
>>
>> Jake Mannix commented on LUCENE-1997:
>> -
>>
>> The current concern is to do with the memory?  I'm more concerned with the
>> weird "java ghosts" that are flying around, sometimes swaying results by
>> 20-40%...  the memory could only be an issue on a setup with hundreds of
>> segments and sorting the top 1000 values (do we really try to optimize for
>> this performance case?).  In the normal case (no more than tens of segments,
>> and the top 10 or 100 hits), we're talking about what, 100-1000 PQ entries?
>>
>>  Explore performance of multi-PQ vs single-PQ sorting API
>>> 
>>>
>>>   Key: LUCENE-1997
>>>   URL: https://issues.apache.org/jira/browse/LUCENE-1997
>>>   Project: Lucene - Java
>>>Issue Type: Improvement
>>>Components: Search
>>>  Affects Versions: 2.9
>>>  Reporter: Michael McCandless
>>>  Assignee: Michael McCandless
>>>   Attachments: LUCENE-1997.patch, LUCENE-1997.patch,
>>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,
>>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>>>
>>>
>>> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
>>> where a simpler (non-segment-based) comparator API is proposed that
>>> gathers results into multiple PQs (one per segment) and then merges
>>> them in the end.
>>> I started from John's multi-PQ code and worked it into
>>> contrib/benchmark so that we could run perf tests.  Then I generified
>>> the Python script I use for running search benchmarks (in
>>> contrib/benchmark/sortBench.py).
>>> The script first creates indexes with 1M docs (based on
>>> SortableSingleDocSource, and based on wikipedia, if available).  Then
>>> it runs various combinations:
>>>  * Index with 20 balanced segments vs index with the "normal" log
>>>   segment size
>>>  * Queries with different numbers of hits (only for wikipedia index)
>>>  * Different top N
>>>  * Different sorts (by title, for wikipedia, and by random string,
>>>   random int, and country for the random index)
>>> For each test, 7 search rounds are run and the best QPS is kept.  The
>>> script runs singlePQ then multiPQ, and records the resulting best QPS
>>> for each and produces table (in Jira format) as output.
>>>
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-11-02 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772741#action_12772741
 ] 

Jake Mannix commented on LUCENE-1997:
-

The current concern is to do with the memory?  I'm more concerned with the 
weird "java ghosts" that are flying around, sometimes swaying results by 
20-40%...  the memory could only be an issue on a setup with hundreds of 
segments and sorting the top 1000 values (do we really try to optimize for this 
performance case?).  In the normal case (no more than tens of segments, and the 
top 10 or 100 hits), we're talking about what, 100-1000 PQ entries?

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-28 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771041#action_12771041
 ] 

Jake Mannix commented on LUCENE-1997:
-

Can you tell me more about this?  What kind of comparator can't pre-create a 
fixed ordinal list for all the possible values?  I'm sure I've seen this too, 
but I can't bring one to mind right now.

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-28 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771017#action_12771017
 ] 

Jake Mannix commented on LUCENE-1997:
-

But part of the point is that you don't have to get the values - you can have a 
fast in-memory structure which just encodes their sort-order, right?  This is 
the whole point of using the ordinal - you pre-sort all of the possible values 
to get the ordinals, and now arbitrarily complex comparator reduces to int 
compare at sort time(*).  In the custom comparators we use, for example, this 
allows for even sorting by multivalued fields in a custom way, via the simple 
compare(int doca, int docb) way.

(*) This reminds me: Mike, you switched the compare for ord values from being 
"return ordA - ordB" to being "return ordA > ordB ? 1 : (ordA == ordB ? 0 : 
-1)", on the basis of int overflow at some point, right?  This is only true if 
we're really sorting by integers, which could overflow - if they're ordinals, 
then these are both non-negative numbers, and their difference will always be 
greater than -MAX_INT, so the branching can be avoided in this innermost 
comparison in this case.

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-28 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771005#action_12771005
 ] 

Jake Mannix commented on LUCENE-1997:
-

That should speed things up, but that's way subleading in complexity.   This is 
an additive term O(numSegments * numDesiredResults) total operations when done 
"slowly" (as opposed to the best merge, which is O(numDesiredResults * 
log(numSegments)) ), in comparison to the primary subleading piece for multiPQ, 
which is O(numSegments * numDesiredResults * log(numDesiredResults) * 
log(numHitsPerSegment) ), so that's taking a piece of the CPU time which is 
smaller by a factor of 20-100 already than the total PQ insert time, and 
reducing it by a further factor of maybe 5-10.  

If it's easy to code up, sure, why not.  But it's not really "inner loop" 
necessary optimizations anymore, I'd argue.

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-28 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770945#action_12770945
 ] 

Jake Mannix commented on LUCENE-1997:
-

bq. search segment 2 into queue B (with possible short circuiting by the 
smallest value in queueA)

Well, we're not doing the short circuit trick on multiPQ right now, are we?  It 
would certainly speed things up, but requires the API have the convert() method 
available, which was the big savings on the API side to multiPQ.  If it was 
available, I think multiPQ (either with N or 2 queues) would perform *strictly* 
better than singlePQ, but I didn't suggest this because it seems to negate the 
cleanliness of the API.

One thing John mentioned offhand is that perhaps the convert() method could be 
optional?  If you don't implement it, you don't get to short-circuit using 
knowledge of previous segments, but if you do, you get maximum performance in 
the cases where multiPQ performs worse (mid-range hitCount, high 
numResultsToReturn, and in the numeric sorting case).

I think maybe combining this idea with 2 queues could be the best of all 
worlds, with best overall speed, only twice the memory of singlePQ, and the 
simplest API with the addition of one new *optional* method?

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-27 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770709#action_12770709
 ] 

Jake Mannix commented on LUCENE-1997:
-

Not terribly useful, in that it's MacOS X laptop - 

2.4 GHz Intel Core 2 Duo, 4GB 667 MHz DDR2 SDRAM, 

Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02-304)
Java HotSpot(TM) Client VM (build 1.5.0_19-137, mixed mode, sharing)
 
but it's run with the latest patch.

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-27 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770708#action_12770708
 ] 

Jake Mannix commented on LUCENE-1997:
-

||Source||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change||
|random|balanced||500|rand int|10|22.02|22.99|{color:green}4.4%{color}|
|random|balanced||500|rand int|25|21.45|22.91|{color:green}6.8%{color}|
|random|balanced||500|rand int|50|21.40|22.58|{color:green}5.5%{color}|
|random|balanced||500|rand 
int|100|21.16|21.93|{color:green}3.6%{color}|
|random|balanced||500|rand 
int|500|21.27|17.45|{color:red}-18.0%{color}|
|random|balanced||500|rand 
int|1000|20.62|14.33|{color:red}-30.5%{color}|
|random|balanced||500|rand 
string|10|18.33|21.91|{color:green}19.5%{color}|
|random|balanced||500|rand 
string|25|18.37|22.20|{color:green}20.8%{color}|
|random|balanced||500|rand 
string|50|18.51|22.13|{color:green}19.6%{color}|
|random|balanced||500|rand 
string|100|20.30|21.91|{color:green}7.9%{color}|
|random|balanced||500|rand 
string|500|18.34|18.69|{color:green}1.9%{color}|
|random|balanced||500|rand 
string|1000|17.83|15.04|{color:red}-15.6%{color}|
|random|balanced||500|country|10|18.49|22.16|{color:green}19.8%{color}|
|random|balanced||500|country|25|18.62|22.28|{color:green}19.7%{color}|
|random|balanced||500|country|50|18.30|21.86|{color:green}19.5%{color}|
|random|balanced||500|country|100|20.46|22.08|{color:green}7.9%{color}|
|random|balanced||500|country|500|18.02|18.38|{color:green}2.0%{color}|
|random|balanced||500|country|1000|18.28|14.62|{color:red}-20.0%{color}|
|random|log||500|rand int|10|22.04|23.18|{color:green}5.2%{color}|
|random|log||500|rand int|25|21.48|23.02|{color:green}7.2%{color}|
|random|log||500|rand int|50|21.37|22.96|{color:green}7.4%{color}|
|random|log||500|rand int|100|21.47|22.43|{color:green}4.5%{color}|
|random|log||500|rand int|500|21.09|19.60|{color:red}-7.1%{color}|
|random|log||500|rand int|1000|20.37|17.58|{color:red}-13.7%{color}|
|random|log||500|rand string|10|18.54|22.49|{color:green}21.3%{color}|
|random|log||500|rand string|25|18.50|22.32|{color:green}20.6%{color}|
|random|log||500|rand string|50|18.53|22.24|{color:green}20.0%{color}|
|random|log||500|rand string|100|20.46|22.43|{color:green}9.6%{color}|
|random|log||500|rand string|500|18.59|20.83|{color:green}12.0%{color}|
|random|log||500|rand string|1000|18.27|18.01|{color:red}-1.4%{color}|
|random|log||500|country|10|18.55|22.50|{color:green}21.3%{color}|
|random|log||500|country|25|18.48|22.35|{color:green}20.9%{color}|
|random|log||500|country|50|18.48|22.24|{color:green}20.3%{color}|
|random|log||500|country|100|20.73|22.34|{color:green}7.8%{color}|
|random|log||500|country|500|18.77|20.12|{color:green}7.2%{color}|
|random|log||500|country|1000|17.92|18.44|{color:green}2.9%{color}|


> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to a

[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-27 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770604#action_12770604
 ] 

Jake Mannix commented on LUCENE-1997:
-

What happened with these ones, Mike?

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-27 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770587#action_12770587
 ] 

Jake Mannix commented on LUCENE-1997:
-

Excellent, good to see that my big-O analysis is holding up on the 5M doc set: 
as the sub-leading terms drop off and become negligible, any improvement of 
singlePQ over multiPQ starts to go away entirely.

Still seems to be some statistical fluctuations here and there though (why 
would 1000 hits ever have *better* perf for multiPQ vs singlePQ compared to 500 
hits?), but I guess that's entropy for you...

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-26 Thread Jake Mannix
> Also, I really do not like the large perf hits at highish topN sizes.
> Admittedly it's unusual (though I suspect not THAT rare) to use such
> a large topN, but, I esepically don't like that it doesn't degrade
> gracefully – it's surprising. Likewise I'd like to see with highish
> number of segments how things degrade.

Well the degradation should not be surprising - it's coded into the
subleading complexity: redoing the complexity argument carefully
for the average case, probabilistically, the flop count for single PQ,
for top T docs out of S segments of H hits in each, is^(footnote1)

  O(S*H + T*log(T)*log(S*H))

(note that it is a *sum* of the two terms)

while inserting in S different queues is^(footnote2)

  O(S*H + T*log(T)*log(H)*S )

This means that multi-PQ has a subleading term with
S*log(H) / log(S*H) worse performance, as a function of T.

To see whether this is "bad", we have to consider the full
range of S = numSegments, H = numHitsPerSegment
and T = topNumHitsToReturn.

If you fix S and H (which fixes the dominating factor as
constant also), then as T increases, as the tests show,
multi-PQ diverges, getting worse as T goes up.

On the other hand, if you fix T, and instead let S and H
vary, then the flop-count ratio is the following:

flops_multiPQ / flops_singlePQ =

S*H + T*log(T)*log(S*H)
 =
S*H + T*log(T)*log(H)*S

1 + T*log(T)*log(S*H)/S*H
-- =
1 + T*log(T)*log(H)*S/S*H

under the always true (for us) case that S*H is the
dominating factor, and using 1/(1+x) =~ 1-x, for small
x, we have:

(1 + T*log(T)*log(S*H)/S*H ) * (1 - T*log(T)*log(H)/H )

using (1+x)*(1-y) = 1+x-y for small x, y, and pulling out
the common factor of T*log(T)/H, we have:

1 - ( T*log(T)/H ) * ( log(H)  - (log(S*H) / S) )

What's the end result?  If you fix the queue size (T),
then as the number of hits and the number of hits
per segment grows, the percentage performance difference
should go to *zero* proportional to 1/H =
1/numHitsPerSegment.

This to me is pretty important to note, and should be
pretty easy to see in a test run: doing it on a run with
5 million documents hit by the query should work out
much better for the multiPQ than singlePQ, whereas
*decreasing* the hit count, to say 10,000 hits, should
penalize multiPQ versus singlePQ.

The average user I would imagine doesn't have *tons*
of variation in queue-size or number of segments
(meaning I doubt either grows often above 100, and
very rarely up to 1000), but the variation in total
number of hits is extremely variable, and as even
recent discussions on java-user show, people even
look at multi-TB indexes, so performance as a function
of index size is probably the most important.

I can run these numbers on my mac desktop, because
this analysis is independent of hardware or even
java version, since it's a scaling argument, but we should
really see what the results are like on reasonable
hardware for both the small hit-count case (where the
performance might be really bad for singlePQ, and
rule it out altogether), and the very high hit-count case
(where the difference should be much more in favor
of multiPQ, and in fact could rule out singlePQ if the
constant factors (like the cost of the more string
compares that singlePQ has to do) become more
important than the complexity, because the asymptotics
are the same in this regime).

  -jake

--

(1) To see this scaling, if you have a queue of size
T, and you have currently looked at X hits, then the
probability in the average case of the next hit beating
the bottom element of the queue is T/X.  Then the
total number of insertions in the queue you have to
do in a set of H hits is sum_X=1...H [ T/X ].  Replacing
the sum with an integral, it's just T log(H).  The
cost of an insertion into a PQ of size T is log(T), giving
the flop count based on insertions of log(T) * T * log(H)
for insertions while walking H hits, and log(T)*T*log(S*H)
for insertions while walking S*H hits.

(2) we're neglecting the final merge for the multiPQ case,
because its sub-sub-leading in complexity: merging
S sorted lists of T elements is O(T log(S)) on average,
which is smaller than the next bigger term in this counting
by a multiplicative factor of log(T) * log(H).


[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-25 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769860#action_12769860
 ] 

Jake Mannix commented on LUCENE-1997:
-

Mark, you say with the previous numbers, you'd say "-1", but if you look at the 
most common use case (top 10), the simpler API is faster in almost all cases, 
and in some cases it's 10-20% faster.   Top 500, top 1000 are not only just 
"not as common", they're probably at the 1% level, or less.

As far as shifting back, API-wise, that really shouldn't be a factor: 2.9 
*just* came out, and what, we stick with a slightly *slower* API (for the most 
common use case across all Lucene users), which happens to be *more complex*, 
and more importantly: just very nonstandard - Comparable is very familiar to 
everyone, even if you have to have two forms, one for primitives, one for 
Objects - an api which *doesn't* have the whole slew of compare(), 
compareBottom(), copy(), setBottom(), value() and setNextReader() has a 
tremendous advantage over one which does.  

It's "advanced" to implement a custom sort, but it will be *easier* if it's not 
complex, and then it doesn't *need* to be "advanced" (shouldn't we be striving 
to make there be less APIs which are listed as "advanced", and instead more 
features which can *do* complex things but are still listed as things "normal 
users" can do).

I think it's *great* precedent to set with users to say, "oops!  we found that 
this new (just now as of this version) api was unnecessarily clumsy, we're 
shifting back to a simpler one which is just like the one you used to have".  
Sticking with a worse api because it performs better in only extreme scenarios 
because "we already moved on to this new api, shouldn't go back now, don't want 
to admit we ever made a mistake!" is what is "ugly".

The main thing to remember is that the entire thinking around making this 
different from the old was *only* because it seemed that using a simpler api 
would perform much worse than this one, and it does not appear that this is the 
case.  If that original reasoning turns out to have been incorrect, then the 
answer is simple: go with the simpler API *now* before users *do* get used to 
using the new one.

If it turns out I'm wrong, and lots of users sort based on field values for the 
top 1000 entries often, or that the most recent runs turn out to be flukes and 
are not typical performance, only then would I'd change my opinion.

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-23 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769328#action_12769328
 ] 

Jake Mannix commented on LUCENE-1997:
-

Mike, thanks for all the hard work on this - it's clearly far more work than 
anyone has spent yet on just doing the upgrade to the newer api, and that's 
appreciated.  

Am I wrong in thinking that these results are pretty ambiguous?  How often to 
people take the top 500 or top 1000 sorted hits?  If you don't focus on that 
case (that of looking for pages *50 through 100* of normal 10-per-page search 
results), there's a bunch of green, a bunch of red, both techniques are +/- 
10-20% of each other?

Is that what everyone else sees of Mike's newest numbers here, or am I 
misreading them?

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Jake Mannix
On Thu, Oct 22, 2009 at 9:58 PM, Mark Miller  wrote:

> Yes - I've seen a handful of non core devs report back that they
> upgraded with no complaints on the difficulty. Its in the mailing list
> archives. The only core dev I've seen say its easy is Uwe. He's super
> sharp though, so I wasn't banking my comment on him ;)
>

Upgrade custom sorting?  Where has anyone talked about this?

2.9 is great, I like the new apis, they're great in general.  It's just this
multi-segment sorting we're talking about here.

  -jake


[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769058#action_12769058
 ] 

Jake Mannix commented on LUCENE-1997:
-

I would say that of course weighting more highly linux and solaris should be 
done over results on macs, because while I love my mac, I've yet to see a 
production cluster running on MacBook Pros... :)

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769056#action_12769056
 ] 

Jake Mannix commented on LUCENE-1997:
-

Java6 is standard in production servers, since when?  What justified lucene 
staying java1.4 for so long if this is the case?  In my own experience, my last 
job only moved to java1.5 a year ago, and at my current company, we're still on 
1.5, and I've seen that be pretty common, and I'm in the Valley, where things 
update pretty quickly.

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Jake Mannix
On Thu, Oct 22, 2009 at 9:25 PM, Mark Miller  wrote:

> >> he new API is much harder for the
> >> average user to use, and even for the experienced user, it's not
> terribly fun,
> >> and more importantly:
>
> Do we have enough info to support that though? All the cases I have seen
> on the list, people have figured it out pretty easily - havn't really
> seen any complaints in that regard (not counting you and John - that is
> two). The only other complaints I have noticed are those that happened
> to count on unsupported behavior (eg people counting on no MultiSearcher
> use)
>

John and I and TomS all found it both complex, and we're all pretty serious
users of inner lucene apis.

You see *core developers* saying the api seems fine.  Have you seen *any
users*
of the new sorting api say anything positive about it?  Do you know of
*anyone* who
has implemented the new comparator interface at all,  let alone *likes* it?

3 negative votes by users, in comparison to *zero* positive votes by users
together with a bunch of core developers saying, "yeah it looks easy, what
are
you guys complaining about?".

Internal apis take a while to percolate out to the user base - we're only
the first
few running into this, and while the sample size is small, it shouldn't be
discounted.

Yes, of course it is possible to migrate to the new APIs - which is what we,
as well
as many others, were in the process of doing.  This is just an example of an
API
which got more complex in going to 2.9, and unlike the Collector API, it's
possible
that in this case it wasn't necessary for it to be as complex as it did.

  -jake


Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Jake Mannix
On Thu, Oct 22, 2009 at 8:30 PM, Yonik Seeley wrote:

> On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix 
> wrote:
> > It's hard to read the column format, but if you look up above in the
> thread
> > from tonight,
> > you can see that yes, for PQ sizes less than 100 elements, multiPQ is
> > better, and only
> > starts to be worse at around 100 for strings, and 50 for ints.
>
> Ah, OK, I had missed John's followup with the numbers.
>
> I assume this is for Java5 + optimizations?
>

Yeah, this was for Java5 + optimizations.


> What does Java6 show?
>

Java6 on Mac showed close to what Mike posted in his report on the Jira
ticket -
that single-PQ performs a little better for small pq, and more like 30-40%
better
for large pq.


> My biggest reservation is that we've gone down the road of telling
> people to implement a new style of comparators, and told them that the
> old style comparators would be deleted in the next release (which is
> where we are).  Reversing that will be a bit of a headache/question...
> the new stuff isn't deprecated, and having *both* isn't desirable, but
> that's a separate decision to be made apart from performance testing.
>

Well the issue comes down to: if the performance is *basically comparable*
between the two approaches, then the new API is much harder for the
average user to use, and even for the experienced user, it's not terribly
fun,
and more importantly: for the user who has already implemented custom
sorts on the old API, upgrading is enough trouble that people may decide
it's not worth it.  It probably *is* worth it, but if you're going to even
put that
kind of thinking in the user's head, you've got to ask yourself: what's the
reasoning for going with a more complex API if you can get equal (slightly
better in some cases, slightly worse in others) performance with a simpler
API?

Yes, as Mike says, the new API is *not* breaking back-compat in a
functional sense, but how many users have converted to the new sorting
api already?  2.9 has barely just come out, and while it's work for the
community as a whole to reconsider the multi-segment sorting api, and
work to implement a change at this level, if it's the right thing to do,
we shouldn't let the question of which method is deprecated dictate
which one *should* be deprecated.


> Is there also an option of using a multiPQ approach with the new style
> comparators?
>

For the record: that would be the worst of all worlds, in my view: harder
API with only better performance in some cases, and sometimes worse
performance.

  -jake


[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769042#action_12769042
 ] 

Jake Mannix commented on LUCENE-1997:
-

Hah!  Thanks for posting that, Mark!   Much easier to read. :)

Hey John, can you comment with your hardware specs on this, so it can be 
recorded for posterity? ;)

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Jake Mannix
Of course, John's running on his mac laptop, which also may be a factor,
which is
another reason why he wanted to see if these carried over onto a linux
desktop (for example).

  -jake

On Thu, Oct 22, 2009 at 8:11 PM, Jake Mannix  wrote:

> It's hard to read the column format, but if you look up above in the thread
> from tonight,
> you can see that yes, for PQ sizes less than 100 elements, multiPQ is
> better, and only
> starts to be worse at around 100 for strings, and 50 for ints.
>
>   -jake
>
>
> On Thu, Oct 22, 2009 at 8:06 PM, Yonik Seeley 
> wrote:
>
>> On Thu, Oct 22, 2009 at 10:35 PM, John Wang  wrote:
>> >Please be patient with me. I am seeing a difference and was
>> wondering
>> > if Mike would see the same thing.
>>
>> Some differences are bound to be seen... with your changes (JVM
>> changes, branch optimizations), are you seeing better average
>> performance with multiPQ?
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>


Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Jake Mannix
It's hard to read the column format, but if you look up above in the thread
from tonight,
you can see that yes, for PQ sizes less than 100 elements, multiPQ is
better, and only
starts to be worse at around 100 for strings, and 50 for ints.

  -jake

On Thu, Oct 22, 2009 at 8:06 PM, Yonik Seeley wrote:

> On Thu, Oct 22, 2009 at 10:35 PM, John Wang  wrote:
> >Please be patient with me. I am seeing a difference and was
> wondering
> > if Mike would see the same thing.
>
> Some differences are bound to be seen... with your changes (JVM
> changes, branch optimizations), are you seeing better average
> performance with multiPQ?
>
> -Yonik
> http://www.lucidimagination.com
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Jake Mannix
Mark,

  We're not seeing exactly the numbers that Mike is seeing in his tests,
running with jdk 1.5 on intel macs, so we're trying to eliminate factors of
difference.

  Point 2 does indeed make a difference, we've seen it, and it's only fair:
the
single pq comparator does this branch optimization but the current patch
multi-pq
does not, so let's level the playing field.

  John's on the road with limited net connectivity, but we'll have some
numbers to
compare more over the weekend for sure.

  -jake

On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller  wrote:

> Why? What might he find? Whats with the cryptic request?
>
> Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains?
>
> I know point 2 certainly doesn't. Cards on the table?
>
> John Wang wrote:
> > Hey Michael:
> >
> >Would you mind rerunning the test you have with jdk1.5?
> >
> >Also, if you would, change the comparator method to avoid
> > brachning for int and string comparators, e.g.
> >
> >
> >   return index.order[i.doc] - index.order[j.doc];
> >
> >
> > Thanks
> >
> >
> > -John
> >
> >
> > On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
> > mailto:luc...@mikemccandless.com>> wrote:
> >
> > On Thu, Oct 22, 2009 at 2:17 AM, John Wang  > > wrote:
> >
> > >  I have been playing with the patch, and I think I have some
> > information
> > > that you might like.
> > >  Let me spend sometime and gather some more numbers and
> > update in jira.
> >
> > Excellent!
> >
> > >  say bottom has ords 23, 45, 76, each corresponding to a
> > string. When
> > > moving to the next segment, you need to make bottom to have ords
> > that can be
> > > comparable to other docs in this new segment, so you would need
> > to find the
> > > new ords for the values in 23,45 and 76, don't you? To find it,
> > assuming the
> > > values are s1,s2,s3, you would do a bin. search on the new val
> > array, and
> > > find index for s1,s2,s3.
> >
> > It's that inversion (from ord->Comparable in first seg, and
> > Comparable->ord in second seg) that I'm trying to avoid (w/ this new
> > proposal).
> >
> > > Which is 3 bin searches per convert, I am not sure
> > > how you can short circuit it. Are you suggesting we call
> > Comparable on
> > > compareBottom until some doc beats it?
> >
> > I'm saying on seg transition you indeed get the Comparable for
> current
> > bottom, but, don't attempt to invert it.  Instead, as seg 2 finds a
> > hit, you get that hit's Comparables and compare to bottom.  If it
> > beats bottom, it goes into the queue.  If it does not, you use the
> ord
> > (in seg 2's ord space) to "learn" a bottom in the ord space of seg 2.
> >
> > > That would hurt performance I lot though, no?
> >
> > Yeah I think likely it would, since we're talking about a binary
> > search on transition VS having to do possibly many
> > upgrade-to-Comparable and compare-Comparabls to slowly learn the
> > equivalent ord in the new segment.  I was proposing it for cases
> where
> > inversion is very difficult.  But realistically, since you must keep
> > around the ful ord -> Comparable for every segment anyway (in order
> to
> > merge in the end), inversion shouldn't ever actually be "difficult"
> --
> > it'd just be a binary search on presumably in-RAM storage.
> >
> > Mike
> >
> > -
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > 
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> > 
> >
> >
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: lucene 2.9 sorting algorithm

2009-10-19 Thread Jake Mannix
Given that this new API is pretty unweildy, and seems to not actually
perform any better than the old one... are we going to consider revisiting
that?

  -jake

On Mon, Oct 19, 2009 at 11:27 PM, Uwe Schindler  wrote:

>  The old search API is already removed in trunk…
>
>
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>   --
>
> *From:* John Wang [mailto:john.w...@gmail.com]
> *Sent:* Tuesday, October 20, 2009 3:28 AM
> *To:* java-dev@lucene.apache.org
> *Subject:* Re: lucene 2.9 sorting algorithm
>
>
>
> Hi Michael:
>
>
>
>  Was wondering if you got a chance to take a look at this.
>
>
>
>  Since deprecated APIs are being removed in 3.0, I was wondering
> if/when we would decide on keeping the ScoreDocComparator API and thus would
> be kept for Lucene 3.0.
>
>
>
> Thanks
>
>
>
> -John
>
> On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
> Oh, no problem...
>
> Mike
>
>
> On Fri, Oct 16, 2009 at 12:33 PM, John Wang  wrote:
> > Mike, just a clarification on my first perf report email.
> > The first section, numHits is incorrectly labeled, it should be 20
> instead
> > of 50. Sorry about the possible confusion.
> > Thanks
> > -John
> >
> > On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
> >  wrote:
> >>
> >> Thanks John; I'll have a look.
> >>
> >> Mike
> >>
> >> On Fri, Oct 16, 2009 at 12:57 AM, John Wang 
> wrote:
> >> > Hi Michael:
> >> > I added classes: ScoreDocComparatorQueue
> and OneSortNoScoreCollector
> >> > as
> >> > a more general case. I think keeping the old api for
> ScoreDocComparator
> >> > and
> >> > SortComparatorSource would work.
> >> >   Please take a look.
> >> > Thanks
> >> > -John
> >> >
> >> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang 
> wrote:
> >> >>
> >> >> Hi Michael:
> >> >>  It is open,
> http://code.google.com/p/lucene-book/source/checkout
> >> >>  I think I sent the https url instead, sorry.
> >> >> The multi PQ sorting is fairly self-contained, I have 2 versions,
> 1
> >> >> for string and 1 for int, each are Collector impls.
> >> >>  I shouldn't say the Multi Q is faster on int sort, it is within
> >> >> the
> >> >> error boundary. The diff is very very small, I would stay they are
> more
> >> >> equal.
> >> >>  If you think it is a good thing to go this way, (if not for the
> >> >> perf,
> >> >> just for the simpler api) I'd be happy to work on a patch.
> >> >> Thanks
> >> >> -John
> >> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
> >> >>  wrote:
> >> >>>
> >> >>> John, looks like this requires login -- any plans to open that up,
> or,
> >> >>> post the code on an issue?
> >> >>>
> >> >>> How self-contained is your Multi PQ sorting?  EG is it a standalone
> >> >>> Collector impl that I can test?
> >> >>>
> >> >>> Mike
> >> >>>
> >> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang 
> >> >>> wrote:
> >> >>> > BTW, we are have a little sandbox for these experiments. And all
> my
> >> >>> > testcode
> >> >>> > are at. They are not very polished.
> >> >>> >
> >> >>> > https://lucene-book.googlecode.com/svn/trunk
> >> >>> >
> >> >>> > -John
> >> >>> >
> >> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang 
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> Numbers Mike requested for Int types:
> >> >>> >>
> >> >>> >> only the time/cputime are posted, others are all the same since
> the
> >> >>> >> algorithm is the same.
> >> >>> >>
> >> >>> >> Lucene 2.9:
> >> >>> >> numhits: 10
> >> >>> >> time: 14619495
> >> >>> >> cpu: 146126
> >> >>> >>
> >> >>> >> numhits: 20
> >> >>> >> time: 14550568
> >> >>> >> cpu: 163242
> >> >>> >>
> >> >>> >> numhits: 100
> >> >>> >> time: 16467647
> >> >>> >> cpu: 178379
> >> >>> >>
> >> >>> >>
> >> >>> >> my test:
> >> >>> >> numHits: 10
> >> >>> >> time: 14101094
> >> >>> >> cpu: 144715
> >> >>> >>
> >> >>> >> numHits: 20
> >> >>> >> time: 14804821
> >> >>> >> cpu: 151305
> >> >>> >>
> >> >>> >> numHits: 100
> >> >>> >> time: 15372157
> >> >>> >> cpu time: 158842
> >> >>> >>
> >> >>> >> Conclusions:
> >> >>> >> The are very similar, the differences are all within error
> bounds,
> >> >>> >> especially with lower PQ sizes, which second sort alg again
> >> >>> >> slightly
> >> >>> >> faster.
> >> >>> >>
> >> >>> >> Hope this helps.
> >> >>> >>
> >> >>> >> -John
> >> >>> >>
> >> >>> >>
> >> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
> >> >>> >> 
> >> >>> >> wrote:
> >> >>> >>>
> >> >>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
> >> >>> >>>  wrote:
> >> >>> >>> > Though it'd be odd if the switch to searching by segment
> >> >>> >>> > really was most of the gains here.
> >> >>> >>>
> >> >>> >>> I had assumed that much of the improvement was due to ditching
> >> >>> >>> MultiTermEnum/MultiTermDocs.
> >> >>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that only
> >> >>> >>> helps
> >> >>> >>> with queries that use a TermEnum (range, prefi

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Jake Mannix
On Thu, Oct 15, 2009 at 2:33 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

>
> I don't think we do any branch tuning on the PQ insertion -- the ifs
> involved in re-heapifying the PQ are simply hard for the CPU to
> predict (though, apparently, not as hard as comparing strings ;).
>

But it does look like you do some nice short-circuting where you compare
to bottom instead of doing the naive insertWithOverflow() and check the
result, which is cool.


> > The single-PQ approach requires only O(K * log(K) * log(M))
> > insertions, but each one requires, instead of O(log(K)) int compares,
> > O(log(K)) *string* compares, and a segment conversion.
>
> Right, except, for a large segment, if it has enough insertions, as
> more entries in the PQ belong to that segment, more of the comparisons
> become ints.
>

Assuming you're processing the larger segments first, as Yonik said
was done, then this won't typically help very much, right?  If the
second segment is much smaller than the first, then not as many
replacements will happen, leaving the majority of the PQ having
entries from the previous segment.  Additionally - as you move to further
and further segments, whether they are the same size or smaller than
previous, the bottom element of the PQ is getting better and better, and
so you're chances of actually filling up even a significant minority of it
with
entries from the current segment is very small.

Looking back through LUCENE-1483, we were seeing much more sizable
> gains in the new API vs the old single-PQ-only-int-comparison approach
> (which should be faster than both the single and multi PQ approaches
> we are discussing here).  BUT: those tests combined searching by
> segment AND the new FieldCollector API, plus other more basic code
> speedups.  Though it'd be odd if the switch to searching by segment
> really was most of the gains here.  I'm confused (for now) on where
> the gains we were seeing in that issue came from...
>

You don't give yourself enough credit here: it could certainly that the
perf improvement was from more basic code-level speedups, as we're
only dealing with the sub-leading behavior here, and the constants
look like they most certainly matter!

  -jake


Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Jake Mannix
On Thu, Oct 15, 2009 at 2:12 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Nice results!  Comments below...
>
> > Here are the numbers (times are measured in nanoseconds):
> >
> > numHits = 50:
> >
> > Lucene 2.9/OneComparatorNonScoringCollector:
> > num string compares: 251
> > num conversions: 34
> > time: 15272049
> > cpu time:  161105
> > num inserts into PQ: 211
> >
> > My test sort algorithm:
> > num string compares: 51
> > num conversions: 0
> > time: 14943172
> > cpu time: 153722
> > num inserts into PQ: 1500
> >
> >
> > numHits = 100:
> >
> >
> > Lucene 2.9/OneComparatorNonScoringCollector:
> > num string compares: 2285
> > num conversions: 172
> > time: 17703866
> > cpu time:  187073
> > num inserts into PQ: 982
> >
> > My test sort algorithm:
> > num string compares: 210
> > num conversions: 0
> > time: 15823473
> > cpu time: 160737
> > num inserts into PQ: 6011
>
> These are nice results.  Single PQ does looks faster for this case.
>

You mean multiPQ, right?


Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Jake Mannix
On Thu, Oct 15, 2009 at 3:12 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> If I remembering it right... this (matching MultiSearcher's approach)
> was nearly the first thing we tried with LUCENE-1483.  But the CPU
> cost was higher in our tests.  I think we had tested unbalanced and
> balanced segments, but memory is definitely somewhat hazy at this
> point...
>

Yeah, from what I could tell, the way you used the PQ api, comparing
only with the bottom until you saw it beat the bottom, instead of always
just addWithOverflow, leads to some significant performance improvement,
but this can be done with the MultiSearcher / multiple PQ approach as
well - like I said, I'm pretty sure the overall complexity is the same
(ie. leading complexity, with the factor of numDocs = N * M out front), and
we're just talking about sub-leading behavior at this point.


> I suspect even in the balanced case the cost will be higher with the
> "separate PQs then merge at the end" approach.  It's not only the int
> ord comparisons (which are cheap) that take CPU.  It's the if
> statements that are executed as a result of the comparison when
> re-heapifying the PQ.  Those are far more costly especially if the CPU
> has trouble predicting which way the if will go, which it does here.


> Ie, it's the insertions into each PQ that are expensive; by tracking a
> separate PQ per reader you'll have many more insertions since
> each PQ must re-converge.
>

Insertions are expensive - totally agreed, and subtle branching tweaks
make a big difference at subleading level.  So we need to do a careful
analysis of the sub-leading behavior (O(K,M) but O(1) in terms of N):

The multi-PQ approach requires more insertions: O(M * K log(K))
insertions, but each one can be made very fast, with careful branch
tuning similar to the tweaks currently in the 2.9 code - as all of the
comparisons are ints.  The final merging requires K log(M) string
compares as well.

The single-PQ approach requires only O(K * log(K) * log(M))
insertions, but each one requires, instead of O(log(K)) int compares,
O(log(K)) *string* compares, and a segment conversion.

Plugging in some numbers, with K = 100, and M = 20

  multiPQ : 14,000 insertions which have 7 int compares each,
  followed by 400 string compares for the merge
vs

  singlePQ: 2,800 insertions which have maybe 4 string compares
   and 3 int compares each

-

So yeah, executive summary in theory: *at sub-leading order*, Yonik
is right, singlePQ has less overall operations (log(M) vs M), but it
has way more slow string compares (K log(K)^2 log(M)  vs K log(M) ).


> But I agree it's not obvious which way it'll go... so why not simply
> build a comparator that does this and then let the computer tell us?
>

I agree completely - this is all talk until I see it run on an actual index.
I know John was writing up some code to port bobo to 2.9, so perhaps
you can try and compare these two, with a couple different values of K,
John?

I guess a bigger question, than just the sub-leading behavior (if we're
talking 3.5 ms vs. 3.1ms, this is really academic), is if the overall
performance is basically the same (which I'm guessing is the case),
then for developers, writing and understanding Comparators for the
old API is *way* easier than with the current one, and the multi-PQ
technique makes it totally easy to use the same logic and understanding
that already exists for MultiSearcher.

Of course, if my calculations above are wrong, or the sub-leading
behavior is dominated by the branching and string compares are super
fast, then difficulty of API be damned, we stick with what performs
best.

  -jake


Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Jake Mannix
On Thu, Oct 15, 2009 at 9:12 AM, Yonik Seeley wrote:

> On Thu, Oct 15, 2009 at 11:53 AM, Yonik Seeley
>  wrote:
> > And it seems like a PQ per segment simply delays many of the slow
> > lookups until the end where the PQs must be merged.
>
> Actually, I'm wrong about that part - one can simply merge on
> values... there will be lots of string comparisons (and a new merge
> queue) but no binary searches.
>

The number of string comparisons in the final merge of queues is
O(K log(M) ) if you do as you're describing (have basically a PQ of sorted
lists, the same way BooleanScorer works), where K is going to be roughly
what, 10, 100 (numResultsToReturn), and M is 10-50?  This is like a couple
thousand string compares, tops, and independent of the overall size of the
segments - it's measured in microseconds of CPU time.

  -jake


Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Jake Mannix
I had to dig through the source code (actually, walk through a unit test,
because
that was simpler to see what was going on in the 2.9 sorting), but I think
John's
way has slightly lower complexity in the balanced segment size case.

On Wed, Oct 14, 2009 at 8:57 PM, Yonik Seeley wrote:

> Interesting idea... though one further piece of info in the mix is
> that large segments are typically processed first, and tend to fill up
> the priority queue.


[noted: in the unbalanced case, things work out nicely for the 2.9 case
compared
to the balanced case, but M roughly equal balanced segments is common enough
use case to consider this as non-extreme (c.f. BalancedMergePolicy).
Of course this means that the sorting is very easy using John's technique as
well,
as the number of string compares in the priority queue merge at the end is
much lower]


> Conversion from one segment to another is only
> done as needed... only the bottom slot is converted automatically when
> the segment is switched.
>

That's not what it looks like, actually: you convert the bottom slot, and
as soon as you get another hit from the new segment which is
competitive, it pushes what was the old bottom out, and there's a new
bottom, which often will be from one of the previous segments, in which
case *it* must be converted to the new ordering as well.

Additionally, whenever this happens, and a doc from the new segment
is competitive (which means that it wins in a fast int ordinal compare
with the bottom), it needs to do O(log(K)) slow *string* compares to find
out where in the queue this newly competitive doc belongs (where K is
numHits = size of the queue).


> Filling a complete priority queue for each segment is also more
> expensive (from the big-O perspective).


Huh?  If you have M roughly balanced segments of N docs each, and
you want to find the top K (out of all M*N) sorted docs, John is suggesting
(correct me if I'm wrong here John), that you keep M priority queues of
size K, filling them in sequence: on average (and when K << N) this takes
M * N * c_i, where c_i is the cost of an integer compare, and then you
need to merge M sorted string lists of size K, which takes K log(M) * c_s,
where c_s is the cost of doing a string compare.  This latter factor
is sub-leading in N, which is the biggest number around, and the
M * N * c_i factor is the same as the dominating factor in the current
2.9 sorting, because M * N * c_i is minimal cost to compare with the
bottom of the queue across all M * N docs.

So the leading in M*N factors from both approaches are the same, and
anything without a factor of N is tiny (if N might be 10^6, M is maybe
10-20, K is 10-100), but I worked out the sub-leading factors on a
sheet of paper and I think I've convinced myself that John's form
actually wins out even there, if you keep track of how often string
compares and conversions are required.

I can write up that O(K,M) analysis if you'd like, but it should be
mostly academic - for K, M << N, both John's approach and the
current 2.9 have the same leading behavior and should perform
equally well.  After all - the multiple queues approach is what
lucene has always done for the multi-reader case, why is it that
multiple SegmentReaders should be any different?

  -jake