[jira] Commented: (LUCENE-129) Finalizers are non-canonical

2005-11-16 Thread Sam Hough (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-129?page=comments#action_12357779 ] 

Sam Hough commented on LUCENE-129:
--

I think FSDirectory needs a finalize method adding to remove its reference
from FSDirectory.DIRECTORIES otherwise, through normal garbage collection,
directories could linger.

I presume the orginator of this issue is commenting on the finalize methods for
the Input and Output Streams.

I'm assuming that the intention is for Lucene to clean up after itself even if 
close is
not called explicitly. If this really is a bug then I'm happy to try and 
construct a unit
test to check that FSDirectory cleans up after itself properly.


> Finalizers are non-canonical
> 
>
>  Key: LUCENE-129
>  URL: http://issues.apache.org/jira/browse/LUCENE-129
>  Project: Lucene - Java
> Type: Bug
>   Components: Other
> Versions: unspecified
>  Environment: Operating System: other
> Platform: All
> Reporter: Esmond Pitt
> Assignee: Lucene Developers
> Priority: Minor

>
> The canonical form of a Java finalizer is:
> protected void finalize() throws Throwable()
> {
>  try
>  {
>// ... local code to finalize this class
>  }
>  catch (Throwable t)
>  {
>  }
>  super.finalize(); // finalize base class.
> }
> The finalizers in IndexReader, IndexWriter, and FSDirectory don't conform. 
> This
> is probably minor or null in effect, but the principle is important.
> As a matter of fact FSDirectory.finaliz() is entirely redundant and could be
> removed, as it doesn't do anything that RandomAccessFile.finalize would do
> automatically.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-129) Finalizers are non-canonical

2005-11-16 Thread Sam Hough (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-129?page=comments#action_12357780 ] 

Sam Hough commented on LUCENE-129:
--

Doh. Sorry. Been a long day. Finalize wont be called if DIRECTORIES still 
points at it :( Think twice, post once.

Does this mean that clients of FSDirectory should have finalize methods that 
close the Directory?
IndexReader.finalize for instance just cleans up its lock but doesn't call 
close()!?

It is making my head hurt thinking back to C++ days of no automatic garbage 
collection.

Sorry.

> Finalizers are non-canonical
> 
>
>  Key: LUCENE-129
>  URL: http://issues.apache.org/jira/browse/LUCENE-129
>  Project: Lucene - Java
> Type: Bug
>   Components: Other
> Versions: unspecified
>  Environment: Operating System: other
> Platform: All
> Reporter: Esmond Pitt
> Assignee: Lucene Developers
> Priority: Minor

>
> The canonical form of a Java finalizer is:
> protected void finalize() throws Throwable()
> {
>  try
>  {
>// ... local code to finalize this class
>  }
>  catch (Throwable t)
>  {
>  }
>  super.finalize(); // finalize base class.
> }
> The finalizers in IndexReader, IndexWriter, and FSDirectory don't conform. 
> This
> is probably minor or null in effect, but the principle is important.
> As a matter of fact FSDirectory.finaliz() is entirely redundant and could be
> removed, as it doesn't do anything that RandomAccessFile.finalize would do
> automatically.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-16 Thread Yonik Seeley
If that's the way to go, we should do it by default so the user doesn't have to.

Unless the scores between two types of queries are compatible, It's a
bad idea to transparently switch between them since it will cause
relevancy to unpredictably change in the future (triggered by either a
query changing slightly, or the index changing slightly).

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

On 11/16/05, Paul Elschot <[EMAIL PROTECTED]> wrote:
> Why not leave that decision to the program using the query?
> Something like this:
> - catch the TooManyClauses exception,
> -  adapt (the offending parts of) the query to make these use
>a FieldNormQuery,
> - retry with a warning to the provider of the query that
>   the term frequencies have been ignored.
>
> The only disadvantage of this is that the term expansion
> during rewrite has to be redone.
> Also, when enough terms are involved, they might still cause
> a memory problem because they are all needed at the same
> time.
>
> Regards,
> Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Index backboned by DB

2005-11-16 Thread Robert Kirchgessner
Hi, 

> 1) It might be OK to implement retrieving field values separately for a
> document. However, I think from a simplicity point of view, it might be
> better to have the application code do this drudgery. Adding this feature
> could complicate the nice and simple design of Lucene without much benefit.

Yes, it is possible to store document parts in a second index or in a
database. If only a small amount of documents at a time must be
loaded into memory, there are good solutions for that with the existing
storage model.

> 2) The application could separately a document into several documents, for
> example, one document for indexing mainly, the other documents for storing
> binary values for different fields. Thus, giving the relevant doc id, its
> associated binary value for a particular field could be loaded very fast
> with just a disk lookup (looking up the fdx file).

But consider the problem of using the information stored in a field for
sorting purposes (dates, any numeric attributes of documents). There
is now a special case implemented in Lucene: norms are stored as bytes
per document in a single file per segment. They a designed to be loaded at 
once into memory to completely avoid disk lookup for every document to
be weighted.

>
> This way, only the relevant field is loaded into memory rather than all of
> the fields for a doc. There is no change on Lucene side, only some more
> work for the application code.

I may have missed something, but I don't know how to implement
fast custom sorting without a disc access per document with the
current Lucene interface. Please tell me if I'm completely wrong.

Having the possibility of storing a binary field of fixed length in a
separate file to be loaded at once into memory for fast access would
solve the problem. Is it worth the effort ? I don't think it would
make interface that much more complicated.

Any comments?



>
> My view for a search library (or in general, a library), should be small
> and efficient, since it is used by lot of applications, any additional
> feature could potentially impact its robustness and liability to
> performance drawback.
>
> Welcome for any critics or comments?
>
> Jian
>
> On 11/15/05, Robert Kirchgessner <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > a discussion in
> >
> > http://issues.apache.org/jira/browse/LUCENE-196
> >
> > might be of interest to you.
> >
> > Did you think about storing the large pieces of documents
> > in a database to reduce the size of Lucene index?
> >
> > I think there are good reasons to adding support for
> > storing fields in separate files:
> >
> > 1. One could define a binary field of fixed length and store it
> > in a separate file. Then load it into memory and have fast
> > access for field contents.
> >
> > A use case might be: store calendar date (-MM-DD)
> > in three bytes, 4 bits for months, 5 bits for days and up to
> > 15 bits for years. If you want to retrieve hits sorted by date
> > you can load the fields file of size (3 * documents in index) bytes
> > and support sorting by date without accessing hard drive
> > for reading dates.
> >
> > 2. One could store document contents in a separate
> > file and fields of small size like title and some metadata
> > in the way it is stored now. It could speed up access to
> > fields. It would be interesting to know whether you gain
> > significant perfomance leaving the big chunks out, i.e.
> > not storing them in index.
> >
> > In my opinion 1. is the most interesting case: storing some
> > binary fields (dates, prices, length, any numeric metrics of
> > documents) would enable *really* fast sorting of hits.
> >
> > Any thoughts about this?
> >
> > Regards,
> >
> > Robert
> >
> >
> >
> > We have a similiar problem
> >
> > Am Dienstag, 15. November 2005 23:23 schrieb Karel Tejnora:
> > > Hi all,
> > > in our testing application using lucene 1.4.3. Thanks you guys for
> > > that great job.
> > > We have index file around 12GiB, one file (merged). To retrieve hits it
> > > takes nice small amount of the time, but reading fields takes 10-100
> > > times more (the stored ones). I think because all the fields are read.
> > > I would like to try implement lucene index files as tables in db with
> > > some lazy fields loading. As I have searched web I have found only
> > > impl. of the store.Directory (bdb), but it only holds data as binary
> > > streams. This technique will be not so helpful because BLOB operations
> > > are not fast performing. On another side I will have a lack of the
> > > freedom from documents fields variability but I can omit a lot of the
> > > skipping and many opened files. Also IndexWriter can have document/term
> > > locking granuality.
> > > So I think that way leads to extends IndexWriter / IndexReader and have
> > > own implementation of index.Segment* classes. It is the best way or I
> > > missing smthg how achieve this?
> > > If it is bad idea, I will be happy to heard another possibilities.
> > >
> > > 

[jira] Resolved: (LUCENE-395) CoordConstrainedBooleanQuery + QueryParser support

2005-11-16 Thread Yonik Seeley (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-395?page=all ]
 
Yonik Seeley resolved LUCENE-395:
-

Resolution: Fixed
 Assign To: Yonik Seeley  (was: Lucene Developers)

fixed BooleanQuery hashCode/equals and committed patches.

> CoordConstrainedBooleanQuery + QueryParser support
> --
>
>  Key: LUCENE-395
>  URL: http://issues.apache.org/jira/browse/LUCENE-395
>  Project: Lucene - Java
> Type: Improvement
>   Components: Search
> Versions: unspecified
>  Environment: Operating System: other
> Platform: Other
> Reporter: Mark Harwood
> Assignee: Yonik Seeley
> Priority: Minor
>  Attachments: BooleanQuery.patch, BooleanQuery.patch, BooleanScorer2.java, 
> CoordConstrainedBooleanQuery.java, CoordConstrainedBooleanQuery.java, 
> CustomQueryParserExample.java, CustomQueryParserExample.java, 
> LUCENE-395.patch, LUCENE-395.patch, LUCENE-395.patch, TestBoolean2Patch5.txt, 
> TestBooleanMinShouldMatch.java, TestBooleanMinShouldMatch.java, 
> TestBooleanMinShouldMatch.java, TestBooleanMinShouldMatch.java, 
> TestBooleanMinShouldMatch.java, TestBooleanMinShouldMatch.java
>
> Attached 2 new classes:
> 1) CoordConstrainedBooleanQuery
> A boolean query that only matches if a specified number of the contained 
> clauses
> match. An example use might be a query that returns a list of books where ANY 
> 2
> people from a list of people were co-authors, eg:
> "Lucene In Action" would match ("Erik Hatcher" "Otis Gospodnetić" "Mark 
> Harwood"
> "Doug Cutting") with a minRequiredOverlap of 2 because Otis and Erik wrote 
> that.
> The book "Java Development with Ant" would not match because only 1 element in
> the list (Erik) was selected.
> 2) CustomQueryParserExample
> A customised QueryParser that allows definition of
> CoordConstrainedBooleanQueries. The solution (mis)uses fieldnames to pass
> parameters to the custom query.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-466) Need QueryParser support for BooleanQuery.minNrShouldMatch

2005-11-16 Thread Yonik Seeley (JIRA)
Need QueryParser support for BooleanQuery.minNrShouldMatch
--

 Key: LUCENE-466
 URL: http://issues.apache.org/jira/browse/LUCENE-466
 Project: Lucene - Java
Type: Improvement
  Components: Search  
Versions: unspecified
 Environment: Operating System: other
Platform: Other
Reporter: Mark Harwood
 Assigned to: Yonik Seeley 
Priority: Minor


Attached 2 new classes:

1) CoordConstrainedBooleanQuery
A boolean query that only matches if a specified number of the contained clauses
match. An example use might be a query that returns a list of books where ANY 2
people from a list of people were co-authors, eg:
"Lucene In Action" would match ("Erik Hatcher" "Otis Gospodnetić" "Mark 
Harwood"
"Doug Cutting") with a minRequiredOverlap of 2 because Otis and Erik wrote that.
The book "Java Development with Ant" would not match because only 1 element in
the list (Erik) was selected.

2) CustomQueryParserExample
A customised QueryParser that allows definition of
CoordConstrainedBooleanQueries. The solution (mis)uses fieldnames to pass
parameters to the custom query.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-16 Thread Chris Hostetter

: > Should we dynamically decide to switch to FieldNormQuery when
: > BooleanQuery.maxClauseCount is exceeded?  That way queries that

: Why not leave that decision to the program using the query?
: Something like this:
: - catch the TooManyClauses exception,
: -  adapt (the offending parts of) the query to make these use
:a FieldNormQuery,
: - retry with a warning to the provider of the query that

...because it seems like the people who typically run into TooManyClauses
aren't familiary with whole API enough to understand why they are getting
the exception.  right now they ask questions and people give them advice
on reducing clauses based on their specific use case.  If this change were
made the advice could be simplified and generalized - but the number of
confused questions probably wouldn't decrease that much.

I think Doug is suggesting that the "default" case for people who don't
look very deeply at the API is to "just work" all of the time, as best it
can.  people who dig deeper can call the method or set the property to
make it fail in the extreme cases where they want it to fail.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-323) [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate support for queries across multiple fields

2005-11-16 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-323?page=comments#action_12357806 ] 

Yonik Seeley commented on LUCENE-323:
-

Added Iterable to DisjunctionMaxQuery as a semi Java5 friendly way to iterate 
over the disjuncts.  Added ability to add all disjuncts from an Iterable 
(Collection, List, another DisjunctionMaxQuery, etc).

I Committed DisjunctionMaxQuery/Scorer/Test since the Interface should be 
stable, and the implementation seems to work fine for the common cases.  I'll 
be happy to evaluate & commit performance updates when they become available.

I'll leave this bug open since it contains multiple issues.



> [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate 
> support for queries across multiple fields
> -
>
>  Key: LUCENE-323
>  URL: http://issues.apache.org/jira/browse/LUCENE-323
>  Project: Lucene - Java
> Type: Bug
>   Components: QueryParser
> Versions: 1.4
>  Environment: Operating System: Windows XP
> Platform: PC
> Reporter: Chuck Williams
> Assignee: Lucene Developers
>  Attachments: DisjunctionMaxQuery.java, DisjunctionMaxScorer.java, 
> TestDisjunctionMaxQuery.java, TestMaxDisjunctionQuery.java, TestRanking.zip, 
> TestRanking.zip, TestRanking.zip, WikipediaSimilarity.java, 
> WikipediaSimilarity.java, WikipediaSimilarity.java
>
> The attached test case demonstrates this problem and provides a fix:
>   1.  Use a custom similarity to eliminate all tf and idf effects, just to 
> isolate what is being tested.
>   2.  Create two documents doc1 and doc2, each with two fields title and 
> description.  doc1 has "elephant" in title and "elephant" in description.  
> doc2 has "elephant" in title and "albino" in description.
>   3.  Express query for "albino elephant" against both fields.
> Problems:
>   a.  MultiFieldQueryParser won't recognize either document as containing 
> both terms, due to the way it expands the query across fields.
>   b.  Expressing query as "title:albino description:albino title:elephant 
> description:elephant" will score both documents equivalently, since each 
> matches two query terms.
>   4.  Comparison to MaxDisjunctionQuery and my method for expanding queries 
> across fields.  Using notation that () represents a BooleanQuery and ( | ) 
> represents a MaxDisjunctionQuery, "albino elephant" expands to:
> ( (title:albino | description:albino)
>   (title:elephant | description:elephant) )
> This will recognize that doc2 has both terms matched while doc1 only has 1 
> term matched, score doc2 over doc1.
> Refinement note:  the actual expansion for "albino query" that I use is:
> ( (title:albino | description:albino)~0.1
>   (title:elephant | description:elephant)~0.1 )
> This causes the score of each MaxDisjunctionQuery to be the score of highest 
> scoring MDQ subclause plus 0.1 times the sum of the scores of the other MDQ 
> subclauses.  Thus, doc1 gets some credit for also having "elephant" in the 
> description but only 1/10 as much as doc2 gets for covering another query 
> term 
> in its description.  If doc3 has "elephant" in title and both "albino" 
> and "elephant" in the description, then with the actual refined expansion, it 
> gets the highest score of all (whereas with pure max, without the 0.1, it 
> would get the same score as doc2).
> In real apps, tf's and idf's also come into play of course, but can affect 
> these either way (i.e., mitigate this fundamental problem or exacerbate it).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-16 Thread Doug Cutting

Yonik Seeley wrote:

Totally untested, but here is a hack at what the scorer might look
like when the number of terms is large.


Looks plausible to me.

You could instead use a byte[maxDoc] and encode/decode floats as you 
store and read them, to use a lot less RAM.



  // could also use a bitset to keep track of docs in the set...


I think that is probably a very important optimization.

If you implemented both of these suggestions, this would use 5 bits/doc, 
instead of 33 bits/doc.  With a 100M doc index, that would be the 
difference between 62MB/query and 412MB/query.  The classic term 
expanding approach uses perhaps 2k/term.  So, with a 100M document 
index, the byte array approach uses less memory for queries which expand 
to more than 3,100 terms.  The float-array method uses less memory for 
queries with more than 206k terms.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-16 Thread Yonik Seeley
On 11/16/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
> You could instead use a byte[maxDoc] and encode/decode floats as you
> store and read them, to use a lot less RAM.

Hmmm, very interesting idea.
Less than one decimal digit of precision might be hard to swallow when
you have to add scores together though:

smallfloat(score1) + smallfloat(score2) + smallfloat(score3)

Do you think that the 5/3 exponent/mantissa split is right for this,
or would a 4/4 be better?

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Float.floatToRawIntBits

2005-11-16 Thread Yonik Seeley
Float.floatToRawIntBits (in Java1.4) gives the raw float bits without
normalization (like *(int*)&floatvar would in C).  Since it doesn't do
normalization of NaN values, it's faster (and hopefully optimized to a
simple inline machine instruction by the JVM).

On my Pentium4, using floatToRawIntBits is over 5 times as fast as
floatToIntBits.
That can really add up in something like Similarity.floatToByte() for
encoding norms, especially if used as a way to compress an array of
float during query time as suggested by Doug.

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Issues while doing ant on lucene source

2005-11-16 Thread Pol, Parikshit
Hi Folks.
I downloaded the Lucene and tried to do an ant. It initially gave me the 
following error:
BUILD FAILED
file:/home/parikpol/downloads/lucene-1.4.3/build.xml:11: Unexpected element 
"tstamp"

I commented out the tstamp tag from build.xml, and now it gives me the 
following errors:
compile-core:
[javac] Compiling 160 source files to 
/home/parikpol/downloads/lucene-1.4.3/build/classes/java
[javac] 
/home/parikpol/downloads/lucene-1.4.3/src/java/org/apache/lucene/search/FieldCacheImpl.java:236:
 error: Type `StringIndex' not found in the declaration of the return type of 
method `getStringIndex'.
[javac]  public StringIndex getStringIndex (IndexReader reader, String 
field)
[javac] ^
[javac] 
/home/parikpol/downloads/lucene-1.4.3/src/java/org/apache/lucene/search/FieldCacheImpl.java:291:
 error: Type `StringIndex' not found in the declaration of the local variable 
`value'.
[javac]  StringIndex value = new StringIndex (retArray, mterms);
[javac]  ^
[javac] 2 errors, 5 warnings

Any help would be appreciated.
Thanks.
Parik


Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-16 Thread Doug Cutting

Yonik Seeley wrote:

Hmmm, very interesting idea.
Less than one decimal digit of precision might be hard to swallow when
you have to add scores together though:

smallfloat(score1) + smallfloat(score2) + smallfloat(score3)

Do you think that the 5/3 exponent/mantissa split is right for this,
or would a 4/4 be better?


The float epsilon should ideally be greater than the minimum score 
increment, and the float range should ideally be at least 100x greater 
than the maximum score increment, to permit boosting, large queries, etc.


Given a 100M document collection, the maximum idf is log(100M) = ~18, 
with a length-normalized tf of 1, for a max of 18.  So the float range 
should ideally be around 1800 or greater.


The minimum idf is 1, and the minimum normalized tf with 10k word 
documents is 1/100.  So the float epsilon should ideally be less than 1/100.


5 bits of mantissa and 3 bits of exponent is closest to this, but not 
quite there, with an epsilon of 1/32 and a range of up to ~1000.


Did I get the math right?

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-16 Thread Paul Elschot
On Tuesday 15 November 2005 23:45, Yonik Seeley wrote:
> Totally untested, but here is a hack at what the scorer might look
> like when the number of terms is large.
> 
> -Yonik
> 
> 
> package org.apache.lucene.search;
> 
> import org.apache.lucene.index.TermEnum;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.TermDocs;
> 
> import java.io.IOException;
> 
> /**
>  * @author yonik
>  * @version $Id$
>  */
> public class MultiTermScorer extends Scorer{
>   protected final float[] scores;
>   protected int pos;
>   protected float docScore;
> 
>   public MultiTermScorer(Similarity similarity, IndexReader reader,
> Weight w, TermEnum terms, byte[] norms, boolean include_idf, boolean
> include_tf) throws IOException {
> super(similarity);
> float weightVal = w.getValue();
> int maxDoc = reader.maxDoc();
> this.scores = new float[maxDoc];
> float[] normDecoder = Similarity.getNormDecoder();
> 
> TermDocs tdocs = reader.termDocs();

This part is only needed at the top level of the query, so
one could implement in this optimization hook of BooleanScorer:

  /** Expert: Collects matching documents in a range.
   * Note that [EMAIL PROTECTED] #next()} must be called once before this 
method is
   * called for the first time.
   * @param hc The collector to which all matching documents are passed 
through
   * [EMAIL PROTECTED] HitCollector#collect(int, float)}.
   * @param max Do not score documents past this.
   * @return true if more matching documents may remain.
   */
  protected boolean score(HitCollector hc, int max) throws IOException {
...
  }

> while (terms.next()) {
>   tdocs.seek(terms);

terms.term() iirc.

>   float termScore = weightVal;
>   if (include_idf) {
> termScore *= similarity.idf(terms.docFreq(),maxDoc);
>   }
>   while (tdocs.next()) {
> int doc = tdocs.doc();
> float subscore = termScore;
> if (include_tf) subscore *= tdocs.freq();

getSimilarity().tf(tdocs.freq());

> if (norms!=null) subscore *= normDecoder[norms[doc&0xff]];
> scores[doc] += subscore;

The scores[] array is the pain point, but when it can be used
this can be generalized to DisjunctionSumScorer, so it would
work for all disjunctions, not only terms.

I think it is possible to implement this hook for
DisjunctionSumScorer with a scores[] array, iterating over the
subscorers one by one.
Getting that hook called through BooleanScorer2 is no problem
when the coordination factor can be left out.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Float.floatToRawIntBits

2005-11-16 Thread Paul Smith
I can confirm this takes ~ 20% of an overall Indexing operation (see  
attached link from YourKit).


http://people.apache.org/~psmith/luceneYourkit.jpg

Mind you, the whole "signalling via IOException" in the  
FastCharStream is a way bigger overhead, although I agree much harder  
to fix.


Paul Smith

On 17/11/2005, at 7:21 AM, Yonik Seeley wrote:


Float.floatToRawIntBits (in Java1.4) gives the raw float bits without
normalization (like *(int*)&floatvar would in C).  Since it doesn't do
normalization of NaN values, it's faster (and hopefully optimized to a
simple inline machine instruction by the JVM).

On my Pentium4, using floatToRawIntBits is over 5 times as fast as
floatToIntBits.
That can really add up in something like Similarity.floatToByte() for
encoding norms, especially if used as a way to compress an array of
float during query time as suggested by Doug.

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






smime.p7s
Description: S/MIME cryptographic signature


Re: Float.floatToRawIntBits

2005-11-16 Thread Yonik Seeley
Wow!  A much larger gain than I expected!
Thanks for the profile Paul!

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706


On 11/16/05, Paul Smith <[EMAIL PROTECTED]> wrote:
> I can confirm this takes ~ 20% of an overall Indexing operation (see
> attached link from YourKit).
>
> http://people.apache.org/~psmith/luceneYourkit.jpg
>
> Mind you, the whole "signalling via IOException" in the
> FastCharStream is a way bigger overhead, although I agree much harder
> to fix.
>
> Paul Smith
>
> On 17/11/2005, at 7:21 AM, Yonik Seeley wrote:
>
> > Float.floatToRawIntBits (in Java1.4) gives the raw float bits without
> > normalization (like *(int*)&floatvar would in C).  Since it doesn't do
> > normalization of NaN values, it's faster (and hopefully optimized to a
> > simple inline machine instruction by the JVM).
> >
> > On my Pentium4, using floatToRawIntBits is over 5 times as fast as
> > floatToIntBits.
> > That can really add up in something like Similarity.floatToByte() for
> > encoding norms, especially if used as a way to compress an array of
> > float during query time as suggested by Doug.
> >
> > -Yonik
> > Now hiring -- http://forms.cnet.com/slink?231706
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-467) Use Float.floatToRawIntBits over Float.floatToIntBits

2005-11-16 Thread Yonik Seeley (JIRA)
Use Float.floatToRawIntBits over Float.floatToIntBits 
--

 Key: LUCENE-467
 URL: http://issues.apache.org/jira/browse/LUCENE-467
 Project: Lucene - Java
Type: Improvement
  Components: Other  
Versions: 1.9
Reporter: Yonik Seeley
Priority: Minor


Copied From my Email:
  Float.floatToRawIntBits (in Java1.4) gives the raw float bits without
normalization (like *(int*)&floatvar would in C).  Since it doesn't do
normalization of NaN values, it's faster (and hopefully optimized to a
simple inline machine instruction by the JVM).

On my Pentium4, using floatToRawIntBits is over 5 times as fast as
floatToIntBits.
That can really add up in something like Similarity.floatToByte() for
encoding norms, especially if used as a way to compress an array of
float during query time as suggested by Doug.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-467) Use Float.floatToRawIntBits over Float.floatToIntBits

2005-11-16 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-467?page=comments#action_12357827 ] 

Yonik Seeley commented on LUCENE-467:
-

Paul Smith's profiling shows that that encodeNorm() taking 20% of the total 
indexing time, with floatToIntBits registering all of that 20%!  almost hard to 
believe...

There should be some good gains with this change.
It would be nice to change the usage in Query.hashCode too.

> Use Float.floatToRawIntBits over Float.floatToIntBits
> -
>
>  Key: LUCENE-467
>  URL: http://issues.apache.org/jira/browse/LUCENE-467
>  Project: Lucene - Java
> Type: Improvement
>   Components: Other
> Versions: 1.9
> Reporter: Yonik Seeley
> Priority: Minor

>
> Copied From my Email:
>   Float.floatToRawIntBits (in Java1.4) gives the raw float bits without
> normalization (like *(int*)&floatvar would in C).  Since it doesn't do
> normalization of NaN values, it's faster (and hopefully optimized to a
> simple inline machine instruction by the JVM).
> On my Pentium4, using floatToRawIntBits is over 5 times as fast as
> floatToIntBits.
> That can really add up in something like Similarity.floatToByte() for
> encoding norms, especially if used as a way to compress an array of
> float during query time as suggested by Doug.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Float.floatToRawIntBits

2005-11-16 Thread Doug Cutting
In general I would not take this sort of profiler output too literally. 
 If floatToRawIntBits is 5x faster, then you'd expect a 16% improvement 
from using it, but my guess is you'll see far less.  Still, it's 
probably worth switching & measuring as it might be significant.


Doug

Paul Smith wrote:
I can confirm this takes ~ 20% of an overall Indexing operation (see  
attached link from YourKit).


http://people.apache.org/~psmith/luceneYourkit.jpg

Mind you, the whole "signalling via IOException" in the  FastCharStream 
is a way bigger overhead, although I agree much harder  to fix.


Paul Smith

On 17/11/2005, at 7:21 AM, Yonik Seeley wrote:


Float.floatToRawIntBits (in Java1.4) gives the raw float bits without
normalization (like *(int*)&floatvar would in C).  Since it doesn't do
normalization of NaN values, it's faster (and hopefully optimized to a
simple inline machine instruction by the JVM).

On my Pentium4, using floatToRawIntBits is over 5 times as fast as
floatToIntBits.
That can really add up in something like Similarity.floatToByte() for
encoding norms, especially if used as a way to compress an array of
float during query time as suggested by Doug.

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Float.floatToRawIntBits

2005-11-16 Thread Paul Smith


On 17/11/2005, at 9:24 AM, Doug Cutting wrote:

In general I would not take this sort of profiler output too  
literally.  If floatToRawIntBits is 5x faster, then you'd expect a  
16% improvement from using it, but my guess is you'll see far  
less.  Still, it's probably worth switching & measuring as it might  
be significant.


Yes I don't think we'll get 5x speed update, as it will likely move  
the bottleneck back to the IO layer, but still...  If you can reduce  
CPU usage, then multithreaded indexing operations can gain better CPU  
utilization (doing other stuff while waiting for IO).  Seems like an  
easy win and dead easy to unit test?


I've been meaning to have a crack at reworking FastCharStream but  
everytime I start thinking about it I realise there is a bit of a  
depency on this IOExecption signalling EOF that I'm pretty sure it's  
going to be much harder task.  The JavaCC stuff is really designed  
for compiling tree's which is usually a 'once off' type usage, but  
Lucenes usage of it (large indexing operations) means the flaws in it  
are exacerbated.


Paul



smime.p7s
Description: S/MIME cryptographic signature


Re: Float.floatToRawIntBits

2005-11-16 Thread Chris Lamprecht
1. Run profiler
2. Sort methods by CPU time spent
3. Optimize
4. Repeat

:)

On 11/16/05, Paul Smith <[EMAIL PROTECTED]> wrote:
>
> On 17/11/2005, at 9:24 AM, Doug Cutting wrote:
>
> > In general I would not take this sort of profiler output too
> > literally.  If floatToRawIntBits is 5x faster, then you'd expect a
> > 16% improvement from using it, but my guess is you'll see far
> > less.  Still, it's probably worth switching & measuring as it might
> > be significant.
>
> Yes I don't think we'll get 5x speed update, as it will likely move
> the bottleneck back to the IO layer, but still...  If you can reduce
> CPU usage, then multithreaded indexing operations can gain better CPU
> utilization (doing other stuff while waiting for IO).  Seems like an
> easy win and dead easy to unit test?
>
> I've been meaning to have a crack at reworking FastCharStream but
> everytime I start thinking about it I realise there is a bit of a
> depency on this IOExecption signalling EOF that I'm pretty sure it's
> going to be much harder task.  The JavaCC stuff is really designed
> for compiling tree's which is usually a 'once off' type usage, but
> Lucenes usage of it (large indexing operations) means the flaws in it
> are exacerbated.
>
> Paul
>
>
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Float.floatToRawIntBits

2005-11-16 Thread Paul Smith


On 17/11/2005, at 10:21 AM, Chris Lamprecht wrote:


1. Run profiler
2. Sort methods by CPU time spent
3. Optimize
4. Repeat

:)



Umm, well I know I could make it quicker, it's just whether it still  
_works_ as expected  Maintaining the contract means I'll need to  
develop some good junit tests that I feel confident cover the current  
workings before making changes.  That's the hard bit.


Paul



smime.p7s
Description: S/MIME cryptographic signature


[jira] Commented: (LUCENE-467) Use Float.floatToRawIntBits over Float.floatToIntBits

2005-11-16 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-467?page=comments#action_12357838 ] 

Yonik Seeley commented on LUCENE-467:
-

With -server mode, it's only 3 times as fast, and both are really fairly fast.
I do wonder if the profiler had it's numbers right, or if the act of 
observation messed things up... 20% seems too high.

> Use Float.floatToRawIntBits over Float.floatToIntBits
> -
>
>  Key: LUCENE-467
>  URL: http://issues.apache.org/jira/browse/LUCENE-467
>  Project: Lucene - Java
> Type: Improvement
>   Components: Other
> Versions: 1.9
> Reporter: Yonik Seeley
> Priority: Minor

>
> Copied From my Email:
>   Float.floatToRawIntBits (in Java1.4) gives the raw float bits without
> normalization (like *(int*)&floatvar would in C).  Since it doesn't do
> normalization of NaN values, it's faster (and hopefully optimized to a
> simple inline machine instruction by the JVM).
> On my Pentium4, using floatToRawIntBits is over 5 times as fast as
> floatToIntBits.
> That can really add up in something like Similarity.floatToByte() for
> encoding norms, especially if used as a way to compress an array of
> float during query time as suggested by Doug.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-467) Use Float.floatToRawIntBits over Float.floatToIntBits

2005-11-16 Thread Paul Smith (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-467?page=comments#action_12357839 ] 

Paul Smith commented on LUCENE-467:
---

I probably didn't make my testing framework as clear as I should.  Yourkit was 
setup to use method sampling (waking up every X milliseconds).  I wouldn't use 
the 20% as a 'accurate' figure but suffice to say that improving this method 
would 'certainly' improve things.  Only testing the way you have will flush out 
the correct numbers.

We don't use -server (due to some Linux vagaries we've been careful with 
-server because of some stability problems)

> Use Float.floatToRawIntBits over Float.floatToIntBits
> -
>
>  Key: LUCENE-467
>  URL: http://issues.apache.org/jira/browse/LUCENE-467
>  Project: Lucene - Java
> Type: Improvement
>   Components: Other
> Versions: 1.9
> Reporter: Yonik Seeley
> Priority: Minor

>
> Copied From my Email:
>   Float.floatToRawIntBits (in Java1.4) gives the raw float bits without
> normalization (like *(int*)&floatvar would in C).  Since it doesn't do
> normalization of NaN values, it's faster (and hopefully optimized to a
> simple inline machine instruction by the JVM).
> On my Pentium4, using floatToRawIntBits is over 5 times as fast as
> floatToIntBits.
> That can really add up in something like Similarity.floatToByte() for
> encoding norms, especially if used as a way to compress an array of
> float during query time as suggested by Doug.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-467) Use Float.floatToRawIntBits over Float.floatToIntBits

2005-11-16 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-467?page=comments#action_12357851 ] 

Yonik Seeley commented on LUCENE-467:
-

Fun with premature optimization!
I know this isn't a bottleneck, but here is the fastest floatToByte() that I 
could come up with:

  public static byte floatToByte(float f) {
int bits = Float.floatToRawIntBits(f);
if (bits<=0) return 0;
int mantissa = (bits & 0xff) >> 21;
int exponent = (bits >>> 24) - 63 + 15;
if ((exponent & ~0x1f)==0) return (byte)((exponent << 3) | mantissa);
else if (exponent<0) return 1;
return -1;
  }

Here is the original from Lucene for reference:

  public static byte floatToByte(float f) {
if (f < 0.0f) // round negatives up to zero
  f = 0.0f;

if (f == 0.0f)// zero is a special case
  return 0;

int bits = Float.floatToIntBits(f);   // parse float into parts
int mantissa = (bits & 0xff) >> 21;
int exponent = (((bits >> 24) & 0x7f) - 63) + 15;
if (exponent > 31) {  // overflow: use max value
  exponent = 31;
  mantissa = 7;
}

if (exponent < 0) {   // underflow: use min value
  exponent = 0;
  mantissa = 1;
}

return (byte)((exponent << 3) | mantissa);// pack into a byte
   }


Here is the performance (in seconds) on my P4 to do 640M conversions:

  JDK14-server  JDK14-client  JDK15-server  JDK15-client  
JDK16-server  JDK16-client
orig   75.422   89.4518.344 57.631  
  7.656 57.984
new  67.265   78.8915.906 22.172
5.172 18.750
diff12%13%41%   
160%  48%   209%

Some decent gains... but the biggest moral of the story is: use Java>=1.5 and 
-server if you can!



> Use Float.floatToRawIntBits over Float.floatToIntBits
> -
>
>  Key: LUCENE-467
>  URL: http://issues.apache.org/jira/browse/LUCENE-467
>  Project: Lucene - Java
> Type: Improvement
>   Components: Other
> Versions: 1.9
> Reporter: Yonik Seeley
> Priority: Minor

>
> Copied From my Email:
>   Float.floatToRawIntBits (in Java1.4) gives the raw float bits without
> normalization (like *(int*)&floatvar would in C).  Since it doesn't do
> normalization of NaN values, it's faster (and hopefully optimized to a
> simple inline machine instruction by the JVM).
> On my Pentium4, using floatToRawIntBits is over 5 times as fast as
> floatToIntBits.
> That can really add up in something like Similarity.floatToByte() for
> encoding norms, especially if used as a way to compress an array of
> float during query time as suggested by Doug.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]