[jira] Commented: (LUCENE-129) Finalizers are non-canonical
[ http://issues.apache.org/jira/browse/LUCENE-129?page=comments#action_12357779 ] Sam Hough commented on LUCENE-129: -- I think FSDirectory needs a finalize method adding to remove its reference from FSDirectory.DIRECTORIES otherwise, through normal garbage collection, directories could linger. I presume the orginator of this issue is commenting on the finalize methods for the Input and Output Streams. I'm assuming that the intention is for Lucene to clean up after itself even if close is not called explicitly. If this really is a bug then I'm happy to try and construct a unit test to check that FSDirectory cleans up after itself properly. > Finalizers are non-canonical > > > Key: LUCENE-129 > URL: http://issues.apache.org/jira/browse/LUCENE-129 > Project: Lucene - Java > Type: Bug > Components: Other > Versions: unspecified > Environment: Operating System: other > Platform: All > Reporter: Esmond Pitt > Assignee: Lucene Developers > Priority: Minor > > The canonical form of a Java finalizer is: > protected void finalize() throws Throwable() > { > try > { >// ... local code to finalize this class > } > catch (Throwable t) > { > } > super.finalize(); // finalize base class. > } > The finalizers in IndexReader, IndexWriter, and FSDirectory don't conform. > This > is probably minor or null in effect, but the principle is important. > As a matter of fact FSDirectory.finaliz() is entirely redundant and could be > removed, as it doesn't do anything that RandomAccessFile.finalize would do > automatically. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-129) Finalizers are non-canonical
[ http://issues.apache.org/jira/browse/LUCENE-129?page=comments#action_12357780 ] Sam Hough commented on LUCENE-129: -- Doh. Sorry. Been a long day. Finalize wont be called if DIRECTORIES still points at it :( Think twice, post once. Does this mean that clients of FSDirectory should have finalize methods that close the Directory? IndexReader.finalize for instance just cleans up its lock but doesn't call close()!? It is making my head hurt thinking back to C++ days of no automatic garbage collection. Sorry. > Finalizers are non-canonical > > > Key: LUCENE-129 > URL: http://issues.apache.org/jira/browse/LUCENE-129 > Project: Lucene - Java > Type: Bug > Components: Other > Versions: unspecified > Environment: Operating System: other > Platform: All > Reporter: Esmond Pitt > Assignee: Lucene Developers > Priority: Minor > > The canonical form of a Java finalizer is: > protected void finalize() throws Throwable() > { > try > { >// ... local code to finalize this class > } > catch (Throwable t) > { > } > super.finalize(); // finalize base class. > } > The finalizers in IndexReader, IndexWriter, and FSDirectory don't conform. > This > is probably minor or null in effect, but the principle is important. > As a matter of fact FSDirectory.finaliz() is entirely redundant and could be > removed, as it doesn't do anything that RandomAccessFile.finalize would do > automatically. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/
If that's the way to go, we should do it by default so the user doesn't have to. Unless the scores between two types of queries are compatible, It's a bad idea to transparently switch between them since it will cause relevancy to unpredictably change in the future (triggered by either a query changing slightly, or the index changing slightly). -Yonik Now hiring -- http://forms.cnet.com/slink?231706 On 11/16/05, Paul Elschot <[EMAIL PROTECTED]> wrote: > Why not leave that decision to the program using the query? > Something like this: > - catch the TooManyClauses exception, > - adapt (the offending parts of) the query to make these use >a FieldNormQuery, > - retry with a warning to the provider of the query that > the term frequencies have been ignored. > > The only disadvantage of this is that the term expansion > during rewrite has to be redone. > Also, when enough terms are involved, they might still cause > a memory problem because they are all needed at the same > time. > > Regards, > Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Index backboned by DB
Hi, > 1) It might be OK to implement retrieving field values separately for a > document. However, I think from a simplicity point of view, it might be > better to have the application code do this drudgery. Adding this feature > could complicate the nice and simple design of Lucene without much benefit. Yes, it is possible to store document parts in a second index or in a database. If only a small amount of documents at a time must be loaded into memory, there are good solutions for that with the existing storage model. > 2) The application could separately a document into several documents, for > example, one document for indexing mainly, the other documents for storing > binary values for different fields. Thus, giving the relevant doc id, its > associated binary value for a particular field could be loaded very fast > with just a disk lookup (looking up the fdx file). But consider the problem of using the information stored in a field for sorting purposes (dates, any numeric attributes of documents). There is now a special case implemented in Lucene: norms are stored as bytes per document in a single file per segment. They a designed to be loaded at once into memory to completely avoid disk lookup for every document to be weighted. > > This way, only the relevant field is loaded into memory rather than all of > the fields for a doc. There is no change on Lucene side, only some more > work for the application code. I may have missed something, but I don't know how to implement fast custom sorting without a disc access per document with the current Lucene interface. Please tell me if I'm completely wrong. Having the possibility of storing a binary field of fixed length in a separate file to be loaded at once into memory for fast access would solve the problem. Is it worth the effort ? I don't think it would make interface that much more complicated. Any comments? > > My view for a search library (or in general, a library), should be small > and efficient, since it is used by lot of applications, any additional > feature could potentially impact its robustness and liability to > performance drawback. > > Welcome for any critics or comments? > > Jian > > On 11/15/05, Robert Kirchgessner <[EMAIL PROTECTED]> wrote: > > Hi, > > > > a discussion in > > > > http://issues.apache.org/jira/browse/LUCENE-196 > > > > might be of interest to you. > > > > Did you think about storing the large pieces of documents > > in a database to reduce the size of Lucene index? > > > > I think there are good reasons to adding support for > > storing fields in separate files: > > > > 1. One could define a binary field of fixed length and store it > > in a separate file. Then load it into memory and have fast > > access for field contents. > > > > A use case might be: store calendar date (-MM-DD) > > in three bytes, 4 bits for months, 5 bits for days and up to > > 15 bits for years. If you want to retrieve hits sorted by date > > you can load the fields file of size (3 * documents in index) bytes > > and support sorting by date without accessing hard drive > > for reading dates. > > > > 2. One could store document contents in a separate > > file and fields of small size like title and some metadata > > in the way it is stored now. It could speed up access to > > fields. It would be interesting to know whether you gain > > significant perfomance leaving the big chunks out, i.e. > > not storing them in index. > > > > In my opinion 1. is the most interesting case: storing some > > binary fields (dates, prices, length, any numeric metrics of > > documents) would enable *really* fast sorting of hits. > > > > Any thoughts about this? > > > > Regards, > > > > Robert > > > > > > > > We have a similiar problem > > > > Am Dienstag, 15. November 2005 23:23 schrieb Karel Tejnora: > > > Hi all, > > > in our testing application using lucene 1.4.3. Thanks you guys for > > > that great job. > > > We have index file around 12GiB, one file (merged). To retrieve hits it > > > takes nice small amount of the time, but reading fields takes 10-100 > > > times more (the stored ones). I think because all the fields are read. > > > I would like to try implement lucene index files as tables in db with > > > some lazy fields loading. As I have searched web I have found only > > > impl. of the store.Directory (bdb), but it only holds data as binary > > > streams. This technique will be not so helpful because BLOB operations > > > are not fast performing. On another side I will have a lack of the > > > freedom from documents fields variability but I can omit a lot of the > > > skipping and many opened files. Also IndexWriter can have document/term > > > locking granuality. > > > So I think that way leads to extends IndexWriter / IndexReader and have > > > own implementation of index.Segment* classes. It is the best way or I > > > missing smthg how achieve this? > > > If it is bad idea, I will be happy to heard another possibilities. > > > > > >
[jira] Resolved: (LUCENE-395) CoordConstrainedBooleanQuery + QueryParser support
[ http://issues.apache.org/jira/browse/LUCENE-395?page=all ] Yonik Seeley resolved LUCENE-395: - Resolution: Fixed Assign To: Yonik Seeley (was: Lucene Developers) fixed BooleanQuery hashCode/equals and committed patches. > CoordConstrainedBooleanQuery + QueryParser support > -- > > Key: LUCENE-395 > URL: http://issues.apache.org/jira/browse/LUCENE-395 > Project: Lucene - Java > Type: Improvement > Components: Search > Versions: unspecified > Environment: Operating System: other > Platform: Other > Reporter: Mark Harwood > Assignee: Yonik Seeley > Priority: Minor > Attachments: BooleanQuery.patch, BooleanQuery.patch, BooleanScorer2.java, > CoordConstrainedBooleanQuery.java, CoordConstrainedBooleanQuery.java, > CustomQueryParserExample.java, CustomQueryParserExample.java, > LUCENE-395.patch, LUCENE-395.patch, LUCENE-395.patch, TestBoolean2Patch5.txt, > TestBooleanMinShouldMatch.java, TestBooleanMinShouldMatch.java, > TestBooleanMinShouldMatch.java, TestBooleanMinShouldMatch.java, > TestBooleanMinShouldMatch.java, TestBooleanMinShouldMatch.java > > Attached 2 new classes: > 1) CoordConstrainedBooleanQuery > A boolean query that only matches if a specified number of the contained > clauses > match. An example use might be a query that returns a list of books where ANY > 2 > people from a list of people were co-authors, eg: > "Lucene In Action" would match ("Erik Hatcher" "Otis Gospodnetić" "Mark > Harwood" > "Doug Cutting") with a minRequiredOverlap of 2 because Otis and Erik wrote > that. > The book "Java Development with Ant" would not match because only 1 element in > the list (Erik) was selected. > 2) CustomQueryParserExample > A customised QueryParser that allows definition of > CoordConstrainedBooleanQueries. The solution (mis)uses fieldnames to pass > parameters to the custom query. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-466) Need QueryParser support for BooleanQuery.minNrShouldMatch
Need QueryParser support for BooleanQuery.minNrShouldMatch -- Key: LUCENE-466 URL: http://issues.apache.org/jira/browse/LUCENE-466 Project: Lucene - Java Type: Improvement Components: Search Versions: unspecified Environment: Operating System: other Platform: Other Reporter: Mark Harwood Assigned to: Yonik Seeley Priority: Minor Attached 2 new classes: 1) CoordConstrainedBooleanQuery A boolean query that only matches if a specified number of the contained clauses match. An example use might be a query that returns a list of books where ANY 2 people from a list of people were co-authors, eg: "Lucene In Action" would match ("Erik Hatcher" "Otis Gospodnetić" "Mark Harwood" "Doug Cutting") with a minRequiredOverlap of 2 because Otis and Erik wrote that. The book "Java Development with Ant" would not match because only 1 element in the list (Erik) was selected. 2) CustomQueryParserExample A customised QueryParser that allows definition of CoordConstrainedBooleanQueries. The solution (mis)uses fieldnames to pass parameters to the custom query. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/
: > Should we dynamically decide to switch to FieldNormQuery when : > BooleanQuery.maxClauseCount is exceeded? That way queries that : Why not leave that decision to the program using the query? : Something like this: : - catch the TooManyClauses exception, : - adapt (the offending parts of) the query to make these use :a FieldNormQuery, : - retry with a warning to the provider of the query that ...because it seems like the people who typically run into TooManyClauses aren't familiary with whole API enough to understand why they are getting the exception. right now they ask questions and people give them advice on reducing clauses based on their specific use case. If this change were made the advice could be simplified and generalized - but the number of confused questions probably wouldn't decrease that much. I think Doug is suggesting that the "default" case for people who don't look very deeply at the API is to "just work" all of the time, as best it can. people who dig deeper can call the method or set the property to make it fail in the extreme cases where they want it to fail. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-323) [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate support for queries across multiple fields
[ http://issues.apache.org/jira/browse/LUCENE-323?page=comments#action_12357806 ] Yonik Seeley commented on LUCENE-323: - Added Iterable to DisjunctionMaxQuery as a semi Java5 friendly way to iterate over the disjuncts. Added ability to add all disjuncts from an Iterable (Collection, List, another DisjunctionMaxQuery, etc). I Committed DisjunctionMaxQuery/Scorer/Test since the Interface should be stable, and the implementation seems to work fine for the common cases. I'll be happy to evaluate & commit performance updates when they become available. I'll leave this bug open since it contains multiple issues. > [PATCH] MultiFieldQueryParser and BooleanQuery do not provide adequate > support for queries across multiple fields > - > > Key: LUCENE-323 > URL: http://issues.apache.org/jira/browse/LUCENE-323 > Project: Lucene - Java > Type: Bug > Components: QueryParser > Versions: 1.4 > Environment: Operating System: Windows XP > Platform: PC > Reporter: Chuck Williams > Assignee: Lucene Developers > Attachments: DisjunctionMaxQuery.java, DisjunctionMaxScorer.java, > TestDisjunctionMaxQuery.java, TestMaxDisjunctionQuery.java, TestRanking.zip, > TestRanking.zip, TestRanking.zip, WikipediaSimilarity.java, > WikipediaSimilarity.java, WikipediaSimilarity.java > > The attached test case demonstrates this problem and provides a fix: > 1. Use a custom similarity to eliminate all tf and idf effects, just to > isolate what is being tested. > 2. Create two documents doc1 and doc2, each with two fields title and > description. doc1 has "elephant" in title and "elephant" in description. > doc2 has "elephant" in title and "albino" in description. > 3. Express query for "albino elephant" against both fields. > Problems: > a. MultiFieldQueryParser won't recognize either document as containing > both terms, due to the way it expands the query across fields. > b. Expressing query as "title:albino description:albino title:elephant > description:elephant" will score both documents equivalently, since each > matches two query terms. > 4. Comparison to MaxDisjunctionQuery and my method for expanding queries > across fields. Using notation that () represents a BooleanQuery and ( | ) > represents a MaxDisjunctionQuery, "albino elephant" expands to: > ( (title:albino | description:albino) > (title:elephant | description:elephant) ) > This will recognize that doc2 has both terms matched while doc1 only has 1 > term matched, score doc2 over doc1. > Refinement note: the actual expansion for "albino query" that I use is: > ( (title:albino | description:albino)~0.1 > (title:elephant | description:elephant)~0.1 ) > This causes the score of each MaxDisjunctionQuery to be the score of highest > scoring MDQ subclause plus 0.1 times the sum of the scores of the other MDQ > subclauses. Thus, doc1 gets some credit for also having "elephant" in the > description but only 1/10 as much as doc2 gets for covering another query > term > in its description. If doc3 has "elephant" in title and both "albino" > and "elephant" in the description, then with the actual refined expansion, it > gets the highest score of all (whereas with pure max, without the 0.1, it > would get the same score as doc2). > In real apps, tf's and idf's also come into play of course, but can affect > these either way (i.e., mitigate this fundamental problem or exacerbate it). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/
Yonik Seeley wrote: Totally untested, but here is a hack at what the scorer might look like when the number of terms is large. Looks plausible to me. You could instead use a byte[maxDoc] and encode/decode floats as you store and read them, to use a lot less RAM. // could also use a bitset to keep track of docs in the set... I think that is probably a very important optimization. If you implemented both of these suggestions, this would use 5 bits/doc, instead of 33 bits/doc. With a 100M doc index, that would be the difference between 62MB/query and 412MB/query. The classic term expanding approach uses perhaps 2k/term. So, with a 100M document index, the byte array approach uses less memory for queries which expand to more than 3,100 terms. The float-array method uses less memory for queries with more than 206k terms. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/
On 11/16/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > You could instead use a byte[maxDoc] and encode/decode floats as you > store and read them, to use a lot less RAM. Hmmm, very interesting idea. Less than one decimal digit of precision might be hard to swallow when you have to add scores together though: smallfloat(score1) + smallfloat(score2) + smallfloat(score3) Do you think that the 5/3 exponent/mantissa split is right for this, or would a 4/4 be better? -Yonik Now hiring -- http://forms.cnet.com/slink?231706 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Float.floatToRawIntBits
Float.floatToRawIntBits (in Java1.4) gives the raw float bits without normalization (like *(int*)&floatvar would in C). Since it doesn't do normalization of NaN values, it's faster (and hopefully optimized to a simple inline machine instruction by the JVM). On my Pentium4, using floatToRawIntBits is over 5 times as fast as floatToIntBits. That can really add up in something like Similarity.floatToByte() for encoding norms, especially if used as a way to compress an array of float during query time as suggested by Doug. -Yonik Now hiring -- http://forms.cnet.com/slink?231706 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Issues while doing ant on lucene source
Hi Folks. I downloaded the Lucene and tried to do an ant. It initially gave me the following error: BUILD FAILED file:/home/parikpol/downloads/lucene-1.4.3/build.xml:11: Unexpected element "tstamp" I commented out the tstamp tag from build.xml, and now it gives me the following errors: compile-core: [javac] Compiling 160 source files to /home/parikpol/downloads/lucene-1.4.3/build/classes/java [javac] /home/parikpol/downloads/lucene-1.4.3/src/java/org/apache/lucene/search/FieldCacheImpl.java:236: error: Type `StringIndex' not found in the declaration of the return type of method `getStringIndex'. [javac] public StringIndex getStringIndex (IndexReader reader, String field) [javac] ^ [javac] /home/parikpol/downloads/lucene-1.4.3/src/java/org/apache/lucene/search/FieldCacheImpl.java:291: error: Type `StringIndex' not found in the declaration of the local variable `value'. [javac] StringIndex value = new StringIndex (retArray, mterms); [javac] ^ [javac] 2 errors, 5 warnings Any help would be appreciated. Thanks. Parik
Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/
Yonik Seeley wrote: Hmmm, very interesting idea. Less than one decimal digit of precision might be hard to swallow when you have to add scores together though: smallfloat(score1) + smallfloat(score2) + smallfloat(score3) Do you think that the 5/3 exponent/mantissa split is right for this, or would a 4/4 be better? The float epsilon should ideally be greater than the minimum score increment, and the float range should ideally be at least 100x greater than the maximum score increment, to permit boosting, large queries, etc. Given a 100M document collection, the maximum idf is log(100M) = ~18, with a length-normalized tf of 1, for a max of 18. So the float range should ideally be around 1800 or greater. The minimum idf is 1, and the minimum normalized tf with 10k word documents is 1/100. So the float epsilon should ideally be less than 1/100. 5 bits of mantissa and 3 bits of exponent is closest to this, but not quite there, with an epsilon of 1/32 and a range of up to ~1000. Did I get the math right? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/
On Tuesday 15 November 2005 23:45, Yonik Seeley wrote: > Totally untested, but here is a hack at what the scorer might look > like when the number of terms is large. > > -Yonik > > > package org.apache.lucene.search; > > import org.apache.lucene.index.TermEnum; > import org.apache.lucene.index.IndexReader; > import org.apache.lucene.index.TermDocs; > > import java.io.IOException; > > /** > * @author yonik > * @version $Id$ > */ > public class MultiTermScorer extends Scorer{ > protected final float[] scores; > protected int pos; > protected float docScore; > > public MultiTermScorer(Similarity similarity, IndexReader reader, > Weight w, TermEnum terms, byte[] norms, boolean include_idf, boolean > include_tf) throws IOException { > super(similarity); > float weightVal = w.getValue(); > int maxDoc = reader.maxDoc(); > this.scores = new float[maxDoc]; > float[] normDecoder = Similarity.getNormDecoder(); > > TermDocs tdocs = reader.termDocs(); This part is only needed at the top level of the query, so one could implement in this optimization hook of BooleanScorer: /** Expert: Collects matching documents in a range. * Note that [EMAIL PROTECTED] #next()} must be called once before this method is * called for the first time. * @param hc The collector to which all matching documents are passed through * [EMAIL PROTECTED] HitCollector#collect(int, float)}. * @param max Do not score documents past this. * @return true if more matching documents may remain. */ protected boolean score(HitCollector hc, int max) throws IOException { ... } > while (terms.next()) { > tdocs.seek(terms); terms.term() iirc. > float termScore = weightVal; > if (include_idf) { > termScore *= similarity.idf(terms.docFreq(),maxDoc); > } > while (tdocs.next()) { > int doc = tdocs.doc(); > float subscore = termScore; > if (include_tf) subscore *= tdocs.freq(); getSimilarity().tf(tdocs.freq()); > if (norms!=null) subscore *= normDecoder[norms[doc&0xff]]; > scores[doc] += subscore; The scores[] array is the pain point, but when it can be used this can be generalized to DisjunctionSumScorer, so it would work for all disjunctions, not only terms. I think it is possible to implement this hook for DisjunctionSumScorer with a scores[] array, iterating over the subscorers one by one. Getting that hook called through BooleanScorer2 is no problem when the coordination factor can be left out. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Float.floatToRawIntBits
I can confirm this takes ~ 20% of an overall Indexing operation (see attached link from YourKit). http://people.apache.org/~psmith/luceneYourkit.jpg Mind you, the whole "signalling via IOException" in the FastCharStream is a way bigger overhead, although I agree much harder to fix. Paul Smith On 17/11/2005, at 7:21 AM, Yonik Seeley wrote: Float.floatToRawIntBits (in Java1.4) gives the raw float bits without normalization (like *(int*)&floatvar would in C). Since it doesn't do normalization of NaN values, it's faster (and hopefully optimized to a simple inline machine instruction by the JVM). On my Pentium4, using floatToRawIntBits is over 5 times as fast as floatToIntBits. That can really add up in something like Similarity.floatToByte() for encoding norms, especially if used as a way to compress an array of float during query time as suggested by Doug. -Yonik Now hiring -- http://forms.cnet.com/slink?231706 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: Float.floatToRawIntBits
Wow! A much larger gain than I expected! Thanks for the profile Paul! -Yonik Now hiring -- http://forms.cnet.com/slink?231706 On 11/16/05, Paul Smith <[EMAIL PROTECTED]> wrote: > I can confirm this takes ~ 20% of an overall Indexing operation (see > attached link from YourKit). > > http://people.apache.org/~psmith/luceneYourkit.jpg > > Mind you, the whole "signalling via IOException" in the > FastCharStream is a way bigger overhead, although I agree much harder > to fix. > > Paul Smith > > On 17/11/2005, at 7:21 AM, Yonik Seeley wrote: > > > Float.floatToRawIntBits (in Java1.4) gives the raw float bits without > > normalization (like *(int*)&floatvar would in C). Since it doesn't do > > normalization of NaN values, it's faster (and hopefully optimized to a > > simple inline machine instruction by the JVM). > > > > On my Pentium4, using floatToRawIntBits is over 5 times as fast as > > floatToIntBits. > > That can really add up in something like Similarity.floatToByte() for > > encoding norms, especially if used as a way to compress an array of > > float during query time as suggested by Doug. > > > > -Yonik > > Now hiring -- http://forms.cnet.com/slink?231706 > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-467) Use Float.floatToRawIntBits over Float.floatToIntBits
Use Float.floatToRawIntBits over Float.floatToIntBits -- Key: LUCENE-467 URL: http://issues.apache.org/jira/browse/LUCENE-467 Project: Lucene - Java Type: Improvement Components: Other Versions: 1.9 Reporter: Yonik Seeley Priority: Minor Copied From my Email: Float.floatToRawIntBits (in Java1.4) gives the raw float bits without normalization (like *(int*)&floatvar would in C). Since it doesn't do normalization of NaN values, it's faster (and hopefully optimized to a simple inline machine instruction by the JVM). On my Pentium4, using floatToRawIntBits is over 5 times as fast as floatToIntBits. That can really add up in something like Similarity.floatToByte() for encoding norms, especially if used as a way to compress an array of float during query time as suggested by Doug. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-467) Use Float.floatToRawIntBits over Float.floatToIntBits
[ http://issues.apache.org/jira/browse/LUCENE-467?page=comments#action_12357827 ] Yonik Seeley commented on LUCENE-467: - Paul Smith's profiling shows that that encodeNorm() taking 20% of the total indexing time, with floatToIntBits registering all of that 20%! almost hard to believe... There should be some good gains with this change. It would be nice to change the usage in Query.hashCode too. > Use Float.floatToRawIntBits over Float.floatToIntBits > - > > Key: LUCENE-467 > URL: http://issues.apache.org/jira/browse/LUCENE-467 > Project: Lucene - Java > Type: Improvement > Components: Other > Versions: 1.9 > Reporter: Yonik Seeley > Priority: Minor > > Copied From my Email: > Float.floatToRawIntBits (in Java1.4) gives the raw float bits without > normalization (like *(int*)&floatvar would in C). Since it doesn't do > normalization of NaN values, it's faster (and hopefully optimized to a > simple inline machine instruction by the JVM). > On my Pentium4, using floatToRawIntBits is over 5 times as fast as > floatToIntBits. > That can really add up in something like Similarity.floatToByte() for > encoding norms, especially if used as a way to compress an array of > float during query time as suggested by Doug. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Float.floatToRawIntBits
In general I would not take this sort of profiler output too literally. If floatToRawIntBits is 5x faster, then you'd expect a 16% improvement from using it, but my guess is you'll see far less. Still, it's probably worth switching & measuring as it might be significant. Doug Paul Smith wrote: I can confirm this takes ~ 20% of an overall Indexing operation (see attached link from YourKit). http://people.apache.org/~psmith/luceneYourkit.jpg Mind you, the whole "signalling via IOException" in the FastCharStream is a way bigger overhead, although I agree much harder to fix. Paul Smith On 17/11/2005, at 7:21 AM, Yonik Seeley wrote: Float.floatToRawIntBits (in Java1.4) gives the raw float bits without normalization (like *(int*)&floatvar would in C). Since it doesn't do normalization of NaN values, it's faster (and hopefully optimized to a simple inline machine instruction by the JVM). On my Pentium4, using floatToRawIntBits is over 5 times as fast as floatToIntBits. That can really add up in something like Similarity.floatToByte() for encoding norms, especially if used as a way to compress an array of float during query time as suggested by Doug. -Yonik Now hiring -- http://forms.cnet.com/slink?231706 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Float.floatToRawIntBits
On 17/11/2005, at 9:24 AM, Doug Cutting wrote: In general I would not take this sort of profiler output too literally. If floatToRawIntBits is 5x faster, then you'd expect a 16% improvement from using it, but my guess is you'll see far less. Still, it's probably worth switching & measuring as it might be significant. Yes I don't think we'll get 5x speed update, as it will likely move the bottleneck back to the IO layer, but still... If you can reduce CPU usage, then multithreaded indexing operations can gain better CPU utilization (doing other stuff while waiting for IO). Seems like an easy win and dead easy to unit test? I've been meaning to have a crack at reworking FastCharStream but everytime I start thinking about it I realise there is a bit of a depency on this IOExecption signalling EOF that I'm pretty sure it's going to be much harder task. The JavaCC stuff is really designed for compiling tree's which is usually a 'once off' type usage, but Lucenes usage of it (large indexing operations) means the flaws in it are exacerbated. Paul smime.p7s Description: S/MIME cryptographic signature
Re: Float.floatToRawIntBits
1. Run profiler 2. Sort methods by CPU time spent 3. Optimize 4. Repeat :) On 11/16/05, Paul Smith <[EMAIL PROTECTED]> wrote: > > On 17/11/2005, at 9:24 AM, Doug Cutting wrote: > > > In general I would not take this sort of profiler output too > > literally. If floatToRawIntBits is 5x faster, then you'd expect a > > 16% improvement from using it, but my guess is you'll see far > > less. Still, it's probably worth switching & measuring as it might > > be significant. > > Yes I don't think we'll get 5x speed update, as it will likely move > the bottleneck back to the IO layer, but still... If you can reduce > CPU usage, then multithreaded indexing operations can gain better CPU > utilization (doing other stuff while waiting for IO). Seems like an > easy win and dead easy to unit test? > > I've been meaning to have a crack at reworking FastCharStream but > everytime I start thinking about it I realise there is a bit of a > depency on this IOExecption signalling EOF that I'm pretty sure it's > going to be much harder task. The JavaCC stuff is really designed > for compiling tree's which is usually a 'once off' type usage, but > Lucenes usage of it (large indexing operations) means the flaws in it > are exacerbated. > > Paul > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Float.floatToRawIntBits
On 17/11/2005, at 10:21 AM, Chris Lamprecht wrote: 1. Run profiler 2. Sort methods by CPU time spent 3. Optimize 4. Repeat :) Umm, well I know I could make it quicker, it's just whether it still _works_ as expected Maintaining the contract means I'll need to develop some good junit tests that I feel confident cover the current workings before making changes. That's the hard bit. Paul smime.p7s Description: S/MIME cryptographic signature
[jira] Commented: (LUCENE-467) Use Float.floatToRawIntBits over Float.floatToIntBits
[ http://issues.apache.org/jira/browse/LUCENE-467?page=comments#action_12357838 ] Yonik Seeley commented on LUCENE-467: - With -server mode, it's only 3 times as fast, and both are really fairly fast. I do wonder if the profiler had it's numbers right, or if the act of observation messed things up... 20% seems too high. > Use Float.floatToRawIntBits over Float.floatToIntBits > - > > Key: LUCENE-467 > URL: http://issues.apache.org/jira/browse/LUCENE-467 > Project: Lucene - Java > Type: Improvement > Components: Other > Versions: 1.9 > Reporter: Yonik Seeley > Priority: Minor > > Copied From my Email: > Float.floatToRawIntBits (in Java1.4) gives the raw float bits without > normalization (like *(int*)&floatvar would in C). Since it doesn't do > normalization of NaN values, it's faster (and hopefully optimized to a > simple inline machine instruction by the JVM). > On my Pentium4, using floatToRawIntBits is over 5 times as fast as > floatToIntBits. > That can really add up in something like Similarity.floatToByte() for > encoding norms, especially if used as a way to compress an array of > float during query time as suggested by Doug. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-467) Use Float.floatToRawIntBits over Float.floatToIntBits
[ http://issues.apache.org/jira/browse/LUCENE-467?page=comments#action_12357839 ] Paul Smith commented on LUCENE-467: --- I probably didn't make my testing framework as clear as I should. Yourkit was setup to use method sampling (waking up every X milliseconds). I wouldn't use the 20% as a 'accurate' figure but suffice to say that improving this method would 'certainly' improve things. Only testing the way you have will flush out the correct numbers. We don't use -server (due to some Linux vagaries we've been careful with -server because of some stability problems) > Use Float.floatToRawIntBits over Float.floatToIntBits > - > > Key: LUCENE-467 > URL: http://issues.apache.org/jira/browse/LUCENE-467 > Project: Lucene - Java > Type: Improvement > Components: Other > Versions: 1.9 > Reporter: Yonik Seeley > Priority: Minor > > Copied From my Email: > Float.floatToRawIntBits (in Java1.4) gives the raw float bits without > normalization (like *(int*)&floatvar would in C). Since it doesn't do > normalization of NaN values, it's faster (and hopefully optimized to a > simple inline machine instruction by the JVM). > On my Pentium4, using floatToRawIntBits is over 5 times as fast as > floatToIntBits. > That can really add up in something like Similarity.floatToByte() for > encoding norms, especially if used as a way to compress an array of > float during query time as suggested by Doug. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-467) Use Float.floatToRawIntBits over Float.floatToIntBits
[ http://issues.apache.org/jira/browse/LUCENE-467?page=comments#action_12357851 ] Yonik Seeley commented on LUCENE-467: - Fun with premature optimization! I know this isn't a bottleneck, but here is the fastest floatToByte() that I could come up with: public static byte floatToByte(float f) { int bits = Float.floatToRawIntBits(f); if (bits<=0) return 0; int mantissa = (bits & 0xff) >> 21; int exponent = (bits >>> 24) - 63 + 15; if ((exponent & ~0x1f)==0) return (byte)((exponent << 3) | mantissa); else if (exponent<0) return 1; return -1; } Here is the original from Lucene for reference: public static byte floatToByte(float f) { if (f < 0.0f) // round negatives up to zero f = 0.0f; if (f == 0.0f)// zero is a special case return 0; int bits = Float.floatToIntBits(f); // parse float into parts int mantissa = (bits & 0xff) >> 21; int exponent = (((bits >> 24) & 0x7f) - 63) + 15; if (exponent > 31) { // overflow: use max value exponent = 31; mantissa = 7; } if (exponent < 0) { // underflow: use min value exponent = 0; mantissa = 1; } return (byte)((exponent << 3) | mantissa);// pack into a byte } Here is the performance (in seconds) on my P4 to do 640M conversions: JDK14-server JDK14-client JDK15-server JDK15-client JDK16-server JDK16-client orig 75.422 89.4518.344 57.631 7.656 57.984 new 67.265 78.8915.906 22.172 5.172 18.750 diff12%13%41% 160% 48% 209% Some decent gains... but the biggest moral of the story is: use Java>=1.5 and -server if you can! > Use Float.floatToRawIntBits over Float.floatToIntBits > - > > Key: LUCENE-467 > URL: http://issues.apache.org/jira/browse/LUCENE-467 > Project: Lucene - Java > Type: Improvement > Components: Other > Versions: 1.9 > Reporter: Yonik Seeley > Priority: Minor > > Copied From my Email: > Float.floatToRawIntBits (in Java1.4) gives the raw float bits without > normalization (like *(int*)&floatvar would in C). Since it doesn't do > normalization of NaN values, it's faster (and hopefully optimized to a > simple inline machine instruction by the JVM). > On my Pentium4, using floatToRawIntBits is over 5 times as fast as > floatToIntBits. > That can really add up in something like Similarity.floatToByte() for > encoding norms, especially if used as a way to compress an array of > float during query time as suggested by Doug. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]