Re: fileformats.html not in sync with fileformats.xml
Doron Cohen wrote: http://issues.apache.org/jira/browse/LUCENE-738 updated fileformats.xml. This shows correctly in http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml?view=markup but not reflected (2nd day now) in the Main site version http://lucene.apache.org/java/docs/fileformats.html OK, looks like the docs just needed to be regenerated pushed to the site. I've done this now (it was a great chance to test the instructions @ http://wiki.apache.org/jakarta-lucene/HowToUpdateTheWebsite -- thanks Hoss!). So I think changes should refresh to the public site in maybe 30 minutes or so Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12458158 ] Ning Li commented on LUCENE-565: Can the same thing happen with your patch (with a smaller window), or are deletes applied between writing the new segment and writing the new segments file that references it? (hard to tell from current diff in isolation) No, it does not happen with the patch, no matter what the window size is. This is because results of flushing ram - both inserts and deletes - are committed in the same transaction. Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: IndexWriter.java, IndexWriter.July09.patch, IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the experiments show, with the proposed changes, the performance improved by
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12458170 ] Yonik Seeley commented on LUCENE-565: - both inserts and deletes - are committed in the same transaction. OK, cool. I agree that's the ideal default behavior. Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: IndexWriter.java, IndexWriter.July09.patch, IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the experiments show, with the proposed changes, the performance improved by 60% when inserts and deletes were interleaved in small batches. Regards, Ning Ning Li Search Technologies IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of
[jira] Commented: (LUCENE-740) Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception
[ http://issues.apache.org/jira/browse/LUCENE-740?page=comments#action_12458201 ] Steven Parkes commented on LUCENE-740: -- I'm kind of wondering about the snowball licensing, so I'm intrigued by Yonik's comment. Cleanup is necessary? Did the original snowball authors agree to license the software under the AL2.0? That's what LICENSE.txt says now. The source site cites the BSD license and says you can't claim it's licensed under another license. Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception -- Key: LUCENE-740 URL: http://issues.apache.org/jira/browse/LUCENE-740 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 1.9 Environment: linux amd64 Reporter: Andreas Kohn Priority: Minor Attachments: lucene-1.9.1-SnowballProgram.java, snowball.patch.txt (copied from mail to java-user) while playing with the various stemmers of Lucene(-1.9.1), I got an index out of bounds exception: lucene-1.9.1java -cp build/contrib/snowball/lucene-snowball-1.9.2-dev.jar net.sf.snowball.TestApp Kp bla.txt Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:615) at net.sf.snowball.TestApp.main(TestApp.java:56) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 11 at java.lang.StringBuffer.charAt(StringBuffer.java:303) at net.sf.snowball.SnowballProgram.find_among_b(SnowballProgram.java:270) at net.sf.snowball.ext.KpStemmer.r_Step_4(KpStemmer.java:1122) at net.sf.snowball.ext.KpStemmer.stem(KpStemmer.java:1997) This happens when executing lucene-1.9.1java -cp build/contrib/snowball/lucene-snowball-1.9.2-dev.jar net.sf.snowball.TestApp Kp bla.txt bla.txt contains just this word: 'spijsvertering'. After some debugging, and some tests with the original snowball distribution from snowball.tartarus.org, it seems that the attached change is needed to avoid the exception. (The change comes from tartarus' SnowballProgram.java) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-740) Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception
[ http://issues.apache.org/jira/browse/LUCENE-740?page=comments#action_12458209 ] Doug Cutting commented on LUCENE-740: - This is a good question. We redistribute stuff generated from Snowball sources, not the original files. Does this constitute a redistribution in binary form? I think the LICENSE.txt here refers to the code that's included in this sub-tree, which is Apache-licensed. So that's okay. If anything we might need to add something to NOTICE.txt and/or include a copy of Snowball's BSD license too, as something like SNOWBALL-LICENSE.txt. Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception -- Key: LUCENE-740 URL: http://issues.apache.org/jira/browse/LUCENE-740 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 1.9 Environment: linux amd64 Reporter: Andreas Kohn Priority: Minor Attachments: lucene-1.9.1-SnowballProgram.java, snowball.patch.txt (copied from mail to java-user) while playing with the various stemmers of Lucene(-1.9.1), I got an index out of bounds exception: lucene-1.9.1java -cp build/contrib/snowball/lucene-snowball-1.9.2-dev.jar net.sf.snowball.TestApp Kp bla.txt Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:615) at net.sf.snowball.TestApp.main(TestApp.java:56) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 11 at java.lang.StringBuffer.charAt(StringBuffer.java:303) at net.sf.snowball.SnowballProgram.find_among_b(SnowballProgram.java:270) at net.sf.snowball.ext.KpStemmer.r_Step_4(KpStemmer.java:1122) at net.sf.snowball.ext.KpStemmer.stem(KpStemmer.java:1997) This happens when executing lucene-1.9.1java -cp build/contrib/snowball/lucene-snowball-1.9.2-dev.jar net.sf.snowball.TestApp Kp bla.txt bla.txt contains just this word: 'spijsvertering'. After some debugging, and some tests with the original snowball distribution from snowball.tartarus.org, it seems that the attached change is needed to avoid the exception. (The change comes from tartarus' SnowballProgram.java) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IBM OmniFind Yahoo! Edition
I just saw the following new Lucene application announced: http://omnifind.ibm.yahoo.net/productinfo.php While I work for Yahoo!, I know nothing about Yahoo!'s involvement in this except for what I've just read in the press. Were any of the IBM folks on this list involved? If so, congratulations! Can you tell us any more about how Lucene is used here? (I see that Steven Parkes already updated the Powered By page in the wiki...) Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: IBM OmniFind Yahoo! Edition
The primary folks on the Lucene side are Michael Busch and Andreas Neumann. Certainly other folks at IBM have contributed significant pieces (though notable NOT me) but Michael and Andreas did most of the heavy lifting. I'll leave them to take credit for their work. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-746) Incorrect error message in AnalyzingQueryParser.getPrefixQuery
Incorrect error message in AnalyzingQueryParser.getPrefixQuery -- Key: LUCENE-746 URL: http://issues.apache.org/jira/browse/LUCENE-746 Project: Lucene - Java Issue Type: Improvement Components: Other Reporter: Ronnie Kolehmainen Priority: Minor The error message of getPrefixQuery is incorrect when tokens were added, for example by a stemmer. The message is token was consumed even if tokens were added. Attached is a patch, which when applied gives a better description of what actually happened. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-746) Incorrect error message in AnalyzingQueryParser.getPrefixQuery
[ http://issues.apache.org/jira/browse/LUCENE-746?page=all ] Ronnie Kolehmainen updated LUCENE-746: -- Attachment: AnalyzingQueryParser.getPrefixQuery.patch Patch for current trunk. Incorrect error message in AnalyzingQueryParser.getPrefixQuery -- Key: LUCENE-746 URL: http://issues.apache.org/jira/browse/LUCENE-746 Project: Lucene - Java Issue Type: Improvement Components: Other Reporter: Ronnie Kolehmainen Priority: Minor Attachments: AnalyzingQueryParser.getPrefixQuery.patch The error message of getPrefixQuery is incorrect when tokens were added, for example by a stemmer. The message is token was consumed even if tokens were added. Attached is a patch, which when applied gives a better description of what actually happened. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
GData DB4o reloaded
Hello all, two weeks ago I met the DB4O CEO at the DB4O roadshow in Berlin. We talked about gdata, lucene and the license nightmare. Two days later I got an email that db4o will release a third license to allow projects like lucene to closely-distribute the db4o binaries and source. I did receive the licence today. So here goes the question.. Do I have to talk to some ASF officials / lawyers about that stuff or should I just add the license text and jar to the svn. Currently I just got an PDF license document should I send this to the list at all?! best regards Simon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: GData DB4o reloaded
What are the licensing terms? -Brian On Dec 13, 2006, at 11:31 AM, Simon Willnauer wrote: Hello all, two weeks ago I met the DB4O CEO at the DB4O roadshow in Berlin. We talked about gdata, lucene and the license nightmare. Two days later I got an email that db4o will release a third license to allow projects like lucene to closely-distribute the db4o binaries and source. I did receive the licence today. So here goes the question.. Do I have to talk to some ASF officials / lawyers about that stuff or should I just add the license text and jar to the svn. Currently I just got an PDF license document should I send this to the list at all?! best regards Simon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: GData DB4o reloaded
There your go: http://www.db4o.com/about/company/legalpolicies/docl.aspx thanks simon On 12/13/06, Brian McCallister [EMAIL PROTECTED] wrote: What are the licensing terms? -Brian On Dec 13, 2006, at 11:31 AM, Simon Willnauer wrote: Hello all, two weeks ago I met the DB4O CEO at the DB4O roadshow in Berlin. We talked about gdata, lucene and the license nightmare. Two days later I got an email that db4o will release a third license to allow projects like lucene to closely-distribute the db4o binaries and source. I did receive the licence today. So here goes the question.. Do I have to talk to some ASF officials / lawyers about that stuff or should I just add the license text and jar to the svn. Currently I just got an PDF license document should I send this to the list at all?! best regards Simon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-436) [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception
[ http://issues.apache.org/jira/browse/LUCENE-436?page=comments#action_12458264 ] Otis Gospodnetic commented on LUCENE-436: - 4 months later, I think I see the same problem here. I'm using JDK 1.6 (I saw the same problem under 1.5.0_0(8,9,10)) and Lucene from HEAD (2.1-dev). I'm running out of 2GB heap in under 1 day on a production system that searches tens of thousands of indexes, where a few hundred of them have IndexSearchers open to them at any one time, with unused IndexSearchers getting closed after some period of inactivity. I'm periodically dumping the heap with jconsole and noticing the continuously increasing number of: org.apache.lucene.index.TermInfo org.apache.lucene.index.CompoundFileReader$CSIndexInput org.apache.lucene.index.Term org.apache.lucene.index.SegmentTermEnum ... There was a LOT of back and forth here. What is the final solution? I see a complete new copy of TermInfosReader, but there are a lot of formatting changes in there, it's hard to tell what was actually changed, even with diff -bB --expand-tabs. I also see FixedThreadLocal, but I see no references to it from TermInfosReader...? [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception Key: LUCENE-436 URL: http://issues.apache.org/jira/browse/LUCENE-436 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 1.4 Environment: Solaris JVM 1.4.1 Linux JVM 1.4.2/1.5.0 Windows not tested Reporter: kieran Attachments: FixedThreadLocal.java, lucene-1.9.1.patch, Lucene-436-TestCase.tar.gz, TermInfosReader.java, ThreadLocalTest.java We've been experiencing terrible memory problems on our production search server, running lucene (1.4.3). Our live app regularly opens new indexes and, in doing so, releases old IndexReaders for garbage collection. But...there appears to be a memory leak in org.apache.lucene.index.TermInfosReader.java. Under certain conditions (possibly related to JVM version, although I've personally observed it under both linux JVM 1.4.2_06, and 1.5.0_03, and SUNOS JVM 1.4.1) the ThreadLocal member variable, enumerators doesn't get garbage-collected when the TermInfosReader object is gc-ed. Looking at the code in TermInfosReader.java, there's no reason why it _shouldn't_ be gc-ed, so I can only presume (and I've seen this suggested elsewhere) that there could be a bug in the garbage collector of some JVMs. I've seen this problem briefly discussed; in particular at the following URL: http://java2.5341.com/msg/85821.html The patch that Doug recommended, which is included in lucene-1.4.3 doesn't work in our particular circumstances. Doug's patch only clears the ThreadLocal variable for the thread running the finalizer (my knowledge of java breaks down here - I'm not sure which thread actually runs the finalizer). In our situation, the TermInfosReader is (potentially) used by more than one thread, meaning that Doug's patch _doesn't_ allow the affected JVMs to correctly collect garbage. So...I've devised a simple patch which, from my observations on linux JVMs 1.4.2_06, and 1.5.0_03, fixes this problem. Kieran PS Thanks to daniel naber for pointing me to jira/lucene @@ -19,6 +19,7 @@ import java.io.IOException; import org.apache.lucene.store.Directory; +import java.util.Hashtable; /** This stores a monotonically increasing set of Term, TermInfo pairs in a * Directory. Pairs are accessed either by Term or by ordinal position the @@ -29,7 +30,7 @@ private String segment; private FieldInfos fieldInfos; - private ThreadLocal enumerators = new ThreadLocal(); + private final Hashtable enumeratorsByThread = new Hashtable(); private SegmentTermEnum origEnum; private long size; @@ -60,10 +61,10 @@ } private SegmentTermEnum getEnum() { -SegmentTermEnum termEnum = (SegmentTermEnum)enumerators.get(); +SegmentTermEnum termEnum = (SegmentTermEnum)enumeratorsByThread.get(Thread.currentThread()); if (termEnum == null) { termEnum = terms(); - enumerators.set(termEnum); + enumeratorsByThread.put(Thread.currentThread(), termEnum); } return termEnum; } @@ -195,5 +196,15 @@ public SegmentTermEnum terms(Term term) throws IOException { get(term); return (SegmentTermEnum)getEnum().clone(); + } + + /* some jvms might have trouble gc-ing enumeratorsByThread */ + protected void finalize() throws Throwable { +try { +// make sure gc can clear up. +enumeratorsByThread.clear(); +} finally { +super.finalize(); +} } } TermInfosReader.java, full source: == package org.apache.lucene.index; /** * Copyright 2004 The Apache
TermInfosReader and clone of SegmentTermEnum
Hi, I'm looking at Robert Engels' patches in http://issues.apache.org/jira/browse/LUCENE-436 and looking at TermInfosReader. I think I understand why there is ThreadLocal there in the first place - to act as a per-thread cache for the expensive to compute SegmentTermEnum yes? But why is there is need to clone() the (original) SegmentTermEnum? Thanks, Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-436) [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception
[ http://issues.apache.org/jira/browse/LUCENE-436?page=comments#action_12458292 ] robert engels commented on LUCENE-436: -- I would doubt the ThreadLocal issue that was in 1.4, changed in 1.5, would be reintroduced in 1.6. I do not use Lucene 2.1 so I can't say for certain that a new memory bug hasn't been introduced. I suggest attaching a good profiler (like JProfiler) and figure our the cause of the memory leak (the root references). I use 1.9 based Lucene and can say unequivocally there are no inherent memory issues (especially when running under 1.5+). There may also be new issues introduced in JDK 6 - we have not tested with it, only 1.4 and 1.5. [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception Key: LUCENE-436 URL: http://issues.apache.org/jira/browse/LUCENE-436 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 1.4 Environment: Solaris JVM 1.4.1 Linux JVM 1.4.2/1.5.0 Windows not tested Reporter: kieran Attachments: FixedThreadLocal.java, lucene-1.9.1.patch, Lucene-436-TestCase.tar.gz, TermInfosReader.java, ThreadLocalTest.java We've been experiencing terrible memory problems on our production search server, running lucene (1.4.3). Our live app regularly opens new indexes and, in doing so, releases old IndexReaders for garbage collection. But...there appears to be a memory leak in org.apache.lucene.index.TermInfosReader.java. Under certain conditions (possibly related to JVM version, although I've personally observed it under both linux JVM 1.4.2_06, and 1.5.0_03, and SUNOS JVM 1.4.1) the ThreadLocal member variable, enumerators doesn't get garbage-collected when the TermInfosReader object is gc-ed. Looking at the code in TermInfosReader.java, there's no reason why it _shouldn't_ be gc-ed, so I can only presume (and I've seen this suggested elsewhere) that there could be a bug in the garbage collector of some JVMs. I've seen this problem briefly discussed; in particular at the following URL: http://java2.5341.com/msg/85821.html The patch that Doug recommended, which is included in lucene-1.4.3 doesn't work in our particular circumstances. Doug's patch only clears the ThreadLocal variable for the thread running the finalizer (my knowledge of java breaks down here - I'm not sure which thread actually runs the finalizer). In our situation, the TermInfosReader is (potentially) used by more than one thread, meaning that Doug's patch _doesn't_ allow the affected JVMs to correctly collect garbage. So...I've devised a simple patch which, from my observations on linux JVMs 1.4.2_06, and 1.5.0_03, fixes this problem. Kieran PS Thanks to daniel naber for pointing me to jira/lucene @@ -19,6 +19,7 @@ import java.io.IOException; import org.apache.lucene.store.Directory; +import java.util.Hashtable; /** This stores a monotonically increasing set of Term, TermInfo pairs in a * Directory. Pairs are accessed either by Term or by ordinal position the @@ -29,7 +30,7 @@ private String segment; private FieldInfos fieldInfos; - private ThreadLocal enumerators = new ThreadLocal(); + private final Hashtable enumeratorsByThread = new Hashtable(); private SegmentTermEnum origEnum; private long size; @@ -60,10 +61,10 @@ } private SegmentTermEnum getEnum() { -SegmentTermEnum termEnum = (SegmentTermEnum)enumerators.get(); +SegmentTermEnum termEnum = (SegmentTermEnum)enumeratorsByThread.get(Thread.currentThread()); if (termEnum == null) { termEnum = terms(); - enumerators.set(termEnum); + enumeratorsByThread.put(Thread.currentThread(), termEnum); } return termEnum; } @@ -195,5 +196,15 @@ public SegmentTermEnum terms(Term term) throws IOException { get(term); return (SegmentTermEnum)getEnum().clone(); + } + + /* some jvms might have trouble gc-ing enumeratorsByThread */ + protected void finalize() throws Throwable { +try { +// make sure gc can clear up. +enumeratorsByThread.clear(); +} finally { +super.finalize(); +} } } TermInfosReader.java, full source: == package org.apache.lucene.index; /** * Copyright 2004 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the License); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See
Re: TermInfosReader and clone of SegmentTermEnum
Aaaah, I think I get it. TermIndexReader can be shared by multiple threads. Each thread will need access to SegmentTermEnum inside the TIR, but since each of them will search, scan, and seek to a different location, each threads needs its own copy/clone of the original SegmentTermEnum. ThreadLocal is then used as a simple cache for the clone of the original SegmentTermEnum, so a single thread can get to it without repeating scan/seek stuff, and so that each thread works with its own clone of SegmentTermEnum. Otis - Original Message From: Otis Gospodnetic [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Wednesday, December 13, 2006 4:53:45 PM Subject: TermInfosReader and clone of SegmentTermEnum Hi, I'm looking at Robert Engels' patches in http://issues.apache.org/jira/browse/LUCENE-436 and looking at TermInfosReader. I think I understand why there is ThreadLocal there in the first place - to act as a per-thread cache for the expensive to compute SegmentTermEnum yes? But why is there is need to clone() the (original) SegmentTermEnum? Thanks, Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-681) org.apache.lucene.document.Field is Serializable but doesn't have default constructor
[ http://issues.apache.org/jira/browse/LUCENE-681?page=all ] Otis Gospodnetic resolved LUCENE-681. - Resolution: Won't Fix I think Jed's right. Plus, calling new Field(), which would now be possible, would give us without the actual information about the field - name, value, tokenized, stored, indexed, etc. org.apache.lucene.document.Field is Serializable but doesn't have default constructor - Key: LUCENE-681 URL: http://issues.apache.org/jira/browse/LUCENE-681 Project: Lucene - Java Issue Type: Bug Components: Other Affects Versions: 1.9, 2.0.0, 2.1, 2.0.1 Environment: doesn't depend on environment Reporter: Elijah Epifanov Priority: Critical when I try to pass Document via network or do anyhing involving serialization/deserialization I will get an exception. the following patch should help (Field.java): public Field () { } private void writeObject (java.io.ObjectOutputStream out) throws IOException { out.defaultWriteObject (); } private void readObject (java.io.ObjectInputStream in) throws IOException, ClassNotFoundException { in.defaultReadObject (); if (name == null) { throw new NullPointerException (name cannot be null); } this.name = name.intern ();// field names are interned } Maybe other classes do not conform to Serialization requirements too... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TermInfosReader and clone of SegmentTermEnum
That is correct. On Dec 13, 2006, at 4:48 PM, Otis Gospodnetic wrote: Aaaah, I think I get it. TermIndexReader can be shared by multiple threads. Each thread will need access to SegmentTermEnum inside the TIR, but since each of them will search, scan, and seek to a different location, each threads needs its own copy/clone of the original SegmentTermEnum. ThreadLocal is then used as a simple cache for the clone of the original SegmentTermEnum, so a single thread can get to it without repeating scan/seek stuff, and so that each thread works with its own clone of SegmentTermEnum. Otis - Original Message From: Otis Gospodnetic [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Wednesday, December 13, 2006 4:53:45 PM Subject: TermInfosReader and clone of SegmentTermEnum Hi, I'm looking at Robert Engels' patches in http://issues.apache.org/ jira/browse/LUCENE-436 and looking at TermInfosReader. I think I understand why there is ThreadLocal there in the first place - to act as a per-thread cache for the expensive to compute SegmentTermEnum yes? But why is there is need to clone() the (original) SegmentTermEnum? Thanks, Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene nightly build failure
javacc-uptodate-check: javacc-notice: [echo] [echo] One or more of the JavaCC .jj files is newer than its corresponding [echo] .java file. Run the javacc target to regenerate the artifacts. [echo] init: clover.setup: clover.info: [echo] [echo] Clover not found. Code coverage reports disabled. [echo] clover: common.compile-core: [mkdir] Created dir: /tmp/lucene-nightly/build/classes/java [javac] Compiling 204 source files to /tmp/lucene-nightly/build/classes/java [javac] Note: /tmp/lucene-nightly/src/java/org/apache/lucene/queryParser/QueryParser.java uses or overrides a deprecated API. [javac] Note: Recompile with -deprecation for details. compile-core: [rmic] RMI Compiling 1 class to /tmp/lucene-nightly/build/classes/java compile-demo: [mkdir] Created dir: /tmp/lucene-nightly/build/classes/demo [javac] Compiling 17 source files to /tmp/lucene-nightly/build/classes/demo common.compile-test: [mkdir] Created dir: /tmp/lucene-nightly/build/classes/test [javac] Compiling 124 source files to /tmp/lucene-nightly/build/classes/test [javac] Note: /tmp/lucene-nightly/src/test/org/apache/lucene/queryParser/TestQueryParser.java uses or overrides a deprecated API. [javac] Note: Recompile with -deprecation for details. [copy] Copying 2 files to /tmp/lucene-nightly/build/classes/test [copy] Copied 1 empty directory to 1 empty directory under /tmp/lucene-nightly/build/classes/test compile-test: test: [mkdir] Created dir: /tmp/lucene-nightly/build/test [junit] Testsuite: org.apache.lucene.TestDemo [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.399 sec [junit] Testsuite: org.apache.lucene.TestHitIterator [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.343 sec [junit] Testsuite: org.apache.lucene.TestSearch [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.426 sec [junit] Testsuite: org.apache.lucene.TestSearchForDuplicates [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.916 sec [junit] Testsuite: org.apache.lucene.analysis.TestAnalyzers [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.272 sec [junit] Testsuite: org.apache.lucene.analysis.TestISOLatin1AccentFilter [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.271 sec [junit] Testsuite: org.apache.lucene.analysis.TestKeywordAnalyzer [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.378 sec [junit] Testsuite: org.apache.lucene.analysis.TestLengthFilter [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.258 sec [junit] Testsuite: org.apache.lucene.analysis.TestPerFieldAnalzyerWrapper [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.262 sec [junit] Testsuite: org.apache.lucene.analysis.TestStandardAnalyzer [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.315 sec [junit] Testsuite: org.apache.lucene.analysis.TestStopAnalyzer [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.268 sec [junit] Testsuite: org.apache.lucene.analysis.TestStopFilter [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.261 sec [junit] Testsuite: org.apache.lucene.document.TestBinaryDocument [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.334 sec [junit] Testsuite: org.apache.lucene.document.TestDateTools [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.337 sec [junit] Testsuite: org.apache.lucene.document.TestDocument [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.377 sec [junit] Testsuite: org.apache.lucene.document.TestNumberTools [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.633 sec [junit] Testsuite: org.apache.lucene.index.TestAddIndexesNoOptimize [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 2.726 sec [junit] Testsuite: org.apache.lucene.index.TestBackwardsCompatibility [junit] Tests run: 9, Failures: 0, Errors: 0, Time elapsed: 1.041 sec [junit] Testsuite: org.apache.lucene.index.TestCompoundFile [junit] Tests run: 10, Failures: 0, Errors: 0, Time elapsed: 3.316 sec [junit] Testsuite: org.apache.lucene.index.TestDoc [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.435 sec [junit] Testsuite: org.apache.lucene.index.TestDocumentWriter [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.504 sec [junit] Testsuite: org.apache.lucene.index.TestFieldInfos [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.333 sec [junit] Testsuite: org.apache.lucene.index.TestFieldsReader [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 7.781 sec [junit] - Standard Output --- [junit] Average Non-lazy time (should be very close to zero): 0 ms for 50 reads [junit] Average
[EMAIL PROTECTED]: Project lucene-java (in module lucene-java) failed
To whom it may engage... This is an automated request, but not an unsolicited one. For more information please visit http://gump.apache.org/nagged.html, and/or contact the folk at [EMAIL PROTECTED] Project lucene-java has an issue affecting its community integration. This issue affects 4 projects. The current state of this project is 'Failed', with reason 'Build Failed'. For reference only, the following projects are affected by this: - eyebrowse : Web-based mail archive browsing - jakarta-lucene : Java Based Search Engine - jakarta-slide : Content Management System based on WebDAV technology - lucene-java : Java Based Search Engine Full details are available at: http://vmgump.apache.org/gump/public/lucene-java/lucene-java/index.html That said, some information snippets are provided here. The following annotations (debug/informational/warning/error messages) were provided: -DEBUG- Sole output [lucene-core-13122006.jar] identifier set to project name -DEBUG- Dependency on javacc exists, no need to add for property javacc.home. -INFO- Failed with reason build failed -DEBUG- Extracted fallback artifacts from Gump Repository The following work was performed: http://vmgump.apache.org/gump/public/lucene-java/lucene-java/gump_work/build_lucene-java_lucene-java.html Work Name: build_lucene-java_lucene-java (Type: Build) Work ended in a state of : Failed Elapsed: 9 secs Command Line: java -Djava.awt.headless=true -Xbootclasspath/p:/usr/local/gump/public/workspace/xml-commons/java/external/build/xml-apis.jar:/usr/local/gump/public/workspace/xml-xerces2/build/xercesImpl.jar org.apache.tools.ant.Main -Dgump.merge=/x1/gump/public/gump/work/merge.xml -Dbuild.sysclasspath=only -Dversion=13122006 -Djavacc.home=/usr/local/gump/packages/javacc-3.1 package [Working Directory: /usr/local/gump/public/workspace/lucene-java] CLASSPATH: /opt/jdk1.5/lib/tools.jar:/usr/local/gump/public/workspace/lucene-java/build/classes/java:/usr/local/gump/public/workspace/lucene-java/build/classes/demo:/usr/local/gump/public/workspace/lucene-java/build/classes/test:/usr/local/gump/public/workspace/ant/dist/lib/ant-jmf.jar:/usr/local/gump/public/workspace/ant/dist/lib/ant-swing.jar:/usr/local/gump/public/workspace/ant/dist/lib/ant-apache-resolver.jar:/usr/local/gump/public/workspace/ant/dist/lib/ant-trax.jar:/usr/local/gump/public/workspace/ant/dist/lib/ant-junit.jar:/usr/local/gump/public/workspace/ant/dist/lib/ant-launcher.jar:/usr/local/gump/public/workspace/ant/dist/lib/ant-nodeps.jar:/usr/local/gump/public/workspace/ant/dist/lib/ant.jar:/usr/local/gump/packages/junit3.8.1/junit.jar:/usr/local/gump/public/workspace/xml-commons/java/build/resolver.jar:/usr/local/gump/packages/je-1.7.1/lib/je.jar:/usr/local/gump/packages/javacc-3.1/bin/lib/javacc.jar:/usr/local/gump/packages/jtidy-04aug2000r7-dev/build/Tidy.jar:/usr/local/gump/public/workspace/dist/junit/junit.jar - Buildfile: build.xml javacc-uptodate-check: javacc-notice: [echo] [echo] One or more of the JavaCC .jj files is newer than its corresponding [echo] .java file. Run the javacc target to regenerate the artifacts. [echo] init: clover.setup: clover.info: [echo] [echo] Clover not found. Code coverage reports disabled. [echo] clover: common.compile-core: [mkdir] Created dir: /x1/gump/public/workspace/lucene-java/build/classes/java [javac] Compiling 204 source files to /x1/gump/public/workspace/lucene-java/build/classes/java [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. compile-core: [rmic] RMI Compiling 1 class to /x1/gump/public/workspace/lucene-java/build/classes/java jar-core: [jar] Building jar: /x1/gump/public/workspace/lucene-java/build/lucene-core-13122006.jar javadocs: [mkdir] Created dir: /x1/gump/public/workspace/lucene-java/build/docs/api BUILD FAILED /x1/gump/public/workspace/lucene-java/build.xml:126: The following error occurred while executing this line: /x1/gump/public/workspace/lucene-java/build.xml:368: /x1/gump/public/workspace/lucene-java/contrib/gdata-server/src/java not found. Total time: 9 seconds - To subscribe to this information via syndicated feeds: - RSS: http://vmgump.apache.org/gump/public/lucene-java/lucene-java/rss.xml - Atom: http://vmgump.apache.org/gump/public/lucene-java/lucene-java/atom.xml == Gump Tracking Only === Produced by Gump version 2.2. Gump Run 14001613122006, vmgump.apache.org:vmgump-public:14001613122006 Gump E-mail Identifier (unique within run) #1. -- Apache Gump http://gump.apache.org/ [Instance: vmgump] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[EMAIL PROTECTED]: Project lucene-java (in module lucene-java) failed
To whom it may engage... This is an automated request, but not an unsolicited one. For more information please visit http://gump.apache.org/nagged.html, and/or contact the folk at [EMAIL PROTECTED] Project lucene-java has an issue affecting its community integration. This issue affects 4 projects. The current state of this project is 'Failed', with reason 'Build Failed'. For reference only, the following projects are affected by this: - eyebrowse : Web-based mail archive browsing - jakarta-lucene : Java Based Search Engine - jakarta-slide : Content Management System based on WebDAV technology - lucene-java : Java Based Search Engine Full details are available at: http://vmgump.apache.org/gump/public/lucene-java/lucene-java/index.html That said, some information snippets are provided here. The following annotations (debug/informational/warning/error messages) were provided: -DEBUG- Sole output [lucene-core-13122006.jar] identifier set to project name -DEBUG- Dependency on javacc exists, no need to add for property javacc.home. -INFO- Failed with reason build failed -DEBUG- Extracted fallback artifacts from Gump Repository The following work was performed: http://vmgump.apache.org/gump/public/lucene-java/lucene-java/gump_work/build_lucene-java_lucene-java.html Work Name: build_lucene-java_lucene-java (Type: Build) Work ended in a state of : Failed Elapsed: 9 secs Command Line: java -Djava.awt.headless=true -Xbootclasspath/p:/usr/local/gump/public/workspace/xml-commons/java/external/build/xml-apis.jar:/usr/local/gump/public/workspace/xml-xerces2/build/xercesImpl.jar org.apache.tools.ant.Main -Dgump.merge=/x1/gump/public/gump/work/merge.xml -Dbuild.sysclasspath=only -Dversion=13122006 -Djavacc.home=/usr/local/gump/packages/javacc-3.1 package [Working Directory: /usr/local/gump/public/workspace/lucene-java] CLASSPATH: /opt/jdk1.5/lib/tools.jar:/usr/local/gump/public/workspace/lucene-java/build/classes/java:/usr/local/gump/public/workspace/lucene-java/build/classes/demo:/usr/local/gump/public/workspace/lucene-java/build/classes/test:/usr/local/gump/public/workspace/ant/dist/lib/ant-jmf.jar:/usr/local/gump/public/workspace/ant/dist/lib/ant-swing.jar:/usr/local/gump/public/workspace/ant/dist/lib/ant-apache-resolver.jar:/usr/local/gump/public/workspace/ant/dist/lib/ant-trax.jar:/usr/local/gump/public/workspace/ant/dist/lib/ant-junit.jar:/usr/local/gump/public/workspace/ant/dist/lib/ant-launcher.jar:/usr/local/gump/public/workspace/ant/dist/lib/ant-nodeps.jar:/usr/local/gump/public/workspace/ant/dist/lib/ant.jar:/usr/local/gump/packages/junit3.8.1/junit.jar:/usr/local/gump/public/workspace/xml-commons/java/build/resolver.jar:/usr/local/gump/packages/je-1.7.1/lib/je.jar:/usr/local/gump/packages/javacc-3.1/bin/lib/javacc.jar:/usr/local/gump/packages/jtidy-04aug2000r7-dev/build/Tidy.jar:/usr/local/gump/public/workspace/dist/junit/junit.jar - Buildfile: build.xml javacc-uptodate-check: javacc-notice: [echo] [echo] One or more of the JavaCC .jj files is newer than its corresponding [echo] .java file. Run the javacc target to regenerate the artifacts. [echo] init: clover.setup: clover.info: [echo] [echo] Clover not found. Code coverage reports disabled. [echo] clover: common.compile-core: [mkdir] Created dir: /x1/gump/public/workspace/lucene-java/build/classes/java [javac] Compiling 204 source files to /x1/gump/public/workspace/lucene-java/build/classes/java [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. compile-core: [rmic] RMI Compiling 1 class to /x1/gump/public/workspace/lucene-java/build/classes/java jar-core: [jar] Building jar: /x1/gump/public/workspace/lucene-java/build/lucene-core-13122006.jar javadocs: [mkdir] Created dir: /x1/gump/public/workspace/lucene-java/build/docs/api BUILD FAILED /x1/gump/public/workspace/lucene-java/build.xml:126: The following error occurred while executing this line: /x1/gump/public/workspace/lucene-java/build.xml:368: /x1/gump/public/workspace/lucene-java/contrib/gdata-server/src/java not found. Total time: 9 seconds - To subscribe to this information via syndicated feeds: - RSS: http://vmgump.apache.org/gump/public/lucene-java/lucene-java/rss.xml - Atom: http://vmgump.apache.org/gump/public/lucene-java/lucene-java/atom.xml == Gump Tracking Only === Produced by Gump version 2.2. Gump Run 14001613122006, vmgump.apache.org:vmgump-public:14001613122006 Gump E-mail Identifier (unique within run) #1. -- Apache Gump http://gump.apache.org/ [Instance: vmgump] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Locale string compare: Java vs. C#
Hi folks, Over at Lucene.Net, I have run into a NUnit test which is failing with Lucene.Net (C#) but is passing with Lucene (Java). The two tests that fail are: TestInternationalMultiSearcherSort and TestInternationalSort After several hours of investigation, I narrowed the problem to what I believe is a difference in the way Java and .NET implement compare. The code in question is this method (found in FieldSortedHitQueue.java): public final int compare (final ScoreDoc i, final ScoreDoc j) { return collator.compare (index[i.doc], index[j.doc]); } To demonstrate the compare problem (Java vs. .NET) I crated this simple code both in Java and C#: // Java code: you get back 1 for 'res' String s1 = H\u00D8T; String s2 = HUT; Collator collator = Collator.getInstance (Locale.US); int diff = collator.compare(s1, s2); // C# code: you get back -1 for 'res' string s1 = H\u00D8T; string s2 = HUT; System.Globalization.CultureInfo locale = new System.Globalization.CultureInfo(en-US); System.Globalization.CompareInfo collator = locale.CompareInfo; int res = collator.Compare(s1, s2); Java will give me back a 1 while .NET gives me back -1. So, what I am trying to figure out is who is doing the right thing? Or am I missing additional calls before I can compare? My goal is to understand why the difference exist and thus based on that understanding I can judge how serious this issue is and find a fix for it or just document it as a language difference between Java and .NET. Btw, this is based on Lucene 2.0 for both Java and C# Lucene. Regards, -- George Aroush - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locale string compare: Java vs. C#
Surprising but it looks to me like a bug in Java's collation rules for en-US. According to http://developer.mimer.com/collations/charts/UCA_latin.htm, \u00D8 (which is Latin Capital Letter O With Stroke) should be before U, implying -1 is the correct result. Java is returning 1 for all strengths of the collator. Maybe there is some other subtlety with this character... Chuck George Aroush wrote on 12/13/2006 04:20 PM: Hi folks, Over at Lucene.Net, I have run into a NUnit test which is failing with Lucene.Net (C#) but is passing with Lucene (Java). The two tests that fail are: TestInternationalMultiSearcherSort and TestInternationalSort After several hours of investigation, I narrowed the problem to what I believe is a difference in the way Java and .NET implement compare. The code in question is this method (found in FieldSortedHitQueue.java): public final int compare (final ScoreDoc i, final ScoreDoc j) { return collator.compare (index[i.doc], index[j.doc]); } To demonstrate the compare problem (Java vs. .NET) I crated this simple code both in Java and C#: // Java code: you get back 1 for 'res' String s1 = H\u00D8T; String s2 = HUT; Collator collator = Collator.getInstance (Locale.US); int diff = collator.compare(s1, s2); // C# code: you get back -1 for 'res' string s1 = H\u00D8T; string s2 = HUT; System.Globalization.CultureInfo locale = new System.Globalization.CultureInfo(en-US); System.Globalization.CompareInfo collator = locale.CompareInfo; int res = collator.Compare(s1, s2); Java will give me back a 1 while .NET gives me back -1. So, what I am trying to figure out is who is doing the right thing? Or am I missing additional calls before I can compare? My goal is to understand why the difference exist and thus based on that understanding I can judge how serious this issue is and find a fix for it or just document it as a language difference between Java and .NET. Btw, this is based on Lucene 2.0 for both Java and C# Lucene. Regards, -- George Aroush - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]