[jira] Created: (LUCENE-503) Contrib: ThaiAnalyzer to enable Thai full-text search in Lucene
Contrib: ThaiAnalyzer to enable Thai full-text search in Lucene --- Key: LUCENE-503 URL: http://issues.apache.org/jira/browse/LUCENE-503 Project: Lucene - Java Type: New Feature Components: Analysis Versions: 1.4 Reporter: Samphan Raruenrom Thai text don't have space between words. Usually, a dictionary-based algorithm is used to break string into words. For Lucene to be usable for Thai, an Analyzer that know how to break Thai words is needed. I've implemented such Analyzer, ThaiAnalyzer, using ICU4j DictionaryBasedBreakIterator for word breaking. I'll upload the code later. I'm normally a C++ programmer and very new to Java. Please review the code for any problem. One possible problem is that it requires ICU4j. I don't know whether this is OK. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
FuzzyQuery / PriorityQueue BUG
Hi, FuzzyQuery produces a java.lang.NegativeArraySizeException in PriorityQueue.initialize if I use Integer.MAX_VALUE as BooleanQuery.MaxClauseCount. This is because it adds 1 to MaxClauseCount and tries to allocate an Array of this Size (I think it overflows to MIN_VALUE). Usually nobody needs so much clauses, but I think this should be catched somehow. Perhaps an Error your MaxClauseCount is too large could do it, so the user knows where to find the problem. Greets Joerg - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FuzzyQuery / PriorityQueue BUG
Jörg, could you please add this to JIRA so that things don't get lost. If you have a patch and/or a testcase showing the problem, it would be great if you append it to JIRA also. thanks, Bernhard Jörg Henß wrote: Hi, FuzzyQuery produces a java.lang.NegativeArraySizeException in PriorityQueue.initialize if I use Integer.MAX_VALUE as BooleanQuery.MaxClauseCount. This is because it adds 1 to MaxClauseCount and tries to allocate an Array of this Size (I think it overflows to MIN_VALUE). Usually nobody needs so much clauses, but I think this should be catched somehow. Perhaps an Error your MaxClauseCount is too large could do it, so the user knows where to find the problem. Greets Joerg - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-504) FuzzyQuery produces a java.lang.NegativeArraySizeException in PriorityQueue.initialize if I use Integer.MAX_VALUE as BooleanQuery.MaxClauseCount
FuzzyQuery produces a java.lang.NegativeArraySizeException in PriorityQueue.initialize if I use Integer.MAX_VALUE as BooleanQuery.MaxClauseCount -- Key: LUCENE-504 URL: http://issues.apache.org/jira/browse/LUCENE-504 Project: Lucene - Java Type: Bug Components: Search Versions: 1.9 Reporter: Joerg Henss Priority: Minor PriorityQueue creates an java.lang.NegativeArraySizeException when initialized with Integer.MAX_VALUE, because Integer overflows. I think this could be a general problem with PriorityQueue. The Error occured when I set BooleanQuery.MaxClauseCount to Integer.MAX_VALUE and user a FuzzyQuery for searching. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-504) FuzzyQuery produces a java.lang.NegativeArraySizeException in PriorityQueue.initialize if I use Integer.MAX_VALUE as BooleanQuery.MaxClauseCount
[ http://issues.apache.org/jira/browse/LUCENE-504?page=all ] Joerg Henss updated LUCENE-504: --- Attachment: TestFuzzyQueryError.java Simple test showing the error FuzzyQuery produces a java.lang.NegativeArraySizeException in PriorityQueue.initialize if I use Integer.MAX_VALUE as BooleanQuery.MaxClauseCount -- Key: LUCENE-504 URL: http://issues.apache.org/jira/browse/LUCENE-504 Project: Lucene - Java Type: Bug Components: Search Versions: 1.9 Reporter: Joerg Henss Priority: Minor Attachments: TestFuzzyQueryError.java PriorityQueue creates an java.lang.NegativeArraySizeException when initialized with Integer.MAX_VALUE, because Integer overflows. I think this could be a general problem with PriorityQueue. The Error occured when I set BooleanQuery.MaxClauseCount to Integer.MAX_VALUE and user a FuzzyQuery for searching. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: version issue ?
Andi Vajda wrote: It would seem to me that the source code snapshot that is made to release 'official' source and binary tarballs on the Lucene website should correspond to a precise svn version It does correspond to a precise SVN version, but what we prefer is a tag. The tag for 1.9-final is: http://svn.apache.org/repos/asf/lucene/java/tags/lucene_1_9_final/ Tags should never be revised. If you're paranoid, then you could also note the revision of this tag, which is 381437. and that that version's common-build.xml should reflect the release number, ie 1.9 or 1.9-final in its version property. Here's where things are murky. I don't like it if what folks build themselves is easily confused with a released build. So the version in the common-build.xml should not be 1.9 or 1.9-final but something else. But what? 1.9-dev? 1.9.1-dev? 1.9.1-rc1-dev? Classically we've used the -dev suffix to distinguish non-release builds. Can you suggest something bette If the 1.9 branch were to be modified for making, say, a 1.9.1 release, the first change should be in common-build.xml for the version to say something like 1.9.1-rc1-dev. Right. What is the use case here ? The use case is folks downloading the source, tweaking things, and building things themselves. If they're not going to tweak things, there's no need to build things themselves. This is not like C code, where folks frequently need to build releases on different platforms. The predominant reason folks build Lucene is because they want to tweak it. And once it is tweaked, it is no longer a released version. It is no problem for me to patch common-build.xml for the 1.9-final release of PyLucene from Java Lucene, made from the lucene_1_9 branch, and then revert back to the trunk, without the patch, for the future ongoing 2.0 dev builds and PyLucene releases. There's no need to patch: you can simply run 'ant -Dversion=XXX'. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index compatibility question
On Mittwoch 01 März 2006 16:21, DM Smith wrote: I find that 1.9 reads my 1.4.3 built indexes just fine. But not the other way around. That's exactly how it is supposed to be. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-505) MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object
[ http://issues.apache.org/jira/browse/LUCENE-505?page=all ] Steven Tamm updated LUCENE-505: --- Attachment: NormFactors.patch Sorry, I didn't remove whitespace in the previous patch. This one's easier to read. svn diff --diff-cmd diff -x -b -u works better than svn diff --diff-cmd diff -x -b -x -u MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object --- Key: LUCENE-505 URL: http://issues.apache.org/jira/browse/LUCENE-505 Project: Lucene - Java Type: Improvement Components: Index Versions: 1.9 Environment: Patch is against Lucene 1.9 trunk (as of Mar 1 06) Reporter: Steven Tamm Attachments: NormFactors.patch, NormFactors.patch MultiReader.norms() is very inefficient: it has to construct a byte array that's as long as all the documents in every segment. This doubles the memory requirement for scoring MultiReaders vs. Segment Readers. Although this is cached, it's still a baseline of memory that is unnecessary. The problem is that the Normalization Factors are passed around as a byte[]. If it were instead replaced with an Object, you could perform a whole host of optimizations a. When reading, you wouldn't have to construct a fakeNorms array of all 1.0fs. You could instead return a singleton object that would just return 1.0f. b. MultiReader could use an object that could delegate to NormFactors of the subreaders c. You could write an implementation that could use mmap to access the norm factors. Or if the index isn't long lived, you could use an implementation that reads directly from the disk. The patch provided here replaces the use of byte[] with a new abstract class called NormFactors. NormFactors has two methods on it public abstract byte getByte(int doc) throws IOException; // Returns the byte[doc] public float getFactor(int doc) throws IOException;// Calls Similarity.decodeNorm(getByte(doc)) There are four implementations of this abstract class 1. NormFactors.EmptyNormFactors - This replaces the fakeNorms with a singleton that only returns 1.0 2. NormFactors.ByteNormFactors - Converts a byte[] to a NormFactors for backwards compatibility in constructors. 3. MultiNormFactors - Multiplexes the NormFactors in MultiReader to prevent the need to construct the gigantic norms array. 4. SegmentReader.Norm - Same class, but now extends NormFactors to provide the same access. In addition, Many of the Query and Scorer classes were changes to pass around NormFactors instead of byte[], and to call getFactor() instead of using the byte[]. I have kept around IndexReader.norms(String) for backwards compatibiltiy, but marked it as deprecated. I believe that the use of ByteNormFactors in IndexReader.getNormFactors() will keep backward compatibility with other IndexReader implementations, but I don't know how to test that. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-505) MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object
[ http://issues.apache.org/jira/browse/LUCENE-505?page=comments#action_12368389 ] Steven Tamm commented on LUCENE-505: I also worry about performance with this change. Have you benchmarked this while searching large indexes? yes. see below. For example, in TermScorer.score(HitCollector, int), Lucene's innermost loop, you change two array accesses into a call to an interface. That could make a substantial difference. Small changes to that method can cause significant performance changes. Specifically you change two array accesses into a call to an interface. I have changed two byte array references (one of which is static), to a method call on an abstract class. I'm using JDK 1.5.0_06. Hotspot inlines both calls and performance was about the same with a 1M docs index (we have a low term/doc ratio, so we have about 8.5M terms). HPROF doesn't even see the call to Similarity.decodeNorm. If I was using JDK 1.3, I'd probably agree with you, but HotSpot is very good at figuring this stuff out and autoinlining the calls. As for the numbers: an average request returning 5000 hits from our 0.5G index was at ~485ms average on my box before. It's now at ~480ms. (50 runs each). Most of that is overhead, granted. The increase in performance may be obscured by my other change in TermScorer (LUCENE-502). I'm not sure of the history of TermScorer, but it seems heavily optimized for a Large # Terms/Document. We have a low # Terms/Document, so performance suffers greatly.. Performance was dramatically improved by not unnecessarily caching things. TermScorer seems to be heavily optimized for a non-modern VM (like inlining next() into score(), caching result of Math.sqrt for each term being queried, having a doc/freq cache that provides no benefit unless iterating backwards, etc). The total of the term scorer changes brought the average down from ~580ms. Since we use a lot of large indexes and don't keep them in memory all that often, our performance increases dramatically due to the reduction in GC overhead. As we move to not actually storing the Norms array in memory but instead using the disk, this change will have an even higher benefit. I'm in the process of preparing a set of patches that will help people that don't have long-lived indexes, and this is just one part. MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object --- Key: LUCENE-505 URL: http://issues.apache.org/jira/browse/LUCENE-505 Project: Lucene - Java Type: Improvement Components: Index Versions: 2.0 Environment: Patch is against Lucene 1.9 trunk (as of Mar 1 06) Reporter: Steven Tamm Attachments: NormFactors.patch, NormFactors.patch MultiReader.norms() is very inefficient: it has to construct a byte array that's as long as all the documents in every segment. This doubles the memory requirement for scoring MultiReaders vs. Segment Readers. Although this is cached, it's still a baseline of memory that is unnecessary. The problem is that the Normalization Factors are passed around as a byte[]. If it were instead replaced with an Object, you could perform a whole host of optimizations a. When reading, you wouldn't have to construct a fakeNorms array of all 1.0fs. You could instead return a singleton object that would just return 1.0f. b. MultiReader could use an object that could delegate to NormFactors of the subreaders c. You could write an implementation that could use mmap to access the norm factors. Or if the index isn't long lived, you could use an implementation that reads directly from the disk. The patch provided here replaces the use of byte[] with a new abstract class called NormFactors. NormFactors has two methods on it public abstract byte getByte(int doc) throws IOException; // Returns the byte[doc] public float getFactor(int doc) throws IOException;// Calls Similarity.decodeNorm(getByte(doc)) There are four implementations of this abstract class 1. NormFactors.EmptyNormFactors - This replaces the fakeNorms with a singleton that only returns 1.0 2. NormFactors.ByteNormFactors - Converts a byte[] to a NormFactors for backwards compatibility in constructors. 3. MultiNormFactors - Multiplexes the NormFactors in MultiReader to prevent the need to construct the gigantic norms array. 4. SegmentReader.Norm - Same class, but now extends NormFactors to provide the same access. In addition, Many of the Query and Scorer classes were changes to pass around NormFactors instead of byte[], and to call getFactor() instead of using the byte[]. I have kept around IndexReader.norms(String) for backwards compatibiltiy, but marked it as deprecated. I believe that the use of ByteNormFactors in
[jira] Created: (LUCENE-506) Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you know the queries ahead of time)
Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you know the queries ahead of time) - Key: LUCENE-506 URL: http://issues.apache.org/jira/browse/LUCENE-506 Project: Lucene - Java Type: Improvement Components: Index Versions: 2.0 Environment: Patch against Lucene 1.9 trunk as of Mar 1 06 Reporter: Steven Tamm Summary: Provide a way to avoid loading the TermInfoIndex into memory if you know all the terms you are ever going to query. In our search environment, we have a large number of indexes (many thousands), any of which may be queried by any number of hosts. These indexes may be very large (~1M document), but since we have a low term/doc ratio, we have 7-11M terms. With an index interval of 128, that means ~70-90K terms. On loading the index, it instantiates a Term, a TermInfo, a String, and a char[]. When the document is long lived, this makes some sense because you can quickly search the list of terms using binary search. However, since we throw away the Indexes very often, a lot of garbage is created per query Here's an example where we load a large index 10 times. This corresponds to 7MB of garbage per query. percent live alloc'ed stack class rank self accum bytes objs bytes objs trace name 1 4.48% 4.48% 4678736 128946 23393680 644730 387749 char[] 3 3.95% 12.61% 4126272 128946 20631360 644730 387751 org.apache.lucene.index.TermInfo 6 2.96% 22.71% 3094704 128946 15473520 644730 387748 java.lang.String 8 1.98% 26.97% 2063136 128946 10315680 644730 387750 org.apache.lucene.index.Term This adds up after a while. Since we know exactly which Terms we're going to search for before even opening the index, there's no need to allocate this much memory. Upon opening the index, we can go through the TII in sequential order and retrieve the entries into the main term dictionary and reduce the storage requirements dramatically. This reduces the amount of garbage generated by querying by about 60% if you only make 1 query/index with a 77% increase in throughput. This is accomplished by factoring out the index loading aspects of TermInfosReader into a new file, SegmentTermInfosReader. TermInfosReader becomes a base class to allow access to terms. A new class, PrefetchedTermInfosReader will, upon startup, sort the passed in terms and retrieve the IndexEntries for those terms. IndexReader and SegmentReader are modified to take new constructor methods that take a Collection of Terms that correspond to the total set of terms that will ever be searched in the life of the index. In order to support the skipping behavior, some changes need to be made to SegmentTermEnum: specifically, we need to be able to go back an entry in order to retrieve the previous TermInfo and IndexPointer. This is because, unlike the normal case, with the index we want to return the value right before the intended field (so that we can be behind the desired termin the main dictionary). For example, if we're looking for apple in the index, and the two adjacent values are abba and argon, we want to return abba instead of argon. That way we won't miss any terms in the real index. This code is confusing; it should probably be moved to an subclass of TermBuffer, but that required more code. Not wanting to modify TermBuffer to keep it small, also lead to the odd NPE catch in SegmentTermEnum.java. Stickler for contracts may want to rename SegmentTermEnum.skipTo() to a different name because it implements a different contract: but it would be useful for anyone trying to skip around in the TII, so I figured it was the right thing to do. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-506) Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you know the queries ahead of time)
[ http://issues.apache.org/jira/browse/LUCENE-506?page=all ] Steven Tamm updated LUCENE-506: --- Attachment: Prefetching.patch This also includes two additional test cases. The public exposure to the prefetching is controlled solely by IndexReader.open(Directory,Collection). If you try to query a term that wasn't included, This also includes some wildcard handling. I'm not sure it's absolutely necessary for WildcardTermEnum or FuzzyTermEnum. You can probably remove the entire if (entry == null) block of PrefetchedTermInfosReader.seekEnum. But this provides more flexibility. Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you know the queries ahead of time) - Key: LUCENE-506 URL: http://issues.apache.org/jira/browse/LUCENE-506 Project: Lucene - Java Type: Improvement Components: Index Versions: 2.0 Environment: Patch against Lucene 1.9 trunk as of Mar 1 06 Reporter: Steven Tamm Attachments: Prefetching.patch Summary: Provide a way to avoid loading the TermInfoIndex into memory if you know all the terms you are ever going to query. In our search environment, we have a large number of indexes (many thousands), any of which may be queried by any number of hosts. These indexes may be very large (~1M document), but since we have a low term/doc ratio, we have 7-11M terms. With an index interval of 128, that means ~70-90K terms. On loading the index, it instantiates a Term, a TermInfo, a String, and a char[]. When the document is long lived, this makes some sense because you can quickly search the list of terms using binary search. However, since we throw away the Indexes very often, a lot of garbage is created per query Here's an example where we load a large index 10 times. This corresponds to 7MB of garbage per query. percent live alloc'ed stack class rank self accum bytes objs bytes objs trace name 1 4.48% 4.48% 4678736 128946 23393680 644730 387749 char[] 3 3.95% 12.61% 4126272 128946 20631360 644730 387751 org.apache.lucene.index.TermInfo 6 2.96% 22.71% 3094704 128946 15473520 644730 387748 java.lang.String 8 1.98% 26.97% 2063136 128946 10315680 644730 387750 org.apache.lucene.index.Term This adds up after a while. Since we know exactly which Terms we're going to search for before even opening the index, there's no need to allocate this much memory. Upon opening the index, we can go through the TII in sequential order and retrieve the entries into the main term dictionary and reduce the storage requirements dramatically. This reduces the amount of garbage generated by querying by about 60% if you only make 1 query/index with a 77% increase in throughput. This is accomplished by factoring out the index loading aspects of TermInfosReader into a new file, SegmentTermInfosReader. TermInfosReader becomes a base class to allow access to terms. A new class, PrefetchedTermInfosReader will, upon startup, sort the passed in terms and retrieve the IndexEntries for those terms. IndexReader and SegmentReader are modified to take new constructor methods that take a Collection of Terms that correspond to the total set of terms that will ever be searched in the life of the index. In order to support the skipping behavior, some changes need to be made to SegmentTermEnum: specifically, we need to be able to go back an entry in order to retrieve the previous TermInfo and IndexPointer. This is because, unlike the normal case, with the index we want to return the value right before the intended field (so that we can be behind the desired termin the main dictionary). For example, if we're looking for apple in the index, and the two adjacent values are abba and argon, we want to return abba instead of argon. That way we won't miss any terms in the real index. This code is confusing; it should probably be moved to an subclass of TermBuffer, but that required more code. Not wanting to modify TermBuffer to keep it small, also lead to the odd NPE catch in SegmentTermEnum.java. Stickler for contracts may want to rename SegmentTermEnum.skipTo() to a different name because it implements a different contract: but it would be useful for anyone trying to skip around in the TII, so I figured it was the right thing to do. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional
[jira] Created: (LUCENE-508) SegmentTermEnum.next() doesn't maintain prevBuffer at end
SegmentTermEnum.next() doesn't maintain prevBuffer at end - Key: LUCENE-508 URL: http://issues.apache.org/jira/browse/LUCENE-508 Project: Lucene - Java Type: Bug Components: Index Versions: 1.9, 2.0 Environment: Lucene Trunk Reporter: Steven Tamm When you're iterating a SegmentTermEnum and you go past the end of the docs, you end up with a state where the nextBuffer = null and the prevBuffer is the penultimate term, not the last term. This patch fixes it. (It's also required for my Prefetching bug [LUCENE-506]) Index: java/org/apache/lucene/index/SegmentTermEnum.java === --- java/org/apache/lucene/index/SegmentTermEnum.java (revision 382121) +++ java/org/apache/lucene/index/SegmentTermEnum.java (working copy) @@ -109,6 +109,7 @@ /** Increments the enumeration to the next element. True if one exists.*/ public final boolean next() throws IOException { if (position++ = size - 1) { + prevBuffer.set(termBuffer); termBuffer.reset(); return false; } -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]