[jira] Assigned: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned LUCENE-2215: --- Assignee: Grant Ingersoll paging collector Key: LUCENE-2215 URL: https://issues.apache.org/jira/browse/LUCENE-2215 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.4, 3.0 Reporter: Adam Heinz Assignee: Grant Ingersoll Priority: Minor Attachments: IterablePaging.java, PagingCollector.java, TestingPagingCollector.java http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 Somebody assign this to Aaron McCurry and we'll see if we can get enough votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2227) separate chararrayset interface from impl
separate chararrayset interface from impl - Key: LUCENE-2227 URL: https://issues.apache.org/jira/browse/LUCENE-2227 Project: Lucene - Java Issue Type: Task Components: Analysis Affects Versions: 3.0 Reporter: Robert Muir Priority: Minor CharArraySet should be abstract the hashing implementation currently being used should instead be called CharArrayHashSet currently our 'CharArrayHashSet' is hardcoded across Lucene, but others might want their own impl. For example, implementing CharArraySet as DFA with org.apache.lucene.util.automaton gives faster contains(char[], int, int) performance, as it can do a 'fast fail' and need not hash the entire string. This is useful as it speeds up indexing in StopFilter. I did not think this would be faster but i did benchmarks over and over with the reuters corpus, and it is, even with english text's wierd average word length of 5 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802235#action_12802235 ] Renaud Delbru commented on LUCENE-1410: --- On another aspect, why is the PFOR/FOR is encoding the number of compressed integers into the block header since this information is already stored in the stream header (block size information written in FixedIntBlockIndexOutput#init()). Is there a particular use case for that ? PFOR implementation --- Key: LUCENE-1410 URL: https://issues.apache.org/jira/browse/LUCENE-1410 Project: Lucene - Java Issue Type: New Feature Components: Other Reporter: Paul Elschot Priority: Minor Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, TestPFor2.java Original Estimate: 21840h Remaining Estimate: 21840h Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802235#action_12802235 ] Renaud Delbru edited comment on LUCENE-1410 at 1/19/10 1:10 PM: On another aspect, why the PFOR/FOR is encoding the number of compressed integers into the block header since this information is already stored in the stream header (block size information written in FixedIntBlockIndexOutput#init()). Is there a particular use case for that ? Is it for the special case when a block is complete (when the block encodes the reamaining integer of the list) ? was (Author: renaud.delbru): On another aspect, why is the PFOR/FOR is encoding the number of compressed integers into the block header since this information is already stored in the stream header (block size information written in FixedIntBlockIndexOutput#init()). Is there a particular use case for that ? PFOR implementation --- Key: LUCENE-1410 URL: https://issues.apache.org/jira/browse/LUCENE-1410 Project: Lucene - Java Issue Type: New Feature Components: Other Reporter: Paul Elschot Priority: Minor Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, TestPFor2.java Original Estimate: 21840h Remaining Estimate: 21840h Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2111) Wrapup flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2111: --- Attachment: LUCENE-2111.patch Attached patch w/ various fixes: - Switch over payloads to use BytesRef, in flex API - DocsEnum.positions now returns null if no positions were indexed (ie omitTFAP was set for the field). Also fixed Phrase/SpanQuery to throw IllegalStateException when run against an omitTFAP field. - Rename PositionsConsumer.addPosition - .add Wrapup flexible indexing Key: LUCENE-2111 URL: https://issues.apache.org/jira/browse/LUCENE-2111 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch Spinoff from LUCENE-1458. The flex branch is in fairly good shape -- all tests pass, initial search performance testing looks good, it survived several visits from the Unicode policeman ;) But it still has a number of nocommits, could use some more scrutiny especially on the emulate old API on flex index and vice/versa code paths, and still needs some more performance testing. I'll do these under this issue, and we should open separate issues for other self contained fixes. The end is in sight! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802276#action_12802276 ] Adam Heinz commented on LUCENE-2215: Awesome, thanks! I'll schedule some time in the coming week to patch our dev installation and sic some QA guys on it. paging collector Key: LUCENE-2215 URL: https://issues.apache.org/jira/browse/LUCENE-2215 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.4, 3.0 Reporter: Adam Heinz Assignee: Grant Ingersoll Priority: Minor Attachments: IterablePaging.java, PagingCollector.java, TestingPagingCollector.java http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 Somebody assign this to Aaron McCurry and we'll see if we can get enough votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2213) Small improvements to ArrayUtil.getNextSize
[ https://issues.apache.org/jira/browse/LUCENE-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2213: --- Attachment: LUCENE-2213.patch New patch, just renaming to ArrayUtil.oversize. Small improvements to ArrayUtil.getNextSize --- Key: LUCENE-2213 URL: https://issues.apache.org/jira/browse/LUCENE-2213 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.1 Attachments: LUCENE-2213.patch, LUCENE-2213.patch, LUCENE-2213.patch, LUCENE-2213.patch Spinoff from java-dev thread Dynamic array reallocation algorithms started on Jan 12, 2010. Here's what I did: * Keep the +3 for small sizes * Added 2nd arg = number of bytes per element. * Round up to 4 or 8 byte boundary (if it's 32 or 64 bit JRE respectively) * Still grow by 1/8th * If 0 is passed in, return 0 back I also had to remove some asserts in tests that were checking the actual values returned by this method -- I don't think we should test that (it's an impl. detail). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802335#action_12802335 ] Paul Elschot commented on LUCENE-1410: -- The only reason why the number of compressed integers is encoded in the block header here is that when I coded it I did not know that this was not necessary in lucene indexes. That also means that the header can be used for different compression methods, for example in the following way: cases encoded in 1st byte: 32 FrameOfRef cases (#frameBits) followed by 3 bytes for #exceptions (0 for BITS, 0 for PFOR) 16-64 cases for a SimpleNN variant 1-8 cases for run length encoding (for example followed by 3 bytes for length and value) Total #cases is 49-104 or 6-7 bits. Run length encoding is good for terms that occur in every document and for the frequencies of primary keys. The only concern I have is that the instruction cache might get filled up with the code for all these decoding cases. At the moment I don't know how to deal with that other than by adding such cases slowly while doing performance tests all the time. PFOR implementation --- Key: LUCENE-1410 URL: https://issues.apache.org/jira/browse/LUCENE-1410 Project: Lucene - Java Issue Type: New Feature Components: Other Reporter: Paul Elschot Priority: Minor Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, TestPFor2.java Original Estimate: 21840h Remaining Estimate: 21840h Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene memory consumption
Hello Frederic, I'm CCing java-dev@lucene.apache.org as Michael McCandless has been very helpful on IRC in discussing the ThreadLocal implication, and it would be nice you could provide first-hand information. There's a good reading to start from at http://issues.apache.org/jira/browse/LUCENE-1383 Basically your proposal is having a problem which is that when you close the ThreadLocal it's only going to cleanup the resources stored by the current thread, not by others; setting the reference to null also won't help: Quoting the TLocal source's comment: * However, since reference queues are not * used, stale entries are guaranteed to be removed only when * the table starts running out of space. About your issues: 1. A ThreadLocal object should normally be a singleton used has key to the thread map. Here it is repeatedly created and destroy! It's only built in the constructor, and destroyed on close. So it's lifecycle is linked to the Analyzer / FieldCache using it, probably a long time, or the appropriate time to cleanup things. 2. Setting t = null; is not affecting the garbage collection of the ThreadLocal map since t is the key (hard ref) of the thread map. Well t is unfortunately being reused as a variable name: t = null; is clearing the reference to the threadlocal, which really is the key of the map used by the threadlocal and referenced by the current Thread instance, and TLocal uses weak *Keys* not values (and the key is the TLocal itself). 3. There are no call to t.remove() which will really clean the Map entry. You could add one, but it would only cleanup the garbage from the current thread, so it's ok but not enough. The current impl is making sure all stuff is collected by wrapping it all in weak values. Actually some stuff is not collected: the WeakReferences themselves, but pointing to going-to-be-collected stuff. These WeakReferences are going to be removed when the TLocal table is full, and should be harmless (?). As you pointed out, since Lucene 3 it's releasing what is possible to release eagerly, but it's a very small slight optimization: you still need the weak/hardref trick to clean the other values. 4. A ThreadLocal Map is already a WeakReference for the value. No, it's on the keys: a collected ThreadLocal will be cleaned up for. eventually :-/ 5. Leaving objects on a ThreadLocal after it is out of your control is bad practice. Another task may reuse the Thread and found dirty objects there. Agree, but having weak values it's not a big issue. Also it's not meant to be used by faint hearted, just people writing their own Analyzer could have this wrong :) 6. We found (in all our tests) the hardRef Map to be completely unnecessary in Lucene 2.4.1, but here I'm lacking more in depth knowledge of the lifecycle of the objects added to this CloseableThreadLocal. Well as it's being used as a cache functionality will be the same, performance should be worse. AFAIK all TokenFilters are able to rebuild what they need when get() returns null, you might have a problem on the unlikely case of org.apache.lucene.util.CloseableThreadLocal:68 having the assertion fail, but again not affecting functionality (assuming assertions are disabled). A vanilla ThreadLocal is obviously faster than this, but then we end up reverting LUCENE-1383 and so introducing more pressure on the GC. It would be very interesting to find out why your implementation is performing better? Maybe because in your case Analyzers are used by one thread at a time, and so you're not leaking memory? Could you tell more about this to lucene-dev directly? Regards, Sanne 2010/1/6 Frederic Simon fr...@jfrog.org: Thanks Emmanuel, Yes the main issue is that the hardRef map in this class was forcing all the objects to go to the Old generation space in the JVM GC, instead of staying at a ThreadLocal level. So, all objects put in the CloseableThreadLocal were GC only on full GC. On heavy lucene usage, it generated around 500Mb of heap for each 5 secs until full GC kicks in. Our problem is that we really a lot on SoftReference for our cache and so this Lucene behavior is really bad for us (Customer feedback: http://old.nabble.com/What's-the-memory-requirements-for-2.1.3--to27026622.html#a27026622 ). With my class all objects stay in young gen and so the performance boost for us was huge. The issues with the class: A ThreadLocal object should normally be a singleton used has key to the thread map. Here it is reapeatdly created and destroy! Setting t = null; is not affecting the garbage collection of the ThreadLocal map since t is the key (hard ref) of the thread map. There are no call to t.remove() which will really clean the Map entry. A ThreadLocal Map is already a WeakReference for the value. Leaving objects on a ThreadLocal after it is out of your control is bad practice. Another task may reuse the Thread and found dirty objects there. We found (in all our tests) the hardRef Map to
[jira] Commented: (LUCENE-2217) SortedVIntList allocation should use ArrayUtils.getNextSize()
[ https://issues.apache.org/jira/browse/LUCENE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802449#action_12802449 ] Paul Elschot commented on LUCENE-2217: -- Btw. shouldn't IndexInput.bytes also be reallocated using ArrayUtils.getNextSize() ? The growth factor there is a hardcoded 1.25 . SortedVIntList allocation should use ArrayUtils.getNextSize() - Key: LUCENE-2217 URL: https://issues.apache.org/jira/browse/LUCENE-2217 Project: Lucene - Java Issue Type: Improvement Components: Other Reporter: Paul Elschot Assignee: Michael McCandless Priority: Trivial Attachments: LUCENE-2217.patch, LUCENE-2217.patch See recent discussion on ArrayUtils.getNextSize(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2217) SortedVIntList allocation should use ArrayUtils.getNextSize()
[ https://issues.apache.org/jira/browse/LUCENE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802456#action_12802456 ] Michael McCandless commented on LUCENE-2217: bq. Btw. shouldn't IndexInput.bytes also be reallocated using ArrayUtils.getNextSize() +1 Wanna fold it into this patch? (And any others you find..?). SortedVIntList allocation should use ArrayUtils.getNextSize() - Key: LUCENE-2217 URL: https://issues.apache.org/jira/browse/LUCENE-2217 Project: Lucene - Java Issue Type: Improvement Components: Other Reporter: Paul Elschot Assignee: Michael McCandless Priority: Trivial Attachments: LUCENE-2217.patch, LUCENE-2217.patch See recent discussion on ArrayUtils.getNextSize(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2217) SortedVIntList allocation should use ArrayUtils.getNextSize()
[ https://issues.apache.org/jira/browse/LUCENE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802510#action_12802510 ] Paul Elschot commented on LUCENE-2217: -- Well, it's not that I'm searching, but I'll provide another patch that includes IndexInput for this. Would you have any idea about testcases for that? :) SortedVIntList allocation should use ArrayUtils.getNextSize() - Key: LUCENE-2217 URL: https://issues.apache.org/jira/browse/LUCENE-2217 Project: Lucene - Java Issue Type: Improvement Components: Other Reporter: Paul Elschot Assignee: Michael McCandless Priority: Trivial Attachments: LUCENE-2217.patch, LUCENE-2217.patch See recent discussion on ArrayUtils.getNextSize(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.
[ https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802535#action_12802535 ] Deepak commented on LUCENE-2205: Hi Aaron I hope you will be able to post the files today Regards D Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure. --- Key: LUCENE-2205 URL: https://issues.apache.org/jira/browse/LUCENE-2205 Project: Lucene - Java Issue Type: Improvement Environment: Java5 Reporter: Aaron McCurry Attachments: patch-final.txt, RandomAccessTest.java, rawoutput.txt Basically packing those three arrays into a byte array with an int array as an index offset. The performance benefits are stagering on my test index (of size 6.2 GB, with ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the terminfos into memory were reduced to 17% of there original size. From 291.5 MB to 49.7 MB. The random access speed has been made better by 1-2%, load time of the segments are ~40% faster as well, and full GC's on my JVM were made 7 times faster. I have already performed the work and am offering this code as a patch. Currently all test in the trunk pass with this new code enabled. I did write a system property switch to allow for the original implementation to be used as well. -Dorg.apache.lucene.index.TermInfosReader=default or small I have also written a blog about this patch here is the link. http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1990) Add unsigned packed int impls in oal.util
[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802542#action_12802542 ] Toke Eskildsen commented on LUCENE-1990: Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting. I'm fairly afraid of the overhead of the extra method-calls, but I'll try it and see what happens. I've read your (Michael McCandless) code an I can see that the tiny interfaces for Reader and Writer works well for your scenario. However, as the Reader must have (fast) random access, wouldn't it make sense to make it possible to update values? That way the same code can be used to hold ords for sorting and similar structures. Instead of Reader, we could use {code} abstract class Mutator { public abstract long get(int index); public abstract long set(int index, long value); } {code} ...should the index also be a long? No need to be bound by Java's 31-bit limit on array-length, although I might very well be over-engineering here. The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only). Add unsigned packed int impls in oal.util - Key: LUCENE-1990 URL: https://issues.apache.org/jira/browse/LUCENE-1990 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Priority: Minor Attachments: LUCENE-1990.patch, LUCENE-1990_PerformanceMeasurements20100104.zip There are various places in Lucene that could take advantage of an efficient packed unsigned int/long impl. EG the terms dict index in the standard codec in LUCENE-1458 could subsantially reduce it's RAM usage. FieldCache.StringIndex could as well. And I think load into RAM codecs like the one in TestExternalCodecs could use this too. I'm picturing something very basic like: {code} interface PackedUnsignedLongs { long get(long index); void set(long index, long value); } {code} Plus maybe an iterator for getting and maybe also for setting. If it helps, most of the usages of this inside Lucene will be write once so eg the set could make that an assumption/requirement. And a factory somewhere: {code} PackedUnsignedLongs create(int count, long maxValue); {code} I think we should simply autogen the code (we can start from the autogen code in LUCENE-1410), or, if there is an good existing impl that has a compatible license that'd be great. I don't have time near-term to do this... so if anyone has the itch, please jump! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu
[ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802568#action_12802568 ] Vilaythong Southavilay commented on LUCENE-1488: I am developing an IR system for Lao. I've been searching for this kind of analyzers to be used in my development to index documents containing languages like Lao, French and English in one single passage. I tested it for Lao language for Lucene 2.9 and 3.0 using my short passage. It worked correctly for both versions as I expected, especially for segmenting Lao single syllables. I also tried it with the bi-gram filter option for two syllables, which worked fine for simple words. The result contained some two-syllable words which do not make sense in Lao language. I guess this not a big issue. As Robert pointed out (in an email to me), we still need dictionary-based word segmentation for Lao, which can be integrated in ICU and used by this analyzer. Any way, thanks for your assistance. This work will be helpful not only for Lao, but others as well because it's good to have a common analyzer for unicode characters. I'll continue testing it and report any problems if I find one. multilingual analyzer based on icu -- Key: LUCENE-1488 URL: https://issues.apache.org/jira/browse/LUCENE-1488 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts. This is because it is unaware of unicode bounds properties. I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier. Thanks, Robert -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
NRT and IndexSearcher performance
The javadocs for IndexSearcher in Lucene 3.0.0 read: For performance reasons it is recommended to open only one IndexSearcher and use it for all of your searches. However, to use NRT, it seems I have to do this for every search, which contradicts the advice above: IndexSearcher myIndexSearcher = new IndexSearcher(myIndexWriter.getReader()); Is there any way to take advantage of NRT and not run into these performance problems under heavy load? Is the advice from the javadoc above aimed more at IndexSearcher(org.apache.lucene.store.Directory directory)? Or is it also aimed at IndexSearcher(org.apache.lucene.index.IndexReader indexReader), which I believe I have to use to get NRT (correct me if I am wrong)? -- View this message in context: http://old.nabble.com/NRT-and-IndexSearcher-performance-tp27235434p27235434.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu
[ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802596#action_12802596 ] Robert Muir commented on LUCENE-1488: - Thanks for sharing those results! Yes the bigram behavior (right now enabled for Han, Lao, Khmer, and Myanmar) is an attempt to boost relevance in a consistent way since we do not have dictionary-based word segmentation for those writing systems, only the ability to segment into syllables. In the next patch I'll make it easier to configure this behavior, and turn it off when you want, without writing your own analyzer. I am glad to hear the syllable segmentation algorithm is working well! The credit really belongs to the Pan Localization Project, I simply implemented the algorithm described here: http://www.panl10n.net/english/final%20reports/pdf%20files/Laos/LAO06.pdf You can see the code in Lao.rbbi in the patch, warning, as it mentions, I am pretty sure Lao numeric digits are not yet working correctly, but hopefully I will fix those too in the next version. multilingual analyzer based on icu -- Key: LUCENE-1488 URL: https://issues.apache.org/jira/browse/LUCENE-1488 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts. This is because it is unaware of unicode bounds properties. I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier. Thanks, Robert -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.
[ https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron McCurry updated LUCENE-2205: -- Attachment: TermInfosReaderIndexDefault.java TermInfosReaderIndex.java TermInfosReader.java The patch as it exists now. It no longer needs any mods to the Term.java file. Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure. --- Key: LUCENE-2205 URL: https://issues.apache.org/jira/browse/LUCENE-2205 Project: Lucene - Java Issue Type: Improvement Environment: Java5 Reporter: Aaron McCurry Attachments: patch-final.txt, RandomAccessTest.java, rawoutput.txt, TermInfosReader.java, TermInfosReaderIndex.java, TermInfosReaderIndexDefault.java Basically packing those three arrays into a byte array with an int array as an index offset. The performance benefits are stagering on my test index (of size 6.2 GB, with ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the terminfos into memory were reduced to 17% of there original size. From 291.5 MB to 49.7 MB. The random access speed has been made better by 1-2%, load time of the segments are ~40% faster as well, and full GC's on my JVM were made 7 times faster. I have already performed the work and am offering this code as a patch. Currently all test in the trunk pass with this new code enabled. I did write a system property switch to allow for the original implementation to be used as well. -Dorg.apache.lucene.index.TermInfosReader=default or small I have also written a blog about this patch here is the link. http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.
[ https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron McCurry updated LUCENE-2205: -- Attachment: TermInfosReaderIndexSmall.java Here's the last file. I have also back patched 3.0.0 and 2.9.1 and place them on my blog incase you want to have a drop in replacement to try out. http://www.nearinfinity.com/blogs/aaron_mccurry/low_memory_patch_for_lucene.html Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure. --- Key: LUCENE-2205 URL: https://issues.apache.org/jira/browse/LUCENE-2205 Project: Lucene - Java Issue Type: Improvement Environment: Java5 Reporter: Aaron McCurry Attachments: patch-final.txt, RandomAccessTest.java, rawoutput.txt, TermInfosReader.java, TermInfosReaderIndex.java, TermInfosReaderIndexDefault.java, TermInfosReaderIndexSmall.java Basically packing those three arrays into a byte array with an int array as an index offset. The performance benefits are stagering on my test index (of size 6.2 GB, with ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the terminfos into memory were reduced to 17% of there original size. From 291.5 MB to 49.7 MB. The random access speed has been made better by 1-2%, load time of the segments are ~40% faster as well, and full GC's on my JVM were made 7 times faster. I have already performed the work and am offering this code as a patch. Currently all test in the trunk pass with this new code enabled. I did write a system property switch to allow for the original implementation to be used as well. -Dorg.apache.lucene.index.TermInfosReader=default or small I have also written a blog about this patch here is the link. http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.
[ https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802632#action_12802632 ] Aaron McCurry edited comment on LUCENE-2205 at 1/20/10 2:57 AM: Here's the last file. I have also back patched 3.0.0 and 2.9.1 and placed them on my blog incase you want to have a drop in replacement to try out. http://www.nearinfinity.com/blogs/aaron_mccurry/low_memory_patch_for_lucene.html was (Author: amccurry): Here's the last file. I have also back patched 3.0.0 and 2.9.1 and place them on my blog incase you want to have a drop in replacement to try out. http://www.nearinfinity.com/blogs/aaron_mccurry/low_memory_patch_for_lucene.html Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure. --- Key: LUCENE-2205 URL: https://issues.apache.org/jira/browse/LUCENE-2205 Project: Lucene - Java Issue Type: Improvement Environment: Java5 Reporter: Aaron McCurry Attachments: patch-final.txt, RandomAccessTest.java, rawoutput.txt, TermInfosReader.java, TermInfosReaderIndex.java, TermInfosReaderIndexDefault.java, TermInfosReaderIndexSmall.java Basically packing those three arrays into a byte array with an int array as an index offset. The performance benefits are stagering on my test index (of size 6.2 GB, with ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the terminfos into memory were reduced to 17% of there original size. From 291.5 MB to 49.7 MB. The random access speed has been made better by 1-2%, load time of the segments are ~40% faster as well, and full GC's on my JVM were made 7 times faster. I have already performed the work and am offering this code as a patch. Currently all test in the trunk pass with this new code enabled. I did write a system property switch to allow for the original implementation to be used as well. -Dorg.apache.lucene.index.TermInfosReader=default or small I have also written a blog about this patch here is the link. http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: NRT and IndexSearcher performance
J, The javadocs are illustrating there's no need to create new IndexSearchers for each query. Jason On Tue, Jan 19, 2010 at 5:04 PM, jchang jchangkihat...@gmail.com wrote: The javadocs for IndexSearcher in Lucene 3.0.0 read: For performance reasons it is recommended to open only one IndexSearcher and use it for all of your searches. However, to use NRT, it seems I have to do this for every search, which contradicts the advice above: IndexSearcher myIndexSearcher = new IndexSearcher(myIndexWriter.getReader()); Is there any way to take advantage of NRT and not run into these performance problems under heavy load? Is the advice from the javadoc above aimed more at IndexSearcher(org.apache.lucene.store.Directory directory)? Or is it also aimed at IndexSearcher(org.apache.lucene.index.IndexReader indexReader), which I believe I have to use to get NRT (correct me if I am wrong)? -- View this message in context: http://old.nabble.com/NRT-and-IndexSearcher-performance-tp27235434p27235434.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: NRT and IndexSearcher performance
I think the question here really is the cost of creating new IndexReader instances per query. Calling IndexWriter.getReader() for each query has shown to be expensive from our benchmark and previous discussions. -John On Tue, Jan 19, 2010 at 8:12 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: J, The javadocs are illustrating there's no need to create new IndexSearchers for each query. Jason On Tue, Jan 19, 2010 at 5:04 PM, jchang jchangkihat...@gmail.com wrote: The javadocs for IndexSearcher in Lucene 3.0.0 read: For performance reasons it is recommended to open only one IndexSearcher and use it for all of your searches. However, to use NRT, it seems I have to do this for every search, which contradicts the advice above: IndexSearcher myIndexSearcher = new IndexSearcher(myIndexWriter.getReader()); Is there any way to take advantage of NRT and not run into these performance problems under heavy load? Is the advice from the javadoc above aimed more at IndexSearcher(org.apache.lucene.store.Directory directory)? Or is it also aimed at IndexSearcher(org.apache.lucene.index.IndexReader indexReader), which I believe I have to use to get NRT (correct me if I am wrong)? -- View this message in context: http://old.nabble.com/NRT-and-IndexSearcher-performance-tp27235434p27235434.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org