[jira] Updated: (LUCENE-1470) Add TrieRangeQuery to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1470: -- Attachment: fixbuild-LUCENE-1470.patch Sorry for again a new patch: When again looking into the test, I missed a test for the automatic encoding detection by string length (TrieUtils.trieCodedToXxxAuto()). The appended patch fixes the hudson build and adds this test. > Add TrieRangeQuery to contrib > - > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Uwe Schindler >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 50 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes
good open source projects should be better than the commercial counter parts. I really like 2.4. The DocIDSet/Filter apis really allowed me to do some interesting stuff. I feel lucene has potential to be more than just a full text search library. -John On Wed, Dec 3, 2008 at 11:58 PM, Robert Muir <[EMAIL PROTECTED]> wrote: > no, i'm not doing any caching but as mentioned it did require some work to > become almost completely i/o bound due to the nature of my wacky queries, > example removing O(n) behavior from fuzzy and regexp. > > probably the os cache is not helping much because indexes are very large. > I'm very happy being i/o bound because now and especially in the future i > think it will be cheaper to speed up with additional ram and faster storage. > > still even out of box without any tricks lucene performs *much* better than > the commercial alternatives i have fought with. lucene was evaluated a while > ago before 2.3 and this was not the case, but I re-evaluated around 2.3 > release and it is now. > > > On Thu, Dec 4, 2008 at 2:45 AM, John Wang <[EMAIL PROTECTED]> wrote: > >> Thanks Robert, definitely interested! >> We are too, looking into SSDs for performance. >> 2.4 allows you to create extend QueryParser and create your own "leaf" >> queries. >> I am surprised you are mostly IO bound. Lucene does a good job caching. Do >> you do some sort of caching yourself? If your index is not changing often, >> there is a lot you can do without SSDs. >> >> -John >> >> >> On Wed, Dec 3, 2008 at 11:27 PM, Robert Muir <[EMAIL PROTECTED]> wrote: >> >>> yeah i am using read-only. >>> >>> i will admit to subclassing queryparser and having customized >>> query/scorer for several. all queries contain fuzzy queries so this was >>> necessary. >>> >>> "high" throughput i guess is a matter of opinion. in attempting to >>> profile high-throughput, again customized query/scorer made it easy for me >>> to simplify some things, such as some math in termquery that doesn't make >>> sense (redundant) for my Similarity. everything is pretty much i/o bound now >>> so if tehre is some throughput issue i will look into SSD for high volume >>> indexes. >>> >>> i posted on Use Cases on the wiki how I made fuzzy and regex fast if you >>> are curious. >>> >>> >>> On Thu, Dec 4, 2008 at 2:10 AM, John Wang <[EMAIL PROTECTED]> wrote: >>> Thanks Robert for sharing. Good to hear it is working for what you need it to do. 3) Especially with ReadOnlyIndexReaders, you should not be blocked while indexing. Especially if you have multicore machines. 4) do you stay with sub-second responses with high thru-put? -John On Wed, Dec 3, 2008 at 11:03 PM, Robert Muir <[EMAIL PROTECTED]> wrote: > > > On Thu, Dec 4, 2008 at 1:24 AM, John Wang <[EMAIL PROTECTED]> wrote: > >> Nice! >> Some questions: >> >> 1) one index? >> > no, but two individual ones today were around 100M docs > >> 2) how big is your document? e.g. how many terms etc. >> > last one built has over 4M terms > >> 3) are you serving(searching) the docs in realtime? >> > i dont understand this question, but searching is slower if i am > indexing on a disk thats also being searched. > >> >> 4) search speed? >> > usually subsecond (or close) after some warmup. while this might seem > slow its fast compared to the competition, trust me. > >> >> I'd love to learn more about your architecture. >> > i hate to say you would be disappointed, but theres nothign fancy. > probably why it works... > >> >> -John >> >> >> On Wed, Dec 3, 2008 at 10:13 PM, Robert Muir <[EMAIL PROTECTED]>wrote: >> >>> sorry gotta speak up on this. i indexed 300m docs today. I'm using an >>> out of box jar. >>> >>> yeah i have some special subclasses but if i thought any of this >>> stuff was general enough to be useful to others i'd submit it. I'm just >>> happy to have something scalable that i can customize to my >>> peculiarities. >>> >>> so i think i fit in your 10% and im not stressing on either >>> scalability or api. >>> >>> thanks, >>> robert >>> >>> >>> On Thu, Dec 4, 2008 at 12:36 AM, John Wang <[EMAIL PROTECTED]>wrote: >>> Grant: I am sorry that I disagree with some points: 1) "I think it's a sign that Lucene is pretty stable." - While lucene is a great project, especially with 2.x releases, great improvements are made, but do we really have a clear picture on how lucene is being used and deployed. While lucene works great running as a vanilla search library, when pushed to limits, one needs to "hack" into lucene to make certain things work. If 90% of the user base use it to build small indexes and usin
[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653257#action_12653257 ] Michael McCandless commented on LUCENE-1470: Hmm -- I would prefer that contrib tests subclass LiaTestCase. We must be missing a dependency in the ant build files. OK this seems to fix it: Index: contrib/contrib-build.xml === --- contrib/contrib-build.xml (revision 723145) +++ contrib/contrib-build.xml (working copy) @@ -61,7 +61,7 @@ - + I'll commit that, and the fix to the test case. Thanks Uwe! > Add TrieRangeQuery to contrib > - > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Uwe Schindler >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 50 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653258#action_12653258 ] Michael McCandless commented on LUCENE-1470: bq. Hmm - I would prefer that contrib tests subclass LiaTestCase Woops, I meant LuceneTestCase ;) Time sharing not working very well in my brain this morning... > Add TrieRangeQuery to contrib > - > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Uwe Schindler >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 50 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Issue Comment Edited: (LUCENE-1470) Add TrieRangeQuery to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653257#action_12653257 ] mikemccand edited comment on LUCENE-1470 at 12/4/08 3:07 AM: - Hmm -- I would prefer that contrib tests subclass LiaTestCase. We must be missing a dependency in the ant build files. OK this seems to fix it: {code} Index: contrib/contrib-build.xml === --- contrib/contrib-build.xml (revision 723145) +++ contrib/contrib-build.xml (working copy) @@ -61,7 +61,7 @@ - + {code} I'll commit that, and the fix to the test case. Thanks Uwe! was (Author: mikemccand): Hmm -- I would prefer that contrib tests subclass LiaTestCase. We must be missing a dependency in the ant build files. OK this seems to fix it: Index: contrib/contrib-build.xml === --- contrib/contrib-build.xml (revision 723145) +++ contrib/contrib-build.xml (working copy) @@ -61,7 +61,7 @@ - + I'll commit that, and the fix to the test case. Thanks Uwe! > Add TrieRangeQuery to contrib > - > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Uwe Schindler >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 50 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1470) Add TrieRangeQuery to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1470. Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Committed revision 723287. > Add TrieRangeQuery to contrib > - > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Uwe Schindler >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 50 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes
Robert Muir wrote: i posted on Use Cases on the wiki how I made fuzzy and regex fast if you are curious. It looks like this is the wiki page: http://wiki.apache.org/lucene-java/FastSSFuzzy?highlight=(fuzzy) The approach is similar to how contrib/spellchecker generates its candidates, in that you build a 2nd index from the primary index and use the 2nd index to more quickly (not O(N)) generate candidates. It'd be nice to get your approach into contrib as well ;) Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes
John Wang wrote: Seems like being a committer can be rather lucrative. I think being an Apache committer on any project can be somewhat lucrative. Companies know that you probably work well with others if your a committer, which can probably lead to improved career opportunities. Cant say too much about working well with others :) I may not be extracting as much money as I can though - sounds like I could be taking bribes to commit code if I wanted to make more ;) My comment was on the statements of being volunteers and don't get paid, which is a little misleading. It depends. Sometimes, something your doing with a customer might make its way into Lucene. Thats not most of the work that goes on here though. Most of the work is looking at submitted patches in our free time, going over them, running the tests, and possibly committing them. I do that for the project because I like to, not for any money I'm getting (true enough I havnt been a core committer long, but I did the same as a contrib committer). When I'm sitting around at 11 at night or 7 in the morning, trying to get patches committed, I'd hate to be classified as a non volunteer. Its just as easy to get the committer title and then fall off the face of the world. No one ensures you are helping anyone get anything done. I guess I need to learn to be a good boy not to piss off the committers anymore (or convince my company to pay to get some patches in) And hopefully someday I get to grow up and get to become a committer and make some $ too. You might consider it. I think you have been a bit rude, but watch and see...quality patches you submit will still get processed like any other. The people around here are friendly and mainly interested in the quality of Lucene. Noone is trying to enforce some sort of "power elite" here. There is no blacklist. At the same time, lashing out isnt going to help get any issues passed (in fact, I've seen it flounder more than one issue). I've certainly never been involved in Lucene for the money myself (and I don't have much of it, believe you me). - Mark -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-689) NullPointerException thrown by equals method in SpanOrQuery
[ https://issues.apache.org/jira/browse/LUCENE-689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-689: -- Fix Version/s: 2.9 > NullPointerException thrown by equals method in SpanOrQuery > --- > > Key: LUCENE-689 > URL: https://issues.apache.org/jira/browse/LUCENE-689 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.1 > Environment: Java 1.5.0_09, RHEL 3 Linux, Tomcat 5.0.28 >Reporter: Michael Goddard >Assignee: Otis Gospodnetic >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-689.txt > > > Part of our code utilizes the equals method in SpanOrQuery and, in certain > cases (details to follow, if necessary), a NullPointerException gets thrown > as a result of the String "field" being null. After applying the following > patch, the problem disappeared: > Index: src/java/org/apache/lucene/search/spans/SpanOrQuery.java > === > --- src/java/org/apache/lucene/search/spans/SpanOrQuery.java(revision > 465065) > +++ src/java/org/apache/lucene/search/spans/SpanOrQuery.java(working copy) > @@ -121,7 +121,8 @@ > final SpanOrQuery that = (SpanOrQuery) o; > if (!clauses.equals(that.clauses)) return false; > -if (!field.equals(that.field)) return false; > +if (field != null && !field.equals(that.field)) return false; > +if (field == null && that.field != null) return false; > return getBoost() == that.getBoost(); >} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653268#action_12653268 ] Michael McCandless commented on LUCENE-1470: bq. I think, this cannot work. The Cache is keyed by FieldCacheImpl.Entry containing the parser to use. Sigh, you are correct. How would you fix FieldCache? I guess the workaround is to also index the original value (unencoded by TrieUtils) as an additional field, for sorting. > Add TrieRangeQuery to contrib > - > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Uwe Schindler >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 50 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653270#action_12653270 ] Uwe Schindler commented on LUCENE-1470: --- Thanks, then I would also change TestTrieRangeQuery to also use LuceneTestCase, just for completeness. bq. Sigh, you are correct. How would you fix FieldCache? I would fix FieldCache by giving in SortField the possibility to supply a parser instance. So you create a SortField using a new constructor SortField(String field, int type, Object parser, boolean reverse). The parser is "object" bcause all parsers have no super-interface. The ideal solution would be to have: SortField(String field, int type, FieldCache.Parser parser, boolean reverse) and FieldCache.Parser is a super-interface (just empty, more like a marker-interface) of all other parsers (like LongParser...) bq. I guess the workaround is to also index the original value (unencoded by TrieUtils) as an additional field, for sorting. The problem with the extra field would be, that it works good for longs or doubles (with some extra work), but Dates still keep as String, or you use Date.getTime() as long. But this is not very elegant and needs more fields and terms. I prefer a clean solution. > Add TrieRangeQuery to contrib > - > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Uwe Schindler >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 50 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
[ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653272#action_12653272 ] Mark Miller commented on LUCENE-1390: - So my final thought on this is performance...is handling more much slower? Could that be a reason to keep the Latin1 filter as well? > add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter > > > Key: LUCENE-1390 > URL: https://issues.apache.org/jira/browse/LUCENE-1390 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Environment: any >Reporter: Andi Vajda >Assignee: Mark Miller >Priority: Minor > Fix For: 2.9 > > Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, > ASCIIFoldingFilter.patch > > > The ISOLatin1AccentFilter is removing accents from accented characters in the > ISO Latin 1 character set. > It does what it does and there is no bug with it. > It would be nicer, though, if there was a more comprehensive version of this > code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 > and Latin Extended A unicode blocks. > See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block > See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block > That way, all languages using roman characters are covered. > A new class, ISOLatinAccentFilter is attached. It is intended to supercede > ISOLatin1AccentFilter which should get deprecated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1465) NearSpansOrdered.getPayload does not return the payload from the minimum match span
[ https://issues.apache.org/jira/browse/LUCENE-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653277#action_12653277 ] Mark Miller commented on LUCENE-1465: - Whats involved in a backport - just commit it to the 2.4 branch and thats all? Looks like I have to look into terms indexed at the same position first - I'll try to get to that soon. - Mark > NearSpansOrdered.getPayload does not return the payload from the minimum > match span > --- > > Key: LUCENE-1465 > URL: https://issues.apache.org/jira/browse/LUCENE-1465 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.4 >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Minor > Fix For: 2.4.1, 2.9 > > Attachments: LUCENE-1465.patch, LUCENE-1465.patch, LUCENE-1465.patch, > LUCENE-1465.patch, Test.java > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-996) Parsing mixed inclusive/exclusive range queries
[ https://issues.apache.org/jira/browse/LUCENE-996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-996: --- Fix Version/s: (was: 2.9) 3.0 Because this requires changing a callback or two in the queryparser, its probably easier to put it into 3 than 2.9. > Parsing mixed inclusive/exclusive range queries > --- > > Key: LUCENE-996 > URL: https://issues.apache.org/jira/browse/LUCENE-996 > Project: Lucene - Java > Issue Type: Improvement > Components: QueryParser >Affects Versions: 2.2 >Reporter: Andrew Schurman >Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-996.patch, LUCENE-996.patch, lucene-996.patch, > lucene-996.patch > > > The current query parser doesn't handle parsing a range query (i.e. > ConstantScoreRangeQuery) with mixed inclusive/exclusive bounds. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1286) LargeDocHighlighter - another span highlighter optimized for large documents
[ https://issues.apache.org/jira/browse/LUCENE-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653283#action_12653283 ] Mark Miller commented on LUCENE-1286: - Hey Koji, I actually have some ideas to come back to this with, but no time for some time to actually work on it. bq. Can you elaborate this - "rebuild the document by running through the query terms by using their offsets"? Part of the problem with the Highlighter and large docs is that it runs through every token in the doc and scores that token, building the original highlighted doc as it goes. For a large doc, that can be a bit slow. What Ronnies highlighter did was just look at the offsets of the query terms (hence the need for term vectors) which allows you to rebuild the original highlighted document in big quick chunks (stitching things together between query term offsets). I was attempting a similar thing here with phrase and span support, but I couldn't match the speed of what the current Span highlighter has - this is because the current Span Highlighter can highlight non position sensitive terms very fast. My method required getting non position sensitive terms from the MemoryIndex as well (via getSpans) and the cost ruined any benefit. I came up with a few things to try since then but havn't had the time to dedicate to it yet. Its hard to get around requiring term vectors (for the offsets), and I'd like to avoid that. At the same time, if you don't require term vectors, its probably going to be pretty slow re-analyzing the documents anyway... > LargeDocHighlighter - another span highlighter optimized for large documents > > > Key: LUCENE-1286 > URL: https://issues.apache.org/jira/browse/LUCENE-1286 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter >Affects Versions: 2.4 >Reporter: Mark Miller >Priority: Minor > > The existing Highlighter API is rich and well designed, but the approach > taken is not very efficient for large documents. > I believe that this is because the current Highlighter rebuilds the document > by running through and scoring every every token in the tokenstream. > With a break in the current API, an alternate approach can be taken: rebuild > the document by running through the query terms by using their offsets. The > benefit is clear - a large doc will have a large tokenstream, but a query > will likely be very small in comparison. > I expect this approach to be quite a bit faster for very large documents, > while still supporting Phrase and Span queries. > First rough patch to follow shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1469) isValid should be invoked after analyze rather than before it so it can validate the output of analyze
[ https://issues.apache.org/jira/browse/LUCENE-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653285#action_12653285 ] Mark Miller commented on LUCENE-1469: - This makes sense to me. Care to submit a patch? > isValid should be invoked after analyze rather than before it so it can > validate the output of analyze > -- > > Key: LUCENE-1469 > URL: https://issues.apache.org/jira/browse/LUCENE-1469 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4 >Reporter: Vincent Li >Priority: Minor > Original Estimate: 0.08h > Remaining Estimate: 0.08h > > The Synonym map has a protected method String analyze(String word) designed > for custom stemming. > However, before analyze is invoked on a word, boolean isValid(String str) is > used to validate the word - which causes the program to discard words that > maybe useable by the custom analyze method. > I think that isValid should be invoked after analyze rather than before it so > it can validate the output of analyze and allow implemters to decide what is > valid for the overridden analyze method. (In fact, if you look at code > snippet below, isValid should really go after the empty string check) > This is a two line change in org.apache.lucene.index.memory.SynonymMap > /* >* Part B: ignore phrases (with spaces and hyphens) and >* non-alphabetic words, and let user customize word (e.g. do some >* stemming) >*/ > if (!isValid(word)) continue; // ignore > word = analyze(word); > if (word == null || word.length() == 0) continue; // ignore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1465) NearSpansOrdered.getPayload does not return the payload from the minimum match span
[ https://issues.apache.org/jira/browse/LUCENE-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653292#action_12653292 ] Michael McCandless commented on LUCENE-1465: bq. Whats involved in a backport - just commit it to the 2.4 branch and thats all? Yup. "svn merge" works well as long as the code hasn't diverged much, eg running this in a 2.4 branch checkout: {code} svn merge -r(N-1):N https://svn.apache.org/repos/asf/lucene/java/trunk {code} where N was the revision committed to trunk. > NearSpansOrdered.getPayload does not return the payload from the minimum > match span > --- > > Key: LUCENE-1465 > URL: https://issues.apache.org/jira/browse/LUCENE-1465 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.4 >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Minor > Fix For: 2.4.1, 2.9 > > Attachments: LUCENE-1465.patch, LUCENE-1465.patch, LUCENE-1465.patch, > LUCENE-1465.patch, Test.java > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653295#action_12653295 ] Michael McCandless commented on LUCENE-1470: bq. Thanks, then I would also change TestTrieRangeQuery to also use LuceneTestCase, just for completeness. OK done. bq. would fix FieldCache by giving in SortField the possibility to supply a parser instance. So you create a SortField using a new constructor SortField(String field, int type, Object parser, boolean reverse). The parser is "object" bcause all parsers have no super-interface. This seems OK for now? Can you open an issue? Retro-fitting a super-interface would break back-compat for (admittedly very advanced) existing Parser instances external to Lucene, right? bq. but Dates still keep as String, or you use Date.getTime() as long Yeah. But if we open the new issue (to allow external FieldCache parsers to be used when sorting) then one could parse to long directly from a TrieUtil encoded Date field, right? > Add TrieRangeQuery to contrib > - > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Uwe Schindler >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 50 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes
On Dec 4, 2008, at 12:36 AM, John Wang wrote: Grant: I am sorry that I disagree with some points: 1) "I think it's a sign that Lucene is pretty stable." - While lucene is a great project, especially with 2.x releases, great improvements are made, but do we really have a clear picture on how lucene is being used and deployed. While lucene works great running as a vanilla search library, when pushed to limits, one needs to "hack" into lucene to make certain things work. If 90% of the user base use it to build small indexes and using the vanilla api, and the other 10% is really stressing both on the scalability and api side and are running into issues, would you still say: "running well for 90% of the users, therefore it is stable or extensible"? I think it is unfair to the project itself to be measured by the vanilla use- case. I have done couple of large deployments, e.g. >30 million documents indexed and searched in realtime., and I really had to do some tweaking. Sorry, we should have written a perfect engine the first time out. I'll get on that. Question for you: how much of that tweaking have you contributed back? If you have such obvious wins, put them up as patches so we can all benefit, just like you've benefitted from our volunteering. As for 90%, I'd say it is more like > 95% and, gee, if I can write a general purpose open source search library that keeps 95% of a very, very, very large install base happy all while still improving it and maintaining backward compatibility, than color me stable. 2) "You want stuff committed, keep it up to date, make it manageable to review, document it, respond to questions/concerns with answers as best you can. " - To some degree I would hope it depends on what the issue is, e.g. enforcing such process on a one-line null check seems to be an overkill. I agree with the process itself, what would make it better is some transparency on how patches/issues are evaluated to be committed. At least seemed from the outside, it is purely being decided on by the committers, and since my understanding is that an open source project belongs to the public, the public user base should have some say. Here's your list of opened issues: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&reporterSelect=specificuser&[EMAIL PROTECTED] Only 1 of which has more than 2 votes and which is assigned to Hoss. However, from what I can see, you've had all but 1, I repeat ONE, issue not resolved. And, yes, what gets committed is decided on by the COMMITTERS with input from the community; who else can be responsible for committing? Hence the title. We can't please everyone, but I'll be damned if you're going to disparage the work of so many because you have sour grapes over some people (not all) disagreeing with you over how serialization should work in Lucene just b/c you think the problem is trivial when clearly others do not. Committers are picked by the project over a long period of time (feel free to nominate someone who you feel has merit, we've elected committers based on community nominations in the past) because they stick around and stay involved and respond on the list, etc. I'm starting to think your real issue here is that we haven't all agreed with you the minute you suggest something, but sorry, that is how open source works. 3) which brings me to this point: "I personally, would love to work on Lucene all day every day as I have a lot of things I'd love to engage the community on, but the fact is I'm not paid to do that, so I give what I can when I can. I know most of the other committers are that way too." - Is this really true? Isn't a large part of the committer base also a part of the for-profit, consulting business, e.g. Lucid? Would groups/companies that pay for consulting service get their patches/requirements committed with higher priority? If so, seems to me to be a conflict of interest there. Yes, John, it is true. I would love to work on Lucene all day. If I won the lottery tomorrow, I'd probably still volunteer on Lucene. Let me ask you back, who pays you to work on Lucene? Was this patch submitted because you just happened to spot it while pouring over the code at night on your own and out of the goodness of your heart? Or did you discover it at LinkedIn where you were specifically hired because of your Lucene skills and knowledge of the Lucene community? In other words, you're accusing me and others of getting paid for my expertise in Lucene, all the while you are getting paid for your expertise in Lucene. 4) "Lather, rinse, repeat. Next thing you know, you'll be on the receiving end as a committer." - While I agree that being a committer is a great honor and many committers are awesome, but assuming everyone would want to be a committer is a little presumptuous.
[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions
[ https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653297#action_12653297 ] Michael McCandless commented on LUCENE-1473: bq. It seems best to remove Serialization from Lucene so that users are not confused and create a better solution. I don't think that's the case. If we choose to only support "live serialization" then we should add "implements Serializable" but spell out clearly in the javadocs that there is no guarantee of cross-version compatibility ("long term persistence") and in fact that often there are incompatibilities. I think "live serialization" is still a useful feature. > Implement standard Serialization across Lucene versions > --- > > Key: LUCENE-1473 > URL: https://issues.apache.org/jira/browse/LUCENE-1473 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Minor > Attachments: LUCENE-1473.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > To maintain serialization compatibility between Lucene versions, > serialVersionUID needs to be added to classes that implement > java.io.Serializable. java.io.Externalizable may be implemented in classes > for faster performance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Build failed in Hudson: Lucene-trunk #665
Why does compiling of my testcase for TrieRangeQuery fails on Hudson, but works here? - UWE SCHINDLER Webserver/Middleware Development PANGAEA - Publishing Network for Geoscientific and Environmental Data MARUM - University of Bremen Room 2500, Leobener Str., D-28359 Bremen Tel.: +49 421 218 65595 Fax: +49 421 218 65505 http://www.pangaea.de/ E-mail: [EMAIL PROTECTED] > -Original Message- > From: Apache Hudson Server [mailto:[EMAIL PROTECTED] > Sent: Thursday, December 04, 2008 3:11 AM > To: java-dev@lucene.apache.org > Subject: Build failed in Hudson: Lucene-trunk #665 > > See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/665/changes > > Changes: > > [mikemccand] LUCENE-1457: fix possible overflow bugs during binary search > > [mikemccand] LUCENE-1470: add TrieRangeQuery, a much more efficient > implementation of RangeQuery at the expense of added space consumed in the > index > > [markrmiller] LUCENE-1246: check for null sub queries so that > BooleanQuery.toString does not throw NullPointerException. > > -- > [...truncated 3201 lines...] > clover.setup: > > clover.info: > > clover: > > compile-core: > > common.compile-test: > [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene- > trunk/ws/trunk/build/contrib/misc/classes/test > [javac] Compiling 7 source files to > http://hudson.zones.apache.org/hudson/job/Lucene- > trunk/ws/trunk/build/contrib/misc/classes/test > [javac] Note: Some input files use or override a deprecated API. > [javac] Note: Recompile with -Xlint:deprecation for details. > > build-artifacts-and-tests: > [echo] Building queries... > > javacc-uptodate-check: > > javacc-notice: > > jflex-uptodate-check: > > jflex-notice: > > common.init: > > build-lucene: > > init: > > clover.setup: > > clover.info: > > clover: > > compile-core: > [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene- > trunk/ws/trunk/build/contrib/queries/classes/java > [javac] Compiling 12 source files to > http://hudson.zones.apache.org/hudson/job/Lucene- > trunk/ws/trunk/build/contrib/queries/classes/java > [javac] Note: Some input files use or override a deprecated API. > [javac] Note: Recompile with -Xlint:deprecation for details. > > jar-core: > [jar] Building jar: > http://hudson.zones.apache.org/hudson/job/Lucene- > trunk/ws/trunk/build/contrib/queries/lucene-queries-2.4-SNAPSHOT.jar > > jar: > > compile-test: > [echo] Building queries... > > javacc-uptodate-check: > > javacc-notice: > > jflex-uptodate-check: > > jflex-notice: > > common.init: > > build-lucene: > > init: > > clover.setup: > > clover.info: > > clover: > > compile-core: > > common.compile-test: > [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene- > trunk/ws/trunk/build/contrib/queries/classes/test > [javac] Compiling 6 source files to > http://hudson.zones.apache.org/hudson/job/Lucene- > trunk/ws/trunk/build/contrib/queries/classes/test > [javac] http://hudson.zones.apache.org/hudson/job/Lucene- > trunk/ws/trunk/contrib/queries/src/test/org/apache/lucene/search/trie/Test > TrieUtils.java :23: cannot find symbol > [javac] symbol : class LuceneTestCase > [javac] location: package org.apache.lucene.util > [javac] import org.apache.lucene.util.LuceneTestCase; > [javac] ^ > [javac] http://hudson.zones.apache.org/hudson/job/Lucene- > trunk/ws/trunk/contrib/queries/src/test/org/apache/lucene/search/trie/Test > TrieUtils.java :25: cannot find symbol > [javac] symbol: class LuceneTestCase > [javac] public class TestTrieUtils extends LuceneTestCase { > [javac]^ > [javac] http://hudson.zones.apache.org/hudson/job/Lucene- > trunk/ws/trunk/contrib/queries/src/test/org/apache/lucene/search/trie/Test > TrieUtils.java :29: cannot find symbol > [javac] symbol : method > assertEquals(java.lang.String,java.lang.String) > [javac] location: class org.apache.lucene.search.trie.TestTrieUtils > [javac] assertEquals( > TrieUtils.VARIANT_8BIT.TRIE_CODED_NUMERIC_MIN, > "\u0100\u0100\u0100\u0100\u0100\u0100\u0100\u0100"); > [javac] ^ > [javac] http://hudson.zones.apache.org/hudson/job/Lucene- > trunk/ws/trunk/contrib/queries/src/test/org/apache/lucene/search/trie/Test > TrieUtils.java :30: cannot find symbol > [javac] symbol : method > assertEquals(java.lang.String,java.lang.String) > [javac] location: class org.apache.lucene.search.trie.TestTrieUtils > [javac] assertEquals( > TrieUtils.VARIANT_8BIT.TRIE_CODED_NUMERIC_MAX, > "\u01ff\u01ff\u01ff\u01ff\u01ff\u01ff\u01ff\u01ff"); > [javac] ^ > [javac] http://hudson.zones.apache.org/hudson/job/Lucene- > trunk/ws/trunk/contrib/queries/src/test/org/apache/lucene/search/trie/Test > TrieUtils.java :31: cannot find symbol > [j
[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653298#action_12653298 ] Uwe Schindler commented on LUCENE-1470: --- Yes, I will open an issue! Maybe I maybe create a first patch after looking into the problem. bq. This seems OK for now? Can you open an issue? Retro-fitting a super-interface would break back-compat for (admittedly very advanced) existing Parser instances external to Lucene, right? I am not sure, but I think its better to leave it as now. On the other hand, if we just have a "marker" super-interface, it should be backwards compatible, because the new super-interface is new and existing code would only use the existing interfaces. New methods are not added by the super interface, so code would be source and binary compatible (as it only references the existing interfaces). I think we had this discussion some time in the past in another issue (Fieldable???), but this was another problem. bq. Yeah. But if we open the new issue (to allow external FieldCache parsers to be used when sorting) then one could parse to long directly from a TrieUtil encoded Date field, right? Correct. As soon as this works, I would simply add as "extra bonus" o.a.l.search.trie.TrieSortField, that automatically supplys a correct parser for easy usage. Date, Double and Long trie fields can always be sorted as longs without knowing the correct meaning (because the trie format was designed like so). Currently my code would just sort the trie encoded fields using SortField.STRING, but this resource expensive (butI have no example currently running, as it was not needed for panFMP/PANGAEA and other projects). > Add TrieRangeQuery to contrib > - > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Uwe Schindler >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 50 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions
[ https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653321#action_12653321 ] Michael McCandless commented on LUCENE-1473: bq. For classes that no one submits an Externalizable patch for, the serialVersionUID needs to be added. The serialVersionUID approach would be too simplistic, because we can't simply bump it up whenever we make a change since that then breaks back compatibility. We would have to override write/readObject or write/readExternal, and serialVersionUID would not be used. > Implement standard Serialization across Lucene versions > --- > > Key: LUCENE-1473 > URL: https://issues.apache.org/jira/browse/LUCENE-1473 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Minor > Attachments: LUCENE-1473.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > To maintain serialization compatibility between Lucene versions, > serialVersionUID needs to be added to classes that implement > java.io.Serializable. java.io.Externalizable may be implemented in classes > for faster performance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes
Mark and Grant: I do apologize if I came off seeming rude. I guess I let my frustration of the serialization issue got the better of me (and also a built up from some of the other issues, which I thought are trivial but was made to be not). And I will improve my behavior in the future. There is a reason I have stopped submitting patches via Jira. (For which I no longer dare to express.) There is absolutely nothing wrong with getting paid for Lucene expertise. I was just commenting on your comment about "volunteering", but if you think I am wrong, then I am. I did have a concern with the focus of the project getting biased by paying companies to the committers, but obviously it is not my business. The issues/patches I am having are trivial stuffs, and that was precisely my point. I am not pushing for grandeous ideas, I am frustrated with some very brain dead issues (I am not smart enough to provide any earth shattering patches) that has blown out of proportion in my mind. I will try to keep my mouth shut in the future. -John On Thu, Dec 4, 2008 at 5:24 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > On Dec 4, 2008, at 12:36 AM, John Wang wrote: > > Grant: >> >>I am sorry that I disagree with some points: >> >> 1) "I think it's a sign that Lucene is pretty stable." - While lucene is a >> great project, especially with 2.x releases, great improvements are made, >> but do we really have a clear picture on how lucene is being used and >> deployed. While lucene works great running as a vanilla search library, when >> pushed to limits, one needs to "hack" into lucene to make certain things >> work. If 90% of the user base use it to build small indexes and using the >> vanilla api, and the other 10% is really stressing both on the scalability >> and api side and are running into issues, would you still say: "running well >> for 90% of the users, therefore it is stable or extensible"? I think it is >> unfair to the project itself to be measured by the vanilla use-case. I have >> done couple of large deployments, e.g. >30 million documents indexed and >> searched in realtime., and I really had to do some tweaking. >> > > Sorry, we should have written a perfect engine the first time out. I'll > get on that. Question for you: how much of that tweaking have you > contributed back? If you have such obvious wins, put them up as patches so > we can all benefit, just like you've benefitted from our volunteering. > > As for 90%, I'd say it is more like > 95% and, gee, if I can write a > general purpose open source search library that keeps 95% of a very, very, > very large install base happy all while still improving it and maintaining > backward compatibility, than color me stable. > > >> 2) "You want stuff committed, keep it up to date, make it manageable to >> review, document it, respond to questions/concerns with answers as best you >> can. " - To some degree I would hope it depends on what the issue is, e.g. >> enforcing such process on a one-line null check seems to be an overkill. I >> agree with the process itself, what would make it better is some >> transparency on how patches/issues are evaluated to be committed. At least >> seemed from the outside, it is purely being decided on by the committers, >> and since my understanding is that an open source project belongs to the >> public, the public user base should have some say. >> > > Here's your list of opened issues: > https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&reporterSelect=specificuser&[EMAIL > PROTECTED] Only 1 of which has more than 2 votes and which is assigned to > Hoss. > However, from what I can see, you've had all but 1, I repeat ONE, issue not > resolved. > > And, yes, what gets committed is decided on by the COMMITTERS with input > from the community; who else can be responsible for committing? Hence the > title. We can't please everyone, but I'll be damned if you're going to > disparage the work of so many because you have sour grapes over some people > (not all) disagreeing with you over how serialization should work in Lucene > just b/c you think the problem is trivial when clearly others do not. > > Committers are picked by the project over a long period of time (feel free > to nominate someone who you feel has merit, we've elected committers based > on community nominations in the past) because they stick around and stay > involved and respond on the list, etc. I'm starting to think your real > issue here is that we haven't all agreed with you the minute you suggest > something, but sorry, that is how open source works. > > > >> 3) which brings me to this point: "I personally, would love to work on >> Lucene all day every day as I have a lot of things I'd love to engage the >> community on, but the fact is I'm not paid to do that, so I give what I can >> when I can. I know most of the other committers are that way too." - Is >> this really true? Isn't a large part of the
[jira] Created: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results
Missing possibility to supply custom FieldParser when sorting search results Key: LUCENE-1478 URL: https://issues.apache.org/jira/browse/LUCENE-1478 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was confronted by the problem that the special trie-encoded values (which are longs in a special encoding) cannot be sorted by Searcher.search() and SortField. The problem is: If you use SortField.LONG, you get NumberFormatExceptions. The trie encoded values may be sorted using SortField.String (as the encoding is in such a way, that they are sortable as Strings), but this is very memory ineffective. ExtendedFieldCache gives the possibility to specify a custom LongParser when retrieving the cached values. But you cannot use this during searching, because there is no possibility to supply this custom LongParser to the SortField. I propose a change in the sort classes: Include a pointer to the parser instance to be used in SortField (if not given use the default). My idea is to create a SortField using a new constructor {code}SortField(String field, int type, Object parser, boolean reverse){code} The parser is "object" bcause all parsers have no super-interface. The ideal solution would be to have: {code}SortField(String field, int type, FieldCache.Parser parser, boolean reverse){code} and FieldCache.Parser is a super-interface (just empty, more like a marker-interface) of all other parsers (like LongParser...). The sort implementation then must be changed to respect the given parser (if not NULL), else use the default FieldCache.get without parser. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results
[ https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1478: -- Description: When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was confronted by the problem that the special trie-encoded values (which are longs in a special encoding) cannot be sorted by Searcher.search() and SortField. The problem is: If you use SortField.LONG, you get NumberFormatExceptions. The trie encoded values may be sorted using SortField.String (as the encoding is in such a way, that they are sortable as Strings), but this is very memory ineffective. ExtendedFieldCache gives the possibility to specify a custom LongParser when retrieving the cached values. But you cannot use this during searching, because there is no possibility to supply this custom LongParser to the SortField. I propose a change in the sort classes: Include a pointer to the parser instance to be used in SortField (if not given use the default). My idea is to create a SortField using a new constructor {code}SortField(String field, int type, Object parser, boolean reverse){code} The parser is "object" because all current parsers have no super-interface. The ideal solution would be to have: {code}SortField(String field, int type, FieldCache.Parser parser, boolean reverse){code} and FieldCache.Parser is a super-interface (just empty, more like a marker-interface) of all other parsers (like LongParser...). The sort implementation then must be changed to respect the given parser (if not NULL), else use the default FieldCache.get without parser. was: When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was confronted by the problem that the special trie-encoded values (which are longs in a special encoding) cannot be sorted by Searcher.search() and SortField. The problem is: If you use SortField.LONG, you get NumberFormatExceptions. The trie encoded values may be sorted using SortField.String (as the encoding is in such a way, that they are sortable as Strings), but this is very memory ineffective. ExtendedFieldCache gives the possibility to specify a custom LongParser when retrieving the cached values. But you cannot use this during searching, because there is no possibility to supply this custom LongParser to the SortField. I propose a change in the sort classes: Include a pointer to the parser instance to be used in SortField (if not given use the default). My idea is to create a SortField using a new constructor {code}SortField(String field, int type, Object parser, boolean reverse){code} The parser is "object" bcause all parsers have no super-interface. The ideal solution would be to have: {code}SortField(String field, int type, FieldCache.Parser parser, boolean reverse){code} and FieldCache.Parser is a super-interface (just empty, more like a marker-interface) of all other parsers (like LongParser...). The sort implementation then must be changed to respect the given parser (if not NULL), else use the default FieldCache.get without parser. > Missing possibility to supply custom FieldParser when sorting search results > > > Key: LUCENE-1478 > URL: https://issues.apache.org/jira/browse/LUCENE-1478 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.4 >Reporter: Uwe Schindler > > When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was > confronted by the problem that the special trie-encoded values (which are > longs in a special encoding) cannot be sorted by Searcher.search() and > SortField. The problem is: If you use SortField.LONG, you get > NumberFormatExceptions. The trie encoded values may be sorted using > SortField.String (as the encoding is in such a way, that they are sortable as > Strings), but this is very memory ineffective. > ExtendedFieldCache gives the possibility to specify a custom LongParser when > retrieving the cached values. But you cannot use this during searching, > because there is no possibility to supply this custom LongParser to the > SortField. > I propose a change in the sort classes: > Include a pointer to the parser instance to be used in SortField (if not > given use the default). My idea is to create a SortField using a new > constructor > {code}SortField(String field, int type, Object parser, boolean reverse){code} > The parser is "object" because all current parsers have no super-interface. > The ideal solution would be to have: > {code}SortField(String field, int type, FieldCache.Parser parser, boolean > reverse){code} > and FieldCache.Parser is a super-interface (just empty, more like a > marker-interface) of all other parsers (like LongParser...). The sort > implementation then mus
[jira] Commented: (LUCENE-1461) Cached filter for a single term field
[ https://issues.apache.org/jira/browse/LUCENE-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653361#action_12653361 ] Otis Gospodnetic commented on LUCENE-1461: -- Is this related to LUCENE-855? The same? Aha, I see Paul asked the reverse question in LUCENE-855 already... Tim? > Cached filter for a single term field > - > > Key: LUCENE-1461 > URL: https://issues.apache.org/jira/browse/LUCENE-1461 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Tim Sturge >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: DisjointMultiFilter.java, FieldCacheRangeFilter.patch, > LUCENE-1461.patch, LUCENE-1461a.patch, LUCENE-1461b.patch, > LUCENE-1461c.patch, RangeMultiFilter.java, RangeMultiFilter.java, > TermMultiFilter.java, TestFieldCacheRangeFilter.patch > > > These classes implement inexpensive range filtering over a field containing a > single term. They do this by building an integer array of term numbers > (storing the term->number mapping in a TreeMap) and then implementing a fast > integer comparison based DocSetIdIterator. > This code is currently being used to do age range filtering, but could also > be used to do other date filtering or in any application where there need to > be multiple filters based on the same single term field. I have an untested > implementation of single term filtering and have considered but not yet > implemented term set filtering (useful for location based searches) as well. > The code here is fairly rough; it works but lacks javadocs and toString() and > hashCode() methods etc. I'm posting it here to discover if there is other > interest in this feature; I don't mind fixing it up but would hate to go to > the effort if it's not going to make it into Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions
[ https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653378#action_12653378 ] John Wang commented on LUCENE-1473: --- Mike: If you have class A implements Serializable, with a defined suid, say 1. Let A2 be a newer version of class A, and suid is not changed, say 1. Let's say A2 has a new field. Imaging A is running in VM1 and A2 is running in VM2. Serialization between VM1 and VM2 of class A is ok, just that A will not get the new fields. Which is fine since VM1 does not make use of it. You can argue that A2 will not get the needed field from serialized A, but isn't that better than crashing? Either the case, I think the behavior is better than it is currently. (maybe that's why Eclipse and Findbug both report the lacking of suid definition in lucene code a warning) I agree adding Externalizable implementation is more work, but it would make the serialization story correct. -John > Implement standard Serialization across Lucene versions > --- > > Key: LUCENE-1473 > URL: https://issues.apache.org/jira/browse/LUCENE-1473 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Minor > Attachments: LUCENE-1473.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > To maintain serialization compatibility between Lucene versions, > serialVersionUID needs to be added to classes that implement > java.io.Serializable. java.io.Externalizable may be implemented in classes > for faster performance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes
John Wang wrote: I agree with the process itself, what would make it better is some transparency on how patches/issues are evaluated to be committed. To be clear: there is no forum for communication about patches except this list, and, by extension, Jira. The process of patch evaluation is completely transparent. At least seemed from the outside, it is purely being decided on by the committers, and since my understanding is that an open source project belongs to the public, the public user base should have some say. It is not a democracy, it is a meritocracy. http://www.apache.org/foundation/how-it-works.html#meritocracy I'll repeat: committers are added when they've both contributed a series of high-quality, easy-to-commit patches, and when they've demonstrated that they are easy to work with. That process has resulted in the current set of committers, and those committers determine which patches are committed and when. Those are the rules. However committers cannot ram just any patch through. Committers are only added after they've demonstrated the ability to build consensus around their patches. And they must continue to build consensus around their patches even after they are committers. Patches that receive no endorsement from others are not committed, no matter who contributes them. A contribution is not more rapidly committed simply because the contributor is a committer. Rather, committers knows how to elicit and respond to criticism and build consensus around a patch in order to get them committed rapidly. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1461) Cached filter for a single term field
[ https://issues.apache.org/jira/browse/LUCENE-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653408#action_12653408 ] Tim Sturge commented on LUCENE-1461: That's amazing. LUCENE-855 (the FieldCacheRangeFilter part) is pretty much identical in purpose and design, down to the name. The major implementation differences are that it overloaded BitSet which was necessary prior to the addition of DocIdSetIterator. Thus my implementation looks significantly cleaner even though it is basically functionally identical. I think this shows that any decent idea will be repeatedly reinvented until it is widely enough known. I personally would have saved some time both in conceptualization and implementation had I been aware of this. I would very much like to credit Matt in CHANGES.txt for this as well; it seems like an accident of fate that I'm not using his implementation today. > Cached filter for a single term field > - > > Key: LUCENE-1461 > URL: https://issues.apache.org/jira/browse/LUCENE-1461 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Tim Sturge >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: DisjointMultiFilter.java, FieldCacheRangeFilter.patch, > LUCENE-1461.patch, LUCENE-1461a.patch, LUCENE-1461b.patch, > LUCENE-1461c.patch, RangeMultiFilter.java, RangeMultiFilter.java, > TermMultiFilter.java, TestFieldCacheRangeFilter.patch > > > These classes implement inexpensive range filtering over a field containing a > single term. They do this by building an integer array of term numbers > (storing the term->number mapping in a TreeMap) and then implementing a fast > integer comparison based DocSetIdIterator. > This code is currently being used to do age range filtering, but could also > be used to do other date filtering or in any application where there need to > be multiple filters based on the same single term field. I have an untested > implementation of single term filtering and have considered but not yet > implemented term set filtering (useful for location based searches) as well. > The code here is fairly rough; it works but lacks javadocs and toString() and > hashCode() methods etc. I'm posting it here to discover if there is other > interest in this feature; I don't mind fixing it up but would hate to go to > the effort if it's not going to make it into Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions
[ https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653413#action_12653413 ] Doug Cutting commented on LUCENE-1473: -- > Serialization between VM1 and VM2 of class A is ok, just that A will not get > the new fields. Which is fine since VM1 does not make use of it. But VM1 might require an older field that the new field replaced, and VM1 may then crash in an unpredictable way. Not defining explicit suid's is more conservative: you get a well-defined exception when things might not work. Defining suid's but doing nothing else about compatibility is playing fast-and-loose: it might work in many cases, but it also might cause strange, hard-to-diagnose problems in others. If we want Lucene to work reliably across versions, then we need to commit to that goal as a project, define the limits of the compatibility, implement Externalizeable, add tests, etc. Just adding suid's doesn't achieve that, so far as I can see. > Implement standard Serialization across Lucene versions > --- > > Key: LUCENE-1473 > URL: https://issues.apache.org/jira/browse/LUCENE-1473 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Minor > Attachments: LUCENE-1473.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > To maintain serialization compatibility between Lucene versions, > serialVersionUID needs to be added to classes that implement > java.io.Serializable. java.io.Externalizable may be implemented in classes > for faster performance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653414#action_12653414 ] Tim Sturge commented on LUCENE-855: --- Matt, Andy, Please take a look at LUCENE-1461. As far as I can tell it is identical in purpose and design to this patch. Matt, I would like to add you to the CHANGES.txt credits for LUCENE-1461. Are you OK with that? > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, > TestRangeFilterPerformanceComparison.java, > TestRangeFilterPerformanceComparison.java > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions
[ https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653421#action_12653421 ] robert engels commented on LUCENE-1473: --- Even if you changed SUIDs based on version changes, there is the very real possibility that the new code CAN'T be instantiated in any meaningful way from the old data. Then what would you do? Even if you had all of the old classes, and their dependencies available from dynamic classloading, it still won't work UNLESS every new feature is designed with backwards compatibility with previous versions - a burden that is just too great when required of all Lucene code. Given that, as has been discussed, there are other formats that can be used where isolated backwards persistence is desired (like XML based query descriptions). Even these won't work if the XML description references explicit classes - which is why designing such a format for a near limitless query structure (given user defined query classes) is probably impossible. So strive for a decent solution that covers most cases, and fails gracefully when it can't work. using standard serialization (with proper transient fields) seems to fit this bill, since in a stable API, most core classes should remain fairly constant, and those that are bound to change may take explicit steps in their serialization (if deemed needed) > Implement standard Serialization across Lucene versions > --- > > Key: LUCENE-1473 > URL: https://issues.apache.org/jira/browse/LUCENE-1473 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Minor > Attachments: LUCENE-1473.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > To maintain serialization compatibility between Lucene versions, > serialVersionUID needs to be added to classes that implement > java.io.Serializable. java.io.Externalizable may be implemented in classes > for faster performance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes
To put things in perspective, I believe Microsoft (who could potentially place a lot of resources towards Lucene) now uses Lucene through Powerset? and I don't think those folks are contributing back. I know of several other companies who do the same, and many potential contributions that are not submitted because people and their companies do not see the benefit of going through the hoops required to get patches committed. A relatively simple patch such as 1473 Serialization represents this well. For example if a company is developing custom search algorithms, Lucene supports TF/IDF but not much else. Custom search algorithms require rewriting lots of Lucene code. Companies who write new search algorithms do not necessarily want to rewrite Lucene as well to make it pluggable for new scoring as it is out of scope, they will simply branch the code. It does not help that the core APIs underneath IndexReader are protected and package protected which assumes a user that is not advanced. It is repeated in the mailing lists that new features will threaten the existing user base which is based on opinion rather than fact. More advanced users are currently hindered by the conservatism of the project and so naturally have stopped trying to submit changes that alter the core non-public code. The rancor is from users would benefit from a faster pace and the ability to be more creative inside the core Lucene system. As the internals change frequently and unnannounced the process of developing core patches is difficult and frustrating. Now that Lucene is stable and flexible indexing is being implemented. It would benefit the community to focus on the future. Who exactly is responsible for this? Which of the committers are building for the future? Which are doing bug fixes? What is the process of developing more advanced features in open source? Right now it seems to be one person, Michael McCandless developing all of the new core code. This is great forward progress, however it's unclear how others can get involved and not get stampeded by the constant changes that all happen via one brilliant person. I have requested of people such as Michael Busch to collaborate on the column stride fields and received no response. To me, an good example of volunteers are people who prepare food and donate their time at soup kitchens with no pay, and no hope for pay related to feeding the hungry. -J On Wed, Dec 3, 2008 at 2:52 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > On Dec 3, 2008, at 2:27 PM, Jason Rutherglen (JIRA) wrote: > > >> >> Hoss wrote: "sort of mythical "Lucene powerhouse" >> Lucene seems to run itself quite differently than other open source Java >> projects. Perhaps it would be good to spell out the reasons for the >> reluctance to move ahead with features that developers work on, that work, >> but do not go in. The developer contributions seem to be quite low right >> now, especially compared to neighbor projects such as Hadoop. Is this >> because fewer people are using Lucene? Or is it due to the reluctance to >> work with the developer community? Unfortunately the perception in the eyes >> of some people who work on search related projects it is the latter. >> > > > Or, could it be that Hadoop is relatively new and in vogue at the moment, > very malleable and buggy(?) and has a HUGE corporate sponsor who dedicates > lots of resources to it on a full time basis, whilst Lucene has been around > in the ASF for 7+ years (and 12+ years total) and has a really large install > base and thus must move more deliberately and basically has 1 person who > gets to work on it full time while the rest of us pretty much volunteer? > That's not an excuse, it's just the way it is. I personally, would love to > work on Lucene all day every day as I have a lot of things I'd love to > engage the community on, but the fact is I'm not paid to do that, so I give > what I can when I can. I know most of the other committers are that way > too. > > Thus, I don't think any one of us has a reluctance to move ahead with > features or bug fixes. Looking at CHANGES.txt, I see a lot of > contributors. Looking at java-dev and JIRA, I see lots of engagement with > the community. Is it near the historical high for traffic, no it's not, but > that isn't necessarily a bad thing. I think it's a sign that Lucene is > pretty stable. > > What we do have a reluctance for are patches that don't have tests (i.e. > this one), patches that massively change Lucene APIs in non-trivial ways or > break back compatibility or are not kept up to date. Are we perfect? Of > course not. I, personally, would love for there to be a way that helps us > process a larger volume of patches (note, I didn't say commit a larger > volume). Hadoop's automated patch tester would be a huge start in that, but > at the end of the day, Lucene still works the way all ASF projects do: via > meritocracy and volunteerism. You want stuff com
Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes
Jason Rutherglen wrote: A relatively simple patch such as 1473 Serialization represents this well. LUCENE-1473 is an incomplete patch that proposes to commit the project to new back-compatibility requirements. Compatibility requirements should not be added lightly, but only deliberately, as they have a long-term impact on the ability of the project to evolve. Prior to this we've not heard from folks who require cross-version java serialization compatibility. Without more folks asserting this as a need it is hard to rationalize adding this. As the internals change frequently and unnannounced the process of developing core patches is difficult and frustrating. The process is entirely in public. You have as much announcement as anyone. Patches are weighed on there merits as they are contributed. It would benefit the community to focus on the future. Who exactly is responsible for this? Which of the committers are building for the future? Which are doing bug fixes? What is the process of developing more advanced features in open source? I've already explained the process several times. We cannot easily make a long-term plan when we do not have the power to assign folks. We can state long-term goals, like flexible indexing, but in the end, it won't get done until someone volunteers to write the code. So you're welcome to start a wish list on the wiki, and you're welcome to then start contributing patches that implement items on your wish list. If you propose something that folks think is extremely useful, but requires an incompatible change, then it could perhaps be done in a branch. But most of the existing community is interested in pushing forward incrementally, trying hard to keep most things back-compatible. If that's too frustrating for you, you can fork Lucene and build a new community. Right now it seems to be one person, Michael McCandless developing all of the new core code. Mike does a lot of development, but he also commits a lot of patches written by others. This is great forward progress, however it's unclear how others can get involved and not get stampeded by the constant changes that all happen via one brilliant person. You want Mike to do less? Others can and do get involved all the time. Look at http://tinyurl.com/5nl78n. The majority of the things Mike works on are instigated by others. I have requested of people such as Michael Busch to collaborate on the column stride fields and received no response. Did you pay Michael? No one here is compelled to work with anyone else. We work with others when we feel it is in our mutual self interest. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653450#action_12653450 ] Andy Liu commented on LUCENE-855: - Yes, it looks the same. Glad this will finally make it to the source! > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, > TestRangeFilterPerformanceComparison.java, > TestRangeFilterPerformanceComparison.java > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
jira attachments ?
I am having a problem posting an attachment to Jira. Just spins, and spins... Everything else seems to work fine (comments, etc.). Anyone else experiencing this? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
I can't seem to post to Jira, so I am attaching here...I attached QueryFilter.java.In reading this patch, and other similar ones, the problem seems to be that if the index is modified, the cache is invalidated, causing a complete reload of the cache. Do I have this correct?The attached patch works really well in a highly interactive environment, as the cache is only invalidated at the segment level.The MyMultiReader is a subclass that allows access to the underlying SegmentReaders.The patch cannot be applied, but I think the implementation works far better in many cases - it is also far less memory intensive. Scanning the bitset could also be optimized very easily using internal skip values.Maybe this is completely off-base, but the solution has worked very well for us. Maybe this is a completely different issue and separate incident should be opened ?is there any interest in this? QueryFilter.java Description: Binary data On Dec 4, 2008, at 2:10 PM, Andy Liu (JIRA) wrote:  [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653450#action_12653450 ] Andy Liu commented on LUCENE-855:-Yes, it looks the same. Glad this will finally make it to the source! MemoryCachedRangeFilter to boost performance of Range queries-        Key: LUCENE-855        URL: https://issues.apache.org/jira/browse/LUCENE-855      Project: Lucene - Java     Issue Type: Improvement     Components: Search  Affects Versions: 2.1      Reporter: Andy Liu    Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.javaCurrently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets.MemoryCachedRangeFilter reads all pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices.TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.So in summery, MemoryCachedRangeFilter can be useful when:- Performance is critical- Memory is not an issue- Field contains many unique numeric values- Index contains large amount of documents -- This message is automatically generated by JIRA.-You can reply to this email to add a comment to the issue online.-To unsubscribe, e-mail: [EMAIL PROTECTED]For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes
Correction: Powerset apparently did not use Lucene. And apparently there are a few other companies who are not open sourcing, use Lucene serialization regularly. > Did you pay Michael? No one here is compelled to work with anyone else. We work with others when we feel it is in our mutual self interest. Nice... I guess our government is the macrocosm. On Thu, Dec 4, 2008 at 11:21 AM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > To put things in perspective, I believe Microsoft (who could potentially > place a lot of resources towards Lucene) now uses Lucene through Powerset? > and I don't think those folks are contributing back. I know of several > other companies who do the same, and many potential contributions that are > not submitted because people and their companies do not see the benefit of > going through the hoops required to get patches committed. A relatively > simple patch such as 1473 Serialization represents this well. > > For example if a company is developing custom search algorithms, Lucene > supports TF/IDF but not much else. Custom search algorithms require > rewriting lots of Lucene code. Companies who write new search algorithms do > not necessarily want to rewrite Lucene as well to make it pluggable for new > scoring as it is out of scope, they will simply branch the code. It does > not help that the core APIs underneath IndexReader are protected and package > protected which assumes a user that is not advanced. It is repeated in the > mailing lists that new features will threaten the existing user base which > is based on opinion rather than fact. More advanced users are currently > hindered by the conservatism of the project and so naturally have stopped > trying to submit changes that alter the core non-public code. > > The rancor is from users would benefit from a faster pace and the ability > to be more creative inside the core Lucene system. As the internals change > frequently and unnannounced the process of developing core patches is > difficult and frustrating. > > Now that Lucene is stable and flexible indexing is being implemented. It > would benefit the community to focus on the future. Who exactly is > responsible for this? Which of the committers are building for the future? > Which are doing bug fixes? What is the process of developing more advanced > features in open source? Right now it seems to be one person, Michael > McCandless developing all of the new core code. This is great forward > progress, however it's unclear how others can get involved and not get > stampeded by the constant changes that all happen via one brilliant person. > > > I have requested of people such as Michael Busch to collaborate on the > column stride fields and received no response. > > To me, an good example of volunteers are people who prepare food and donate > their time at soup kitchens with no pay, and no hope for pay related to > feeding the hungry. > > -J > > > On Wed, Dec 3, 2008 at 2:52 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > >> >> On Dec 3, 2008, at 2:27 PM, Jason Rutherglen (JIRA) wrote: >> >> >>> >>> Hoss wrote: "sort of mythical "Lucene powerhouse" >>> Lucene seems to run itself quite differently than other open source Java >>> projects. Perhaps it would be good to spell out the reasons for the >>> reluctance to move ahead with features that developers work on, that work, >>> but do not go in. The developer contributions seem to be quite low right >>> now, especially compared to neighbor projects such as Hadoop. Is this >>> because fewer people are using Lucene? Or is it due to the reluctance to >>> work with the developer community? Unfortunately the perception in the eyes >>> of some people who work on search related projects it is the latter. >>> >> >> >> Or, could it be that Hadoop is relatively new and in vogue at the moment, >> very malleable and buggy(?) and has a HUGE corporate sponsor who dedicates >> lots of resources to it on a full time basis, whilst Lucene has been around >> in the ASF for 7+ years (and 12+ years total) and has a really large install >> base and thus must move more deliberately and basically has 1 person who >> gets to work on it full time while the rest of us pretty much volunteer? >> That's not an excuse, it's just the way it is. I personally, would love to >> work on Lucene all day every day as I have a lot of things I'd love to >> engage the community on, but the fact is I'm not paid to do that, so I give >> what I can when I can. I know most of the other committers are that way >> too. >> >> Thus, I don't think any one of us has a reluctance to move ahead with >> features or bug fixes. Looking at CHANGES.txt, I see a lot of >> contributors. Looking at java-dev and JIRA, I see lots of engagement with >> the community. Is it near the historical high for traffic, no it's not, but >> that isn't necessarily a bad thing. I think it's a sign that Lucene is >> pretty stable. >> >> What we do have a reluctance
[jira] Updated: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results
[ https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1478: -- Attachment: LUCENE-1478-no-superinterface.patch Attached is a patch that implements the first variant (without super interface for all FieldParsers). All current tests pass. A special test case for a custum field parser was not implemented. For testing, I modified one of my contrib TrieRangeQuery test cases locally to sort using a custom LongParser that decoded the encoded longs in the cache [parseLong(value) returns TrieUtils.trieCodedToLong(value)]. A good test case would be to store some dates in ISO format in a field and then sort it as longs after parsing using SimpleDateFormat. This would be another typical use case (sorting by date, but not using SortField.STRING to minimize memory usage). If you like my patch, we could also discuss about using a super-interface for all Parsers. The modifications are rather simple (only the SortField constructor would be affected and some casts, and of course: the superinterface in all declarations inside FieldCache, ExtendedFieldCache) > Missing possibility to supply custom FieldParser when sorting search results > > > Key: LUCENE-1478 > URL: https://issues.apache.org/jira/browse/LUCENE-1478 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.4 >Reporter: Uwe Schindler > Attachments: LUCENE-1478-no-superinterface.patch > > > When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was > confronted by the problem that the special trie-encoded values (which are > longs in a special encoding) cannot be sorted by Searcher.search() and > SortField. The problem is: If you use SortField.LONG, you get > NumberFormatExceptions. The trie encoded values may be sorted using > SortField.String (as the encoding is in such a way, that they are sortable as > Strings), but this is very memory ineffective. > ExtendedFieldCache gives the possibility to specify a custom LongParser when > retrieving the cached values. But you cannot use this during searching, > because there is no possibility to supply this custom LongParser to the > SortField. > I propose a change in the sort classes: > Include a pointer to the parser instance to be used in SortField (if not > given use the default). My idea is to create a SortField using a new > constructor > {code}SortField(String field, int type, Object parser, boolean reverse){code} > The parser is "object" because all current parsers have no super-interface. > The ideal solution would be to have: > {code}SortField(String field, int type, FieldCache.Parser parser, boolean > reverse){code} > and FieldCache.Parser is a super-interface (just empty, more like a > marker-interface) of all other parsers (like LongParser...). The sort > implementation then must be changed to respect the given parser (if not > NULL), else use the default FieldCache.get without parser. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: jira attachments ?
Hi Robert, two minutes ago I uploaded a patch... Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] > From: robert engels [mailto:[EMAIL PROTECTED] > Sent: Thursday, December 04, 2008 9:37 PM > To: java-dev@lucene.apache.org > Subject: jira attachments ? > > I am having a problem posting an attachment to Jira. Just spins, and > spins... > > Everything else seems to work fine (comments, etc.). > > Anyone else experiencing this? > > Thanks. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
[ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653500#action_12653500 ] Robert Muir commented on LUCENE-1390: - its a bit slower, but the difference is minor. i just ran some tests with some cpu-bound (these filters are right at the top of hprof.txt) indexes that i build i ran em a couple times and it looks like this... not very scientific but it gives an idea. ASCII Folding filter index time (ms): 143365 ISOLatin1Accent filter (ms): 134649 > add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter > > > Key: LUCENE-1390 > URL: https://issues.apache.org/jira/browse/LUCENE-1390 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Environment: any >Reporter: Andi Vajda >Assignee: Mark Miller >Priority: Minor > Fix For: 2.9 > > Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, > ASCIIFoldingFilter.patch > > > The ISOLatin1AccentFilter is removing accents from accented characters in the > ISO Latin 1 character set. > It does what it does and there is no bug with it. > It would be nicer, though, if there was a more comprehensive version of this > code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 > and Latin Extended A unicode blocks. > See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block > See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block > That way, all languages using roman characters are covered. > A new class, ISOLatinAccentFilter is attached. It is intended to supercede > ISOLatin1AccentFilter which should get deprecated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: jira attachments ?
Dear God, I've been blocked ! What will the Lucene community do ! :) On Dec 4, 2008, at 3:27 PM, Uwe Schindler wrote: Hi Robert, two minutes ago I uploaded a patch... Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] From: robert engels [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:37 PM To: java-dev@lucene.apache.org Subject: jira attachments ? I am having a problem posting an attachment to Jira. Just spins, and spins... Everything else seems to work fine (comments, etc.). Anyone else experiencing this? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
I am looking all the time to LUCENE-831, which is a new version of FieldCache that is compatible with IndexReader.reopen() and invalidates only reloaded segments. In each release of Lucene I am very unhappy, because it is still not in. The same problem like yours is if you have a one million documents index that is updated by adding a few documents each half hour. If you use sorting by a field, whenever the index is reopened and you really only a very small segment is added, nevertheless the complete FieldCache is rebuild, very bad :(. So I think the ultimative fix would be to hopefully apply LUCENE-831 soon and also use LUCENE-1461 as RangeFilter cache. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] From: robert engels [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:39 PM To: java-dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries I can't seem to post to Jira, so I am attaching here... I attached QueryFilter.java. In reading this patch, and other similar ones, the problem seems to be that if the index is modified, the cache is invalidated, causing a complete reload of the cache. Do I have this correct? The attached patch works really well in a highly interactive environment, as the cache is only invalidated at the segment level. The MyMultiReader is a subclass that allows access to the underlying SegmentReaders. The patch cannot be applied, but I think the implementation works far better in many cases - it is also far less memory intensive. Scanning the bitset could also be optimized very easily using internal skip values. Maybe this is completely off-base, but the solution has worked very well for us. Maybe this is a completely different issue and separate incident should be opened ? is there any interest in this? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
Lucene-831 is far more comprehensive. I also think that by exposing access to the sub-readers it can be far simpler (closer to what I have provided). In the mean-time, you should be able to use the provided class with a few modifications. The "reload the entire cache" was a deal breaker for us, so I came up the attached. Works very well. On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote: I am looking all the time to LUCENE-831, which is a new version of FieldCache that is compatible with IndexReader.reopen() and invalidates only reloaded segments. In each release of Lucene I am very unhappy, because it is still not in. The same problem like yours is if you have a one million documents index that is updated by adding a few documents each half hour. If you use sorting by a field, whenever the index is reopened and you really only a very small segment is added, nevertheless the complete FieldCache is rebuild, very bad :(. So I think the ultimative fix would be to hopefully apply LUCENE-831 soon and also use LUCENE-1461 as RangeFilter cache. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] From: robert engels [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:39 PM To: java-dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries I can't seem to post to Jira, so I am attaching here... I attached QueryFilter.java. In reading this patch, and other similar ones, the problem seems to be that if the index is modified, the cache is invalidated, causing a complete reload of the cache. Do I have this correct? The attached patch works really well in a highly interactive environment, as the cache is only invalidated at the segment level. The MyMultiReader is a subclass that allows access to the underlying SegmentReaders. The patch cannot be applied, but I think the implementation works far better in many cases - it is also far less memory intensive. Scanning the bitset could also be optimized very easily using internal skip values. Maybe this is completely off-base, but the solution has worked very well for us. Maybe this is a completely different issue and separate incident should be opened ? is there any interest in this? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
The biggest benefit I see of using the field cache to do filter caching, is that the same cache can be used for sorting - thereby improving the performance and memory usage. The downside I see is that if you have a common filter that is built from many fields, you are going to use a lot more memory, as every field used needs to be cached. With my code you would only have a single "bitset" for the filter. On Dec 4, 2008, at 4:00 PM, robert engels wrote: Lucene-831 is far more comprehensive. I also think that by exposing access to the sub-readers it can be far simpler (closer to what I have provided). In the mean-time, you should be able to use the provided class with a few modifications. The "reload the entire cache" was a deal breaker for us, so I came up the attached. Works very well. On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote: I am looking all the time to LUCENE-831, which is a new version of FieldCache that is compatible with IndexReader.reopen() and invalidates only reloaded segments. In each release of Lucene I am very unhappy, because it is still not in. The same problem like yours is if you have a one million documents index that is updated by adding a few documents each half hour. If you use sorting by a field, whenever the index is reopened and you really only a very small segment is added, nevertheless the complete FieldCache is rebuild, very bad :(. So I think the ultimative fix would be to hopefully apply LUCENE-831 soon and also use LUCENE-1461 as RangeFilter cache. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] From: robert engels [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:39 PM To: java-dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries I can't seem to post to Jira, so I am attaching here... I attached QueryFilter.java. In reading this patch, and other similar ones, the problem seems to be that if the index is modified, the cache is invalidated, causing a complete reload of the cache. Do I have this correct? The attached patch works really well in a highly interactive environment, as the cache is only invalidated at the segment level. The MyMultiReader is a subclass that allows access to the underlying SegmentReaders. The patch cannot be applied, but I think the implementation works far better in many cases - it is also far less memory intensive. Scanning the bitset could also be optimized very easily using internal skip values. Maybe this is completely off-base, but the solution has worked very well for us. Maybe this is a completely different issue and separate incident should be opened ? is there any interest in this? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
Op Thursday 04 December 2008 23:03:40 schreef robert engels: > The biggest benefit I see of using the field cache to do filter > caching, is that the same cache can be used for sorting - thereby > improving the performance and memory usage. Would it be possible to build such Filter caching into CachingWrapperFilter instead of into QueryFilter? Both filter caching and the field value caching will need access to the underlying (segment?) readers. > > The downside I see is that if you have a common filter that is built > from many fields, you are going to use a lot more memory, as every > field used needs to be cached. With my code you would only have a > single "bitset" for the filter. But with many ranges that would mean many bitsets, and MemoryCachedRangeFilter only needs to cache the field values once for any number of ranges. It's a tradeoff. Regards, Paul Elschot > > On Dec 4, 2008, at 4:00 PM, robert engels wrote: > > Lucene-831 is far more comprehensive. > > > > I also think that by exposing access to the sub-readers it can be > > far simpler (closer to what I have provided). > > > > In the mean-time, you should be able to use the provided class with > > a few modifications. > > > > The "reload the entire cache" was a deal breaker for us, so I came > > up the attached. Works very well. > > > > On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote: > >> I am looking all the time to LUCENE-831, which is a new version of > >> FieldCache that is compatible with IndexReader.reopen() and > >> invalidates only > >> reloaded segments. In each release of Lucene I am very unhappy, > >> because it > >> is still not in. The same problem like yours is if you have a one > >> million > >> documents index that is updated by adding a few documents each > >> half hour. If > >> you use sorting by a field, whenever the index is reopened and you > >> really > >> only a very small segment is added, nevertheless the complete > >> FieldCache is > >> rebuild, very bad :(. > >> > >> > >> So I think the ultimative fix would be to hopefully apply > >> LUCENE-831 soon > >> and also use LUCENE-1461 as RangeFilter cache. > >> > >> - > >> Uwe Schindler > >> H.-H.-Meier-Allee 63, D-28213 Bremen > >> http://www.thetaphi.de > >> eMail: [EMAIL PROTECTED] > >> > >> From: robert engels [mailto:[EMAIL PROTECTED] > >> Sent: Thursday, December 04, 2008 9:39 PM > >> To: java-dev@lucene.apache.org > >> Subject: Re: [jira] Commented: (LUCENE-855) > >> MemoryCachedRangeFilter to boost > >> performance of Range queries > >> > >> I can't seem to post to Jira, so I am attaching here... > >> > >> I attached QueryFilter.java. > >> > >> In reading this patch, and other similar ones, the problem seems > >> to be that > >> if the index is modified, the cache is invalidated, causing a > >> complete > >> reload of the cache. Do I have this correct? > >> > >> The attached patch works really well in a highly interactive > >> environment, as > >> the cache is only invalidated at the segment level. > >> > >> The MyMultiReader is a subclass that allows access to the > >> underlying SegmentReaders. > >> > >> The patch cannot be applied, but I think the implementation works > >> far better > >> in many cases - it is also far less memory intensive. Scanning the > >> bitset > >> could also be optimized very easily using internal skip values. > >> > >> Maybe this is completely off-base, but the solution has worked > >> very well for > >> us. Maybe this is a completely different issue and separate > >> incident should > >> be opened ? > >> > >> is there any interest in this? > >> > >> > >> > >> -- > >>--- To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > > > > --- > >-- To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream
[ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653513#action_12653513 ] Michael Busch commented on LUCENE-1448: --- {quote} Another option is to "define" the API such that when incrementToken() returns false, then it has actually advanced to an "end-of-stream token". OffsetAttribute.getEndOffset() should return the final offset. Since we have not released the new API, we could simply make this change (and fix all instances in the core/contrib that use the new API accordingly). I think I like this option best. {quote} This adds some "cleaning up" responsibilities to all existing TokenFilters out there. So far it is very straightforward to change an existing TokenFilter to use the new API. You simply have to: - add attributes the filter needs in its constructor - change next() to incrementToken() and change return calls that return null to false, others to true (or what input returns) - don't access a token but the appropriate attributes to set the data But maybe there's a custom filter in the end of the chain that returns more tokens even after its input returned the last one. For example a SynonymExpansionFilter might return a synonym for the last word it received from its input before it returns false. In this case it might overwrite endOffset that another filter/stream already set to the final endOffset. It needs to cache that value and set it when it returns false. ALso all filters that currently use an offset need to know now to clean up before returning false. I'm not saying this is necessarily bad. I also find this approach tempting, because it's simple. But it might be a common pitfall for bugs? What I'd like to work on soon is an efficient way to buffer attributes (maybe add methods to attribute that write into a bytebuffer). Then attributes can implement what variables need to be serialized and which ones don't. In that case we could add a finalOffset to OffsetAttribute that does not get serialiezd/deserialized. And possibly it might be worthwhile to have explicit states defined in a TokenStream that we can enforce with three methods: start(), increment(), end(). Then people would now if they have to do something at the end of a stream they have to do it in end(). > add getFinalOffset() to TokenStream > --- > > Key: LUCENE-1448 > URL: https://issues.apache.org/jira/browse/LUCENE-1448 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, > LUCENE-1448.patch > > > If you add multiple Fieldable instances for the same field name to a > document, and you then index those fields with TermVectors storing offsets, > it's very likely the offsets for all but the first field instance will be > wrong. > This is because IndexWriter under the hood adds a cumulative base to the > offsets of each field instance, where that base is 1 + the endOffset of the > last token it saw when analyzing that field. > But this logic is overly simplistic. For example, if the WhitespaceAnalyzer > is being used, and the text being analyzed ended in 3 whitespace characters, > then that information is lost and then next field's offsets are then all 3 > too small. Similarly, if a StopFilter appears in the chain, and the last N > tokens were stop words, then the base will be 1 + the endOffset of the last > non-stopword token. > To fix this, I'd like to add a new getFinalOffset() to TokenStream. I'm > thinking by default it returns -1, which means "I don't know so you figure it > out", meaning we fallback to the faulty logic we have today. > This has come up several times on the user's list. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
It would be cool to be able to explicitly list subreaders that were added/removed as a result of reopen(), or have some kind of notification mechanism. We have filter caches, custom field/sort caches here and they are all reader-bound. Currently warm-up delay is negated by reopening and warming up in background, before switching to the new reader/caches, but it still limits our minimum between-reopens delay. On Fri, Dec 5, 2008 at 01:03, robert engels <[EMAIL PROTECTED]> wrote: > The biggest benefit I see of using the field cache to do filter caching, is > that the same cache can be used for sorting - thereby improving the > performance and memory usage. > > The downside I see is that if you have a common filter that is built from > many fields, you are going to use a lot more memory, as every field used > needs to be cached. With my code you would only have a single "bitset" for > the filter. > > On Dec 4, 2008, at 4:00 PM, robert engels wrote: > >> Lucene-831 is far more comprehensive. >> >> I also think that by exposing access to the sub-readers it can be far >> simpler (closer to what I have provided). >> >> In the mean-time, you should be able to use the provided class with a few >> modifications. >> >> The "reload the entire cache" was a deal breaker for us, so I came up the >> attached. Works very well. >> >> On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote: >> >>> I am looking all the time to LUCENE-831, which is a new version of >>> FieldCache that is compatible with IndexReader.reopen() and invalidates >>> only >>> reloaded segments. In each release of Lucene I am very unhappy, because >>> it >>> is still not in. The same problem like yours is if you have a one million >>> documents index that is updated by adding a few documents each half hour. >>> If >>> you use sorting by a field, whenever the index is reopened and you really >>> only a very small segment is added, nevertheless the complete FieldCache >>> is >>> rebuild, very bad :(. >>> >>> >>> So I think the ultimative fix would be to hopefully apply LUCENE-831 soon >>> and also use LUCENE-1461 as RangeFilter cache. >>> >>> - >>> Uwe Schindler >>> H.-H.-Meier-Allee 63, D-28213 Bremen >>> http://www.thetaphi.de >>> eMail: [EMAIL PROTECTED] >>> >>> From: robert engels [mailto:[EMAIL PROTECTED] >>> Sent: Thursday, December 04, 2008 9:39 PM >>> To: java-dev@lucene.apache.org >>> Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to >>> boost >>> performance of Range queries >>> >>> I can't seem to post to Jira, so I am attaching here... >>> >>> I attached QueryFilter.java. >>> >>> In reading this patch, and other similar ones, the problem seems to be >>> that >>> if the index is modified, the cache is invalidated, causing a complete >>> reload of the cache. Do I have this correct? >>> >>> The attached patch works really well in a highly interactive environment, >>> as >>> the cache is only invalidated at the segment level. >>> >>> The MyMultiReader is a subclass that allows access to the underlying >>> SegmentReaders. >>> >>> The patch cannot be applied, but I think the implementation works far >>> better >>> in many cases - it is also far less memory intensive. Scanning the bitset >>> could also be optimized very easily using internal skip values. >>> >>> Maybe this is completely off-base, but the solution has worked very well >>> for >>> us. Maybe this is a completely different issue and separate incident >>> should >>> be opened ? >>> >>> is there any interest in this? >>> >>> >>> >>> - >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Kirill Zakharenko/Кирилл Захаренко ([EMAIL PROTECTED]) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785
[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
[ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653520#action_12653520 ] Robert Muir commented on LUCENE-1390: - sorry, that wasn't a fair test case. a good chunk of those docs contain accents outside of latin1, so asciifoldingfilter was doing more work i reran on some heavily accented (but only latin1) data and the difference was negligible, 1% or so appears asciifoldingfilter only slows you down versus isolatin1accentfilter in the case where it probably should be! (you have accents outside of latin1 but are using latin1accentfilter) > add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter > > > Key: LUCENE-1390 > URL: https://issues.apache.org/jira/browse/LUCENE-1390 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Environment: any >Reporter: Andi Vajda >Assignee: Mark Miller >Priority: Minor > Fix For: 2.9 > > Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, > ASCIIFoldingFilter.patch > > > The ISOLatin1AccentFilter is removing accents from accented characters in the > ISO Latin 1 character set. > It does what it does and there is no bug with it. > It would be nicer, though, if there was a more comprehensive version of this > code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 > and Latin Extended A unicode blocks. > See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block > See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block > That way, all languages using roman characters are covered. > A new class, ISOLatinAccentFilter is attached. It is intended to supercede > ISOLatin1AccentFilter which should get deprecated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results
[ https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1478: -- Lucene Fields: [New, Patch Available] (was: [New]) > Missing possibility to supply custom FieldParser when sorting search results > > > Key: LUCENE-1478 > URL: https://issues.apache.org/jira/browse/LUCENE-1478 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.4 >Reporter: Uwe Schindler > Attachments: LUCENE-1478-no-superinterface.patch > > > When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was > confronted by the problem that the special trie-encoded values (which are > longs in a special encoding) cannot be sorted by Searcher.search() and > SortField. The problem is: If you use SortField.LONG, you get > NumberFormatExceptions. The trie encoded values may be sorted using > SortField.String (as the encoding is in such a way, that they are sortable as > Strings), but this is very memory ineffective. > ExtendedFieldCache gives the possibility to specify a custom LongParser when > retrieving the cached values. But you cannot use this during searching, > because there is no possibility to supply this custom LongParser to the > SortField. > I propose a change in the sort classes: > Include a pointer to the parser instance to be used in SortField (if not > given use the default). My idea is to create a SortField using a new > constructor > {code}SortField(String field, int type, Object parser, boolean reverse){code} > The parser is "object" because all current parsers have no super-interface. > The ideal solution would be to have: > {code}SortField(String field, int type, FieldCache.Parser parser, boolean > reverse){code} > and FieldCache.Parser is a super-interface (just empty, more like a > marker-interface) of all other parsers (like LongParser...). The sort > implementation then must be changed to respect the given parser (if not > NULL), else use the default FieldCache.get without parser. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
[ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653539#action_12653539 ] Mark Miller commented on LUCENE-1390: - Thanks Robert. I plan to commit this in a few days with the deprecation of the latin1 filter for removal in 3.0. > add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter > > > Key: LUCENE-1390 > URL: https://issues.apache.org/jira/browse/LUCENE-1390 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Environment: any >Reporter: Andi Vajda >Assignee: Mark Miller >Priority: Minor > Fix For: 2.9 > > Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, > ASCIIFoldingFilter.patch > > > The ISOLatin1AccentFilter is removing accents from accented characters in the > ISO Latin 1 character set. > It does what it does and there is no bug with it. > It would be nicer, though, if there was a more comprehensive version of this > code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 > and Latin Extended A unicode blocks. > See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block > See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block > That way, all languages using roman characters are covered. > A new class, ISOLatinAccentFilter is attached. It is intended to supercede > ISOLatin1AccentFilter which should get deprecated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653544#action_12653544 ] Uwe Schindler commented on LUCENE-1470: --- Hi Mike, I opened issue LUCENE-1478 and attached a first patch. About the current issue: I have seen that TrieRangeQuery is missing in /lucene/java/trunk/contrib/queries/README.txt. Can you add it there or should I write a small patch? I think it should at least be mentioned there for what it is for, but the JavaDocs are much more informative and the corresponding paper / code credits are cited there. Thank you very much for helping to get this into Lucene! > Add TrieRangeQuery to contrib > - > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Uwe Schindler >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 50 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes
On Dec 4, 2008, at 2:21 PM, Jason Rutherglen wrote: To put things in perspective, I believe Microsoft (who could potentially place a lot of resources towards Lucene) now uses Lucene through Powerset? and I don't think those folks are contributing back. I know of several other companies who do the same, and many potential contributions that are not submitted because people and their companies do not see the benefit of going through the hoops required to get patches committed. A relatively simple patch such as 1473 Serialization represents this well. What do you suggest? We didn't force anyone to use Lucene. Heck, most of our users don't even ever participate on the mailing list. We do provide a very clear, transparent path for making contributions and becoming a committer. I don't know what else we can do, but we're totally open to suggestions on how to improve it. FWIW, just b/c you think 1473 is trivial doesn't make it so. You have a single use case and that's all you care about. The community has dozens, if not hundreds of use cases, and your "trivial" patch may not be so trivial in that regards. How would you feel if we "broke" something that you have relied on for years in the name of us moving faster? I am willing to bet the large number of people here in Lucene appreciate our deliberations for the most part. As for my opinion on 1473, I personally think there are better ways of achieving what you are trying to do, as Robert and others have suggested and I don't think it is worth it to maintain serialization across versions as it is a too large of a burden, IMO. But, heh, make an argument (preferably w/o the accusations) and convince me otherwise. For example if a company is developing custom search algorithms, Lucene supports TF/IDF but not much else. Custom search algorithms require rewriting lots of Lucene code. Companies who write new search algorithms do not necessarily want to rewrite Lucene as well to make it pluggable for new scoring as it is out of scope, they will simply branch the code. It does not help that the core APIs underneath IndexReader are protected and package protected which assumes a user that is not advanced. It is repeated in the mailing lists that new features will threaten the existing user base which is based on opinion rather than fact. More advanced users are currently hindered by the conservatism of the project and so naturally have stopped trying to submit changes that alter the core non-public code. So, your mad at us for others not contributing back their forks? Even the ones we don't know about? Simply put, I'm sorry we can't please you. If you go read the archives, you will see plenty of times when even us committers have been frustrated from time to time by the process (just look at the JDK 1.5 debate, or the Interface/Abstract debate) but in the end, I feel Lucene is stronger for it. Community over code, it's the Apache Way. You are free to disagree. In fact, you have several options available to you to show that disagreement: 1. You can work to become a committer and change it from within. The bar really isn't that high, 3 to 4 non-trivial patches and a willingness to work with others in a mostly pleasant way. 2. You can make us aware of the patches and be persistent about seeing it through and we'll try to get to it. Just look at CHANGES.txt and JIRA and you will see that this happens all the time and from a wide variety of contributors (including both you and John). 3. You can fork the code and go do your thing and build your own community, etc. Personally, I hope you choose 1 or 2, as we're all stronger together than we are apart. The rancor is from users would benefit from a faster pace and the ability to be more creative inside the core Lucene system. As the internals change frequently and unnannounced the process of developing core patches is difficult and frustrating. I'm sorry that we can't work at a faster pace. Suggestions on how to deal with the number of patches we have and still maintain quality and how to move forward w/o breaking old patches are much appreciated. As for the internals changing, you have just hit the nail on the head as to why it is so important to maintain back-compat. I simply don't get the unannounced part. What isn't announced? Geez, I've been a committer for a few years now, and I have yet to see another open source project that is as public as Lucene, for better or worse. Look at the archives, we regularly even put our warts out for public consumption in an effort to improve ourselves. Rather than continue hijacking this thread, why don't we either let it die and focus on serialization, or we go over to java-dev and you and John and the rest of us can create a concrete list of suggestions that we think could make Lucene better and we can all
[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions
[ https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653545#action_12653545 ] John Wang commented on LUCENE-1473: --- The discussion here is whether it is better to have 100% of the time failing vs. 10% of the time failing. (these are just meaningless numbers to express a point) I do buy Doug's comment about getting into a weird state due to data serialization, but this is something Externalizable would solve. This discussion has digressed to general Java serialization design, where it originally scoped only to several lucene classes. If it is documented that lucene only supports serialization of classes from the same jar, is that really enough, doesn't it also depend on the compiler, if someone were to build their own jar? Furthermore, in a distributed environment with lotsa machines, it is always idea to upgrade bit by bit, is taking this functionality away by imposing this restriction a good trade-off to just implementing Externalizable for a few classes, if Serializable is deemed to be dangerous, which I am not so sure given the lucene classes we are talking about. > Implement standard Serialization across Lucene versions > --- > > Key: LUCENE-1473 > URL: https://issues.apache.org/jira/browse/LUCENE-1473 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Minor > Attachments: LUCENE-1473.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > To maintain serialization compatibility between Lucene versions, > serialVersionUID needs to be added to classes that implement > java.io.Serializable. java.io.Externalizable may be implemented in classes > for faster performance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-831: --- Attachment: LUCENE-831.patch Updated to trunk. I've combined all of the dual (primitive array/ObjectArray) CachKeys into one. Each cache key can support both modes or throw UnsupportedException or something. I've also tried something a bit experimental to allow users to eventually use custom or alternate cachekeys (payload or sparse arrays or something) that work with internal sorting. A cache implementation can now supply a ComparatorFactory (name will prob be tweaked) that handles creating comparators. You can subclass ComparatorFactory and add new or override current supported CacheKeys. CustomComparators still needs to be twiddled with some. I've converted some of the sort tests to run with both primitive and object arrays as well. - Mark I > Complete overhaul of FieldCache API/Implementation > -- > > Key: LUCENE-831 > URL: https://issues.apache.org/jira/browse/LUCENE-831 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Hoss Man > Fix For: 3.0 > > Attachments: fieldcache-overhaul.032208.diff, > fieldcache-overhaul.diff, fieldcache-overhaul.diff, > LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, > LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, > LUCENE-831.patch, LUCENE-831.patch > > > Motivation: > 1) Complete overhaul the API/implementation of "FieldCache" type things... > a) eliminate global static map keyed on IndexReader (thus > eliminating synch block between completley independent IndexReaders) > b) allow more customization of cache management (ie: use > expiration/replacement strategies, disk backed caches, etc) > c) allow people to define custom cache data logic (ie: custom > parsers, complex datatypes, etc... anything tied to a reader) > d) allow people to inspect what's in a cache (list of CacheKeys) for > an IndexReader so a new IndexReader can be likewise warmed. > e) Lend support for smarter cache management if/when > IndexReader.reopen is added (merging of cached data from subReaders). > 2) Provide backwards compatibility to support existing FieldCache API with > the new implementation, so there is no redundent caching as client code > migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: jira attachments ?
Robert which browser are you using? Mike robert engels wrote: Dear God, I've been blocked ! What will the Lucene community do ! :) On Dec 4, 2008, at 3:27 PM, Uwe Schindler wrote: Hi Robert, two minutes ago I uploaded a patch... Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] From: robert engels [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:37 PM To: java-dev@lucene.apache.org Subject: jira attachments ? I am having a problem posting an attachment to Jira. Just spins, and spins... Everything else seems to work fine (comments, etc.). Anyone else experiencing this? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions
[ https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653553#action_12653553 ] Doug Cutting commented on LUCENE-1473: -- > This discussion has digressed to general Java serialization design, where it > originally scoped only to several lucene classes. Which classes? The existing patch applies to one class. Jason said, "If it looks ok, I will implement Externalizable in other classes." but never said which. It would be good to know how wide the impact of the proposed change would be. > Implement standard Serialization across Lucene versions > --- > > Key: LUCENE-1473 > URL: https://issues.apache.org/jira/browse/LUCENE-1473 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Minor > Attachments: LUCENE-1473.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > To maintain serialization compatibility between Lucene versions, > serialVersionUID needs to be added to classes that implement > java.io.Serializable. java.io.Externalizable may be implemented in classes > for faster performance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: jira attachments ?
I am using Safari 3.2 (on OSX Tiger). On Dec 4, 2008, at 5:38 PM, Michael McCandless wrote: Robert which browser are you using? Mike robert engels wrote: Dear God, I've been blocked ! What will the Lucene community do ! :) On Dec 4, 2008, at 3:27 PM, Uwe Schindler wrote: Hi Robert, two minutes ago I uploaded a patch... Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] From: robert engels [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:37 PM To: java-dev@lucene.apache.org Subject: jira attachments ? I am having a problem posting an attachment to Jira. Just spins, and spins... Everything else seems to work fine (comments, etc.). Anyone else experiencing this? Thanks. --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: jira attachments ?
Hmmm the only time I've seen this was also with Safari (though on an older version). It caused me to switch [back] to Firefox. Try Firefox? Mike robert engels wrote: I am using Safari 3.2 (on OSX Tiger). On Dec 4, 2008, at 5:38 PM, Michael McCandless wrote: Robert which browser are you using? Mike robert engels wrote: Dear God, I've been blocked ! What will the Lucene community do ! :) On Dec 4, 2008, at 3:27 PM, Uwe Schindler wrote: Hi Robert, two minutes ago I uploaded a patch... Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] From: robert engels [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:37 PM To: java-dev@lucene.apache.org Subject: jira attachments ? I am having a problem posting an attachment to Jira. Just spins, and spins... Everything else seems to work fine (comments, etc.). Anyone else experiencing this? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
On Dec 4, 2008, at 4:10 PM, Paul Elschot wrote: Op Thursday 04 December 2008 23:03:40 schreef robert engels: The biggest benefit I see of using the field cache to do filter caching, is that the same cache can be used for sorting - thereby improving the performance and memory usage. Would it be possible to build such Filter caching into CachingWrapperFilter instead of into QueryFilter? Both filter caching and the field value caching will need access to the underlying (segment?) readers. I don't see why not. The QueryFilter extends from that... We are just on a much older code base. Not really sure why this hierarchy exists tough, as the only extenders are QueryFilter, and CachingWrapperFilterHelper. I would prefer QueryFilter, and then extend that as CachingQueryFilter. I've always been taught that is you see the words Wrapper, or Helper, there is probably a design problem, or at least a naming problem. The downside I see is that if you have a common filter that is built from many fields, you are going to use a lot more memory, as every field used needs to be cached. With my code you would only have a single "bitset" for the filter. But with many ranges that would mean many bitsets, and MemoryCachedRangeFilter only needs to cache the field values once for any number of ranges. It's a tradeoff. That was my point. I don't see the field based caching and the filter based caching as solving the same problem to a degree. It is going to depend on the actual usage - that is why I would like to support both. Regards, Paul Elschot On Dec 4, 2008, at 4:00 PM, robert engels wrote: Lucene-831 is far more comprehensive. I also think that by exposing access to the sub-readers it can be far simpler (closer to what I have provided). In the mean-time, you should be able to use the provided class with a few modifications. The "reload the entire cache" was a deal breaker for us, so I came up the attached. Works very well. On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote: I am looking all the time to LUCENE-831, which is a new version of FieldCache that is compatible with IndexReader.reopen() and invalidates only reloaded segments. In each release of Lucene I am very unhappy, because it is still not in. The same problem like yours is if you have a one million documents index that is updated by adding a few documents each half hour. If you use sorting by a field, whenever the index is reopened and you really only a very small segment is added, nevertheless the complete FieldCache is rebuild, very bad :(. So I think the ultimative fix would be to hopefully apply LUCENE-831 soon and also use LUCENE-1461 as RangeFilter cache. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] From: robert engels [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:39 PM To: java-dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries I can't seem to post to Jira, so I am attaching here... I attached QueryFilter.java. In reading this patch, and other similar ones, the problem seems to be that if the index is modified, the cache is invalidated, causing a complete reload of the cache. Do I have this correct? The attached patch works really well in a highly interactive environment, as the cache is only invalidated at the segment level. The MyMultiReader is a subclass that allows access to the underlying SegmentReaders. The patch cannot be applied, but I think the implementation works far better in many cases - it is also far less memory intensive. Scanning the bitset could also be optimized very easily using internal skip values. Maybe this is completely off-base, but the solution has worked very well for us. Maybe this is a completely different issue and separate incident should be opened ? is there any interest in this? -- --- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: jira attachments ?
Could be... I will try next time... Seems a strange (and serious) bug in Jira (I have no problems with other "add attachment" sites) ... On Dec 4, 2008, at 5:59 PM, Michael McCandless wrote: Hmmm the only time I've seen this was also with Safari (though on an older version). It caused me to switch [back] to Firefox. Try Firefox? Mike robert engels wrote: I am using Safari 3.2 (on OSX Tiger). On Dec 4, 2008, at 5:38 PM, Michael McCandless wrote: Robert which browser are you using? Mike robert engels wrote: Dear God, I've been blocked ! What will the Lucene community do ! :) On Dec 4, 2008, at 3:27 PM, Uwe Schindler wrote: Hi Robert, two minutes ago I uploaded a patch... Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] From: robert engels [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:37 PM To: java-dev@lucene.apache.org Subject: jira attachments ? I am having a problem posting an attachment to Jira. Just spins, and spins... Everything else seems to work fine (comments, etc.). Anyone else experiencing this? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- --- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-831: --- Attachment: LUCENE-831.patch Couple of needed tweaks and a test for a custom ComparatorFactory. > Complete overhaul of FieldCache API/Implementation > -- > > Key: LUCENE-831 > URL: https://issues.apache.org/jira/browse/LUCENE-831 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Hoss Man > Fix For: 3.0 > > Attachments: fieldcache-overhaul.032208.diff, > fieldcache-overhaul.diff, fieldcache-overhaul.diff, > LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, > LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, > LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch > > > Motivation: > 1) Complete overhaul the API/implementation of "FieldCache" type things... > a) eliminate global static map keyed on IndexReader (thus > eliminating synch block between completley independent IndexReaders) > b) allow more customization of cache management (ie: use > expiration/replacement strategies, disk backed caches, etc) > c) allow people to define custom cache data logic (ie: custom > parsers, complex datatypes, etc... anything tied to a reader) > d) allow people to inspect what's in a cache (list of CacheKeys) for > an IndexReader so a new IndexReader can be likewise warmed. > e) Lend support for smarter cache management if/when > IndexReader.reopen is added (merging of cached data from subReaders). > 2) Provide backwards compatibility to support existing FieldCache API with > the new implementation, so there is no redundent caching as client code > migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes
Hi Grant: I agree and I apologize for hijacking this thread. If Luceners feel our criticisms are invalid, then so be it. We should focus on this issue, being the serialization story in Lucene. Not general java serialization, so I don't see how it would benefit to move this to the java dev list. As far as lucene serialization, incorporating comments from various people, this is what I gather are the choices (feel free to correct me) 1) Remove implementation and support of Serializable: We all agreed this is bad and breaks backward compatibility. 2) Do nothing to the code base and fix documentation, and clarify Lucene only supports Serialization between components with the release jar. This seems to be the suggested approach where I have a coupla concerns: a) Since given the exact code base, due to the nature of java serialization, different builds of the jar via IBM vm vs. Sun VM vs. Jrocket etc, cannot guarantee compatibility. Thus we are enforcing users that care about Serialization to use the release jar. b) There is at least one place, as I have previously mentioned, e.g. ScoreDocComparator, the contract returns a Comparable and via javadoc, must be serializable. How should this be treated? This can be an application object, should we pass on the same enforcement there when merge/sort is happening across the wire since similar serialization problem would break inside MultiSearcher? 3) Clean up the serialization story, either add SUID or implement Externalizable for some classes within Lucene that implements Serializable: >From what I am told, this is too much work for the committers. I hope you guys at least agree with me with the way it is currently, the serialization story is broken, whether in documentation or in code. I see the disagreement being its severity, and whether it is a trivial fix, which I have learned it is not really my place to say. Please do understand this is not a far-fetched, made-up use-case, we are running into this in production, and we are developing in accordance to lucene documentation. Thanks -John On Thu, Dec 4, 2008 at 3:23 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > On Dec 4, 2008, at 2:21 PM, Jason Rutherglen wrote: > > To put things in perspective, I believe Microsoft (who could potentially >> place a lot of resources towards Lucene) now uses Lucene through Powerset? >> and I don't think those folks are contributing back. I know of several >> other companies who do the same, and many potential contributions that are >> not submitted because people and their companies do not see the benefit of >> going through the hoops required to get patches committed. A relatively >> simple patch such as 1473 Serialization represents this well. >> > > What do you suggest? We didn't force anyone to use Lucene. Heck, most of > our users don't even ever participate on the mailing list. > > We do provide a very clear, transparent path for making contributions and > becoming a committer. I don't know what else we can do, but we're totally > open to suggestions on how to improve it. > > FWIW, just b/c you think 1473 is trivial doesn't make it so. You have a > single use case and that's all you care about. The community has dozens, if > not hundreds of use cases, and your "trivial" patch may not be so trivial in > that regards. How would you feel if we "broke" something that you have > relied on for years in the name of us moving faster? I am willing to bet > the large number of people here in Lucene appreciate our deliberations for > the most part. As for my opinion on 1473, I personally think there are > better ways of achieving what you are trying to do, as Robert and others > have suggested and I don't think it is worth it to maintain serialization > across versions as it is a too large of a burden, IMO. But, heh, make an > argument (preferably w/o the accusations) and convince me otherwise. > > >> >> For example if a company is developing custom search algorithms, Lucene >> supports TF/IDF but not much else. Custom search algorithms require >> rewriting lots of Lucene code. Companies who write new search algorithms do >> not necessarily want to rewrite Lucene as well to make it pluggable for new >> scoring as it is out of scope, they will simply branch the code. It does >> not help that the core APIs underneath IndexReader are protected and package >> protected which assumes a user that is not advanced. It is repeated in the >> mailing lists that new features will threaten the existing user base which >> is based on opinion rather than fact. More advanced users are currently >> hindered by the conservatism of the project and so naturally have stopped >> trying to submit changes that alter the core non-public code. >> > > So, your mad at us for others not contributing back their forks? Even the > ones we don't know about? Simply put, I'm sorry we can't please you. If > you go read the archives, you will see plenty of times when eve
[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions
[ https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653563#action_12653563 ] John Wang commented on LUCENE-1473: --- For our problem, it is Query all all its derived and encapsulated classes. I guess the title of the bug is too generic. As far as my comment about other lucene classes, one can just go to the lucene javadoc and click on "Tree" and look for Serializable. If you want me to, I can go an fetch the complete list, but here are some examples: 1) Document (Field etc.) 2) OpenBitSet, Filter ... 3) Sort, SortField 4) Term 5) TopDocs, Hits etc. For the top level API. > Implement standard Serialization across Lucene versions > --- > > Key: LUCENE-1473 > URL: https://issues.apache.org/jira/browse/LUCENE-1473 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Minor > Attachments: LUCENE-1473.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > To maintain serialization compatibility between Lucene versions, > serialVersionUID needs to be added to classes that implement > java.io.Serializable. java.io.Externalizable may be implemented in classes > for faster performance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib
[ https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653569#action_12653569 ] Michael McCandless commented on LUCENE-1470: bq. Thank you very much for helping to get this into Lucene! You're welcome! But, that was the easy part ;) Thank you for creating it & getting it into Lucene! bq. About the current issue: I have seen that TrieRangeQuery is missing in /lucene/java/trunk/contrib/queries/README.txt. I agree -- can you create a patch? Thanks. > Add TrieRangeQuery to contrib > - > > Key: LUCENE-1470 > URL: https://issues.apache.org/jira/browse/LUCENE-1470 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Uwe Schindler >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, > LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch > > > According to the thread in java-dev > (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and > http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to > include my fast numerical range query implementation into lucene > contrib-queries. > I implemented (based on RangeFilter) another approach for faster > RangeQueries, based on longs stored in index in a special format. > The idea behind this is to store the longs in different precision in index > and partition the query range in such a way, that the outer boundaries are > search using terms from the highest precision, but the center of the search > Range with lower precision. The implementation stores the longs in 8 > different precisions (using a class called TrieUtils). It also has support > for Doubles, using the IEEE 754 floating-point "double format" bit layout > with some bit mappings to make them binary sortable. The approach is used in > rather big indexes, query times are even on low performance desktop > computers <<100 ms (!) for very big ranges on indexes with 50 docs. > I called this RangeQuery variant and format "TrieRangeRange" query because > the idea looks like the well-known Trie structures (but it is not identical > to real tries, but algorithms are related to it). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1473) Implement standard Serialization across Lucene versions
[ https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1473: - Attachment: LUCENE-1473.patch LUCENE-1473.patch serialVersionUID added to the relevant classes manually. Defaulted to 10 because it does not matter, as long it is different between versions. Thought of writing some code to go through the Lucene JAR, do an instanceof on the classes for Serializable and then verify that the serialVersionUID is 10. Term implements Externalizable. SerializationUtils was adapted from WriteableUtils of Hadoop for writing VLong. TestSerialization use case does term serialization and serializes an arbitrary query to a file and compares them. TODO: - Implement Externalizable - More unit tests? How to write a unit test for multiple versions? > Implement standard Serialization across Lucene versions > --- > > Key: LUCENE-1473 > URL: https://issues.apache.org/jira/browse/LUCENE-1473 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Minor > Attachments: LUCENE-1473.patch, LUCENE-1473.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > To maintain serialization compatibility between Lucene versions, > serialVersionUID needs to be added to classes that implement > java.io.Serializable. java.io.Externalizable may be implemented in classes > for faster performance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Hudson build is back to normal: Lucene-trunk #666
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/666/changes - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653635#action_12653635 ] Matt Ericson commented on LUCENE-855: - Looks similar to what I wrote but it uses a more data structures. I liked the what I built as it just has direct access to the Field Cache and there are no other data structures and that if once you load the data in the FC you can do any other search on that field and not have to rebuild anything you can just re-use the data. But I think all 3 are improvements on what's there but as I am prejudiced and I really like they one I wrote and I think it will stack up faster then the 1461 if you do load tests on it. Just my $0.02 Matt > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, > TestRangeFilterPerformanceComparison.java, > TestRangeFilterPerformanceComparison.java > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]