[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833860#action_12833860 ] Eks Dev commented on LUCENE-329: {quote} query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. {quote} as a matter of fact, we have not only one frequency to consider, rather two Term frequencies! consider simpler case Query term: "Johan" //would be High frequency term gives: Fuzzy Expanded term1 "Johana" // High frequency Fuzzy Expanded term2 "Joahn" // Low Freq I guess you would like to score the second term higher, meaning Lower frequency (higher IDF)... So far so good. Now turn it upside down and search for LF typo "Joahn"... in that case you would preffer HF Term "Johan" from expanded list to score higher... Point being, this situation here is just not "complete" without taking both frequencies into consideration (Query Term and Expanded term). In my experience, some simple nonlinear hints based on these two freqs bring some easy precision points (HF-LF Pairs are much more likely to be typos that two HF-HF... ). > Fuzzy query scoring issues > -- > > Key: LUCENE-329 > URL: https://issues.apache.org/jira/browse/LUCENE-329 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 1.2rc5 > Environment: Operating System: All > Platform: All >Reporter: Mark Harwood >Priority: Minor > Attachments: patch.txt > > > Queries which automatically produce multiple terms (wildcard, range, prefix, > fuzzy etc)currently suffer from two problems: > 1) Scores for matching documents are significantly smaller than term queries > because of the volume of terms introduced (A match on query Foo~ is 0.1 > whereas a match on query Foo is 1). > 2) The rarer forms of expanded terms are favoured over those of more common > forms because of the IDF. When using Fuzzy queries for example, rare mis- > spellings typically appear in results before the more common correct > spellings. > I will attach a patch that corrects the issues identified above by > 1) Overriding Similarity.coord to counteract the downplaying of scores > introduced by expanding terms. > 2) Taking the IDF factor of the most common form of expanded terms as the > basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832911#action_12832911 ] Eks Dev commented on LUCENE-2089: - {quote} ...Aaron i think generation may pose a problem for a full unicode alphabet... {quote} I wouldn't discount Aron's approach so quickly! There is one *really smart* way to aproach generation of the distance negborhood. Have a look at FastSS http://fastss.csg.uzh.ch/ The trick is to delete, not to genarate variations over complete alphabet! They call it "deletion negborhood". Also, generates much less variation Terms, reducing pressure on binary search in TermDict! You do not get all these goodies from Weighted distance implementation, but the solution is much simpler. Would work similary to the current spellchecker (just lookup on "variations"), only faster. They have even some exemple code to see how they generate "deletions" (http://fastss.csg.uzh.ch/FastSimilarSearch.java). {quote} but the more intelligent stuff you speak of could be really cool esp. for spellchecking, sure you dont want to rewrite our spellchecker? btw its not clear to me yet, could you implement that stuff on top of "ghetto DFA" (the sorted terms dict we have now) or is something more sophisticated needed? its a lot easier to write this stuff now with the flex MTQ apis {quote} I really would love to, but I was paid before to work on this. I guess "gheto dfa" would not work, at least not fast enough (I didn't think about it really). Practically you would need to know which characters extend current character in you dictionary, or in DFA parlance, all outgoing transitions from the current state. "gheto dfa" cannot do it efficiently? What would be an idea with flex is to implement this stuff with an in memory trie (full trie or TST), befor jumping into noisy channel (this is easy to add later) and persistent trie-dictionary. The traversal part is identical, and would make a nice contrib with a usefull use case as the majority of folks have enogh memory to slurp complete termDict into memory... Would serve as a proof of concept for flex and fuzzyQ, help you understand the magic of calculating edit distance against Trie structures. Once you have trie structure, the sky is the limit, prefix, regex... If I remeber corectly, there were some trie implmentations floating around, with it you need just one extra traversal method to find all terms at distance N. You can have a look at "http://jaspell.sourceforge.net/"; TST implmentation, class TernarySearchTrie.matchAlmost(...) methods. Just for an ilustration what is going there, it is simple recursive traversal of all terms at max distance of N. Later we could tweak memory demand, switch to some more compact trie... and at the and add weighted distance and convince Mike to make blasing fast persisten trie :)... in meantime, the folks with enogh memory would have really really fast fuzzy, prefix... better distance... So the theory :) I hope you find these comments usful, even without patches > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java > > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least small common E up to some max K (1,2,3, etc) we should create > a DFA for each E. > if the required E is above our supported max, we use "dumb mode" at first (no > seeking, no DFA, just brute force like now). > As the pq fills, we swap progressively lower DFAs into the enum, based upon > the lowest score in the pq. > This should work well on avg, at high E, you will typically fill the pq very > quickly since you will match many terms. > This not only provides a mechanism to switch to more efficient DFAs during > enumeration, but also to switch from "dumb mode" to "smart mode". > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates t
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832741#action_12832741 ] Eks Dev commented on LUCENE-2089: - {quote} I assume you mean by weighted edit distance that the transitions in the state machine would have costs? {quote} Yes, kind of, not embedded in the trie, just defined externally. What I am talking about is a part of the noisy channel approach, modeling only channel distribution. Have a look at the http://norvig.com/spell-correct.html for basic theory. I am suggesting almost the same, just applied at character level and without language model part. It is rather easy once you have your dictionary in some sort of tree structure. You guide your trie traversal over the trie by iterating on each char in your search term accumulating log probabilities of single transformations (recycling prefix part). When you hit a leaf insert into PriorityQueue of appropriate depth. What I mean by "probabilities of single transformations" are defined as: insertion(character a)//map char->log probability (think of it as kind of "cost of inserting this particular character) deletion(character)//map char->log probability... transposition(char a, char b) replacement(char a, char b)//2D matrix ->probability (cost) if you wish , you could even add some positional information, boosting match on start/end of the string I avoided tricky mechanicson traversal, insertion, deletion, but on trie you can do it by following different paths... the only good implementation (in memory) around there I know of is in LingPipe spell checker (they implement full Noisy Channel, with Language model driving traversal)... has huge educational value, Bob is really great at explaining things. The code itself is proprietary. I would suggest you to peek into this code to see this 2-Minute rumbling I wrote here properly explained :) Just ignore the language model part and assume you have NULL language model (all chars in language are equally probable) , doing full traversal over the trie. {quote} If this is the case couldn't we even define standard levenshtein very easily (instead of nasty math), and would the beam search technique enumerate efficiently for us? {quote} Standard Lev. is trivially configured once you have this, it is just setting all these costs to 1 (delete, insert... in log domain)... But who would use standard distance with such a beast, reducing impact of inserting/deleting silent "h" as in "Thomas" "Tomas"... Enumeration is trie traversal, practically calculating distance against all terms at the same time and collectiong N best along the way. The place where you save your time is recycling prefix part in this calculation. Enumeration is optimal as this trie there contains only the terms from termDict, you are not trying all possible alphabet characters and you can implement "early path abandoning" easily ether by cost (log probability) or/and by limiting the number of successive insertions If interested in really in depth things, look at http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 Great book, (another great tip from b...@lingpipe). A bit strange with terminology (at least to me), but once you get used to it, is really worth the time you spend trying to grasp it. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java > > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least small common E up to some max K (1,2,3, etc) we should create > a DFA for each E. > if the required E is above our supported max, we use "dumb mode" at first (no > seeking, no DFA, just brute force like now). > As the pq fills, we swap progressively lower DFAs into the enum, based upon > the lowest score in the pq. > This should work well on avg, at high E, you will typically fill the pq very > quickly since you will match many terms. > This not only provides a mechanism to switch to more efficient DFAs during > enumeration, but also to switch from "dumb mode" to "smart mode". > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor)
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832424#action_12832424 ] Eks Dev commented on LUCENE-2089: - {quote} What about this, http://www.catalysoft.com/articles/StrikeAMatch.html it seems logically more appropriate to (human-entered) text objects than Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance faster? {quote} Is that only me who sees plain, vanilla bigram distance here? What is new or better in StrikeAMatch compared to the first phase of the current SpellCehcker (feeding PriorityQueue with candidates)? If you need too use this, nothing simpler, you do not even need pair comparison (aka traversal), just Index terms split into bigrams and search with standard Query. Autmaton trick is a neat one. Imo, the only thing that would work better is to make term dictionary real trie (ternary, n-ary, dfa, makes no big diff). Making TerrmDict some sort of trie/dfa would permit smart beam-search, even without compiling query DFA. Beam search also makes implementation of better distances possible (Weighted Edit distance without "metric constraint" ). I guess this is going to be possible with Flex, Mike was allready talking about DFA Dictionary :) It took a while to figure out the trick Robert pooled here, treating term dictionary as another DFA due to the sortedness, nice. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java > > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least small common E up to some max K (1,2,3, etc) we should create > a DFA for each E. > if the required E is above our supported max, we use "dumb mode" at first (no > seeking, no DFA, just brute force like now). > As the pq fills, we swap progressively lower DFAs into the enum, based upon > the lowest score in the pq. > This should work well on avg, at high E, you will typically fill the pq very > quickly since you will match many terms. > This not only provides a mechanism to switch to more efficient DFAs during > enumeration, but also to switch from "dumb mode" to "smart mode". > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762742#action_12762742 ] Eks Dev commented on LUCENE-1410: - Mike, That is definitely the way to go, distribution dependent encoding, where every Term gets individual treatment. Take for an example simple, but not all that rare case where Index gets sorted on some of the indexed fields (we use it really extensively, e.g. presorted doc collection on user_rights/zip/city, all indexed). There you get perfectly "compressible" postings by simply managing intervals of set bits. Updates distort this picture, but we rebuild index periodically and all gets good again. At the moment we load them into RAM as Filters in IntervalSets. if that would be possible in lucene, we wouldn't bother with Filters (VInt decoding on such super dense fields was killing us, even in RAMDirectory) ... Thinking about your comments, isn't pulsing somewhat orthogonal to packing method? For example, if you load index into RAMDirecectory, one could avoid one indirection level and inline all postings. Flex Indexing rocks, that is going to be the most important addition to lucene since it started (imo)... I would even bet on double search speed in first attempt for average queries :) Cheers, eks > PFOR implementation > --- > > Key: LUCENE-1410 > URL: https://issues.apache.org/jira/browse/LUCENE-1410 > Project: Lucene - Java > Issue Type: New Feature > Components: Other >Reporter: Paul Elschot >Priority: Minor > Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, > LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, > LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, > TestPFor2.java > > Original Estimate: 21840h > Remaining Estimate: 21840h > > Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1762) Slightly more readable code in TermAttributeImpl
[ https://issues.apache.org/jira/browse/LUCENE-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735809#action_12735809 ] Eks Dev commented on LUCENE-1762: - cool, thanks for the review. > Slightly more readable code in TermAttributeImpl > - > > Key: LUCENE-1762 > URL: https://issues.apache.org/jira/browse/LUCENE-1762 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 2.9 >Reporter: Eks Dev >Assignee: Uwe Schindler >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1762.patch, LUCENE-1762.patch, LUCENE-1762.patch, > LUCENE-1762.patch > > > No big deal. > growTermBuffer(int newSize) was using correct, but slightly hard to follow > code. > the method was returning null as a hint that the current termBuffer has > enough space to the upstream code or reallocated buffer. > this patch simplifies logic making this method to only reallocate buffer, > nothing more. > It reduces number of if(null) checks in a few methods and reduces amount of > code. > all tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1762) Slightly more readable code in TermAttributeImpl
[ https://issues.apache.org/jira/browse/LUCENE-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1762: Attachment: LUCENE-1762.patch - made allocation in initTermBuffer() consistent with ArrayUtil.getNextSize(int) - this is ok not to start with MIN_BUFFER_SIZE, but rather with ArrayUtil.getNextSize(MIN_BUFFER_SIZE)... e.g. if getNextSize gets very sensitive to initial conditions one day... - null-ed termText on switch to termBuffer in resizeTermBuffer (as it was before!) . This was a bug in previous patch > Slightly more readable code in TermAttributeImpl > - > > Key: LUCENE-1762 > URL: https://issues.apache.org/jira/browse/LUCENE-1762 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Eks Dev >Assignee: Uwe Schindler >Priority: Trivial > Attachments: LUCENE-1762.patch, LUCENE-1762.patch, LUCENE-1762.patch > > > No big deal. > growTermBuffer(int newSize) was using correct, but slightly hard to follow > code. > the method was returning null as a hint that the current termBuffer has > enough space to the upstream code or reallocated buffer. > this patch simplifies logic making this method to only reallocate buffer, > nothing more. > It reduces number of if(null) checks in a few methods and reduces amount of > code. > all tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1762) Slightly more readable code in TermAttributeImpl
[ https://issues.apache.org/jira/browse/LUCENE-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1762: Attachment: LUCENE-1762.patch made the changes in Token along the same lines, - had to change one constant in TokenTest as I have changed initial allocation policy of termBuffer to be consistent with Arayutils.getnextSize() if(termBuffer==null) NEW: termBuffer = new char[ArrayUtil.getNextSize(newSize < MIN_BUFFER_SIZE ? MIN_BUFFER_SIZE : newSize)]; OLD: termBuffer = new char[newSize < MIN_BUFFER_SIZE ? MIN_BUFFER_SIZE : newSize]; not sure if this is better, but looks more consistent to me (buffer size is always determined via getNewSize()) Uwe, setOnlyUseNewAPI(false) does not exist, it was removed with some of the patches lately. It gets automatically detected via reflection? > Slightly more readable code in TermAttributeImpl > - > > Key: LUCENE-1762 > URL: https://issues.apache.org/jira/browse/LUCENE-1762 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Eks Dev >Assignee: Uwe Schindler >Priority: Trivial > Attachments: LUCENE-1762.patch, LUCENE-1762.patch > > > No big deal. > growTermBuffer(int newSize) was using correct, but slightly hard to follow > code. > the method was returning null as a hint that the current termBuffer has > enough space to the upstream code or reallocated buffer. > this patch simplifies logic making this method to only reallocate buffer, > nothing more. > It reduces number of if(null) checks in a few methods and reduces amount of > code. > all tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1762) Slightly more readable code in TermAttributeImpl
[ https://issues.apache.org/jira/browse/LUCENE-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1762: Attachment: LUCENE-1762.patch > Slightly more readable code in TermAttributeImpl > - > > Key: LUCENE-1762 > URL: https://issues.apache.org/jira/browse/LUCENE-1762 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Eks Dev >Priority: Trivial > Attachments: LUCENE-1762.patch > > > No big deal. > growTermBuffer(int newSize) was using correct, but slightly hard to follow > code. > the method was returning null as a hint that the current termBuffer has > enough space to the upstream code or reallocated buffer. > this patch simplifies logic making this method to only reallocate buffer, > nothing more. > It reduces number of if(null) checks in a few methods and reduces amount of > code. > all tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1762) Slightly more readable code in TermAttributeImpl
Slightly more readable code in TermAttributeImpl - Key: LUCENE-1762 URL: https://issues.apache.org/jira/browse/LUCENE-1762 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial No big deal. growTermBuffer(int newSize) was using correct, but slightly hard to follow code. the method was returning null as a hint that the current termBuffer has enough space to the upstream code or reallocated buffer. this patch simplifies logic making this method to only reallocate buffer, nothing more. It reduces number of if(null) checks in a few methods and reduces amount of code. all tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1743) MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS
[ https://issues.apache.org/jira/browse/LUCENE-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731104#action_12731104 ] Eks Dev commented on LUCENE-1743: - right, it is not everything about reading index, you have to write it as well... why not making it an abstract class with abstract Directory getDirectory(String file, int minSize, int maxSize, String [read/write/append], String context); String getName(); // for logging What do you understand under "context"? Something along the lines /Give me directory for "segment merges", "read only" for search./ ...Maybe one day we will have possibility not to kill OS cache by merging, > MMapDirectory should only mmap large files, small files should be opened > using SimpleFS/NIOFS > - > > Key: LUCENE-1743 > URL: https://issues.apache.org/jira/browse/LUCENE-1743 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Affects Versions: 2.9 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 3.1 > > > This is a followup to LUCENE-1741: > Javadocs state (in FileChannel#map): "For most operating systems, mapping a > file into memory is more expensive than reading or writing a few tens of > kilobytes of data via the usual read and write methods. From the standpoint > of performance it is generally only worth mapping relatively large files into > memory." > MMapDirectory should get a user-configureable size parameter that is a lower > limit for mmapping files. All files with a size a conventional IndexInput from SimpleFS or NIO (another configuration option > for the fallback?). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1743) MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS
[ https://issues.apache.org/jira/browse/LUCENE-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731085#action_12731085 ] Eks Dev commented on LUCENE-1743: - indeed! obvious idea, the only thing I do not like with it is making these hidden, deceptive decisions "I said I want MMapDirectory and someone else decided something else for me"... it does not matter if we have conses here now, it may change tomorrow probably better way would be to turbo charge FileSwitchDirectory with sexy parametrization options, MMapDirectory <- F(fileExtension, minSize, maxSize) // If and file size less than and greater than than open file with MMapDirectory... than go on on next rule... (can be designed upside down as well... changes nothing in idea) the same for RAMDir, NIO, FS... With this, we can make UwesBestOfMMapDirectoryFor32BitOSs (your proposal here) or HighlyConcurentForWindows64WithTermDictionaryInRamAndStoredFieldsOnDiskDirectory just for me :) So the most of the end users take some smart defaults we provide in core, and freaks (Expert users in official lingo :) have their job easy, just to configure TurboChargedFileSwitchDirectory Should be easy to come up with clean design for these "Concrete Directory selection rules" by keeping concrete Directories "pure" Cheers, Eks > MMapDirectory should only mmap large files, small files should be opened > using SimpleFS/NIOFS > - > > Key: LUCENE-1743 > URL: https://issues.apache.org/jira/browse/LUCENE-1743 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Affects Versions: 2.9 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 3.1 > > > This is a followup to LUCENE-1741: > Javadocs state (in FileChannel#map): "For most operating systems, mapping a > file into memory is more expensive than reading or writing a few tens of > kilobytes of data via the usual read and write methods. From the standpoint > of performance it is generally only worth mapping relatively large files into > memory." > MMapDirectory should get a user-configureable size parameter that is a lower > limit for mmapping files. All files with a size a conventional IndexInput from SimpleFS or NIO (another configuration option > for the fallback?). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1741) Make MMapDirectory.MAX_BBUF user configureable to support chunking the index files in smaller parts
[ https://issues.apache.org/jira/browse/LUCENE-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730560#action_12730560 ] Eks Dev commented on LUCENE-1741: - Uwe, you convinced me, I looked at the code, and indeed, no performance penalty for this. what helped me was 1.1G... (I've tried to find maximum); Max file size is 1.4G ... but 1.1 is just OS coincidence, no magic about it. I guess 512mb makes a good value, if memory is so fragmented that you cannot allocate 0.5G, you are definitely having some other problems around. We are taliking here about VM memory, and even on windows having 512Mb in block is not an issue (or better said, I have never seen problems with this value). @Paul: It is misunderstanding, my "algorithm" was meant to be manual... no catching OOM and retry (I've burned my fingers already on catching RuntimeException, do only when absolutely desperate :). Uwe made this value user settable anyhow. Thanks Uwe! > Make MMapDirectory.MAX_BBUF user configureable to support chunking the index > files in smaller parts > --- > > Key: LUCENE-1741 > URL: https://issues.apache.org/jira/browse/LUCENE-1741 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1741.patch, LUCENE-1741.patch > > > This is a followup for java-user thred: > http://www.lucidimagination.com/search/document/9ba9137bb5d8cb78/oom_with_2_9#9bf3b5b8f3b1fb9b > It is easy to implement, just add a setter method for this parameter to > MMapDir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725182#action_12725182 ] Eks Dev commented on LUCENE-1720: - Sure, I just wanted to "sharpen definition" what is Lucene core issue, and what we can leave to end users. It is not only about the time, rather about canceling search requests (even better, general activities). > TimeLimitedIndexReader and associated utility class > --- > > Key: LUCENE-1720 > URL: https://issues.apache.org/jira/browse/LUCENE-1720 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Attachments: ActivityTimedOutException.java, > ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, > TimeLimitedIndexReader.java > > > An alternative to TimeLimitedCollector that has the following advantages: > 1) Any reader activity can be time-limited rather than just single searches > e.g. the document retrieve phase. > 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly > before last "collect" stage of query processing) > Uses new utility timeout class that is independent of IndexReader. > Initial contribution includes a performance test class but not had time as > yet to work up a formal Junit test. > TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725168#action_12725168 ] Eks Dev commented on LUCENE-1720: - it's been late for this issue, but maybe worth thinking about. We could change semantics of this problem completely. Imo, the problem can be reformulated as "Provide possibility to cancel running queries on best effort basis, with or without providing so far collected results" That would leave Timer management to the end users and make an issue focus on one "Lucene core" ... Timeout management can be then provided as an example somewhere "How to implement Timeout management using ..." > TimeLimitedIndexReader and associated utility class > --- > > Key: LUCENE-1720 > URL: https://issues.apache.org/jira/browse/LUCENE-1720 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Attachments: ActivityTimedOutException.java, > ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, > TimeLimitedIndexReader.java > > > An alternative to TimeLimitedCollector that has the following advantages: > 1) Any reader activity can be time-limited rather than just single searches > e.g. the document retrieve phase. > 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly > before last "collect" stage of query processing) > Uses new utility timeout class that is independent of IndexReader. > Initial contribution includes a performance test class but not had time as > yet to work up a formal Junit test. > TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1594) Use source code specialization to maximize search performance
[ https://issues.apache.org/jira/browse/LUCENE-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707116#action_12707116 ] Eks Dev commented on LUCENE-1594: - huh, it reduces hardware costs 2-3 times for larger setup! great > Use source code specialization to maximize search performance > - > > Key: LUCENE-1594 > URL: https://issues.apache.org/jira/browse/LUCENE-1594 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: FastSearchTask.java, LUCENE-1594.patch, > LUCENE-1594.patch, LUCENE-1594.patch > > > Towards eeking absolute best search performance, and after seeing the > Java ghosts in LUCENE-1575, I decided to build a simple prototype > source code specializer for Lucene's searches. > The idea is to write dynamic Java code, specialized to run a very > specific query context (eg TermQuery, collecting top N by field, no > filter, no deletions), compile that Java code, and run it. > Here're the performance gains when compared to trunk: > ||Query||Sort||Filt|Deletes||Scoring||Hits||QPS (base)||QPS (new)||%|| > |1|Date (long)|no|no|Track,Max|2561886|6.8|10.6|{color:green}55.9%{color}| > |1|Date (long)|no|5%|Track,Max|2433472|6.3|10.5|{color:green}66.7%{color}| > |1|Date (long)|25%|no|Track,Max|640022|5.2|9.9|{color:green}90.4%{color}| > |1|Date (long)|25%|5%|Track,Max|607949|5.3|10.3|{color:green}94.3%{color}| > |1|Date (long)|10%|no|Track,Max|256300|6.7|12.3|{color:green}83.6%{color}| > |1|Date (long)|10%|5%|Track,Max|243317|6.6|12.6|{color:green}90.9%{color}| > |1|Relevance|no|no|Track,Max|2561886|11.2|17.3|{color:green}54.5%{color}| > |1|Relevance|no|5%|Track,Max|2433472|10.1|15.7|{color:green}55.4%{color}| > |1|Relevance|25%|no|Track,Max|640022|6.1|14.1|{color:green}131.1%{color}| > |1|Relevance|25%|5%|Track,Max|607949|6.2|14.4|{color:green}132.3%{color}| > |1|Relevance|10%|no|Track,Max|256300|7.7|15.6|{color:green}102.6%{color}| > |1|Relevance|10%|5%|Track,Max|243317|7.6|15.9|{color:green}109.2%{color}| > |1|Title (string)|no|no|Track,Max|2561886|7.8|12.5|{color:green}60.3%{color}| > |1|Title (string)|no|5%|Track,Max|2433472|7.5|11.1|{color:green}48.0%{color}| > |1|Title (string)|25%|no|Track,Max|640022|5.7|11.2|{color:green}96.5%{color}| > |1|Title (string)|25%|5%|Track,Max|607949|5.5|11.3|{color:green}105.5%{color}| > |1|Title (string)|10%|no|Track,Max|256300|7.0|12.7|{color:green}81.4%{color}| > |1|Title (string)|10%|5%|Track,Max|243317|6.7|13.2|{color:green}97.0%{color}| > Those tests were run on a 19M doc wikipedia index (splitting each > Wikipedia doc @ ~1024 chars), on Linux, Java 1.6.0_10 > But: it only works with TermQuery for now; it's just a start. > It should be easy for others to run this test: > * apply patch > * cd contrib/benchmark > * run python -u bench.py -delindex > -nodelindex > (You can leave off one of -delindex or -nodelindex and it'll skip > those tests). > For each test, bench.py generates a single Java source file that runs > that one query; you can open > contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/FastSearchTask.java > to see it. I'll attach an example. It writes "results.txt", in Jira > table format, which you should be able to copy/paste back here. > The specializer uses pretty much every search speedup I can think of > -- the ones from LUCENE-1575 (to score or not, to maxScore or not), > the ones suggested in the spinoff LUCENE-1593 (pre-fill w/ sentinels, > don't use docID for tie breaking), LUCENE-1536 (random access > filters). It bypasses TermDocs and interacts directly with the > IndexInput, and with BitVector for deletions. It directly folds in > the collector, if possible. A filter if used must be random access, > and is assumed to pre-multiply-in the deleted docs. > Current status: > * I only handle TermQuery. I'd like to add others over time... > * It can collect by score, or single field (with the 3 scoring > options in LUCENE-1575). It can't do reverse field sort nor > multi-field sort now. > * The auto-gen code (gen.py) is rather hideous. It could use some > serious refactoring, etc.; I think we could get it to the point > where each Query can gen its own specialized code, maybe. It also > needs to be eventually ported to Java. > * The script runs old, then new, then checks that the topN results > are identical, and aborts if not. So I'm pretty sure the > specialized code is working correctly, for the cases I'm testing. > * The patch includes a few small changes to core, mostly to open up > package protected APIs so I can access stuff > I think this is an interesting effort for several reasons: > * It gives us a be
[jira] Commented: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704618#action_12704618 ] Eks Dev commented on LUCENE-1518: - Paul: ...The current patch at LUCENE-1345 does not need such a FilterWeight; the no scoring case is handled by not asking for score values... Me: ...Imagine you have a Query and you are not interested in Scoring at all, this can be acomplished with only DocID iterator arithmetic, ignoring score() totally. But that is only an optimization (maybe allready there?)... I knew Paul will kick in at this place, he sad exactly the same thing I did, but, as oposed to me, he made formulation that executes :) Pfff, I feel bad :) > Merge Query and Filter classes > -- > > Key: LUCENE-1518 > URL: https://issues.apache.org/jira/browse/LUCENE-1518 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.4 >Reporter: Uwe Schindler > Fix For: 2.9 > > Attachments: LUCENE-1518.patch > > > This issue presents a patch, that merges Queries and Filters in a way, that > the new Filter class extends Query. This would make it possible, to use every > filter as a query. > The new abstract filter class would contain all methods of > ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the > Filter's getDocIdSet()/bits() methods he has nothing more to do, he could > just use the filter as a normal query. > I do not want to completely convert Filters to ConstantScoreQueries. The idea > is to combine Queries and Filters in such a way, that every Filter can > automatically be used at all places where a Query can be used (e.g. also > alone a search query without any other constraint). For that, the abstract > Query methods must be implemented and return a "default" weight for Filters > which is the current ConstantScore Logic. If the filter is used as a real > filter (where the API wants a Filter), the getDocIdSet part could be directly > used, the weight is useless (as it is currently, too). The constant score > default implementation is only used when the Filter is used as a Query (e.g. > as direct parameter to Searcher.search()). For the special case of > BooleanQueries combining Filters and Queries the idea is, to optimize the > BooleanQuery logic in such a way, that it detects if a BooleanClause is a > Filter (using instanceof) and then directly uses the Filter API and not take > the burden of the ConstantScoreQuery (see LUCENE-1345). > Here some ideas how to implement Searcher.search() with Query and Filter: > - User runs Searcher.search() using a Filter as the only parameter. As every > Filter is also a ConstantScoreQuery, the query can be executed and returns > score 1.0 for all matching documents. > - User runs Searcher.search() using a Query as the only parameter: No change, > all is the same as before > - User runs Searcher.search() using a BooleanQuery as parameter: If the > BooleanQuery does not contain a Query that is subclass of Filter (the new > Filter) everything as usual. If the BooleanQuery only contains exactly one > Filter and nothing else the Filter is used as a constant score query. If > BooleanQuery contains clauses with Queries and Filters the new algorithm > could be used: The queries are executed and the results filtered with the > filters. > For the user this has the main advantage: That he can construct his query > using a simplified API without thinking about Filters oder Queries, you can > just combine clauses together. The scorer/weight logic then identifies the > cases to use the filter or the query weight API. Just like the query > optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704613#action_12704613 ] Eks Dev commented on LUCENE-1518: - Shai, Regarding pure ranked, CSQ is really what we need, no? --- Yep, it would work for Filters, but why not making it possible to have normal Query "constant score". For these cases, I am just not sure if this aproach gets max performance (did not look at this code for quite a while). Imagine you have a Query and you are not interested in Scoring at all, this can be acomplished with only DocID iterator arithmetic, ignoring score() totally. But that is only an optimization (maybe allready there?) Paul, How about materializing the DocIds _and_ the score values? exactly, that would open full caching posibility (original purpose of Filters). Think Search Results caching ... that is practically another name for search() method. It is easy to create this, but using it again would require some bigger changes :) Filter_on_Steroids materialize(boolean without_score); > Merge Query and Filter classes > -- > > Key: LUCENE-1518 > URL: https://issues.apache.org/jira/browse/LUCENE-1518 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.4 >Reporter: Uwe Schindler > Fix For: 2.9 > > Attachments: LUCENE-1518.patch > > > This issue presents a patch, that merges Queries and Filters in a way, that > the new Filter class extends Query. This would make it possible, to use every > filter as a query. > The new abstract filter class would contain all methods of > ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the > Filter's getDocIdSet()/bits() methods he has nothing more to do, he could > just use the filter as a normal query. > I do not want to completely convert Filters to ConstantScoreQueries. The idea > is to combine Queries and Filters in such a way, that every Filter can > automatically be used at all places where a Query can be used (e.g. also > alone a search query without any other constraint). For that, the abstract > Query methods must be implemented and return a "default" weight for Filters > which is the current ConstantScore Logic. If the filter is used as a real > filter (where the API wants a Filter), the getDocIdSet part could be directly > used, the weight is useless (as it is currently, too). The constant score > default implementation is only used when the Filter is used as a Query (e.g. > as direct parameter to Searcher.search()). For the special case of > BooleanQueries combining Filters and Queries the idea is, to optimize the > BooleanQuery logic in such a way, that it detects if a BooleanClause is a > Filter (using instanceof) and then directly uses the Filter API and not take > the burden of the ConstantScoreQuery (see LUCENE-1345). > Here some ideas how to implement Searcher.search() with Query and Filter: > - User runs Searcher.search() using a Filter as the only parameter. As every > Filter is also a ConstantScoreQuery, the query can be executed and returns > score 1.0 for all matching documents. > - User runs Searcher.search() using a Query as the only parameter: No change, > all is the same as before > - User runs Searcher.search() using a BooleanQuery as parameter: If the > BooleanQuery does not contain a Query that is subclass of Filter (the new > Filter) everything as usual. If the BooleanQuery only contains exactly one > Filter and nothing else the Filter is used as a constant score query. If > BooleanQuery contains clauses with Queries and Filters the new algorithm > could be used: The queries are executed and the results filtered with the > filters. > For the user this has the main advantage: That he can construct his query > using a simplified API without thinking about Filters oder Queries, you can > just combine clauses together. The scorer/weight logic then identifies the > cases to use the filter or the query weight API. Just like the query > optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704561#action_12704561 ] Eks Dev commented on LUCENE-1518: - imo, it is really not all that important to make Filter and Query the same (that is just one alternative to achieve goal). Basic problem we try to solve is adding Filter directly to BoolenQuery, and making optimizations after that easier. Wrapping with CSQ is just adding anothe layer between Lucene search machinery and Filter, making these optimizations harder. On the other hand, I must accept, conceptually FIter and Query are "the same", supporting together following options: 1. Pure boolean model: You do not care about scores (today we can do it only wia CSQ, as Filter does not enter BoolenQuery) 2. Mixed boolean and ranked: you have to define Filter contribution to the documents (CSQ) 3. Pure ranked: No filters, all gets scored (the same as 2.) Ideally, as a user, I define only Query (Filter based or not) and for each clause in my Query define Query.setScored(true/false) or useConstantScore(double score); also I should be able to say, "Dear Lucene please materialize this "Query_Filter" for me as I would like to have it cached and please store only DocIds (Filter today). Maybe open possibility to open possibility to cache scores of the documents as well. one thing is concept and another is optimization. From optimization point of view, we have couple of decisions to make: - DocID Set supports random access, yes or no (my "Materialized Query") - Decide if clause should / should not be scored/ or should be constant So, for each "Query" we need to decide/support: - scoring{yes, no, constant} and - opening option to "materialize Query" (that is how we today create Filters today) - these Materialized Queries (aka Filter) should be able to tell us if they support random access, if they cache only doc id's or scores as well nothing usefull in this email, just thinking aloud, sometimes helps :) > Merge Query and Filter classes > -- > > Key: LUCENE-1518 > URL: https://issues.apache.org/jira/browse/LUCENE-1518 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.4 >Reporter: Uwe Schindler > Fix For: 2.9 > > Attachments: LUCENE-1518.patch > > > This issue presents a patch, that merges Queries and Filters in a way, that > the new Filter class extends Query. This would make it possible, to use every > filter as a query. > The new abstract filter class would contain all methods of > ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the > Filter's getDocIdSet()/bits() methods he has nothing more to do, he could > just use the filter as a normal query. > I do not want to completely convert Filters to ConstantScoreQueries. The idea > is to combine Queries and Filters in such a way, that every Filter can > automatically be used at all places where a Query can be used (e.g. also > alone a search query without any other constraint). For that, the abstract > Query methods must be implemented and return a "default" weight for Filters > which is the current ConstantScore Logic. If the filter is used as a real > filter (where the API wants a Filter), the getDocIdSet part could be directly > used, the weight is useless (as it is currently, too). The constant score > default implementation is only used when the Filter is used as a Query (e.g. > as direct parameter to Searcher.search()). For the special case of > BooleanQueries combining Filters and Queries the idea is, to optimize the > BooleanQuery logic in such a way, that it detects if a BooleanClause is a > Filter (using instanceof) and then directly uses the Filter API and not take > the burden of the ConstantScoreQuery (see LUCENE-1345). > Here some ideas how to implement Searcher.search() with Query and Filter: > - User runs Searcher.search() using a Filter as the only parameter. As every > Filter is also a ConstantScoreQuery, the query can be executed and returns > score 1.0 for all matching documents. > - User runs Searcher.search() using a Query as the only parameter: No change, > all is the same as before > - User runs Searcher.search() using a BooleanQuery as parameter: If the > BooleanQuery does not contain a Query that is subclass of Filter (the new > Filter) everything as usual. If the BooleanQuery only contains exactly one > Filter and nothing else the Filter is used as a constant score query. If > BooleanQuery contains clauses with Queries and Filters the new algorithm > could be used: The queries are executed and the results filtered with the > filters. > For the user this has the main advantage: That he can construct his query > using a simplified API without thinking ab
[jira] Commented: (LUCENE-1619) TermAttribute.termLength() optimization
[ https://issues.apache.org/jira/browse/LUCENE-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703543#action_12703543 ] Eks Dev commented on LUCENE-1619: - thanks Mike > TermAttribute.termLength() optimization > --- > > Key: LUCENE-1619 > URL: https://issues.apache.org/jira/browse/LUCENE-1619 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1619.patch > > >public int termLength() { > initTermBuffer(); // This patch removes this method call > return termLength; >} > I see no reason to initTermBuffer() in termLength()... all tests pass, but I > could be wrong? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory
[ https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703406#action_12703406 ] Eks Dev commented on LUCENE-1618: - Maybe, FileSwitchDirectory should have possibility to get file list/extensions that should be loaded into RAM... making it maintenance free, pushing this decision to end user... if, and when we decide to support users in it, we could than maintain static list at separate place . Kind of separate execution and configuration I *think* I saw something similar Ning Lee made quite a while ago, from hadoop camp (indexing on hadoop something...). But cannot remember what was it :( > Allow setting the IndexWriter docstore to be a different directory > -- > > Key: LUCENE-1618 > URL: https://issues.apache.org/jira/browse/LUCENE-1618 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Original Estimate: 336h > Remaining Estimate: 336h > > Add an IndexWriter.setDocStoreDirectory method that allows doc > stores to be placed in a different directory than the IW default > dir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1619) TermAttribute.termLength() optimization
TermAttribute.termLength() optimization --- Key: LUCENE-1619 URL: https://issues.apache.org/jira/browse/LUCENE-1619 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Attachments: LUCENE-1619.patch public int termLength() { initTermBuffer(); // This patch removes this method call return termLength; } I see no reason to initTermBuffer() in termLength()... all tests pass, but I could be wrong? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1619) TermAttribute.termLength() optimization
[ https://issues.apache.org/jira/browse/LUCENE-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1619: Attachment: LUCENE-1619.patch > TermAttribute.termLength() optimization > --- > > Key: LUCENE-1619 > URL: https://issues.apache.org/jira/browse/LUCENE-1619 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Eks Dev >Priority: Trivial > Attachments: LUCENE-1619.patch > > >public int termLength() { > initTermBuffer(); // This patch removes this method call > return termLength; >} > I see no reason to initTermBuffer() in termLength()... all tests pass, but I > could be wrong? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703335#action_12703335 ] Eks Dev commented on LUCENE-1616: - ant build-contrib > add one setter for start and end offset to OffsetAttribute > -- > > Key: LUCENE-1616 > URL: https://issues.apache.org/jira/browse/LUCENE-1616 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Eks Dev >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, > LUCENE-1616.patch > > > add OffsetAttribute. setOffset(startOffset, endOffset); > trivial change, no JUnit needed > Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1616: Attachment: LUCENE-1616.patch ok, maybe this time it will work, I hope I managed to clean it up (core build and test pass). The only thing that fails is contrib, but I guess this has nothing to do with it? [javac] D:\Repository\SerachAndMatch\Lucene\lucene\java\trunk\contrib\highlighter\src\java\org\apache\lucene\search\highlight\WeightedSpanTermExtractor.java:306: cannot find symbol [javac] MemoryIndex indexer = new MemoryIndex(); [javac] ^ [javac] symbol: class MemoryIndex [javac] location: class org.apache.lucene.search.highlight.WeightedSpanTermExtractor [javac] D:\Repository\SerachAndMatch\Lucene\lucene\java\trunk\contrib\highlighter\src\java\org\apache\lucene\search\highlight\WeightedSpanTermExtractor.java:306: cannot find symbol [javac] MemoryIndex indexer = new MemoryIndex(); [javac] ^ [javac] symbol: class MemoryIndex [javac] location: class org.apache.lucene.search.highlight.WeightedSpanTermExtractor [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 3 errors > add one setter for start and end offset to OffsetAttribute > -- > > Key: LUCENE-1616 > URL: https://issues.apache.org/jira/browse/LUCENE-1616 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Eks Dev >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, > LUCENE-1616.patch > > > add OffsetAttribute. setOffset(startOffset, endOffset); > trivial change, no JUnit needed > Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703254#action_12703254 ] Eks Dev commented on LUCENE-1616: - me too, sorry! Eclipse left me blind for some funny reason waiting for test to complete before I commit again ... > add one setter for start and end offset to OffsetAttribute > -- > > Key: LUCENE-1616 > URL: https://issues.apache.org/jira/browse/LUCENE-1616 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Eks Dev >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch > > > add OffsetAttribute. setOffset(startOffset, endOffset); > trivial change, no JUnit needed > Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1616: Attachment: LUCENE-1616.patch whoops, this time it compiles :) > add one setter for start and end offset to OffsetAttribute > -- > > Key: LUCENE-1616 > URL: https://issues.apache.org/jira/browse/LUCENE-1616 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Eks Dev >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch > > > add OffsetAttribute. setOffset(startOffset, endOffset); > trivial change, no JUnit needed > Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1616: Attachment: LUCENE-1616.patch the same as the first patch, just with removed setStart/EndOffset(int) > add one setter for start and end offset to OffsetAttribute > -- > > Key: LUCENE-1616 > URL: https://issues.apache.org/jira/browse/LUCENE-1616 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Eks Dev >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1616.patch, LUCENE-1616.patch > > > add OffsetAttribute. setOffset(startOffset, endOffset); > trivial change, no JUnit needed > Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703085#action_12703085 ] Eks Dev commented on LUCENE-1616: - I am ok with both options, removing separate looks a bit better for me as it forces users to think "attomic" about offset <=> {start, end}. If you separate start and end offset too far in your code, probability that you do not see mistake somewhere is higher compared to the case where you manage start and end on your own in these cases (this is then rather "explicit" in you code)... But that is all really something we should not think too much about it :) We make no mistakes eather way I can provide new patch, if needed. > add one setter for start and end offset to OffsetAttribute > -- > > Key: LUCENE-1616 > URL: https://issues.apache.org/jira/browse/LUCENE-1616 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Eks Dev >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1616.patch > > > add OffsetAttribute. setOffset(startOffset, endOffset); > trivial change, no JUnit needed > Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1616: Attachment: LUCENE-1616.patch > add one setter for start and end offset to OffsetAttribute > -- > > Key: LUCENE-1616 > URL: https://issues.apache.org/jira/browse/LUCENE-1616 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Eks Dev >Priority: Trivial > Attachments: LUCENE-1616.patch > > > add OffsetAttribute. setOffset(startOffset, endOffset); > trivial change, no JUnit needed > Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1615) deprecated method used in fieldsReader / setOmitTf()
[ https://issues.apache.org/jira/browse/LUCENE-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702901#action_12702901 ] Eks Dev commented on LUCENE-1615: - sure, replacing Fieldable is good, just noticed quick win when cleaning-up deprecations from our code base... one step in a time > deprecated method used in fieldsReader / setOmitTf() > > > Key: LUCENE-1615 > URL: https://issues.apache.org/jira/browse/LUCENE-1615 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Priority: Trivial > Attachments: LUCENE-1615.patch > > > setOmitTf(boolean) is deprecated and should not be used by core classes. One > place where it appears is FieldsReader , this patch fixes it. It was > necessary to change Fieldable to AbstractField at two places, only local > variables. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1615) deprecated method used in fieldsReader / setOmitTf()
[ https://issues.apache.org/jira/browse/LUCENE-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1615: Attachment: LUCENE-1615.patch > deprecated method used in fieldsReader / setOmitTf() > > > Key: LUCENE-1615 > URL: https://issues.apache.org/jira/browse/LUCENE-1615 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Priority: Trivial > Attachments: LUCENE-1615.patch > > > setOmitTf(boolean) is deprecated and should not be used by core classes. One > place where it appears is FieldsReader , this patch fixes it. It was > necessary to change Fieldable to AbstractField at two places, only local > variables. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1615) deprecated method used in fieldsReader / setOmitTf()
deprecated method used in fieldsReader / setOmitTf() Key: LUCENE-1615 URL: https://issues.apache.org/jira/browse/LUCENE-1615 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Eks Dev Priority: Trivial setOmitTf(boolean) is deprecated and should not be used by core classes. One place where it appears is FieldsReader , this patch fixes it. It was necessary to change Fieldable to AbstractField at two places, only local variables. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701298#action_12701298 ] Eks Dev commented on LUCENE-1606: - hmmm, sounds like good idea, but I am still not convinced it would work for Fuzzy take simple dictionary: one two three four query Term is, e.g. "ana", right? and n=1, means your DFA would be: {.na, a.a, an., an, na, ana, .ana, ana., a.na, an.a, ana.} where dot represents any character in you alphabet. For the first element in DFA (in expanded form) you need to visit all terms, no matter how you walk DFA... or am I missing something? Where you could save time is actual calculation of LD Matrix for terms that do not pass automata > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Robert Muir >Priority: Minor > Fix For: 2.9 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701279#action_12701279 ] Eks Dev commented on LUCENE-1606: - Robert, in order for Lev. Automata to work, you need to have the complete dictionary as DFA. Once you have dictionary as DFA (or any sort of trie), computing simple regex-s or simple fixed or weighted Levenshtein distance becomes a snap. Levenshtein-Automata is particularity fast at it, much simpler and only slightly slower method (one pager code) "K.Oflazer"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.136.3862 As said, you cannot really walk current term dictionary as automata/trie (or you have an idea on how to do that?). I guess there is enough application where stoing complete Term dictionary into RAM-DFA is not a problem. Even making some smart (heavily cached) persistent trie/DFA should not be all that complex. Or you intended just to iterate all terms, and compute distance faster "break LD Matrix computation as soon as you see you hit the boundary"? But this requires iteration over all terms? I have done something similar, in memory, but unfortunately someone else paid me for this and is not willing to share... > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Robert Muir >Priority: Minor > Fix For: 2.9 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs
[ https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688429#action_12688429 ] Eks Dev commented on LUCENE-1561: - maybe something along the lines: usePureBooleanPostings() minimalInvertedList() > Maybe rename Field.omitTf, and strengthen the javadocs > -- > > Key: LUCENE-1561 > URL: https://issues.apache.org/jira/browse/LUCENE-1561 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4.1 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1561.patch > > > Spinoff from here: > > http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html > Maybe rename omitTf to something like omitTermPositions, and make it clear > what queries will silently fail to work as a result. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688284#action_12688284 ] Eks Dev commented on LUCENE-1410: - It looks like Google went there as well (Block encoding), see: Blog http://blogs.sun.com/searchguy/entry/google_s_postings_format http://research.google.com/people/jeff/WSDM09-keynote.pdf (Slides 47-63) > PFOR implementation > --- > > Key: LUCENE-1410 > URL: https://issues.apache.org/jira/browse/LUCENE-1410 > Project: Lucene - Java > Issue Type: New Feature > Components: Other >Reporter: Paul Elschot >Priority: Minor > Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch, > LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, > TestPFor2.java, TestPFor2.java > > Original Estimate: 21840h > Remaining Estimate: 21840h > > Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied
[ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669595#action_12669595 ] Eks Dev commented on LUCENE-1532: - .bq but I'm not sure the exact frequency number at just word-level is really that useful for spelling correction, assuming a normal zipfian distribution. you are probably right, you cannot expect high resolution from frequency, but exact frequency information is your "source information". Clustering it on anything is just one algorithmic modification where, at the end, less information remains. Mark suggests 1-10, someone else would be happy with 1-3 ... who could tell? Therefore I would recommend real frequency information and leave possibility for end user to decide what to do with it. Frequency distribution is not simple measure, depends heavily on corpus composition, size. In one corpus doc. frequency of 3 means it is probably a typo, in another this means nothing... My proposal is to work with real frequency as you have no information loss there ... > File based spellcheck with doc frequencies supplied > --- > > Key: LUCENE-1532 > URL: https://issues.apache.org/jira/browse/LUCENE-1532 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/spellchecker >Reporter: David Bowen >Priority: Minor > > The file-based spellchecker treats all words in the dictionary as equally > valid, so it can suggest a very obscure word rather than a more common word > which is equally close to the misspelled word that was entered. It would be > very useful to have the option of supplying an integer with each word which > indicates its commonness. I.e. the integer could be the document frequency > in some index or set of indexes. > I've implemented a modification to the spellcheck API to support this by > defining a DocFrequencyInfo interface for obtaining the doc frequency of a > word, and a class which implements the interface by looking up the frequency > in an index. So Lucene users can provide alternative implementations of > DocFrequencyInfo. I could submit this as a patch if there is interest. > Alternatively, it might be better to just extend the spellcheck API to have a > way to supply the frequencies when you create a PlainTextDictionary, but that > would mean storing the frequencies somewhere when building the spellcheck > index, and I'm not sure how best to do that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied
[ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669579#action_12669579 ] Eks Dev commented on LUCENE-1532: - .bq I got better results by refining edit distance costs by keyboard layout Sure, better distance helps a lot, but even in that case frequency information brings a lot. Frequency brings you some information about corpus that is orthogonal to information you get from pure "word1" vs "word2" comparison. > File based spellcheck with doc frequencies supplied > --- > > Key: LUCENE-1532 > URL: https://issues.apache.org/jira/browse/LUCENE-1532 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/spellchecker >Reporter: David Bowen >Priority: Minor > > The file-based spellchecker treats all words in the dictionary as equally > valid, so it can suggest a very obscure word rather than a more common word > which is equally close to the misspelled word that was entered. It would be > very useful to have the option of supplying an integer with each word which > indicates its commonness. I.e. the integer could be the document frequency > in some index or set of indexes. > I've implemented a modification to the spellcheck API to support this by > defining a DocFrequencyInfo interface for obtaining the doc frequency of a > word, and a class which implements the interface by looking up the frequency > in an index. So Lucene users can provide alternative implementations of > DocFrequencyInfo. I could submit this as a patch if there is interest. > Alternatively, it might be better to just extend the spellcheck API to have a > way to supply the frequencies when you create a PlainTextDictionary, but that > would mean storing the frequencies somewhere when building the spellcheck > index, and I'm not sure how best to do that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied
[ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669018#action_12669018 ] Eks Dev commented on LUCENE-1532: - bq. so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered in my experience freq information brings there a lot, but is not linear. It is not always that word with higher frequency makes better suggestion. Common sense is that high frequency words get often misspelled in different ways in normal corpus. Making following patterns: HF(High Freiquency) Word against LF(Low Frequency) that is similar in edit distance sense is much more likely typo/misspelling than HF vs HF case. Similar cases with HF vs LF "the" against "hte" "think" vs "tihnk" Very similar, but HF vs HF "think" vs "thing" some cases that jump out of these ideas are synonyms, alternative spellings and very common mistakes. Very tricky to isolate just by using some distance measure and frequency. Her you need context. similar and HF vs HF "thomas" vs "tomas" sometimes spelling mistake, sometimes different names... depends what you are trying to achieve, if you expect mistakes in query you are good if you assume HF suggestions are better, but if you go for high recall you need to cover cases where query term is correct you have to dig into your corpus to find incorrect words (Query "think about it" should find document containing "tihnk about it") very challenging problem but cutting to the chase. The proposal is to make it possible to define float Function(Edit distance, Query_Token_Freq, Corpus_Token_Freq) that returns some measure that is higher for more similar pairs considering edit distance and frequency (value that gets used as condition for priority queue) . Default could just work as you described. (It is maybe already possible, I did not look at it). > File based spellcheck with doc frequencies supplied > --- > > Key: LUCENE-1532 > URL: https://issues.apache.org/jira/browse/LUCENE-1532 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/spellchecker >Reporter: David Bowen > > The file-based spellchecker treats all words in the dictionary as equally > valid, so it can suggest a very obscure word rather than a more common word > which is equally close to the misspelled word that was entered. It would be > very useful to have the option of supplying an integer with each word which > indicates its commonness. I.e. the integer could be the document frequency > in some index or set of indexes. > I've implemented a modification to the spellcheck API to support this by > defining a DocFrequencyInfo interface for obtaining the doc frequency of a > word, and a class which implements the interface by looking up the frequency > in an index. So Lucene users can provide alternative implementations of > DocFrequencyInfo. I could submit this as a patch if there is interest. > Alternatively, it might be better to just extend the spellcheck API to have a > way to supply the frequencies when you create a PlainTextDictionary, but that > would mean storing the frequencies somewhere when building the spellcheck > index, and I'm not sure how best to do that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663120#action_12663120 ] Eks Dev commented on LUCENE-1518: - nice, you did it top down (api), Paul takes it bottom up (speed). this makes some really crazy things possible, e.g. implementing normal TermQuery as a "DirectFilter" and when the optimization of the BooleanQuery gets done (no Score calculation, direct usage of DocIdSetIterators) you can speed up some queries containing TermQuery without really instantiating Filter. Of course only for cases where tf/idf/norm can be ignored. Kind of middle-ground between Filter and full ranked TermQuery (better said any BooleanQuery!), Faster than ranked case due to the switched off score calculation and more comfortable than Filter usage, no instantiation of DocIdSet-s... very nice indeed, smooth mix between ranked and "pure boolean" model with both benefits. > Merge Query and Filter classes > -- > > Key: LUCENE-1518 > URL: https://issues.apache.org/jira/browse/LUCENE-1518 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.4 >Reporter: Uwe Schindler > Attachments: LUCENE-1518.patch > > > This issue presents a patch, that merges Queries and Filters in a way, that > the new Filter class extends Query. This would make it possible, to use every > filter as a query. > The new abstract filter class would contain all methods of > ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the > Filter's getDocIdSet()/bits() methods he has nothing more to do, he could > just use the filter as a normal query. > I do not want to completely convert Filters to ConstantScoreQueries. The idea > is to combine Queries and Filters in such a way, that every Filter can > automatically be used at all places where a Query can be used (e.g. also > alone a search query without any other constraint). For that, the abstract > Query methods must be implemented and return a "default" weight for Filters > which is the current ConstantScore Logic. If the filter is used as a real > filter (where the API wants a Filter), the getDocIdSet part could be directly > used, the weight is useless (as it is currently, too). The constant score > default implementation is only used when the Filter is used as a Query (e.g. > as direct parameter to Searcher.search()). For the special case of > BooleanQueries combining Filters and Queries the idea is, to optimize the > BooleanQuery logic in such a way, that it detects if a BooleanClause is a > Filter (using instanceof) and then directly uses the Filter API and not take > the burden of the ConstantScoreQuery (see LUCENE-1345). > Here some ideas how to implement Searcher.search() with Query and Filter: > - User runs Searcher.search() using a Filter as the only parameter. As every > Filter is also a ConstantScoreQuery, the query can be executed and returns > score 1.0 for all matching documents. > - User runs Searcher.search() using a Query as the only parameter: No change, > all is the same as before > - User runs Searcher.search() using a BooleanQuery as parameter: If the > BooleanQuery does not contain a Query that is subclass of Filter (the new > Filter) everything as usual. If the BooleanQuery only contains exactly one > Filter and nothing else the Filter is used as a constant score query. If > BooleanQuery contains clauses with Queries and Filters the new algorithm > could be used: The queries are executed and the results filtered with the > filters. > For the user this has the main advantage: That he can construct his query > using a simplified API without thinking about Filters oder Queries, you can > just combine clauses together. The scorer/weight logic then identifies the > cases to use the filter or the query weight API. Just like the query > optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641128#action_12641128 ] Eks Dev commented on LUCENE-1426: - Just a few random thoughts on this topic - I am sure I read somewhere in these pdfs that were floating around that it would make sense to use VInts for very short postings and PFOR for the rest. I just do not remember rationale behind it. - During omitTf() discussion, we came up with cool idea to actually inline very short postings into term dict instead of storing offset. This way we spare one seek per term in many cases, as well as some space for storing offset. I do not know if this is a problem, but sounds reasonable. With standard Zipfian distribution, a lot of postings should get inlined. Use cases where we have query expansion on many terms (think spell checker, synonyms ...) should benefit from that heavily. These postings are small but there is a lot of them, so it adds up... seek is deadly :) I am sorry to miss the party here with PFOR, but let us hope this credit crunch gets over soon so I that I could dedicate some time to fun things like this :) cheers, eks > Next steps towards flexible indexing > > > Key: LUCENE-1426 > URL: https://issues.apache.org/jira/browse/LUCENE-1426 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1426.patch > > > In working on LUCENE-1410 (PFOR compression) I tried to prototype > switching the postings files to use PFOR instead of vInts for > encoding. > But it quickly became difficult. EG we currently mux the skip data > into the .frq file, which messes up the int blocks. We inline > payloads with positions which would also mess up the int blocks. > Skipping offsets and TermInfo offsets hardwire the file pointers of > frq & prox files yet I need to change these to block + offset, etc. > Separately this thread also started up, on how to customize how Lucene > stores positional information in the index: > http://www.gossamer-threads.com/lists/lucene/java-user/66264 > So I decided to make a bit more progress towards "flexible indexing" > by first modularizing/isolating the classes that actually write the > index format. The idea is to capture the logic of each (terms, freq, > positions/payloads) into separate interfaces and switch the flushing > of a new segment as well as writing the segment during merging to use > the same APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1329) Remove synchronization in SegmentReader.isDeleted
[ https://issues.apache.org/jira/browse/LUCENE-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624657#action_12624657 ] Eks Dev commented on LUCENE-1329: - ok, I see, thanks. At least, It resolves an issue completely for RAM based indexes. We have seen performance drop for RAM based index when we switched to MT setup with shared IndexReader, I am not yet sure what is the reason for it, problems in our code or this is indeed related to lucene. I am talking about 25-30% drop on 3 Threads on 4-Core CPU. Must measure it properly... > Remove synchronization in SegmentReader.isDeleted > - > > Key: LUCENE-1329 > URL: https://issues.apache.org/jira/browse/LUCENE-1329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.3.1 >Reporter: Jason Rutherglen >Assignee: Michael McCandless >Priority: Trivial > Fix For: 2.4 > > Attachments: LUCENE-1329.patch, LUCENE-1329.patch, lucene-1329.patch > > > Removes SegmentReader.isDeleted synchronization by using a volatile > deletedDocs variable on Java 1.5 platforms. On Java 1.4 platforms > synchronization is limited to obtaining the deletedDocs reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1329) Remove synchronization in SegmentReader.isDeleted
[ https://issues.apache.org/jira/browse/LUCENE-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624634#action_12624634 ] Eks Dev commented on LUCENE-1329: - Mike, did someone measure what this brings? This practically reduces need to have many IndexReader-s in MT setup when Index is used in read only case. > Remove synchronization in SegmentReader.isDeleted > - > > Key: LUCENE-1329 > URL: https://issues.apache.org/jira/browse/LUCENE-1329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.3.1 >Reporter: Jason Rutherglen >Assignee: Michael McCandless >Priority: Trivial > Fix For: 2.4 > > Attachments: LUCENE-1329.patch, LUCENE-1329.patch, lucene-1329.patch > > > Removes SegmentReader.isDeleted synchronization by using a volatile > deletedDocs variable on Java 1.5 platforms. On Java 1.4 platforms > synchronization is limited to obtaining the deletedDocs reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623593#action_12623593 ] Eks Dev commented on LUCENE-1219: - bq. did you ever measure the before/after performance difference? sure we did, it's been a while we measured it so I do not have the real numbers at hand. But for both cases (indexing and fetching stored binary field) it showed up during profiling as the only easy quick-win(s) we could make . We index very short documents and indexing speed per thread before this patch was is in 7.5k documents/ second range, after it we run it with the patch over 9.5-10K/Second, sweet... for searching, I do not not remember the numbers, but it was surely above 5% range (try to allocate 12Mb in 6k objects per second as unnecessary addition and you will see it :) > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1219.extended.patch, LUCENE-1219.extended.patch, > LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, > LUCENE-1219.take2.patch, LUCENE-1219.take3.patch > > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623332#action_12623332 ] Eks Dev commented on LUCENE-1219: - how was it: "repetitio est mater studiorum" ;) thanks Mike! - Original Message Send instant messages to your online friends http://uk.messenger.yahoo.com > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1219.extended.patch, LUCENE-1219.extended.patch, > LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, > LUCENE-1219.take2.patch, LUCENE-1219.take3.patch > > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1219: Attachment: LUCENE-1219.extended.patch bq. couldn't you just call document.getFieldable(name), and then call binaryValue(byte[] result) on that Fieldable, and then get the length from it (getBinaryLength()) too? (Trying to minimize API changes). sure, good tip, I this could work. No need to have this byte[]->Fieldable-byte[] loop, it confuses. I have attached patch that uses this approach. But I created getBinaryValue(byte[]) instead of binaryValue(byte[]) as we have binaryValue() as deprecated method (would be confusing as well). Not really tested, but looks simple enough Just thinking aloud This is one nice feature, but I permanently had a feeling I do not understand this Field structures, roles and responsibilities :) Field/Fieldable/AbstractField hierarchy is really ripe for good re-factoring.This bigamy with index / search use cases makes things not really easy to follow, Hoss has right, we need some way to divorce RetrievedField from FieldToBeIndexed, they are definitely not the same, just very similar. > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1219.extended.patch, LUCENE-1219.extended.patch, > LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, > LUCENE-1219.take2.patch, LUCENE-1219.take3.patch > > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621036#action_12621036 ] Eks Dev commented on LUCENE-1219: - bq. could we instead add this to Field: byte[] binaryValue(byte[] result) this is exactly where I started, but then realized I am missing actual length we read in LazyField, without it you would have to relocate each time, except in case where your buffer length equals toRead in LazyField... simply, the question is, how the caller of byte[] getBinaryValue(String name, byte[] result) could know what is the length in this returned byte[] Am I missing something obvious? > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1219.extended.patch, LUCENE-1219.patch, > LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, > LUCENE-1219.take2.patch, LUCENE-1219.take3.patch > > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1219: Attachment: LUCENE-1219.extended.patch Mike, This new patch includes take3 and adds the following: Fieldable Document.getStoredBinaryField(String name, byte[] scratch); where scratch param represents user byte buffer that will be used in case it is big enough, if not, it will be simply allocated like today. If scratch is used, you get the same object through Fieldable.getByteValue() for this to work, I added one new method in Fieldable abstract Fieldable getBinaryField(byte[] scratch); the only interesting implementation is in LazyField The reason for this is in my previous comment this does not affect issues from take3 at all, but is dependant on it, as you need to know the length of byte[] you read. take3 remains good to commit, I just did not know how to make one isolated patch with only these changes without too much work in text editor > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1219.extended.patch, LUCENE-1219.patch, > LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, > LUCENE-1219.take2.patch, LUCENE-1219.take3.patch > > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620019#action_12620019 ] Eks Dev commented on LUCENE-1219: - Great Mike, it gets better and better, i saw LUCENE-1340 committed. Thanks to you Grant, Doug and all others that voted for 1349 this happened so quickly. Trust me, these two issues are really making my life easier. I pushed decision to add new hardware to some future point (means, save customer's money now)... a few weeks later would be too late. Now it remains only to make one nice patch that enables us to pass our own byte[] for retrieving stored fields during search. I was thinking along the lines of things you did in Analyzers. we could pool the same trick for this, eg. Field Document.getBinaryValue(String FIELD_NAME, Field destination); Field already has all access methods (get/set), the contract would be: If destination==null, new one will be created and returned, if not we use this one and returne the same object back. The method should check if byte[] is big enough, if not simple growth policy can be there. This way we avoid new byte[] each time you fetch stored field.. I did not look exactly at code now, but the last time I was looking into it it looked as quite simple to do something along these lines. Do you have some ideas how we could do it better? Just simple calculation in my case, average Hits count is around 200, for each hit we have to fetch one stored field where we do some post-processing, re-scoring and whatnot. Currently we run max 30 rq/second , with average document length of 2k you lend at 2K * 200 * 30 = 6000 object allocations per second totaling 12Mb ... only to get the data... I can imagine people with much longer documents (that would be typical lucene use case) where it gets worse... simply reducing gc() pressure with really small amount of work. I am sure this would have nice effects on some other use cases in lucene. thanks again to all "workers" behind this greet peace of software... eks PS: I need to find some time to peek at paul's work in LUVENE -1345 and my wish list will be complete, at least for now (at least until you get your magic with flexi index format done :) > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, > LUCENE-1219.patch, LUCENE-1219.take2.patch, LUCENE-1219.take3.patch > > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1219: Attachment: LUCENE-1219.take3.patch - updated this patch to apply to trunk - implemented abstract getBinary***() methods in Fieldable, and removed a few ugly instanceof AbstractField from a few places (introduced by previous versions of this patch. This was there due to the assumption that Fieldable should stay unchanged...) all test pass, (as expected, only minor diff to take2 version. much like the initial version ) > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, > LUCENE-1219.patch, LUCENE-1219.take2.patch, LUCENE-1219.take3.patch > > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index
[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12618069#action_12618069 ] Eks Dev commented on LUCENE-1340: - that sound like consensus :) Great! in that case LUCENE-1219 can be reworked slightly to avoid instanceoff (less code). Also it opens a way to pass reference to byte[] for retrieving stored fields out of lucene and communicating length back to caller (now we new byte[] every time we fetch stored field) bq. it's one of my biggest regrets in Lucene (yes, I am responsible for it), yet I firmly believe there is a way to do interfaces and abstracts in a proper way in Java. no need to regret Grant, if you do nothing you make no mistakes... Interfaces are ok, as long as you can tell what they are going to be doing in next 5 years... this forces you to design "for the future"... something we cannot afford in so popular and complex libraries like lucene at places like Field. Abstract* is equally good design-abstraction... Proposal: We could live with a statement "Fieldable changes are allowed from now, it is deprecated and will be probably removed in 3.0" , it causes just a tiny bit of work in case someone is really implementing it (adding new methods to Fieldable like omitTf() costs you max 5 minutes work to change your implementing class to implement it!). from 3.0 on, I could very well live without it, until then, we cause 5 minutes work for people that implement Fieldable on their own and want to stay up to date with the trunk. It is fair deal for everyone and lucene moves forward... > Make it posible not to include TF information in index > -- > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, > LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index
[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617978#action_12617978 ] Eks Dev commented on LUCENE-1340: - ouch! it is kind of getting personal between me and Fieldable :) Not the first time to get bugged by it! Due to Fieldable (things really important, at lest to me): - We cannot get binary stored Field in and out of lucene without getting gc() go crazy - We cannot omitTF it would be possible somehow to make it at AbstractField levele and instanceoff at a few places, but I simply hate to do it (I will patch my local copy, this issue is worth to me... must branch off from the trunk for the first time, sigh) funny it is, I see no reason to have anything but AbstractField (Field/Fieldable are just redundant) > Make it posible not to include TF information in index > -- > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, > LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1345) Allow Filter as clause to BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1345: Attachment: OpenBitSetIteratorExperiment.java TestIteratorPerf.java I just enhanced TestIteratorPerf to work with OpenBitSetIterator(Experiment)... on dense bit sets sentinel based are faster (ca 9%), on low density about the same? Yonik's tip -1 < doc instead of -1 != doc still performs worse, and knowing Yonik's hunch on these things, I am still not convinced it is really faster ... Paul's work here is more interesting, clear API and Performance win on many fronts... practically, no need to pollute this issue more with iterator semantics if I(or someone else) figure out something really interesting there, will create new issue > Allow Filter as clause to BooleanQuery > -- > > Key: LUCENE-1345 > URL: https://issues.apache.org/jira/browse/LUCENE-1345 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Priority: Minor > Attachments: DisjunctionDISI.java, DisjunctionDISI.patch, > DisjunctionDISI.patch, LUCENE-1345.patch, LUCENE-1345.patch, > OpenBitSetIteratorExperiment.java, TestIteratorPerf.java, > TestIteratorPerf.java > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1345) Allow Filter as clause to BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617836#action_12617836 ] Eks Dev commented on LUCENE-1345: - bq. comparison with -1 is being optimized away entirely I do not think so, how compiler could "optimize away" the only condition that stops the loop? The loop would never finish, or am I misreading something here? Anyhow, the test is so simple that compiler can take completely other direction from the real case. I guess much better test (without too much effort!) would be to take something like OpenBitSetIterator and make one Iterator implementation with sentinel approach and then compare... this test is really just a dumb loop, but on the other side isolates the difference between two approaches... > Allow Filter as clause to BooleanQuery > -- > > Key: LUCENE-1345 > URL: https://issues.apache.org/jira/browse/LUCENE-1345 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Priority: Minor > Attachments: DisjunctionDISI.patch, DisjunctionDISI.patch, > LUCENE-1345.patch, LUCENE-1345.patch, TestIteratorPerf.java > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1345) Allow Filter as clause to BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617726#action_12617726 ] Eks Dev commented on LUCENE-1345: - Yonik, this would probably work fine for int values (on my CPU), I have tried it on long values and this was significantly slower on this test... it boils down again to "what is the CPU we are optimizing for" :) > Allow Filter as clause to BooleanQuery > -- > > Key: LUCENE-1345 > URL: https://issues.apache.org/jira/browse/LUCENE-1345 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Priority: Minor > Attachments: DisjunctionDISI.patch, DisjunctionDISI.patch, > LUCENE-1345.patch, LUCENE-1345.patch, TestIteratorPerf.java > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1345) Allow Filter as clause to BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617603#action_12617603 ] Eks Dev commented on LUCENE-1345: - great! Will look into at at the weekend in more datails. I have moved this part to Constructor on my local copy, it passes all tests: +if (disiDocQueue == null) { + initDisiDocQueue(); +} it is in next() and skipTo() practically the same as reported in https://issues.apache.org/jira/browse/LUCENE-1145, with this, 1145 can be closed > Allow Filter as clause to BooleanQuery > -- > > Key: LUCENE-1345 > URL: https://issues.apache.org/jira/browse/LUCENE-1345 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Priority: Minor > Attachments: DisjunctionDISI.patch, DisjunctionDISI.patch, > LUCENE-1345.patch, LUCENE-1345.patch, TestIteratorPerf.java > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1345) Allow Filter as clause to BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1345: Attachment: TestIteratorPerf.java Hi Paul, I gave it a try on micro benchmarking, and it looks like we could gain a lot by switcing to sentinel approach for iterators, apart for being faster they are also a bit robuster to "one off" bugs. This test is just a simulation made assuming docId is long (I have tried it with int and it is about the same result). Just attaching it here as I did not want to create new issue for now, before we identify if there are some design/performance knock-out criteria. test on my setup: 32bit java version "1.6.0_10-rc" java(TM) SE Runtime Environment (build 1.6.0_10-rc-b28) Windows XP Profesional 32bit notebook, 3Gb RAM, CPU x86 Family 6 Model 15 Stepping 11 GenuineIntel ~2194 Mhz java -server -Xbatch result (with docID long): old milliseconds=6938 old milliseconds=6953 old milliseconds=6890 old milliseconds=6938 old milliseconds=6906 old milliseconds=6922 old milliseconds=6906 old milliseconds=6938 old milliseconds=6906 old milliseconds=6906 old total milliseconds=69203 new milliseconds=5797 new milliseconds=5703 new milliseconds=5266 new milliseconds=5250 new milliseconds=5234 new milliseconds=5250 new milliseconds=5235 new milliseconds=5250 new milliseconds=5250 new milliseconds=5250 new total milliseconds=53485 New/Old Time 53485/69203 (77.28711%) all in all, faster more than 22% !! Of course, this type of benchmark does not mean all iterator ops in real life are going to be 20% faster... other things probably dominate, but if it proves that this test does not have some flaws (easy possible)... well worth of pursuing cheers, eks > Allow Filter as clause to BooleanQuery > -- > > Key: LUCENE-1345 > URL: https://issues.apache.org/jira/browse/LUCENE-1345 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Priority: Minor > Attachments: DisjunctionDISI.patch, DisjunctionDISI.patch, > LUCENE-1345.patch, TestIteratorPerf.java > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1345) Allow Filter as clause to BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1345: Attachment: DisjunctionDISI.patch I just realised TestDisjunctionDISI had a bug (iterators have to be reinitialized)... apart from that only small change in DISIQueue to use constants instead of vars (compiler should have done it as well, but you never know) private final void downHeap() { +int i = 1; +int j = 2; //i << 1; // find smaller child +int k = 3; //j + 1; + > Allow Filter as clause to BooleanQuery > -- > > Key: LUCENE-1345 > URL: https://issues.apache.org/jira/browse/LUCENE-1345 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Priority: Minor > Attachments: DisjunctionDISI.patch, DisjunctionDISI.patch, > LUCENE-1345.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1345) Allow Filter as clause to BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1345: Attachment: DisjunctionDISI.patch bq. Would anyone have a DisjunctionDISI (Disjunction over DocIdSetIterators) somewhere? I have played with DisjunctionSumScorer rip-off, maybe you find it useful for this issue... What would be nice here(and in DisjunctionSumScorer ), if possible?: - to remove initDISIQueue() from next() and skipTo() (also the same in DisjunctionSumScorer()) ... this is due to this ugly -1 position before first call, I just do not know how to get rid of it :) - to switch to Conjuction "mode" if minNrShouldMatch kicks in there are already todo-s for it arround if you think you can use it, just go ahead and include it in your patch, I am not using this for anything, just wrapped it up when you asked. > Allow Filter as clause to BooleanQuery > -- > > Key: LUCENE-1345 > URL: https://issues.apache.org/jira/browse/LUCENE-1345 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Priority: Minor > Attachments: DisjunctionDISI.patch, LUCENE-1345.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index
[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617140#action_12617140 ] Eks Dev commented on LUCENE-1340: - we finished our tests Index without omitTf() : - 87Mio Documents, 2 indexed Fields one stored field - Unique terms in index 2.5Mio - Average Field lengths in tokens: 3.3 and 5.5 (very short fields) - On Disk size 3.8 Gb total with stored field Queries under test: - BooleanQuery in all shapes and forms (disjunctive, conjunctive, nested, with minNumberShouldMatch()) . with a lot of clauses (5-100). - Filter used, yes Test scope, regression with 30k Queries on the same index with omitTf(true/false). Result: - The Queries returned 100% identical Hits (full recall tested, all hits checked)! - Index size reduction(not including stored field!): 7% (short documents => less positions than in Mike's case) - Performance of Queries: 5.2% faster, but index was loaded as RAMIndex (on disk setup should bring even more due to the reduced IO for reading postings) -Indexing performance (FSDisk!) 13% faster Also, we compared omitTf(false) with this patch and lucene.jar without this patch, no changes whatsoever. >From my perspective, this is good to go into production. At least for our >usage of lucene, there are no differences with homitTf(true)... >One more thing here: since the tiis are loaded into RAM, that unused >proxPointer wastes 8 bytes for each indexed terms. For indices with alot of >terms this can add up to alot of wasted ram. But still I think we should wait >and fix this as part of flexible indexing, when we maybe refactor the >TermInfos to be "column stride" instead. I am more than happy with the results, no need to squeeze the last bit out of it right now. Mike, thanks again for the great work! > Make it posible not to include TF information in index > -- > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, > LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index
[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615357#action_12615357 ] Eks Dev commented on LUCENE-1340: - Great, it is already more than I expected, even indexing is going to be somewhat faster. I have tried your patch on smallish index with 8Mio documents and it worked on our regression test without problems. it worked fine with and without omitTf(true), no performance drop or bad surprises when we do not use it. Tomorrow is scheduled real test with production data, around 80Mio very small documents, with some very extensive tests I will report back. "The one place I know of that will still waste bytes is the term dict (TermInfo): it stores a long proxPointer on disk (in .tii,.tis) and also in memory because we load *.tii into RAM " About this one, it would be nice not to store this as well, but I think the pointers are already reduced to one byte, as they are 0 for these cases (are they,?) So we have this benefit without expecting it :) And yes, more "column stride" is great, if you followed my comments on LUCENE-1278, that would mean we could easily "inline" very short postings into term dict (here I expect huge performance benefit, as skip() on another large file is going to be saved independent from omitTf(true)), without increase in size (or minimal) of tii (no locality penalty) If we follow Zipfian distribution, there is *a lot* of terms with postings shorter than e.g. 16 ... Thanks again for your support, without you this patch would be just another nice idea :) > Make it posible not to include TF information in index > -- > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, > LUCENE-1340.patch, LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index
[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1340: Attachment: LUCENE-1340.patch - fixed stupid bug in SegmentTermDocs (was doc = docCode; instead of doc += docCode;) - TestOmitTf extended a bit > Make it posible not to include TF information in index > -- > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, > LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615077#action_12615077 ] Eks Dev commented on LUCENE-1278: - in light of Mike's comments hier (Michael McCandless - 05/May/08 05:33 AM), I think it is worth mentioning that I am working on LUCENE-1340, that is storing postings without additional frq info. correct me if I am wrong, the only difference is that this approach with *.frq needs one seek more... at the same time, this could potentially increase term dict size, so we loose some locality. Your your last proposal sounds interesting, "inline short postings" into term dict , so for short postings (about the size of offset pointer into *.frq) with tf==1 (that is the always the case if you use omitTf(true) from LUCENE-1340) we spare one seek()... this could be a lot. Also, there is no need to store postings into *frq (this complicates maintenance I guess) > Add optional storing of document numbers in term dictionary > --- > > Key: LUCENE-1278 > URL: https://issues.apache.org/jira/browse/LUCENE-1278 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.3.1 >Reporter: Jason Rutherglen >Priority: Minor > Attachments: lucene.1278.5.4.2008.patch, > lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, > lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, > TestTermEnumDocs.java > > > Add optional storing of document numbers in term dictionary. String index > field cache and range filter creation will be faster. > Example read code: > {noformat} > TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); > do { > Term term = termEnum.term(); > if (term == null || term.field() != field) break; > int[] docs = termEnum.docs(); > } while (termEnum.next()); > {noformat} > Example write code: > {noformat} > Document document = new Document(); > document.add(new Field("tag", "dog", Field.Store.YES, > Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); > indexWriter.addDocument(document); > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index
[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1340: Attachment: LUCENE-1340.patch Thanks Mike, with just a little bit more hand-holding we are going to be there :) I *think* I have *.prx IO excluded in case omitTf==true, please have a look, this part is really not an easy one (*Merger). Also, now if a single field has mixed true/false for omitTf, I set it to true. One unit test is already there, basic use case works, but the test has to cover a bit more > Make it posible not to include TF information in index > -- > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index
[ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1340: Attachment: LUCENE-1340.patch first cut > Make it posible not to include TF information in index > -- > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one > VInt less and one X>>>1...) and IO can be spared by making pure boolen fields > possible in Lucene. This topic has already been discussed and accepted as a > part of Flexible Indexing... This issue tries to push things a bit faster > forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, > enumerations, user rights, IDs or very short "texts", phone numbers, zip > codes, names... > Status: just passed standard test (compatibility), commited for early review, > I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1340) Make it posible not to include TF information in index
Make it posible not to include TF information in index -- Key: LUCENE-1340 URL: https://issues.apache.org/jira/browse/LUCENE-1340 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Eks Dev Priority: Minor Term Frequency is typically not needed for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands. benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone numbers, zip codes, names... Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests Complexity: simpler than expected can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1187) Things to be done now that Filter is independent from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578656#action_12578656 ] Eks Dev commented on LUCENE-1187: - Michael, I do not think we need to add Factory (for this particular reason), DocIdSet type should not be assumed as we could come up with smart ways to select optimal Filter representation depending on doc-id distribution, size... The only problem we have with is that contrib classes, ChainedFilter and BooleanFilter assume BitSet. And the solution for this would be to add just a few methods to the DocIdSet that are able to do AND/OR/NOT on DocIdSet[] using DocIdSetIterator() e.g. DocIdSet or(DocIdSet[], int minimumShouldMatch); DocIdSet or(DocIdSet[]); Optimized code for these basic operations *already exists*, can be copied from Conjunction/Disjunction/ReqOpt/ReqExcl Scorer classes by just simply stripping-off scoring part. with these utility methods in DocIdSet, rewriting ChainedFilter/BooleanFilter to work with DocIdSet (and that works on all implementations of Fileter/DocIdSet) is 10 minutes job... than, if needed this implementation can be optimized to cover type specific cases. Imo, BoolenFilter is better bet, we do not need both of them. Unfortunately I do not have time to play with it next 3-4 weeks, but should be no more than 2 days work (remember, we have difficult part already done in Scorers). Having so much code duplication is not something really good, but we can then later "merge" these somehow. > Things to be done now that Filter is independent from BitSet > > > Key: LUCENE-1187 > URL: https://issues.apache.org/jira/browse/LUCENE-1187 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Paul Elschot >Priority: Minor > Attachments: ChainedFilterAndCachingFilterTest.patch, > javadocsZero2Match.patch > > > (Aside: where is the documentation on how to mark up text in jira comments?) > The following things are left over after LUCENE-584 : > For Lucene 3.0 Filter.bits() will have to be removed. > There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the > boolean behaviour of a Filter. > I have not looked into Filter caching yet, but I suppose there will be some > room for improvement there. > Iirc the current core has moved to use OpenBitSetFilter and that is probably > what is being cached. > In some cases it might be better to cache a SortedVIntList instead. > Boolean logic on DocIdSetIterator is already available for Scorers (that > inherit from DocIdSetIterator) in the search package. This is currently > implemented by ConjunctionScorer, DisjunctionSumScorer, > ReqOptSumScorer and ReqExclScorer. > Boolean logic on BitSets is available in contrib/misc and contrib/queries > DisjunctionSumScorer calls score() on its subscorers before the score value > actually needed. > This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as > a superclass of DisjunctionSumScorer. > To fully implement non scoring queries a TermDocIdSetIterator will be needed, > perhaps as a superclass of TermScorer. > The javadocs in org.apache.lucene.search using matching vs non-zero score: > I'll investigate this soon, and provide a patch when necessary. > An early version of the patches of LUCENE-584 contained a class Matcher, > that differs from the current DocIdSet in that Matcher has an explain() > method. > It remains to be seen whether such a Matcher could be useful between > DocIdSet and Scorer. > The semantics of scorer.skipTo(scorer.doc()) was discussed briefly. > This was also discussed at another issue recently, so perhaps it is wortwhile > to open a separate issue for this. > Skipping on a SortedVIntList is done using linear search, this could be > improved by adding multilevel skiplist info much like in the Lucene index for > documents containing a term. > One comment by me of 3 Dec 2008: > A few complete (test) classes are deprecated, it might be good to add the > target release for removal there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578253#action_12578253 ] Eks Dev commented on LUCENE-1219: - >>Eks can you see if the changes look OK? Thanks. It looks perfect, you have brought it to the "commit ready status" already. I will it try it on our production mirror a bit later today and report back if something goes wrong. >>I guess I don't really understand the need for Fieldable. In fact I also don't really understand why we even needed to add AbstractField. I am with you 100% here, It looks to me as well that one concrete class could replace it all. But... maybe someone kicks-in with some god arguments why we have it that way. > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, > LUCENE-1219.patch, LUCENE-1219.take2.patch > > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1219: Attachment: LUCENE-1219.patch latest patch updated to the trunk (Lucene-1217 is there. Michael you did not mark it as resolved.) > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, > LUCENE-1219.patch > > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1219: Attachment: LUCENE-1219.patch this one keeps addition of new methods localized to AbstractField, does not change Fieldable interface... it looks like it could work done this way with a few instanceof checks in FieldsWriter, This one has dependency on LUCENE-1217 it will not give you any benefit if you directly implement your Fieldable without extending AbstractField, therefore I would suggest to eventually change Fieldable to support all these methods that operate with offset/length. Or someone clever finds some way to change an interface without braking backwards compatibility :) > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch > > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1217) use isBinary cached variable instead of instanceof in Filed
[ https://issues.apache.org/jira/browse/LUCENE-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1217: Attachment: Lucene-1217-take1.patch new patch, fixes isBinary status in LazyField > use isBinary cached variable instead of instanceof in Filed > --- > > Key: LUCENE-1217 > URL: https://issues.apache.org/jira/browse/LUCENE-1217 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Trivial > Attachments: Lucene-1217-take1.patch, LUCENE-1217.patch > > > Filed class can hold three types of values, > See: AbstractField.java protected Object fieldsData = null; > currently, mainly RTTI (instanceof) is used to determine the type of the > value stored in particular instance of the Field, but for binary value we > have mixed RTTI and cached variable "boolean isBinary" > This patch makes consistent use of cached variable isBinary. > Benefit: consistent usage of method to determine run-time type for binary > case (reduces chance to get out of sync on cached variable). It should be > slightly faster as well. > Thinking aloud: > Would it not make sense to maintain type with some integer/byte"poor man's > enum" (Interface with a couple of constants) > code:java{ > public static final interface Type{ > public static final byte BOOLEAN = 0; > public static final byte STRING = 1; > public static final byte READER = 2; > > } > } > and use that instead of isBinary + instanceof? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1217) use isBinary cached variable instead of instanceof in Filed
[ https://issues.apache.org/jira/browse/LUCENE-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577601#action_12577601 ] Eks Dev commented on LUCENE-1217: - hah, this bug just justified this patch :) sorry, I should have run tests before... nothing is trivial enough. The problem was indeed isBinary that went out of sync in LazyField, new patch follows > use isBinary cached variable instead of instanceof in Filed > --- > > Key: LUCENE-1217 > URL: https://issues.apache.org/jira/browse/LUCENE-1217 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Trivial > Attachments: LUCENE-1217.patch > > > Filed class can hold three types of values, > See: AbstractField.java protected Object fieldsData = null; > currently, mainly RTTI (instanceof) is used to determine the type of the > value stored in particular instance of the Field, but for binary value we > have mixed RTTI and cached variable "boolean isBinary" > This patch makes consistent use of cached variable isBinary. > Benefit: consistent usage of method to determine run-time type for binary > case (reduces chance to get out of sync on cached variable). It should be > slightly faster as well. > Thinking aloud: > Would it not make sense to maintain type with some integer/byte"poor man's > enum" (Interface with a couple of constants) > code:java{ > public static final interface Type{ > public static final byte BOOLEAN = 0; > public static final byte STRING = 1; > public static final byte READER = 2; > > } > } > and use that instead of isBinary + instanceof? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577597#action_12577597 ] Eks Dev commented on LUCENE-1219: - I do not know for sure if this is something we could not live with. Adding new interface sounds equally bad, would work nicely, but I do not like it as it makes code harder to follow with too many interfaces ... I'll have another look at it to see if there is a way to do it without interface changes. Any ideas? > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1219.patch, LUCENE-1219.patch > > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1217) use isBinary cached variable instead of instanceof in Filed
[ https://issues.apache.org/jira/browse/LUCENE-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577591#action_12577591 ] Eks Dev commented on LUCENE-1217: - thanks fof looking into it! Subclassing now with backwards compatibility would be clumsy, I was thinking about it but could not find clean way to make it. >>Or we could wait until Java 5 (3.0) and use real enums? yes, that is ultimate solution, but my line of thoughts was that "poor man's enum"->java 5 enum migration would be trivial later... but do not change working code kicks-in here :) > use isBinary cached variable instead of instanceof in Filed > --- > > Key: LUCENE-1217 > URL: https://issues.apache.org/jira/browse/LUCENE-1217 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Trivial > Attachments: LUCENE-1217.patch > > > Filed class can hold three types of values, > See: AbstractField.java protected Object fieldsData = null; > currently, mainly RTTI (instanceof) is used to determine the type of the > value stored in particular instance of the Field, but for binary value we > have mixed RTTI and cached variable "boolean isBinary" > This patch makes consistent use of cached variable isBinary. > Benefit: consistent usage of method to determine run-time type for binary > case (reduces chance to get out of sync on cached variable). It should be > slightly faster as well. > Thinking aloud: > Would it not make sense to maintain type with some integer/byte"poor man's > enum" (Interface with a couple of constants) > code:java{ > public static final interface Type{ > public static final byte BOOLEAN = 0; > public static final byte STRING = 1; > public static final byte READER = 2; > > } > } > and use that instead of isBinary + instanceof? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1219: Attachment: LUCENE-1219.patch Michael McCandless had some nice ideas on how to make getValue() change performance penalty for legacy usage negligible, this patch includes them: - deprecates getValue() method - returns direct reference if offset==0 && length == data.length > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1219.patch, LUCENE-1219.patch > > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1219: Attachment: LUCENE-1219.patch > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1219.patch > > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1219: Attachment: (was: LUCENE-1219.patch) > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Priority: Minor > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data
[ https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1219: Attachment: LUCENE-1219.patch all tests pass with this patch. some polish needed and probably more testing, TODOs: - someone pedantic should check if these new set / get methods should be named better - check if there are more places where this new feature cold/should be used, I think I have changed all of them but one place, direct subclass FieldForMerge in FieldsReader, this is the code I do not know so I did not touch it... - javadoc is poor should be enough to get us started. the only "pseudo-issue" I see is that public byte[] binaryValue(); now creates byte[] and copies content into it, reference to original array can be now fetched via getBinaryValue() method... this is to preserve compatibility as users expect compact, zero based array from this method and we keep offset/length in Field now this is "pseudo issue" as users already should have a reference to this array, so this method is rather superfluous for end users. > support array/offset/ length setters for Field with binary data > --- > > Key: LUCENE-1219 > URL: https://issues.apache.org/jira/browse/LUCENE-1219 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Eks Dev >Priority: Minor > Attachments: LUCENE-1219.patch > > > currently Field/Fieldable interface supports only compact, zero based byte > arrays. This forces end users to create and copy content of new objects > before passing them to Lucene as such fields are often of variable size. > Depending on use case, this can bring far from negligible performance > improvement. > this approach extends Fieldable interface with 3 new methods > getOffset(); gettLenght(); and getBinaryValue() (this only returns reference > to the array) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1219) support array/offset/ length setters for Field with binary data
support array/offset/ length setters for Field with binary data --- Key: LUCENE-1219 URL: https://issues.apache.org/jira/browse/LUCENE-1219 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Eks Dev Priority: Minor currently Field/Fieldable interface supports only compact, zero based byte arrays. This forces end users to create and copy content of new objects before passing them to Lucene as such fields are often of variable size. Depending on use case, this can bring far from negligible performance improvement. this approach extends Fieldable interface with 3 new methods getOffset(); gettLenght(); and getBinaryValue() (this only returns reference to the array) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1217) use isBinary cached variable instead of instanceof in Filed
[ https://issues.apache.org/jira/browse/LUCENE-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1217: Attachment: LUCENE-1217.patch > use isBinary cached variable instead of instanceof in Filed > --- > > Key: LUCENE-1217 > URL: https://issues.apache.org/jira/browse/LUCENE-1217 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Eks Dev >Priority: Trivial > Attachments: LUCENE-1217.patch > > > Filed class can hold three types of values, > See: AbstractField.java protected Object fieldsData = null; > currently, mainly RTTI (instanceof) is used to determine the type of the > value stored in particular instance of the Field, but for binary value we > have mixed RTTI and cached variable "boolean isBinary" > This patch makes consistent use of cached variable isBinary. > Benefit: consistent usage of method to determine run-time type for binary > case (reduces chance to get out of sync on cached variable). It should be > slightly faster as well. > Thinking aloud: > Would it not make sense to maintain type with some integer/byte"poor man's > enum" (Interface with a couple of constants) > code:java{ > public static final interface Type{ > public static final byte BOOLEAN = 0; > public static final byte STRING = 1; > public static final byte READER = 2; > > } > } > and use that instead of isBinary + instanceof? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1217) use isBinary cached variable instead of instanceof in Filed
use isBinary cached variable instead of instanceof in Filed --- Key: LUCENE-1217 URL: https://issues.apache.org/jira/browse/LUCENE-1217 Project: Lucene - Java Issue Type: Improvement Components: Other Reporter: Eks Dev Priority: Trivial Filed class can hold three types of values, See: AbstractField.java protected Object fieldsData = null; currently, mainly RTTI (instanceof) is used to determine the type of the value stored in particular instance of the Field, but for binary value we have mixed RTTI and cached variable "boolean isBinary" This patch makes consistent use of cached variable isBinary. Benefit: consistent usage of method to determine run-time type for binary case (reduces chance to get out of sync on cached variable). It should be slightly faster as well. Thinking aloud: Would it not make sense to maintain type with some integer/byte"poor man's enum" (Interface with a couple of constants) code:java{ public static final interface Type{ public static final byte BOOLEAN = 0; public static final byte STRING = 1; public static final byte READER = 2; } } and use that instead of isBinary + instanceof? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance
[ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574759#action_12574759 ] Eks Dev commented on LUCENE-1035: - Robert, you said: We actually have a multiplexing directory that (depending on file type and size), either opens the file purely in memory, uses a cached file, or lets the OS do the caching. Works really well... Did you create a patch somewhere, or is this your internal work? I have a case where this could come in very handy, I plan to use MMAP for postings & co... but FSDirectory for stored fields as they could easily blow the size ... With possibility to to select on file type/size makes MMAP use case much much closer to many users... one Directory implementation that allows users to select strategy is indeed perfect, LRU, FSDirectora, MMAP, RAM or whatnot > Optional Buffer Pool to Improve Search Performance > -- > > Key: LUCENE-1035 > URL: https://issues.apache.org/jira/browse/LUCENE-1035 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Ning Li > Attachments: LUCENE-1035.patch > > > Index in RAMDirectory provides better performance over that in FSDirectory. > But many indexes cannot fit in memory or applications cannot afford to > spend that much memory on index. On the other hand, because of locality, > a reasonably sized buffer pool may provide good improvement over FSDirectory. > This issue aims at providing such an optional buffer pool layer. In cases > where it fits, i.e. a reasonable hit ratio can be achieved, it should provide > a good improvement over FSDirectory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1187) Things to be done now that Filter is independent from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12571939#action_12571939 ] Eks Dev commented on LUCENE-1187: - Paul, I think there is one CHEKME in DisjunctionSumScorer I have stumbled upon recently when I realized (token1+ token2+) query works way faster than (token1 token2).setMinimumSholdMatch(2). It is not directly related to the LUCENE-584, but just as a reminder. also I think there is a hard_to_detect_small_maybe_performance_bug in ConjuctionScorer, : {code:java} // If first-time skip distance is any predictor of // scorer sparseness, then we should always try to skip first on // those scorers. // Keep last scorer in it's last place (it will be the first // to be skipped on), but reverse all of the others so that // they will be skipped on in order of original high skip. int end=(scorers.length-1)-1; for (int i=0; i<(end>>1); i++) { Scorer tmp = scorers[i]; scorers[i] = scorers[end-i]; scorers[end-i] = tmp; } {code} It has not been detected so far as it has only performance implications (I think?), and it sometimes works and sometimes not, depending on number of scorers: to see what I am talking about, try this "simulator": {code:java} public static void main(String[] args) { int[] scorers = new int[7]; //3 and 7 do not work for (int i=0; i>1); i++) { int tmp = scorers[i]; scorers[i] = scorers[end-i]; scorers[end-i] = tmp; } System.out.println(Arrays.toString(scorers)); } {code} for 7 you get: [0, 1, 2, 3, 4, 5, 6] [5, 4, 2, 3, 1, 0, 6] instead of [5, 4, 3, 2, 1, 0, 6] and for 3 [0, 1, 2] [0, 1, 2] (should be [1, 0, 2]) > Things to be done now that Filter is independent from BitSet > > > Key: LUCENE-1187 > URL: https://issues.apache.org/jira/browse/LUCENE-1187 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Paul Elschot >Priority: Minor > > (Aside: where is the documentation on how to mark up text in jira comments?) > The following things are left over after LUCENE-584 : > For Lucene 3.0 Filter.bits() will have to be removed. > There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the > boolean behaviour of a Filter. > I have not looked into Filter caching yet, but I suppose there will be some > room for improvement there. > Iirc the current core has moved to use OpenBitSetFilter and that is probably > what is being cached. > In some cases it might be better to cache a SortedVIntList instead. > Boolean logic on DocIdSetIterator is already available for Scorers (that > inherit from DocIdSetIterator) in the search package. This is currently > implemented by ConjunctionScorer, DisjunctionSumScorer, > ReqOptSumScorer and ReqExclScorer. > Boolean logic on BitSets is available in contrib/misc and contrib/queries > DisjunctionSumScorer calls score() on its subscorers before the score value > actually needed. > This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as > a superclass of DisjunctionSumScorer. > To fully implement non scoring queries a TermDocIdSetIterator will be needed, > perhaps as a superclass of TermScorer. > The javadocs in org.apache.lucene.search using matching vs non-zero score: > I'll investigate this soon, and provide a patch when necessary. > An early version of the patches of LUCENE-584 contained a class Matcher, > that differs from the current DocIdSet in that Matcher has an explain() > method. > It remains to be seen whether such a Matcher could be useful between > DocIdSet and Scorer. > The semantics of scorer.skipTo(scorer.doc()) was discussed briefly. > This was also discussed at another issue recently, so perhaps it is wortwhile > to open a separate issue for this. > Skipping on a SortedVIntList is done using linear search, this could be > improved by adding multilevel skiplist info much like in the Lucene index for > documents containing a term. > One comment by me of 3 Dec 2008: > A few complete (test) classes are deprecated, it might be good to add the > target release for removal there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1169) Search with Filter does not work!
[ https://issues.apache.org/jira/browse/LUCENE-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567306#action_12567306 ] Eks Dev commented on LUCENE-1169: - Thank for explaining it! So we have now classes implementing DocIdSetIterator (OpenBitSetIterator, SortedVIntList...) that are strictly speaking not conforming to the specification for skipTo(). Side-effects we had here are probably local for this issue, but I have somehow bad feeling having different behaving implementations of the same interface. Sounds paranoid, no :) To make things better, new classes in core like eg. OpenBitSet cover the case you described, when we have iterator positioned one before the first one, but they do not comply to other side effects. Mainly, invoking iterator.skipTo(anything <= iterator.doc()) should have the same effect as next(), meaning that iterator gets moved not only in iterator.skipTo(iterator.doc()) ... to cut to the chase, should we attempt to fix all OpenDocIdSetIterator implementations to comply to these effects, or it will be enough to comment these differences "relaxed skipTo contract"? Current usage of these classes is in Filter related code and is practically replacement for BitSet iteration, therefore "under control". But if we move on using these classes tightly with Scorers I am afraid we could expect "one off" and similar bugs. Another option would be to change specification and use this sentinel -1 approach, but honestly, this is way above my head to comment... > Search with Filter does not work! > - > > Key: LUCENE-1169 > URL: https://issues.apache.org/jira/browse/LUCENE-1169 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Reporter: Eks Dev >Assignee: Michael Busch >Priority: Blocker > Attachments: lucene-1169.patch, TestFilteredSearch.java > > > See attached JUnitTest, self-explanatory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1169) Search with Filter does not work!
[ https://issues.apache.org/jira/browse/LUCENE-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566971#action_12566971 ] Eks Dev commented on LUCENE-1169: - Thank you for fixing it in no time :) But... I am getting confused with skipping iterators semantics, is this requirement for the other DocIdSetIterators, of only for scorers (should be, I guess)? iterator.skipTo(iterator.doc()) <=> iterator.next();// is this contract? if that is the case, we have another bug in OpenBitSetIterator (border condition) //this is the code in javadoc, "official contract" boolean simulatedSkipTo(DocIdSetIterator i, int target) throws IOException { do { if (!i.next()) return false; } while (target > i.doc()); return true; } public void testOpenBitSetBorderCondition() throws IOException { OpenBitSet bs = new OpenBitSet(); bs.set(0); DocIdSetIterator i = bs.iterator(); i.skipTo(i.doc()); assertEquals(0, i.doc()); //cool, moved to the first legal position assertFalse("End of Matcher", i.skipTo(i.doc())); //NOT OK according to the javadoc } public void testOpenBitSetBorderConditionSimulated() throws IOException { OpenBitSet bs = new OpenBitSet(); bs.set(0); DocIdSetIterator i = bs.iterator(); simulatedSkipTo(i, i.doc()); assertEquals(0, i.doc()); //cool, moved to the first legal position assertFalse("End of Matcher", simulatedSkipTo(i, i.doc())); //OK according to the javadoc!! } > Search with Filter does not work! > - > > Key: LUCENE-1169 > URL: https://issues.apache.org/jira/browse/LUCENE-1169 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Reporter: Eks Dev >Assignee: Michael Busch >Priority: Blocker > Attachments: lucene-1169.patch, TestFilteredSearch.java > > > See attached JUnitTest, self-explanatory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1145) DisjunctionSumScorer small tweak
[ https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566961#action_12566961 ] Eks Dev commented on LUCENE-1145: - test using Sun 1.4 jvm on the same hardware showed the same "a bit faster" behavior, so this is in my opinion OK to be committed. > DisjunctionSumScorer small tweak > > > Key: LUCENE-1145 > URL: https://issues.apache.org/jira/browse/LUCENE-1145 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Environment: all >Reporter: Eks Dev >Priority: Trivial > Attachments: DisjunctionSumScorerOptimization.patch, > DSSQueueSizeOptimization.patch, TestScorerPerformance.java > > > Move ScorerDocQueue initialization from next() and skipTo() methods to the > Constructor. Makes DisjunctionSumScorer a bit faster (less than 1% on my > tests). > Downside (if this is one, I cannot judge) would be throwing IOException from > DisjunctionSumScorer constructors as we touch HardDisk there. I see no > problem as this IOException does not propagate too far (the only modification > I made is in BooleanScorer2) > if (scorerDocQueue == null) { > initScorerDocQueue(); > } > > Attached test is just quick & dirty rip of TestScorerPerf from standard > Lucene test package. Not included as patch as I do not like it. > All test pass, patch made on trunk revision 613923 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1169) Search with Filter does not work!
[ https://issues.apache.org/jira/browse/LUCENE-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1169: Attachment: TestFilteredSearch.java Filter Bug > Search with Filter does not work! > - > > Key: LUCENE-1169 > URL: https://issues.apache.org/jira/browse/LUCENE-1169 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Reporter: Eks Dev >Priority: Blocker > Attachments: TestFilteredSearch.java > > > See attached JUnitTest, self-explanatory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1169) Search with Filter does not work!
Search with Filter does not work! - Key: LUCENE-1169 URL: https://issues.apache.org/jira/browse/LUCENE-1169 Project: Lucene - Java Issue Type: Bug Components: Search Reporter: Eks Dev Priority: Blocker Attachments: TestFilteredSearch.java See attached JUnitTest, self-explanatory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1145) DisjunctionSumScorer small tweak
[ https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561836#action_12561836 ] Eks Dev commented on LUCENE-1145: - Well, I do not know how it behaves on earlier jvm-s and what would be the "jvm we optimize", I would not be surprised if jvm 6+ evolved optimization methods. These patches are just side effects of trying to get familiar with scorer family inner working in light of LUCENE-584. Boolean arithmetic on multiple skipping iterators in Scorers can hardly be beaten and can be recycled for cases like BooleanFilter... and maybe one day merged to avoid code duplication :) Anyhow, if it proves that performance on 1.4 behaves similarly, I would opt for size(), makes code slightly cleaner. If not, I would suggest to replace the only size() usage in next() with cached queueSize > DisjunctionSumScorer small tweak > > > Key: LUCENE-1145 > URL: https://issues.apache.org/jira/browse/LUCENE-1145 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Environment: all >Reporter: Eks Dev >Priority: Trivial > Attachments: DisjunctionSumScorerOptimization.patch, > DSSQueueSizeOptimization.patch, TestScorerPerformance.java > > > Move ScorerDocQueue initialization from next() and skipTo() methods to the > Constructor. Makes DisjunctionSumScorer a bit faster (less than 1% on my > tests). > Downside (if this is one, I cannot judge) would be throwing IOException from > DisjunctionSumScorer constructors as we touch HardDisk there. I see no > problem as this IOException does not propagate too far (the only modification > I made is in BooleanScorer2) > if (scorerDocQueue == null) { > initScorerDocQueue(); > } > > Attached test is just quick & dirty rip of TestScorerPerf from standard > Lucene test package. Not included as patch as I do not like it. > All test pass, patch made on trunk revision 613923 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1145) DisjunctionSumScorer small tweak
[ https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1145: Attachment: DSSQueueSizeOptimization.patch Simplification of the DisjunctionSumScorer. - removed cached field "private int queueSize" which mirrored ScorerDocQueue.size() and replaced it with method call. It is faster with this patch, but hardly measurable (test made with attached TestScorerPerformance) 585660ms vs 586090ms. Test on WIN XP Prof. Dual Core Intel T7300 2GHz with 6.0 java -server -Xbatch At a moment, I have no other configurations to test it, it would be good to see what happens on jvm 1.4 It makes sense to commit this as it simplifies (pff, ok, simpifies it a bit :) already complex code in DSScorer and is not slower. > DisjunctionSumScorer small tweak > > > Key: LUCENE-1145 > URL: https://issues.apache.org/jira/browse/LUCENE-1145 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Environment: all >Reporter: Eks Dev >Priority: Trivial > Attachments: DisjunctionSumScorerOptimization.patch, > DSSQueueSizeOptimization.patch, TestScorerPerformance.java > > > Move ScorerDocQueue initialization from next() and skipTo() methods to the > Constructor. Makes DisjunctionSumScorer a bit faster (less than 1% on my > tests). > Downside (if this is one, I cannot judge) would be throwing IOException from > DisjunctionSumScorer constructors as we touch HardDisk there. I see no > problem as this IOException does not propagate too far (the only modification > I made is in BooleanScorer2) > if (scorerDocQueue == null) { > initScorerDocQueue(); > } > > Attached test is just quick & dirty rip of TestScorerPerf from standard > Lucene test package. Not included as patch as I do not like it. > All test pass, patch made on trunk revision 613923 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Closed: (LUCENE-1146) ConjunctionScorer small (ca. 3.5%) optimization
[ https://issues.apache.org/jira/browse/LUCENE-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev closed LUCENE-1146. --- Resolution: Incomplete Lucene Fields: [New] (was: [Patch Available, New]) not ready, patch too bugy > ConjunctionScorer small (ca. 3.5%) optimization > --- > > Key: LUCENE-1146 > URL: https://issues.apache.org/jira/browse/LUCENE-1146 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Environment: all >Reporter: Eks Dev >Priority: Minor > > ConjunctionScorer initialization is done lazy in next() and skipTo() methods, > using one if(firstTime) check, this patch moves this initialization to the > Constructor. Constructor already throws an IOException. speed-up on jdk 5 & > 6 is in the 3.5% - 4% range. Speed-up was measured with standard > TestScorerPerf test in Lucene test package (very dense bit sets) . > Similar issue is with: > https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > patch made on trunk revision: 614219 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1146) ConjunctionScorer small (ca. 3.5%) optimization
[ https://issues.apache.org/jira/browse/LUCENE-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1146: Attachment: (was: ConjuctionScorerInitialization.patch) > ConjunctionScorer small (ca. 3.5%) optimization > --- > > Key: LUCENE-1146 > URL: https://issues.apache.org/jira/browse/LUCENE-1146 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Environment: all >Reporter: Eks Dev >Priority: Minor > > ConjunctionScorer initialization is done lazy in next() and skipTo() methods, > using one if(firstTime) check, this patch moves this initialization to the > Constructor. Constructor already throws an IOException. speed-up on jdk 5 & > 6 is in the 3.5% - 4% range. Speed-up was measured with standard > TestScorerPerf test in Lucene test package (very dense bit sets) . > Similar issue is with: > https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > patch made on trunk revision: 614219 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1146) ConjunctionScorer small (ca. 3.5%) optimization
[ https://issues.apache.org/jira/browse/LUCENE-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561375#action_12561375 ] Eks Dev commented on LUCENE-1146: - argh.. these were not core tests, all CoreTests pass with this patch... > ConjunctionScorer small (ca. 3.5%) optimization > --- > > Key: LUCENE-1146 > URL: https://issues.apache.org/jira/browse/LUCENE-1146 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Environment: all >Reporter: Eks Dev >Priority: Minor > Attachments: ConjuctionScorerInitialization.patch > > > ConjunctionScorer initialization is done lazy in next() and skipTo() methods, > using one if(firstTime) check, this patch moves this initialization to the > Constructor. Constructor already throws an IOException. speed-up on jdk 5 & > 6 is in the 3.5% - 4% range. Speed-up was measured with standard > TestScorerPerf test in Lucene test package (very dense bit sets) . > Similar issue is with: > https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > patch made on trunk revision: 614219 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1146) ConjunctionScorer small (ca. 3.5%) optimization
[ https://issues.apache.org/jira/browse/LUCENE-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561370#action_12561370 ] Eks Dev commented on LUCENE-1146: - Whoops, some tests fail! > ConjunctionScorer small (ca. 3.5%) optimization > --- > > Key: LUCENE-1146 > URL: https://issues.apache.org/jira/browse/LUCENE-1146 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Environment: all >Reporter: Eks Dev >Priority: Minor > Attachments: ConjuctionScorerInitialization.patch > > > ConjunctionScorer initialization is done lazy in next() and skipTo() methods, > using one if(firstTime) check, this patch moves this initialization to the > Constructor. Constructor already throws an IOException. speed-up on jdk 5 & > 6 is in the 3.5% - 4% range. Speed-up was measured with standard > TestScorerPerf test in Lucene test package (very dense bit sets) . > Similar issue is with: > https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > patch made on trunk revision: 614219 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1146) ConjunctionScorer small (ca. 3.5%) optimization
[ https://issues.apache.org/jira/browse/LUCENE-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1146: Attachment: ConjuctionScorerInitialization.patch > ConjunctionScorer small (ca. 3.5%) optimization > --- > > Key: LUCENE-1146 > URL: https://issues.apache.org/jira/browse/LUCENE-1146 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Environment: all >Reporter: Eks Dev >Priority: Minor > Attachments: ConjuctionScorerInitialization.patch > > > ConjunctionScorer initialization is done lazy in next() and skipTo() methods, > using one if(firstTime) check, this patch moves this initialization to the > Constructor. Constructor already throws an IOException. speed-up on jdk 5 & > 6 is in the 3.5% - 4% range. Speed-up was measured with standard > TestScorerPerf test in Lucene test package (very dense bit sets) . > Similar issue is with: > https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > patch made on trunk revision: 614219 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1146) ConjunctionScorer small (ca. 3.5%) optimization
ConjunctionScorer small (ca. 3.5%) optimization --- Key: LUCENE-1146 URL: https://issues.apache.org/jira/browse/LUCENE-1146 Project: Lucene - Java Issue Type: Improvement Components: Search Environment: all Reporter: Eks Dev Priority: Minor Attachments: ConjuctionScorerInitialization.patch ConjunctionScorer initialization is done lazy in next() and skipTo() methods, using one if(firstTime) check, this patch moves this initialization to the Constructor. Constructor already throws an IOException. speed-up on jdk 5 & 6 is in the 3.5% - 4% range. Speed-up was measured with standard TestScorerPerf test in Lucene test package (very dense bit sets) . Similar issue is with: https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel patch made on trunk revision: 614219 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1145) DisjunctionSumScorer small tweak
[ https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1145: Attachment: TestScorerPerformance.java > DisjunctionSumScorer small tweak > > > Key: LUCENE-1145 > URL: https://issues.apache.org/jira/browse/LUCENE-1145 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Environment: all >Reporter: Eks Dev >Priority: Trivial > Attachments: DisjunctionSumScorerOptimization.patch, > TestScorerPerformance.java > > > Move ScorerDocQueue initialization from next() and skipTo() methods to the > Constructor. Makes DisjunctionSumScorer a bit faster (less than 1% on my > tests). > Downside (if this is one, I cannot judge) would be throwing IOException from > DisjunctionSumScorer constructors as we touch HardDisk there. I see no > problem as this IOException does not propagate too far (the only modification > I made is in BooleanScorer2) > if (scorerDocQueue == null) { > initScorerDocQueue(); > } > > Attached test is just quick & dirty rip of TestScorerPerf from standard > Lucene test package. Not included as patch as I do not like it. > All test pass, patch made on trunk revision 613923 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1145) DisjunctionSumScorer small tweak
DisjunctionSumScorer small tweak Key: LUCENE-1145 URL: https://issues.apache.org/jira/browse/LUCENE-1145 Project: Lucene - Java Issue Type: Improvement Components: Search Environment: all Reporter: Eks Dev Priority: Trivial Attachments: DisjunctionSumScorerOptimization.patch Move ScorerDocQueue initialization from next() and skipTo() methods to the Constructor. Makes DisjunctionSumScorer a bit faster (less than 1% on my tests). Downside (if this is one, I cannot judge) would be throwing IOException from DisjunctionSumScorer constructors as we touch HardDisk there. I see no problem as this IOException does not propagate too far (the only modification I made is in BooleanScorer2) if (scorerDocQueue == null) { initScorerDocQueue(); } Attached test is just quick & dirty rip of TestScorerPerf from standard Lucene test package. Not included as patch as I do not like it. All test pass, patch made on trunk revision 613923 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]