[jira] [Commented] (LUCENE-7091) Add doc values support to MemoryIndex
[ https://issues.apache.org/jira/browse/LUCENE-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194961#comment-15194961 ] ASF subversion and git services commented on LUCENE-7091: - Commit cf3eea26406692306505d2606d7ff73ee3634c30 in lucene-solr's branch refs/heads/master from [~martijn.v.groningen] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cf3eea2 ] LUCENE-7091: Added doc values support to memory index > Add doc values support to MemoryIndex > - > > Key: LUCENE-7091 > URL: https://issues.apache.org/jira/browse/LUCENE-7091 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen >Assignee: David Smiley > Fix For: 6.0 > > Attachments: LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, > LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, > LUCENE-7091.patch, LUCENE-7091.patch > > > Sometimes queries executed via the MemoryIndex require certain things to be > stored as doc values. Today this isn't possible because the memory index > doesn't support this and these queries silently return no results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7091) Add doc values support to MemoryIndex
[ https://issues.apache.org/jira/browse/LUCENE-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194959#comment-15194959 ] ASF subversion and git services commented on LUCENE-7091: - Commit bd0803fd41c68297c54201529c2c14ad50cda48e in lucene-solr's branch refs/heads/branch_6_0 from [~martijn.v.groningen] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bd0803f ] LUCENE-7091: Added doc values support to memory index > Add doc values support to MemoryIndex > - > > Key: LUCENE-7091 > URL: https://issues.apache.org/jira/browse/LUCENE-7091 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen >Assignee: David Smiley > Fix For: 6.0 > > Attachments: LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, > LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, > LUCENE-7091.patch, LUCENE-7091.patch > > > Sometimes queries executed via the MemoryIndex require certain things to be > stored as doc values. Today this isn't possible because the memory index > doesn't support this and these queries silently return no results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7091) Add doc values support to MemoryIndex
[ https://issues.apache.org/jira/browse/LUCENE-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194960#comment-15194960 ] ASF subversion and git services commented on LUCENE-7091: - Commit 559432fcdcd828864a800072cf144e1f5c96647e in lucene-solr's branch refs/heads/branch_6x from [~martijn.v.groningen] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=559432f ] LUCENE-7091: Added doc values support to memory index > Add doc values support to MemoryIndex > - > > Key: LUCENE-7091 > URL: https://issues.apache.org/jira/browse/LUCENE-7091 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen >Assignee: David Smiley > Fix For: 6.0 > > Attachments: LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, > LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, > LUCENE-7091.patch, LUCENE-7091.patch > > > Sometimes queries executed via the MemoryIndex require certain things to be > stored as doc values. Today this isn't possible because the memory index > doesn't support this and these queries silently return no results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7091) Add doc values support to MemoryIndex
[ https://issues.apache.org/jira/browse/LUCENE-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193749#comment-15193749 ] David Smiley commented on LUCENE-7091: -- Eh, the grow algorithm/code still isn't right. The code in the patch will double the size even though it's likely plenty big to add the current long trying to get added. This is what I mean: {code:java} case SORTED_NUMERIC: if (info.numericProducer.dvLongValues == null) { info.numericProducer.dvLongValues = new long[4]; } info.numericProducer.dvLongValues = ArrayUtil.grow(info.numericProducer.dvLongValues, info.numericProducer.count + 1); info.numericProducer.dvLongValues[info.numericProducer.count++] = (long) docValuesValue; break; {code} Everything else is good. Assuming you agree with this code snippet above, +1 from me to commit. > Add doc values support to MemoryIndex > - > > Key: LUCENE-7091 > URL: https://issues.apache.org/jira/browse/LUCENE-7091 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen >Assignee: David Smiley > Attachments: LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, > LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, > LUCENE-7091.patch > > > Sometimes queries executed via the MemoryIndex require certain things to be > stored as doc values. Today this isn't possible because the memory index > doesn't support this and these queries silently return no results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7091) Add doc values support to MemoryIndex
[ https://issues.apache.org/jira/browse/LUCENE-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193519#comment-15193519 ] Martijn van Groningen commented on LUCENE-7091: --- Right, speed over memory, I'll change it to the doubling logic you initially suggested. > Add doc values support to MemoryIndex > - > > Key: LUCENE-7091 > URL: https://issues.apache.org/jira/browse/LUCENE-7091 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen >Assignee: David Smiley > Attachments: LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, > LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch > > > Sometimes queries executed via the MemoryIndex require certain things to be > stored as doc values. Today this isn't possible because the memory index > doesn't support this and these queries silently return no results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7091) Add doc values support to MemoryIndex
[ https://issues.apache.org/jira/browse/LUCENE-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193500#comment-15193500 ] David Smiley commented on LUCENE-7091: -- My main concern is that it not be O(N^2). Personally I'm not too concerned with ArrayUtil.grow's algorithm. You might also pick an initial size of something like '4' for the SORTED_NUMERIC case (since it implies the intent to add more than 1). With MemoryIndex I think speed is generally more important than memory size any way. > Add doc values support to MemoryIndex > - > > Key: LUCENE-7091 > URL: https://issues.apache.org/jira/browse/LUCENE-7091 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen >Assignee: David Smiley > Attachments: LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, > LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch > > > Sometimes queries executed via the MemoryIndex require certain things to be > stored as doc values. Today this isn't possible because the memory index > doesn't support this and these queries silently return no results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7091) Add doc values support to MemoryIndex
[ https://issues.apache.org/jira/browse/LUCENE-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193486#comment-15193486 ] Martijn van Groningen commented on LUCENE-7091: --- > in storeDocValues() SORTED_NUMERIC: you call ArrayUtil.grow with only the > array. This results in the same O(N^2) we're trying to avoid! Pass in a > second argument of the desired length. Maybe I was a bit concerned about the size of these arrays. ArrayUtil.grow(array) grows an array by an 1/8th and at least adds 3 additional slots, I thought that would be enough, since we don't know upfront how many new values are going to be added. I can change it double the array instead. > Add doc values support to MemoryIndex > - > > Key: LUCENE-7091 > URL: https://issues.apache.org/jira/browse/LUCENE-7091 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen >Assignee: David Smiley > Attachments: LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, > LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch > > > Sometimes queries executed via the MemoryIndex require certain things to be > stored as doc values. Today this isn't possible because the memory index > doesn't support this and these queries silently return no results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7091) Add doc values support to MemoryIndex
[ https://issues.apache.org/jira/browse/LUCENE-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193364#comment-15193364 ] David Smiley commented on LUCENE-7091: -- Almost there... * (still applies to the _other_ addField): I think the javadocs sentence you added to addField meant to use "if" not "is". * At first I thought there might have been a bug for double-applying the boost since I see you're still passing "boost" as a constructor argument to Info. But now I see you only apply when numTokens > 0. I think it would be much clearer (and simpler) to remove boost from the constructor to Info, and simply apply it in storeTerms (no matter what numTokens is). It's hard to judge the testDocValuesDoNotAffectBoostPositionsOrOffset for this problem... it'd get encoded in the norm and I have no idea what norm it should be; your test asserts -127. H. Perhaps simply leave a check of that nature to the test that asserts parity with the real index in RAMDirectory * in {{storeDocValues()}} SORTED_NUMERIC: you call ArrayUtil.grow with only the array. This results in the same {{O(N^2)}} we're trying to avoid! Pass in a second argument of the desired length. > Add doc values support to MemoryIndex > - > > Key: LUCENE-7091 > URL: https://issues.apache.org/jira/browse/LUCENE-7091 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen > Attachments: LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, > LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch > > > Sometimes queries executed via the MemoryIndex require certain things to be > stored as doc values. Today this isn't possible because the memory index > doesn't support this and these queries silently return no results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7091) Add doc values support to MemoryIndex
[ https://issues.apache.org/jira/browse/LUCENE-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192781#comment-15192781 ] David Smiley commented on LUCENE-7091: -- I _really_ like how this refactor has turned out of making Info mutable. * I think the javadocs sentence you added to {{addField}} meant to use "if" not "is". * In {{getInfoForExpectedDocValuesType()}}, the two checks using {{(info.fieldInfo.getDocValuesType()}} read a little weird/confusing to me but I get the gist of the intent. Couldn't the first condition simply check {{expectedType == DocValuesType.NONE}} to return null? That's logically the same I think and could even be checked at the very front of the method. * getSortedSetDocValues's impl ought to reset index to 0 on setDocument(). * getDocsWithField doesn't need to actually implement a bits, it can use {{return new Bits.MatchAllBits(1);}} * I think there's a bug in addField's positionIncrementGap & offsetGap handling. If fields are added to MemoryIndex that are DV then terms data, the posInc & offset will get incremented when it shouldn't be. This could be corrected by checking info.numTokens. Please modify or add a test for this. The boost should be applied conditionally as well (only when storeTerms is invoked). * {{storeDocValues()}}: for SORTED_NUMERIC I'm concerned about that growing algorithim... it's O(N^2) as it creates a new array each time that is just one larger. Ouch. Maybe use a straight-forward array doubling algorithm and keep track of the number of values? * It's a shame that SORTED & BINARY use a BytesRefHash (adds overhead) and ultimately get sorted when, really, it's not necessary of course. The ByteBlockPool could be used directly to store it (see BytesRefArray for examples) with a little bit of code. This isn't a blocker but it would sure be nice. * {{testDocValues()}} should test non-set behavior of sorted numerics, and set behavior of sorted set DV. Could be done easily by changing or adding a couple lines of code. Add term text here too, and under same field names as DV ones at that. > Add doc values support to MemoryIndex > - > > Key: LUCENE-7091 > URL: https://issues.apache.org/jira/browse/LUCENE-7091 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen > Attachments: LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch, > LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch > > > Sometimes queries executed via the MemoryIndex require certain things to be > stored as doc values. Today this isn't possible because the memory index > doesn't support this and these queries silently return no results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7091) Add doc values support to MemoryIndex
[ https://issues.apache.org/jira/browse/LUCENE-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15191098#comment-15191098 ] David Smiley commented on LUCENE-7091: -- bq. Returning the same doc values instance for binary, sorted, number and sorted number doc values is fine, but not for sorted set doc values. Good point. Some comments: * addField: unfortunately this method is very long and it's difficult to follow. I understand that it's not easy to split it up because of the number of local variables. One thing that would help is renaming "longValues" and "bytesValuesSet" to make them clearly associated with doc-values. I suggest "dvLongValues" and "dvBytesValuesSet" and add a comment to the former {{//NOT a set}}. Another thing that would help is comments to declare the different phases of this method... like definitely before the switch(docValuesType) and at other junctures. But I already see some code duplication in how numericProducer & binaryProducer are initialized. Here's an idea: Maybe Info could be changed to hold this state mutably. Then, there wouldn't be a long stage of pulling out each var from the info only to put it all back again. If this idea is successful, there would be much fewer local variables, and then you could easily extract a method to handle the DV stuff and a separate method for the Terms stuff. What do you think? * instead of freeze() knowing to call both getNormDocValues & prepareDocValues (and to sort terms), I suggest that freeze be implemented on each Info where those methods can be called there. I think that's easier to maintain. ... to be continued; I didn't finish reviewing ... > Add doc values support to MemoryIndex > - > > Key: LUCENE-7091 > URL: https://issues.apache.org/jira/browse/LUCENE-7091 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen > Attachments: LUCENE-7091.patch, LUCENE-7091.patch, LUCENE-7091.patch > > > Sometimes queries executed via the MemoryIndex require certain things to be > stored as doc values. Today this isn't possible because the memory index > doesn't support this and these queries silently return no results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7091) Add doc values support to MemoryIndex
[ https://issues.apache.org/jira/browse/LUCENE-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190743#comment-15190743 ] Martijn van Groningen commented on LUCENE-7091: --- Thanks for looking at this! I'll try to split this DocValuesHolder class into separate classes per DV type. bq. In your patch you do an anonymous class for each one but I think it would be clearer to not do that and eliminate the DocValuesHolder which is holding for multiple types of values at once which is a little confusing. MemoryIndex is supposed to be thread safe after the freeze() method has been invoked. (this is describe in the jdocs) Returning the same doc values instance for binary, sorted, number and sorted number doc values is fine, but not for sorted set doc values. That implementation keeps state in order to iterate over the values. So for sorted set doc values we always need to return a new instance and in order to be consistent with the other doc values types I did they same thing there too. I think this isn't a big of deal. bq. Could you move fields() above MemoryFields so it reads more naturally? Sure, will do bq. Why is there a different sorting approach to sorted numeric vs BytesRef DV types? The former you add to an array and sort eventually, and the latter you use a TreeSet. They could both be handled consistently – array first then sort later. That is what I thought initially too, but the difference between the two is that sorted numeric DV can hold duplicates while sort set DV doesn't hold duplicates. This is why I took the approach of using treeset during the building phase for any binary DV. bq. I noticed the BytesRefs DV data isn't copied; we retain a reference. Is that allowed? Good point, let me fix this. > Add doc values support to MemoryIndex > - > > Key: LUCENE-7091 > URL: https://issues.apache.org/jira/browse/LUCENE-7091 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen > Attachments: LUCENE-7091.patch, LUCENE-7091.patch > > > Sometimes queries executed via the MemoryIndex require certain things to be > stored as doc values. Today this isn't possible because the memory index > doesn't support this and these queries silently return no results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7091) Add doc values support to MemoryIndex
[ https://issues.apache.org/jira/browse/LUCENE-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189769#comment-15189769 ] David Smiley commented on LUCENE-7091: -- Cool. Instead of DocValuesHolder, how about actually having one implementation class per DocValues type that holds the value. So, a MemoryNumericDocValues for example. In your patch you do an anonymous class for each one but I think it would be clearer to not do that and eliminate the DocValuesHolder which is holding for multiple types of values at once which is a little confusing. Note SortedDocValues extends BinaryDocValues so that eliminates an implementation. I wish Lucene's DocValue abstractions were interfaces and not abstract classes since it prevents collapsing some other similar implementations into fewer concrete classes. Could you move fields() above MemoryFields so it reads more naturally? Why is there a different sorting approach to sorted numeric vs BytesRef DV types? The former you add to an array and sort eventually, and the latter you use a TreeSet. They could both be handled consistently -- array first then sort later. I noticed the BytesRefs DV data isn't copied; we retain a reference. Is that allowed? I suspect not; seems very risky. Notice that the term index data uses BytesRefHash wrapped around the shared ByteBlockPool. Perhaps we should do similarly using the same ByteBlockPool. > Add doc values support to MemoryIndex > - > > Key: LUCENE-7091 > URL: https://issues.apache.org/jira/browse/LUCENE-7091 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen > Attachments: LUCENE-7091.patch, LUCENE-7091.patch > > > Sometimes queries executed via the MemoryIndex require certain things to be > stored as doc values. Today this isn't possible because the memory index > doesn't support this and these queries silently return no results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7091) Add doc values support to MemoryIndex
[ https://issues.apache.org/jira/browse/LUCENE-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189014#comment-15189014 ] Martijn van Groningen commented on LUCENE-7091: --- bq. Is there any reason for DocValuesHolder to be package-private? No, nothing outside MemoryIndex uses it. I can change the visibility to private. > Add doc values support to MemoryIndex > - > > Key: LUCENE-7091 > URL: https://issues.apache.org/jira/browse/LUCENE-7091 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen > Attachments: LUCENE-7091.patch > > > Sometimes queries executed via the MemoryIndex require certain things to be > stored as doc values. Today this isn't possible because the memory index > doesn't support this and these queries silently return no results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7091) Add doc values support to MemoryIndex
[ https://issues.apache.org/jira/browse/LUCENE-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15188994#comment-15188994 ] Alan Woodward commented on LUCENE-7091: --- +1, nice! Is there any reason for DocValuesHolder to be package-private? > Add doc values support to MemoryIndex > - > > Key: LUCENE-7091 > URL: https://issues.apache.org/jira/browse/LUCENE-7091 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen > Attachments: LUCENE-7091.patch > > > Sometimes queries executed via the MemoryIndex require certain things to be > stored as doc values. Today this isn't possible because the memory index > doesn't support this and these queries silently return no results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org