Re: Welcome Ben Trent as Lucene committer

2023-01-28 Thread LuXugang
Congratulations and welcome, Ben!

Xugang

https://www.amazingkoala.com.cn




> On Jan 28, 2023, at 00:51, Anshum Gupta  wrote:
> 
> Congratulations and welcome, Ben!



Re: Dense union of doc IDs

2022-11-07 Thread LuXugang
+1 If we would have a new BulkAdder and it could detect long runs of set bits, 
It also could be at least used in LRUQueryCache to cache part dense docs 
instead of always building a huge BitSet by maxDoc?

Xugang

https://www.amazingkoala.com.cn




> On Nov 4, 2022, at 08:15, Michael Froh  wrote:
> 
> Hi,
> 
> I was recently poking around in the createWeight implementation for 
> MultiTermQueryConstantScoreWrapper to get to the bottom of some slow queries, 
> and I realized that the worst-case performance could be pretty bad, but 
> (maybe) possible to optimize for.
> 
> Imagine if we have a segment with N docs and our MultiTermQuery expands to 
> hit M terms, where each of the M terms matches N-1 docs. (If we matched all N 
> docs, then Greg Miller's recent optimization to replace the MultiTermQuery 
> with a TermQuery would kick in.) In this case, my understanding is that we 
> would iterate through all the terms and pass each one's postings to a 
> DocIdSetBuilder, which will iterate through the postings to set bits. This 
> whole thing would be O(MN), I think.
> 
> I was thinking that it would be cool if the DocIdSetBuilder could detect long 
> runs of set bits and advance() each DISI to skip over them (since they're 
> guaranteed not to contribute anything new to the union). In the worst case 
> that I described above, I think it would make the whole thing O(M log N) 
> (assuming advance() takes log time). 
> 
> At the risk of overcomplicating things, maybe DocIdSetBuilder could use a 
> third ("dense") BulkAdder implementation that kicks in once enough bits are 
> set, to efficiently implement the "or" operation to skip over known 
> (sufficiently long) runs of set bits?
> 
> Would something like that be useful? Is the "dense union of doc IDs" case 
> common enough to warrant it?
> 
> Thanks,
> Froh



Re: [Lucene] Selection of threshold

2021-07-02 Thread LuXugang
Thanks for sharing your ideas,  Adrien~~

> 2021年7月2日 上午1:26,Adrien Grand  写道:
> 
> Hi,
> 
> This is just a number that proved to work well in practice.
> 
> The general idea is that we want to narrow down the set of candidates 
> periodically in order to speed up query execution. If we do it too often, 
> then we might spend more time narrowing down the set of candidates than 
> actually evaluating candidates, and if we don't do it often enough, then 
> we're still evaluating lots of candidates that have no chance of being 
> competitive and the query is slow too. What the code samples you shared mean 
> is that Lucene would only re-evaluate the set of candidates whenever it seems 
> that we could reduce the number of candidates by 8x.
> 
> On Thu, Jul 1, 2021 at 11:57 AM LuXugang  wrote:
> Hi,
> 
> While reading Lucene source code, I have a tiny question about the selection 
> of threshold:threshold = value >>> 3.
> 
> eg. in NumericComparator#updateCompetitiveIterator(), as 'threshold = 
> iteratorCost >>> 3'  a condition for  whether to update iterator
> 
> eg. in IndexOrDocValuesQuery, as 'threshold = cost() >>> 3'  a condition for 
> choosing indexScorerSupplier or dvScorerSupplier
> 
> So the selection of threshold base some theory or tradeoff or other reason?
> 
> Could  I get some suggestion?
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> <mailto:dev-unsubscr...@lucene.apache.org>
> For additional commands, e-mail: dev-h...@lucene.apache.org 
> <mailto:dev-h...@lucene.apache.org>
> 
> 
> 
> -- 
> Adrien



[Lucene] Selection of threshold

2021-07-01 Thread LuXugang
Hi,

While reading Lucene source code, I have a tiny question about the selection of 
threshold:threshold = value >>> 3.

eg. in NumericComparator#updateCompetitiveIterator(), as 'threshold = 
iteratorCost >>> 3'  a condition for  whether to update iterator

eg. in IndexOrDocValuesQuery, as 'threshold = cost() >>> 3'  a condition for 
choosing indexScorerSupplier or dvScorerSupplier

So the selection of threshold base some theory or tradeoff or other reason?

Could  I get some suggestion?


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Use DirectMonotonicWriter store sorted NumericDocValues

2021-06-16 Thread LuXugang
Thanks, Robert, Adrien. your replies are helpful to me

> 2021年6月15日 下午10:19,Robert Muir  写道:
> 
> Well it definitely wouldn't be as useful as changing to a
> postings-style approach. That would bring a lot more benefits to
> general cases, e.g. use of PFOR and so on.
> 
> But it is also easier to implement right now, to accelerate cases
> where fields are sorted, without hurting other things.
> 
> On Tue, Jun 15, 2021 at 9:53 AM Adrien Grand  wrote:
>> 
>> SegmentWriteState has a reference to SegmentInfos which itself has the index 
>> sort, so I believe that it would be possible.
>> 
>> I wonder how useful it would be in practice. E.g. in the Elasticsearch case, 
>> even though we store lots of time-based data and have been looking into 
>> index sorting for storage/query efficiency reasons, the index sorts that we 
>> are interested in in practice look more like `host.name ASC, @timestamp 
>> DESC` than just `@timestamp DESC`. The reason for sorting by `host` first is 
>> that it helps a lot with storage/query efficiency of metadata that is tied 
>> to the host (e.g. IP addresses, operating system, etc.), and then because 
>> `host.name` is usually a low-cardinality field, queries by descending 
>> timestamp remain super efficient thanks to LUCENE-9280. So we'd be more 
>> interested in an optimization that would support piecewise monotonic fields.
>> 
>> On Tue, Jun 15, 2021 at 3:33 PM Robert Muir  wrote:
>>> 
>>> +1 to that idea. Maybe a shorter-term possibility would be to only do
>>> this compression on a field when the user has explicitly configured
>>> index sorting on the field (can we hackishly peek at it and tell?)
>>> 
>>> On Tue, Jun 15, 2021 at 9:04 AM Adrien Grand  wrote:
>>>> 
>>>> I believe that this sort of optimization would be more effective and 
>>>> robust if we made doc values look more like postings, with relatively 
>>>> small blocks of values that would get compressed independently and 
>>>> decompressed in bulk. This way, we wouldn't require data to be sorted 
>>>> across entire segments for this optimization to kick in, and we would be 
>>>> less likely to slow down the normal case.
>>>> 
>>>> On Tue, Jun 15, 2021 at 12:06 PM Robert Muir  wrote:
>>>>> 
>>>>> We did this monotonic detection/compression before in older times, but
>>>>> had to remove it because it caused too many slowdowns.
>>>>> 
>>>>> I think it easily causes too much type pollution, for example, for a
>>>>> typical large index with unsorted docvalues field, big segments aren't
>>>>> won't be sorted, tiny segments with a few values might happen to be
>>>>> sorted (depending on chance/luck), tiny tiny ones with e.g. a single
>>>>> document are sorted. Now we have a mix of monotonic and non-monotonic
>>>>> over the same field.
>>>>> 
>>>>> On the other hand, optimization is very fragile and rare: even for
>>>>> these log users actually sorting on that field at index-time, it will
>>>>> just apply to one field out of the somehow typical dozens/hundreds
>>>>> that they like to have. But may destroy performance of all the other
>>>>> fields and overall causes more harm than good.
>>>>> 
>>>>> On Tue, Jun 15, 2021 at 5:49 AM LuXugang  
>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> In class Lucene80DocValuesConsumer#writeValues(FieldInfo field, 
>>>>>> DocValuesProducer valuesProducer), all numericDocValues will be visited 
>>>>>> to calculate gcd, in the meantime,  we can check if all values were 
>>>>>> sorted. if so, maybe we could use DirectMonotonicWriter to store them.  
>>>>>> DirectMonotonicWriter can get impressive compression.
>>>>>> 
>>>>>> In addition, when i use Elasticsearch to store numeric field types, in 
>>>>>> Lucene level,  the data always at least stored by 
>>>>>> NumericDocValues/SortedNumericDocValues. So when indexing some sorted 
>>>>>> values like ID, TIMESTAMP, maybe the upon optimization is applicable.
>>>>>> 
>>>>>> Could I have some suggestions?
>>>>>> -
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>>> For additiona

Use DirectMonotonicWriter store sorted NumericDocValues

2021-06-15 Thread LuXugang
Hi,

In class Lucene80DocValuesConsumer#writeValues(FieldInfo field, 
DocValuesProducer valuesProducer), all numericDocValues will be visited to 
calculate gcd, in the meantime,  we can check if all values were sorted. if so, 
maybe we could use DirectMonotonicWriter to store them.  DirectMonotonicWriter 
can get impressive compression.

In addition, when i use Elasticsearch to store numeric field types, in Lucene 
level,  the data always at least stored by 
NumericDocValues/SortedNumericDocValues. So when indexing some sorted values 
like ID, TIMESTAMP, maybe the upon optimization is applicable. 

Could I have some suggestions?
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[Lucene] BYTE_BLOCK_SIZE in class ByteBlockPool

2021-04-02 Thread LuXugang
Hi, 

In class ByteBlockPool , per buffer's default maximum size was set 32KB(2 << 
15),  is this such choice relevant with CPU L1 cache which cache size is 32KB?

If not so, could anyone give some tips?



Re: [Lucene] confusion in posting encoding

2021-01-13 Thread LuXugang
Thanks for your answer, Adrien.

> 2021年1月13日 下午9:26,Adrien Grand  写道:
> 
> Hello,
> 
> It is indeed because I could get the compiler to use SIMD instructions with 
> the loop written this way.
> 
> On Wed, Jan 13, 2021 at 11:29 AM LuXugang  wrote:
> Hi Adrien,
> 
> I have some confusion about the method collapse8(long[ ] arr) in ForUtil class
> 
> <粘贴的图形-1.png>
> 
> 
> On line 85, the loop times are 16, because there are 128 elements in arr, and 
> eight elements were processed at a time.
> 
> My question is why not choose the elements in order , like in the first loop, 
> arr[0] ~ arr[7] were chosen, in the second loop,  arr[8] ~ arr[15] were 
> chosen … 
> 
> Is it cause the compiler cann’t generate SIMD instructions or just something 
> else?
> 
> Could you help me find the answer here?
> 
> 
> -- 
> Adrien



[Lucene] confusion in posting encoding

2021-01-13 Thread LuXugang
Hi Adrien,

I have some confusion about the method collapse8(long[ ] arr) in ForUtil class




On line 85, the loop times are 16, because there are 128 elements in arr, and 
eight elements were processed at a time.

My question is why not choose the elements in order , like in the first loop, 
arr[0] ~ arr[7] were chosen, in the second loop,  arr[8] ~ arr[15] were chosen 
… 

Is it cause the compiler cann’t generate SIMD instructions or just something 
else?

Could you help me find the answer here?

Re: [Lucene] Add javadoc for Lucene86PointsFormat class

2020-12-23 Thread LuXugang
Thanks David

My userId is  luxugang



> On Dec 23, 2020, at 21:31, David Smiley  wrote:
> 
> Please register at the ASF's Confluence / wiki space.  Then tell me your 
> userId, and I will then grant you permissions to edit our wiki.
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley 
> <http://www.linkedin.com/in/davidwsmiley>
> 
> On Wed, Dec 23, 2020 at 2:50 AM LuXugang  wrote:
> Hi, David
> 
> Here the link on my personal website to introduce index file 
> .kdd&.kdi:https://www.amazingkoala.com.cn/Lucene_Document/IndexFile/2020/1104/175.html
>  <http://www.amazingkoala.com.cn/Lucene_Document/IndexFile/2020/1104/175.html>
> 
> If it’s ok, I would like to rewrite it on  
> https://cwiki.apache.org/confluence/display/LUCENE/Home 
> <https://cwiki.apache.org/confluence/display/LUCENE/Home>, but I have no 
> permission to edit. 
> 
> Jira link: https://issues.apache.org/jira/browse/LUCENE-9590?filter=-2 
> <https://issues.apache.org/jira/browse/LUCENE-9590?filter=-2>
> 
> Actually, I have wrote some other articles to introduce Lucene but with 
> Chinese, so if it is needed, I would like to write more in English.
> 
> 
> 
>> 2020年10月30日 16:57,LuXugang > <mailto:xugan...@icloud.com.INVALID>> 写道:
>> 
>> Thanks David, add link in javadocs is great, got it ~
>> 
>>> On Oct 30, 2020, at 12:45 PM, David Smiley >> <mailto:dsmi...@apache.org>> wrote:
>>> 
>>> Fantastic contribution! 
>>> 
>>> I don't think we have images in our javadocs, but if you can prove me wrong 
>>> then great!  We could link to it from javadocs and host it at the 
>>> Confluence based wiki here: 
>>> https://cwiki.apache.org/confluence/display/LUCENE/Home 
>>> <https://cwiki.apache.org/confluence/display/LUCENE/Home>
>>> 
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley 
>>> <http://www.linkedin.com/in/davidwsmiley>
>>> 
>>> On Wed, Oct 28, 2020 at 11:39 AM LuXugang >> <mailto:xugan...@icloud.com.invalid>> wrote:
>>> Hi,
>>> 
>>> I would like to add javadoc for Lucene86PointsFormat class,  it is really 
>>> helpful for source reader to understand the data structure with point value
>>> 
>>> The attachment list part of the data structure (filled with color means it 
>>> has sub data structure)<1.png>
>>> 
>>> 
>> 
> 



Re: [Lucene] Add javadoc for Lucene86PointsFormat class

2020-12-22 Thread LuXugang
Hi, David

Here the link on my personal website to introduce index file 
.kdd&.kdi:https://www.amazingkoala.com.cn/Lucene_Document/IndexFile/2020/1104/175.html
 <http://www.amazingkoala.com.cn/Lucene_Document/IndexFile/2020/1104/175.html>

If it’s ok, I would like to rewrite it on  
https://cwiki.apache.org/confluence/display/LUCENE/Home 
<https://cwiki.apache.org/confluence/display/LUCENE/Home>, but I have no 
permission to edit. 

Jira link: https://issues.apache.org/jira/browse/LUCENE-9590?filter=-2 
<https://issues.apache.org/jira/browse/LUCENE-9590?filter=-2>

Actually, I have wrote some other articles to introduce Lucene but with 
Chinese, so if it is needed, I would like to write more in English.



> 2020年10月30日 16:57,LuXugang  写道:
> 
> Thanks David, add link in javadocs is great, got it ~
> 
>> On Oct 30, 2020, at 12:45 PM, David Smiley > <mailto:dsmi...@apache.org>> wrote:
>> 
>> Fantastic contribution! 
>> 
>> I don't think we have images in our javadocs, but if you can prove me wrong 
>> then great!  We could link to it from javadocs and host it at the Confluence 
>> based wiki here: https://cwiki.apache.org/confluence/display/LUCENE/Home 
>> <https://cwiki.apache.org/confluence/display/LUCENE/Home>
>> 
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley 
>> <http://www.linkedin.com/in/davidwsmiley>
>> 
>> On Wed, Oct 28, 2020 at 11:39 AM LuXugang > <mailto:xugan...@icloud.com.invalid>> wrote:
>> Hi,
>> 
>> I would like to add javadoc for Lucene86PointsFormat class,  it is really 
>> helpful for source reader to understand the data structure with point value
>> 
>> The attachment list part of the data structure (filled with color means it 
>> has sub data structure)<1.png>
>> 
>> 
> 



Re: [Lucene] Add javadoc for Lucene86PointsFormat class

2020-11-04 Thread LuXugang
Hi,

I just wrote an article about point-value data structure for the 
Lucene86PointsFormat class. Link: 
https://www.amazingkoala.com.cn/Lucene_Document/IndexFile/2020/1104/175.html 
<https://www.amazingkoala.com.cn/Lucene_Document/IndexFile/2020/1104/175.html>

I hope it helps to read the source code about point’s data structure

> On Oct 30, 2020, at 4:57 PM, LuXugang  wrote:
> 
> Thanks David, add link in javadocs is great, got it ~
> 
>> On Oct 30, 2020, at 12:45 PM, David Smiley > <mailto:dsmi...@apache.org>> wrote:
>> 
>> Fantastic contribution! 
>> 
>> I don't think we have images in our javadocs, but if you can prove me wrong 
>> then great!  We could link to it from javadocs and host it at the Confluence 
>> based wiki here: https://cwiki.apache.org/confluence/display/LUCENE/Home 
>> <https://cwiki.apache.org/confluence/display/LUCENE/Home>
>> 
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley 
>> <http://www.linkedin.com/in/davidwsmiley>
>> 
>> On Wed, Oct 28, 2020 at 11:39 AM LuXugang > <mailto:xugan...@icloud.com.invalid>> wrote:
>> Hi,
>> 
>> I would like to add javadoc for Lucene86PointsFormat class,  it is really 
>> helpful for source reader to understand the data structure with point value
>> 
>> The attachment list part of the data structure (filled with color means it 
>> has sub data structure)<1.png>
>> 
>> 
> 



Re: [Lucene] Add javadoc for Lucene86PointsFormat class

2020-10-30 Thread LuXugang
Thanks David, add link in javadocs is great, got it ~

> On Oct 30, 2020, at 12:45 PM, David Smiley  wrote:
> 
> Fantastic contribution! 
> 
> I don't think we have images in our javadocs, but if you can prove me wrong 
> then great!  We could link to it from javadocs and host it at the Confluence 
> based wiki here: https://cwiki.apache.org/confluence/display/LUCENE/Home 
> <https://cwiki.apache.org/confluence/display/LUCENE/Home>
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley 
> <http://www.linkedin.com/in/davidwsmiley>
> 
> On Wed, Oct 28, 2020 at 11:39 AM LuXugang  wrote:
> Hi,
> 
> I would like to add javadoc for Lucene86PointsFormat class,  it is really 
> helpful for source reader to understand the data structure with point value
> 
> The attachment list part of the data structure (filled with color means it 
> has sub data structure)<1.png>
> 
>