Question about field weight value checks in CombinedFieldQuery

2024-09-24 Thread S G
Hello! I have been exploring the use of BM25F queries (CombinedFieldQuery) in Lucene, and I have noticed that it is not allowed to use field weights with values less than 1.0f (as shown in here: https://github.com/apache/lucene/blob/53d1c2bd2fb3e6b9da590bee360996dbbdc8ea34/lucene/sandbox/src/java/

Re: Question about the performance of Lucene99PostingsFormat

2024-09-16 Thread Rui Wu
space-efficiency reasons. We >> have many indexes that use IndexOptions.DOCS where the move from PFOR to >> FOR significantly increased disk usage (unlike indexes that use >> IndexOptions.DOCS_AND_FREQS_AND_POSITIONS where space is typically >> dominated by positions anyw

Re: Question about the performance of Lucene99PostingsFormat

2024-09-11 Thread Rui Wu
the move from PFOR to > FOR significantly increased disk usage (unlike indexes that use > IndexOptions.DOCS_AND_FREQS_AND_POSITIONS where space is typically > dominated by positions anyway). > Got it. Thanks! > > On Tue, Sep 10, 2024 at 9:31 PM Rui Wu wrote: > > > Dear exp

Re: Question about the performance of Lucene99PostingsFormat

2024-09-10 Thread Adrien Grand
ndexes that use IndexOptions.DOCS_AND_FREQS_AND_POSITIONS where space is typically dominated by positions anyway). On Tue, Sep 10, 2024 at 9:31 PM Rui Wu wrote: > Dear experts, > > I have a question about the following change: > The Lucene9.11 changed the Posting list format > (Lucene GITHUB#12696 <https://gi

Question about the performance of Lucene99PostingsFormat

2024-09-10 Thread Rui Wu
Dear experts, I have a question about the following change: The Lucene9.11 changed the Posting list format (Lucene GITHUB#12696 <https://github.com/apache/lucene/pull/12696>: Change Postings back to using FOR in Lucene99PostingsFormat. Freqs, positions and offset keep using PFOR) However,

Re: Lucene LRUQueryCache question

2024-07-16 Thread Yixun Xu
This post explains why Lucene doesn't cache all queries: https://www.mail-archive.com/java-user@lucene.apache.org/msg51649.html Your queries could be skipping the cache because of the LRUQueryCache constructor parameters, or because of the QueryCachingPolicy.shouldCache predicate. They probably ha

Lucene LRUQueryCache question

2024-07-02 Thread Δημήτρης Κλειναυτάκης
Hi all, I am using the Lucene 9.6 version and I am trying to add queries into LRUQueryCache from my benchmarks that evaluate the queries and create the LRUQueryCache. First, I believed that Lucene puts the queries by default into queryCache but that was never the case. So, I read the LRUQueryCac

Re: Filter question

2023-11-21 Thread Mikhail Khludnev
l be built > from something like "product:a". (The boost queries serve to push recent > content higher in the results, they don't really matter for this > question.) > > > > There will always be a content query. My question relates to the filter > qu

Filter question

2023-11-21 Thread Trevor Nicholls
from something like "product:a". (The boost queries serve to push recent content higher in the results, they don't really matter for this question.) There will always be a content query. My question relates to the filter query. I know that a query which has no 'positive&#

Re: Question - Why stopwords.txt provided by smartcn contains blank lines?

2023-05-15 Thread Jerry Chin
Hi Michael, Thanks for clarifying, I have created an issue to follow up in Github. Much appreciated! On Monday, May 15, 2023, Michael McCandless wrote: > Hi Jerry, > > I agree, that makes no sense! Maybe the stopload loader should ignore > truly

Re: Question - Why stopwords.txt provided by smartcn contains blank lines?

2023-05-15 Thread Michael McCandless
Hi Jerry, I agree, that makes no sense! Maybe the stopload loader should ignore truly blank lines? Also, the comments on lines 57 and 59 are confusing -- there are no (default) English and Chinese stopwords in the file. I guess they are placeholders. Could you open an issue in Lucene's GitHub

Question - Why stopwords.txt provided by smartcn contains blank lines?

2023-05-15 Thread Jerry Chin
Hi all, This following line contains two blank lines, including line 56 & 58: https://github.com/apache/lucene/blob/main/lucene/analysis/smartcn/src/resources/org/apache/lucene/analysis/cn/smart/stopwords.txt As a result, SmartChineseAnalyzer.getDefaultStopSet() will produce a empty string as st

Question - ClassicSimilarity method not being called

2023-05-14 Thread Usman Shaikh
oadsTest.java#L76> is failing. The document "warning label maker" should be returned lower, because the other two documents should be boosted high (using the payload). This isn't happening. Sorry if my question isn't clear, hopefully the unit test is. Thanks Usman Shaikh

Re: Question about index segment search order

2023-05-13 Thread Uwe Schindler
Hi, in reference to previous code references and discussions from other Lucene committers I have to clarify: * If you run the query multithreaded (per segment), this means when you add an Executor to IndexSearcher, the order is not predicatable, plain simple * If you use Solr, a single

Re: Question about index segment search order

2023-05-11 Thread Wei
Hi Michael, Yes the collector counts hits across all segments. Thanks for the suggestion, I'm also asking the question on solr-dev. Wei On Thu, May 11, 2023 at 11:57 AM Michael Sokolov wrote: > Maybe ask this issue on solr-dev then? I'm not familiar with how that > collecto

Re: Question about index segment search order

2023-05-11 Thread Michael Sokolov
Maybe ask this issue on solr-dev then? I'm not familiar with how that collector works. Does it count hits across all segments? only within a single segment? On Tue, May 9, 2023 at 1:36 PM Wei wrote: > > Hi Michael, > > I am applying early termination with Solr's EarlyTerminatingCollector > https:

Re: Question about index segment search order

2023-05-09 Thread Wei
Hi Michael, I am applying early termination with Solr's EarlyTerminatingCollector https://github.com/apache/solr/blob/d9ddba3ac51ece953d762c796f62730e27629966/solr/core/src/java/org/apache/solr/search/EarlyTerminatingCollector.java , which triggers EarlyTerminatingCollectorException in SolrIndexSe

Re: Question about index segment search order

2023-05-04 Thread Michael Sokolov
Yes, sorry I didn't mean to imply you couldn't control this if you want to. I guess in the typical setup it is not predictable. How are you applying early termination? Are you using a standard Lucene Collector or do you have your own? On Thu, May 4, 2023 at 2:03 PM Patrick Zhai wrote: > > Hi Mike

Re: Question about index segment search order

2023-05-04 Thread Patrick Zhai
Hi Mike, Just want to mention if the user chooses to use single thread to index and use LogXXMergePolicy then the document order will be preserved as index order. On Thu, May 4, 2023 at 10:04 AM Wei wrote: > Hi Michael, > > We are interested in the segment sequence for early termination. In ou

Re: Question about index segment search order

2023-05-04 Thread Wei
Hi Michael, We are interested in the segment sequence for early termination. In our case there is always a large dominant segment after index rebuild, then many small segments are generated with continuous updates as time goes by. When early termination is applied, the limit could be reached just

Re: Question about index segment search order

2023-05-04 Thread Michael Sokolov
There is no meaning to the sequence. The segments are created concurrently by many threads and the merge process will merge them without regards to any ordering. On Wed, May 3, 2023, 1:09 PM Patrick Zhai wrote: > For that part I'm not entirely sure, if other folks know it please chime in > :)

Re: Question about index segment search order

2023-05-03 Thread Patrick Zhai
For that part I'm not entirely sure, if other folks know it please chime in :) On Wed, May 3, 2023 at 8:48 AM Wei wrote: > Thanks Patrick! In the default case when no LeafSorter is provided, are the > segments traversed in the order of creation time, i.e. the oldest segment > is always visited f

Re: Question about index segment search order

2023-05-03 Thread Wei
Thanks Patrick! In the default case when no LeafSorter is provided, are the segments traversed in the order of creation time, i.e. the oldest segment is always visited first? Wei On Tue, May 2, 2023 at 7:22 PM Patrick Zhai wrote: > Hi Wei, > Lucene in general iterate through the index in the or

Re: Question about index segment search order

2023-05-02 Thread Patrick Zhai
Hi Wei, Lucene in general iterate through the index in the order of what is recorded in the SegmentInfos And at search time, you can specify the order using LeafSorter

Question about index segment search order

2023-05-02 Thread Wei
Hello, We have a index that has multiple segments generated with continuous updates. Does Lucene have a specific order when iterate through the segments (assuming single query thread) ? Can the order be customized that the latest generated segments are searched first? Thanks, Wei

Re: Question about current situation of good first issues in GitHub

2023-03-11 Thread Shunya Ueta
And sorry for the too late response. I completely missed your kindful response. Thank you again! I will try to contribute Apache Lucene. : ) Regards! 2023年3月11日(土) 16:03 Shunya Ueta : > Oh, Thank you very much. > I don't know those beginner-friendly labels. > I will try to find the good first iss

Re: Question about current situation of good first issues in GitHub

2023-03-10 Thread Shunya Ueta
Oh, Thank you very much. I don't know those beginner-friendly labels. I will try to find the good first issue. 2023年1月14日(土) 5:51 Michael Sokolov : > That label seems to be something GitHub created automatically? > > You might have better luck browsing the full list of labels. I found these: > >

Re: Question about searcherManager applyAllDeletes parameter and maybeRefresh method

2023-03-03 Thread Patrick Zhai
o search over >>> refreshed docs before we call writer.Commit(). So we are thinking of >>> constructing the SearcherManager with applyAllDeletes true. However, we >>> find there is another method ReferenceManager.maybeRefresh() >>> < >>> https://lucene.apa

Re: Question about searcherManager applyAllDeletes parameter and maybeRefresh method

2023-03-02 Thread Ningshan Li
er, we >> find there is another method ReferenceManager.maybeRefresh() >> < >> https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/search/ReferenceManager.html#maybeRefresh-- >> > >> where >> we refresh instances. >> >> My question is if we periodically

Re: Question about searcherManager applyAllDeletes parameter and maybeRefresh method

2023-03-02 Thread Patrick Zhai
/lucene.apache.org/core/7_4_0/core/org/apache/lucene/search/ReferenceManager.html#maybeRefresh-- > > > where > we refresh instances. > > My question is if we periodically call maybeRefresh(), and make sure the > manager is refreshed every time before we acquire a new searcher, do we > still

Question about searcherManager applyAllDeletes parameter and maybeRefresh method

2023-03-02 Thread Ningshan Li
tml#maybeRefresh--> where we refresh instances. My question is if we periodically call maybeRefresh(), and make sure the manager is refreshed every time before we acquire a new searcher, do we still need to set applyAllDeletes to true when constructing the manager? For example, I wrote a simple test like:

DoubleLeafComparator Question

2023-02-16 Thread Puneeth Bikkumanla
tation, I see that it is using Double::longBitsToDouble. My question is why does Lucene use that and not NumericUtils::sortableLongToDouble <https://github.com/apache/lucene/blob/8340b01c3cc229f33584ce2178b07b8984daa6a9/lucene/core/src/java/org/apache/lucene/util/NumericUtils.java#L56>? I am a

Re: Question for SynonymQuery

2023-01-27 Thread Mikhail Khludnev
gt; us. > > Regards, > Anh Dung Bui > > On Mon, Jan 2, 2023 at 8:07 Mikhail Khludnev wrote: > > > Hello Anh, > > I was intrigued by your question. And I managed it to work somehow. > > see > > > > > https://github.com/mkhludnev/likely/blob/eval-mulyw-s

Re: Question for SynonymQuery

2023-01-27 Thread Mikhail Khludnev
urns out I used FlattenGraphFilter and cause the PositionLength to be > > all 1 and resulted in the behavior above =) > > > > A side note is that we don't need to use WORD_SEPARATOR in the synonym > > file. SynonymMap.Parser.analyze would tokenize and append the separato

Re: Question for SynonymQuery

2023-01-20 Thread _ SATNAM
> file. SynonymMap.Parser.analyze would tokenize and append the separator for > us. > > Regards, > Anh Dung Bui > > On Mon, Jan 2, 2023 at 8:07 Mikhail Khludnev wrote: > > > Hello Anh, > > I was intrigued by your question. And I managed it to work somehow. >

Re: Question for SynonymQuery

2023-01-18 Thread Anh Dũng Bùi
gards, Anh Dung Bui On Mon, Jan 2, 2023 at 8:07 Mikhail Khludnev wrote: > Hello Anh, > I was intrigued by your question. And I managed it to work somehow. > see > > https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/java/org/apache/lucene/playground/TestMultiPulty.ja

Re: Question about current situation of good first issues in GitHub

2023-01-13 Thread Michael Sokolov
That label seems to be something GitHub created automatically? You might have better luck browsing the full list of labels. I found these: https://github.com/apache/lucene/labels/legacy-jira-label%3Anewbie https://github.com/apache/lucene/labels/legacy-jira-label%3Anewdev https://github.com/apach

Re: Question about current situation of good first issues in GitHub

2023-01-10 Thread Uwe Schindler
Hi, The old JIRA labels are also in Github. See tags named "legacy-jiralabel:*". The equivalent search would be this: https://github.com/apache/lucene/labels/legacy-jira-label%3Anewdev Uwe Am 10.01.2023 um 12:41 schrieb Stefan Vodita: Hello Shunya, As far as I know, GitHub issues are not m

Question about current situation of good first issues in GitHub

2023-01-10 Thread Stefan Vodita
Hello Shunya, As far as I know, GitHub issues are not marked for new developers yet. The project migrated a few months ago from Jira to GitHub issues, so you can still search the old labels in Jira . In particular, there is `newdev` for good starter issues [1]. Hope this helps, Stefan [1] https

Question about current situation of good first issues in GitHub

2023-01-08 Thread Shunya Ueta
Hello Lucene users. Last time I checked `good first issue` in GitHub issues to start a contribution of Lucene. https://github.com/apache/lucene/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22 But currently no issues with this label. I don't know the current operation of this label, b

RE: Question for SynonymQuery

2023-01-03 Thread Trevor Nicholls
nuary 2023 09:55 To: java-user@lucene.apache.org Subject: Re: Question for SynonymQuery Hello Trevor. Can you help me better understand this approach? If we have a text "wifi router" and inject "internet device" at indexing time, terms reside at the same positions. How to avoid fals

Re: Question for SynonymQuery

2023-01-02 Thread Michael Wechner
fi device" is closer to "internet device" than "wifi router" to "internet device" using the model "all-mpnet-base-v2", whereas if you consider "wifi device" a false positive, then it is not helpful of course, but it might be useful otherwise consid

Re: Question for SynonymQuery

2023-01-02 Thread Mikhail Khludnev
during searching - and I'm sure you'll be doing far more > searching than indexing). > > cheers > T > > > -Original Message- > From: Michael Wechner > Sent: Thursday, 29 December 2022 08:56 > To: java-user@lucene.apache.org > Subject: Re: Question for

RE: Question for SynonymQuery

2023-01-02 Thread Trevor Nicholls
ing). cheers T -Original Message- From: Michael Wechner Sent: Thursday, 29 December 2022 08:56 To: java-user@lucene.apache.org Subject: Re: Question for SynonymQuery Hi Anh The following Stackoverflow link might help https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-mul

Re: Question for SynonymQuery

2023-01-01 Thread Mikhail Khludnev
Hello Anh, I was intrigued by your question. And I managed it to work somehow. see https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/java/org/apache/lucene/playground/TestMultiPulty.java Beware, synonym files https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test

Re: Question for SynonymQuery

2022-12-28 Thread Anh Dũng Bùi
Thanks everyone for the insight. I guess I'll use BooleanQuery then. There is also a caveat I noticed (not sure if it's an issue or not), which is slightly different from the mentioned thread. When I have a multi-word synonym, let say "wifi router" and "internet device". Then using SynonymGraphFil

Re: Question for SynonymQuery

2022-12-28 Thread Michael Wechner
Hi Anh The following Stackoverflow link might help https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene The following thread seems to confirm, that escaping the space with a backslash does not help https://lists.apache.org/list?java-u

Re: Question for SynonymQuery

2022-12-28 Thread Mikhail Khludnev
Hello. 1)Yes. That's the purpose. 2) I've skimmed through QueryBuilder.java. Conclusion is that it creates BQ.SHOULD (however, there should be something like DisjunctionMaxQuery) over PhraseQuery or MultiPhraseQuery (-ies). Good hack! On Wed, Dec 28, 2022 at 2:23 AM Anh Dũng Bùi wrote: > Hi Luc

Question for SynonymQuery

2022-12-27 Thread Anh Dũng Bùi
Hi Lucene users, I recently came across SynonymQuery and found out that it only supports single-term synonyms (since it accepts a list of Term which will be considered as synonyms). We have some multi-term synonyms like "internet device" <-> "wifi router" or "dns" <-> "domain name service". Am I r

Question regarding UnifiedHighlighter and SynonymGraphFilter

2022-09-13 Thread Rune Stilling
Hi list I’m using the SynonymGraphFilter in Lucene 8.11 together with the UnifiedHighlighter class. I’m not sure if they are supposed to work seamlessly together but I’m having issues with the highlighter showing partial matches of multi token synonym phrases. Ie. the word “in” in the following

Re: Lucene Suggester APIs question

2022-08-20 Thread Dawid Weiss
l#add-org.apache.lucene.util.IntsRef-T- This should be fast as fst construction is linear with data size (if it's sorted). Dawid D. On Mon, Aug 15, 2022 at 4:36 PM Nitish Jain wrote: > > Hi, > > I have a question about lucene suggester APIs. If I build multiple FSTs > usin

Lucene Suggester APIs question

2022-08-15 Thread Nitish Jain
Hi, I have a question about lucene suggester APIs. If I build multiple FSTs using a suggester, is there a way to merge two generated FSTs? -- Nitish Jain

Re: Lucene Suggester APIs question

2022-08-14 Thread Mikhail Khludnev
Hello Nitish. What about https://lucene.apache.org/core/7_2_1/core/org/apache/lucene/util/automaton/Operations.html#union-org.apache.lucene.util.automaton.Automaton-org.apache.lucene.util.automaton.Automaton- ? On Mon, Aug 15, 2022 at 4:42 AM Nitish Jain wrote: > Hi, > > I have a

Re: Question about Benchmark

2022-05-17 Thread Michael Sokolov
OK I replied on the issue. This ann-benchmarks is a separate project, and I think you are asking about how to change it. Probably should take it up with erikbern or whatever community is supporting that actively. I just created a "plugin" so we could use it to test Lucene's KNN implementation, but

Re: Question about Benchmark

2022-05-17 Thread balmukund mandal
Hi All, It's my apologies for not mentioning the benchmark which i was using. Also, i realized that i've not subscribed to this group,hence duplicating this mail. The below queries are for ANN-Benchmark https://issues.apache.org/jira/browse/LUCENE-9625 Indexing takes a long time, so is there a way

Re: Question about Benchmark

2022-05-16 Thread Mikhail Khludnev
Hi, Balmukund. Assuming you are asking about Lucene benchmark module. 1) If one build index once, it's possible to start benchmark with ResetSystemSoft that keep index files intact and allow to benchmark search again and again, without waiting long for reindex. 2) Check indexing-multithreaded.alg

Re: Question about Benchmark

2022-05-16 Thread Adrien Grand
Hi Balmukund, What benchmark are you talking about? On Mon, May 16, 2022 at 4:35 PM balmukund mandal wrote: > > Hi All, > I was trying to run the benchmark and had a couple of questions. Indexing > takes a long time, so is there a way to configure the benchmark to use an > already existing index

Question about Benchmark

2022-05-16 Thread balmukund mandal
Hi All, I was trying to run the benchmark and had a couple of questions. Indexing takes a long time, so is there a way to configure the benchmark to use an already existing index for search? Also, is there a way to configure the benchmark to use multiple threads for indexing (looks to me that it’s

Re: RangeFacetsCount Question

2022-04-26 Thread Greg Miller
t something like drill sideways?? > > On Thu, Apr 21, 2022 at 6:08 PM Marc D'Mello wrote: > > > > Hi, > > > > I had a quick question about RangeFacetsCounts > > <https://github.com/apache/lucene/blob/a071180a806d1bb7ae11ae30a07e43e452bea810/lucene/

Re: RangeFacetsCount Question

2022-04-26 Thread Michael Sokolov
pport something like drill sideways?? On Thu, Apr 21, 2022 at 6:08 PM Marc D'Mello wrote: > > Hi, > > I had a quick question about RangeFacetsCounts > <https://github.com/apache/lucene/blob/a071180a806d1bb7ae11ae30a07e43e452bea810/lucene/facet/src/java/org/apache/lucene/facet/ran

RangeFacetsCount Question

2022-04-21 Thread Marc D'Mello
Hi, I had a quick question about RangeFacetsCounts <https://github.com/apache/lucene/blob/a071180a806d1bb7ae11ae30a07e43e452bea810/lucene/facet/src/java/org/apache/lucene/facet/range/RangeFacetCounts.java#L65>, I'm a bit confused by the fastMatchQuery param. Specifically, I was wonde

RE: synonym question

2022-03-14 Thread Trevor Nicholls
Just to confirm, escaping the spaces in synonym table construction, query construction, or both, does not solve the problem. -Original Message- From: Trevor Nicholls Sent: Tuesday, 15 March 2022 05:02 To: java-user@lucene.apache.org Subject: RE: synonym question Hi, thanks for such a

RE: synonym question

2022-03-14 Thread Trevor Nicholls
e.org Subject: Re: synonym question Hello, just a guess, have you tried escaping the space in your multi-word terms with backslash? isoweek,iso\ week Regards Bernd Am 14.03.22 um 15:54 schrieb Trevor Nicholls: > I have technical data which I am querying with Lucene; one of the > feat

Re: synonym question

2022-03-14 Thread Bernd Fehling
Hello, just a guess, have you tried escaping the space in your multi-word terms with backslash? isoweek,iso\ week Regards Bernd Am 14.03.22 um 15:54 schrieb Trevor Nicholls: I have technical data which I am querying with Lucene; one of the features of the content is that a large number of te

synonym question

2022-03-14 Thread Trevor Nicholls
I have technical data which I am querying with Lucene; one of the features of the content is that a large number of technical terms may be written as multiple words or as a compound word. For example, ISOWEEK or ISO WEEK. Or SynonymFilter or synonym filter. I have a synonym table which includes

Re: Question about using Lucene to search source code

2021-12-20 Thread Michael Wechner
Hi Yuxin Can you provide a concrete example of a query and a document/code snippet? Thanks Michael Am 20.12.21 um 03:06 schrieb Yuxin Liu: Dear development community of Lucene: Hi from student research assistant Yuxin Liu. I'm using Lucene to build an index search for source code indexes usi

Question about using Lucene to search source code

2021-12-20 Thread Yuxin Liu
Dear development community of Lucene: Hi from student research assistant Yuxin Liu. I'm using Lucene to build an index search for source code indexes using TF-IDF similarity. I have a set of source code snippets and I want to use part of the source code snippet as a query and obtain the document wi

Re: A question on PhraseQuery and slop

2021-12-13 Thread Michael Sokolov
I wonder if the Analysis chain could be involved. If those stop words ("is") are removed without leaving a hole somehow, then that could explain? On Mon, Dec 13, 2021 at 9:35 AM Michael McCandless wrote: > > Hello Claude, > > Hmm, that is interesting that you see slop=2 matching query "quick fox"

Re: A question on PhraseQuery and slop

2021-12-13 Thread Michael McCandless
Hello Claude, Hmm, that is interesting that you see slop=2 matching query "quick fox" against document "the fox is quick". Edit distance (Levenshtein) is a bit tricky because it might include a transposition (just swapping the two words) as edit distance 1 OR 2. So maybe Lucene's PhraseQuery is

A question on PhraseQuery and slop

2021-12-10 Thread Claude Lepere
Hello. The explanation of https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop writes that the edit distance between "quick fox" and "the fox is quick" would be a

Re: Question about readVint & writeVint from DataOutput and DataInput

2021-09-03 Thread Aaron Cohen
Thank you for the clarification. > On Sep 3, 2021, at 10:46 AM, Uwe Schindler wrote: > > They are fully supported, so you can write and read them. > > The problem with negative numbers is that they need lot of (disk) space, > because in two's complement they have almost all bits set. The large

Re: Question about readVint & writeVint from DataOutput and DataInput

2021-09-03 Thread Uwe Schindler
They are fully supported, so you can write and read them. The problem with negative numbers is that they need lot of (disk) space, because in two's complement they have almost all bits set. The largest number is kinds of disk space is -1. Negative numbers appear in older index formats, so they

Question about readVint & writeVint from DataOutput and DataInput

2021-09-03 Thread Aaron Cohen
While reading the Lucene JavaDoc I came across writeVInt & readVInt from DataOutput and DataInput base

Re: hello~~i have a question

2021-08-02 Thread Michael Wechner
b.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/store/FSDirectory.java HTH Michael Am 02.08.21 um 08:07 schrieb nic k: hello i have a question, so im sending u an e-mail when searching in lucene, i wonder if it reads from the oldest segment or the most recently created se

hello~~i have a question

2021-08-02 Thread nic k
hello i have a question, so im sending u an e-mail when searching in lucene, i wonder if it reads from the oldest segment or the most recently created segment when i test it, i think it reads the oldest file first, but i ask for conviction, not conjecture please im looking around and i cant

Question Regarding Computing Vocabulary Size

2020-04-07 Thread jointcc2 .
Hello, I am a master student currently working on a search engine project on BM25similarity. My question is about computing the length of vocabulary size of a single document. I have looked through the code base but has not found anything useful for that specific application. I am wondering if

Re: ComplexPhraseQueryParser performance question

2020-02-13 Thread baris . kazar
Thanks Mikhail. On 2/13/20 5:05 AM, Mikhail Khludnev wrote: Hello, I picked two first questions for reply. does this class offer any Shingling capability embedded to it? No, it doesn't allow to expand wildcard phrase with shingles. I could not find any api within this class ComplexPhra

Re: ComplexPhraseQueryParser performance question

2020-02-13 Thread Mikhail Khludnev
Hello, I picked two first questions for reply. > does this class offer any Shingling capability embedded to it? > No, it doesn't allow to expand wildcard phrase with shingles. > I could not find any api within this class ComplexPhraseQueryParser for > that purpose. > There are no one. > B

Re: ComplexPhraseQueryParser performance question

2020-02-12 Thread baris . kazar
org.apache.lucene.search.PhraseWildcardQuery looks very good, i hope this makes into Lucene build soon. Thanks > On Feb 12, 2020, at 10:01 PM, baris.ka...@oracle.com wrote: > > Thanks David, can i look at the source code? > i think ComplexPhraseQueryParser uses > something similar. > i will che

Re: ComplexPhraseQueryParser performance question

2020-02-12 Thread baris . kazar
Thanks David, can i look at the source code? i think ComplexPhraseQueryParser uses something similar. i will check the differences but do You know the differences for quick reference? Thanks > On Feb 12, 2020, at 6:41 PM, David Smiley wrote: > >  > Hi, > > See org.apache.lucene.search.Phra

Re: ComplexPhraseQueryParser performance question

2020-02-12 Thread David Smiley
Hi, See org.apache.lucene.search.PhraseWildcardQuery in Lucene's sandbox module. It was recently added by my amazing colleague Bruno. At this time there is no query parser that uses it in Lucene unfortunately but you can rectify this for your own purposes. I hope this query "graduates" to Lucen

Re: ComplexPhraseQueryParser performance question

2020-02-12 Thread baris . kazar
Hi,- Regarding this mechanisms below i mentioned, does this class offer any Shingling capability embedded to it? I could not find any api within this class ComplexPhraseQueryParser for that purpose. For instance does this class offer the most commonly used words api? i can then use one of

Re: ComplexPhraseQueryParser performance question

2020-02-04 Thread baris . kazar
Thanks but i thought this class would have a mechanism to fix this issue. Thanks > On Feb 4, 2020, at 4:14 AM, Mikhail Khludnev wrote: > > It's slow per se, since it loads terms positions. Usual advices are > shingling or edge ngrams. Note, if this is not a text but a string or enum, > it pr

Re: ComplexPhraseQueryParser performance question

2020-02-04 Thread Mikhail Khludnev
It's slow per se, since it loads terms positions. Usual advices are shingling or edge ngrams. Note, if this is not a text but a string or enum, it probably let to apply another tricks. Another idea is perhaps IntervalQueries can be smarter and faster in certain cases, although they are backed on th

Re: ComplexPhraseQueryParser performance question

2020-02-03 Thread baris . kazar
How can this slowdown be resolved? is this another limitation of this class? Thanks > On Feb 3, 2020, at 4:14 PM, baris.ka...@oracle.com wrote: > > Please ignore the first comparison there. i was comparing there {term1 with > 2 chars} vs {term1 with >= 5 chars + term2 with 1 char} > > > The s

Re: ComplexPhraseQueryParser performance question

2020-02-03 Thread baris . kazar
Please ignore the first comparison there. i was comparing there {term1 with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char} The slowdown is The query "term1 term2*" slows down 400 times (~1500 millisecs) compared to "term1*" when term1 has >5 chars and term2 is still 1 char. Best re

ComplexPhraseQueryParser performance question

2020-02-03 Thread baris . kazar
Hi,-  i hope everyone is doing great. I saw this issue with this class such that if you search for "term1*"  it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250 millisecs when it is 2 chars) but when you search for "term1 term2*" where when term2 is a single char, the perfo

Re: ComplexPhraseQueryParser class question

2020-01-29 Thread baris . kazar
ucene api. My question is asking also whether ComplexPhraseQueryParser has a way to support partial phrase match capability? Elastic Search has this capability with a percentage indication. i am surprised Lucene Core does not have this, i hope i am wrong. Best regards > On Jan 29, 2020, at

回复:ComplexPhraseQueryParser class question

2020-01-29 Thread 陈志祥
件人: 日 期:2020年01月30日 05:02:50 收件人:java-user@lucene.apache.org 抄 送:baris.kazar 主 题:ComplexPhraseQueryParser class question Hi,- I hope everyone is doing great. i have a question regarrding ComplexPhraseQueryParser class. This class can handle this queryText case very well: "term1 erm2

ComplexPhraseQueryParser class question

2020-01-29 Thread baris . kazar
Hi,-  I hope everyone is doing great. i have a question regarrding ComplexPhraseQueryParser class. This class can handle this queryText case very well: "term1 erm2 abcd term3*"~2 (last term3 has * at the end and the whole phrase has slop value 2) The term1, term2 and term3 are

Re: Question about PhraseQuery's capacity...

2020-01-12 Thread 小鱼儿
hi i have filed a issue to lucene-core: https://issues.apache.org/jira/browse/LUCENE-9130 i just write a test case, and find that BooelanQuery with MUST filter mode is ok, but PhraseQuery fails 小鱼儿 于2020年1月10日周五 下午7:14写道: > explain api helps! thanks for hint~! > I have found out that one case fa

Re: Question about PhraseQuery's capacity...

2020-01-10 Thread 小鱼儿
explain api helps! thanks for hint~! I have found out that one case failed becaused i carelessly add another filter condition, but the other case (which is analyzed into 30 terms) still failed, doesn't know why I guess i need to write a unit testcase to use MultiTerms.getTerms API to find out if th

Re: Question about PhraseQuery's capacity...

2020-01-10 Thread Mikhail Khludnev
Hello, Sometimes IndexSearcher.explain(Query, int) allows to analyse mismatches. On Fri, Jan 10, 2020 at 1:13 PM 小鱼儿 wrote: > After i directly call Analyzer.tokenStream() method to extract terms from > query, i still cannot get results. Doesn't know the why... > > Code when build index: >

Re: Question about PhraseQuery's capacity...

2020-01-10 Thread 小鱼儿
After i directly call Analyzer.tokenStream() method to extract terms from query, i still cannot get results. Doesn't know the why... Code when build index: IndexWriterConfig iwc = new IndexWriterConfig(analyzer); //new SmartChineseAnalyzer(); Code do query: (1) extract terms from query

Re: Question about PhraseQuery's capacity...

2020-01-10 Thread 小鱼儿
Hi Adrien, I find i might make a mistake: There is 2 level processing in a Analyzer class: one is Tokenizer, which is HMMChineseTokenizer, and the other is Analyzer which may apply some filtering... I'm using lucene's default interface to set a Analyzer instance to do the indexing, b

Re: Question about PhraseQuery's capacity...

2020-01-10 Thread Adrien Grand
It should match. My guess is that you might not reusing the same positions as set by the analysis chain when creating the phrase query? Can you show us how you build the phrase query? On Fri, Jan 10, 2020 at 9:24 AM 小鱼儿 wrote: > I use SmartChineseAnalyzer to do the indexing, and add a document w

Question about PhraseQuery's capacity...

2020-01-10 Thread 小鱼儿
I use SmartChineseAnalyzer to do the indexing, and add a document with a TextField whose value is a long sentence, when anaylized, will get 18 terms. & then i use the same value to construct a PhraseQuery, setting slop to 2, and adding the 18 terms concequently... I expect the search api to find

Re: Question abount combining InvertedIndex and SortField

2020-01-01 Thread 小鱼儿
Hi, Mikhail Your words is very encouraging. I was thinking i might need to do another Lucene custom query to apply my business-specific "index-only sort" and "early-termination"... SortField API says can use any numeric/String field, which is very perfect. (In this way, Lucene should be able to a

Re: Question abount combining InvertedIndex and SortField

2019-12-31 Thread Mikhail Khludnev
Hello, 小鱼儿. On Tue, Dec 31, 2019 at 6:32 AM 小鱼儿 wrote: > Assume i first use keyword search to get a DocIDSet from inverted index, > then i want to sort these docIds by some numeric field, like a > `updateTime`, does Lucene do this without need of loading the Document > objects but only with an s

Question abount combining InvertedIndex and SortField

2019-12-30 Thread 小鱼儿
Assume i first use keyword search to get a DocIDSet from inverted index, then i want to sort these docIds by some numeric field, like a `updateTime`, does Lucene do this without need of loading the Document objects but only with an sorted index on `updateTime`? Which i call it "Index-Only Sort Opti

  1   2   3   4   5   6   7   8   9   10   >