Re: Question - Why stopwords.txt provided by smartcn contains blank lines?

2023-05-15 Thread Jerry Chin
Hi Michael, Thanks for clarifying, I have created an issue to follow up in Github. Much appreciated! On Monday, May 15, 2023, Michael McCandless wrote: > Hi Jerry, > > I agree, that makes no sense! Maybe the stopload loader should ignore >

Re: Question - Why stopwords.txt provided by smartcn contains blank lines?

2023-05-15 Thread Michael McCandless
Hi Jerry, I agree, that makes no sense! Maybe the stopload loader should ignore truly blank lines? Also, the comments on lines 57 and 59 are confusing -- there are no (default) English and Chinese stopwords in the file. I guess they are placeholders. Could you open an issue in Lucene's GitHub

Re: Question about index segment search order

2023-05-13 Thread Uwe Schindler
Hi, in reference to previous code references and discussions from other Lucene committers I have to clarify: * If you run the query multithreaded (per segment), this means when you add an Executor to IndexSearcher, the order is not predicatable, plain simple * If you use Solr, a

Re: Question about index segment search order

2023-05-11 Thread Wei
Hi Michael, Yes the collector counts hits across all segments. Thanks for the suggestion, I'm also asking the question on solr-dev. Wei On Thu, May 11, 2023 at 11:57 AM Michael Sokolov wrote: > Maybe ask this issue on solr-dev then? I'm not familiar with how that > collector works. Does it

Re: Question about index segment search order

2023-05-11 Thread Michael Sokolov
Maybe ask this issue on solr-dev then? I'm not familiar with how that collector works. Does it count hits across all segments? only within a single segment? On Tue, May 9, 2023 at 1:36 PM Wei wrote: > > Hi Michael, > > I am applying early termination with Solr's EarlyTerminatingCollector >

Re: Question about index segment search order

2023-05-09 Thread Wei
Hi Michael, I am applying early termination with Solr's EarlyTerminatingCollector https://github.com/apache/solr/blob/d9ddba3ac51ece953d762c796f62730e27629966/solr/core/src/java/org/apache/solr/search/EarlyTerminatingCollector.java , which triggers EarlyTerminatingCollectorException in

Re: Question about index segment search order

2023-05-04 Thread Michael Sokolov
Yes, sorry I didn't mean to imply you couldn't control this if you want to. I guess in the typical setup it is not predictable. How are you applying early termination? Are you using a standard Lucene Collector or do you have your own? On Thu, May 4, 2023 at 2:03 PM Patrick Zhai wrote: > > Hi

Re: Question about index segment search order

2023-05-04 Thread Patrick Zhai
Hi Mike, Just want to mention if the user chooses to use single thread to index and use LogXXMergePolicy then the document order will be preserved as index order. On Thu, May 4, 2023 at 10:04 AM Wei wrote: > Hi Michael, > > We are interested in the segment sequence for early termination. In

Re: Question about index segment search order

2023-05-04 Thread Wei
Hi Michael, We are interested in the segment sequence for early termination. In our case there is always a large dominant segment after index rebuild, then many small segments are generated with continuous updates as time goes by. When early termination is applied, the limit could be reached

Re: Question about index segment search order

2023-05-04 Thread Michael Sokolov
There is no meaning to the sequence. The segments are created concurrently by many threads and the merge process will merge them without regards to any ordering. On Wed, May 3, 2023, 1:09 PM Patrick Zhai wrote: > For that part I'm not entirely sure, if other folks know it please chime in > :)

Re: Question about index segment search order

2023-05-03 Thread Patrick Zhai
For that part I'm not entirely sure, if other folks know it please chime in :) On Wed, May 3, 2023 at 8:48 AM Wei wrote: > Thanks Patrick! In the default case when no LeafSorter is provided, are the > segments traversed in the order of creation time, i.e. the oldest segment > is always visited

Re: Question about index segment search order

2023-05-03 Thread Wei
Thanks Patrick! In the default case when no LeafSorter is provided, are the segments traversed in the order of creation time, i.e. the oldest segment is always visited first? Wei On Tue, May 2, 2023 at 7:22 PM Patrick Zhai wrote: > Hi Wei, > Lucene in general iterate through the index in the

Re: Question about index segment search order

2023-05-02 Thread Patrick Zhai
Hi Wei, Lucene in general iterate through the index in the order of what is recorded in the SegmentInfos And at search time, you can specify the order using LeafSorter

Re: Question about current situation of good first issues in GitHub

2023-03-11 Thread Shunya Ueta
And sorry for the too late response. I completely missed your kindful response. Thank you again! I will try to contribute Apache Lucene. : ) Regards! 2023年3月11日(土) 16:03 Shunya Ueta : > Oh, Thank you very much. > I don't know those beginner-friendly labels. > I will try to find the good first

Re: Question about current situation of good first issues in GitHub

2023-03-10 Thread Shunya Ueta
Oh, Thank you very much. I don't know those beginner-friendly labels. I will try to find the good first issue. 2023年1月14日(土) 5:51 Michael Sokolov : > That label seems to be something GitHub created automatically? > > You might have better luck browsing the full list of labels. I found these: > >

Re: Question about searcherManager applyAllDeletes parameter and maybeRefresh method

2023-03-03 Thread Patrick Zhai
Note that in the javadoc it says "If false, the deletes may or may not be applied" means it will not force applying all the delete and it's up to IndexWriter to decide whether to apply at the refresh time or not, I'm not 100% sure how IndexWriter decides that and maybe someone knows more can chime

Re: Question about searcherManager applyAllDeletes parameter and maybeRefresh method

2023-03-02 Thread Ningshan Li
Hi Patrick, Thanks for the quick response and the explanation and sources are helpful! But there is still a point we couldn't quite understand: why did the test I mentioned earlier pass (applyAllDeletes false and do maybeRefresh())? If the delete is not applied, we should see the deleted doc in

Re: Question about searcherManager applyAllDeletes parameter and maybeRefresh method

2023-03-02 Thread Patrick Zhai
Hi Ningshan, If you want to make sure the deletes are applied after you call maybeRefresh() then you need to set the applyAllDeletes to be true. A bit more details: The constructor of SearcherManager actually internally passes the applyAllDeletes to the IndexWriter, which then will pass it to the

Re: Question for SynonymQuery

2023-01-27 Thread Mikhail Khludnev
Right. SynonymMap.html#WORD_SEPARATOR was a redundant complication. Spaces work fine. On Thu, Jan 19, 2023 at 4:26 AM Anh Dũng Bùi wrote: > Thanks Mikhail! > > It turns out

Re: Question for SynonymQuery

2023-01-27 Thread Mikhail Khludnev
Hello, Santam. It seems I achieved what you asking for. https://github.com/mkhludnev/likely/blob/381b491d25e4d2035dd5b8a891dfdcfe2b986b90/src/test/java/org/apache/lucene/playground/TestMultiPulty.java#L32 It expands API and UI into phrases, which match like you expect. On Fri, Jan 20, 2023 at

Re: Question for SynonymQuery

2023-01-20 Thread _ SATNAM
Hey Mikhail and Anh Dung Bui i am also struggling with synonym query my use case for eg I created synonyms for word API --> Application program interface UI -> user interface doc 1 ---> This is API and it is called Application program interface doc2 > How i help you in UI

Re: Question for SynonymQuery

2023-01-18 Thread Anh Dũng Bùi
Thanks Mikhail! It turns out I used FlattenGraphFilter and cause the PositionLength to be all 1 and resulted in the behavior above =) A side note is that we don't need to use WORD_SEPARATOR in the synonym file. SynonymMap.Parser.analyze would tokenize and append the separator for us. Regards,

Re: Question about current situation of good first issues in GitHub

2023-01-13 Thread Michael Sokolov
That label seems to be something GitHub created automatically? You might have better luck browsing the full list of labels. I found these: https://github.com/apache/lucene/labels/legacy-jira-label%3Anewbie https://github.com/apache/lucene/labels/legacy-jira-label%3Anewdev

Re: Question about current situation of good first issues in GitHub

2023-01-10 Thread Uwe Schindler
Hi, The old JIRA labels are also in Github. See tags named "legacy-jiralabel:*". The equivalent search would be this: https://github.com/apache/lucene/labels/legacy-jira-label%3Anewdev Uwe Am 10.01.2023 um 12:41 schrieb Stefan Vodita: Hello Shunya, As far as I know, GitHub issues are not

RE: Question for SynonymQuery

2023-01-03 Thread Trevor Nicholls
ocessing > occurs once at indexing time and not at all during searching - and I'm > sure you'll be doing far more searching than indexing). > > cheers > T > > > -----Original Message- > From: Michael Wechner > Sent: Thursday, 29 December 2022 08:56

Re: Question for SynonymQuery

2023-01-02 Thread Michael Wechner
onymQuery; I have just used the standard QueryParser. Instead the synonym processing occurs in the indexing phase, which is not only simpler (one search pattern, one query), but also I think you would also find it gives you superior performance (because the synonym processing occurs once at indexing time

Re: Question for SynonymQuery

2023-01-02 Thread Mikhail Khludnev
ng searching - and I'm sure you'll be doing far more > searching than indexing). > > cheers > T > > > -Original Message- > From: Michael Wechner > Sent: Thursday, 29 December 2022 08:56 > To: java-user@lucene.apache.org > Subject: Re: Question for SynonymQuery &g

RE: Question for SynonymQuery

2023-01-02 Thread Trevor Nicholls
-Original Message- From: Michael Wechner Sent: Thursday, 29 December 2022 08:56 To: java-user@lucene.apache.org Subject: Re: Question for SynonymQuery Hi Anh The following Stackoverflow link might help https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym

Re: Question for SynonymQuery

2023-01-01 Thread Mikhail Khludnev
Hello Anh, I was intrigued by your question. And I managed it to work somehow. see https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/java/org/apache/lucene/playground/TestMultiPulty.java Beware, synonym files

Re: Question for SynonymQuery

2022-12-28 Thread Anh Dũng Bùi
Thanks everyone for the insight. I guess I'll use BooleanQuery then. There is also a caveat I noticed (not sure if it's an issue or not), which is slightly different from the mentioned thread. When I have a multi-word synonym, let say "wifi router" and "internet device". Then using

Re: Question for SynonymQuery

2022-12-28 Thread Michael Wechner
Hi Anh The following Stackoverflow link might help https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene The following thread seems to confirm, that escaping the space with a backslash does not help

Re: Question for SynonymQuery

2022-12-28 Thread Mikhail Khludnev
Hello. 1)Yes. That's the purpose. 2) I've skimmed through QueryBuilder.java. Conclusion is that it creates BQ.SHOULD (however, there should be something like DisjunctionMaxQuery) over PhraseQuery or MultiPhraseQuery (-ies). Good hack! On Wed, Dec 28, 2022 at 2:23 AM Anh Dũng Bùi wrote: > Hi

Re: Question about Benchmark

2022-05-17 Thread Michael Sokolov
OK I replied on the issue. This ann-benchmarks is a separate project, and I think you are asking about how to change it. Probably should take it up with erikbern or whatever community is supporting that actively. I just created a "plugin" so we could use it to test Lucene's KNN implementation, but

Re: Question about Benchmark

2022-05-17 Thread balmukund mandal
Hi All, It's my apologies for not mentioning the benchmark which i was using. Also, i realized that i've not subscribed to this group,hence duplicating this mail. The below queries are for ANN-Benchmark https://issues.apache.org/jira/browse/LUCENE-9625 Indexing takes a long time, so is there a way

Re: Question about Benchmark

2022-05-17 Thread Mikhail Khludnev
Hi, Balmukund. Assuming you are asking about Lucene benchmark module. 1) If one build index once, it's possible to start benchmark with ResetSystemSoft that keep index files intact and allow to benchmark search again and again, without waiting long for reindex. 2) Check indexing-multithreaded.alg

Re: Question about Benchmark

2022-05-16 Thread Adrien Grand
Hi Balmukund, What benchmark are you talking about? On Mon, May 16, 2022 at 4:35 PM balmukund mandal wrote: > > Hi All, > I was trying to run the benchmark and had a couple of questions. Indexing > takes a long time, so is there a way to configure the benchmark to use an > already existing

Re: Question about using Lucene to search source code

2021-12-20 Thread Michael Wechner
Hi Yuxin Can you provide a concrete example of a query and a document/code snippet? Thanks Michael Am 20.12.21 um 03:06 schrieb Yuxin Liu: Dear development community of Lucene: Hi from student research assistant Yuxin Liu. I'm using Lucene to build an index search for source code indexes

Re: Question about readVint & writeVint from DataOutput and DataInput

2021-09-03 Thread Aaron Cohen
Thank you for the clarification. > On Sep 3, 2021, at 10:46 AM, Uwe Schindler wrote: > > They are fully supported, so you can write and read them. > > The problem with negative numbers is that they need lot of (disk) space, > because in two's complement they have almost all bits set. The

Re: Question about readVint & writeVint from DataOutput and DataInput

2021-09-03 Thread Uwe Schindler
They are fully supported, so you can write and read them. The problem with negative numbers is that they need lot of (disk) space, because in two's complement they have almost all bits set. The largest number is kinds of disk space is -1. Negative numbers appear in older index formats, so they

Re: Question about PhraseQuery's capacity...

2020-01-12 Thread 小鱼儿
hi i have filed a issue to lucene-core: https://issues.apache.org/jira/browse/LUCENE-9130 i just write a test case, and find that BooelanQuery with MUST filter mode is ok, but PhraseQuery fails 小鱼儿 于2020年1月10日周五 下午7:14写道: > explain api helps! thanks for hint~! > I have found out that one case

Re: Question about PhraseQuery's capacity...

2020-01-10 Thread 小鱼儿
explain api helps! thanks for hint~! I have found out that one case failed becaused i carelessly add another filter condition, but the other case (which is analyzed into 30 terms) still failed, doesn't know why I guess i need to write a unit testcase to use MultiTerms.getTerms API to find out if

Re: Question about PhraseQuery's capacity...

2020-01-10 Thread Mikhail Khludnev
Hello, Sometimes IndexSearcher.explain(Query, int) allows to analyse mismatches. On Fri, Jan 10, 2020 at 1:13 PM 小鱼儿 wrote: > After i directly call Analyzer.tokenStream() method to extract terms from > query, i still cannot get results. Doesn't know the why... > > Code when build index: >

Re: Question about PhraseQuery's capacity...

2020-01-10 Thread 小鱼儿
After i directly call Analyzer.tokenStream() method to extract terms from query, i still cannot get results. Doesn't know the why... Code when build index: IndexWriterConfig iwc = new IndexWriterConfig(analyzer); //new SmartChineseAnalyzer(); Code do query: (1) extract terms from

Re: Question about PhraseQuery's capacity...

2020-01-10 Thread 小鱼儿
Hi Adrien, I find i might make a mistake: There is 2 level processing in a Analyzer class: one is Tokenizer, which is HMMChineseTokenizer, and the other is Analyzer which may apply some filtering... I'm using lucene's default interface to set a Analyzer instance to do the indexing,

Re: Question about PhraseQuery's capacity...

2020-01-10 Thread Adrien Grand
It should match. My guess is that you might not reusing the same positions as set by the analysis chain when creating the phrase query? Can you show us how you build the phrase query? On Fri, Jan 10, 2020 at 9:24 AM 小鱼儿 wrote: > I use SmartChineseAnalyzer to do the indexing, and add a document

Re: Question abount combining InvertedIndex and SortField

2020-01-01 Thread 小鱼儿
Hi, Mikhail Your words is very encouraging. I was thinking i might need to do another Lucene custom query to apply my business-specific "index-only sort" and "early-termination"... SortField API says can use any numeric/String field, which is very perfect. (In this way, Lucene should be able to

Re: Question abount combining InvertedIndex and SortField

2019-12-31 Thread Mikhail Khludnev
Hello, 小鱼儿. On Tue, Dec 31, 2019 at 6:32 AM 小鱼儿 wrote: > Assume i first use keyword search to get a DocIDSet from inverted index, > then i want to sort these docIds by some numeric field, like a > `updateTime`, does Lucene do this without need of loading the Document > objects but only with an

Re: Question about the light and minimal French stemmers

2019-07-28 Thread Adrien Gallou
Hi Tomoko, Thanks for your answer. So, after them, I have opened an issue with a patch attached: https://issues.apache.org/jira/browse/LUCENE-8937 Adrien Le dim. 28 juil. 2019 à 13:51, Michael Sokolov a écrit : > Oh sorry for jumping in with my irrelevant comment, you are right, of > course!

Re: Question about the light and minimal French stemmers

2019-07-28 Thread Michael Sokolov
Oh sorry for jumping in with my irrelevant comment, you are right, of course! On Sat, Jul 27, 2019, 10:36 PM Tomoko Uchida wrote: > Let me just make things a bit clear... > I think the concern here is that FrenchMinimalStemmer would remove the > last "digit" from a token because of it does not

Re: Question about the light and minimal French stemmers

2019-07-27 Thread Tomoko Uchida
Let me just make things a bit clear... I think the concern here is that FrenchMinimalStemmer would remove the last "digit" from a token because of it does not check if the character is letter or not. e.g., "123455" is trimmed to "12345" by FrenchMinimalStemmer. To me, this behaviour is beyond

Re: Question about the light and minimal French stemmers

2019-07-27 Thread Michael Sokolov
I'm not so sure. I think the whole idea of having both stemmers is that the minimal one does less than the light one. Removing the final character of a double letter suffix is going to sacrifice some precision. For example mes/mess, ne/née, I'm sure there are others. So having both options is

Re: Question about the light and minimal French stemmers

2019-07-27 Thread Tomoko Uchida
I found an issue which adds the isLetter() check on FrenchLightStemmer. https://issues.apache.org/jira/browse/LUCENE-4063 Seems the same change has not been applied to FrenchMinimalStemmer, would it be a good idea that we add the same check to it to avoid too aggressive stemming? Tomoko

Re: Question about the light and minimal French stemmers

2019-07-27 Thread Tomoko Uchida
Hi Adrien, To me, it sounds simply a bug. Can you please open a JIRA (with a patch if possible)? Tomoko 2019年7月23日(火) 22:05 Adrien Gallou : > > Hi, > > I'm using both light and minimal French stemmers and encountered an issue > when using the minimal stemmer. > > The light stemmer removes the

Re: Question about Lucene in my project ..

2019-05-28 Thread Adrien Grand
Hi John, I heard of many users who used Lucene for this use-case, it's definitely a valid one. Indexes are stored mostly on disk, with a tiny part of them being held in memory to guarantee good access speed. Lucene supports both inverted indexes and KD trees up to 8 dimensions. Lookup, sorting

Re: Question about Indexsearcher.search()

2019-01-25 Thread Tomoko Uchida
Hi, Tokenization is usually performed by a query parser before searching and the result documents may include all terms or some of the terms or only one term in the query string (it depends on your query configuration). > I'm trying to make sample search application with Lucene. Have you

Re: Question about upgrading lucene 4.4.0 to 7.5.0

2018-11-06 Thread Tomoko Uchida
Hi, I think changing analyzer per each document when indexing will lead inconsistent or unstable search results. I would break down the reason why this is needed. > While adding a document we are adding a different analyzer. If a field needs to be analyzed by multiple analyzer, I would split up

Re: Question about upgrading lucene 4.4.0 to 7.5.0

2018-11-05 Thread Arpit Mittal
Could you please help us on it? This is urgent for us? On Sun, Nov 4, 2018 at 10:04 PM Arpit Mittal wrote: > Hi All, > > We are working on upgrading lucene version from 4.4.0 to 7.5.0. > > We have a few questions. Could you please help us by giving us suggestions > to fix it? > > Remove

Re: Question About FST, multiple-column index

2018-09-22 Thread Michael McCandless
You might want to index the name field normally (as StringField, for example), then index the age as a NumericDocValuesField, and then make a BooleanQuery with two required clauses, one clause TermQuery on the name, the other a NumericDocValuesField.newSlowExactQuery. Even though its name is

Re: Question About FST, multiple-column index

2018-09-21 Thread Mikhail Khludnev
No way. And this is the point. To have combined index you need to combine fields concatenating terms. It will be faster but it brings much other hurdles. Do you think that this is the real problem? What's the search time now and how do you search exactly? On Thu, Sep 20, 2018 at 5:57 PM ly铖

Re: Question about usage of LuceneTestCase

2018-08-27 Thread Tomoko Uchida
> i haven't looked closely into what exactly that "useFactory(null)" call > does, but it's probably worth getting to the bottom of the failures and > *IF* it's tied to some specific dir type or codec, using annotations to > supress them -- rather then just eliminating all directory randomization.

Re: Question about usage of LuceneTestCase

2018-08-27 Thread Chris Hostetter
: Current version of Luke supports FS based directory implementations only. : (I think it will be better if future versions support non-FS based custom : implementations, such as HdfsDirectoryFactory for users who need it.) : Disabling the randomization, at least for now, sounds reasonable to me

Re: Question about BytesRef and BinaryDocValues

2018-08-24 Thread Vadim Gindin
Kevin, the sequence is the following: get terms for the field, get postings for a term and further get payload from the postings. Have a read a little about reverse index structure and it will be more clear to you. Your Query creates Weight, that must create a scorer in the method

Re: Question about BytesRef and BinaryDocValues

2018-08-23 Thread Kevin Manuel
Hi Vadim, Thank you so much for your reply. I think you were right. So if a field is 'analyzed' how can I get both terms 'hey' and 'tom'? Thanks, Kevin On Thu, Aug 23, 2018, 20:26 Vadim Gindin wrote: > Hi Kevin! > > I think that your field is "analyzed" and so your field value is divided to

Re: Question about BytesRef and BinaryDocValues

2018-08-23 Thread Vadim Gindin
Hi Kevin! I think that your field is "analyzed" and so your field value is divided to 2 terms "hey" and "tom". So docvalue is written for each of them. Regards Vadim Gindin пт, 24 авг. 2018, 5:19 Kevin Manuel : > Hi, > > I'm using lucene version 4.3.1 and I've implemented a custom score

Re: Question about usage of LuceneTestCase

2018-08-22 Thread Tomoko Uchida
> You don't really have to figure out exactly what the combinations are, > just execute the test with the "reproduce with" flags set, cut/paste > the error message at the root of your local Solr source tree in a > command prompt. > ant test -Dtestcase=CommitsImplTest >

Re: Question about usage of LuceneTestCase

2018-08-22 Thread Michael Sokolov
It looks to me as if this test is asserting that the segment in an index it just created has some attributes, but in fact it does not. Perhaps there is a codec that does not store any attributes with its segments, and Luke does not expect this, and maybe the codec is being selected randomly by the

Re: Question about usage of LuceneTestCase

2018-08-22 Thread Michael Sokolov
Here's a seed that fails for me consistently in IntelliJ: "FEF692F43FE50191:656E22441676701C" running CommitsImplTest. Warning: I have a bunch of local changes that might have perturbed the randomness so possibly it might not reproduce for others. I just run the tests, open the "Edit

Re: Question about usage of LuceneTestCase

2018-08-22 Thread Erick Erickson
bq. My understanding at this point is (though it may be a repeat of your words,) first we should find out the combinations behind the failures. If there are any particular patterns, there could be bugs, so we should fix it. You don't really have to figure out exactly what the combinations are,

Re: Question about usage of LuceneTestCase

2018-08-22 Thread Tomoko Uchida
Can I ask one more question. 4> If MIke's intuition that it's one of the file system randomizations that occasionally gets hit _and_ you determine that that's an invalid test case (and for Luke requiring that the FS-basesd tests are all that are necessary may be fine) I'm pretty sure you you can

Re: Question about usage of LuceneTestCase

2018-08-22 Thread Tomoko Uchida
Thanks for your kind explanations, sorry of course I know what is the randomization seed, but your description and instruction is exactly what I wanted. > The randomization can cause different > combinations of "stuff" to happen. Say the locale is randomized to > Turkish and a token is also

Re: Question about usage of LuceneTestCase

2018-08-21 Thread Erick Erickson
The pseudo-random generator in the Lucene test framework is used to randomize lots of test conditions, we're talking about the file system implementation here, but there are lots of others. Whenever you see a call to random().whatever, that's the call to the framework's method. But here's the

Re: Question about usage of LuceneTestCase

2018-08-21 Thread Tomoko Uchida
Thanks a lot for your information & insights, I will try to reproduce the errors and investigate the results. And, maybe I should learn more about internal of the test framework, I'm not familiar with it and still do not understand what does "seed" means exactly in this context. Regards, Tomoko

Re: Question about usage of LuceneTestCase

2018-08-21 Thread Erick Erickson
Couple of things (and I know you've been around for a while, so pardon me if it's all old hat to you): 1> if you run the entire "reproduce with" line and can get a consistent failure, then you are half way there, nothing is as frustrating as not getting failures reliably. The critical bit is

Re: Question about usage of LuceneTestCase

2018-08-21 Thread Tomoko Uchida
Hi, Mike Thanks for sharing your experiments. > CommitsImplTest.testListCommits > CommitsImplTest.testGetCommit_generation_notfound > CommitsImplTest.testGetSegments > DocumentsImplTest.testGetDocumentFIelds I also found CommitsImplTest and DocumentsImplTest fail frequently, especially

Re: Question about usage of LuceneTestCase

2018-08-21 Thread Michael Sokolov
I was running these luke tests a bunch and found the following tests fail intermittently; pretty frequently. Once I @Ignore them I can get a consistent pass: CommitsImplTest.testListCommits CommitsImplTest.testGetCommit_generation_notfound CommitsImplTest.testGetSegments

Re: Question about threading in search

2018-08-17 Thread Erick Erickson
Please don't optimize to 1 segment unless you can afford to do it quite regularly, see: https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ (NOTE: this is changing as of 7.5, see: https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/). bq. It

Re: Question about threading in search

2018-08-17 Thread Toke Eskildsen
On Sat, 2017-09-02 at 18:33 -0700, Peilin Yang wrote: > we're comparing two different indexes on the same collection - one > with lots of different segments (default settings), and one with a > force merged into one segment. It seems that search is sometimes > faster with multiple segments. If

Re: Question: find term by position and document id

2018-04-05 Thread Adrien Grand
Indeed it is not possible to do something like this efficiently. The only way to do something like this that I know of would be to index shingles and then query them by prefix. Le jeu. 5 avr. 2018 à 14:11, Adam Hornacek a écrit : > Hello, > > I’m having difficulty

Re: Question about a documentation note in CompressingStoredFieldsIndexWriter

2017-11-24 Thread Adrien Grand
That makes sense to me, I'll push that change. Thanks! Le ven. 24 nov. 2017 à 10:40, Roman Margolis a écrit : > Sorry about that. > In my original message, I highlighted the relevant parts which probably > didn't make it to the mail archive. > > I would expect the note

Re: Question about a documentation note in CompressingStoredFieldsIndexWriter

2017-11-24 Thread Roman Margolis
Sorry about that. In my original message, I highlighted the relevant parts which probably didn't make it to the mail archive. I would expect the note to state the following (unless I misunderstood some of the details): "Once data is loaded into memory, you can lookup the start pointer of any

Re: Question about a documentation note in CompressingStoredFieldsIndexWriter

2017-11-24 Thread Adrien Grand
Hi Roman, It's unclear to me what modification you are suggesting, could you please share what the updated comment would look like? Le mer. 22 nov. 2017 à 14:17, Roman Margolis a écrit : > Hi, > > I was reading some internal info about Lucene, and was confused by a

Re: question

2017-01-20 Thread Dawid Weiss
> But it is fairly trivially to tweak/extend the query parser to produce > diff behavior. I think the conclusion for the original poster should be that there's really not enough information to provide a definite answer. Lucene is a search engine. Much like with a mechanical engine, its final

RE: question

2017-01-19 Thread Chris Hostetter
: Yes, they should be the same unless the field is indexed with shingles, in that case order matters. : Markus just to clarify... The examples provided show *stirngs* which would have to be parsed into Query objects by a query parser. the *default* QueryParser will produce queries that

Re: question

2017-01-16 Thread Erick Erickson
"it depends". I'm assuming that your case 1 is intended to be phrase searches whereas case 2 is just boolean (and specifically AND is the operator). So, within 1 (assuming phrase queries) the results should NOT be the same, that is "sas institute" (as a phrase) should not return the same results

Re: question

2017-01-16 Thread Erik Hatcher
Or a no-slop PhraseQuery, where order also matters. Erik > On Jan 16, 2017, at 12:27, Markus Jelsma wrote: > > Yes, they should be the same unless the field is indexed with shingles, in > that case order matters. > Markus > > -Original message- >>

RE: question

2017-01-16 Thread Markus Jelsma
Yes, they should be the same unless the field is indexed with shingles, in that case order matters. Markus -Original message- > From:Julius Kravjar > Sent: Monday 16th January 2017 18:20 > To: java-user@lucene.apache.org > Subject: question > > May I have

Re: Question on Lucene Behavior in 4.9 vs 5.4.1

2016-05-05 Thread Jeremy Glesner
Thanks Christoph. I can upgrade, but would prefer not to go through the trouble only to find that my issue was still there. I looked in JIRA and I don't see any other tickets logged for changes to Fuzzy matching. There may be other changes that would affect my situation. But would be curious

Re: Question on Lucene Behavior in 4.9 vs 5.4.1

2016-05-05 Thread Christoph Läubrich
I recently have had a problem with strange search results in lucene 5.5, upgrading to 6.0 fixed that, so is upgrading to the latest version an option for you? Am 04.05.2016 23:08, schrieb Jeremy Glesner: Thanks to Adrien for responding. I performed the explain on indexSearcher in Lucene

Re: Question on Lucene Behavior in 4.9 vs 5.4.1

2016-05-04 Thread Jeremy Glesner
Thanks to Adrien for responding. I performed the explain on indexSearcher in Lucene 5.4.1, the results are pasted below for Basti Bosan (the highest ranked result) and Boston (the preferred result). I'm not 100% sure how to interpret this based on (a) lucene's weighting of the term in the

Re: Question on Lucene Behavior in 4.9 vs 5.4.1

2016-04-22 Thread Adrien Grand
FuzzyQuery scoring was changen in Lucene 5.3: https://issues.apache.org/jira/browse/LUCENE-329 Maybe look at the result of IndexSearcher.explain to understand why the "Boston" doc got a lower score than you "Basti Bosan" doc? Le jeu. 21 avr. 2016 à 15:39, Jeremy Glesner

Re: Question related to reranking and RankQuery

2015-09-18 Thread Ajinkya Kale
Is there a way to do something like q=hello+world={!rerank reRankQuery=$rqq reRankDocs=100}=sort={!func}myFunc() desc ? or even as simple as 1. http://localhost:8983/solr/0/select?q=edgengram:abc=json=true=true={!rerank reRankQuery=$rqq reRankDocs=20}=sort=some_field desc I am not

Re: Question related to reranking and RankQuery

2015-09-18 Thread Ajinkya Kale
Is there a way I can issue a regular query with q and then apply functionQuery only on the top n documents of the result from q ? Applying functionQuery on all documents will be very expensive in my case. I am not able to find a way to "rerank" only top N documents using Function Query. --aj On

Re: Question related to reranking and RankQuery

2015-09-18 Thread Joel Bernstein
The syntax would be something like this: q=hello+world={!rerank reRankQuery=$rqq reRankDocs=100}={!func}myFunc() I'm not sure if there is a test case demonstrating this but it should work. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Sep 18, 2015 at 2:42 PM, Ajinkya Kale

Re: Question related to reranking and RankQuery

2015-09-18 Thread Ajinkya Kale
Thank Joel! This is exactly what I was looking for. I did not realize rerank was extensible to your own Function Query. This is good. --aj On Fri, Sep 18, 2015 at 12:00 PM Joel Bernstein wrote: > The syntax would be something like this: > > q=hello+world={!rerank

Re: Question related to reranking and RankQuery

2015-09-18 Thread Joel Bernstein
The ReRankQuery re-ranks the Top N documents of the main query based on a query. Rather then the CustomScoreQuery you may want to look at ReRanking by a Function Query using the FunctionQParserPlugin. This would allow you to directly control the ReRankScore for the top N documents. Writing your

Re: Question about JoinUtil

2014-12-17 Thread Glen Newton
Hi Gregory, Thanks for your reply. In reading it, I realized that one side of my relational join wasn't that large, and I could bring it in as a couple of fields to the main document without any penalty, so my need to join two different document types then goes away. Thanks! :-) Glen On Tue,

Re: Question about JoinUtil

2014-12-16 Thread Glen Newton
Anyone? On Thu, Dec 11, 2014 at 2:53 PM, Glen Newton glen.new...@gmail.com wrote: Is there any reason JoinUtil (below) does not have a 'Query toQuery' available? I was wanting to filter on the 'to' side as well. I feel I am missing something here. To make sure this is not an XY problem, here

Re: Question about JoinUtil

2014-12-16 Thread Gregory Dearing
Glen, Lucene isn't relational at heart and may not be the right tool for what you're trying to accomplish. Note that JoinQuery doesn't join 'left' and 'right' answers; rather it transforms a 'left' answerset into a 'right' answerset. JoinQuery is able to perform this transformation with a single

Re: Question regarding complex queries and long tail suggestions

2014-09-08 Thread Mirko Sertic
/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java?revision=1622067view=markup -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Wednesday, September 3, 2014 7:14 PM To: java-user Subject: Re: Question regarding complex queries and long tail

Re: Question regarding complex queries and long tail suggestions

2014-09-03 Thread Erick Erickson
Take a look at the ComplexPhraseQueryParser here: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser Best, Erick On Wed, Sep 3, 2014 at 12:41 PM, Mirko Sertic mirko.ser...@web.de wrote: Hi@all I am using Lucene 4.9 for a search application.

  1   2   3   4   5   >