Re: Performance Difference between files getting opened with IoContext.RANDOM vs IoContext.READ

2024-09-29 Thread Michael McCandless
Hi Navneet, With RANDOM IOcontext, on modern OS's / Java versions, Lucene will hint the memory mapped segment that the IO will be random using madvise POSIX API with MADV_RANDOM flag. For READ IOContext, Lucene maybe hits with MADV_SEQUENTIAL, I'm not sure. Or maybe it doesn't hint anything? It'

Re: Excessive reads while doing commit in lucene

2024-09-04 Thread Michael McCandless
It's odd to have a ~500X difference in writes versus reads. Are you sure? Is it possible you are also opening IndexReaders and searching the commit points? Lucene does re-read previously written (already indexed) documents during segment merges. But at default settings (as long as you did not ch

Re: ArithmeticException: due to integer overflow during lucene merging

2024-05-15 Thread Michael McCandless
Thanks Jeven, more response inlined below: On Tue, May 14, 2024 at 12:58 PM Jerven Tjalling Bolleman wrote: The index that had an issue when merging into one segment definitely had > more than 1 billion times the word "positional" in it. I hope to be able > to give a closer number once re-indexi

Re: ArithmeticException: due to integer overflow during lucene merging

2024-05-14 Thread Michael McCandless
I think we should at least open an issue to try to improve the exception message? We might catch the exception higher up (where we know the field name) and rethrow with the field name, maybe. We can discuss options on the issue ... If you are not using custom term frequencies it's not clear to m

Re: recommended index size

2024-01-04 Thread Michael McCandless
Hi Vincent, Lucene has a hard limit of ~2.1 B documents in a single index; hopefully you hit the ~50 - 100 GB limit well before that. Otherwise it's very application dependent: how much latency can you tolerate during searching, how fast are the underlying IO devices at random and large sequentia

Re: Performance changes within the Lucene 8 branch

2023-12-14 Thread Michael McCandless
Hi Marc, How are you retrieving your hits? Lucene's stored fields, or doc values, or both? Do you sort the hits docids and then retrieve them in docid order (NOT in the sorted order Lucene returned them in)? I think that might be faster as Lucene's stored fields use block compression and if the

Re: Consistent NRT searching with SearcherLifetimeManager and multiple instances

2023-12-14 Thread Michael McCandless
Hi Steven, Great question! I'm so glad to hear your app is providing consistent pagination :) I've long felt Lucene (with NRT segment replication) could do a great job at this, yet so few apps manage to implement it. Every time I interact with a search engine and go to the next page it irks me

Re: When to use StringField and when to use FacetField for categorization?

2023-10-20 Thread Michael McCandless
e has to > "connect" it with a TaxonomyWriter > > FacetsConfig config = new FacetsConfig(); > DirectoryTaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(taxoDir); > indexWriter.addDocument(config.build(taxoWriter, doc)); > > right? > > Thanks > > Michael &

Re: When to use StringField and when to use FacetField for categorization?

2023-10-20 Thread Michael McCandless
There are some differences. StringField is indexed into the inverted index (postings) so you can do efficient filtering. You can also store in stored fields to retrieve. FacetField does everything StringField does (filtering, storing (maybe?)), but in addition it stores data for faceting. I.e.

Re: Lucene Index Writer in a distributed system

2023-10-19 Thread Michael McCandless
Hi Gopal, Indeed, for a single Lucene index, only one writer may be open at a time. Lucene tries to catch you if you mess this up, using file-based locking. If you really need concurrent indexing, you could have N IndexWriters each writing into a private Directory, and then periodically use addIn

Re: Reindexing leaving behind 0 live doc segments

2023-08-31 Thread Michael McCandless
Hi Rahul, Please do not pursue Approach 2 :) ReadersAndUpdates.release is not something the application should be calling. This path can only lead to pain. It sounds to me like something in Solr is holding an old reader (maybe the last commit point, or reader prior to the refresh after you re-i

Re: Vector Search with OpenAI Embeddings: Lucene Is All You Need

2023-08-31 Thread Michael McCandless
Thanks Michael, very interesting! I of course agree that Lucene is all you need, heh ;) Jimmy Lin also tweeted about the strength of Lucene's HNSW: https://twitter.com/lintool/status/1681333664431460353?s=20 Mike McCandless http://blog.mikemccandless.com On Thu, Aug 31, 2023 at 3:31 AM Michae

Re: LuceneTestCase altered the default query cache policy

2023-06-27 Thread Michael McCandless
Hi Yuan, [Disclaimer: I work in the same team at Amazon, customer facing product search, where we heavily use Lucene at high scale!] LuceneTestCase already has similar assertions, e.g. to confirm that no system properties were changed, no threads leaked, not too much static objects left reference

Re: Lucene in action

2023-06-10 Thread Michael McCandless
Hi Vimal, Indeed I think it is unlikely I have the energy for a 3rd edition ... but anyone can drive the 3rd edition, not just the prior authors. New authors welcome! > Since 2nd edition ( based on lucene 4), I'm sorry to say that 2nd edition is based on Lucene 3.0 not 4! It's even older than

Re: Analyzer.createComponents(String fieldname) only being called once, when indexing multiple documents

2023-06-09 Thread Michael McCandless
Hi Usman, Long ago Lucene switched to reusing these analysis components (per Analyzer, per thread), so that explains why createComponents is called once. However, the reuse policy is controllable (expert usage), so in theory you could implement an Analyzer.ReuseStrategy that never reuses and pass

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-09 Thread Michael McCandless
I'd also love to understand this: > using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on Windows for our index sizes which commonly run north of 1 TB) Is this a known problem on certain versions of Windows? Normally memory mapped IO can scale to very large sizes (well beyond s

Re: Info required on licensing of Lucene component

2023-05-18 Thread Michael McCandless
gt; > > > We do see the fix included in Lucene 9.6.0. > > Appreciate your prompt response and thank you so much for resolving the > issue! > > > > Regards, > > Open Source Request Team > > > > *From:* Michael McCandless > *Sent:* 11 May

Re: Question - Why stopwords.txt provided by smartcn contains blank lines?

2023-05-15 Thread Michael McCandless
Hi Jerry, I agree, that makes no sense! Maybe the stopload loader should ignore truly blank lines? Also, the comments on lines 57 and 59 are confusing -- there are no (default) English and Chinese stopwords in the file. I guess they are placeholders. Could you open an issue in Lucene's GitHub

Re: Info required on licensing of Lucene component

2023-05-11 Thread Michael McCandless
s/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue I'll start a separate thread ... Mike McCandless http://blog.mikemccandless.com On Wed, May 10, 2023 at 12:28 PM Michael McCandless < luc...@mikemccandless.com> wrote: > Hello, > > That's a great ques

Re: Info required on licensing of Lucene component

2023-05-10 Thread Michael McCandless
ory.com/artifact/org.apache.lucene/lucene-backward-codecs/9.3.0 > > > > Hence, just wanted to confirm exactly which Lucene release is the > update/pull request applied to? > > > > > > Thanks, > > Open Source Request Team > > > > > > *From:* Micha

Re: Info required on licensing of Lucene component

2023-04-06 Thread Michael McCandless
> In that case, can you’ll update your source repo for Lucene to exclude references to ‘junit’ from Notices.txt file since it is something which is not part of distribution for Lucene. That sounds reasonable to me. I'll open an issue in our GitHub repo, but IANAL and I'm not sure how to specifica

Re: Info required on licensing of Lucene component

2023-04-04 Thread Michael McCandless
Hello, You maybe missed the two responses already to the email, since by default responses only go the the user list not back to the individual. See the archived responses here: https://lists.apache.org/thread/zg01tkq8wtmym27q3dolcg1msbtoxoxl Mike McCandless http://blog.mikemccandless.com On

Re: Vector Search on Lucene

2023-03-16 Thread Michael McCandless
Note that Lucene's demo package (IndexFiles.java, SearchFiles.java) also show examples of how to index and search KNN vectors. Mike McCandless http://blog.mikemccandless.com On Thu, Mar 2, 2023 at 4:46 AM Michael Wechner wrote: > Hi Marcos > > The indexing looks kind of > > Document doc =new

Re: [ANNOUNCE] Issue migration Jira to GitHub starts on Monday, August 22

2022-08-25 Thread Michael McCandless
uot;fix-version"), please review this >> manual. >> > >> https://github.com/apache/lucene/blob/main/dev-docs/github-issues-howto.md >> > >> > Tomoko >> > >> > >> > 2022年8月22日(月) 19:46 Michael McCandless : >> >> >>

Re: [ANNOUNCE] Issue migration Jira to GitHub starts on Monday, August 22

2022-08-22 Thread Michael McCandless
Wooot! Thank you so much Tomoko!! Mike On Mon, Aug 22, 2022 at 6:44 AM Tomoko Uchida wrote: > > > Issue migration has been started. Jira is now read-only. > > GitHub issue is available for new issues. > > - You should open new issues on GitHub. E.g. > https://github.com/apache/lucene/issues/1

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Michael McCandless
OK done: https://github.com/apache/lucene-jira-archive/commit/13fa4cb46a1a6d609448240e4f66c263da8b3fd1 Mike McCandless http://blog.mikemccandless.com On Sat, Aug 6, 2022 at 10:29 AM Baris Kazar wrote: > I think so. > Best regards > -- > *From:* Michae

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Michael McCandless
Thanks Baris, And your Jira ID is bkazar right? Mike On Sat, Aug 6, 2022 at 10:05 AM Baris Kazar wrote: > My github username is bmkazar > can You please register me? > Best regards > ____ > From: Michael McCandless > Sent: Saturday, August 6, 20

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Michael McCandless
p the linked accounts coming! Mike On Thu, Aug 4, 2022 at 7:02 PM Rushabh Shah wrote: > Hi, > My mapping is: > JiraName,GitHubAccount,JiraDispName > shahrs87, shahrs87, Rushabh Shah > > Thank you Tomoko and Mike for all of your hard work. > > > > > On Sun, Jul 31, 2022

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-07-31 Thread Michael McCandless
hub id: wjp719 > > the jira issue I create before: > https://issues.apache.org/jira/browse/LUCENE-10425 > the github pr I submit before: https://github.com/apache/lucene/pull/780 > > > Best Regards, > jianping weng > > > > Michael McCandless 于2022年7月31日周日

[HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-07-31 Thread Michael McCandless
Hello Lucene users, contributors and developers, If you have used Lucene's Jira and you have a GitHub account as well, please check whether your user id mapping is in this file: https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified If no

Re: Unclear on what position means

2022-07-22 Thread Michael McCandless
Hi Kendall, "Position" and "Offset" are often confused in Lucene ;) Lucene uses offset to track what you referred to ("(character, not byte) offset into a text file", or into an indexed string). Lucene uses position to track the Nth token: position 0 is first token, position 1 is the second toke

Re: Replicator PrimaryNode waits forever for remotes to close

2022-06-30 Thread Michael McCandless
+1 to provide a timeout, or, to simply fix close to aggressively close regardless of what the replicas are doing? It's not a great design for primary to be so dependent on the replicas (but vice/versa makes sense?). Maybe open a Jira issue or starting PR so we can discuss? Thanks for uncovering

Re: Index corruption and repair

2022-05-05 Thread Michael McCandless
Antony, do you maybe have Microsoft Defender turned on, which might quarantine files that it suspects are malicious? I'm not sure if it is on by default these days on modern Windows boxes ... Mike McCandless http://blog.mikemccandless.com On Thu, May 5, 2022 at 10:34 AM Michael McCan

Re: Index corruption and repair

2022-05-05 Thread Michael McCandless
On Thu, May 5, 2022 at 10:30 AM Uwe Schindler wrote: To find all errors in an index, you should pass -ea to the java command > line to enable assertions. > +1 Tempting to make CheckIndex demand that :) Or at least, slow you down and make it clear why, if assertions are disabled. Mike McCandle

Re: Index corruption and repair

2022-05-05 Thread Michael McCandless
e? > > Regards, > Antony > > On Sun, 1 May 2022 at 19:35, Antony Joseph > wrote: > >> Hi Michael, >> >> Thank you for your reply. Please find responses to your questions below. >> >> Regards, >> Antony >> >> On Sat, 30 Apr 2022

Re: Index corruption and repair

2022-04-30 Thread Michael McCandless
Hi Antony, Hmm it looks like the root cause is this: Caused by: java.nio.file.NoSuchFileException: D:\i\202204\_14gb.si Can you list all the files in the index directory at the time this exception happens, and reply here? We need to figure out whether the file is really missing or what.

Re: A question on PhraseQuery and slop

2021-12-13 Thread Michael McCandless
Hello Claude, Hmm, that is interesting that you see slop=2 matching query "quick fox" against document "the fox is quick". Edit distance (Levenshtein) is a bit tricky because it might include a transposition (just swapping the two words) as edit distance 1 OR 2. So maybe Lucene's PhraseQuery is

Re: Java 17 and Lucene

2021-10-20 Thread Michael McCandless
ontroller > > > > in the RamUsageEstimator class. > > > > We suppressed the warning for now (based on recommendations > > > > <http://mail-archives.apache.org/mod_mbox/db-derby- > > > > dev/202106.mbox/%3CJIRA.13369440.1617476525000.615331.16239514800 &g

Re: Java 17 and Lucene

2021-10-18 Thread Michael McCandless
Also, I try to semi-aggressively upgrade Lucene's nightly benchmarks to new JDK releases and leave an annotation on the nightly charts: https://home.apache.org/~mikemccand/lucenebench/ I just now upgraded to JDK 17 and kicked off a new benchmark run ... in a few hours it should show the new data p

Re: Issue regarding build

2021-08-19 Thread Michael McCandless
Hello Udit, The screen shot did not come through for me -- it's a broken image. Maybe copy/paste the text of the error instead? Also, try running "./gradlew assemble" from the command-line (in a console shell, e.g. Terminal on OS X) instead? Mike McCandless http://blog.mikemccandless.com On

Re: Info about the Lucene 4.10.4 version.

2021-06-22 Thread Michael McCandless
Hi Arvind, I responded about this on the issue you also opened: https://issues.apache.org/jira/browse/LUCENE-10013 Mike McCandless http://blog.mikemccandless.com On Tue, Jun 22, 2021 at 10:04 AM Arvind Kumar Sahu wrote: > Hi Team, > > Currently we are using Lucene 4.10.4 version. We are gett

Re: Multiple merge-runs from same set of segments

2021-05-24 Thread Michael McCandless
Are you trying to rewrite your already created index into a different segment geometry? Maybe have a look at the new IndexRearranger tool ? It is already doing something like what you enumerated below, including mocking LiveDocs to get the right

Re: Performance decrease with NRT use-case in 8.8.x (coming from 8.3.0)

2021-05-19 Thread Michael McCandless
> The update showed no issues (e.g. compiled without changes) but I noticed that our test-suites take a lot longer to finish. Hmm, that sounds bad. We need our tests to stay fast but also do a good job testing things ;) Does your production usage also slow down? Tests do other interesting thing

Re: Correct usage of synonyms with Japanese

2021-05-18 Thread Michael McCandless
Hi Geoffrey, [Disclaimer: Geoffrey and I both work at Amazon on customer-facing product search] We absolutely must get SynonymGraphFilter consuming input graphs! This is just a (serious) bug in it! But it's just software, let's fix it :) That is clearly the right fix, it is just rather fun and

Re: CorruptIndexException after failed segment merge caused by No space left on device

2021-03-24 Thread Michael McCandless
+1, this sounds like a bad bug in Lucene! We try hard to test for and prevent such bugs! As long as you succeeded in at least one commit since creating the index before you hit the disk full, restarting Lucene on the index should have recovered from that last successful commit. How often do you

Re: [VOTE] Lucene logo contest, third time's a charm

2020-12-21 Thread Michael McCandless
Thank you Ryan for pushing forwards to our new logo. Now that this VOTE has passed, are there issues open to actually "deliver it" to the world? E.g. I see https://lucene.apache.org still shows our old logo. Branding is a lot of work! Mike McCandless http://blog.mikemccandless.com On Tue, Se

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2020-12-14 Thread Michael McCandless
Hello, Yes, that is exactly what MMapDirectory.setPreload is trying to do, but not promises (it is best effort). I think it asks the OS to touch all pages in the mapped region so they are cached in RAM, if you have enough RAM. Make your JVM heap as low as possible to let the OS have more RAM to

Re: best way (performance wise) to search for field without value?

2020-11-13 Thread Michael McCandless
_ALLOWED_EMPTY", BooleanClause.Occur.SHOULD); > > On Fri, Nov 13, 2020 at 2:09 PM Michael McCandless < > luc...@mikemccandless.com> wrote: > > > Maybe NormsFieldExistsQuery as a MUST_NOT clause? Though, you must > enable > > norms on your field to use that. >

Re: best way (performance wise) to search for field without value?

2020-11-13 Thread Michael McCandless
Maybe NormsFieldExistsQuery as a MUST_NOT clause? Though, you must enable norms on your field to use that. TermRangeQuery is indeed a horribly costly way to execute this, but if you cache the result on each refresh, perhaps it is OK? You could also index a dedicated doc values field indicating t

Re: BooleanQuery normal form

2020-09-27 Thread Michael McCandless
Hi Patrick, I don't think Lucene supports CNF or DNF for BooleanQuery? BooleanQuery will try to do some rewriting simplifications for degenerate cases, e.g. a BooleanQuery with a single clause. Probably it could do more optimizing? It is quite complex already :) Mike On Tue, Sep 22, 2020 at 1

Re: Optimizing term-occurrence counting (code included)

2020-09-21 Thread Michael McCandless
I left a comment on the issue. Mike McCandless http://blog.mikemccandless.com On Sun, Sep 20, 2020 at 1:08 PM Alex K wrote: > Hi all, I'm still a bit stuck on this particular issue.I posted an issue on > the Elastiknn repo outlining some measurements and thoughts on potential > solutions: htt

Re: [VOTE] Lucene logo contest, third time's a charm

2020-09-02 Thread Michael McCandless
A2, A1, C5, D (binding) Thank you to everyone for working so hard to make such cool looking possible future Lucene logos! And to Ryan for the challenging job of calling this VOTE :) Mike McCandless http://blog.mikemccandless.com On Tue, Sep 1, 2020 at 4:21 PM Ryan Ernst wrote: > Dear Lucene

Re: Hierarchical facet select a subtree but one child

2020-08-17 Thread Michael McCandless
I think this is a missing API in DrillDownQuery? Nicola, could you open an issue? The filtering is as Mike Sokolov described, but I think we should add a sugar method, e.g. DrillDownQuery.remove or something, to add a negated query clause. And until this API is added and you can upgrade to it, y

Re: Adding fields with same field type complains that they have different term vector settings

2020-06-30 Thread Michael McCandless
> would work to just reconstruct the values for the field being modified, or > am I likely to just run into more issues by modifying a loaded Document? > > Regards, > Albert > > > From: "Michael McCandless" > > To: "java-user" , "albert macsw

Re: Adding fields with same field type complains that they have different term vector settings

2020-06-29 Thread Michael McCandless
Hi Albert, Unfortunately, you have fallen into a common and sneaky Lucene trap. The problem happens because you loaded a Document from the index's stored fields (the one you previously indexed) and then tried to modify that one and re-index. Lucene does not guarantee that this will work, because

Re: Sharing buffer between large number of IndexWriters?

2020-06-22 Thread Michael McCandless
Hello Marcin, Alas, Lucene does not have this capability out of the box. However, you are able to live-update the IndexWriterConfig.setRAMBufferSizeMB, and the change should take effect on the next document indexed in that IndexWriter instance. So you could build your own "proportional RAM" on t

Re: [VOTE] Lucene logo contest

2020-06-17 Thread Michael McCandless
Change is good :) I vote Option A (binding PMC vote). Thank you to all the open-source artists who helped out here. Mike McCandless http://blog.mikemccandless.com On Mon, Jun 15, 2020 at 6:08 PM Ryan Ernst wrote: > Dear Lucene and Solr developers! > > In February a contest was started to de

Re: CheckIndex complaining about -1 for norms value

2020-06-11 Thread Michael McCandless
Maybe we should fix CheckIndex to print norms as unsigned integers? Mike McCandless http://blog.mikemccandless.com On Thu, Jun 11, 2020 at 3:00 AM Adrien Grand wrote: > To my knowledge, -1 always represented the maximum supported length, both > before and after 7.0 (when we changed the norms

Re: Lucene Migration issue

2020-06-08 Thread Michael McCandless
You're welcome! Mike McCandless http://blog.mikemccandless.com On Mon, Jun 8, 2020 at 10:48 AM Adarsh Sunilkumar < adarshsunilkuma...@gmail.com> wrote: > Hi Michael, > > Thanks for your information. > > > Thanks&Regards, > Adarsh Sunilkumar > > On

Re: Lucene Migration issue

2020-06-08 Thread Michael McCandless
jira/browse/LUCENE-8134 > <https://issues.apache.org/jira/browse/LUCENE-8134> > > Thanks& Regards, > Adarsh Sunilkumar > > On Fri, Jun 5, 2020 at 7:28 PM Michael McCandless < > luc...@mikemccandless.com> wrote: > >> This just means you previously ind

Re: Lucene Migration issue

2020-06-05 Thread Michael McCandless
This just means you previously indexed only docis (skipping term frequencies, positions) for at least one of the fields in at least one document in your existing index. But now you are trying to also index with term frequencies and positions, which Lucene cannot do. You either have to reindex wit

Re: Retrieving query-time join fromQuery hits

2020-06-03 Thread Michael McCandless
e the left side of the join must retain some state, to know which top hits corresponded to those join values, and then add an API to retrieve them? Mike McCandless http://blog.mikemccandless.com On Wed, May 20, 2020 at 6:31 PM Michael McCandless < luc...@mikemccandless.com> wrote: &

Re: Retrieving query-time join fromQuery hits

2020-05-20 Thread Michael McCandless
I am trying first to understand the proposed solution from the previous thread. You run query #1, it returns top N hits. From those hits you ask JoinUtil to create the "joined" query #2. You run the query #2 to get the top final (joined) hits. Then, to reconstruct which docids from query #1 mat

Re: Resizable LRUQueryCache

2020-03-10 Thread Michael McCandless
Maybe start with your own cache implementation that implements a resize method? The cache is pluggable through IndexSearcher. Fully discarding the cache and swapping in a newly sized (empty) one could also be jarring, so I can see the motivation for this method. Especially for usages that are ho

Re: Lucene 7.7.2 Indexwriter.numDocs() replacement in Lucene 8.4.1

2020-02-26 Thread Michael McCandless
Yes. Mike McCandless http://blog.mikemccandless.com On Mon, Feb 24, 2020 at 5:55 PM wrote: > A typo corrected below. > > Best regards > > > On 2/24/20 5:54 PM, baris.ka...@oracle.com wrote: > > Hi,- > > > > I hope everyone is doing great. > > > > > > I think the Lucene 7.7.2 Indexwriter.num

Re: Searching number of tokens in text field

2020-01-02 Thread Michael McCandless
Norms encode the number of tokens in the field, but in a lossy manner (1 byte by default), so you could probably create a custom query that filtered based on that, if you could tolerate the loss in precision? Or maybe change your norms storage to more precision? You could use NormsFieldExistsQuer

Re: Lucene Index Cloud Replication

2019-07-09 Thread Michael McCandless
+1 to share code for doing 1) and 3) both of which are tricky! Safely moving / copying bytes around is a notoriously difficult problem ... but Lucene's "end to end checksums" and per-segment-file-GUID make this safer. I think Lucene's replicator module is a good place for this? Mike McCandless

Re: find documents with big stored fields

2019-07-01 Thread Michael McCandless
Hi Rob, The codec records per docid how many bytes each document consumes -- maybe instrument the codec's sources locally, then open your index and have it visit stored fields for every doc in the index and gather stats? Or, to avoid touching Lucene level code, you could make a small tool that lo

Re: ArrayIndexOutOfBoundsException during System.arraycopy in BKDWriter

2019-05-03 Thread Michael McCandless
Note that the -Xint flag will make your code run tremendously more slowly! Likely to the point of not really being usable. But it'd be interesting to see if that side-steps the bug. Is it possible to test with OpenJDK as well? The BKDWriter code is quite complex, so it is also possible there i

Re: Ask about Lucene/Core/Index DocumentsWriter

2019-03-19 Thread Michael McCandless
Can you try increasing your IndexWriter.setRAMBufferSizeMB? That flush control logic will block incoming threads if the number of bytes trying to flush to disk is too large relative to your RAM buffer. Mike McCandless http://blog.mikemccandless.com On Mon, Mar 18, 2019 at 2:30 PM yuncheng lu

Re: FlattenGraphFilter assertion error

2019-03-12 Thread Michael McCandless
ttenGraphFilter.java:195) > at > org.apache.lucene.analysis.core.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:258) > at com.wolfram.textsearch.AnalyzerError.main(AnalyzerError.java:32) > > It's the interaction between WordDelimiterGraphFilter and stop word > removal

Re: IndexWriter concurrent flushing

2019-02-17 Thread Michael McCandless
+1 to make it simple to let multiple threads help with commit/refresh operations. IW.yield is a simple way to achieve it, matching (roughly) how IW's commit/refresh work today, hijacking incoming indexing threads to gain concurrency. I think this would be a small change? Adding an ExecutorServic

Re: prorated early termination

2019-02-03 Thread Michael McCandless
On Sun, Feb 3, 2019 at 10:41 AM Michael Sokolov wrote: > > In single-threaded mode we can check against minCompetitiveScore and > terminate collection for each segment appropriately, > > > Does Lucene do this today by default? That should be a nice > optimization, > and it'd be safe/correct. >

Re: prorated early termination

2019-02-03 Thread Michael McCandless
I think this is because our per-hit cost is sometimes very high -- we have "post filters" that are sometimes very restrictive. We are working to get those post-filters out into an inverted index to make them more efficient, but net/net reducing how many hits we must collect for each segment can he

Re: prorated early termination

2019-02-03 Thread Michael McCandless
One question about this: > In single-threaded mode we can check against minCompetitiveScore and terminate collection for each segment appropriately, Does Lucene do this today by default? That should be a nice optimization, and it'd be safe/correct. Mike McCandless http://blog.mikemccandless.co

Re: RamUsageCrawler

2018-12-06 Thread Michael McCandless
I think you mean RamUsageEstimator (in Lucene's test-framework)? It's entirely possible it fails to dig into Maps correctly with newer Java releases; maybe Dawid or Uwe would know? Mike McCandless http://blog.mikemccandless.com On Tue, Dec 4, 2018 at 12:18 PM Michael Sokolov wrote: > Hi, I'm

Re: Race condition between IndexWriter.commit and IndexWriter.close

2018-12-05 Thread Michael McCandless
n the > documentation does it say that these two calls should be synchronized... > at least that must be fixed. :) > > On 12/1/18 6:25 PM, Michael McCandless wrote: > > I think if you call commit and close concurrently the results are > undefined > > and so this is accepta

Re: Race condition between IndexWriter.commit and IndexWriter.close

2018-12-01 Thread Michael McCandless
I think if you call commit and close concurrently the results are undefined and so this is acceptable. Mike On Thu, Nov 29, 2018 at 5:53 AM Boris Petrov wrote: > Hi all, > > We're getting the following exception: > > java.lang.IllegalStateException: cannot close: prepareCommit was already > cal

Re: SearcherManager not seeing changes in IndexWriteral and

2018-11-12 Thread Michael McCandless
Thanks for bringing closure, Boris. Mike McCandless http://blog.mikemccandless.com On Mon, Nov 12, 2018 at 7:13 AM Boris Petrov wrote: > Hello, > > OK, so actually this appears to be a bug in our code - Lucene is searching > correctly, we were doing something wrong with the result after that.

Re: MultiPhraseQuery or PhraseQuery to take the synonyms into account?

2018-09-22 Thread Michael McCandless
PhraseQuery can indeed be used to represent a multi-token synonym. In fact, I mis-spoke before: MultiPhraseQuery can also represent a multi-token synonym when the multiple tokens are all the same except in one spot. Mike McCandless http://blog.mikemccandless.com On Thu, Sep 20, 2018 at 2:32 PM

Re: Question About FST, multiple-column index

2018-09-22 Thread Michael McCandless
You might want to index the name field normally (as StringField, for example), then index the age as a NumericDocValuesField, and then make a BooleanQuery with two required clauses, one clause TermQuery on the name, the other a NumericDocValuesField.newSlowExactQuery. Even though its name is "slow

Re: MultiPhraseQuery

2018-09-18 Thread Michael McCandless
Yes, +1 for a patch to improve the docs! MultiPhraseQuery only works for single term synonyms, and is usually produced by query parsers when the incoming query text had single term synonyms matching, I think? The query parser will use other (span?) queries for multi token synonyms. I think the e

Re: SynonymGraphFilter

2018-09-11 Thread Michael McCandless
Try reading the blog post I wrote about token stream graphs? http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html Mike McCandless http://blog.mikemccandless.com On Tue, Sep 11, 2018 at 1:35 PM, wrote: > Any comments please? > > Thanks > > > On 9/10/18 5:07 PM, baris.k

Re: SynonymMap.Builder.add method

2018-09-11 Thread Michael McCandless
That's correct. When the input sequence is seen during tokenization, the synonym (graph) filter will also insert the output tokens into the TokenStream, as if they "naturally" occurred. Mike McCandless http://blog.mikemccandless.com On Tue, Sep 11, 2018 at 1:35 PM, wrote: > Any comments pleas

Re: SynonymMap

2018-09-10 Thread Michael McCandless
The SynonymMap.Builder constructor takes a dedup parameter to tell it what to do in that case (when input and output are identical across added rules). Mike McCandless http://blog.mikemccandless.com On Thu, Sep 6, 2018 at 2:06 PM, Baris Kazar wrote: > Hi,- > how does SynonymMap deal with repea

Re: offsets

2018-07-29 Thread Michael McCandless
How would a fixup API work? We would try to provide correctOffset throughout the full analysis chain? Mike McCandless http://blog.mikemccandless.com On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov wrote: > I've run into some difficulties with offsets in some TokenFilters I've been > writing,

Re: Deleted documents and NRT Readers

2018-07-20 Thread Michael McCandless
add one > document and then update it, that the load is so small that it for sure > would not have applied the delete. > > Why am I wrong in thinking this? > > > On Thu, Jul 19, 2018, 5:50 PM Michael McCandless < > luc...@mikemccandless.com> wrote: > >> Passing a

Re: Deleted documents and NRT Readers

2018-07-19 Thread Michael McCandless
Passing applyDeletes=false means Lucene does not have to apply all of its buffered deletes. But, it still may have already applied some deletes, so there's no guarantee that it won't have applied deletes. Mike McCandless http://blog.mikemccandless.com On Thu, Jul 19, 2018 at 3:23 PM, Stuart Gol

Re: Lucene Speed

2018-07-18 Thread Michael McCandless
Hi Ehson, Have you looked at the luceneutil source code that runs the benchmarks? https://github.com/mikemccand/luceneutil The sources are not super clean, but that's what's running the nightly benchmarks, starting from src/main/perf/Indexer.java. Mike McCandless http://blog.mikemccandless.com

Re: Recreating index lucene without stopping client applications

2018-07-18 Thread Michael McCandless
If you use IndexWriter.deleteAll, and not any of the other delete by Query, Term methods, it should be quite efficient to delete, as IndexWriter just drops all segments. That API is also transactional, so you could call IW.deleteAll, proceed to reindex all your documents, and if somehow that crash

Re: UTF8TaxonomyWriterCache inconsistency

2018-07-02 Thread Michael McCandless
Yes please create a Jira issue! Mike On Mon, Jul 2, 2018, 12:31 AM Руслан Торобаев wrote: > Hi! > > I’m facing a problem with taxonomy writer cache inconsistency. At some > point in time UTF8TaxonomyWriterCache starts to return wrong ord for some > facet labels. As result wrong ord are written

Re: Help! - Max Segment name reached

2018-04-21 Thread Michael McCandless
Well I think as time goes on we'll see more and more people running into it ;) But you really need to commit at a surprisingly high rate, and have a surprisingly long lived index, to overflow the int that holds the segment number. E.g. if you commit once per second, it should take ~68 years to ov

Re: WordDelimiterGraphFilter does not respect KeywordAttribute

2018-04-21 Thread Michael McCandless
+1 Mike On Fri, Apr 20, 2018, 9:42 AM Michael Sokolov wrote: > I have a use case that generates some tokens containing punctuation > (fractions and other numerical constructs), but I am handling most > punctuation with WordDelimiterGraphFilter, which then decomposes those > tokens into parts an

Re: IndexWriter updateDocument is removing doc from index

2018-03-16 Thread Michael McCandless
Yes you can add documents by calling updateDocument -- if no prior documents matched the deletion Term you provide, nothing is deleted and your new doc is added. Hmm are you sure your 2nd update really updated and then added 12 new docs? Dropping segment 1 makes sense because you deleted the one

Re: any api to get segment number of index

2018-01-14 Thread Michael McCandless
How about IndexSearcher.getIndexReader().leaves().size()? Mike McCandless http://blog.mikemccandless.com On Wed, Jan 10, 2018 at 5:19 AM, Yonghui Zhao wrote: > Hi, > > Is there any public API that I can get segment number of current version > index? > > I didn't find in indexwriter or indexsea

Re: typed IntPoint.RangeQuery & LongPoint.rangeQuery

2018-01-09 Thread Michael McCandless
Lucene doesn't (shouldn't?) let you add 'a' at first as an IntPoint and then later as a LongPoint -- they must always be consistent. So however you indexed it, you must use the corresponding class to construct the query. String 'hi' can only be found if you had indexed a token 'hi' in that field

Re: index sorting merge

2017-12-28 Thread Michael McCandless
You should upgrade to newer versions of Lucene, where all segments are sorted, not just merged segments. Mike McCandless http://blog.mikemccandless.com On Thu, Dec 28, 2017 at 11:13 AM, Yonghui Zhao wrote: > Hi, > > I specified a SortingMergePolicy in my case. I find only the first N-1 > segm

Re: may be lucene bug

2017-12-28 Thread Michael McCandless
I think there's a bug in your code: this line: doc.doc <= leaf.docBase + leaf.reader().maxDoc()) should be < not <=. Mike McCandless http://blog.mikemccandless.com On Thu, Dec 28, 2017 at 6:15 AM, 291699763 <291699...@qq.com> wrote: > Lucene version:6.6.0 > > when Index > document.add(ne

Re: CompiledAutomaton performance issue

2017-12-17 Thread Michael McCandless
This is just an optimization; maybe we should expose an option to disable it? Or maybe we can find the common suffix on an NFA instead, to avoid determinization? Can you open a Jira issue so we can discuss options? Thanks, Mike McCandless http://blog.mikemccandless.com On Fri, Dec 15, 2017 at

Re: Optimize FTS memory footprint

2017-12-12 Thread Michael McCandless
Try upgrading Elasticsearch -- it's up to 6.0 release just a few week ago now -- its (and Lucene's) memory usage has decreased over time. The _uid field in particular will always be costly, unfortunately. Since it's a primary key, every term will be unique, and the term index has to work hard to

  1   2   3   4   5   6   7   8   9   10   >