Re: Hunspell performance

2021-02-10 Thread Dawid Weiss
I didn't mean for Peter to write both backends but perhaps, if he's experimenting already anyway, make it possible to extract an interface which could be substituted externally with different implementations. Makes it easier to tinker with various options, even for us. D. On Thu, Feb 11, 2021 at

Re: Hunspell performance

2021-02-10 Thread Robert Muir
On Wed, Feb 10, 2021 at 3:05 PM Dawid Weiss wrote: > Maybe the "backend" could be configurable somehow so that you could change > the strategy depending on your needs?... I haven't looked at how FSTs are > used but if can be hidden behind a facade then an alternative implementation > could be

Re: Trouble with building PyLucene on Mac

2021-02-10 Thread Andi Vajda
Hi Clem, Lots of replies inline... On Wed, 10 Feb 2021, Wang, Clem wrote: (My msg originally post here: https://issues.apache.org/jira/projects/PYLUCENE/issues/PYLUCENE-10 but Andreas Vajda said I should send to the mailing list. I missed whatever he had posted to the mailing list

Re: Help needed with fixing lucene-site GitHub repo

2021-02-10 Thread Anshum Gupta
This has been resolved. Thanks to everyone who helped :) On Wed, Feb 10, 2021 at 12:36 PM Anshum Gupta wrote: > Can you elaborate more around this? I was also trying to see if I could > just create a PR to merge production -> master, but that would just mess > up the history. It will bring

Re: Hunspell performance

2021-02-10 Thread Gus Heck
+1 to configurability that is well documented, and reasonably actionable downstream in Solr... Some folks struggle with the costs of buying machines with lots of memory. On Wed, Feb 10, 2021 at 3:05 PM Dawid Weiss wrote: > > >> To me the challenge with such a change is just trying to prevent >

Re: Help needed with fixing lucene-site GitHub repo

2021-02-10 Thread Anshum Gupta
Can you elaborate more around this? I was also trying to see if I could just create a PR to merge production -> master, but that would just mess up the history. It will bring the code in sync but I'm also not sure if that would fix the larger problem. On Wed, Feb 10, 2021 at 12:01 PM Michael

Re: Hunspell performance

2021-02-10 Thread Dawid Weiss
> To me the challenge with such a change is just trying to prevent strange dictionaries from blowing up to 30x the space :) > Maybe the "backend" could be configurable somehow so that you could change the strategy depending on your needs?... I haven't looked at how FSTs are used but if can be

Re: Help needed with fixing lucene-site GitHub repo

2021-02-10 Thread Michael Sokolov
Have you considered using a merge commit for this? That won't require force pushing On Wed, Feb 10, 2021 at 2:51 PM Anshum Gupta wrote: > > Hi All, > > Seems like during the last release, we directly committed the website changes > to the production branch, bypassing the master. This is now

Help needed with fixing lucene-site GitHub repo

2021-02-10 Thread Anshum Gupta
Hi All, Seems like during the last release, we directly committed the website changes to the production branch, bypassing the master. This is now causing issues with merging updates from master into prod using the simple 'create PR' -> 'merge master to prod' workflow. I was working with

Re: 8.8.1 release soon

2021-02-10 Thread Anshum Gupta
Thanks for taking care of this, Tim. I've added a note to the 'downloads' page so folks who head there know about this issue and that a release with the fix is in the works. (thanks for reviewing that too :) ) -Anshum On Wed, Feb 10, 2021 at 7:37 AM Timothy Potter wrote: > I was a tad bit

Re: Hunspell performance

2021-02-10 Thread Robert Muir
50% speedup for the HunspellStemmer use case? for 3x the memory space? Just my opinion: Seems like the correct tradeoff to me. Analysis chain is a serious bottleneck for indexing speed: this hunspell is one of the slower ones. To me the challenge with such a change is just trying to prevent

Re: Hunspell performance

2021-02-10 Thread Peter Gromov
I was hoping for some numbers :) In the meantime, I've got some of my own. I loaded 90 dictionaries from https://github.com/wooorm/dictionaries (there's more, but I ignored dialects of the same base language). Together they currently consume a humble 166MB. With one of my less memory-hungry

Re: 8.8.1 release soon

2021-02-10 Thread Tomás Fernández Löbbe
I'd like to get SOLR-15114 in. It already has a patch that I'm testing, I'll try to merge it today. On Wed, Feb 10, 2021 at 8:23 AM Timothy Potter wrote: > Hi Ishan, > > Please let me know how SOLR-15138 is looking on Friday and we can make a >

Re: Hunspell performance

2021-02-10 Thread Peter Gromov
> > at the price of not being able to enumerate all of node's outgoing arcs. > So FSTEnum isn't possible there? Too bad, I need it for suggestions.

Re: 8.8.1 release soon

2021-02-10 Thread Timothy Potter
Hi Ishan, Please let me know how SOLR-15138 is looking on Friday and we can make a decision then. My hope is for 8.8.1 sooner than later, but a couple more days seems fine too. Cheers, Tim On Wed, Feb 10, 2021 at 8:55 AM Ishan Chattopadhyaya < ichattopadhy...@gmail.com> wrote: > I'd like for

Re: Hunspell performance

2021-02-10 Thread Robert Muir
Peter, looks like you are way ahead of me :) Thanks for all the work you have been doing here, and thanks to Dawid for helping! You probably know a lot of this code better than me at this point, but I remember a couple of these pain points, inline below: On Wed, Feb 10, 2021 at 9:44 AM Peter

Re: Hunspell performance

2021-02-10 Thread Dawid Weiss
> They just seem to need reading/analyzing too many bytes, doing much more work than a typical hashmap access :) This is a very tough score to beat... Pretty much any trie structure will have to descend somehow. FSTs are additionally densely packed in Lucene and outgoing arc lookup is what's

Re: 8.8.1 release soon

2021-02-10 Thread Ishan Chattopadhyaya
I'd like for us to include SOLR-15138 please, but the fix is still under review and development. Please let us know if it should be possible for us to wait until that one is done (hopefully quickly), otherwise we can release it later (if you want to proceed with the release before this is ready).

8.8.1 release soon

2021-02-10 Thread Timothy Potter
I was a tad bit ambitious with backporting SOLR-12182 to 8.8.0 and it seems we have no automated SolrJ back-compat tests in our RC vetting process, so unfortunately older SolrJ clients don't work with Solr 8.8 server, see SOLR-15145. I'd like to release 8.8.1 ASAP to address this problem and will

Re: Hunspell performance

2021-02-10 Thread Peter Gromov
Hi Robert, Yes, having multiple dictionaries in the same process would increase the memory significantly. Do you have any idea about how many of them people are loading, and how much memory they give to Lucene? Yes, I've mentioned I've prototyped "using FST in a smarter way" :) Namely, it's

Re: Hunspell performance

2021-02-10 Thread Robert Muir
Just throwing out another random idea: if you are doing a lot of FST traversals (e.g. for inexact matching or decomposition), you may end out "hammering" the root arcs of the FST heavily, depending on how the algorithm works. Because root arcs are "busy", they end out being O(logN) lookups in the

Re: Hunspell performance

2021-02-10 Thread Robert Muir
The RAM usage used to be bad as you describe, it blows up way worse for other languages than German. There were many issues :) For Lucene, one common issue was that users wanted to have a lot of these things in RAM: e.g. supporting many different languages on a single server (multilingual data)

Hunspell performance

2021-02-10 Thread Peter Gromov
Hi there, I'm mostly done with supporting major Hunspell features necessary for most european languages (https://issues.apache.org/jira/browse/LUCENE-9687) (but of course I anticipate more minor fixes to come). Thanks Dawid Weiss for thorough reviews and prompt accepting my PRs so far! Now I'd

Re: Seeking an adventurous individual that has decent SolrCloud experience and works well independently and as part of a team.

2021-02-10 Thread Mark Miller
Thanks to those that volunteered for this. We are almost ready for kick off. I’ve just got a couple more tests to run and a few help docs to finish. Also, thanks David Smiley for helping to recruit. Sorry bout that outburst a while back, I literally woke up at 4 am or whenever on the couch, took