Alex, We did consider trying to optimize Taxonomy indexing performance but we never really got around to it. The sidecar index is annoying to deal with and we have had occasional issues with it. Zulia has sharding implemented. The main issue here is not the taxonomy but rather just getting exact counts with returning all facets values. We chose to implement a method similar to elastic search ( https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_per_bucket_document_count_error). For replication we plan to use native Lucene index replication built into lucene. The framework is currently there for routing queries and such but the actual copying of the index has not been implemented yet so I can't speak to that. Hope this helps some.
Thanks, Matt On Wed, Apr 28, 2021 at 5:48 PM Alexander Lukyanchikov < alexanderlukyanchi...@gmail.com> wrote: > Hi Matt, > It's very interesting, thanks for the response! Did you have any issues > with Taxonomy indexing performance, or maybe tried to optimize it somehow? > Also, any problems maintaining a sidecar index or experience building a > distributed system around it with sharding/rebalancing? > > -- > Regards, > Alex > > > On Wed, Apr 28, 2021 at 11:18 AM Matt Davis <kryptonics...@gmail.com> > wrote: > > > Alex, > > > > With our lucene based implementation of Zulia ( > > https://github.com/zuliaio/zuliasearch) we have went back and forth. We > > started with Taxonomy and switched and then switched back to taxonomy. > In > > our experience the Taxonomy based approach is more scalable and > > performant. We do large searches (sometimes returning millions of > > results) with about 20 facets being run with some high cardinality > facets. > > A small dataset version of the tool that is backed by zulia we released > for > > covid can be found here ( > > > > > https://icite.od.nih.gov/covid19/search/#search:searchId=6089a5b7218c6902d422e907 > > ). > > If you click on the facet tab you can see how we use facets. I believe > the > > use case might largely drive the choice. > > > > Thanks, > > Matt > > > > On Wed, Apr 28, 2021 at 1:26 PM Alexander Lukyanchikov < > > alexanderlukyanchi...@gmail.com> wrote: > > > > > Hello everyone, > > > > > > We are trying to choose between Taxonomy and > SortedSetDocValuesFacetField > > > implementations for faceted search, and based on available information > > and > > > our quick tests, the difference is the following - > > > > > > - Taxonomy is faster at query time (on our test workload, the > difference > > > sometimes is higher than documented 25%). Also SortedSet adds latency > to > > an > > > NRT refresh. > > > - Taxonomy is slower at index time, and unlike SortedSet > implementation, > > it > > > does not scale as good with more than 4 threads (a lot of contention at > > > DirectoryTaxonomyWriter#addCategory() and UTF8TaxonomyWriterCache.get() > > > synchronized blocks) > > > - SortedSet does not support hierarchical queries > > > - SortedSet does not require a sidecar index > > > - Tie-break differences for labels with the same count > > > > > > Am I missing something, or that’s everything we should take into > account > > as > > > of today? > > > > > > I know that Solr and ES use their own faceting for historical reasons, > > but > > > are there any other large Lucene-based products, which have chosen one > > > implementation over another? Do we know why? > > > Any insight on less known trade-offs and production experience is > greatly > > > appreciated! > > > > > > -- > > > Thank you, > > > Alex > > > > > >