Sadly, the join performances are poor.
The joined collection is 12M documents, and the performances are 6k ms
versus 60ms when I compare to the denormalized field.

Apparently, the performances does not change when the filter on the
joined collection is changed. It is still 6k ms when the subset is 12M
or 1 document in size. So the performance of join looks correlated to
size of joined collection and not the kind of filter applied to it.

I will explore the streaming expressions

On Wed, Oct 16, 2019 at 08:00:43AM +0200, Nicolas Paris wrote:
> > You can certainly replicate the joined collection to every shard. It
> > must fit in one shard and a replica of that shard must be co-located
> > with every replica of the “to” collection.
> 
> Yes, I found this in the documentation, with a clear example just after
> this mail. I will test it today. I also read your blog about join
> performances[1] and I suspect the performance impact of joins will be
> huge because the joined collection is about 10M documents (only two
> fields, unique id and an array of longs and a filter applied to the
> array, join key is 10M unique IDs).
> 
> > Have you looked at streaming and “streaming expressions"? It does not
> > have the same problem, although it does have its own limitations.
> 
> I never tested them, and I am not very confortable yet in how to test
> them. Is it possible to mix query parsers and streaming expression in
> the client call via http parameters - or is streaming expression apply
> programmatically only ?
> 
> [1] https://lucidworks.com/post/solr-and-joins/
> 
> On Tue, Oct 15, 2019 at 07:12:25PM -0400, Erick Erickson wrote:
> > You can certainly replicate the joined collection to every shard. It must 
> > fit in one shard and a replica of that shard must be co-located with every 
> > replica of the “to” collection.
> > 
> > Have you looked at streaming and “streaming expressions"? It does not have 
> > the same problem, although it does have its own limitations.
> > 
> > Best,
> > Erick
> > 
> > > On Oct 15, 2019, at 6:58 PM, Nicolas Paris <nicolas.pa...@riseup.net> 
> > > wrote:
> > > 
> > > Hi
> > > 
> > > I have several large collections that cannot fit in a standalone solr
> > > instance. They are split over multiple shards in solr-cloud mode.
> > > 
> > > Those collections are supposed to be joined to an other collection to
> > > retrieve subset. Because I am using distributed collections, I am not
> > > able to use the solr join feature.
> > > 
> > > For this reason, I denormalize the information by adding the joined
> > > collection within every collections. Naturally, when I want to update
> > > the joined collection, I have to update every one of the distributed
> > > collections.
> > > 
> > > In standalone mode, I only would have to update the joined collection.
> > > 
> > > I wonder if there is a way to overcome this limitation. For example, by
> > > replicating the joined collection to every shard - or other method I am
> > > ignoring.
> > > 
> > > Any thought ? 
> > > -- 
> > > nicolas
> > 
> 
> -- 
> nicolas
> 

-- 
nicolas

Reply via email to