Hi Joel, I reviewed a few options with my team, and your recommendation is at the top of the list. I believe it will work for our use case.
You mentioned that if this approach worked, you would be willing to share more details on an "optimized self join." I would enjoy hearing more. Thanks, Matt On Fri, Jul 9, 2021 at 9:36 AM Joel Bernstein <[email protected]> wrote: > Block join is another option. If that works for you, from an indexing > standpoint, it's the most performant query time join. > > If block indexing doesn't work for you then the optimized self join is > almost as fast. > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > > On Fri, Jul 9, 2021 at 11:31 AM Matt Kuiper <[email protected]> wrote: > > > Thanks Joel! > > > > On my list is to investigate Block Joins and Nested Child docs. > > > > > > > https://solr.apache.org/guide/8_8/other-parsers.html#block-join-query-parsers > > > > > > > https://solr.apache.org/guide/8_8/indexing-nested-documents.html#indexing-nested-documents > > > > However, it looks like you are not suggesting using nested docs, but > > specifying a type field to differentiate between types of docs and then a > > join field. Not having to build nested docs prior to updates would be an > > advantage. And it makes sense that the join field would allow for > reliable > > routing to appropriate the shard for both doc types. > > > > I will take a further look and see if this approach will work, and get > back > > if more info is needed on the optimized self join. > > > > Thanks again, > > Matt > > > > > > On Fri, Jul 9, 2021 at 7:01 AM Joel Bernstein <[email protected]> > wrote: > > > > > Can you solve this problem by adding all documents into the same > > collection > > > and performing self joins. You could add a field called rec_type to > > > differentiate between the records. > > > > > > There are two good reasons for wanting to do this. > > > > > > 1) This allows you to route by the join key and easily co-locate > records. > > > > > > 2) There is an optimized self join which is extremely fast that you > could > > > take advantage of if you did this. > > > > > > Let me know if this might be an option for you and we can discuss the > > > optimized self join in more detail. > > > > > > Joel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Joel Bernstein > > > http://joelsolr.blogspot.com/ > > > > > > > > > On Fri, Jul 2, 2021 at 6:28 PM Matt Kuiper <[email protected]> wrote: > > > > > > > After some research, it appears the following approach may help in > this > > > > situation and relieve the requirement of collocating indexes for > Joins. > > > It > > > > appears one drawback maybe the types of fields supported for the JOIN > > > > field. > > > > > > > > > > > > > > https://solr.apache.org/guide/8_8/other-parsers.html#cross-collection-join > > > > > > > > Matt > > > > > > > > On Wed, Jun 30, 2021 at 11:59 AM Matt Kuiper <[email protected]> > > wrote: > > > > > > > > > Hi Solr Group, > > > > > > > > > > I am not sure the following is a viable use-case, welcoming input > and > > > any > > > > > implementation recommendations. > > > > > > > > > > I would like to perform joins over two sharded collections. Where > > docs > > > > > are routed to specific shards based on a date range and are the > same > > > for > > > > > shards in each collection. > > > > > > > > > > I understand that this means that the replicas from each collection > > > that > > > > > hold data to be joined need to be collated on the same Solr Server. > > I > > > > > have read solutions that use ADD REPLICA to add a Collection B > > replica > > > to > > > > > all SolrServers assuming Collection B has only one Shard. For my > use > > > > case > > > > > I need Collection B to have multiple shards. > > > > > > > > > > *Collection A Collection B SolrServer * > > > > > Shard1_2020 Shard1_2020 172.33.0.1:8983 > _solr > > > > > Shard2_2021 Shard2_2021 172.33.0.2:8983 > _solr > > > > > Shard3_2022 Shard3_2022 172.33.0.3:8983 > _solr > > > > > > > > > > I think my question comes down to how do I break shards by a date > > > range, > > > > > and do it in a way that both Collections A and B would be defined > by > > > the > > > > > same date range? If could reliably break shards by date, and know > > the > > > > date > > > > > range of the shard, I think I could use ADD REPLICA api to align. > > > > > > > > > > Not sure a compositeId routing approach would work, but thinking an > > > > > implicit id may be hard to manage over time. > > > > > > > > > > Is an approach like this viable, concerned a bit about > > > > > maintenance concerns, other ideas to support this join? > > > > > > > > > > Note: I am considering this within Time series collections... > > > > > > > > > > Matt > > > > > > > > > > > > > > >
