Re: Aligning Shards from different Collections on the same Solr server based on Date Range

Joel Bernstein Tue, 13 Jul 2021 08:14:23 -0700

The optimized join was added in Solr 8.8:
https://issues.apache.org/jira/browse/SOLR-15049


It kicks in when you use the join qparser plugin in the following scenario:

1) Do not specify a fromIndex. This is because the to and from index are
the same.
2) The to and from fields are the same.
3) The join method is topLevelDV.

{!join to=store_id from=store_id method=topLevelDV}

If you do this with Solr 8.8+ you get the effect of SOLR-15049. It is a
massive performance improvement. In my testing it was 7000 times faster
then the standard join parser plugin for larger joins.










Joel Bernstein
http://joelsolr.blogspot.com/


On Mon, Jul 12, 2021 at 1:34 PM Matt Kuiper <[email protected]> wrote:

> Hi Joel,
>
> I reviewed a few options with my team, and your recommendation is at the
> top of the list.  I believe it will work for our use case.
>
> You mentioned that if this approach worked, you would be willing to share
> more details on an "optimized self join."
>
> I would enjoy hearing more.
>
> Thanks,
> Matt
>
> On Fri, Jul 9, 2021 at 9:36 AM Joel Bernstein <[email protected]> wrote:
>
> > Block join is another option. If that works for you, from an indexing
> > standpoint, it's the most performant query time join.
> >
> > If block indexing doesn't work for you then the optimized self join is
> > almost as fast.
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Fri, Jul 9, 2021 at 11:31 AM Matt Kuiper <[email protected]> wrote:
> >
> > > Thanks Joel!
> > >
> > > On my list is to investigate Block Joins and Nested Child docs.
> > >
> > >
> > >
> >
> https://solr.apache.org/guide/8_8/other-parsers.html#block-join-query-parsers
> > >
> > >
> > >
> >
> https://solr.apache.org/guide/8_8/indexing-nested-documents.html#indexing-nested-documents
> > >
> > > However, it looks like you are not suggesting using nested docs, but
> > > specifying a type field to differentiate between types of docs and
> then a
> > > join field.  Not having to build nested docs prior to updates would be
> an
> > > advantage.  And it makes sense that the join field would allow for
> > reliable
> > > routing to appropriate the shard for both doc types.
> > >
> > > I will take a further look and see if this approach will work, and get
> > back
> > > if more info is needed on the optimized self join.
> > >
> > > Thanks again,
> > > Matt
> > >
> > >
> > > On Fri, Jul 9, 2021 at 7:01 AM Joel Bernstein <[email protected]>
> > wrote:
> > >
> > > > Can you solve this problem by adding all documents into the same
> > > collection
> > > > and performing self joins. You could add a field called rec_type to
> > > > differentiate between the records.
> > > >
> > > > There are two good reasons for wanting to do this.
> > > >
> > > > 1) This allows you to route by the join key and easily co-locate
> > records.
> > > >
> > > > 2) There is an optimized self join which is extremely fast that you
> > could
> > > > take advantage of if you did this.
> > > >
> > > > Let me know if this might be an option for you and we can discuss the
> > > > optimized self join in more detail.
> > > >
> > > > Joel
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > >
> > > > On Fri, Jul 2, 2021 at 6:28 PM Matt Kuiper <[email protected]>
> wrote:
> > > >
> > > > > After some research, it appears the following approach may help in
> > this
> > > > > situation and relieve the requirement of collocating indexes for
> > Joins.
> > > > It
> > > > > appears one drawback maybe the types of fields supported for the
> JOIN
> > > > > field.
> > > > >
> > > > >
> > > >
> > >
> >
> https://solr.apache.org/guide/8_8/other-parsers.html#cross-collection-join
> > > > >
> > > > > Matt
> > > > >
> > > > > On Wed, Jun 30, 2021 at 11:59 AM Matt Kuiper <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi Solr Group,
> > > > > >
> > > > > > I am not sure the following is a viable use-case, welcoming input
> > and
> > > > any
> > > > > > implementation recommendations.
> > > > > >
> > > > > > I would like to perform joins over two sharded collections.
> Where
> > > docs
> > > > > > are routed to specific shards based on a date range and are the
> > same
> > > > for
> > > > > > shards in each collection.
> > > > > >
> > > > > > I understand that this means that the replicas from each
> collection
> > > > that
> > > > > > hold data to be joined need to be collated on the same Solr
> Server.
> > >  I
> > > > > > have read solutions that use ADD REPLICA to add a Collection B
> > > replica
> > > > to
> > > > > > all SolrServers assuming Collection B has only one Shard.  For my
> > use
> > > > > case
> > > > > > I need Collection B to have multiple shards.
> > > > > >
> > > > > > *Collection A                Collection B
> SolrServer *
> > > > > > Shard1_2020              Shard1_2020           172.33.0.1:8983
> > _solr
> > > > > > Shard2_2021              Shard2_2021           172.33.0.2:8983
> > _solr
> > > > > > Shard3_2022              Shard3_2022           172.33.0.3:8983
> > _solr
> > > > > >
> > > > > > I think my question comes down to how do I break shards by a date
> > > > range,
> > > > > > and do it in a way that both Collections A and B would be defined
> > by
> > > > the
> > > > > > same date range?  If could reliably break shards by date, and
> know
> > > the
> > > > > date
> > > > > > range of the shard, I think I could use ADD REPLICA api to align.
> > > > > >
> > > > > > Not sure a compositeId routing approach would work, but thinking
> an
> > > > > > implicit id may be hard to manage over time.
> > > > > >
> > > > > > Is an approach like this viable, concerned a bit about
> > > > > > maintenance concerns, other ideas to support this join?
> > > > > >
> > > > > > Note: I am considering this within Time series collections...
> > > > > >
> > > > > > Matt
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Aligning Shards from different Collections on the same Solr server based on Date Range

Reply via email to