Re: Aligning Shards from different Collections on the same Solr server based on Date Range

Matt Kuiper Mon, 12 Jul 2021 10:28:24 -0700

Hi Joel,

I reviewed a few options with my team, and your recommendation is at the
top of the list.  I believe it will work for our use case.


You mentioned that if this approach worked, you would be willing to share
more details on an "optimized self join."

I would enjoy hearing more.

Thanks,
Matt

On Fri, Jul 9, 2021 at 9:36 AM Joel Bernstein <[email protected]> wrote:

> Block join is another option. If that works for you, from an indexing
> standpoint, it's the most performant query time join.
>
> If block indexing doesn't work for you then the optimized self join is
> almost as fast.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Fri, Jul 9, 2021 at 11:31 AM Matt Kuiper <[email protected]> wrote:
>
> > Thanks Joel!
> >
> > On my list is to investigate Block Joins and Nested Child docs.
> >
> >
> >
> https://solr.apache.org/guide/8_8/other-parsers.html#block-join-query-parsers
> >
> >
> >
> https://solr.apache.org/guide/8_8/indexing-nested-documents.html#indexing-nested-documents
> >
> > However, it looks like you are not suggesting using nested docs, but
> > specifying a type field to differentiate between types of docs and then a
> > join field.  Not having to build nested docs prior to updates would be an
> > advantage.  And it makes sense that the join field would allow for
> reliable
> > routing to appropriate the shard for both doc types.
> >
> > I will take a further look and see if this approach will work, and get
> back
> > if more info is needed on the optimized self join.
> >
> > Thanks again,
> > Matt
> >
> >
> > On Fri, Jul 9, 2021 at 7:01 AM Joel Bernstein <[email protected]>
> wrote:
> >
> > > Can you solve this problem by adding all documents into the same
> > collection
> > > and performing self joins. You could add a field called rec_type to
> > > differentiate between the records.
> > >
> > > There are two good reasons for wanting to do this.
> > >
> > > 1) This allows you to route by the join key and easily co-locate
> records.
> > >
> > > 2) There is an optimized self join which is extremely fast that you
> could
> > > take advantage of if you did this.
> > >
> > > Let me know if this might be an option for you and we can discuss the
> > > optimized self join in more detail.
> > >
> > > Joel
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > >
> > > On Fri, Jul 2, 2021 at 6:28 PM Matt Kuiper <[email protected]> wrote:
> > >
> > > > After some research, it appears the following approach may help in
> this
> > > > situation and relieve the requirement of collocating indexes for
> Joins.
> > > It
> > > > appears one drawback maybe the types of fields supported for the JOIN
> > > > field.
> > > >
> > > >
> > >
> >
> https://solr.apache.org/guide/8_8/other-parsers.html#cross-collection-join
> > > >
> > > > Matt
> > > >
> > > > On Wed, Jun 30, 2021 at 11:59 AM Matt Kuiper <[email protected]>
> > wrote:
> > > >
> > > > > Hi Solr Group,
> > > > >
> > > > > I am not sure the following is a viable use-case, welcoming input
> and
> > > any
> > > > > implementation recommendations.
> > > > >
> > > > > I would like to perform joins over two sharded collections.  Where
> > docs
> > > > > are routed to specific shards based on a date range and are the
> same
> > > for
> > > > > shards in each collection.
> > > > >
> > > > > I understand that this means that the replicas from each collection
> > > that
> > > > > hold data to be joined need to be collated on the same Solr Server.
> >  I
> > > > > have read solutions that use ADD REPLICA to add a Collection B
> > replica
> > > to
> > > > > all SolrServers assuming Collection B has only one Shard.  For my
> use
> > > > case
> > > > > I need Collection B to have multiple shards.
> > > > >
> > > > > *Collection A                Collection B              SolrServer *
> > > > > Shard1_2020              Shard1_2020           172.33.0.1:8983
> _solr
> > > > > Shard2_2021              Shard2_2021           172.33.0.2:8983
> _solr
> > > > > Shard3_2022              Shard3_2022           172.33.0.3:8983
> _solr
> > > > >
> > > > > I think my question comes down to how do I break shards by a date
> > > range,
> > > > > and do it in a way that both Collections A and B would be defined
> by
> > > the
> > > > > same date range?  If could reliably break shards by date, and know
> > the
> > > > date
> > > > > range of the shard, I think I could use ADD REPLICA api to align.
> > > > >
> > > > > Not sure a compositeId routing approach would work, but thinking an
> > > > > implicit id may be hard to manage over time.
> > > > >
> > > > > Is an approach like this viable, concerned a bit about
> > > > > maintenance concerns, other ideas to support this join?
> > > > >
> > > > > Note: I am considering this within Time series collections...
> > > > >
> > > > > Matt
> > > > >
> > > >
> > >
> >
>

Re: Aligning Shards from different Collections on the same Solr server based on Date Range

Reply via email to