Jan,
Thanks a lot for the response.
My application's indexer is generating the id based on the raw data and
another metadata field that distinguishes that piece of data to the origin.
Now I can leverage the concept of the unique key to ensure uniqueness per
origin per row (at least that what I did before I migrated to TRA).
Now with the new rules of collection aliases I have to make sure that the
indexed doc wasn't indexed before and that makes it harder to manage and
will affect indexing performance without a doubt.
I really liked your idea of making a query time distinct, I think that I
can live with the fact that my big TRA has some dups across the collections
and in query time I will "hide" them but two questions now:
1) How will using the collapse query parser will affect the query
performance - sounds to me that it depends on the size of the result set,
is it?
2) I tried what you've suggested on the very same simplified use-case and
it didn't work for me - it seems that the collapse doesn't affect the way
solr calculates the total amount of faceted fields, should I add something
else? what I did:
http://localhost:8983/solr/test/select?fq=%7B!collapse%20field%3Did%7D&q=*%3A*&facet=on&facet.field=id
{
- responseHeader:
{
- zkConnected: true,
- status: 0,
- QTime: 8,
- params:
{
- q: "*:*",
- facet.field: "id",
- fq: "{!collapse field=id}",
- facet: "on"
}
},
- response:
{
- numFound: 1,
- start: 0,
- maxScore: 1,
- numFoundExact: true,
- docs:
[
-
{
- id: "123",
- _version_: 1696500688522051600,
- score: 1
}
]
},
- facet_counts:
{
- facet_queries: { },
- facet_fields:
{
- id:
[
- "123",
- 2
]
},
- facet_ranges: { },
- facet_intervals: { },
- facet_heatmaps: { }
}
}
.
.
**BUT! while trying your idea I thought about another idea - use sub-facet
on the faceted field while I am firing a unique facet function on the same
field like so:
http://localhost:8983/solr/test/select?&q=*%3A*&json.facet={ids:{type:terms,field:id,facet:{unique_count:%22unique(id)%22}}}
and if I add another doc {"id":"abc"} for illustration I get:
{
- responseHeader:
{
- zkConnected: true,
- status: 0,
- QTime: 19,
- params:
{
- q: "*:*",
- json.facet:
"{ids:{type:terms,field:id,facet:{unique_count:"unique(id)"}}}"
}
},
- response:
{
- numFound: 2,
- start: 0,
- maxScore: 1,
- numFoundExact: true,
- docs:
[
-
{
- id: "123",
- _version_: 1696500688522051600
},
-
{
- id: "abc",
- _version_: 1696504041626927000
}
]
},
- facets:
{
- count: 3,
- ids:
{
- buckets:
[
-
{
- val: "123",
- count: 2,
- unique_count: 1
},
-
{
- val: "abc",
- count: 1,
- unique_count: 1
}
]
}
}
}
And I think that that basically can solve my issue - I am allowing dups
across the TRA collections and just "ignoring" them with this approach.
WDYT? Do I miss something? How's facet functions and specifically the
unique facet function in terms of performance? especially when it's
nested...
Looking forward to read WYT and others :)
THANKS!
בתאריך יום ה׳, 8 באפר׳ 2021 ב-15:52 מאת Jan Høydahl <
[email protected]>:
> You are right - when you want to search across multiple collections,
> whether through alias or explicitly, Solr does no longer guarantee the
> uniqueness of IDs for you, as that is only per collection.
> Meaning, you need to enforce ID uniqueness yourself. And if using routed
> aliases, ..."It’s extremely important with all routed aliases that the
> route values NOT change."
>
> So if this is outside your control, the question becomes - are documents
> with same ID really duplicates and should not be counted twice? Or are they
> distinct docs which happen to have same ID?
> If they ideed are duplicates, you may attempt to do duplicate removal in
> your query by e.g. adding fq={!collapse field=id} to your query
>
> Jan
>
> > 24. mar. 2021 kl. 18:09 skrev Eran Buchnick <[email protected]>:
> >
> > Hi,
> > I've noticed the following warning in the *aliases documentation*:
> > *"...Reindexing a document with a different route value for the same ID*
> > *produces two distinct documents with the same ID accessible via the*
> > *alias..."*
> > When tested such case it seems that really only one doc is retrieved but
> > when turning on *facets they aren't aligned with the result set.*
> >
> > Expected behavior or bug?
> > If expected - how should I avoid dups and implement upserts without the
> > overhead of preliminary queries?
> >
> > My test:
> > 1) create two collections test1 and test2 and alias named test for both
> > 2) index docs with the same id to both of the collections
> > {"id":123}
> > 3) querying the alias as followed with explained debug:
> >
> http://localhost:8983/solr/test/select?debug.explain.structured=true&debugQuery=on&facet.field=id&facet=on&q=*%3A*
> > {
> > "responseHeader":{
> > "zkConnected":true,
> > "status":0,
> > "QTime":25,
> > "params":{
> > "q":"*:*",
> > "facet.field":"id",
> > "debug.explain.structured":"true",
> > "facet":"on",
> > "debugQuery":"on",
> > "_":"1616269705741"}},
> >
> > "response":{*"numFound":1*
> > ,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[
> > {
> > "id":"123",
> > "_version_":1694670492462481408}]
> > },
> > "facet_counts":{
> > "facet_queries":{},
> > "facet_fields":{
> > *"id":[*
> > * "123",2*]},
> > "facet_ranges":{},
> > "facet_intervals":{},
> > "facet_heatmaps":{}},
> > "debug":{
> > "track":{
> > "rid":"-31",
> > "EXECUTE_QUERY":{
> > "http://some_ip:8983/solr/test2_shard1_replica_n1/":{
> > "QTime":"3",
> > "ElapsedTime":"10",
> > "RequestPurpose":"GET_TOP_IDS,GET_FACETS,SET_TERM_STATS",
> > "NumFound":"1",
> >
> >
> "Response":"{responseHeader={zkConnected=true,status=0,QTime=3,params={df=_text_,distrib=false,fl=[id,
> > score],shards.purpose=16404,fsv=true,shard.url=
> >
> http://some_ip:8983/solr/test2_shard1_replica_n1/,rid=-31,wt=javabin,_=1616269705741,facet.field=id,f.id.facet.mincount=0,debug=[false
> > ,
> > timing,
> >
> track],start=0,f.id.facet.limit=160,collection=test1,test2,rows=10,debug.explain.structured=true,version=2,q=*:*,omitHeader=false,requestPurpose=GET_TOP_IDS,GET_FACETS,SET_TERM_STATS,NOW=1616270594521,isShard=true,facet=on,debugQuery=false}},response={numFound=1,numFoundExact=true,start=0,maxScore=1.0,docs=[SolrDocument{id=123,
> >
> score=1.0}]},sort_values={},facet_counts={facet_queries={},facet_fields={id={123=1}},facet_ranges={},facet_intervals={},facet_heatmaps={}},debug={facet-debug={elapse=0,sub-facet=[{processor=SimpleFacets,elapse=0,action=field
> > facet,maxThreads=0,sub-facet=[{elapse=0,requestedMethod=not
> >
> specified,appliedMethod=FC,inputDocSetSize=1,field=id,numBuckets=2}]}]},timing={time=2.0,prepare={time=0.0,query={time=0.0},facet={time=0.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}},process={time=2.0,query={time=0.0},facet={time=1.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}}}}}"},
> > "http://some_ip:8983/solr/test1_shard1_replica_n1/":{
> > "QTime":"2",
> > "ElapsedTime":"12",
> > "RequestPurpose":"GET_TOP_IDS,GET_FACETS,SET_TERM_STATS",
> > "NumFound":"1",
> >
> >
> "Response":"{responseHeader={zkConnected=true,status=0,QTime=2,params={df=_text_,distrib=false,fl=[id,
> > score],shards.purpose=16404,fsv=true,shard.url=
> >
> http://some_ip:8983/solr/test1_shard1_replica_n1/,rid=-31,wt=javabin,_=1616269705741,facet.field=id,f.id.facet.mincount=0,debug=[false
> > ,
> > timing,
> >
> track],start=0,f.id.facet.limit=160,collection=test1,test2,rows=10,debug.explain.structured=true,version=2,q=*:*,omitHeader=false,requestPurpose=GET_TOP_IDS,GET_FACETS,SET_TERM_STATS,NOW=1616270594521,isShard=true,facet=on,debugQuery=false}},response={numFound=1,numFoundExact=true,start=0,maxScore=1.0,docs=[SolrDocument{id=123,
> >
> score=1.0}]},sort_values={},facet_counts={facet_queries={},facet_fields={id={123=1}},facet_ranges={},facet_intervals={},facet_heatmaps={}},debug={facet-debug={elapse=0,sub-facet=[{processor=SimpleFacets,elapse=0,action=field
> > facet,maxThreads=0,sub-facet=[{elapse=0,requestedMethod=not
> >
> specified,appliedMethod=FC,inputDocSetSize=1,field=id,numBuckets=2}]}]},timing={time=2.0,prepare={time=0.0,query={time=0.0},facet={time=0.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}},process={time=2.0,query={time=0.0},facet={time=1.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}}}}}"}},
> > "GET_FIELDS":{
> > "http://some_ip:8983/solr/test2_shard1_replica_n1/":{
> > "QTime":"5",
> > "ElapsedTime":"8",
> > "RequestPurpose":"GET_FIELDS,GET_DEBUG,SET_TERM_STATS",
> > "NumFound":"1",
> >
> >
> "Response":"{responseHeader={zkConnected=true,status=0,QTime=5,params={facet.field=id,df=_text_,distrib=false,debug=[timing,
> > track],shards.purpose=16704,collection=test1,test2,shard.url=
> >
> http://some_ip:8983/solr/test2_shard1_replica_n1/,rows=10,rid=-31,debug.explain.structured=true,version=2,q=*:*,omitHeader=false,requestPurpose=GET_FIELDS,GET_DEBUG,SET_TERM_STATS,NOW=1616270594521,ids=123,isShard=true,facet=false,wt=javabin,debugQuery=true,_=1616269705741
> }
> >
> },response={numFound=1,numFoundExact=true,start=0,docs=[SolrDocument{id=123,
> >
> _version_=1694670492462481408}]},debug={rawquerystring=*:*,querystring=*:*,parsedquery=MatchAllDocsQuery(*:*),parsedquery_toString=*:*,explain={123={match=true,value=1.0,description=*:*}},QParser=LuceneQParser,timing={time=4.0,prepare={time=0.0,query={time=0.0},facet={time=0.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}},process={time=4.0,query={time=0.0},facet={time=0.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=4.0}}}}}"}}},
> > "facet-debug":{
> > "elapse":0,
> > "sub-facet":[{
> > "processor":"SimpleFacets",
> > "elapse":0,
> > "action":"field facet",
> > "maxThreads":0,
> > "sub-facet":[{
> > "elapse":0,
> > "requestedMethod":"not specified",
> > "appliedMethod":"FC",
> > "inputDocSetSize":1,
> > "field":"id",
> > "numBuckets":2}]}]},
> > "timing":{
> > "time":8.0,
> > "prepare":{
> > "time":0.0,
> > "query":{
> > "time":0.0},
> > "facet":{
> > "time":0.0},
> > "facet_module":{
> > "time":0.0},
> > "mlt":{
> > "time":0.0},
> > "highlight":{
> > "time":0.0},
> > "stats":{
> > "time":0.0},
> > "expand":{
> > "time":0.0},
> > "terms":{
> > "time":0.0},
> > "debug":{
> > "time":0.0}},
> > "process":{
> > "time":8.0,
> > "query":{
> > "time":0.0},
> > "facet":{
> > "time":2.0},
> > "facet_module":{
> > "time":0.0},
> > "mlt":{
> > "time":0.0},
> > "highlight":{
> > "time":0.0},
> > "stats":{
> > "time":0.0},
> > "expand":{
> > "time":0.0},
> > "terms":{
> > "time":0.0},
> > "debug":{
> > "time":4.0}}},
> > "rawquerystring":"*:*",
> > "querystring":"*:*",
> > "parsedquery":"MatchAllDocsQuery(*:*)",
> > "parsedquery_toString":"*:*",
> > "QParser":"LuceneQParser",
> > "explain":{
> > "123":{
> > "match":true,
> > "value":1.0,
> > "description":"*:*"}}}}
> >
> > Thanks.
>
>
--
*BR,*
*Eran Buchnick*