Re: different results in numFound vs using the cursor
> : I am going to adjust my schema, re-index, and try again. See if that > : doesn't fix this problem. I didn't know that having the uniqueKey be a > : textField was a bad idea. > > > https://lucene.apache.org/solr/guide/8_3/other-schema-elements.html#OtherSchemaElements-UniqueKey > > "The fieldType of uniqueKey must not be analyzed" > > (hence my comment baout "possible, but hard to get right ... you can use > something like the KeywordTokenizer, but at that point you might as well > use StrField except in some really esoteric special situations) > > Good news. I added a field called ID, and made it string. Then I deleted documents, re-indexed my data, and tried the search again. Now solrResults size and numFound size are exactly the same. Thanks for your help. Rhys
Re: different results in numFound vs using the cursor
: > whoa... that's not normal .. what *exactly* does the fieldType declaration : > (with all analyzers) look like, and what does the declaration : > look like? : > : > : : : NOTE: "text_general" != "text_gen_sort" Assuming your "text_general" declaration looks like it does in the _default config set, then using that for uniqueKey or sorting is definitly not a good idea. If you were *actually* using SortableTextField for your uniqueKeyField ... well, that should be ok to *sort* on, but i still wouldn't suggest using it as a uniqueKey field ... honestly not sure what behavior that might have with things like deleteById, etc... : I am going to adjust my schema, re-index, and try again. See if that : doesn't fix this problem. I didn't know that having the uniqueKey be a : textField was a bad idea. https://lucene.apache.org/solr/guide/8_3/other-schema-elements.html#OtherSchemaElements-UniqueKey "The fieldType of uniqueKey must not be analyzed" (hence my comment baout "possible, but hard to get right ... you can use something like the KeywordTokenizer, but at that point you might as well use StrField except in some really esoteric special situations) -Hoss http://www.lucidworks.com/
Re: different results in numFound vs using the cursor
On Tue, Nov 12, 2019 at 12:18 PM Chris Hostetter wrote: > > : > a) What is the fieldType of the uniqueKey field in use? > : > > : > : It is a textField > > whoa... that's not normal .. what *exactly* does the fieldType declaration > (with all analyzers) look like, and what does the declaration > look like? > > > you should really never use TextField for a uniqueKey ... it's possible, > but incredibly tricky to get "right". > > I am going to adjust my schema, re-index, and try again. See if that doesn't fix this problem. I didn't know that having the uniqueKey be a textField was a bad idea. > Independent from that, "sorting" on a TextField doesn't always do what you > might think (again: depending on the analysis in use) > > With a cursorMark you have other factors to consider: i bet what's > happening is that the post-analysis terms for your docs result it > duplicate values, so the cursorMark is skipping all docs that have hte > same (post analysis) sort value ... this could also manifest itself in > other weird ways, like trying to deleteById. > > Step #1: switch to using a simple StrField for your uniqueKey field and > see if htat solves all your problems. > > Thanks, doing this now. Rhys
Re: different results in numFound vs using the cursor
: > a) What is the fieldType of the uniqueKey field in use? : > : : It is a textField whoa... that's not normal .. what *exactly* does the fieldType declaration (with all analyzers) look like, and what does the declaration look like? you should really never use TextField for a uniqueKey ... it's possible, but incredibly tricky to get "right". Independent from that, "sorting" on a TextField doesn't always do what you might think (again: depending on the analysis in use) With a cursorMark you have other factors to consider: i bet what's happening is that the post-analysis terms for your docs result it duplicate values, so the cursorMark is skipping all docs that have hte same (post analysis) sort value ... this could also manifest itself in other weird ways, like trying to deleteById. Step #1: switch to using a simple StrField for your uniqueKey field and see if htat solves all your problems. -Hoss http://www.lucidworks.com/
Re: different results in numFound vs using the cursor
On Mon, Nov 11, 2019 at 8:32 PM Chris Hostetter wrote: > > Based on the info provided, it's hard to be certain, but reading between > the lines here are hte assumptions i'm making... > > 1) your core name is "dbtr" > 2) the uniqueId field for the "dbtr" core is "debtor_id" > > ..are those assumptions correct? > Yes they are. Sorry I didn't provide that from the beginning. > Two key pieces of information that doesn't seem to be assumable from the > imfo you've provided: > > a) What is the fieldType of the uniqueKey field in use? > It is a textField > b) how are you determining that "The numFound: 35008" > > I do a preliminary query to the solr core and print out the numFound from this: my $solrResponse = $ua->post( $solrURI ); my $decoded = decode_json( $solrResponse->{_content} ); my $numFound = $decoded->{response}{numFound}; > ... > > You show the code that prints out "size of solrResults: 22006" but nothing > in your code ever prints $numFound. there is a snippet of code at the top > I am printing numFound every time it loops. This should remain constant, because it is the total of all documents found. It's not really necessary that I am printing it. The number of docs is the size that I also print, and that is 1000 every time, until the last little bit, and then it is 6 docs found. > of your perl logic that seems disconnected from the rest of the code which > makes me think that before you do anything with a cursor you are already > parsing some *other* query response to get $numFound that way... > > I am running this query first, to get the cursor set: "http://10.40.10.14:8983/solr/debt/select?indent=on&rows=1000&sort=id asc&q=debt_id: 608384 OR debt_id: 393291&cursorMark=*" This sets the cursor, and then returns a cursorMark that I start using in order to grab 1000 documents at a time. > ...what exactly does all the code *before* this look like? what is the > request that you are using to get that initial '$solrResponse' that you > are parsing to extract '$numFound' are you sure it's exactly the same as > the query whose cursor you are iterating over? > > query from before the loop: "http://10.40.10.14:8983/solr/debt/select?indent=on&rows=1000&sort=id asc&q=debt_id: 608384 OR debt_id: 393291&cursorMark=*" query in the loop: http://10.40.10.14:8983/solr/debt/select?indent=on&rows=1000&sort=id+asc&q=debt_id: 608384 OR debt_id: 393291&cursorMark=AoElMTg1MzE= I do have some logic to make sure i grab the first 1000 from the first query, but other than that, it's a simple loop. > It looks like you are (also) extracting 'my $numFound = > $decoded->{response}{numFound};' on every (cusor) request ... what do you > get if add this to your cursor loop... > >print STDERR "numFound = $numFound at '$cursor'"; > > numFound is always 35008 because that is how many total documents are found. The number of docs in the response is the number that I care about, because that shows me how many came back for this slice. > ...because unless documents are being added/deleted as you iterate over > hte cursor, the numFound value should be consistent on each request. > > numFound is consistently 35008. Thanks Rhys
Re: different results in numFound vs using the cursor
Based on the info provided, it's hard to be certain, but reading between the lines here are hte assumptions i'm making... 1) your core name is "dbtr" 2) the uniqueId field for the "dbtr" core is "debtor_id" ..are those assumptions correct? Two key pieces of information that doesn't seem to be assumable from the imfo you've provided: a) What is the fieldType of the uniqueKey field in use? b) how are you determining that "The numFound: 35008" ... You show the code that prints out "size of solrResults: 22006" but nothing in your code ever prints $numFound. there is a snippet of code at the top of your perl logic that seems disconnected from the rest of the code which makes me think that before you do anything with a cursor you are already parsing some *other* query response to get $numFound that way... : i am using this logic in perl: : : my $decoded = decode_json( $solrResponse->{_content} ); : my $numFound = $decoded->{response}{numFound}; : : $cursor = "*"; : $prevCursor = ''; : : while ( $prevCursor ne $cursor ) : { : my $solrURI = "\"http://[SOLR URL]:8983/solr/"; : $solrURI .= $fdat{core}; ... ...what exactly does all the code *before* this look like? what is the request that you are using to get that initial '$solrResponse' that you are parsing to extract '$numFound' are you sure it's exactly the same as the query whose cursor you are iterating over? It looks like you are (also) extracting 'my $numFound = $decoded->{response}{numFound};' on every (cusor) request ... what do you get if add this to your cursor loop... print STDERR "numFound = $numFound at '$cursor'"; ...because unless documents are being added/deleted as you iterate over hte cursor, the numFound value should be consistent on each request. -Hoss http://www.lucidworks.com/
different results in numFound vs using the cursor
i am using this logic in perl: my $decoded = decode_json( $solrResponse->{_content} ); my $numFound = $decoded->{response}{numFound}; $cursor = "*"; $prevCursor = ''; while ( $prevCursor ne $cursor ) { my $solrURI = "\"http://[SOLR URL]:8983/solr/"; $solrURI .= $fdat{core}; $solrSort = ( $fdat{core} eq 'dbtr' ) ? "debtor_id+asc" : "id+asc"; $solrOptions = "/select?indent=on&rows=$getrows&sort=$solrSort&q="; $solrURI .= $solrOptions; $solrURI .= $query; $solrURI .= ( $prevCursor eq '' ) ? "&cursorMark=*\"": "&cursorMark=$cursor\""; print STDERR "solrURI '$solrURI'\n"; my $solrResponse = $ua->post( $solrURI ); my $decoded = decode_json( $solrResponse->{_content} ); my $numFound = $decoded->{response}{numFound}; foreach my $d ( $decoded->{response}{docs} ) { my @docs = @$d; print STDERR "size of docs '" . scalar( @docs ) . "'\n"; foreach my $r ( @docs ) { if ( $fdat{cust_num} and $fdat{core} eq 'dbtr' ) { push ( @solrResults, $r->{debtor_id} ); } elsif ( $fdat{cust_num} and $fdat{core} eq 'debt' ) { push ( @solrResults, $r->{debt_id} ); } } } $prevCursor = ( $prevCursor eq '' ) ? "*" : $cursor; $cursor = $decoded->{nextCursorMark}; print STDERR "cursor '$cursor'\n"; print STDERR "prevCursor '$prevCursor'\n"; print STDERR "size of solrResults '" . scalar( @solrResults ) . "'\n"; } print out: http://[SOLR URL]:8983/solr/debt/select?indent=on&rows=1000&sort=id+asc&q=debt_id: 608384 OR debt_id: 393291&cursorMark=AoEmMzkzMjkx The numFound: 35008 final size of solrResults: 22006 Am I missing something I should be using with cursorMark? Or is this expected? I've checked my logic, and I'm using the cursors the way this page is using them in examples: https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html Thanks Rhys