Re: different results in numFound vs using the cursor

2019-11-12 Thread rhys J
> : I am going to adjust my schema, re-index, and try again. See if that
> : doesn't fix this problem. I didn't know that having the uniqueKey be a
> : textField was a bad idea.
>
>
> https://lucene.apache.org/solr/guide/8_3/other-schema-elements.html#OtherSchemaElements-UniqueKey
>
> "The fieldType of uniqueKey must not be analyzed"
>
> (hence my comment baout "possible, but hard to get right ... you can use
> something like the KeywordTokenizer, but at that point you might as well
> use StrField except in some really esoteric special situations)
>
>
Good news. I added a field called ID, and made it string. Then I deleted
documents, re-indexed my data, and tried the search again.

Now solrResults size and numFound size are exactly the same.

Thanks for your help.

Rhys


Re: different results in numFound vs using the cursor

2019-11-12 Thread Chris Hostetter


: > whoa... that's not normal .. what *exactly* does the fieldType declaration
: > (with all analyzers) look like, and what does the  declaration
: > look like?
: >
: >
: 
: 
: 

NOTE: "text_general" != "text_gen_sort"

Assuming your "text_general" declaration looks like it does in the 
_default config set, then using that for uniqueKey or sorting is definitly 
not a good idea.

If you were *actually* using SortableTextField for your uniqueKeyField ... 
well, that should be ok to *sort* on, but i still wouldn't suggest using 
it as a uniqueKey field ... honestly not sure what behavior that might 
have with things like deleteById, etc...


: I am going to adjust my schema, re-index, and try again. See if that
: doesn't fix this problem. I didn't know that having the uniqueKey be a
: textField was a bad idea.

https://lucene.apache.org/solr/guide/8_3/other-schema-elements.html#OtherSchemaElements-UniqueKey

"The fieldType of uniqueKey must not be analyzed"

(hence my comment baout "possible, but hard to get right ... you can use 
something like the KeywordTokenizer, but at that point you might as well 
use StrField except in some really esoteric special situations)



-Hoss
http://www.lucidworks.com/


Re: different results in numFound vs using the cursor

2019-11-12 Thread rhys J
On Tue, Nov 12, 2019 at 12:18 PM Chris Hostetter 
wrote:

>
> : > a) What is the fieldType of the uniqueKey field in use?
> : >
> :
> : It is a textField
>
> whoa... that's not normal .. what *exactly* does the fieldType declaration
> (with all analyzers) look like, and what does the  declaration
> look like?
>
>




  
  
  


  
  
  
  

  



> you should really never use TextField for a uniqueKey ... it's possible,
> but incredibly tricky to get "right".
>
>
I am going to adjust my schema, re-index, and try again. See if that
doesn't fix this problem. I didn't know that having the uniqueKey be a
textField was a bad idea.


> Independent from that, "sorting" on a TextField doesn't always do what you
> might think (again: depending on the analysis in use)
>
> With a cursorMark you have other factors to consider: i bet what's
> happening is that the post-analysis terms for your docs result it
> duplicate values, so the cursorMark is skipping all docs that have hte
> same (post analysis) sort value ... this could also manifest itself in
> other weird ways, like trying to deleteById.
>
> Step #1: switch to using a simple StrField for your uniqueKey field and
> see if htat solves all your problems.
>
>
Thanks, doing this now.

Rhys


Re: different results in numFound vs using the cursor

2019-11-12 Thread Chris Hostetter


: > a) What is the fieldType of the uniqueKey field in use?
: >
: 
: It is a textField

whoa... that's not normal .. what *exactly* does the fieldType declaration 
(with all analyzers) look like, and what does the  declaration 
look like?

you should really never use TextField for a uniqueKey ... it's possible, 
but incredibly tricky to get "right".

Independent from that, "sorting" on a TextField doesn't always do what you 
might think (again: depending on the analysis in use)

With a cursorMark you have other factors to consider: i bet what's 
happening is that the post-analysis terms for your docs result it 
duplicate values, so the cursorMark is skipping all docs that have hte 
same (post analysis) sort value ... this could also manifest itself in 
other weird ways, like trying to deleteById.

Step #1: switch to using a simple StrField for your uniqueKey field and 
see if htat solves all your problems.


-Hoss
http://www.lucidworks.com/


Re: different results in numFound vs using the cursor

2019-11-12 Thread rhys J
On Mon, Nov 11, 2019 at 8:32 PM Chris Hostetter 
wrote:

>
> Based on the info provided, it's hard to be certain, but reading between
> the lines here are hte assumptions i'm making...
>
> 1) your core name is "dbtr"
> 2) the uniqueId field for the "dbtr" core is "debtor_id"
>
> ..are those assumptions correct?
>

Yes they are. Sorry I didn't provide that from the beginning.


> Two key pieces of information that doesn't seem to be assumable from the
> imfo you've provided:
>
> a) What is the fieldType of the uniqueKey field in use?
>

It is a textField


> b) how are you determining that "The numFound: 35008"
>
>
I do a preliminary query to the solr core and print out the numFound from
this:

 my $solrResponse = $ua->post( $solrURI );

 my $decoded = decode_json( $solrResponse->{_content} );
 my $numFound = $decoded->{response}{numFound};


> ...
>
> You show the code that prints out "size of solrResults: 22006" but nothing
> in your code ever prints $numFound.  there is a snippet of code at the top
>

I am printing numFound every time it loops. This should remain constant,
because it is the total of all documents found. It's not really necessary
that I am printing it.

The number of docs is the size that I also print, and that is 1000 every
time, until the last little bit, and then it is 6 docs found.


> of your perl logic that seems disconnected from the rest of the code which
> makes me think that before you do anything with a cursor you are already
> parsing some *other* query response to get $numFound that way...
>
>
I am running this query first, to get the cursor set:

"http://10.40.10.14:8983/solr/debt/select?indent=on&rows=1000&sort=id
asc&q=debt_id: 608384 OR debt_id: 393291&cursorMark=*"

This sets the cursor, and then returns a cursorMark that I start using in
order to grab 1000 documents at a time.



> ...what exactly does all the code *before* this look like? what is the
> request that you are using to get that initial '$solrResponse' that you
> are parsing to extract '$numFound'  are you sure it's exactly the same as
> the query whose cursor you are iterating over?
>
>
query from before the loop:

"http://10.40.10.14:8983/solr/debt/select?indent=on&rows=1000&sort=id
asc&q=debt_id: 608384 OR debt_id: 393291&cursorMark=*"

query in the loop:

http://10.40.10.14:8983/solr/debt/select?indent=on&rows=1000&sort=id+asc&q=debt_id:
608384 OR debt_id: 393291&cursorMark=AoElMTg1MzE=

I do have some logic to make sure i grab the first 1000 from the first
query, but other than that, it's a simple loop.


> It looks like you are (also) extracting 'my $numFound =
> $decoded->{response}{numFound};' on every (cusor) request ... what do you
> get if add this to your cursor loop...
>
>print STDERR "numFound = $numFound at '$cursor'";
>
> numFound is always 35008 because that is how many total documents are
found. The number of docs in the response is the number that I care about,
because that shows me how many came back for this slice.


> ...because unless documents are being added/deleted as you iterate over
> hte cursor, the numFound value should be consistent on each request.
>
>
numFound is consistently 35008.

Thanks

Rhys


Re: different results in numFound vs using the cursor

2019-11-11 Thread Chris Hostetter


Based on the info provided, it's hard to be certain, but reading between 
the lines here are hte assumptions i'm making...

1) your core name is "dbtr"
2) the uniqueId field for the "dbtr" core is "debtor_id"

..are those assumptions correct?

Two key pieces of information that doesn't seem to be assumable from the 
imfo you've provided:

a) What is the fieldType of the uniqueKey field in use?
b) how are you determining that "The numFound: 35008"

...

You show the code that prints out "size of solrResults: 22006" but nothing 
in your code ever prints $numFound.  there is a snippet of code at the top 
of your perl logic that seems disconnected from the rest of the code which 
makes me think that before you do anything with a cursor you are already 
parsing some *other* query response to get $numFound that way...

: i am using this logic in perl:
: 
: my $decoded = decode_json( $solrResponse->{_content} );
: my $numFound = $decoded->{response}{numFound};
: 
: $cursor = "*";
: $prevCursor = '';
: 
: while ( $prevCursor ne $cursor )
: {
:   my $solrURI = "\"http://[SOLR URL]:8983/solr/";
:   $solrURI .= $fdat{core};
...

...what exactly does all the code *before* this look like? what is the 
request that you are using to get that initial '$solrResponse' that you 
are parsing to extract '$numFound'  are you sure it's exactly the same as 
the query whose cursor you are iterating over?

It looks like you are (also) extracting 'my $numFound = 
$decoded->{response}{numFound};' on every (cusor) request ... what do you 
get if add this to your cursor loop...

   print STDERR "numFound = $numFound at '$cursor'";


...because unless documents are being added/deleted as you iterate over 
hte cursor, the numFound value should be consistent on each request.


-Hoss
http://www.lucidworks.com/


different results in numFound vs using the cursor

2019-11-11 Thread rhys J
i am using this logic in perl:

my $decoded = decode_json( $solrResponse->{_content} );
my $numFound = $decoded->{response}{numFound};

$cursor = "*";
$prevCursor = '';

while ( $prevCursor ne $cursor )
{
  my $solrURI = "\"http://[SOLR URL]:8983/solr/";
  $solrURI .= $fdat{core};

  $solrSort = ( $fdat{core} eq 'dbtr' ) ? "debtor_id+asc" : "id+asc";
  $solrOptions = "/select?indent=on&rows=$getrows&sort=$solrSort&q=";
  $solrURI .= $solrOptions;
  $solrURI .= $query;

 $solrURI .= ( $prevCursor eq '' ) ? "&cursorMark=*\"":
 "&cursorMark=$cursor\"";

 print STDERR "solrURI '$solrURI'\n";
 my $solrResponse = $ua->post( $solrURI );
   my $decoded = decode_json( $solrResponse->{_content} );
  my $numFound = $decoded->{response}{numFound};

 foreach my $d ( $decoded->{response}{docs} )
  {
  my @docs = @$d;
  print STDERR "size of docs '" . scalar( @docs ) . "'\n";
   foreach my $r ( @docs )
   {
   if ( $fdat{cust_num} and $fdat{core} eq 'dbtr' )
   {
   push ( @solrResults, $r->{debtor_id} );
   }
   elsif ( $fdat{cust_num} and $fdat{core} eq 'debt' )
   {
   push ( @solrResults, $r->{debt_id} );
   }
   }

}
   $prevCursor = ( $prevCursor eq '' ) ? "*" : $cursor;
 $cursor = $decoded->{nextCursorMark};
  print STDERR "cursor '$cursor'\n";
  print STDERR "prevCursor '$prevCursor'\n";
  print STDERR "size of solrResults '" . scalar( @solrResults ) . "'\n";
}

print out:

http://[SOLR
URL]:8983/solr/debt/select?indent=on&rows=1000&sort=id+asc&q=debt_id:
608384 OR debt_id: 393291&cursorMark=AoEmMzkzMjkx

The numFound: 35008
final size of solrResults: 22006

Am I missing something I should be using with cursorMark? Or is this
expected?

I've checked my logic, and I'm using the cursors the way this page is using
them in examples:

https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html

Thanks

Rhys